Prediction Submissions: Completed Code Submissions: Completed
76.4k
1918
273
9435

Code Submission Round Launched

🧑‍💻 Baselines Published! | Task 1: 0.850 | Task 2: 0.741 | Task 3: 0.832 (NOTE: These scores were updated after the test sets were cleaned)

🚀 Datasets Released & Submissions Open : Announcement

If this dataset has been useful for your research, please consider citing the following paper

@misc{reddy2022shopping,
title={Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search},
author={Chandan K. Reddy and Lluís Màrquez and Fran Valero and Nikhil Rao and Hugo Zaragoza and Sambaran Bandyopadhyay and Arnab Biswas and Anlu Xing and Karthik Subbian},
year={2022},
eprint={2206.06588},
archivePrefix={arXiv}
}

🕵️ Introduction

Improving the relevance of search results can significantly improve the customer experience and their engagement with search. Despite the recent advancements in the field of machine learning, correctly classifying items for a particular user search query for shopping is challenging. The presence of noisy information in the results, the difficulty of understanding the query intent, and the diversity of the items available are some of the reasons that contribute to the complexity of this problem.

When developing online shopping applications, extremely high accuracy in ranking is needed. Even more so when deploying search in mobile and voice search applications, where a small number of irrelevant items can break the user experience.

In these applications, the notion of binary relevance limits the customer experience. For example, for the query “IPhone”, would an IPhone charger be relevant, irrelevant, or somewhere in between? In fact, many users search for “IPhone” to find and purchase a charger: they expect the search engine to understand their needs.

For this reason we break down relevance into the following four classes (ESCI) which are used to measure the relevance of the items in the search results:

• Exact (E): the item is relevant for the query, and satisfies all the query specifications (e.g., water bottle matching all attributes of a query “plastic water bottle 24oz”, such as material and size)

• Substitute (S): the item is somewhat relevant: it fails to fulfill some aspects of the query but the item can be used as a functional substitute (e.g., fleece for a “sweater” query)

• Complement (C): the item does not fulfill the query, but could be used in combination with an exact item (e.g., track pants for “running shoe” query)

• Irrelevant (I): the item is irrelevant, or it fails to fulfill a central aspect of the query (e.g. socks for a “pant” query)

In this challenge, we introduce the “Shopping Queries Data Set”, a large dataset of difficult search queries, published with the aim of fostering research in the area of semantic matching of queries and products. For each query, the dataset provides a list of up to 40 potentially relevant results, together with ESCI relevance judgements (Exact, Substitute, Complement, Irrelevant) indicating the relevance of the product to the query. Each query-product pair is accompanied by additional information. The information accompanying every product is public from the catalog, including title, product description, and additional product related bullet points.

The dataset is multilingual, as it contains queries in English, Japanese, and Spanish. With this data, we propose three different tasks, consisting of:

1. ranking the results list.
2. classifying the query/product pairs into E, S, C, or I categories.
3. identifying substitute products for a given query.

The primary objective of this competition is to build new ranking strategies and, simultaneously, identify interesting categories of results (i.e., substitutes) that can be used to improve the customer experience when searching for products.

The three different tasks for this KDD Cup competition using our Shopping Queries Dataset are:

1. Query-Product Ranking
2. Multiclass Product Classification
3. Product Substitute Identification

We will explain each of these tasks in detail below

Given a user specified query and a list of matched products, the goal of this task is to rank the products so that the relevant products are ranked above the non-relevant ones. This is similar to standard information retrieval tasks, but specifically in the context of product search in e-commerce. The input for this task will be a list of queries with their identifiers. The system will have to output a CSV file where the query_id will be in the first column and the product_id in the second column, where for each query_id the first row will be the most relevant product and the last row the least relevant product. The input data for each query will be sorted based on Exacts, Substitutes, Compliments, and irrelevants. In the following example for query_1, product_50 is the most relevant item and product_80 is the least relevant item.

Input:

query_id query query_locale product_id
Query_1 "Query_1" us product_23
Query_2 "Query_2" us product_234

Output:

query_id product_id
Query_1 product_50
Query_1 product_900
Query_1 product_80
Query_2 product_32

The metadata about each of the products will be available in product_catalogue-v0.1.csv

which will have the following columns : product_idproduct_titleproduct_descriptionproduct_bullet_pointproduct_brandproduct_color_nameproduct_locale

Normalized Discounted Cumulative Gain (nDCG) is a commonly used relevance metric. Highly relevant documents appearing lower in a search results list should be penalized as the graded relevance is reduced logarithmically proportional to the position of the result. In our case we have 4 degrees of relevance (rel) for each query and product pair: Exact, Substitute, Complement and Irrelevant; where we set a gain of 1.0, 0.1, 0.01 and 0.0, respectively.

DCG_p shows how to compute the Discounted Cumulative Gain (DCG) for a list of the first p relevant products retrieved by the search engine. IDCG_p compute DCG for the list of p relevant products sorted by their relevance (|REL_p|), therefore, IDCG_p returns the maximum DCG score.

$DCG_p = \sum_{i=1}^p \frac{2^{rel_i}-1}{log_2(i+1)}$

$\large IDCG_p = \sum_{i=1}^{|REL_p|} \frac{2^{rel_i}-1}{log_2(i+1)}$

Search results lists vary in length depending on the query. Comparing a search engine's performance from one query to the next can not be consisted achieved using DCG alone, so the cumulative gain for each position for a chosen value of p must be normalized across queries: where nDCG_p is obtained dividing DCG_p by IDCG_p.

$\large nDCG_p = \frac{DCG_p}{IDCG_p}$

### Task 2: Multiclass Product Classification

Given a query and a result list of products retrieved for this query, the goal of this task is to classify each product as being an Exact, Substitute, Complement, or Irrelevant match for the query. The micro-F1 will be used to evaluate the methods.

The input to this task will be pairs, along with product metadata. Specifically, rows of the dataset will have the form:

example_id query product_id query_locale
example_1 11 degrees product0 us
example_2 11 degrees product1 us
example_3 針なしほっちきす product2 jp
example_4 針なしほっちきす product3 jp

Additionally, the training data will also contain the E/S/C/I label for a query, product pair. The model will output a CSV file where the example_id will be in the first column and the esci_label in the second column. In the following example for example_1 the system predicts exact, for the example_2 complement, for example_3 the system predicts irrelevant and for example_4 it predicts substitute.

example_id esci_label
example_1 exact
example_2 complement
example_3 irrelevant
example_4 substitute

The metadata about each of the products will be available in product_catalogue-v0.1.csv
which will have the following columns : product_idproduct_titleproduct_descriptionproduct_bullet_pointproduct_brandproduct_color_nameproduct_locale

F1 Score is a commonly used metric for multi-class classification. F_1 shows how to compute F1 Score where the variables are the number of true positives (TP), number of false positives (FP) and number of false negatives (FN). We decided to use the Micro averaging F1 Score, because the four classes are unbalanced: 65.17% Exacts, 21.91% Substitutes, 2.89% Complements and 10.04% Irrelevants; and this metric is robust enough for this situation.

$\large&space;F_1&space;=&space;2&space;\cdot&space;\frac{precision&space;\cdot&space;recall}{precision&space;+&space;recall}&space;=&space;\frac{TN}{TP&space;+&space;\frac{1}{2}&space;(FP&space;+&space;FN)&space;}$

Micro averaging F1 Score computes a global average F1 Score by counting the sums of the TP, FP, and FN values across all classes.

### Task 3: Product Substitute Identification

This task will measure the ability of the systems to identify the substitute products in the list of results for a given query. The notion of “substitute” is exactly the same as in Task 2. The F1 score for the substitute class will be used to evaluate and rank the approaches in the leaderboard.

The input of this third task is the same as the input of the second task. The system will have to output a CSV file where the example_id will be in the first column and the substitute_label in the second column. In the following example for example_1 and example_2 the system predicted no_substitute, and for example_3  and example_4 the model predicts as a substitute:

example_id query product query_locale
example_1 query_1 product0 us
example_2 query_2 product1 us
example_3 query_3 product2 jp
example_4 query_4 product3 jp

Output:

example_id substitute_label
example_1 no_substitute
example_2 no_substitute
example_3 substitute
example_4 substitute

The metadata about each of the products will be available in product_catalogue-v0.1.csv
which will have the following columns: product_idproduct_titleproduct_descriptionproduct_bullet_pointproduct_brandproduct_color_nameproduct_locale

As this task consists of binary classification and the classes are unbalanced: 33% substitutes and 67% no_substitues; we decide to still use Micro averaging F1 score as the evaluation metric again as in the second task.

Each task will have it’s own separate leaderboard. The metrics described in the tasks will be used for ranking the teams. Throughout the competition, we will maintain a leaderboard for models evaluated on the public test set.

At the end of the competition, we will maintain a private leaderboard for models evaluated on the private test set. This latter leaderboard will be used to make decisions on who the winners are for each task in the competition. The leaderboard on the public test set is meant to guide the participants on their model performance, and compare it with that of other participants.

## 💾 Dataset (Shopping Queries Data Set)

We provide two different versions of the data set. One for task 1 which is reduced version in terms of number of examples and ones for tasks 2 and 3 which is a larger.

The training data set contain a list of query-result pairs with annotated E/S/C/I labels. The data is multilingual and it includes queries from English, Japanese, and Spanish languages. The examples in the data set have the following fields: example_id, query, query_id, product, product_title, product_description, product_bullet_point, product_brand, product_color, product_locale, and esci_label.

Despite the data set being the same, the three tasks are independent, so the participants will have to provide results separately for each of them. The test data set will be similarly structured, except the last field (esci_label) will be withheld.

The Shopping Queries Data Set is a large-scale manually annotated data set composed of challenging customer queries.

Although we will provide results broken down by language, the metrics that will be used for defining the final ranking of the systems will consider an average of the three languages (micro-averaged). There are 2 versions of the dataset. The reduced version of the data set contains 48,300 unique queries and 1,118,117 rows corresponding each to a <query, item> judgement. The larger version of the data set contains 130,652 unique queries and 2,621,738 judgements. The reduced version of the data accounts for queries that are deemed to be “easy”, and hence filtered out. The data is stratified by queries in three splits train, public test, and private test at 70%, 15%, and 15%, respectively.

A summary of our Shopping Queries Data Set is given in the two tables below showing the statistics of the reduced and larger version, respectively. These tables include the number of unique queries, the number of judgements, and the average number of judgements per query (i.e., average depth) across the three different languages.

Total Train Public Test Private Test
Language #Queries #Judgements Avg. Depth #Queries #Judgements Avg. Depth #Queries #Judgements Avg. Depth #Queries #Judgements Avg. Depth
English 29,844 601,462 20.1 20,888 419,730 20.1 4,477 91.062 20.3 4,479 90,670 20.2
Spanish 8,049 218,832 27.2 5,632 152,920 27.2 2,208 32,908 27.2 1,209 33,004 27.3
Japanese 10,407 297,885 28.6 7,284 209,094 28.7 1,561 43,832 28.1 1,562 44,959 28.8
US+ES+JP 48,300 1,118,179 23.2 33,804 781,744 23.1 7,246 167,802 23.2 7,250 168,633 23.3

Table 1: Summary of the Shopping queries data set for task 1 (reduced version) - the number of unique queries, the number of judgements, and the average number of judgements per query.

Total Train Public Test Private Test
Language #Queries #Judgements Avg. Depth #Queries #Judgements Avg. Depth #Queries #Judgements Avg. Depth #Queries #Judgements Avg. Depth
English 97,345 1,819,105 18.7 68,139 1,272,626 18.7 14,602 274,261 18.8 14,604 272,218 18.6
Spanish 15,180 356,578 23.5 10,624 249,721 23.5 2,277 553,494 23.5 2,279 53,363 23.4
Japanese 18,127 446,055 24.6 12,687 312,397 24.6 2,719 66,612 24.5 2,271 67,046 24.6
US+ES+JP 130,652 2,621,738 20.1 91,450 1,834,744 20.1 19,598 394,367 20.1 19,604 292,627 20

Table 2: Summary of the Shopping queries data set for tasks 2 and 3 (larger version) - the number of unique queries, the number of judgements, and the average number of judgements per query.

Note that we will provide two test sets. One is given to the participants (Public Test set) along with the training data and the performance of the models developed by the participants will be shown on the leaderboard. We have also built a holdout Private Test set with a similar distribution.

Towards the end of the competition, the participants will need to submit their final models on the site. The model will be evaluated on the private test set by automated evaluators hosted in AIcrowd platform. The private leaderboard will not be disclosed until the end of the competition. Teams can improve their solutions and submit improved version of their models, but the leaderboard on this private test set will remain private until the end of competition. We hope this will make the models more generalizable and work well on unseen test data (not fine-tuned for a specific test set).The final ranking of the teams will be based exclusively on the results on the Private Test data set.

## 💡 Baseline Methods

In order to ensure the feasibility of the proposed tasks, we will provide the results obtained by standard baseline models run on this data sets. For example, for the first task (ranking), we have run basic retrieval models (such as BM25) along with a BERT model. For the remaining two tasks (classification) we will provide the results of the multilingual BERT-based models as the initial baseline.

## 📅 Timeline

 Start of the competition March 15, 2022 Dataset Releases March 25 March 28, 2022 Initial submission opens on March 25 March 28, 2022 Baselines published April 5, 2022 Code submissions May 20, 2022 🥶 Team Freeze Deadline July 1, 2022 Competition Ends July 15, 2022 Announcement of Winners July 22, 2022 Workshop paper (online) submission deadline Aug 1, 2022 KDDCup Workshop Aug 15, 2022

## 🏆 Prizes

There are prizes for all three tasks. For each of the task, top three positions on leaderboard win the following cash prize.

• First place : $4,000 • Second place :$2,000
• Third place : $1,000 AWS Credits: For each of the three tasks, the teams/participants that finish between the 4th and 10th position on the leaderboard will receive AWS credit worth$500.

SIGKDD Workshop: Along with that, the top-3 teams from each task and a few selected other teams in the top-10 will have an opportunity to showcase their work to the research community at the KDD Cup Workshop.

### Community Contribution PrizeS

We are excited to announce the following Community Contribution Prizes :

More details about the Community Contribution Prizes are available here.

## 🏆 KDD Cup Workshop

The KDD Cup workshop that will be held in conjunction with the KDD conference on August 15th, 2022. The selected winners will have an opportunity to present their work in this venue.

## 🗂 Rules

You can find the challenge rules over here. Please read and accept the rules to participate in this challenge.

## 📱 Contact

The contact email is : esci-challenge@amazon.com

Organizers of this competition are:

• Lluis Marquez
• Fran Valero
• Nikhil Rao
• Hugo Zaragoza
• Arnab Biswas
• Anlu Xing
• Chandan K Reddy

## 🤝 Acknowledgements

We thank Sahika Genc for helping us establish the AIcrowd partnership. A special thanks to our partners in AWS, Paxton Hall and Cameron Peron, for supporting with the AWS credits for 21 winning teams.

#### Notebooks

 10 [Task 3 - Score 0.836] 3 Common Models Trained Separately By leocd 7 months ago 0 11 Simple baseline By moto 9 months ago 0 24 Dataset of task2 EDA By LYZD-fintech 10 months ago 0