AIcrowd | ESCI Challenge for Improving Product Search

Prediction Submissions: Completed

Code Submissions: Completed

Amazon Search

99.9k

2372

273

9435

Code Submission Round Launched

🎥 Communtiy Townhall Recording

🧑‍💻 Baselines Published! | Task 1: 0.850 | Task 2: 0.741 | Task 3: 0.832 (NOTE: These scores were updated after the test sets were cleaned)

📝 Community Contribution Prize! 2 x DJI Mavic MIni 2 and 2 x Oculus Quest 2!

🚀 Datasets Released & Submissions Open : Announcement

If this dataset has been useful for your research, please consider citing the following paper :

@misc{reddy2022shopping,
      title={Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search},
      author={Chandan K. Reddy and Lluís Màrquez and Fran Valero and Nikhil Rao and Hugo Zaragoza and Sambaran Bandyopadhyay and Arnab Biswas and Anlu Xing and Karthik Subbian},
      year={2022},
      eprint={2206.06588},
      archivePrefix={arXiv}
}

🕵️ Introduction

Improving the relevance of search results can significantly improve the customer experience and their engagement with search. Despite the recent advancements in the field of machine learning, correctly classifying items for a particular user search query for shopping is challenging. The presence of noisy information in the results, the difficulty of understanding the query intent, and the diversity of the items available are some of the reasons that contribute to the complexity of this problem.

When developing online shopping applications, extremely high accuracy in ranking is needed. Even more so when deploying search in mobile and voice search applications, where a small number of irrelevant items can break the user experience.

In these applications, the notion of binary relevance limits the customer experience. For example, for the query “IPhone”, would an IPhone charger be relevant, irrelevant, or somewhere in between? In fact, many users search for “IPhone” to find and purchase a charger: they expect the search engine to understand their needs.

For this reason we break down relevance into the following four classes (ESCI) which are used to measure the relevance of the items in the search results:

Exact (E): the item is relevant for the query, and satisfies all the query specifications (e.g., water bottle matching all attributes of a query “plastic water bottle 24oz”, such as material and size)
Substitute (S): the item is somewhat relevant: it fails to fulfill some aspects of the query but the item can be used as a functional substitute (e.g., fleece for a “sweater” query)
Complement (C): the item does not fulfill the query, but could be used in combination with an exact item (e.g., track pants for “running shoe” query)
Irrelevant (I): the item is irrelevant, or it fails to fulfill a central aspect of the query (e.g. socks for a “pant” query)

In this challenge, we introduce the “Shopping Queries Data Set”, a large dataset of difficult search queries, published with the aim of fostering research in the area of semantic matching of queries and products. For each query, the dataset provides a list of up to 40 potentially relevant results, together with ESCI relevance judgements (Exact, Substitute, Complement, Irrelevant) indicating the relevance of the product to the query. Each query-product pair is accompanied by additional information. The information accompanying every product is public from the catalog, including title, product description, and additional product related bullet points.

The dataset is multilingual, as it contains queries in English, Japanese, and Spanish. With this data, we propose three different tasks, consisting of:

ranking the results list.
classifying the query/product pairs into E, S, C, or I categories.
identifying substitute products for a given query.

📑 Tasks

The primary objective of this competition is to build new ranking strategies and, simultaneously, identify interesting categories of results (i.e., substitutes) that can be used to improve the customer experience when searching for products.

The three different tasks for this KDD Cup competition using our Shopping Queries Dataset are:

Query-Product Ranking
Multiclass Product Classification
Product Substitute Identification

We will explain each of these tasks in detail below

Task 1: Query-Product Ranking

Given a user specified query and a list of matched products, the goal of this task is to rank the products so that the relevant products are ranked above the non-relevant ones. This is similar to standard information retrieval tasks, but specifically in the context of product search in e-commerce. The input for this task will be a list of queries with their identifiers. The system will have to output a CSV file where the query_id will be in the first column and the product_id in the second column, where for each query_id the first row will be the most relevant product and the last row the least relevant product. The input data for each query will be sorted based on Exacts, Substitutes, Compliments, and irrelevants. In the following example for query_1, product_50 is the most relevant item and product_80 is the least relevant item.

Input:

query_id	query	query_locale	product_id
Query_1	"Query_1"	us	product_23
Query_2	"Query_2"	us	product_234

Output:

query_id	product_id
Query_1	product_50
Query_1	product_900
Query_1	product_80
Query_2	product_32

The metadata about each of the products will be available in product_catalogue-v0.1.csv

which will have the following columns : product_id, product_title, product_description, product_bullet_point, product_brand, product_color_name, product_locale

Normalized Discounted Cumulative Gain (nDCG) is a commonly used relevance metric. Highly relevant documents appearing lower in a search results list should be penalized as the graded relevance is reduced logarithmically proportional to the position of the result. In our case we have 4 degrees of relevance (rel) for each query and product pair: Exact, Substitute, Complement and Irrelevant; where we set a gain of 1.0, 0.1, 0.01 and 0.0, respectively.

DCG_p shows how to compute the Discounted Cumulative Gain (DCG) for a list of the first p relevant products retrieved by the search engine. IDCG_p compute DCG for the list of p relevant products sorted by their relevance (|REL_p|), therefore, IDCG_p returns the maximum DCG score.

$DCG_p = \sum_{i=1}^p \frac{2^{rel_i}-1}{log_2(i+1)}$

$IDCG_p = \sum_{i=1}^{|REL_p|} \frac{2^{rel_i}-1}{log_2(i+1)}$

Search results lists vary in length depending on the query. Comparing a search engine's performance from one query to the next can not be consisted achieved using DCG alone, so the cumulative gain for each position for a chosen value of p must be normalized across queries: where nDCG_p is obtained dividing DCG_p by IDCG_p.

$nDCG_p = \frac{DCG_p}{IDCG_p}$

Task 2: Multiclass Product Classification

Given a query and a result list of products retrieved for this query, the goal of this task is to classify each product as being an Exact, Substitute, Complement, or Irrelevant match for the query. The micro-F1 will be used to evaluate the methods.

The input to this task will be pairs, along with product metadata. Specifically, rows of the dataset will have the form:

example_id	query	product_id	query_locale
example_1	11 degrees	product0	us
example_2	11 degrees	product1	us
example_3	針なしほっちきす	product2	jp
example_4	針なしほっちきす	product3	jp

Additionally, the training data will also contain the E/S/C/I label for a query, product pair. The model will output a CSV file where the example_id will be in the first column and the esci_label in the second column. In the following example for example_1 the system predicts exact, for the example_2 complement, for example_3 the system predicts irrelevant and for example_4 it predicts substitute.

example_id	esci_label
example_1	exact
example_2	complement
example_3	irrelevant
example_4	substitute

The metadata about each of the products will be available in product_catalogue-v0.1.csv
which will have the following columns : product_id, product_title, product_description, product_bullet_point, product_brand, product_color_name, product_locale

F1 Score is a commonly used metric for multi-class classification. F_1 shows how to compute F1 Score where the variables are the number of true positives (TP), number of false positives (FP) and number of false negatives (FN). We decided to use the Micro averaging F1 Score, because the four classes are unbalanced: 65.17% Exacts, 21.91% Substitutes, 2.89% Complements and 10.04% Irrelevants; and this metric is robust enough for this situation.

$F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} = \frac{TN}{TP + \frac{1}{2} (FP + FN) }$

Micro averaging F1 Score computes a global average F1 Score by counting the sums of the TP, FP, and FN values across all classes.

Task 3: Product Substitute Identification

This task will measure the ability of the systems to identify the substitute products in the list of results for a given query. The notion of “substitute” is exactly the same as in Task 2. The F1 score for the substitute class will be used to evaluate and rank the approaches in the leaderboard.

The input of this third task is the same as the input of the second task. The system will have to output a CSV file where the example_id will be in the first column and the substitute_label in the second column. In the following example for example_1 and example_2 the system predicted no_substitute, and for example_3 and example_4 the model predicts as a substitute:

example_id	query	product	query_locale
example_1	query_1	product0	us
example_2	query_2	product1	us
example_3	query_3	product2	jp
example_4	query_4	product3	jp

Output:

example_id	substitute_label
example_1	no_substitute
example_2	no_substitute
example_3	substitute
example_4	substitute

The metadata about each of the products will be available in product_catalogue-v0.1.csv
which will have the following columns: product_id, product_title, product_description, product_bullet_point, product_brand, product_color_name, product_locale

As this task consists of binary classification and the classes are unbalanced: 33% substitutes and 67% no_substitues; we decide to still use Micro averaging F1 score as the evaluation metric again as in the second task.

🏆 Leaderboard

Each task will have it’s own separate leaderboard. The metrics described in the tasks will be used for ranking the teams. Throughout the competition, we will maintain a leaderboard for models evaluated on the public test set.

At the end of the competition, we will maintain a private leaderboard for models evaluated on the private test set. This latter leaderboard will be used to make decisions on who the winners are for each task in the competition. The leaderboard on the public test set is meant to guide the participants on their model performance, and compare it with that of other participants.

💾 Dataset (Shopping Queries Data Set)

We provide two different versions of the data set. One for task 1 which is reduced version in terms of number of examples and ones for tasks 2 and 3 which is a larger.

The training data set contain a list of query-result pairs with annotated E/S/C/I labels. The data is multilingual and it includes queries from English, Japanese, and Spanish languages. The examples in the data set have the following fields: example_id, query, query_id, product, product_title, product_description, product_bullet_point, product_brand, product_color, product_locale, and esci_label.

Despite the data set being the same, the three tasks are independent, so the participants will have to provide results separately for each of them. The test data set will be similarly structured, except the last field (esci_label) will be withheld.

The Shopping Queries Data Set is a large-scale manually annotated data set composed of challenging customer queries.

Although we will provide results broken down by language, the metrics that will be used for defining the final ranking of the systems will consider an average of the three languages (micro-averaged). There are 2 versions of the dataset. The reduced version of the data set contains 48,300 unique queries and 1,118,117 rows corresponding each to a <query, item> judgement. The larger version of the data set contains 130,652 unique queries and 2,621,738 judgements. The reduced version of the data accounts for queries that are deemed to be “easy”, and hence filtered out. The data is stratified by queries in three splits train, public test, and private test at 70%, 15%, and 15%, respectively.

A summary of our Shopping Queries Data Set is given in the two tables below showing the statistics of the reduced and larger version, respectively. These tables include the number of unique queries, the number of judgements, and the average number of judgements per query (i.e., average depth) across the three different languages.

	Total			Train			Public Test			Private Test
Language	#Queries	#Judgements	Avg. Depth	#Queries	#Judgements	Avg. Depth	#Queries	#Judgements	Avg. Depth	#Queries	#Judgements	Avg. Depth
English	29,844	601,462	20.1	20,888	419,730	20.1	4,477	91.062	20.3	4,479	90,670	20.2
Spanish	8,049	218,832	27.2	5,632	152,920	27.2	2,208	32,908	27.2	1,209	33,004	27.3
Japanese	10,407	297,885	28.6	7,284	209,094	28.7	1,561	43,832	28.1	1,562	44,959	28.8
US+ES+JP	48,300	1,118,179	23.2	33,804	781,744	23.1	7,246	167,802	23.2	7,250	168,633	23.3

Table 1: Summary of the Shopping queries data set for task 1 (reduced version) - the number of unique queries, the number of judgements, and the average number of judgements per query.

	Total			Train			Public Test			Private Test
Language	#Queries	#Judgements	Avg. Depth	#Queries	#Judgements	Avg. Depth	#Queries	#Judgements	Avg. Depth	#Queries	#Judgements	Avg. Depth
English	97,345	1,819,105	18.7	68,139	1,272,626	18.7	14,602	274,261	18.8	14,604	272,218	18.6
Spanish	15,180	356,578	23.5	10,624	249,721	23.5	2,277	553,494	23.5	2,279	53,363	23.4
Japanese	18,127	446,055	24.6	12,687	312,397	24.6	2,719	66,612	24.5	2,271	67,046	24.6
US+ES+JP	130,652	2,621,738	20.1	91,450	1,834,744	20.1	19,598	394,367	20.1	19,604	292,627	20

Table 2: Summary of the Shopping queries data set for tasks 2 and 3 (larger version) - the number of unique queries, the number of judgements, and the average number of judgements per query.

Note that we will provide two test sets. One is given to the participants (Public Test set) along with the training data and the performance of the models developed by the participants will be shown on the leaderboard. We have also built a holdout Private Test set with a similar distribution.

Towards the end of the competition, the participants will need to submit their final models on the site. The model will be evaluated on the private test set by automated evaluators hosted in AIcrowd platform. The private leaderboard will not be disclosed until the end of the competition. Teams can improve their solutions and submit improved version of their models, but the leaderboard on this private test set will remain private until the end of competition. We hope this will make the models more generalizable and work well on unseen test data (not fine-tuned for a specific test set).The final ranking of the teams will be based exclusively on the results on the Private Test data set.

💡 Baseline Methods

In order to ensure the feasibility of the proposed tasks, we will provide the results obtained by standard baseline models run on this data sets. For example, for the first task (ranking), we have run basic retrieval models (such as BM25) along with a BERT model. For the remaining two tasks (classification) we will provide the results of the multilingual BERT-based models as the initial baseline.

📅 Timeline

Start of the competition	March 15, 2022
Dataset Releases	~~March 25~~ March 28, 2022
Initial submission opens on	~~March 25~~ March 28, 2022
Baselines published	April 5, 2022
Code submissions	May 20, 2022
🥶 Team Freeze Deadline	July 1, 2022
Competition Ends	July 15, 2022
Announcement of Winners	July 22, 2022
Workshop paper (online) submission deadline	Aug 1, 2022
KDDCup Workshop	Aug 15, 2022

🏆 Prizes

There are prizes for all three tasks. For each of the task, top three positions on leaderboard win the following cash prize.

First place : $4,000
Second place : $2,000
Third place : $1,000

AWS Credits: For each of the three tasks, the teams/participants that finish between the 4th and 10th position on the leaderboard will receive AWS credit worth $500.

SIGKDD Workshop: Along with that, the top-3 teams from each task and a few selected other teams in the top-10 will have an opportunity to showcase their work to the research community at the KDD Cup Workshop.

Community Contribution PrizeS

We are excited to announce the following Community Contribution Prizes :

2 x Oculus Quest 2
2 x DJI Mavic Mini FMK

More details about the Community Contribution Prizes are available here.

🏆 KDD Cup Workshop

The KDD Cup workshop that will be held in conjunction with the KDD conference on August 15th, 2022. The selected winners will have an opportunity to present their work in this venue.

🔗 Links

💪 Challenge Page: https://www.aicrowd.com/challenges/esci-challenge-for-improving-product-search
🗣️ Discussion Forum: https://www.aicrowd.com/challenges/esci-challenge-for-improving-product-search/discussion

🗂 Rules

You can find the challenge rules over here. Please read and accept the rules to participate in this challenge.

📱 Contact

The contact email is : esci-challenge@amazon.com

Organizers of this competition are:

Lluis Marquez
Fran Valero
Nikhil Rao
Hugo Zaragoza
Sambaran Bandyopadhyay
Arnab Biswas
Anlu Xing
Chandan K Reddy

🤝 Acknowledgements

We thank Sahika Genc for helping us establish the AIcrowd partnership. A special thanks to our partners in AWS, Paxton Hall and Cameron Peron, for supporting with the AWS credits for 21 winning teams.

Participants

Getting Started

1

📝 Announcing Community Contribution Prizes 🚁🥽 Over 3 years ago

10

30

When to release the baseline Over 3 years ago

2