Problem Statements


πŸ“° News

(April 2nd, 2024) We receive a lot of enquiries about 'Why my submission fails'. To address some common questions, we compiled a FAQ and a checklist for you before your submission. Please go through the checklist for common errors and our suggestions.

(April 2nd, 2024) We would like to remind participants that Mistral models are slower to inference than the baseline, Vicuna, and may run into prediction timeout issues. Please pay attention when you use Mistral or related models.

(March 30th, 2024) We observe that some teams are creating a large number of failure submissions. Therefore, we enforce a new rule regulating the number of failed submissions. After reaching the limit, your team will be suspended from submitting for a week until the limit refreshes. 

  • For Tracks 1-4, each team can make up to 20 failures summed over all 4 tracks.
  • For Track 5 All-around, each team can make 3 failures per week.

We set this limit to encourage participants to fully test their solutions before submission, instead of using our system for debugging. However, we are aware that submission failures do happen, and sometimes no error messages are given. We are working on that to improve your experiences. 

(March 26rd, 2024) We release a larger development set with more tasks here. Also, the column 'track' indicates which track the question comes from. Feel free to explore!

(March 25th, 2024) Finally, after an arduous testing, we are ready to receive GPU submission to Tracks 1-5! As a warmup period, in the first 1-2 weeks, we would apply a tight submission limit.

  • For Tracks 1-4, each team can make 2 submissions per week per track.
  • For Track 5 All-around, each team can make 1 submission per week.
  • We allow a maximum of 15 seconds for each sample prediction. For your information, the baseline, Vicuna-7B takes up to 3s for each sample.
  • The maximum size of a submission is 25GB (including model checkpoints). We will scale it up as the competition progresses. 25GB would be enough for a 7B model checkpoint. 
  • Network connection is currently not enabled for all websites (including HuggingFace). We are working on a model cache that you can use to build your solutions. At present, please be comfortable to use git lfs to upload your checkpoints.

We will see how the cluster works and ramp up the limits as the challenge progresses. Read the submission instruction here and the baseline solution here, and let the fun begin!

(March 21st, 2024) We have implemented a simple baseline that prompts Vicuna-7B to generate all solutions. You can find the baseline implementation here.

(March 21st, 2024) We receive a lot of inquiries about the dataset of the competition. We would like to highlight that we adopt a hidden-test-set setting, where the test dataset is hidden from participants. We do have a few-shot development dataset for you to explore the formats of the questions.

πŸ“„ External Resource

Since we do not provide large-scale training datasets, all solutions have to rely heavily on external resources. We would like to highlight that all solutions submitted to this challenge should be based on resources (e.g. datasets and models) that are publicly available. Submissions should not contain proprietary data or model checkpoints. Participants can paraphrase or extend upon existing datasets (e.g. manual labeling, or labeling/generation with GPT), but should make their extended datasets available after the competition.

We list some public resources that may be helpful to your solutions.

  • ECInstruct: An instruction tuning dataset also based on Amazon raw data.
  • Amazon-M2: A multi-lingual Amazon session dataset with rich meta-data used for KDD Cup 2023.
  • Amazon-ESCI: A multi-lingual Amazon query-product relation dataset used for KDD Cup 2022.

Challenge In a Glance

Imagine you're trying to find the perfect gift for a friend's birthday through an online store. You have to go through countless products, read reviews to gauge quality, compare prices, and finally decide on a purchase. This process is time-consuming and can sometimes be overwhelming due to the sheer volume of information and options available. The complexities of online shopping, such as navigating through a web of products, reviews, and prices, all while trying to make the best decision based on your understanding and preferences can be overwhelming.

This challenge aims to simplify the process with Large Language Models (LLMs). While current techniques often fall short in understanding the nuances of specific shopping terms and knowledge, customer behaviors, preferences, and the diverse nature of products and languages, we believe that LLMs, with their multi-task and few-shot learning abilities, have the potential to master such complexities of online shopping. Motivated by the potential, this challenge introduces ShopBench, a comprehensive benchmark that mimics these real-world online shopping complexities. We invite participants to design powerful LLMs to improve how state-of-the-art techniques can better assist us in navigating online shopping, making it a more intuitive and satisfying experience, much like a knowledgeable shopping assistant would in real life.

πŸ›οΈ Introduction

Online shopping is complex, involving various tasks from browsing to purchasing, all requiring insights into customer behavior and intentions. This necessitates multi-task learning models that can leverage shared knowledge across tasks. Yet, many current models are task-specific, increasing development costs and limiting effectiveness. Large language models (LLMs) have the potential to change this by handling multiple tasks through a single model with minor prompt adjustments. Furthermore, LLMs can also improve customer experiences by providing interactive and timely recommendations. However, online shopping, as a highly specified domain, features a wide range of domain-specific concepts (e.g. brands, product lines) and knowledge (e.g. which brand produces which products), making it challenging to adapt existing powerful LLMs from general domains to online shopping.

Motivated by the potentials and challenges of LLMs, we present ShopBench, a massive challenge for online shopping, with 57 tasks and ~20000 questions, derived from real-world Amazon shopping data. All questions in this challenge are re-formulated to a unified text-to-text generation format to accommodate the exploration of LLM-based solutions. ShopBench focuses on four main key shopping skills (which will serve as Tracks 1-4):

  • shopping concept understanding
  • shopping knowledge reasoning
  • user behavior alignment
  • multi-lingual abilities

In addition, we set up Track 5: All-around to encourage even more versatile and all-around solutions. Track 5 requires participants to solve all questions in Tracks 1-4 with a single solution, which is expected to be more principled and unified than track-specific solutions to Tracks 1-4. We will correspondingly assign larger awards to Track 5.

We hope that this challenge can provide participants with valuable hands-on experiences in developing state-of-the-art LLM-based techniques for real-world problems. We also believe that the challenge will benefit the industry of online user-oriented services with strong and ready-to-use LLM-based solutions, as well as the whole machine learning community with helpful insights and guidelines on LLM training and development.

πŸ“… Timeline

There will be two phases in the challenge. Phase 1 will be open to all teams who sign up. After Phase 1, we will apply a top k% cutoff, and only teams in the top k% of Phase 1 will proceed to Phase 2. We will keep you updated about the value of k as we reach a steady number of participants.

Correspondingly, ShopBench will be split into two disjoint test sets, with Phase 2 containing harder samples and tasks. The final winners will be determined solely with Phase 2 data.

  • Website Online and Registration Begin: 15th March, 2024 23:55 UTC
  • Phase 1 Start Date: 18th March, 2024 23:55 UTC
  • Entry Freeze Deadline and Phase 1 End Date: 10th May, 2024 23:55 UTC
  • Phase 2 Start Date: 15th May, 2024 23:55 UTC
  • End Date: 10th July, 2024 23:55 UTC
  • Winner Notification: 15th July, 2024
  • Winner Announcement: 26th August, 2024 (At KDD 2024)

πŸ† Prizes

The challenge carries a prize pool of $41,500 categorized into the following three types of prizes:

  • Winner Prizes: We will award winners (first, second, and third places) in each track with cash prizes.
  • AWS Credits: Teams immediately after the winners in each track will be awarded with AWS credits.
  • Student Awards: We are aware that developing LLMs require significant computation resources and engineering efforts, neither of which is accessible to students. Therefore, we setup a dedicated student award for the best student teams (i.e. all participants are students) in each track to motivate students to develop resource-efficient solutions.

Specifically, Tracks 1-4 carry the following prizes:

  • πŸ₯‡ First place: $2,000
  • πŸ₯ˆ Second place: $1,000
  • πŸ₯‰ Third place: $500
  • 4th-7th places: AWS Credit $500
  • πŸ… Student Award: $750

Track 5 (all-around) carries the following prizes:

  • πŸ₯‡ First place: $7,000
  • πŸ₯ˆ Second place: $3,500
  • πŸ₯‰ Third place: $1,500
  • 4th-8th places: AWS Credit $500
  • πŸ… Student Award: $2,000

All awards are cumulative. For example, if your solution ranks 2nd in Track 5 all-around, and also ranks 3rd in Track 4, you can get a total cash prize of 3,500+500=4,000. However, Track 5 solutions will not be automatically considered eligible for Tracks 1-4. You have to make a submission to the Track to be eligible.

In addition to cash prizes, the winning teams will also have the opportunity to present their work at the KDD Cup workshop 2024, held in conjunction with ACM SIGKDD 2024.

πŸ“Š Dataset

ShopBench used in this challenge is an anonymized, multi-task dataset sampled from real-world Amazon shopping data. Statistics of ShopBench is given in the following Table.

# Tasks # Questions # Products # Product Category # Attributes # Reviews # Queries
57 20598 ~13300 400 1032 ~11200 ~4500

ShopBench is split into a few-shot development set and a test set to better mimic real-world applications --- where you never know the customer's questions beforehand. With this setting, we encourage participants to use any resource that is publicly available (e.g. pre-trained models, text datasets) to construct their solutions, instead of overfitting the given development data (e.g. generating pseudo data samples with GPT).

The development datasets will be given in json format with the following fields.

  • input_field: This field contains the instructions and the question that should be answered by the model.
  • output_field: This field contains the ground truth answer to the question.
  • task_type: This field contains the type of the task (Details in the next Section, "Tasks")
  • task_name: This field contains the name of the task. However, the exact task names are redacted. We provide hashed task names instead (e.g. task1, task10).
  • metric: This field contains the metric used to evaluate the question (Details in Section "Evaluation Metrics").
  • track: This field specifies the track the question comes from.

However, the test dataset (which will be hidden from participants) will have a different format with only two fields:

  • input_field, which is the same as above.
  • is_multiple_choice: This field contains a True or False that indicates whether the question is a multiple choice or not. The detailed 'task_type' will not be given to participants.

πŸ‘¨β€πŸ’»πŸ‘©β€πŸ’» Tasks

ShopBench is constructed to evaluate four important shopping skills, which correspond to Tracks 1-4 of the challenge.

  • Shopping Concept Understanding: There are many domain-specific concepts in online shopping, such as brands, product lines, etc. Moreover, these concepts often exist in short texts, such as queries, making it even more challenging for models to understand them without adequate contexts. This skill emphasizes the ability of LLMs to understand and answer questions related to these concepts.
  • Shopping Knowledge Reasoning: Complex reasoning with implicit knowledge is involved when people make shopping decisions, such as numeric reasoning (e.g. calculating the total amount of a product pack), multi-step reasoning (e.g. identifying whether two products are compatible with each other). This skill focuses on evaluating the model's reasoning ability on products or product attributes with domain-specific implicit knowledge.
  • User Behavior Alignment: User behavior modeling is of paramount importance in online shopping. However, user behaviors are highly diverse, including browsing, purchasing, query-then-clicking, etc. Moreover, most of them are implicit and not expressed in texts. Therefore, aligning with heterogeneous and implicit shopping behaviors is a unique challenge for language models in online shopping, which is the primary aim of this track.
  • Multi-lingual Abilities: Multi-lingual models are especially desired in online shopping as they can be deployed in multiple marketplaces without re-training. Therefore, we include a separate multi-lingual track, including multi-lingual concept understanding and user behavior alignment, to evaluate how a single model performs in different shopping locales without re-training.

In addition, we setup Track 5: All-around, requiring participants to solve all questions in Tracks 1-4 with a unified solution to further emphasize the generalizability and the versatility of the solutions.

ShopBench involves a total of 5 types of tasks, all of which are re-formulated to text-to-text generation to accommodate LLM-based solutions.

  • Multiple Choice: Each question is associated with several choices, and the model is required to output a single correct choice.
  • Retrieval: Each question is associated with a requirement and a list of candidate items, and the model is required to retrieve all items that satisfy the requirement.
  • Ranking: Each question is associated with a requirement and a list of candidate items, and the model is required to re-rank all items according to how each item satisfies the requirement.
  • Named Entity Recognition: Each question is associated with a piece of text and an entity type. The model is required to extract all phrases from the text that fall in the entity type.
  • Generation: Each question is associated with an instruction and a question, and the model is required to generate text pieces following the instruction to answer the question. There are multiple types of generation questions, including extractive generation, translation, elaboration, etc.

To test the generalization ability of the solutions, the development set will only cover a part of all 57 tasks, resulting to tasks that are unseen throughout the challenge. However, all 5 task types will be covered in the development set to help participants understand the prompts and output formats.

πŸ–Š Evaluation Framework

Evaluation Protocol

To ensure a thorough and unbiased evaluation, the challenge uses a hidden test set that will remain undisclosed to participants to prevent manual labeling or manipulation, and to promote generalizable solutions.

Evaluation Metrics

ShopBench includes multiple types of tasks, each requiring specific metrics for evaluation. The metrics selected are as follows:

  • Multiple Choice: Accuracy is used to measure the performance for multiple choice questions.
  • Ranking: Normalized Discounted Cumulative Gain (NDCG) is used to evaluate ranking tasks.
  • Named Entity Recognition (NER): Micro-F1 score is used to assess NER tasks.
  • Retrieval: Hit@3 is used to assess retrieval tasks. The number of positive samples not exceeding 3 across ShopBench.
  • Generation: Metrics vary based on the task type:
  • Extraction tasks (e.g., keyphrase extraction) uses ROUGE-L.
  • Translation tasks uses BLEU score.
  • For other generation tasks, we employ Sentence Transformer to calculate sentence embeddings of the generated text xgen and the ground truth text xgt. We then compute the cosine similarity between xgen and xgt (clipped to [0, 1]) as the metric. This approach focuses on evaluations on text semantics rather than just token-level accuracy.

As all tasks are converted into text generation tasks, rule-based parsers will parse the answers from participants' solutions. Answers that parsers cannot process will be scored as 0. The parsers will be available to participants.

Since all these metrics range from [0, 1], we calculate the average metric for all tasks within each track (macro-averaged) to determine the overall score for a track and identify track winners. Track 5 applies the same rule, in which metrics of all tasks are macro-averaged (instead of all tracks).

πŸš€ Baseline Solutions

We tested ShopBench with baseline solutions to gauge the feasibility of the challenge. A pipeline was developed to prompt LLMs to answer questions in a zero-shot approach, serving as an initial guide for participants. Results for an open-source LLM, Vicuna-7B, and two proprietary LLMs, Claude 2 and Amazon Titan, are presented in the following table.

Models Track 1: Shopping Concept Understanding Track 2: Shopping Knowledge Reasoning Track 3: User Behavior Alignment Track 4: Multi-lingual Abilities Track 5: All-around
Vicuna-7B-v1.5 0.5273 0.4453 0.4103 0.4382 0.4785
Claude 2 0.7511 0.6382 0.6322 0.6524 0.6960
Amazon Titan 0.6105 0.4500 0.5063 0.5531 0.5556

Vicuna-7B demonstrates the challenge's feasibility with non-trivial scores across all tracks using zero-shot prompts. Moreover, the comparison between Vicuna-7B and Claude 2 reveals a considerable performance gap (approximately 0.2 across all tracks), indicating potential for improvement from the baseline. We encourage participants to develop effective solutions to close or even eliminate the gap.

Note: Both Amazon Titan and Claude 2 (and even Claude 3) are accessible through AWS Bedrock. We will host a tutorial on how to use AWS Bedrock in late March, and will also distribute a small amount of credits for each team to get hands-on. Please stay tuned!

πŸ—ƒοΈ Submission

The challenge would be evaluated as a code competition. Participants must submit their code and essential resources, such as fine-tuned model weights and indices for Retrieval-Augmented Generation (RAG), which will be run on our servers to generate results and then for evaluation.

Submission Instructions

For submission instructions, please see the starter kit and the submission guideline.

Hardware and System Configuration

We apply a limit on the hardware available to each participant to run their solutions. Specifically,

  • All solutions will be run on AWS g4dn.12xlarge instances equipped with NVIDIA T4 GPUs.
  • Solutions for Phase 1 will have access to 2 x NVIDIA T4 GPU.
  • Solutions for Phase 2 will have access to 4 x NVIDIA T4 GPU. Please note that NVIDIA T4 uses somewhat outdated architectures and is thus not compatible with certain acceleration toolkits (e.g. Flash Attention), so please be careful about compatibility.

Besides, the following restrictions will also be imposed.

  • Network connection will be disabled.
  • Each submission will be assigned a certain amount of time to run. Submissions that exceed the time limits will be killed and will not be evaluated. The tentative time limit is set as follows.
Phase Track 1 Track 2 Track 3 Track 4 Track 5
Phase 1 140 minutes 40 minutes 60 minutes 60 minutes 5 hours

For reference, the baseline solution with zero-shot Vicuna-7B (Find it here) consumes the following amount of time.

Phase Track 1 Track 2 Track 3 Track 4
Phase 1 ~50 minutes ~3 minutes ~25 minutes ~35 minutes
  • Each team will be able to make up to 2 submissions per week for each track in tracks 1-4, and 1 Track 5 all-around submissions per week.

Based on the hardware and system configuration, we recommend participants to begin with 7B models. According to our experiments, 7B models like Vicuna-7B and Mistral can perform inference smoothly on 2 NVIDIA T4 GPUs (Inference with Mistral is slower and may run into timeouts.), while 13B models will result in OOM.

Evaluation and Leaderboard

The approach uses undisclosed test datasets for few-shot learning, constructing a live leaderboard and determining the final winner.

Use of External Resources

By only providing a few-shot development set, we encourage participants to exploit public resource to build their solutions. However, participants should ensure that the used datasets or models are publicly available and equally accessible to use by all participants. Such a constraint rules out proprietary datasets and models by large corporations. Participants are allowed to re-formulate existing datasets (e.g. adding additional data/labels manually or with ChatGPT), but should make them publicly available after the competition.

Technical Report and Code Submission

Upon the end of the competition, we will notify potential winners, who will be required to submit a technical report to describe their solutions as well as necessary codes to reproduce their solutions. The organizers will review the submitted contents to check whether the solution follows the rules of the challenge. Teams whose solutions pass the review will get the chance to present their solutions at the KDD Cup 2024 Workshop.

πŸ›οΈ KDD Cup Workshop

KDD Cup is an annual data mining and knowledge discovery competition organised by the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining (ACM SIGKDD). The competition aims to promote research and development in data mining and knowledge discovery by providing a platform for researchers and practitioners to share their innovative solutions to challenging problems in various domains. The KDD Cup Workshop 2024 will be held in Barcelona, Spain, from Sunday, August 25, 2024, to Thursday, August 29, 2024, in conjuction with ACM SIGKDD 2024.

πŸ“± Contact

Please use yilun.jin@connect.ust.hk and kddcup2024@amazon.com for all communication to reach the Amazon KDD cup 2024 team.

Organizers of this competition primarily come from the Amazon Rufus Team. They are:

🀝 Acknowledgements

We thank our partners in AWS, Paxton Hall, for supporting with the AWS credits for winning teams and the competition.



01 Chris_Deotte 21.000
02 rotto 22.000


See all