AIcrowd | Meta Comprehensive RAG Benchmark: KDD Cup 2024

Problem Statements

Weight: 1.0

Meta KDD Cup 24 - CRAG - Retrieval Summarization

Testing RAG Systems with Limited Web Pages

28.1k

4390

#knowledge_graph #llm #knowledge_retrieval #question_answering_systems #retrieval_summarization

Weight: 1.0

Meta KDD Cup 24 - CRAG - Knowledge Graph and Web Retrieval

Evaluating RAG Systems With Mock KGs and APIs

12.6k

1006

#llm #knowledge_retrieval #question_answering_systems #knowledge_graph #retrieval_summarization

Weight: 1.0

Meta KDD Cup 24 - CRAG - End-to-End Retrieval-Augmented Generation

Enhance RAG systems With Multiple Web Sources & Mock API

13k

927

#knowledge_graph #llm #knowledge_retrieval #question_answering_systems #retrieval_summarization

🌟 Introducing the Meta Comprehensive RAG Benchmark Challenge! 🌟

🗞️ News

(Jun 21st, 2024) ‼️ ⏰ Deadline Extended for Submission Selection Form (June 22nd 12:00 UTC): Fill the form to select submission ID
(Jun 19th, 2024) ‼️ Select Submission ID for Final Evaluation: Fill the form to select submission ID.

(Jun 12th, 2024) 🏁Baseline for Task 2: We have released a baseline for Task 2 in the Starter kit. Check out the KG Baseline.

(Jun 11th, 2024) 📜CRAG paper featured by Hugging Face Daily Papers: Our paper about CRAG has been featured by Hugging Face as Daily Papers.

(May 16th, 2024) 📚Submission limit in Phase 1b: We have increased the submission limit to 10 submissions/week in Phase 1b.

(May 14th, 2024) 🚀announcements: New batch prediction interface launched and Phase 1 extended to May 27, 2024, with V3 dataset release and updated baselines.

(May 10th, 2024) 📊Data updated to V3: We have updated Task 1 and Task 3 data to V3. V3 added alternative answers (alt_ans) to the question, and fixed ~100 questions or answers that contain error in V2.

(May 7th, 2024) 📙Test Set in Phase 2: We will soon switch to using a (unreleased) private test set for the leaderboard, and Phase 2 competition.

(May 6th, 2024) 🔑Phase 2 Entry: All teams that have at least one successful submission in Phase 1 can enter Phase 2.

(April 24th, 2024) 🔣 Addition of query_time to the generate_answer Interface, and interim increase of timeouts to 30s!

(April 23rd, 2024) 🧳Llama 3 Models: Participants can use Llama 3 Models to build their RAG solutions. Llama 3 models can be downloaded here.

(April 22nd, 2024) 🗒️Office hours: We will host an office hour on Apr 23 2024 6--7pm PST. Please join to share your questions.

(April 19th, 2024) 📚Submission limit in Phase 1: We have increased the submission limit to 6 submissions/week in Phase 1.

(April 14th, 2024) 📊Data updated to V2: We have updated Task 1 and Task 3 data to V2. V2 replaced low quality questions and fixed some ground truth answers.

(April 8th, 2024) 🏁Baselines available: We have released two baselines which are submission ready.

(April 1st, 2024) 🚀Submissions are open now. And our 🚀Starter Kit is available to help you quickly onboard and make the first submission.

💬 Introduction

How often do you encounter hallucinated responses from LLM-based AI agents? How can we make LLMs trustworthy in providing accurate information? Despite the advancements of LLMs, the issue of hallucination persists as a significant challenge; that is, LLMs may generate answers that lack factual accuracy or grounding. Studies have shown that GPT-4's accuracy in answering questions referring to slow-changing or fast-changing facts is below 15% [1]; even for stable (never-changing) facts, GPT-4's accuracy in answering questions referring to torso-to-tail (less popular) entities is below 35% [2].

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate LLM’s deficiency in lack of knowledge and attracted a lot of attention from both academia research and industry. Given a question, a RAG system searches external sources to retrieve relevant information, and then provides grounded answers; see figure below for an illustration.

Despite its potential, RAG still faces many challenges, like selecting the most relevant information to ground the answer, reducing question answering latency, and synthesizing information to answer complex questions, urging research and development in this domain. The Meta Comprehensive RAG Challenge (CRAG) aims to provide a good benchmark with clear metrics and evaluation protocols, to enable rigorous assessment of the RAG systems, drive innovations, and advance the solutions.

💻 What is Comprehensive RAG (CRAG) Benchmark? The Comprehensive RAG (CRAG) Benchmark evaluates RAG systems across five domains and eight question types, and provides a practical set-up to evaluate RAG systems. In particular, CRAG includes questions with answers that change from over seconds to over years; it considers entity popularity and covers not only head, but also torso and tail facts; it contains simple-fact questions as well as 7 types of complex questions such as comparison, aggregation and set questions to test the reasoning and synthesis capabilities of RAG solutions.

📅 Timeline

There will be two phases in the challenge. Phase 1 will be open to all teams who sign up. All teams that have at least one successful submission in Phase 1 can enter Phase 2.

Phase 1: Open Competition

Website Open, Data Available, and Registration Begin: March 20, 2024, 23:55 UTC
Phase 1 Submission Start Date: April 1, 2024, 23:55 UTC
Phase 1 Submission End Date: May 27, 2024, 23:55 UTC

Phase 2: Competition for Top Teams

Phase 2 Start Date: May 28, 2024, 23:55 UTC
Registration and Team Freeze Deadline: May 31, 2024, 23:55 UTC
Phase 2 End Date: June 20, 2024, 23:55 UTC

Winners Announcement

Winner Notification: July 15, 2024
Winner Public Announcement: August 26, 2024 (At KDD Cup Winners event)

🏆 Prizes

The challenge boasts a prize pool of USD 31,500. There are prizes for all three tasks. For each task, the following teams will win cash prizes:

🥇 First Place: $4,000
🥈 Second Place: $2,000
🥉 Third Place: $1,000
💐 First Place for each of the 7 complex question types: $500

The first, second, and third prize winners are not eligible to win any prize based on a complex question type on the same task.

💻 META Comprehensive RAG Challenge

A RAG QA system takes a question Q as input and outputs an answer A; the answer is generated by LLMs according to information retrieved from external sources, or directly from the knowledge internalized in the model. The answer should provide useful information to answer the question, without adding any hallucination or harmful content such as profanity.

🏹 Challenge Tasks

This challenge comprises of three tasks designed to improve question-answering (QA) systems.

TASK #1: WEB-BASED RETRIEVAL SUMMARIZATION Participants receive 5 web pages per question, potentially containing relevant information. The objective is to measure the systems' capability to identify and condense this information into accurate answers.

TASK #2: KNOWLEDGE GRAPH AND WEB AUGMENTATION This task introduces mock APIs to access information from underlying mock Knowledge Graphs (KGs), with structured data possibly related to the questions. Participants use mock APIs, inputting parameters derived from the questions, to retrieve relevant data for answer formulation. The evaluation focuses on the systems' ability to query structured data and integrate information from various sources into comprehensive answers.

TASK #3: END-TO-END RAG The third task increases complexity by providing 50 web pages and mock API access for each question, encountering both relevant information and noises. It assesses the systems' skill in selecting the most important data from a larger set, reflecting the challenges of real-world information retrieval and integration.

Each task builds upon the previous, steering participants toward developing sophisticated end-to-end RAG systems. This challenge showcases the potential of RAG technology in navigating and making sense of extensive information repositories, setting the stage for future AI research and development breakthroughs.

💯 Evaluation Metrics

RAG systems are evaluated using a scoring method that measures response quality to questions in the evaluation set. Responses are rated as perfect, acceptable, missing, or incorrect:

Perfect: The response correctly answers the user question and contains no hallucinated content.
Acceptable: The response provides a useful answer to the user question, but may contain minor errors that do not harm the usefulness of the answer.
Missing: The answer does not provide the requested information. Such as “I don’t know”, “I’m sorry I can’t find …” or similar sentences without providing a concrete answer to the question.
Incorrect: The response provides wrong or irrelevant information to answer the user question

Scores are given as follows: perfect = 1 points, acceptable = 0.5 point, missing = 0 points, and incorrect = -1 point. The overall score is a macro-average across all domains, with questions weighted based on type popularity and entity popularity (weights will not be disclosed).

🖊 Evaluation Techniques

This challenge employs both automated (auto-eval) and human (human-eval) evaluations. Auto-eval selects the top ten teams, while human-eval decides the top three for each task. Auto-evaluators will only consider responses that begin within 5 seconds and limit them to 50 tokens to promote concise answers. Complete evaluation of longer responses will be done in human-eval stage.

Automatic Evaluation: Automatic evaluation employs rule-based matching and GPT-4 assessment to check answer correctness. It will assign three scores: correct (1 point), missing (0 points), and incorrect (-1 point).
Human Evaluation: Human annotators will decide the rating of each response as Perfect, Acceptable, Missing, Incorrect. In addition, human evaluator will require basic fluency for an answer to be considered Perfect.

To reduce turnaround time in the auto-eval, for each submission, we will use a random subset with 20% questions to calculate an approximate score. The approximate score will be used for the leaderboard in Phase 1. In Phase 2, if the approximate score is within the top percentage, we will conduct evaluation on the full set.

📙 Evaluation Details

Every query is associated with a query_time (when the query was made), the query_time may affect the answers, in particular for dynamic questions.
All False Premise questions should be answered with a standard response “invalid question”.
All Missing answers should be using a standard response “I don't know.”.
The ground truth is the answer that was correct at the point when the question was posed and data were collected.

📊 CRAG Dataset Description

📝 Question Answer Pairs

CRAG includes question-answer pairs that mirror real scenarios. It covers five domains: Finance, Sports, Music, Movies, and Encyclopedia Open domain. These domains represent the spectrum of information change rates—rapid (Finance and Sports), gradual (Music and Movies), and stable (Open domain).

CRAG includes eight types of questions in English:

Simple question: Questions asking for simple facts, such as the birth date of a person and the authors of a book.
Simple question with some condition: Questions asking for simple facts with some given conditions, such as stock price on a certain date and a director's recent movies in a certain genre.
Set question Questions that expect a set of entities or objects as the answer. An example is what are the continents in the southern hemisphere?
Comparison question: Questions that may compare two entities, such as who started performing earlier, Adele or Ed Sheeran?
Aggregation question: Questions that may need aggregation of retrieval results to answer, for example, how many Oscar awards did Meryl Streep win?
Multi-hop questions: Questions that may require chaining multiple pieces of information to compose the answer, such as who acted in Ang Lee's latest movie?
Post-processing question: Questions that need reasoning or processing of the retrieved information to obtain the answer, for instance, How many days did Thurgood Marshall serve as a Supreme Court justice?
False Premise question: Questions that have a false preposition or assumption; for example, What's the name of Taylor Swift's rap album before she transitioned to pop? (Taylor Swift didn't release any rap album.)

📁 Retrieval Contents

The dataset includes web search results and mock KGs to mimic real-world RAG retrieval sources. Web search contents were created by storing up to 50 pages from search queries related to each question. Mock KGs were created using the data behind the questions, supplemented with "hard negative" data to simulate a more challenging retrieval environment. Mock APIs facilitate structured searches within these KGs, and we provide the same API for all five domains to simulate Knowledge Graph access.

📘 Submission and Participation

Parcipants must submit their code, and model weights to run on the host's server for evaluation.

🧭 Model

This KDD Cup requires participants to use Llama models to build their RAG solution. Specially, participants can use or fine-tune the following Llama 2 or Llama 3 models from https://llama.meta.com/llama-downloads:

Meta-Llama-3-8B
Meta-Llama-3-8B-Instruct
Meta-Llama-3-70B
Meta-Llama-3-70B-Instruct
llama-2-7b
llama-2-7b-chat
llama-2-70b
llama-2-70b-chat

Any other non-llama models used need to be under 1.5b parameter size limit.

🔨 Hardware and system configuration

We set a limit on the hardware available to each participant to run their solution. Specifically,

All submissions will be run on an AWS G4dn.12xlarge instance equipped with 4 NVIDIA T4 GPUs with 16GB GPU memory. Please note that

llama-2-7b and llama-2-7b-chat in full precision can run on 2 T4 GPUs.
llama-2-70b and llama-2-70b-chat in full precision cannot be directly run on 4 T4 GPUs. Quantization or other techniques need to be applied to make the model runnable. You may directly use the following quantized llama-2-70b model: Llama-2-70B-GGML, available at https://huggingface.co/TheBloke/Llama-2-70B-GGML.
NVIDIA T4 is not using the latest architectures and hence might not be compatible with certain acceleration toolkits (e.g. Flash Attention), so please make sure the submitted solution is compatible with the configuration.

Moreover, the following restrictions will also be imposed.

Network connection will be disabled.
Each example will have a time-out limit of 10 seconds. [TO BE TESTED WITH AICROWD SUBMISSION SYSTEM].
To encourage concise answers, each answer will be truncated to 75 bpe tokens in the auto-eval. In human-eval, graders will check the first 75 bpe tokens to find valid answers, but check the whole response to judge for hallucination.

🤝 Use of external resources

By only providing a small development set, we encourage participants to exploit public resources to build their solutions. However, participants should ensure that the used datasets or models are publicly available and equally accessible to use by all participants. Such a constraint rules out proprietary datasets and models by large corporations. Participants are allowed to re-formulate existing datasets (e.g., adding additional data/labels manually or with Llama models), but award winners are required to make them publicly available after the competition.

🔑 Baseline implementation

We provide baseline RAG implementations based on llama-2-chat-7b model to help participants onboard quickly.

📘 Participation and Submission

🔑 Registration

Each team can have 1--5 participants. Teams need to register at Link before sumbitting their solutions. The registrated team members need to freeze by 5/31, during Phase 2.

🤝 Solution submission

Parcipants must submit their code, and model weights to run on the host's server for evaluation.

Phase I Each team can make up to 6 submission/week for all 3 tracks.
Phase II Each participating team can make up to 6 submissions for all 3 tracks together over the challenge [TO BE TESTED WITH AICROWD SUBMISSION SYSTEM].

💻 Technical report submission

Upon the end of the competition, we will notify potential winners, who will be required to submit a technical report to describe their solutions as well as necessary codes to reproduce their solutions. The organizers will review eligibility and the teams’ submitted contents to verify compliance with the rules of the challenge. Winning teams who comply with the rules may be invited to present their work at the KDD Cup 2024 Workshop (see rules for more details).

🏛️ KDD Cup Workshop

KDD Cup is an annual data mining and knowledge discovery competition organized by the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining (ACM SIGKDD). The competition aims to promote research and development in data mining and knowledge discovery by providing a platform for researchers and practitioners to share their innovative solutions to challenging problems in various domains. The KDD Cup 2024 will be held in Barcelona, Spain, from Sunday, August 25, 2024, to Thursday, August 29, 2024.

⛰ What Makes CRAG Standout?

Realism: First and foremost, a good benchmark shall best reflect real use cases. In other words, a solution that achieves high metrics in the benchmark shall also perform very well in real scenarios. CRAG query construction considers smart assistants use cases and are realistic; weighting are applied according to the complexity type and entity popularity, such that the metrics can well reflect how we satisfy real user needs.
Richness: A good benchmark shall contain a diverse set of instance types, covering both common use cases, and some complex and advanced use cases, to reveal possible limitations of existing solutions at various aspects. CRAG covers five domains, consider facts of different timeliness (real-time, fast changing, slow-changing, stable) and different popularities (head, torso, tail), and contains questions of different complexities (from simple facts to requiring reasoning).
Reliability: A good benchmark shall allow reliable assessment of metrics. CRAG has manually verified ground truths; the metrics are carefully designed to distinguish correct, incorrect, and missing answers; automatic evaluation mechanisms are designed and provided; the number of instances allows for statistical significant metrics, and tasks are carefully designed to test out different key technical components of the solutions.
Accessibility: CRAG provides not only the problem set and ground truths, but also the mock data sources for retrieval to ensure fair comparisons.

🗂️ Related Work

Benchmark	Web retrieval	KG search	Mock API	Dynamic question	Torso and tail facts	Beyond Wikipedia
QALD-10 [3]	❌	✅	❌	❌	❌	❌
MS MARCO [4]	✅	❌	❌	not explicitly	not explicitly	✅
Natural Questions [5]	✅	❌	❌	not explicitly	not explicitly	❌
RGB [6]	✅	❌	❌	❌	❌	✅
FreshLLM [1]	❌	❌	❌	✅	❌	✅
CRAG	✅	✅	✅	✅	✅	✅

📱 Contact

Please use crag-kddcup-2024@meta.com for all communications to reach the Meta KDD cup 2024 team.

Organizers of this KDD Cup consists of scientists and engineers from Meta Reality-Labs and Hong Kong University of Science & Technology (HKUST, HKUST-GZ). They are:

Xiao Yang
Kai Sun
Hao Xin
Yushi Sun
Sajal Choudhary
Yifan Ethan Xu
Nikita Bhalla
Xiangsen Chen
Rongze Daniel Gui
Ziran Will Jiang
Ziyu Jiang
Brian Moran
Chenyu Yang
Hanwen Zha
Nan Tang
Lei Chen
Nicolas Scheffer
Yue Liu
Rakesh Wanga
Anuj Kumar
Xin Luna Dong

Competition rules:

https://www.aicrowd.com/challenges/meta-comprehensive-rag-benchmark-kdd-cup-2024/challenge_rules

References

[1] Tu Vu et al., "FreshLLMs: Refreshing Large Language Models with search engine augmentation", arXiv, 10/2023. Available at: https://arxiv.org/abs/2310.03214

[2] Kai Sun et al., "Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs?", NAACL, 2024. Available at: https://arxiv.org/abs/2308.10168

[3] Ricardo Usbeck et al., "QALD-10–The 10th challenge on question answering over linked data", Semantic Web Preprint (2023), 1–15. Available at: https://www.semantic-web-journal.net/content/qald-10-%E2%80%94-10th-challenge-question-answering-over-linked-data

[4] Payal Bajaj et al., "Ms marco: A human-generated machine reading comprehension dataset", (2016). Available at: https://arxiv.org/abs/1611.09268

[5] Tom Kwiatkowski et al., "Natural questions: a benchmark for question answering research", Transactions of the Association for Computational Linguistics 7 (2019), 453–466. Available at: https://aclanthology.org/Q19-1026/

[6] Jiawei Chen et al., "Benchmarking large language models in retrieval-augmented generation", arXiv preprint arXiv:2309.01431 (2023). Available at: https://arxiv.org/abs/2309.01431