Activity
Challenge Categories
Challenges Entered
Improve RAG with Real-World Benchmarks | KDD Cup 2025
Latest submissions
See All| graded | 289246 | ||
| graded | 288918 | ||
| graded | 288888 |
Improve RAG with Real-World Benchmarks
Latest submissions
See All| failed | 267153 | ||
| graded | 267152 | ||
| graded | 266979 |
What data should you label to get the most value for your money?
Latest submissions
See All| graded | 173244 |
3D Seismic Image Interpretation by Machine Learning
Latest submissions
Play in a realistic insurance market, compete for profit!
Latest submissions
Multi Agent Reinforcement Learning on Trains.
Latest submissions
Evaluating RAG Systems With Mock KGs and APIs
Latest submissions
See All| graded | 263889 | ||
| graded | 263880 | ||
| failed | 263878 |
Enhance RAG systems With Multiple Web Sources & Mock API
Latest submissions
See All| failed | 267153 | ||
| graded | 267152 | ||
| failed | 265531 |
| Participant | Rating |
|---|
| Participant | Rating |
|---|
Meta CRAG - MM Challenge 2025
Can you please tell me the default local directory where the model files are downloaded?
5 months agoNot sure if this help. You can try with local_files_only or test your code when set HF_HUB_OFFLINE=1 as the enviroment variable. I observed that some packages will use huggifance hub snapshot method to check if some repo has certain files, and if local_files_only is not set in the snapshot, it will try to request internal but did not get correct response.
Why submission 286836 and 286834 hang?
5 months ago- AIcrowd | Single-source Augmentation | Submissions #286836 This one hang at generation stage 56%
-
AIcrowd | Single-source Augmentation | Submissions #286834 This one hang at the validation stage with last message:
2025-06-04 13:19:46.032 | INFO | <private_file>:register_agent:285 - Registering agent with oracle...
Note: ALL the solution passed our offline evalation with 1 L40s GPU.
Image and web search API updates and feedback
6 months agoPlease check and update the quality of web serach index. I observed that many web results for 0.1.2 validation are not correct.
e.g.:
results = search_pipeline("Arancini")
results[2]
>>> {'index': 'https://en.wikipedia.org/wiki/Italy, https://en.wikipedia.org/wiki/Arancini_chunk_0',
'score': 0.5711636543273926,
'page_name': '',
'page_snippet': '',
'page_url': 'https://en.wikipedia.org/wiki/Italy, https://en.wikipedia.org/wiki/Arancini'}
Whatโs the reason that a search result could return two url and split by ',' and how diffenent webpage could have same score? No matter in what kind of scenario, e.g., (the chunking makes the content idendical), those results should be flatted.
Gitlab webpage has SSL certificate problem: certificate has expired
6 months agoThe gitlab.aicrowd.com is having SSL certificate issue
Ground Truth (ans_full) and Auto Eavluation
6 months agoDear Organizers,
We have observed many instances where the automatic evaluation script, local_evaluation.py, is not stable when processing short answers. These answers, despite being correct, are often shorter than the provided โground truthโ responses.
For example, in the v.0.1.2 validation set:
-
Interaction ID:
00663475-7bf0-4c70-bba5-80bd9425082d -
Query:
When did this artist release his first studio album? -
Ground Truth:
Chuck Berry released his first studio album, After School Session, in 1957. -
Agent Response:
1957 -
local_evaluation.py:
{'accuracy': False}
We consider the agentโs response of โ1957โ to be correct, yet the local_evaluation.py frequently marks it as incorrect. It is because the system prompt write as:
"You are an expert evaluator for question answering systems. "
"Your task is to determine if a prediction correctly answers a question based on the ground truth.\n\n"
"Rules:\n"
"1. The prediction is correct if it captures all the key information from the ground truth.\n"
"2. The prediction is correct even if phrased differently as long as the meaning is the same.\n"
"3. The prediction is incorrect if it contains incorrect information or is missing essential details.\n"
"Output a JSON object with a single field 'accuracy' whose value is true or false."
We are wondering how such cases will be judged in the online auto evaluation, and more specifically, how organizers will assess them. Personally, I would consider โ1957โ a perfectly correct answer (score = 1), though it could also be treated as an acceptable answer (score = 0.5).
Weโve noted this is a common occurrence because the ground truth in the current data includes both an answer and a brief reason (ans_full), which differs from last yearโs format where we had both short and full answers.
Thank you for your clarification on this matter.
Sincerely,
Fanyou
Please Confirm Submission Limits for Phase 2
6 months agoCould we have an extra enviroment which just used for debugging purpose that does not return score but check the submission is valid?
๐ฌ Feedback & Suggestions
7 months agoThe huggingface dataset crag-mm-2025/crag-mm-single-turn-public:v0.1.1 (link) has duplicated domain {2: "plants and gardening", 4: "plants and gardening "}. The label text only differs in one space char. It is a potential data quality issue.
What's the G6e instance size for evaluation?
7 months agoHi orgainzer,
I saw the rule mentioned that:
All submissions will be run on a single G6e instance with a NVIDIA L40s GPU with 48GB of GPU memory on AWS.
I want to understand which size [1] of G6e instance is used ? g6e-xlarge? This information will help for our offine evaluation.
[1] Amazon EC2 G6e Instances | Amazon Web Services
Best
Fanyou
Meta Comprehensive RAG Benchmark: KDD Cup 2-9d1937
Final Evaluation Process & Team Scores
Over 1 year agoCan we obtain the full rankings for the main 3 tasks? At least I want to understand how far I am away from the top teams.
Has the Winner Notification already been sent?
Over 1 year agoWe heard from the organizers by email that some part of the human annotations are still undergoing.
Copied from email:
We are still in the middle of annotations for other challenge tasks and will announce winners by email once the annotations are ready. The official winner announcement for the CRAG challenge will be made in early August.
โผ๏ธ โฐ Select Submission ID before 20th June, 2024 23:55 UTC
Over 1 year agoCan we confirm which are our final submissions in the google forms after June 20th e.g (June 21st some time when online evaluation finished). It is because some people want to select their final solution based on the round 2 score. Besides the current evaluation system is stuck as a lot of people are submitting to the evaluation system.
โผ๏ธ โฐ Select Submission ID before 20th June, 2024 23:55 UTC
Over 1 year agoCan we submit the same submissions for all 3 tasks? The aicrowd.json might be the same but the code is able to deal with all three task setting.
Can we submit solution that has not been tested in the online
Over 1 year agoHi,
My limit of online submission is low due to debugging, I am wondering for the final submission, can we chooce solution which are not tested online yet.
Best
Fanyou
Submission Fail due to Private Test: Evaluation timed out ๐ข
Over 1 year agoCould you help check what the reason is for the failure of those submissions:
Those submissions made small changes to previous successful submissions and all were tested successfully on the provided public dataset. Besides, all those submissions passed the Validation Step but got stuck during the start of the evaluations. So there are no progress bars for the evaluations.
Now I can idenfity which part in my code create the problem. But I am still not able to reproduce it offline. I wish I can get some error message from the log to help me solve the problem.
Best
Fanyou
Phase 1 has released the dataset , and how to appy a cut-off to limit Phase 2?
Over 1 year agoHi organizers,
It is apparently there are two teams in Track 1 now (April 30th), use the public testset [1] to obtain nearly full score (~0.98). I am wondering in this senario, how to apply a cut-off in Phase 2? Every participant just need to upload public testset and obtain the similar full score. Is there still potential cut-off?
[1] What does `split` field mean? - #3 by graceyx.yale
Best
Fanyou
Regarding to maxiumn number tokens of response for llama 3
Over 1 year ago@aicrowd_team Yes. I understand that the code has already had this tokenzier. But Llama 3 had different vocab size (128K vs 32K). In some cases, the output number of tokens will be smaller than that of llama 2 if the output texts are the same. In terms of the model performance, LLama 3 is better (in the report) and I foresee people might use it. So I suggest if we can replace the current tokenzier for truncating predictions to Llama 3โs.
Regarding to maxiumn number tokens of response for llama 3
Over 1 year agoI want to raise organziers attention that Llama 3 had a larger vocabulary size (128K) comparing to llama 2 (32K). So we need to clear define in the rule that what tokenizer is used to truncate the response (previously the code used llama 2 tokenzier).
Best
Fanyou
Are we allowed to use LLama 3?
Over 1 year agoHi Organziers,
Meta has introudced Llama 3 and is avalible at huggingface. I am wondering if we can use it for the competition. The Llama 3 - 8B model might be a good choice.
Best
Fanyou
Can we use other LLM at training stage?
Over 1 year agoHi Organizers,
I want to understand if we can use other LLM (not LLAMA2 family) during the traning stage, specifically, used for RLHF and Data Generation.
Below is the raw request for model:
This KDD Cup requires participants to use Llama models to build their RAG solution. Specially, participants can use or fine-tune the following 4 Llama 2 models from https://llama.meta.com/llama-downloads:
- llama-2-7b
- llama-2-7b-chat
- llama-2-70b
- llama-2-70b-chat
Best
Fanyou

Important Update on Missing/Refusal Rate
5 months agoThe high missing rate need extra clarification, either as a hard constraint, of fuse into the final metric. As it will significantly impact the strategies of whether to provide a answer, it might be better to extend the competition for 1 or 2 week. Peasonly I do not think It is not a good idea to change the rule at this time point.