Loading
0 Follower
0 Following
wufanyou
FANYOU WU

Location

Seattle, US

Badges

1
1
0

Connect

Activity

Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Mon
Wed
Fri

Challenge Categories

Loading...

Challenges Entered

Improve RAG with Real-World Benchmarks | KDD Cup 2025

Latest submissions

See All
graded 289246
graded 288918
graded 288888

Improve RAG with Real-World Benchmarks

Latest submissions

See All
failed 267153
graded 267152
graded 266979

What data should you label to get the most value for your money?

Latest submissions

See All
graded 173244

Latest submissions

See All
graded 195825
graded 195824
graded 195801

3D Seismic Image Interpretation by Machine Learning

Latest submissions

No submissions made in this challenge.

Play in a realistic insurance market, compete for profit!

Latest submissions

No submissions made in this challenge.

Multi Agent Reinforcement Learning on Trains.

Latest submissions

No submissions made in this challenge.

Evaluating RAG Systems With Mock KGs and APIs

Latest submissions

See All
graded 263889
graded 263880
failed 263878

Enhance RAG systems With Multiple Web Sources & Mock API

Latest submissions

See All
failed 267153
graded 267152
failed 265531
Participant Rating
Participant Rating
  • TLab Seismic Facies Identification Challenge
    View
  • ETS-Lab ESCI Challenge for Improving Product Search
    View
  • ETSLab Meta Comprehensive RAG Benchmark: KDD Cup 2024
    View
  • ETSLab Meta CRAG - MM Challenge 2025
    View

Meta CRAG - MM Challenge 2025

Important Update on Missing/Refusal Rate

5 months ago

The high missing rate need extra clarification, either as a hard constraint, of fuse into the final metric. As it will significantly impact the strategies of whether to provide a answer, it might be better to extend the competition for 1 or 2 week. Peasonly I do not think It is not a good idea to change the rule at this time point.

Can you please tell me the default local directory where the model files are downloaded?

5 months ago

Not sure if this help. You can try with local_files_only or test your code when set HF_HUB_OFFLINE=1 as the enviroment variable. I observed that some packages will use huggifance hub snapshot method to check if some repo has certain files, and if local_files_only is not set in the snapshot, it will try to request internal but did not get correct response.

Why submission 286836 and 286834 hang?

5 months ago

Note: ALL the solution passed our offline evalation with 1 L40s GPU.

Image and web search API updates and feedback

6 months ago

Please check and update the quality of web serach index. I observed that many web results for 0.1.2 validation are not correct.

e.g.:

results = search_pipeline("Arancini")
results[2]
>>> {'index': 'https://en.wikipedia.org/wiki/Italy, https://en.wikipedia.org/wiki/Arancini_chunk_0',
 'score': 0.5711636543273926,
 'page_name': '',
 'page_snippet': '',
 'page_url': 'https://en.wikipedia.org/wiki/Italy, https://en.wikipedia.org/wiki/Arancini'}

Whatโ€™s the reason that a search result could return two url and split by ',' and how diffenent webpage could have same score? No matter in what kind of scenario, e.g., (the chunking makes the content idendical), those results should be flatted.

Ground Truth (ans_full) and Auto Eavluation

6 months ago

Dear Organizers,

We have observed many instances where the automatic evaluation script, local_evaluation.py, is not stable when processing short answers. These answers, despite being correct, are often shorter than the provided โ€œground truthโ€ responses.

For example, in the v.0.1.2 validation set:

  • Interaction ID: 00663475-7bf0-4c70-bba5-80bd9425082d
  • Query: When did this artist release his first studio album?
  • Ground Truth: Chuck Berry released his first studio album, After School Session, in 1957.
  • Agent Response: 1957
  • local_evaluation.py: {'accuracy': False}

We consider the agentโ€™s response of โ€œ1957โ€ to be correct, yet the local_evaluation.py frequently marks it as incorrect. It is because the system prompt write as:

"You are an expert evaluator for question answering systems. "
"Your task is to determine if a prediction correctly answers a question based on the ground truth.\n\n"
"Rules:\n"
"1. The prediction is correct if it captures all the key information from the ground truth.\n"
"2. The prediction is correct even if phrased differently as long as the meaning is the same.\n"
"3. The prediction is incorrect if it contains incorrect information or is missing essential details.\n"
"Output a JSON object with a single field 'accuracy' whose value is true or false."

We are wondering how such cases will be judged in the online auto evaluation, and more specifically, how organizers will assess them. Personally, I would consider โ€œ1957โ€ a perfectly correct answer (score = 1), though it could also be treated as an acceptable answer (score = 0.5).

Weโ€™ve noted this is a common occurrence because the ground truth in the current data includes both an answer and a brief reason (ans_full), which differs from last yearโ€™s format where we had both short and full answers.

Thank you for your clarification on this matter.

Sincerely,

Fanyou

Please Confirm Submission Limits for Phase 2

6 months ago

Could we have an extra enviroment which just used for debugging purpose that does not return score but check the submission is valid?

๐Ÿ’ฌ Feedback & Suggestions

7 months ago

The huggingface dataset crag-mm-2025/crag-mm-single-turn-public:v0.1.1 (link) has duplicated domain {2: "plants and gardening", 4: "plants and gardening "}. The label text only differs in one space char. It is a potential data quality issue.

What's the G6e instance size for evaluation?

7 months ago

Hi orgainzer,
I saw the rule mentioned that:

All submissions will be run on a single G6e instance with a NVIDIA L40s GPU with 48GB of GPU memory on AWS.

I want to understand which size [1] of G6e instance is used ? g6e-xlarge? This information will help for our offine evaluation.

[1] Amazon EC2 G6e Instances | Amazon Web Services

Best
Fanyou

Meta Comprehensive RAG Benchmark: KDD Cup 2-9d1937

Final Evaluation Process & Team Scores

Over 1 year ago

Can we obtain the full rankings for the main 3 tasks? At least I want to understand how far I am away from the top teams.

Has the Winner Notification already been sent?

Over 1 year ago

We heard from the organizers by email that some part of the human annotations are still undergoing.

Copied from email:

We are still in the middle of annotations for other challenge tasks and will announce winners by email once the annotations are ready. The official winner announcement for the CRAG challenge will be made in early August.

โ€ผ๏ธ โฐ Select Submission ID before 20th June, 2024 23:55 UTC

Over 1 year ago

Can we confirm which are our final submissions in the google forms after June 20th e.g (June 21st some time when online evaluation finished). It is because some people want to select their final solution based on the round 2 score. Besides the current evaluation system is stuck as a lot of people are submitting to the evaluation system.

โ€ผ๏ธ โฐ Select Submission ID before 20th June, 2024 23:55 UTC

Over 1 year ago

Can we submit the same submissions for all 3 tasks? The aicrowd.json might be the same but the code is able to deal with all three task setting.

Can we submit solution that has not been tested in the online

Over 1 year ago

Hi,

My limit of online submission is low due to debugging, I am wondering for the final submission, can we chooce solution which are not tested online yet.

Best
Fanyou

Submission Fail due to Private Test: Evaluation timed out ๐Ÿ˜ข

Over 1 year ago

@aicrowd_team

Could you help check what the reason is for the failure of those submissions:

Those submissions made small changes to previous successful submissions and all were tested successfully on the provided public dataset. Besides, all those submissions passed the Validation Step but got stuck during the start of the evaluations. So there are no progress bars for the evaluations.

Now I can idenfity which part in my code create the problem. But I am still not able to reproduce it offline. I wish I can get some error message from the log to help me solve the problem.

Best
Fanyou

Phase 1 has released the dataset , and how to appy a cut-off to limit Phase 2?

Over 1 year ago

Hi organizers,

It is apparently there are two teams in Track 1 now (April 30th), use the public testset [1] to obtain nearly full score (~0.98). I am wondering in this senario, how to apply a cut-off in Phase 2? Every participant just need to upload public testset and obtain the similar full score. Is there still potential cut-off?

[1] What does `split` field mean? - #3 by graceyx.yale

Best
Fanyou

Regarding to maxiumn number tokens of response for llama 3

Over 1 year ago

@aicrowd_team Yes. I understand that the code has already had this tokenzier. But Llama 3 had different vocab size (128K vs 32K). In some cases, the output number of tokens will be smaller than that of llama 2 if the output texts are the same. In terms of the model performance, LLama 3 is better (in the report) and I foresee people might use it. So I suggest if we can replace the current tokenzier for truncating predictions to Llama 3โ€™s.

Regarding to maxiumn number tokens of response for llama 3

Over 1 year ago

I want to raise organziers attention that Llama 3 had a larger vocabulary size (128K) comparing to llama 2 (32K). So we need to clear define in the rule that what tokenizer is used to truncate the response (previously the code used llama 2 tokenzier).

Best
Fanyou

Are we allowed to use LLama 3?

Over 1 year ago

Hi Organziers,

Meta has introudced Llama 3 and is avalible at huggingface. I am wondering if we can use it for the competition. The Llama 3 - 8B model might be a good choice.

Best
Fanyou

Can we use other LLM at training stage?

Over 1 year ago

Hi Organizers,

I want to understand if we can use other LLM (not LLAMA2 family) during the traning stage, specifically, used for RLHF and Data Generation.

Below is the raw request for model:

This KDD Cup requires participants to use Llama models to build their RAG solution. Specially, participants can use or fine-tune the following 4 Llama 2 models from https://llama.meta.com/llama-downloads:

  • llama-2-7b
  • llama-2-7b-chat
  • llama-2-70b
  • llama-2-70b-chat

Best
Fanyou

wufanyou has not provided any information yet.