Activity
Ratings Progression
Challenge Categories
Challenges Entered
Improve RAG with Real-World Benchmarks
Latest submissions
See Allfailed | 252849 | ||
failed | 252492 | ||
failed | 252486 |
Revolutionising Interior Design with AI
Latest submissions
See Allgraded | 251762 | ||
graded | 251753 | ||
graded | 251702 |
Multi-Agent Dynamics & Mixed-Motive Cooperation
Latest submissions
See Allgraded | 243898 | ||
failed | 242953 | ||
failed | 242945 |
Specialize and Bargain in Brave New Worlds
Latest submissions
See Allsubmitted | 246741 | ||
submitted | 246661 | ||
submitted | 246539 |
Trick Large Language Models
Latest submissions
Shopping Session Dataset
Latest submissions
Small Object Detection and Classification
Latest submissions
See Allgraded | 240507 | ||
graded | 240506 | ||
graded | 240490 |
Understand semantic segmentation and monocular depth estimation from downward-facing drone images
Latest submissions
See Allsubmitted | 218884 | ||
graded | 218883 | ||
submitted | 218875 |
Identify user photos in the marketplace
Latest submissions
See Allgraded | 210413 | ||
failed | 210389 | ||
graded | 210283 |
A benchmark for image-based food recognition
Latest submissions
See Allgraded | 181873 | ||
graded | 181872 | ||
graded | 181870 |
Using AI For Buildingβs Energy Management
Latest submissions
See Allgraded | 205123 | ||
failed | 204464 | ||
failed | 204102 |
What data should you label to get the most value for your money?
Latest submissions
Interactive embodied agents for Human-AI collaboration
Latest submissions
Improving the HTR output of Greek papyri and Byzantine manuscripts
Latest submissions
Machine Learning for detection of early onset of Alzheimers
Latest submissions
A benchmark for image-based food recognition
Latest submissions
5 Puzzles, 3 Weeks. Can you solve them all? π
Latest submissions
Project 2: Road extraction from satellite images
Latest submissions
Project 2: build our own text classifier system, and test its performance.
Latest submissions
5 PROBLEMS 3 WEEKS. CAN YOU SOLVE THEM ALL?
Latest submissions
Predict if users will skip or listen to the music they're streamed
Latest submissions
5 puzzles and 1 week to solve them!
Latest submissions
Latest submissions
Estimate depth in aerial images from monocular downward-facing drone
Latest submissions
See Allsubmitted | 218884 | ||
graded | 218883 | ||
graded | 218801 |
Perform semantic segmentation on aerial images from monocular downward-facing drone
Latest submissions
See Allsubmitted | 218875 | ||
graded | 218874 | ||
submitted | 218871 |
Commonsense Dialogue Response Generation
Latest submissions
See Allgraded | 252068 | ||
graded | 250634 | ||
graded | 250621 |
Commonsense Persona Knowledge Linking
Latest submissions
See Allgraded | 250633 | ||
graded | 250629 | ||
failed | 250626 |
Participant | Rating |
---|
Participant | Rating |
---|---|
gaurav_singhal | 0 |
-
nebula Food Recognition Benchmark 2022View
-
Sneaky_Ninjas NeurIPS 2022: CityLearn ChallengeView
-
925ers Visual Product Recognition Challenge 2023View
-
gs-sai Scene Understanding for Autonomous Drone Delivery (SUADD'23)View
-
WeekendWarriors MeltingPot Challenge 2023View
-
MoonWalkers Commonsense Persona-Grounded Dialogue Challenge 2023View
Meta Comprehensive RAG Benchmark: KDD Cup 2
Meta KDD Cup 24 - CRAG - Retrieval Summarization
Are these evaluation qa values present in qa.json correct?
5 days agoThe evaluation dataset questions consists of stock prices, but none of the answers are accurate as of Feb 16th, which is the last date given in the dataset. I used Nasdaq and MarketWatch to test the prices on that day, but none of them matched.
you can check the closing price on https://www.nasdaq.com/market-activity/stocks/tfx
For example, the closing price on feb16th is 251.07$ but the answer is given as 249.07$
βinteraction_idβ: 0,
βqueryβ: βwhatβs the current stock price of Teleflex Incorporated Common Stock in USD?β,
βanswerβ: β249.07 USDβ
Generative Interior Design Challenge 2024
Top teams solutions
9 days agoNice use of an external dataset and converting it to this challenge format. I did explore different variations at the time of inference using prompt engineering. I used a better segmentation model (swin-base-IN21k) and modified the control items with pillars as well for better geometry along with different prompt engineering techniques. Even though baseline gave me a better score, it is really inconsistent. Finally, I submitted a realistic vision model from Comfort UI, which gave stable and consistent results, and based on the human evaluations I did expect some kind of randomness in the leaderboard. I would like to express my gratitude to the organizers of this challenge. The challenge is new and exciting, but because there are only 40 images in the test dataset, the human evaluations are much worse and inconsistent. It was really fun exploring stable diffusion models and their adapters. When I have more processing power, I want to work on this in the near future.
π Generative Interior Design Challenge: Top 3 Teams
11 days ago@lavanya_nemani you can not judge the test datasetβs performance based on 3 public images. As the scores are really close, the preference of annotators can change a little bit.
π Generative Interior Design Challenge: Top 3 Teams
11 days agoCongratulations to the winners! Also post this in discord, we have no idea that this post exists.
Build fail -- Ephemeral storage issue
19 days ago@lavanya_nemani The maximum size your submission can have is 10GB. So keep only maximum of 10GB in submission tag. Check your models folder and delete unused files. The repo can have more than 10Gb and you donβt need to delete them.
Submission stuck at intermediate state
20 days agoMy submission with Id #251530 evaluation was complete, but it got stuck and did not proceed to the human evaluation phase.
Commonsense Persona-Grounded Dialogue Chall-459c12
Service Announcement: Delays in GPU Node Provisioning
About 1 month agoCan we try new submissions now?
Task 1: Commonsense Dialogue Response Generation
Updates to Task 1 Metrics
About 1 month agoOnly problem is that the leader board is dominated by ChatGPTβs PE track, but the peacock paperβs human evaluation is not the same as ChatGPT/ChatGPT4. It is as if their experimentation can be made false simply by using prompt engineering. Could someone please clarify this?
In the human evaluation, we find that facts generated by COMET-BART receive a high acceptance rate by crowdworkers for plausibility, slightly beating fewshot GPT-3. We also find that zero-shot GPT-3.5 model, although more advanced than the GPT-3 baseline model, scores, on average, βΌ15.3% and
βΌ9.3% lower than COMET-BART in terms of automatic metrics and human acceptance, respectively.
Is anyone encountering SSL issue with image build caching API? I don't know if there is something I can do about this?
About 1 month agoThe same thing is happening for me in task 2 as well. The evaluation gets initialized, but it says the time-out error. @dipam Can you make the logs easier to understand?
Is anyone encountering SSL issue with image build caching API? I don't know if there is something I can do about this?
About 1 month agoYeah, the resubmission of same commit
Is anyone encountering SSL issue with image build caching API? I don't know if there is something I can do about this?
About 2 months agoTry creating the submission again, It happened for me as well.
Updates to Task 1 Metrics
About 2 months ago@dipam Using chatGPT, you are producing a score between 0 and 5, but is there a backup pipeline? if the generated value is a random text string or some out-of-bounds number. For human evaluation, it is also preferable to select the best LB from the GPU and PE tracks.
Updates to Task 1 Metrics
About 2 months agoItβs not clear what this metric is? Is this embeddings similarity? Or just asking chatGPT to generate some floating point based on humanness.
Can somebody please help us in creating submission with out any errors?
2 months agoIf anyone else is having the same problem, creating tags smaller than 10GB will work properly.
Can somebody please help us in creating submission with out any errors?
3 months agoI am creating new submissions, but it fails without any valid logs. If I fork the repo, it already crosses 12 GB, and then you can only fit 4GB to fit the 16GB limit. Every time I create a new one, it says evaluation time out, but the evaluation has not even started. The discord is inactive, so I am posting the problem here.
Task 2: Commonsense Persona Knowledge Linking
Submissions getting failed
4 months agoThanks for the reply, Are you planning to create baseline for task2 from the comfact repo model?
Submissions getting failed
4 months agoFor task 2, I attempted to submit the baseline model, which passed the warm-up phase but failed the subsequent one. A few days ago, I brought up this issue, but it has not been resolved yet. Could you check the reason why it failed? @dipam
MosquitoAlert Challenge 2023
Mosquito Alert Challenge Solutions
6 months agoThank you for sharing your solution. Nice use of ensemble techniques.
Is your improvement score lower on that noisy external dataset due to the use of multiple small models?
Mosquito Alert Challenge Solutions
6 months agoNow that the private leaderboard is online, I am going to provide the details of my submission. Initially, I attempted to improve F1 accuracy using detection models, but they were not comparable with TOP LB, so I used yolov5 as a classification model for detected bbox, but the PyTorch implementation had some errors, resulting in poor performance and producing different results between cli evaluation and PyTorch infernce. After a few failed attempts, I learned that others had the same mIoU score for their submission, so I attempted with the Yolov 8-L classification model, which has a 70 LB f1 score. So, for detection, I utilized Yolov5-S with a CBAM layer in the last layer with an image size of 640, resulting in 233 filtered predictions. The Yolov5m6 with a transformer head and 1280 image size generated better results, but the inference time is very high, so I havenβt used this model for submission with classification models.
For Classification Task, I used this Colab Notebook with multiple modifications. I couldnβt figure out the LB difference, but then saw external dataset posts after the competition ended. So, external datasets are allowed, but external mosquito alert image datasets are not allowed.
The classification training details are mentioned in the shared notebook, and I used Wandb for experiment tracking and openvino for bigger models to fit an inference time of 2 seconds. There could be a slight difference between TPU and CPU inference.
Notebooks
-
Generative Interior Design Submission using Colab Generative Interior Design Submission notebook using Colab's Free versionsaidinesh_polaΒ· About 1 month ago
-
π¦Mosquito-Classification This notebook is for improving classification of mosquitoes detection of starter notebook.saidinesh_polaΒ· 6 months ago
Can we assume the same websites in both the test and training datasets?
4 days agoCan we assume these same websites in both the test and training datasets?