Organiser Interview | Ann Kennedy
Challenge participants and organisers are the two main foundations of AIcrowd. We have shared several participant stories on our platform & now we want to share behind-the-scenes details from your favourite challenges!
Join us for the first [blog name] as we chat with the organiser of the Multi-Agent Behavior Challenge organiser. Ann Kennedy, a theoretical neuroscientist, studies animal behaviour’s structure using dynamical systems, statistical modelling, and machine learning. Her work at Northwestern aims to develop new theories and models to understand better how neural networks govern function and shapes behaviour across the animal kingdom.
What is Multi-Agent Behavior Challenge?
This challenge aimed to determine how close we can come to fully automating the behaviour annotation process and where the most significant roadblocks in that process lie. The challenge utilises a new, high-quality animal pose dataset. Each challenge task is modelled after real-world problems faced by behavioural neuroscientists. The challenge was further broken into three linked tasks, each task carrying a USD 3000 cash prize pool.
Now that you are all caught up on the challenge, and know more about our organiser, let’s hear Ann's experience of hosting the challenge and her thoughts on winning submissions.
AIcrowd has hosted several successful ML challenges in the past few years. We asked Ann to tell us her motivation to host this challenge with us.
Ann Kennedy: The area of automatically quantifying animal behaviour is a new field. It only really started being a thing maybe around 2018 with the release of some packages, like DeepLabCut for tracking postures of animals. And there aren't any good standardized data sets or approaches that people use to study animal behaviour. So people kept on publishing papers on their new methods for behaviour analysis, but they'd all apply them to their own in-house data sets. So you couldn't tell which strategies worked and which methods didn't. We wanted to establish a benchmark data set that the field could use moving forward to show how well their model works.
Why did Northwestern Lab for Theoretical Neuroscience and Behavior choose AIcrowd for hosting this challenge?
Ann Kennedy: We looked at a handful of sites that host AI challenges. Mostly, my collaborator, Jen, searched sites for hosting machine learning challenges. The reason we went with AIcrowd was that it allows more flexibility of evaluation metrics than others. Other platforms seemed more limited in the offerings of evaluation metrics that you could use.
Were you satisfied with the level of flexibility offered in evaluation metrics? How would you describe your overall experience?
Ann Kennedy: Working with AIcrowd was a smooth journey. You guys gave us more support than we were expecting. We had never done one of these before, so we didn't expect help with establishing all of the baseline codes. We figured we'd have to do the baseline code by ourselves. It was great to help with that, and it made it easier for people to join the challenge. Since they had access to the baseline notebooks, they could use it and get started with the dataset without doing all the basic wrangling at the start.
What do you think about the AIcrowd community?
Ann Kennedy: In addition to the baselines, you guys help us think about what kinds of things these challenges are suitable for. We were discussing the trade-off between having data and models and identifying what was missing from having a better solution. Then just like seeing that the challenge went off well and that a decent number of people participated, we weren't sure what to expect. Despite being a niche problem there, it was nice to see engagement from a more extensive community. The top teams had few participants from neuroscience background. Those who were not from the neuroscience field were all still able to understand the problem and make meaningful contributions.
What are your thoughts on the winner’s solution? What are some key takeaways for you?
Ann Kennedy: The winner’s solutions performed well on the test set. They were often hand-tuning the thresholds on classifiers to improve their performance. Everybody struggled with getting a score above baseline for Task 3.
Through this, we learnt to designs our tasks in a non-linear way because we ended up with a situation where everybody was solving task one, while task two and three had much less participation until the very end. So I think if we were doing this again, we would change the design of the tests and make them different enough that you could work on them independently or have a single focused task.
In terms of helping the neuroscience community, was this challenge a success? How has the challenge helped the research project?
Ann Kennedy: Yes, this challenge helped the field a lot. It’s been great for the lab. Having this dataset out there has gotten people talking about ways of standardizing behaviour analysis. I am involved with a group discussing how to establish standard metrics and benchmarks in the field. I think this is something that we hope will try to continue for another couple of years, just adding in different tests and flushing out kind of the space of problems that are unsolved in behavioural analysis.
Was there any hurdle in hosting this challenge? How can we improve?
Ann Kennedy: There was a short blip before the challenge launch that caused a minor delay. Preparing the baseline took a bit longer than expected, which pushed the launch date by a few days. The baseline code wasn’t performing very well for one of the tasks, so tweaking took extra time and effort. But in the end, AIcrowd and our team were able to prepare a well-performing, accessible baseline. Even for Task 2 and Task 3, that didn’t seem too feasible at first. As they were designed to be in a data-limited regime, a lot of off the shelf models just didn’t perform very well. It required the clever handling of small amounts of data that still output an excellent result.
Any final thoughts and takeaways you’d like to share?
Ann Kennedy: This challenge was not to identify who has the best algorithm. But through such exploratory challenges, we can determine what is difficult to do in this field and which areas to focus on for the future. It's not just a bake-off, but it understands what the regions and where are the areas researchers need to focus on are. We were able to achieve that goal from this challenge and implement the learnings in the subsequent iterations.