⚔️ Problem statement
In this Task, you will be given a dataset of raw overhead videos and tracking data containing triplets of socially interacting mice. Rather than being asked to detect a specific behavior of interest, we ask you to submit a frame-by-frame representation of the dataset—for example, a low-dimensional embedding of animals' trajectories over time. This task will tell us to what extent information in raw video that isn't captured by tracking data improves our ability to characterize animal behavior using unsupervised methods. You can read about a few existing methods for representation learning of individual animal behavior here, here, here, and here.
To evaluate the quality of your learned representations, we will take a practical approach: we'll use representations as input to train linear classifiers on many different "hidden" tasks (each task will have its own neural classifier or regressor), such as detecting occurrence of experimenter-defined actions or distinguishing between two different strains of mice. The goal is therefore to create a representation that captures behavior and generalizes well in any downstream task.
Join our Computational Behavior Slack to discuss the challenge, ask questions, find teammates, or chat with the organizers!
We provide overhead videos and corresponding frame-by-frame animal pose estimates of trios of interacting mice filmed at 30Hz. Animal poses are characterized by the tracked locations of body parts on each animal, termed "keypoints."
Keypoints are stored in an ndarray with the following properties:
- Dimensions: (
# frames) x (
animal ID) x (
body part) x (
x, y coordinate).
- Units: pixels; coordinates are relative to the entire image. Original image dimensions are 850 x 850 for the mouse dataset.
Body parts are ordered: 1) nose, 2) left ear, 3) right ear, 4) neck, 5) left forepaw, 6) right forepaw, 7) center back, 8) left hindpaw, 9) right hindpaw, 10) tail base, 11) tail middle, 12) tail tip.
The placement of these keypoints is illustrated below:
The following files are available in the
Resources section on the Challenge Page. A "sequence" is a continuous recording of social interactions between animals: sequences are 60 seconds long (1800 frames at 30Hz) in the mouse video dataset. The
sequence_id is a random hash to anonymize experiment details, and can be used to map keypoint and annotation data to a corresponding video clip (named
<sequence_id>.avi). nans indicate missing data. These occur because not all videos are labelled for all tasks. Data are padded with nans to be all the same size.
user_train.npy- Set of videos where three public tasks are provided, for your local validation, which follows the following schema :
submission_keypoints.npy- Keypoints for the submission clips, which follows the following schema :
frame_number_map.npy- A map of frame numbers for each clip to be used for the submission embeddings array
sample_submission.npy- Template for a sample submission for this task, follows the following schema :
userTrain_videos.zip- Videos for the userTrain sequences, all 512x512 Grayscale 30 fps - 1800 frames each
submission_videos.zip- Videos for the Submission sequences, all 512x512 Grayscale 30 fps - 1800 frames each
userTrain_videos_resized_224.zip- Videos resized for convenience, 224x224 Grayscale 30 fps - 1800 frames each
submission_videos_resized_224.zip- Videos resized for convenience, 224x224 Grayscale 30 fps - 1800 frames each
sample_submission, each key in the
frame_number_map dictionary refers to the unique sequence id of a video in the test set. The item for each key is expected to be an the start and end index for slicing the
embeddings numpy array to get the corresponding embeddings. The
embeddings array is a 2D
ndarray of floats of size
X , where
X is the dimension of your learned embedding (6 in the above example; maximum permitted embedding dimension is 128), representing the embedded value of each frame in the sequence.
total_frames is the sum of all the frames of the sequences, the array should be concatenation of all the embeddings of all the clips.
To help you evaluate the quality of your embeddings, we provide labels for two sample evaluation subtasks:
- Chasing is a "frame-level" task, meaning each frame in a 1-minute clip receives a binary label. Chasing frames are those in which any one mouse is pursuing any other; the mice be must within a given distance of each other, and traveling above a given speed, for at least one second. Frames are labeled with a 1 when chasing is detected, and 0 otherwise.
- Light cycle is a "sequence-level" task, meaning its value is the same for all frames in a sequence. Here, sequences are labeled with a 1 when lights are on, and 0 when lights are off. (Mice are night active, so you will observe more movement when lights are off.) The lights are not perceptible by watching the raw videos, since the videos are captured with an infrared camera.
Sample submission format is described in the Files section above.
To test out the system, you can start by uploading the provided
sample_submission.npy. When you make your own submissions, they should follow the same format.
Also check out the notebooks provided in the
Notebooks tab for baselines provided by us. Your community contributions will also be shown in this section.
The cash prize pool for this task is $3,000 USD total:
- 🥇 1st on leaderboard: $1500 USD
- 🥈 2nd on the leaderboard: $1000 USD
- 🥉 3rd on the leaderboard: $500 USD
Additional prizes to be announced, including speaker opportunity at our CVPR 2022 workshop.