AIcrowd | Evoked Expressions from Videos Challenge (@CVPR 2021)

Overview

What do you feel when you are watching cat videos? How about movie trailers or sports games? Videos can evoke a wide range of affective responses in viewers, such as amusement, sadness, surprise, amongst many others. This affective response changes over time throughout the video.

With the growing amount of videos available online, automated methods to help retrieve, categorize, and recommend videos to viewers is becoming increasingly important. Through this process, the ability to predict evoked facial expressions from a video, before viewers watch the video, can help with content creation, as well as video categorization and recommendation. Predicting evoked facial expressions from video is challenging, as it requires modeling signals from different modalities (visual and audio) potentially over long timescales. Additionally, some expressions are more rare (ex: amusement) compared to others (ex: interest). Techniques that are multimodal, captures temporal contexts, and can address dataset imbalance will be helpful in this task.

To benchmark the ability of models to predict viewer reactions, our challenge uses the Evoked Expressions from Videos (EEV) dataset, a large-scale dataset for studying viewer responses to videos. Each video is annotated at 6 Hz with 15 continuous evoked expression labels, corresponding to the facial expression of viewers who reacted to the video. In total, there are 8 million annotations of viewer facial reactions to 5,153 videos (370 hours). The EEV dataset is based on publicly available YouTube data and contains a diverse set of video content, including music videos, trailers, games, and animations. Since this is a new dataset for exploring affective signals, we encourage participants to experiment with existing video understanding methods, and also novel methods that could improve our ability to model affective signals from video.

Problem Statement

Given a video (with visual and audio signals), how well can models predict viewer facial reactions at each frame when watching the video? Our challenge uses the EEV dataset, a novel dataset collected using reaction videos, to study these facial expressions as viewers watch the video. The 15 facial expressions annotated in the dataset are: amusement, anger, awe, concentration, confusion, contempt, contentment, disappointment, doubt, elation, interest, pain, sadness, surprise, and triumph. Each expression ranges from 0~1 in each frame, corresponding to the confidence that the expression is present. The EEV dataset is collected using publicly available videos, and a detailed description of the full dataset is here.

The EEV dataset is available at: https://github.com/google-research-datasets/eev. You can download the train/val/test csv files using git-lfs. Each Video ID in the dataset corresponds to a YouTube video ID. Because the dataset links to YouTube, it is possible that some videos will become unavailable over time, however we anticipate this amount to be small compared to the total dataset size. We include below what to do if you are unable to access a video in either train/val/test sets.

This dataset contains:

train.csv: training set of video ids, timestamps, and corresponding expression values. Rows with all 0 expressions (all 15 expressions sum to 0) should be considered unlabelled as they correspond to frames with no annotations from the dataset. If you are unable to access one of these videos, treat the video as having no annotations.
val.csv: this is the suggested validation set of video ids, timestamps, and corresponding expression values. Rows with all 0 expressions (all 15 expressions sum to 0) should be considered unlabelled as they correspond to frames with no annotations from the dataset. If you are unable to access one of these videos, treat the video as having no annotations.
test.csv: this is the test set, with only available video id and timestamps. The goal is to predict the expression values for each frame in the test set. If you are unable to access one of these videos, in your subsmission, you should include the video id, but fill in the values for every expression with 0s. Please see this discussion topic for a list of test IDs you may fill with 0s in your test set. Everyone will be evaluated on the same set of available videos (even if you submit predictions on a video that later became unavailable).

Finally, if you find our dataset useful, please consider citing:

@article{sun2021eev,
      title={EEV: A Large-Scale Dataset for Studying Evoked Expressions from Video}, 
      author={Sun, Jennifer J and Liu, Ting and Cowen, Alan S and Schroff, Florian and Adam, Hartwig and Prasad, Gautam},
      year={2021},
      journal={arXiv preprint arXiv:2001.05488}
}

Competition Timeline

March 1st: Challenge start
April 24th: Challenge end
May 1st: Final results announced
June 19th: Top 3 participants are invited to speak at AUVi@CPVR2021

Challenge participants can optionally submit their methods to https://sites.google.com/view/auvi-cvpr2021 (paper submission deadline ~~March 21st~~ April 6th, camera ready deadline April 15th).

Prizes & Workshop Information

The top three participants in our challenge will be invited to speak about their submission methods at the Affective Understanding in Video Workshop @ CVPR 2021.

Workshop info: https://sites.google.com/view/auvi-cvpr2021

Workshop date: June 19th 2021

Workshop Abstract: Videos allow us to capture and model the temporal nature of expressed affect, which is crucial in achieving human-level video understanding. Affective signals in videos are expressed over time across different modalities through music, scenery, camera angles, and movement, as well as with character tone, facial expressions, and body language. With the widespread availability of video recording technology and increasing storage capacity, we now have an expanding amount of public academic data to better study affective signals from video. Additionally, there is a growing number of temporal modeling methods that have not yet been well-explored for affective understanding in video. Our workshop seeks to further explore this area and support researchers to compare models quantitatively at a large scale to improve the confidence, quality, and generalizability of the models. We encourage advances in datasets, models, and statistical techniques that leverage videos to improve our understanding of expressed affect and applications of these models to fields such as social assistive robotics, video retrieval and creation, and assistive driving that have direct and clear benefits to humans.

Rules

The general rule is that participants should use only the provided training and validation videos to train a model to classify the test videos. Please see Challenge Rules for more details.

Submission & Evaluation

We require participants to submit the csv file corresponding to the test set videos and frames, in the same format as the provided train and val csv files (see https://github.com/google-research-datasets/eev). Please make sure the first entry is "Video ID", the next entry is "Timestamp (milliseconds)", followed by the 15 expressions. The sample format is:

Video ID,Timestamp (milliseconds),amusement,anger,awe,concentration,confusion,contempt,contentment,disappointment,doubt,elation,interest,pain,sadness,surprise,triumph
02nCBx8WqW0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
02nCBx8WqW0,166666,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
02nCBx8WqW0,333333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
02nCBx8WqW0,5000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
02nCBx8WqW0,6666660,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
02nCBx8WqW0,8333330,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
02nCBx8WqW0,10000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
02nCBx8WqW0,11666660,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

(Note that this is just a sample and do not correspond to actual values in the test_csv!)

The submissions will be evaluated using correlation computed for each expression in each video, then averaged over the expressions and the videos. The correlation we use is based on scipy:

$r = \frac{\sum (x - m_{x}) (y - m_{y})}{\sqrt{\sum (x - m_{x})^{2} \sum (y - m_{y})^{2}}}$

where $x$ is the predicted values (0~1) for each expression, $y$ is the ground truth value (0~1) for each expression, $m_{x}$ is the average of $x$ and $m_{y}$ is the average of $y$ . Note that correlation is computed over each video.

Affliates

Team

Gautam Prasad (Google)
Jennifer J. Sun (Caltech)
Vikash Gupta (Mayo Clinic)
Salil Soman (Beth Israel Deaconess Medical Center)

Contact

If you have any questions, please contact the AUVi workshop organizers at auvi.workshop@gmail.com.