Problem Statements
Problems will be visible after challenge starts
β° Challenge Starting Soon: Sign up now to be the first to access the Warm-Up Round and jump into the challenge! π©
Generate Synchronized & Contextually Accurate Videos
Welcome to the Sounding Video Generation (SVG) Challenge 2024! This challenge invites you to build models that generate synchronized and contextually accurate videos. Participants can showcase their skills and push the boundaries of sounding video generation with two tracks -
- Temporal Alignment
- Spatial Alignment
π Introduction
Video generation research has progressed significantly, with large-scale diffusion models producing realistic videos. However, sounding video generation, which involves well-aligned video and audio modalities, remains underexplored. The SVG Challenge aims to advance this field by providing a platform for benchmarking and showcasing state-of-the-art models.
π₯ The Sounding Video Generation Challenge
Build state-of-the-art AI models to generate videos, ensuring the audio is synchronized and contextually appropriate.
β° Temporal Alignment Track
This track aims to generate videos that are temporally and semantically aligned with their corresponding audio. This involves producing high-resolution videos (256x256 pixels, 8fps) with monaural audio (1 channel, 16kHz).
You will tackle two types of alignment:
-
Semantic Alignment: The audioβs semantic class should match the video. For instance, if the video shows a dog barking, the audio should contain a barking sound.
-
Temporal Alignment: The audio should be synchronized with the video. For example, the barking sound should occur precisely when the dog is seen barking.
In this track, submissions will be evaluated on how well the audio and video synchronize over time. Participants will use customised datasets named SVGTA24 derived from the Greatest Hits dataset with prepared video captions for training. A baseline model based on AnimateDiff and AudioLDM is provided. Submissions will be tested on a set of text prompts to assess synchronization.
More details are available on the Temporal Alignment Track page.
π Spatial Alignment Track
This track aims to create videos with spatially aligned audio, giving a sense of space and direction. This involves producing high-resolution videos (256x256 pixels, 4fps) with stereo audio (2 channels, 16kHz).
Participants should focus on generating videos where the spatial alignment of the audio enhances the sense of space and direction, ensuring that the audio and video components are well-integrated.
Participants will use a customized SVGSA24 dataset derived from the STARSS23 dataset, where the original videos with an equirectangular view and Ambisonics audio have been converted to videos with a perspective view and stereo audio. Additionally, we have curated content focusing on on-screen speech and instrument sounds. This will be used for training and submit systems that generate video and 2-channel audio signals. A baseline model based on MM-Diffusion is provided. Evaluation will consider how well the generated video and audio align spatially.
More details are available on the Spatial Alignment Track page.
π Timeline
The SVG Challenge takes place in two rounds, with an additional warm-up round:
- Warmup Round: 1st Oct 2024
- Phase I: 21st Oct 2024
- Phase II: 2nd Dec 2024
- Challenge End: 31st January 2025
π Prizes
The total prize pool is $35,000, divided between the two tracks. Teams can win prizes across multiple leaderboards.
Track 1: Temporal Alignment ($17,500)
-
First place: $10,000
-
Second place: $5,000
-
Third place: $2,500
Track 2: Spatial Alignment ($17,500)
-
First place: $10,000
-
Second place: $5,000
-
Third place: $2,500
Please refer to the Challenge Rules for more details on the Open Sourcing criteria for eligibility.