AIcrowd | Scene Understanding for Autonomous Drone Delivery (SUADD'23)

Round 1: Completed

Round 2: Completed

AIcrowd &

Amazon Prime Air

62.2k

1123

149

3589

Problem Statements

Weight: 1.0

Semantic Segmentation

Perform semantic segmentation on aerial images from monocular downward-facing drone

19.7k

2765

Weight: 1.0

Mono Depth Perception

Estimate depth in aerial images from monocular downward-facing drone

12.3k

824

🎯 Select your final submissions

- Semantic Segmentation selection form

- Mono Depth selection form

👥 Challenges are more fun with friends. Find teammates for SUADD'23 💬

📣 Update in segmentation scoring metrics

🗳️ How to download dataset via CLI

🕵️‍♂️ Introduction

Unmanned Aircraft Systems (UAS) have various applications, such as environmental studies, emergency responses or package delivery. The safe operation of fully autonomous UAS requires robust perception systems.

For this challenge, we will focus on images of a single downward camera to estimate the scene's depth and perform semantic segmentation. The results of these two tasks can help the development of safe and reliable autonomous control systems for aircraft.

This challenge includes the release of a new dataset of drone images that will benchmark semantic segmentation and mono-depth perception. The images in this dataset comprise realistic backyard scenarios of variable content and have been taken on various Above Ground Level (AGL) ranges.

This challenge aims to foster the development of fully autonomous Unmanned Aircraft Systems (UAS).

To achieve this, it needs to overcome a multitude of challenges. To leverage fully autonomous drone navigation, the device needs to understand both objects in a scene and the scale and distance to them.

This project's two key computer vision components are semantic segmentation and depth perception.

With this challenge, we aim to inspire the Computer Vision community to develop new insights and advance state-of-the-art in perception tasks involving drone images.

👩‍🎓 Key Tasks

Understanding the 3D scene below the drone is helpful for many of the challenges autonomous drones must address. Semantic segmentation and depth perception are two key components of this. Hence these are the two main goals of this challenge.

These two separate tasks will have their benchmark. We will employ data from a single grey-scale camera to solve them.

Task 1: Semantic Segmentation

Semantic segmentation is the labelling of the pixels of an image according to the category of the object to which they belong. The output for this task is an image in which each pixel has the value of the class it represents.

For this task, we focus on labels that ensure a safe landing, such as the location of humans and animals, round or flat surfaces, tall grass and water elements, vehicles and so on. The labels chosen for this challenge are humans, animals, roads, concrete, roof, tree, furniture, vehicles, wires, snow etc. The complete list of labels is: [WATER, ASPHALT, GRASS, HUMAN, ANIMAL, HIGH_VEGETATION, GROUND_VEHICLE, FAÇADE, WIRE, GARDEN_FURNITURE, CONCRETE, ROOF, GRAVEL, SOIL, PRIMEAIR_PATTERN, SNOW].

Task 2: Mono-Depth Estimation

Depth estimation measures the distance between the camera and the objects in the scene. It is an important perception task for an autonomous aerial drone. Using two stereo cameras makes this task solvable with stereo vision methods. This challenge aims to create a model that can use the information of a single camera to predict the depth of every pixel.

The output of this task must be an image of equal size to the input image, in which every pixel contains a depth value.

💾 Dataset

The dataset consists of a collection of flight frames at given timestamps taken from one of the downward cameras of our drones during dedicated data collection operations, not during customer delivery operations.

The dataset contains 412 flights, 2056 total frames (5 frames per flight at different AGLs), Full semantic segmentation annotations of all frames and depth estimations. The dataset has been split into training and (public) test datasets. While the challenge will be scored using a private test dataset, we considered it useful to have this split to allow teams to share their results even after the challenge ends.

This dataset contains birdseye-view greyscale images taken between 5 m and 25 m AGL. Annotations for the semantic segmentation task are fully labelled images across 16 distinct classes, while annotations for the mono-depth estimation task have been computed with geometric stereo-depth algorithms. To the best of our knowledge, this is the largest dataset with full semantic annotations and monodepth estimation ground-truth over a wide range of AGLs and different scenes.

Images can be in uint8 or uint16 format, to load them you can for example use OpenCV:

Ethical Considerations About The Data

The dataset of the challenge contains images of realistic flight footage taken as part of our research and development programs, not from real customer deliveries. Furthermore, it is ensured that all personal identifiers are removed.

📕 Starter Kit

Check out these easy-2-follow starter kits and baselines to get familiar with documentation, submission follow and setup. This starter kit will help you in making your first submission.