πŸŽ‰ The challenge has ended.


πŸ“ Submit Abstracts explaining your approaches to the ICCV Workshop here. (Deadline: 10th Sep)
βœ… Select submissions to be evaluated on full dataset here. (Deadline: 3rd Sep)

πŸš€ Make your submission with the starter kit | Discussions

😎SiamMOT baseline with EDR=0.669 and AFDR=0.6265 

chat on Discord

πŸ•΅οΈβ€β™‚οΈ Introduction

One of the important challenges of autonomous flight is the Sense and Avoid (SAA) task to maintain enough separation from obstacles. While the route of an autonomous drone might be carefully planned ahead of its mission, and the airspace is relatively sparse, there is still a chance that the drone will encounter unforeseen airborne objects or static obstacles during its autonomous flight.

The autonomous SAA module has to take on the tasks of situational awareness, decision making, and flying the aircraft, while performing an evasive maneuver.

There are several alternatives for onboard sensing including radar, LIDAR, passive electro-optical sensors, and passive acoustic sensors. Solving the SAA task with visual cameras is attractive because cameras have relatively low weight and low cost.

For the purpose of this challenge, we consider a  solution that solely relies on a single visual camera and Computer Vision technique that analyzes a monocular video.

Flying airborne objects pose unique challenges compared to static obstacles. In addition to the typical small size, it is not sufficient to merely detect and localize those objects in the scene, because prediction of the future motion is essential to correctly estimate if the encounter requires a collision avoidance maneuver and create a safer route. Such prediction will typically rely on analysis of the motion over a period of time, and therefore requires association of the detected objects across the video frames.

As a preliminary stage for determining if a collision avoidance maneuver is necessary, this challenge will be concerned with spatio - temporal airborne object detection and tracking, given a new Airborne Object Tracking dataset, and perform two benchmarks:

  1. Airborne detection and tracking
  2. Frame-level airborne detection

πŸ’Ύ Dataset

Airborne Object Tracking Dataset (AOT) Description

The Airborne Object Tracking (AOT) dataset is a collection of flight sequences collected onboard aerial vehicles with high-resolution cameras. To generate those sequences, two aircraft are equipped with sensors and fly planned encounters (e.g., Helicopter1 in Figure 1(a)). The trajectories are designed to create a wide distribution of distances, closing velocities, and approach angles. In addition to the so-called planned aircraft, AOT also contains other unplanned airborne objects, which may be present in the sequences (e.g., Airborne1 in Figure 1(a)). Those objects are also labeled but their distance information is not available. Airborne objects usually appear quite small at the distances which are relevant for early detection: 0.01% of the image size on average, down to a few pixels in area (compared to common object detection datasets, which exhibit objects covering more considerable portion of the image). This makes AOT a new and challenging dataset for the detection and tracking of potential aerial approaching objects. 

Figure 1: Overview of the Airborne Object Tracking (AOT) dataset, more details in the "Data Diversity" section

In total, AOT includes close to 164 hours of flight data:

  • 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. Each sequence typically includes at most one planned encounter, although some may include more.
  • 5.9M+ images
  • 3.3M+ 2D annotation

Video 1: Flight sequence of the drone, with other airborne objects annotated

Dataset Diversity

A unique feature of AOT compared to comparable existing datasets is the wide spectrum of challenging conditions it covers for the detection and tracking of airborne objects. 

Figure 2: Samples images showcasing the diversity in AOT dataset

  • Airborne object size: often a direct proxy to a distance to the object, the area of objects in the dataset varies from 4 to 1000 pixels, as illustrated in Fig. 1 (b). Note that the ground truth for tiny and small objects cannot be marked perfectly tight, instead it is approximated with circles of radius 3 and 8 pixels respectively, which yield two bright horizontal lines in the Fig. 1 (b).
  • Planned encounters:
    • Distance to the object: concentrated between 600 to 2,000 meters (25-75 percentiles)
    • Closing velocity (the velocity with which the object approaches the camera): up to 70 meters per second
    • Angle of approach:
      • Azimuth: from -60 to 60 degrees
      • Elevation: from -45 to 45 degrees
    • Collision risk: out of the planned encounters, it is estimated that 55% of them would qualify as potential collision trajectories and close encounters
  • Camera roll angle: related to camera trajectory, the bank angle goes up to 60 degrees in high bank turns
  • Altitude: the altitude of the camera varies from 24 to 1,600 meters above mean sea level (MSL) with most captures between 260 and 376 meters MSL. The captures are as low as 150 meters above ground, which is challenging to capture.
  • Distance to visual horizon: 80% of targets are above the horizon, 1% on the horizon, and 19% below. This feature particularly affects the amount of clutter in the background of the object.
  • Airborne object type: see Figure 1 (d) and Table 1 below
  • Sky conditions and visibility: sequences with clear, partly cloudy, cloudy, and overcast skies are provided, 69% of the sequences have good visibility, 26% have medium visibility, and 5% exhibit poor visibility conditions.
  • Light conditions:
    • Back-lit aircraft, sun flare, or overexposure are present in 5% of the sequences
    • Time of the day: data was captured only during well lit daylight operations but a different times of the day, creating different sun angle conditions
    • Terrain: flat horizon, hilly terrain, mountainous terrain, shorelines

Table 1 below provides an overview of the objects present in the dataset. There are 3,306,350 frames without labels as they contain no airborne objects. Note that all airborne objects are labeled. For images with labels, there are on average 1.3 labels per image.



All Airborne Objects















































* includes hot air balloons, ultra lights, drones, etc

Table 1: Types and distribution of airborne object labels

Data Collection Process

During a given data capture, the two sensor-equipped aircraft perform several planned rectilinear encounters, repositioning in between maneuvers. This large single data record is then split into digestible sequences of 120 seconds. Due to those cuts in the original record, individual sequences may comprise a mix of: rectilinear approaches, steep turns, with or without the aircraft in sight. As an example, it is possible to have a single rectilinear approach split across two sequences. In addition to the planned aircraft, a given sequence might contain other unplanned airborne objects like birds and small airplanes, or even no airborne objects.

Data Format

The data obtained from two front-facing cameras, the Inertial Navigation System (INS), and the GPS provide the onboard imagery, the orientation and position of the aircraft. The provided dataset will therefore include:

  1. Front-view, low-altitude videos (sampled as .png images at 10 FPS) 
  2. Distance to planned aircraft (calculated based on their GPS) 
  3. Manually labeled ground truth bounding boxes for all visible airborne objects. 

Additional details on the dataset

Dataset Folder Structure: The dataset is given as a training directory, while the validation and test sets are kept separate and not available to competitors. To ensure generalization, sequences collected the same day are either in the training dataset, or in validation / test dataset. They cannot be split between the two datasets (validation and test sets can share common days/ areas).
The training set is further split into smaller directories (to facilitate download), each one containing ImageSets and Images folders. 

The ImageSets folder holds:

  1. groundtruth.json (and its tabular representation  groundtruth.csv), which contains metadata and ground truth information about sequence images.
  2. valid_encounters_maxRange700_maxGap3_minEncLen30.json contains information about encounters (defined in the 🎯 Benchmarks section) with planned aircraft within 700m. distance, which last at least 3 seconds.

    For each encounter, we provide the corresponding sequence (sub-folder) name, relevant image names and additional information on distance statistics of aircraft in the encounter, if the encounter is below or above horizon and its length in frames.
    This file provides information on a representative set of images / sequences to start training with, in case usage of the full dataset is not possible.
  3. valid_encounters_maxRange700_maxGap3_minEncLen30.csv – tabular representation of encounter information from valid_encounters_maxRange700_maxGap3_minEncLen30.json (image names that correspond to each encounter are omitted.

The Images folder finally holds images sampled from one sequence per directory (directory name is unique per each Images folder, but can repeat across in different Images folders). An overview of the dataset split is provided in Table 2 below. 


Size (TB)









validation + test










Table 2: Dataset size

Sequence format: Each sequence is contained in a directory label with a universally unique identifier (UUID), the directory then contains the images of the sequence captured at 10 Hz.

Image format: 2448 pixels wide by 2048 pixels high, encoded as 8-bit grayscale images and saved as PNG files (lossless compression). The filenames follow the convention <timestamp><uuid>.png. The timestamp is 19 characters, and the UUID is 32 characters. The field of view of the camera is 67.8 by 56.8 degrees, for an angular resolution of 0.48 mrad per pixel.

Ground truth format: The groundtruth.json files contain 2 keys: metadata and samples organized as follows.

  "metadata": {
    "description": "PrimeAir, camera 0 ",  # Description of the sequences
    "last_modified": "Jan-08-2021 23:27:55",  # Last time the file was modified
    "version": "1.0",  # Version of the ground truth file
  "samples": "[...]",  # Collection of sample sequences

Code Block 1: structure of the groundtruth.json files

Each sample sequence is then provided with its own metadata and entitites:

  "metadata": {
    "data_path": "train/673f29c3e4b4428fa26bc55d812d45d9/",  # Relative path to video
    "fps": 10.0,  # Frequency of the capture in frames per second (FPS)
    "number_of_frames": 1199,  # Number of frames in the sequence
    "resolution": {
        "height": 2048,  # Height of the images in the sequence
        "width": 2448,  # Width of the images in the sequence
  "entities": [...],  # Collection of entities (frames / objects)

Code Block 2: structure of a sample sequence

Finally, each entity corresponds to an image ground truth label. If the label corresponds to a planned airborne object, its distance information may be available. Note that distance data is not available for other non-planned airborne objects in the scene. When such fields may not be available, they are marked as optional below. For example, one image frame may not contain an object label if not airborne object is present in the scene, however some information about the image is still provided (frame number and timestamp).

  "time": 1573043646380340792,  # Timestamp associated with the image
  "blob": {
    "frame": 3,  # Frame number associated with the image
    "range_distance_m": 1366,  # (optional) Distance to planned airborne objects [m]
  "id": "Airplane1",  # (optional) Identifier for the label (unique for the sequence)
  "bb": [1355.2, 1133.4, 6.0, 6.0],  # (optional), Bounding box [top, left, width, height]
  "labels": {
    "is_above_horizon": -1, # the object is Below(-1)/Not clear(0)/Above(1) the horizon
  "flight_id": "673f29c3e4b4428fa26bc55d812d45d9",
  "img_name": "1566556046185850341673f29c3e4b4428fa26bc55d812d45d9.png",

Code Block 3: structure of an entity (image label)


Please check out DATASET.md to download the dataset and documentation. 

🎯 Benchmarks

The Challenge has two benchmarks: the airborne detection and tracking benchmark and the frame-level airborne object detection benchmark. Teams must clearly indicate which benchmark(s) the submission is participating in. The benchmarks are explained below.

1. Airborne Detection and Tracking Benchmark

Airborne detection and tracking task is essentially an online multi-object tracking with private detections (i.e., detections generated by the algorithm and not provided from external input). There is a wide range of evaluation metrics for multi-object tracking, however the unique nature of the problem imposes certain requirements that help us to define specific metrics for Airborne Detection and Tracking Benchmark. Those requirements and metrics are outlined below.
To ensure safe autonomous flight, the drone should be able to detect a possible collision with an approaching airborne object and maneuver to prevent it. However, unless there is a detected possible collision, the best way to ensure a safe flight is to follow the originally planned route. Deviating from the planned route increases the chances of encounters with other airborne objects and static obstacles, previously not captured by the drone camera. As such, false alarms that might trigger unnecessary maneuvers should be avoided, which imposes a very low budget of false alarms (high precision detection). Another consideration is that while early detection is generally desired, relying only on information from early stages of the encounter might not be indicative of the future motion of the detected airborne object. Therefore, an effective alert must be based on detection (or tracking) that is not too early to allow accurate prediction of future motion, and yet early enough to allow time to maneuver. Typically, such temporal window will depend on a closing velocity between the drone and the other airborne object. However, for simplicity, we will refer to the distance between the drone and the encountered airborne object, to establish when the detections must occur. Finally, to capture sufficient information for future motion prediction, the object should be tracked for several seconds.

To summarize, the requirements for desired solutions are:

  1. Very low number of false alarms 
  2. Detections of the airborne object within the distance that allows maneuver (i.e., not too close) and is informative for future motion prediction (i.e., not too far away)
  3. Tracking the airborne object for sufficient time to allow future motion prediction 


Next, we define airborne metrics that will evaluate if the above terms are met.

The airborne metrics measures:

  1. Encounter-Level Detection Rate (EDR) - number of successfully detected encounters divided by the total number of encounters that should be detected, where an encounter is defined as a temporal sequence (a subset of frames) in which the same planned aircraft (airborne object) is visible (i.e., is manually labeled) and is located within the pre-defined range of distances. The encounter is successfully detected if:
    1. ​​Its respective airborne object is tracked for at least 3 seconds within the encounter duration.
    2. The detection and 3 second tracking occur before the airborne object is within 300m to the drone or within the first 3 seconds of the encounter.
  2. False Alarm Rate (HFAR) per hour - a number of unique reported track ids, which correspond to at least one false positive airborne report, divided by total number of hours in the dataset

Evaluation of submissions

As previously outlined, one of the requirements of safe autonomous flight is very low number of false alarms. Any solution that exceeds the available budget of false alarms will not be usable in practice due to safety concerns. To encourage realistic solutions and simplify the evaluation, we define a HFAR budget of 5 false alarm per 10 hours of flight = 0.5. Any submission with HFAR > 0.5 will be published on the leaderboard, but not be considered for ranking. All the submissions that have HFAR <= 0.5 will be ranked based on EDR. 

2. Frame-level Airborne Object Detection Benchmark

While the first benchmark of this challenge involves tracking, participants can also submit results for frame-level airborne object detection benchmark. The frame-level metrics will measure:

  1. Average frame-level detection rate (AFDR) - a ratio between the number of the detected airborne objects and all the airborne objects that should be detected. For the purpose of this calculation, all the planned airborne aircraft within 700m distance will be considered. 
  2. False positives per image (FPPI) - a ratio between the number of false positive airborne reports and the number of images in the dataset.

Evaluation of submissions

To simplify the evaluation and encourage development of realistic solution, the results will be evaluated based on AFDR with a budget of FPPI. Any submission with FPPI > 0.0005 will be published on the leaderboard, but NOT considered for ranking. All the submissions that have FPPI <= 0.0005 will be ranked based on AFDR.

Additional details on detection evaluation and false alarms calculation 

We elaborate on a definition of encounters that form the set of encounters for detection and tracking benchmark. Recall that a planned aircraft is equipped with GPS during data collection and therefore provides GPS measurements associated with its physical location. We further define, a valid airborne encounter as an encounter with planned aircraft during which the maximum distance to the aircraft is at most UPPER_BOUND_MAX_DIST. The upper bound on the maximum distance ensures that the detection will be benchmarked with respect to airborne objects that are not too far away from the camera. In addition, an upper bound on the minimum distance in the encounter is defined as UPPER_BOUND_MIN_DIST (to disregard encounters that do not get sufficiently close to the camera).
Note that dataset videos and the provided ground truth labels might contain other airborne objects that are not planned, or planned airborne objects that do not belong to valid encounters. The airborne metrics does not consider those objects for detection rate calculation and treats them as β€˜don’t care’ (i.e., those detections will not be counter towards false alarms). Frame-level metrics consider non-planned objects and planned objects at range > 700m as 'don't care'.

Any airborne report (as defined in Table 3) that does not match an airborne object is considered a false positive and is counted once per the same track id as a false alarm. The reason behind it is that a false alarm might trigger a potential maneuver and hence false positives that occur later and correspond to the same object has lower overall impact in real scenarios.

The definitions of successful detection and false positive depend on the matching criteria between the bounding box produced by the detector and the ground truth bounding box. A common matching measure for object detection is Intersection over Union (IoU). However, IoU is sensitive to small bounding boxes, and since our dataset contains very small objects, we propose to use extended IoU, defined as:

In words: 

  • If the ground truth area >= MIN_OBJECT_AREA extended IoU = IoU, and
  • If the ground truth area < MIN_OBJECT_AREA, the ground truth bounding box is dilated to have at least minimum area = MIN_OBJECT_AREA, and all the detections (matched against this ground truth) are dilated to have at least minimum area = MIN_OBJECT_AREA. The dilation operation will maintain aspect ratio of the bounding boxes.

The reported bounding box is considered a match, if the eIoU between the reported bounding box and the ground truth bounding box is greater than IS_MATCH_MIN_IOU_THRESH.

If the eIoU between the reported bounding box and any ground truth is less than IS_NO_MATCH_MAX_IOU_THRESH the reported bounding box is considered a false positive.

Any other case that falls in between the two thresholds is considered neutral (β€˜don’t care’), due to possible inaccuracies in ground truth labeling.  

Please refer to Tables 3-4 for further clarifications on the terms mentioned in this section.



Bounding box

[top, left, width, height]

Planned airborne object

An airborne object with GPS (in the currently available datasets - Helicopter1, Airplane1) and manulally labeled ground truth bounding box in the image.


1) An interval of time of at least MIN_SECS with a planned airborne object
2) The segment can have gaps of length <= 0.1 * MIN_SECS, during which the ground truth might be missing
or the object is at a farther range / not visible in the image
3) A single encounter can include one airborne object only

Valid encounter (should be detected)

The encounter with airborne object, such that:
minimum distance to the object <= UPPER_BOUND_MIN_DIST
maximum distance to the object <= UPPER_BOUND_MAX_DIST

Airborne report

Predicted bounding box, frame id, detection confidence score
Optional: track id. If not provided detection id will be used

False positive airborne report

An airborne report that cannot be matched to ANY airborne object (i.e. eIoU with any airborne object is below IS_NO_MATCH_MAX_IOU_THRESH)

Detected Airborne Object

An airborne object that can be matched with an airborne report

Frame level detection rate per encounter

A ratio between the number of frames in which a specific airborne object is detected out of all the frames that this object should be detected in the considered temporal window of frames.

Table 3: Glossary
















At the ground truth resolution













Table 4: Constants

The metrics can evaluate .json files with the following dictionaries:

Result – List[Dict]:  with the following fields per element
'img_name'  - img_name as appears in the ground truth file
'detections'  - List[Dict]: with the following fields per element:
    'n'  - name of the class (typically airborne)
    'x'  - x coordinate of the center of the bounding box
    'y'  - y of the center of the bounding box
    'w'  - width 
    'h'  - height
    's'  – confidence / score
    'track_id'  / 'object_id'  - optional track or object id associated with the detection 

Please Note:

  1. It is very important to provide the correct img_name, such that the detections can be matched against ground truth 
  2. x,y,w,h should be provided in the coordinate system of the original image, with x,y representing the top-left pixel for the bounding box, and w and h representing the width and height respectively.


    "detections": [
        "x": 37.619754791259766,
        "y": 1843.8494873046875,
        "w": 63.83501434326172,
        "h": 69.88720703125,
        "track_id": 0,
        "n": "airborne",
        "s": 0.9474319815635681
    "img_name": "1568151307970878896b37adfedec804a08bcbde18992355d9b.png"
    "detections": [
        "x": 35.92606735229492,
        "y": 1838.3416748046875,
        "w": 71.85213470458984,
        "h": 84.0302734375,
        "track_id": 0,
        "n": "airborne",
        "s": 0.6456623077392578
    "img_name": "1568151308170494735b37adfedec804a08bcbde18992355d9b.png"

πŸ’ͺ Baselines 

The baselines for the two benchmarks of the challenge are provided by AWS Rekognition based on SIAM-MOT model (https://www.amazon.science/publications/siammot-siamese-multi-object-tracking) trained on AOT dataset. To simulate different submissions, Table 5 outlines the metrics results for various working points - different detection score thresholds and minimum track length required for airborne reports (see Table 3 for definition of airborne report). The inference was performed on test dataset (which is not available to public) and will be used to evaluate performance on the Public board during the challenge.

Note that while we present metrics for all the results, only the results marked with green (with corresponding HFAR < 0.5) will be ranked on the Detection and Tracking Benchmark board based on their EDR values (with ties broken based on lower HFAR). Similarly, only the results marked with green and yellow (with corresponding FPPI < 0.0005) will be ranked on the Detection Benchmark board based on their AFDR values (with ties broken based on lower FPPI). All the other results, e.g., those reported in the rows with red background, will be presented but not ranked.

If participants do not indicate specific benchmark, they will be evaluated in both benchmarks and will be eligible for ranking based on the specific rule of each benchmark.

Table 5: Benchmark of SIAM-MOT

The codebase for the SiamMOT baseline is available in the starter kit here.

⚠️ Please note that identical SiamMOT models (with delta <= 1.5% in EDR or AFDR) would be disqualified from winning the prize.
An identical model is a model that uses the exact same code and config file provided with the baseline.

πŸš€ Submissions

This is a code-based challenge, where you will make your submissions through git tags.

We have prepared a Starter kit for you that you can clone to get started with the challenge πŸ™Œ

πŸ‘‰ Any issues or have any queries? Please do jump to this thread and ask away!

πŸ‘‰ FAQs and common mistakes while making a submission. Check them out.

Hardware Used?

We use "p3.2xlarge" instances to run your evaluations i.e. 8 vCPU, 61 GB RAM, V100 GPU.
(please enable GPU by putting "gpu": true in your aicrowd.json file)

πŸ“… Timeline

πŸ• Start Date: April 16th, 2021 at 18:00:00 UTC

Deadline: July 8th, 2021 at 00:00:00 UTC

⏰ New Deadline: September 1st, 2021 at 00:00:00 UTC

πŸ† Winners Announced: September 30th, 2021

No additional registrations or entries will be accepted after the Entry Deadline. These dates are subject to change at Sponsor’s discretion.

πŸ† Prizes

πŸ₯‡ The Top scoring submission for each benchmark will receive $15,000 USD

πŸ₯ˆ The Second best submission for each benchmark will receive $7,500 USD

πŸ₯‰ The Third place submission for each benchmark will receive $1,250 USD

πŸ… The Most β€œCreative” solution as determined by Sponsor’s sole discretion will receive $2,500 USD


To receive a prize, the Team must make the submission via the AIcrowd portal, and submit its training code and a detailed document describing its training process. The description must be written in English, with mathematical formulae as necessary. The description must be written at a level sufficient for a practitioner in computer science to reproduce the results obtained by the Team. It must describe substantially the training and tuning process to reproduce results independently. Failure to submit both the code and description within one week of notification will disqualify that entry and additional qualifying entries will be considered for prizes. Sponsor reserves right not to award prizes to any Team whose results cannot be reproduced.

πŸ“’ ICCV 2021

The winning teams for each benchmark may be offered the opportunity to give a 15-minute oral presentation at an ICCV Workshop(International Conference on Computer Vision).

Any winning team that would like to give a workshop presentation must submit an abstract for an ICCV paper. 


πŸ”— Links

πŸ† Discussion Forum 

πŸ’ͺ Leaderboard

πŸ“ Notebooks


πŸ“± Contact



See all
Sample interface for training with DarkNet YOLO
Almost 3 years ago
AOT Dataset walkthrough using helper scripts
About 3 years ago
File name to frame metadata exploration
About 3 years ago