AIcrowd | NeurIPS 2020: Procgen Competition

Warm-Up Round: Completed

Round 1: Completed

Round 2: Completed #neurips #reinforcement_learning

OpenAI

66.2k

780

4805

💻 Blog Post
🚀 Starter Kit | Getting Started with SageMaker
🏛 Procgen Townhall | 🛠️ How to debug your submissions | 🏆 Winner announcement

Announcements

🕵️ Introduction

Procgen Benchmark is a suite of 16 procedurally-generated gym environments designed to benchmark both sample efficiency and generalization in reinforcement learning. In this competition, participants will attempt to maximize agents' performance using a fixed number of environment interactions. Agents will be evaluated in each of these 16 publicly released environments, as well as in four secret test environments created specifically for this competition. By aggregating performance across so many diverse environments, we can obtain high quality metrics to judge the underlying algorithms.

Since all content is procedurally generated, each Procgen environment intrinsically requires agents to generalize to never-before-seen situations. These environments therefore provide a robust test of an agent's ability to learn in many diverse settings. Moreover, Procgen environments are designed to be lightweight and simple to use. Participants with limited computational resources will be able to easily reproduce baseline results and run new experiments. More details about the design principles and details of individual environments can be found in the paper Leveraging Procedural Generation to Benchmark Reinforcement Learning. Once the competition concludes, all four test environments will be publicly released.

📜 Rules

In all rounds, participants will be allotted 8 million timesteps in each environment to train their agents. When evaluating generalization, we will provide participants 200 levels from each environment during the training phase. Participants will also be restricted to no more than 2 hours of compute per environment, using a V100 GPU and 8 vCPUs.

Participants are expected to operate in good faith and to not attempt to circumvent these restrictions.

🖊 Evaluation

Participants will train separate agents for each environment, with the number of environments varying in each round of the competition. In general, performance will be judged by the mean of the normalized returns across environments. In each environment, the normalized return is defined as :

where :

is the raw expected return
and are constants chosen (per environment) to approximately bound .

It is possible to choose these constants because each Procgen environment has a clear score ceiling. Using this definition, the normalized return is (almost) guaranteed to fall between 0 and 1. Since Procgen environments are designed to have similar difficulties, it’s unlikely that a small subset of environments will dominate this signal. We use the mean normalized return since it offers a better signal than the median, and since we do not need to be robust to outliers.

📁 Competition Structure

Warm-Up Round

The warm-up round evaluates submissions solely on the CoinRun environment. Participants can become familiar with the codebase and submission pipeline without the need to consider multiple Procgen environments.

Round 1 (General Entry)

Round 1 will evaluate submissions on 3 of the public Procgen environments, as well as on 1 of the private test environments. Participants' final score will be the mean normalized return across these 4 environments.

This round will focus entirely on sample efficiency, with participants being given a budget of 8M timesteps for training.

Round 2 (Finals)

Round 2 will evaluate submissions on the 16 public Procgen environments, as well as on the 4 private test environments. Participants' final score will be a weighted average of the normalized return across these 20 environments, with the private test environments contributing the same weight as the 16 public environments.

This round will evaluate agents on both sample efficiency and generalization.

Sample efficiency will be measured as before, with agents restricted to training for 8M timesteps.

Generalization will be measured by restricting agents to 200 levels from each environment during training (as well as 8M total timesteps). In both cases, agents will be evaluated on the full distribution of levels. We will have separate winners for the categories of sample efficiency and generalization.

Because significant computation is required to train and evaluate agents in this final round, only the top 50 submissions from Round 1 will be eligible to submit solutions for Round 2. The leaderboard will report performance on a subset of all environments, specifically on 4 public environments and 1 private test environment.

The top 10 submissions will be subject to a more thorough evaluation, with their performance being averaged over 3 separate training runs. The final winners will be determined by this evaluation.

📅 Timeline

June 3rd - July 6th : Warm-Up Round
July 7th - September 8th : Round 1 (General Entry)
September 8th - October 19th : Round 2 (Finals)
October 20th - October 25th : Post Competition Analysis
October 25th : Final Results Announced
October 16th - November 10th : Post Competition Wrap-Up

💪 Getting Started

The starter kit of the competition is available at https://github.com/AIcrowd/neurips2020-procgen-starter-kit.

🙋 F.A.Q.

Why evaluate agents using 8M training timesteps?

We evaluate agents using 8M training timesteps since we believe this provides agents enough data to learn reasonable behaviors, while still posing a significant challenge for state of the art algorithms. Our baseline implementation of PPO makes signifcant non-trivial progress over this interval, but it generally fails to converge on most Procgen environments.

When measuring generalization, why restrict agents to 200 levels?

With 200 training levels, we find the generalization gap in many environments is in the golilocks zone -- not too large and not to small (See Figure 13 in the Procgen paper). We believe a training set of this size will provide the best signal to measure algorithmic improvements.