Loading
Warm-Up Round: 3 days left #neurips #reinforcement_learning

NeurIPS 2020: Procgen Competition

Measure sample efficiency and generalization in reinforcement learning using procedurally generated environments

1 Authorship/Co-Authorship
Misc Prizes : To be announced
13k
356
310
918

Starter Kit : https://github.com/aicrowd/neurips2020-procgen-starter-kit

πŸ•΅οΈ Introduction

Procgen Benchmark is a suite of 16 procedurally-generated gym environments designed to benchmark both sample efficiency and generalization in reinforcement learning. In this competition, participants will attempt to maximize agents' performance using a fixed number of environment interactions. Agents will be evaluated in each of these 16 publicly released environments, as well as in four secret test environments created specifically for this competition. By aggregating performance across so many diverse environments, we can obtain high quality metrics to judge the underlying algorithms.

Since all content is procedurally generated, each Procgen environment intrinsically requires agents to generalize to never-before-seen situations. These environments therefore provide a robust test of an agent's ability to learn in many diverse settings. Moreover, Procgen environments are designed to be lightweight and simple to use. Participants with limited computational resources will be able to easily reproduce baseline results and run new experiments. More details about the design principles and details of individual environments can be found in the paper Leveraging Procedural Generation to Benchmark Reinforcement Learning. Once the competition concludes, all four test environments will be publicly released.

πŸ“œ Rules

In all rounds, participants will be allotted 8 million timesteps in each environment to train their agents. When evaluating generalization, we will provide participants 200 levels from each environment during the training phase. Participants will also be restricted to no more than 2 hours of compute per environment, using a P100 GPU and 16 vCPUs.

Participants are expected to operate in good faith and to not attempt to circumvent these restrictions.

πŸ–Š Evaluation

Participants will train separate agents for each environment, with the number of environments varying in each round of the competition. In general, performance will be judged by the mean of the normalized returns across environments. In each environment, the normalized return is defined as :

\(R_{norm} = (R βˆ’ R_{min})/(R_{max} βˆ’ R_{min})\)

where :

  • \(R\) is the raw expected return
  • \(R_{min}\) and \(R_{max}\) are constants chosen (per environment) to approximately bound \(R\).

It is possible to choose these constants because each Procgen environment has a clear score ceiling. Using this definition, the normalized return is (almost) guaranteed to fall between 0 and 1. Since Procgen environments are designed to have similar difficulties, it’s unlikely that a small subset of environments will dominate this signal. We use the mean normalized return since it offers a better signal than the median, and since we do not need to be robust to outliers.

πŸ“ Competition Structure

Warm-Up Round

The warm-up round evaluates submissions solely on the CoinRun environment. Participants can become familiar with the codebase and submission pipeline without the need to consider multiple Procgen environments.

Round 1 (General Entry)

Round 1 will evaluate submissions on 3 of the public Procgen environments, as well as on 1 of the private test environments. Participants' final score will be the mean normalized return across these 4 environments.

This round will focus entirely on sample efficiency, with participants being given a budget of 8M timesteps for training.

Round 2 (Finals)

Round 2 will evaluate submissions on the 16 public Procgen environments, as well as on the 4 private test environments. Participants' final score will be a weighted average of the normalized return across these 20 environments, with the private test environments contributing the same weight as the 16 public environments.

This round will evaluate agents on both sample efficiency and generalization.

Sample efficiency will be measured as before, with agents restricted to training for 8M timesteps.

Generalization will be measured by restricting agents to 200 levels from each environment during training (as well as 8M total timesteps). In both cases, agents will be evaluated on the full distribution of levels. We will have separate winners for the categories of sample efficiency and generalization.

Because significant computation is required to train and evaluate agents in this final round, only the top 50 submissions from Round 1 will be eligible to submit solutions for Round 2. The leaderboard will report performance on a subset of all environments, specifically on 4 public environments and 1 private test environment.

The top 10 submissions will be subject to a more thorough evaluation, with their performance being averaged over 3 separate training runs. The final winners will be determined by this evaluation.

πŸ“… Timeline

  • June 3rd - July 6th : Warm-Up Round
  • July 7th - August 31st : Round 1 (General Entry)
  • September 1st - October 19th : Round 2 (Finals)
  • October 20th - October 25th : Post Competition Analysis
  • October 25th : Final Results Announced
  • October 16th - November 10th : Post Competition Wrap-Up

πŸ’ͺ Getting Started

The starter kit of the competition is available at https://github.com/AIcrowd/neurips2020-procgen-starter-kit.

    git clone git@github.com:AIcrowd/neurips2020-procgen-starter-kit.git
    cd neurips2020-procgen-starter-kit

    # Training example:
    python ./train.py -f experiments/procgen-0.yaml

    # Rollout example:
    # the env name and configuration are automatically picked up from 
    # the experiment config.

    python ./rollout.py \
        /tmp/ray/checkpoint_dir/checkpoint-0 \
        --run PPO \
        --episodes 100

    # NOTE : The path to the checkpoint will have the following path in case of default options :
    # ~/ray_results/procgen-ppo/<experiment-name>-<uuid>/checkpoint_1/checkpoint-1

πŸ™‹ F.A.Q.

Why evaluate agents using 8M training timesteps?

We evaluate agents using 8M training timesteps since we believe this provides agents enough data to learn reasonable behaviors, while still posing a significant challenge for state of the art algorithms. Our baseline implementation of PPO makes signifcant non-trivial progress over this interval, but it generally fails to converge on most Procgen environments.

When measuring generalization, why restrict agents to 200 levels?

With 200 training levels, we find the generalization gap in many environments is in the golilocks zone -- not too large and not to small (See Figure 13 in the Procgen paper). We believe a training set of this size will provide the best signal to measure algorithmic improvements.

What are the prizes for the competition?

We do not yet have a prize pool, but we are still actively searching for sponsors. If you are interested in sponsoring this competition with compute for the evaluations, or prizes for the winners, please reach out via email to Sharada Mohanty (mohanty@aicrowd.com and Karl Cobbe (karl@openai.com).

πŸ‘₯ Team

The organizing team consists of:

  • Sharada Mohanty (AIcrowd)
  • Karl Cobbe (OpenAI)
  • Jyotish Poonganam (AIcrowd)
  • Shivam Khandelwal (AIcrowd)
  • Christopher Hesse (OpenAI)
  • Jacob Hilton (OpenAI)
  • John Schulman (OpenAI)
  • William H. Guss (OpenAI)

πŸ“± Contact

If you have any questions, please contact Sharada Mohanty (mohanty@aicrowd.com) or Karl Cobbe (karl@openai.com).

Participants

Getting Started

Leaderboard

01
0.956
02 quang_tran 0.954
02 karolisram 0.954
04 tim_whitaker 0.952
05 wulfebw 0.942

Latest Submissions

Paseul submitted
victor_le graded
cohen5 graded
cohen5 graded
victor_le graded