Loading
Feedback

We are so excited to announce that the MineRL BASALT challenge has been selected for NeurIPS 2021 competition track! Sign up now and try to win some of our cash prizes!

This page is preliminary. Please check back after the competition start, as the rules may change.

๐Ÿ•ต๏ธ Introduction

The MineRL Benchmark for Agents that Solve Almost-Lifelike Tasks (MineRL BASALT) competition aims to promote research in the area of learning from human feedback in order to enable agents that can pursue tasks that do not have crisp, easily defined reward functions.

We will provide tasks consisting of a simple English language description alongside a Gym environment, without any associated reward function, but with expert demos. Participants will train agents for these tasks using their preferred methods. We expect typical solutions will use imitation learning, learning from comparisons, or similar methods based on human feedback. Submitted agents will be evaluated based on how well they complete the tasks, as judged by humans given the same description of the tasks.

Emoji Request - BlackGemStoneEmoji Task

Rather than being defined by a reward function, our tasks are defined by a human-readable description, which is given both to the competitors and to the site visitors and workers doing the evaluation of the videos that agents generate. A team's final score will be the average of their scores on all four tasks, so it is important to submit agents for every task.

Final evaluations are done by humans using the leaderboard (link coming soon), which includes more details and evaluation questions than the description below. So, you may want to go through the rating process on the leaderboard to see exactly how we instruct people to rate tasks, as this is ultimately what determines how your agents will be rated.

(A technical note: agents can at any point choose to terminate the trajectory through the Minecraft mechanic of throwing a snowball, and all agents are equipped with a snowball even if it is not explicitly listed in the โ€œResourcesโ€ section under that task.) 

  1. FindCave
    1. Description: The agent should search for a cave, and terminate the episode when it is inside one. 
    2. Resources: None 
  2. MakeWaterfall 
    1. Description: After spawning in a mountainous area, the agent should build a beautiful waterfall and then reposition itself to take a scenic picture of the same waterfall. The picture of the waterfall can be taken by orienting the camera and then throwing a snowball when facing the waterfall at a good angle.
    2. Resources: 2 water buckets, a stone pickaxe, shovel, twenty cobblestone
  3. CreateVillageAnimalPen
    1. Description: After spawning in a (plains) village, build an animal pen next to one of the houses in a village. Animal pens must contain two of a single kind of animal; you are only allowed to pen chickens, cows, pigs, or sheep. Donโ€™t harm the village. 
      You may need to terraform the area around a house to build a pen. When we say not to harm the village, examples include taking animals from existing pens, damaging existing houses or farms, and attacking villagers. 
    2. Resources: 64 fence posts, 64 fence gates, carrots, seeds, and wheat for luring animals.
  4. BuildVillageHouse
    1. Description: Using items in your starting inventory, build a new house in the style of the village (random biome), in an appropriate location (e.g. next to the path through the village), without harming the village in the process. Then give a brief tour of the house (i.e. spin around slowly such that all of the walls and the roof are visible). 
    2. Resources: A stone pickaxe and a stone axe, and various building blocks.

๐Ÿ“œ Rules

The full set of rules is available here. Please do read them. Here we only explain a small subset of the rules that are particularly important:

  1. You cannot pull information out of the underlying Minecraft simulator; only information provided in the interfaces of the environments we give is allowed.
  2. If you train using in-the-loop human feedback, you are limited to using 10 hours of human time in the course of training. The interface for providing feedback must be exposed by your code in a way that a person fluent in English can understand how to provide the feedback your algorithm requires, either through a GUI or a command-line interface, after reading a Google Doc you submit containing  at most 10 pages. This is necessary for retraining, since we will have to replicate both the computation of your algorithm and its requests for human feedback. 
  3. Submissions are limited to four days of compute on prespecified computing hardware. We will publish more details on the specs for the hardware after we finalize the details. 

๐Ÿ–Š Evaluation

This competition will be judged according to human assessment of the generated trajectories. In particular, we will generate videos of two different agents acting in the environment, and ask a human which agent performed the task better. After getting many comparisons of this sort, we will produce a score for the agent using the TrueSkill system. This means the core competition environments will, by design, not include reward functions. We realize that this a dramatic departure from the typical paradigm of reinforcement learning, and that it may imply a slower and more complicated workflow. However, we think it's an important problem to build solutions to, since we expect many potentially-useful tasks to be difficult to pre-specify a reward function for.

Since we require humans to compare videos, it is actually quite expensive to evaluate a submission. As a result, we are planning to implement a system in which to make a new submission, you must first provide some comparisons on our leaderboard. Basically, you are rating other people's submissions in order to "pay" for other people to then rate your submission. We will publish more details on this system after we finalize the details.

๐Ÿ“ Competition Structure

Round 1: General Entry + Preliminary Evaluation

In this round, teams of up to 6 individuals will do the following:

  1. Register on the AICrowd competition website and receive the following materials:
    1. Starter code for running the environments for the competition task.
    2. Baseline implementations provided by the competition organizers.
    3. The human demonstration dataset.
    4. Docker Images and a quick-start template that the competition organizers will use to validate the training performance of the competitorโ€™s models.
    5. Scripts enabling the procurement of the standard cloud compute system used to evaluate the sample-efficiency of participantsโ€™ submissions.
    6. (Optional) Form a team using the โ€˜Create Teamโ€™ button on the competition overview. Participants must be signed in to create a team.
  2. Develop and test procedures for efficiently training models to solve the competition tasks.
  3. Train their models against the four task environments using the local training/azure training scripts in the competition starter template in less than four days with at most 10 hours of human-in-the-loop feedback. Submit their trained models (along with the training code) for evaluation when satisfied with their models. The automated setup will generate videos of the agent's behavior under validation seeds, and submit this for human comparisons. These comparisons are then used to compute and report the metrics on the leaderboard of the competition.
  4. Repeat 2-4 until the submission round is complete!

Once the submission round is complete, the organizers will collect additional human comparisons until entries have sufficiently stable scores, after which the top 50 proceed to the second evaluation round.

Evaluation round 2: Final scores

In the second evaluation round, organizers will solicit comparisons between the submitted agents from paid contractors (whether hired directly or through Mechanical Turk or some other method), using the same leaderboard mechanism. Participants are strictly prohibited from providing comparisons in this evaluation round. This gives the final scores for each submission, subject to validation in the next round. The top 10 teams will advance to the validation round.

Validation

In the validation round, organizers will:

  • Fork the submitted code repositories associated, and scrub them of any files larger than 30MB to ensure that participants are not using any pre-trained models or large datasets
  • Examine the code repositories of the top submissions on the leaderboard to ensure compliance with the competition rules.
  • Retrain these submissions, potentially with help from participants to ensure human feedback is provided appropriately.
  • Disqualify any submissions for which retraining provides a significantly worse agent than was submitted by the team.

๐Ÿ’ต Prizes and Funding Opportunities

Thanks to the generosity of our sponsors, there will be $11,000 worth of cash prizes:

  1. First place: $5,000
  2. Second place: $3,000
  3. Third place: $2,000
  4. Most human-like: $500
  5. Creativity of research: $500

In addition, the top three teams will be invited to coauthor the competition report.

Note that as we expect to be unable to evaluate all submissions, prizes may be restricted to entries that reach the second evaluation phase, or the validation phase, at the organizers' discretion. Prize winners are expected to present their solutions at NeurIPS.

We also have an additional $1,000 worth of prizes for participants who provide support for the competition:

  1. Community support: $500 (may be split across participants at the organizers' discretion)
  2. Lottery for leaderboard ratings (above and beyond those used to โ€œpayโ€ for submissions): 5 prizes each worth $100

๐Ÿ“… Timeline 

To be announced

๐Ÿ’ช Getting Started

The challenge has not started yet, you can still visit the resources below to warm up.

You can find the competition submission starter kit on GitHub here.

Here are some additional resources!

๐Ÿ“ Notes on downsampling

(Fairly in-the-weeds, you can probably just skip this section)

For the sake of decreasing the amount of compute needed to train and run the AIs, the AIs are typically only given a very low resolution view of the Minecraft world, and must act in that environment. However, for our human evaluation, we would like to show videos at a regular resolution, so that they don't have to squint to see what exactly is happening.

As a result, by default we train our AIs in low-res environments, and then during evaluation we instantiate a high-res environment (which generates the video that humans watch), and downsample it to the low resolution before passing it to the AI. Unfortunately, the high-res + downsample combination produces slightly different images than using the low-res environment directly. AIs based on neural nets can be very sensitive to this difference. If you train an AI system on the low-res environment, it may work well in that setting, but then work poorly in the high-res + downsample case -- even though these would look identical to humans.

Our best guess is that this won't make much of a difference. However, if you would like to be safe, you can instead train on the high-res + downsample combination and never use the low-res environment at all. In this case, your AI will be tested in exactly the same conditions as it was trained in, so this sort of issue should not occur. The downside is that your training may take 2-3x longer.

You can find more details on the issue here. One of the organizers benchmarked the two options on their personal machine here.

๐Ÿ™‹ F.A.Q.

This F.A.Q is the only official place for clarification of competition Rules!

Q: Do I need to purchase Minecraft to participate?

 > A: No! MineRL includes a special version of Minecraft provided generously by the folks at Microsoft Research via Project Malmo.

We will be updating the FAQ soon!

Have more questions? Ask in Discord or on the Forum

๐Ÿค Partners

Thank you to our amazing partners!

Microsoft

๐Ÿ‘ฅ Team

The organizing team consists of:

  • Rohin Shah (UC Berkeley)
  • Cody Wild (UC Berkeley)
  • Steven H. Wang (UC Berkeley)
  • Neel Alex (UC Berkeley)
  • Brandon Houghton (OpenAI and Carnegie Mellon University)
  • William H. Guss (OpenAI and Carnegie Mellon University)
  • Sharada Mohanty (AIcrowd)
  • Anssi Kanervisto (University of Eastern Finland)
  • Stephanie Milani (Carnegie Mellon University)
  • Nicholay Topin (Carnegie Mellon University)
  • Pieter Abbeel (UC Berkeley)
  • Stuart Russell (UC Berkeley)
  • Anca Dragan (UC Berkeley)

The advisory committee consists of:

  • Sergio Guadarrama (Google Brain)
  • Katja Hofmann (Microsoft Research)

๐Ÿ“ฑ Contact

If you have any questions, please feel free to contact us on Discord or through the AIcrowd forum.