Loading

Problem Statements

 

🕵️ Introduction

The MineRL Benchmark for Agents that Solve Almost-Lifelike Tasks (MineRL BASALT) competition aims to promote research in the area of learning from human feedback, in order to enable agents that can pursue tasks that do not have crisp, easily defined reward functions. Our sponsors have generously provided $11,000 in prize money to incentivize this research!

Emoji Request - BlackGemStoneEmoji Task

Real-world tasks are not simply handed to us with a reward function already defined, and it is often quite challenging to design one, even if you can verbally describe what you want done. To mimic this situation, the BASALT competition environments will, by design, not include reward functions. We realize that this is a dramatic departure from the typical paradigm of reinforcement learning, and that it may imply a slower and more complicated workflow. However, we think it is an important problem to build solutions to if we want AI systems to have effective and safe real-world impacts.

Our tasks are instead defined by a human-readable description, which is given both to the competitors and to the site visitors and workers doing the evaluation of the videos that trained agents generate. You may want to go through the evaluation process (described in the Evaluation section below) to see exactly how we instruct people to rate tasks, as this is ultimately what determines how your agents will be rated.

(Technical note: agents can at any point choose to terminate the trajectory through the Minecraft mechanic of throwing a snowball, and all agents are equipped with a snowball even if it is not explicitly listed in the “Resources” section under that task.) 

The four tasks in this competition are FindCave, MakeWaterfall, CreateVillageAnimalPen, and BuildVillageHouse.

🖊 Evaluation

This competition will be judged according to human assessment of the generated trajectories. In particular, for each task, we will generate videos of two different agents acting in the environment, and ask a human which agent performed the task better. After getting many comparisons of this sort, we will produce a score for each agent using the TrueSkill system, which, very roughly speaking, captures how often your agent is likely to "win" in a head to head comparison. You'll be able to see this score the task-specific leaderboards. Your final score on the overall leaderboard will be an aggregate of your z-scores on all four tasks, so it is important to submit agents for every task. For more details on this aggregation procedure, see this doc.

Since we require humans to compare videos, it can be quite expensive to evaluate a submission. To incentivize people to provide comparisons, we will run a lottery: each comparison you provide will be treated as entries in a $500 lottery. To provide comparisons, check out the "Evaluate Agents" tab above.

During the first evaluation phase, comparisons will be made by site visitors and fellow participants. Based on the ranking generated, we will then send the top 50 agents to a second evaluation phase where comparisons will be done by paid contractors who are not participating in the competition, to reduce potential sources of sabotage or noisiness in the rankings actually used for final top-three prize determination. 

📊 Dataset

To help you train capable agents, we have collected datasets of 40-80 human demonstrations for each of the tasks described above. You can find details about this dataset on the MineRL website.

Note that the demonstrations in the dataset are sometimes longer than the time available to the agent to complete the task.

💪 Getting Started

You can find the competition submission starter kit on GitHub here, as well as our baseline implementation of behavior cloning here. This baseline implementation includes both the training code used to train policies, and the policies themselves, so you can test out submission by simply forking the repo, editing AI Crowd to specify one of the four tasks, and creating a tag to trigger submission. 

Our currently available baselines are straightforward, and are intended mostly as a way to help you understand how very simple models perform in the environment, as well as demonstrating how to train a model with a common reinforcement learning framework (Stable Baselines 3) on these environments. They are created by training behavioral cloning on the full dataset of each task, taking the player’s pixel point of view observation as input, and trying to predict the action taken by the demonstrator at that timestep. The only action wrapper that they use is one that changes camera actions from a continuous space into a discrete up/down and left/right action, to make the scale of log probabilities for the camera action more comparable to the scale of log probabilities for other actions.

The baselines are made with the starter kit and can themselves be submitted, so you could clone the baseline repo instead of the starter kit if you'd like to start from our baseline, rather than starting from scratch.

Here are some additional resources!

Struggling to think of ways to improve on our baselines? Here’s a few ideas that we haven’t tried yet, in order of increasing ambition:

Integrating a richer set of observation types. For tasks other than FindCave, the environments in BASALT contain observations other than a point-of-view pixel array. However, for the sake of designing a straightforward set of training code that can be used on all environments, we currently only use POV observations for our baselines, even when other observation types are available. Exploring other architectures to integrate both image-based, continuous, and categorical observation spaces together could likely improve performance in our more complex tasks. 

Hyperparameter and architecture tuning. Machine learning algorithms depend on a great deal of hyperparameters, or “knobs” that can be turned to get slightly different learning behavior, such as the learning rate, or how much you discretize the camera action. It is often quite important to set these knobs just right, and often it simply requires a lot of trial and error. We have already done some hyperparameter tuning, but there are likely more gains to be had through more tuning.

Dependence on the past. Our behavioral cloning (BC) implementation is memoryless, that is, there is no state or “memories” that it carries forward from the past to help make future decisions. This can cause problems with some of the tasks: for example, in MakeWaterfall, if the agent sees a waterfall, it is hard for it to know whether that waterfall is one that it created, or a naturally occurring waterfall. One default fix would be to use a recurrent neural network (RNN) that compresses the past states and actions into a “hidden state” that can inform future actions. We could also try other schemes:

  1. Instead of using an RNN, we could use a Transformer model. These models have been replacing RNNs in many areas in recent years, most prominently in natural language processing, where it was used to create GPT-3.
  2. We could use a “slow” and “fast” RNN, to accommodate both long-term and short-term history, as done for Capture the Flag.

Comparisons. Our BC baseline only learns from human demonstrations, but there is a lot of work suggesting that you can improve upon such a baseline by training on human comparisons between AI trajectories (after already training on demonstrations, so that the AI trajectories are reasonable, rather than spinning around randomly). Many of these techniques can be straightforwardly applied to our setting as well. (Note: We may release a baseline that learns from comparisons in the future.)

Corrections. One issue with the BC waterfall agent is that it doesn’t seem to realize that it should, y’know, create a waterfall. This isn’t too surprising -- it’s just trained to mimic what humans do, and only a tiny fraction of the human demonstrations involve creating a waterfall -- most of the time is spent navigating the extreme hills biome. One generic way that we might try to solve the problem is to ask humans to look at AI-generated trajectories, and suggest ways in which that trajectory could be improved. In the waterfall case, people might find specific times at which the agent was in a good spot to create a waterfall, and say “you should have made a waterfall at this point”. This paper applies this idea to human-robot interaction, but would need significant modification to apply to Minecraft. This paper applies this idea to Mario, though some parts rely on access to simulator state that is disallowed in this competition.

The kitchen sink. There have been a lot of proposals for types of human feedback; you can see this paper for an overview. You could try to implement several of these and then provide an interface for a human to decide which type of feedback to give next. It’s possible that a human intelligently interleaving the various types of feedback could do much better than a training setup that used just one of those feedback types. For example, demonstrations are great for getting started, comparisons are great for generic improvement, and corrections are great for fixing specific issues, so combining all three could do a lot better than any one thing individually.

However, this is probably way too ambitious for this competition, and we would not recommend trying to do all of this. You could select a small subset of the feedback types and implement an interface just for those.

📁 Competition Structure

[In the diagram above, blue cells are tasks that participants are responsible for, and yellow cells are tasks performed by organizers]

Round 1: General Entry + Preliminary Evaluation

In this round, teams of up to 6 individuals will do the following:

  1. Register on the AICrowd competition website and receive the following materials:
    1. Starter code for running the environments for the competition task.
    2. Baseline implementations provided by the competition organizers.
    3. The human demonstration dataset.
    4. Docker Images and a quick-start template that the competition organizers will use to validate the training performance of the competitor’s models.
    5. Scripts enabling the procurement of the standard cloud compute system used to evaluate the sample-efficiency of participants’ submissions.
    6. (Optional) Form a team using the ‘Create Team’ button on the competition overview. Participants must be signed in to create a team.
  2. Develop and test procedures for efficiently training models to solve the competition tasks.
  3. Train their models against the four task environments using the local training/azure training scripts in the competition starter template. The training for all four tasks together must take at most four days of compute and at most 10 hours of human-in-the-loop feedback.
  4. Submit their trained models (along with the training code) for evaluation when satisfied with their models. Submission instructions can be found in the README on the submission template. The automated setup will generate videos of the agent's behavior under validation seeds, and submit this for human comparisons. These comparisons are then used to compute and report the metrics on the leaderboard of the competition.
  5. Repeat 2-4 until the submission round is complete!

Once the submission round is complete, the organizers will collect additional human comparisons until entries have sufficiently stable scores, after which the top 50 proceed to the second evaluation round.

Evaluation round 2: Final scores

In the second evaluation round, organizers will solicit comparisons between the submitted agents from paid contractors (whether hired directly or through Mechanical Turk or some other method), using the same leaderboard mechanism. Participants are strictly prohibited from providing comparisons in this evaluation round. This gives the final scores for each submission, subject to validation in the next round. The top 10 teams will advance to the validation round.

Validation

In the validation round, organizers will:

  • Fork the submitted code repositories associated, and scrub them of any files larger than 30MB to ensure that participants are not using any pre-trained models or large datasets
  • Examine the code repositories of the top submissions on the leaderboard to ensure compliance with the competition rules.
  • Retrain these submissions, potentially with help from participants to ensure human feedback is provided appropriately.
  • Disqualify any submissions for which retraining provides a significantly worse agent than was submitted by the team.

📜 Rules

The full set of rules is available here. Please do read them. Here we only explain a small subset of the rules that are particularly important:

  1. You cannot pull information out of the underlying Minecraft simulator; only information provided in the interfaces of the environments we give is allowed.
  2. Submissions are limited to four days of compute on prespecified computing hardware to train models for all of the tasks. We will publish more details on the specs for the hardware after we finalize the details. 
  3. If you train using in-the-loop human feedback, you are limited to 10 hours of human feedback over the course of training all four models. The interface for providing feedback must be exposed by your code in a way that a person fluent in English can understand how to provide the feedback your algorithm requires, either through a GUI or a command-line interface, after reading a Google Doc you submit containing at most 10 pages. This is necessary for retraining, since we will have to replicate both the computation of your algorithm and its requests for human feedback.
    1. During retraining, while we aim to get human feedback to you as soon as possible, your program may have to wait for a few hours for human feedback to be available. (This will not count against the four day compute budget, though you are allowed to continue background computation during this time.)
    2. Human feedback will be provided by remote contractors, so your code should be resilient to network delays. (In particular, contractors may find it particularly challenging to play Minecraft well over this connection.)
    3. You are permitted to ask for human feedback in separate batches (e.g. every hour or so, you ask for 10 minutes of human feedback).

💵 Prizes and Funding Opportunities

Thanks to the generosity of our sponsors, there will be $11,000 worth of cash prizes:

  1. First place: $5,000
  2. Second place: $3,000
  3. Third place: $2,000
  4. Most human-like: $500
  5. Creativity of research: $500

In addition, the top three teams will be invited to coauthor the competition report.

Prizes, including those for human-likeness and creativity, will be restricted to entries that reach the second evaluation phase (top 50), and will be chosen at the organizers' discretion. Prize winners are expected to present their solutions at NeurIPS.

We also have an additional $1,000 worth of prizes for participants who provide support for the competition:

  1. Community support: $500 (may be split across participants at the organizers' discretion)
  2. Lottery for leaderboard ratings: 5 prizes each worth $100

📅 Timeline 

  • July 7: Competition begins! Participants are invited to download the starting kit and begin developing their submission.
  • October 15: Submission deadline. Submissions are closed and organizers begin the evaluation process.
  • November: Winners are announced and are invited to contribute to the competition writeup.
  • December 13-14: Presentation at NeurIPS 2021.

📝 Notes on downsampling

(Fairly in-the-weeds, you can probably just skip this section)

For the sake of decreasing the amount of compute needed to train and run the AIs, the AIs are typically only given a very low resolution view of the Minecraft world, and must act in that environment. However, for our human evaluation, we would like to show videos at a regular resolution, so that they don't have to squint to see what exactly is happening.

As a result, by default we train our AIs in low-res environments, and then during evaluation we instantiate a high-res environment (which generates the video that humans watch), and downsample it to the low resolution before passing it to the AI. Unfortunately, the high-res + downsample combination produces slightly different images than using the low-res environment directly. AIs based on neural nets can be very sensitive to this difference. If you train an AI system on the low-res environment, it may work well in that setting, but then work poorly in the high-res + downsample case -- even though these would look identical to humans.

Our best guess is that this won't make much of a difference. However, if you would like to be safe, you can instead train on the high-res + downsample combination and never use the low-res environment at all. In this case, your AI will be tested in exactly the same conditions as it was trained in, so this sort of issue should not occur. The downside is that your training may take 2-3x longer.

You can find more details on the issue here. One of the organizers benchmarked the two options on their personal machine here.

🙋 F.A.Q.

This F.A.Q is the only official place for clarification of competition Rules!

Q: Do I need to purchase Minecraft to participate?

 > A: No! MineRL includes a special version of Minecraft provided generously by the folks at Microsoft Research via Project Malmo.

Q: Will you be releasing your setup for collecting demonstrations?

 > A: Unfortunately not -- our setup is fairly complex and not fit for public release. However, we expect that it would not be too hard to code up a simple keyboard and mouse interface to MineRL, and then record all observations and actions, in order to collect your own demonstrations.

Q: Will you re-run my training code? 

> A: Eventually, but not during Round 1. During Round 1, you will submit a pre-trained agent, which will be evaluated on novel seeds of the environment it was trained for. During Round 2, we will re-train the top submissions from Round 1 to validate that they reach comparable performance to the submitted agents when we run the submitted training code. 

Q: How will leaderboard rankings be calculated? 

> A: The competition homepage will include a place for visitors to rank pairs of compared trajectories, and these rankings will be aggregated into a TrueSkill score.

Q: What does “Minecraft internal state” (that participants aren't allowed to use) refer to?  

It refers to hardcoded aspects of world state like “how far am I from a tree” and “what blocks are in a 360 degree radius around me”; things that either would not be available from the agent’s perspective, or that an agent would normally have to infer from data in a real environment, since the real world doesn’t have hardcoded state available. 

Have more questions? Ask in Discord or on the Forum

🤝 Partners

Thank you to our amazing partners!

Open Philanthropy - Idealist

Microsoft

 

👥 Team

The organizing team consists of:

  • Rohin Shah (UC Berkeley)
  • Cody Wild (UC Berkeley)
  • Steven H. Wang (UC Berkeley)
  • Neel Alex (UC Berkeley)
  • Brandon Houghton (OpenAI and Carnegie Mellon University)
  • William H. Guss (OpenAI and Carnegie Mellon University)
  • Sharada Mohanty (AIcrowd)
  • Anssi Kanervisto (University of Eastern Finland)
  • Stephanie Milani (Carnegie Mellon University)
  • Nicholay Topin (Carnegie Mellon University)
  • Pieter Abbeel (UC Berkeley)
  • Stuart Russell (UC Berkeley)
  • Anca Dragan (UC Berkeley)

The advisory committee consists of:

  • Sergio Guadarrama (Google Brain)
  • Katja Hofmann (Microsoft Research)
  • Andrew Critch (UC Berkeley)

📱 Contact

If you have any questions, please feel free to contact us on Discord or through the AIcrowd forum.

Participants

Leaderboard

01 cody_wild8 9.000
02
11.000
03
13.000
03 yamato.kataoka 13.000
04 shivam 28.000