NeurIPS 2021: MineRL Diamond Competition

Behavioural cloning baseline for the Research track

Introduction¶

This notebook contains the Behavioural Cloning baselines for the Research track of the MineRL 2021 competition. To run it you will need to enable GPU by going to Runtime -> Change runtime type and selecting GPU from the drop down list.

These baselines differ slightly from the standalone version of these baselines on github - the DATA_SAMPLES parameter is set to 400,000 instead of the default 1,000,000. This is done to fit into the RAM limits of Colab.

To train the agent using the obfuscated action space we first discretize the action space using KMeans clustering. We then train the agent using Behavioural cloning. The training takes 10-15 mins.

You can find more details about the obfuscation here:
K-means exploration

Also see the in-depth analysis of the obfuscation and the KMeans approach done by one of the teams in the 2020 competition:

Obfuscation and KMeans analysis

Please note that any attempt to work with the obfuscated state and action spaces should be general and work with a different dataset or even a completely new environment.

Setup¶

In [1]:

%%capture
!sudo add-apt-repository -y ppa:openjdk-r/ppa
!sudo apt-get purge openjdk-*
!sudo apt-get install openjdk-8-jdk
!sudo apt-get install xvfb xserver-xephyr vnc4server python-opengl ffmpeg

In [2]:

%%capture
!pip3 install --upgrade minerl
!pip3 install pyvirtualdisplay
!pip3 install pytorch
!pip3 install scikit-learn
!pip3 install -U colabgymrender

Import Libraries¶

In [3]:

import random
import numpy as np
import torch as th
from torch import nn
import gym
import minerl
from tqdm.notebook import tqdm
from colabgymrender.recorder import Recorder
from pyvirtualdisplay import Display
from sklearn.cluster import KMeans
import logging
logging.disable(logging.ERROR) # reduce clutter, remove if something doesn't work to see the error logs.

/usr/local/lib/python3.7/dist-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))

Imageio: 'ffmpeg-linux64-v3.3.1' was not found on your computer; downloading it now.
Try 1. Download from https://github.com/imageio/imageio-binaries/raw/master/ffmpeg/ffmpeg-linux64-v3.3.1 (43.8 MB)
Downloading: 8192/45929032 bytes (0.0%)3506176/45929032 bytes (7.6%)7454720/45929032 bytes (16.2%)11239424/45929032 bytes (24.5%)15163392/45929032 bytes (33.0%)18874368/45929032 bytes (41.1%)22749184/45929032 bytes (49.5%)26566656/45929032 bytes (57.8%)30203904/45929032 bytes (65.8%)34045952/45929032 bytes (74.1%)37928960/45929032 bytes (82.6%)41754624/45929032 bytes (90.9%)45359104/45929032 bytes (98.8%)45929032/45929032 bytes (100.0%)
  Done
File saved as /root/.imageio/ffmpeg/ffmpeg-linux64-v3.3.1.

Neural network¶

In [4]:

class NatureCNN(nn.Module):
    """
    CNN from DQN nature paper:
        Mnih, Volodymyr, et al.
        "Human-level control through deep reinforcement learning."
        Nature 518.7540 (2015): 529-533.

    Nicked from stable-baselines3:
        https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/torch_layers.py

    :param input_shape: A three-item tuple telling image dimensions in (C, H, W)
    :param output_dim: Dimensionality of the output vector
    """

    def __init__(self, input_shape, output_dim):
        super().__init__()
        n_input_channels = input_shape[0]
        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4, padding=0),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(th.zeros(1, *input_shape)).shape[1]

        self.linear = nn.Sequential(
            nn.Linear(n_flatten, 512),
            nn.ReLU(),
            nn.Linear(512, output_dim)
        )

    def forward(self, observations: th.Tensor) -> th.Tensor:
        return self.linear(self.cnn(observations))

Setup training¶

In [5]:

def train():
    # For demonstration purposes, we will only use ObtainPickaxe data which is smaller,
    # but has the similar steps as ObtainDiamond in the beginning.
    # "VectorObf" stands for vectorized (vector observation and action), where there is no
    # clear mapping between original actions and the vectors (i.e. you need to learn it)
    data = minerl.data.make("MineRLObtainIronPickaxeVectorObf-v0",  data_dir='data', num_workers=1)

    # First, use k-means to find actions that represent most of them.
    # This proved to be a strong approach in the MineRL 2020 competition.
    # See the following for more analysis:
    # https://github.com/GJuceviciute/MineRL-2020

    # Go over the dataset once and collect all actions and the observations (the "pov" image).
    # We do this to later on have uniform sampling of the dataset and to avoid high memory use spikes.
    all_actions = []
    all_pov_obs = []

    print("Loading data")
    trajectory_names = data.get_trajectory_names()
    random.shuffle(trajectory_names)

    # Add trajectories to the data until we reach the required DATA_SAMPLES.
    for trajectory_name in trajectory_names:
        trajectory = data.load_data(trajectory_name, skip_interval=0, include_metadata=False)
        for dataset_observation, dataset_action, _, _, _ in trajectory:
            all_actions.append(dataset_action["vector"])
            all_pov_obs.append(dataset_observation["pov"])
        if len(all_actions) >= DATA_SAMPLES:
            break

    all_actions = np.array(all_actions)
    all_pov_obs = np.array(all_pov_obs)

    # Run k-means clustering using scikit-learn.
    print("Running KMeans on the action vectors")
    kmeans = KMeans(n_clusters=NUM_ACTION_CENTROIDS)
    kmeans.fit(all_actions)
    action_centroids = kmeans.cluster_centers_
    print("KMeans done")

    # Now onto behavioural cloning itself.
    # Much like with intro track, we do behavioural cloning on the discrete actions,
    # where we turn the original vectors into discrete choices by mapping them to the closest
    # centroid (based on Euclidian distance).

    network = NatureCNN((3, 64, 64), NUM_ACTION_CENTROIDS).cuda()
    optimizer = th.optim.Adam(network.parameters(), lr=LEARNING_RATE)
    loss_function = nn.CrossEntropyLoss()

    num_samples = all_actions.shape[0]
    update_count = 0
    losses = []
    # We have the data loaded up already in all_actions and all_pov_obs arrays.
    # Let's do a manual training loop
    print("Training")
    for _ in range(EPOCHS):
        # Randomize the order in which we go over the samples
        epoch_indices = np.arange(num_samples)
        np.random.shuffle(epoch_indices)
        for batch_i in range(0, num_samples, BATCH_SIZE):
            # NOTE: this will cut off incomplete batches from end of the random indices
            batch_indices = epoch_indices[batch_i:batch_i + BATCH_SIZE]

            # Load the inputs and preprocess
            obs = all_pov_obs[batch_indices].astype(np.float32)
            # Transpose observations to be channel-first (BCHW instead of BHWC)
            obs = obs.transpose(0, 3, 1, 2)
            # Normalize observations. Do this here to avoid using too much memory (images are uint8 by default)
            obs /= 255.0

            # Map actions to their closest centroids
            action_vectors = all_actions[batch_indices]
            # Use numpy broadcasting to compute the distance between all
            # actions and centroids at once.
            # "None" in indexing adds a new dimension that allows the broadcasting
            distances = np.sum((action_vectors - action_centroids[:, None]) ** 2, axis=2)
            # Get the index of the closest centroid to each action.
            # This is an array of (batch_size,)
            actions = np.argmin(distances, axis=0)

            # Obtain logits of each action
            logits = network(th.from_numpy(obs).float().cuda())

            # Minimize cross-entropy with target labels.
            # We could also compute the probability of demonstration actions and
            # maximize them.
            loss = loss_function(logits, th.from_numpy(actions).long().cuda())

            # Standard PyTorch update
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            update_count += 1
            losses.append(loss.item())
            if (update_count % 1000) == 0:
                mean_loss = sum(losses) / len(losses)
                tqdm.write("Iteration {}. Loss {:<10.3f}".format(update_count, mean_loss))
                losses.clear()
    print("Training done")

    # Save network and the centroids into separate files
    np.save(TRAIN_KMEANS_MODEL_NAME, action_centroids)
    th.save(network.state_dict(), TRAIN_MODEL_NAME)
    del data

Parameters¶

In [6]:

# Parameters:
EPOCHS = 2  # how many times we train over dataset.
LEARNING_RATE = 0.0001  # Learning rate for the neural network.
BATCH_SIZE = 32
NUM_ACTION_CENTROIDS = 100  # Number of KMeans centroids used to cluster the data.

DATA_SAMPLES = 400000  # how many samples to use from the dataset. Impacts RAM usage

TRAIN_MODEL_NAME = 'research_potato.pth'  # name to use when saving the trained agent.
TEST_MODEL_NAME = 'research_potato.pth'  # name to use when loading the trained agent.
TRAIN_KMEANS_MODEL_NAME = 'centroids_for_research_potato.npy'  # name to use when saving the KMeans model.
TEST_KMEANS_MODEL_NAME = 'centroids_for_research_potato.npy'  # name to use when loading the KMeans model.

TEST_EPISODES = 10  # number of episodes to test the agent for.
MAX_TEST_EPISODE_LEN = 18000  # 18k is the default for MineRLObtainDiamondVectorObf.

Download the data¶

In [7]:

minerl.data.download(directory='data', environment='MineRLObtainIronPickaxeVectorObf-v0');

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [01:10, 27.27MB/s]

Train¶

In [8]:

display = Display(visible=0, size=(400, 300))
display.start();

In [9]:

train()  # only need to run this once.

Loading data

100%|██████████| 5587/5587 [00:00<00:00, 129946.86it/s]

100%|██████████| 4275/4275 [00:00<00:00, 148025.70it/s]

100%|██████████| 6150/6150 [00:00<00:00, 138971.03it/s]

100%|██████████| 2859/2859 [00:00<00:00, 148297.88it/s]

100%|██████████| 4307/4307 [00:00<00:00, 137860.81it/s]

100%|██████████| 6535/6535 [00:00<00:00, 133216.25it/s]

100%|██████████| 3151/3151 [00:00<00:00, 150732.80it/s]

100%|██████████| 14869/14869 [00:00<00:00, 157115.88it/s]

100%|██████████| 3538/3538 [00:00<00:00, 158946.96it/s]

100%|██████████| 6376/6376 [00:00<00:00, 151696.52it/s]

100%|██████████| 3184/3184 [00:00<00:00, 137559.24it/s]

100%|██████████| 9355/9355 [00:00<00:00, 147974.15it/s]

100%|██████████| 5772/5772 [00:00<00:00, 140925.10it/s]

100%|██████████| 2552/2552 [00:00<00:00, 131848.59it/s]

100%|██████████| 5324/5324 [00:00<00:00, 148637.95it/s]

100%|██████████| 5525/5525 [00:00<00:00, 142272.76it/s]

100%|██████████| 4408/4408 [00:00<00:00, 145245.87it/s]

100%|██████████| 3388/3388 [00:00<00:00, 136608.62it/s]

100%|██████████| 2757/2757 [00:00<00:00, 134046.97it/s]

100%|██████████| 5260/5260 [00:00<00:00, 144152.05it/s]

  0%|          | 0/15115 [00:00<?, ?it/s]
100%|██████████| 15115/15115 [00:00<00:00, 140125.64it/s]

100%|██████████| 5291/5291 [00:00<00:00, 137478.55it/s]

  0%|          | 0/17390 [00:00<?, ?it/s]
100%|██████████| 17390/17390 [00:00<00:00, 142468.34it/s]

100%|██████████| 3535/3535 [00:00<00:00, 147255.53it/s]

100%|██████████| 3311/3311 [00:00<00:00, 128433.08it/s]

100%|██████████| 3686/3686 [00:00<00:00, 137549.98it/s]

100%|██████████| 7421/7421 [00:00<00:00, 145443.51it/s]

100%|██████████| 5019/5019 [00:00<00:00, 146658.48it/s]

100%|██████████| 12271/12271 [00:00<00:00, 143968.49it/s]

100%|██████████| 7374/7374 [00:00<00:00, 129117.47it/s]

100%|██████████| 2900/2900 [00:00<00:00, 121597.12it/s]

100%|██████████| 2183/2183 [00:00<00:00, 159212.74it/s]

100%|██████████| 4859/4859 [00:00<00:00, 135065.20it/s]

100%|██████████| 3042/3042 [00:00<00:00, 131799.07it/s]

100%|██████████| 2809/2809 [00:00<00:00, 136903.75it/s]

100%|██████████| 5346/5346 [00:00<00:00, 128222.32it/s]

100%|██████████| 5861/5861 [00:00<00:00, 140441.93it/s]

100%|██████████| 3258/3258 [00:00<00:00, 157467.65it/s]

100%|██████████| 3280/3280 [00:00<00:00, 138534.60it/s]

100%|██████████| 2994/2994 [00:00<00:00, 144908.22it/s]

100%|██████████| 7007/7007 [00:00<00:00, 120130.18it/s]

100%|██████████| 5720/5720 [00:00<00:00, 140068.89it/s]

100%|██████████| 9121/9121 [00:00<00:00, 144440.05it/s]

100%|██████████| 14323/14323 [00:00<00:00, 149334.10it/s]

100%|██████████| 5432/5432 [00:00<00:00, 140845.56it/s]

100%|██████████| 2723/2723 [00:00<00:00, 101616.54it/s]

100%|██████████| 2883/2883 [00:00<00:00, 145901.60it/s]

100%|██████████| 10362/10362 [00:00<00:00, 133089.71it/s]

100%|██████████| 4769/4769 [00:00<00:00, 140820.99it/s]

100%|██████████| 3645/3645 [00:00<00:00, 145530.20it/s]

100%|██████████| 9216/9216 [00:00<00:00, 136644.59it/s]

100%|██████████| 3508/3508 [00:00<00:00, 148819.33it/s]

100%|██████████| 5485/5485 [00:00<00:00, 117569.46it/s]

100%|██████████| 6340/6340 [00:00<00:00, 147660.52it/s]

100%|██████████| 7141/7141 [00:00<00:00, 139116.60it/s]

100%|██████████| 5395/5395 [00:00<00:00, 151614.22it/s]

100%|██████████| 5167/5167 [00:00<00:00, 134850.56it/s]

100%|██████████| 4484/4484 [00:00<00:00, 116452.90it/s]

100%|██████████| 6507/6507 [00:00<00:00, 145669.45it/s]

100%|██████████| 6473/6473 [00:00<00:00, 141106.46it/s]

100%|██████████| 7557/7557 [00:00<00:00, 138888.13it/s]

100%|██████████| 3339/3339 [00:00<00:00, 136099.56it/s]

100%|██████████| 5335/5335 [00:00<00:00, 142229.95it/s]

100%|██████████| 3536/3536 [00:00<00:00, 143881.90it/s]

100%|██████████| 7711/7711 [00:00<00:00, 145681.34it/s]

100%|██████████| 6544/6544 [00:00<00:00, 142228.42it/s]

100%|██████████| 5017/5017 [00:00<00:00, 144796.38it/s]

100%|██████████| 3827/3827 [00:00<00:00, 134266.85it/s]

100%|██████████| 4181/4181 [00:00<00:00, 127072.49it/s]

100%|██████████| 3490/3490 [00:00<00:00, 128637.02it/s]

100%|██████████| 3574/3574 [00:00<00:00, 115947.92it/s]

Running KMeans on the action vectors
KMeans done
Training

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:08, 27.27MB/s]

Iteration 1000. Loss 2.209

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:13, 27.27MB/s]

Iteration 2000. Loss 1.972

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:17, 27.27MB/s]

Iteration 3000. Loss 1.911

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:22, 27.27MB/s]

Iteration 4000. Loss 1.888

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:26, 27.27MB/s]

Iteration 5000. Loss 1.857

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:31, 27.27MB/s]

Iteration 6000. Loss 1.781

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:35, 27.27MB/s]

Iteration 7000. Loss 1.744

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:40, 27.27MB/s]

Iteration 8000. Loss 1.695

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:44, 27.27MB/s]

Iteration 9000. Loss 1.671

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:49, 27.27MB/s]

Iteration 10000. Loss 1.630

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:53, 27.27MB/s]

Iteration 11000. Loss 1.627

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [07:58, 27.27MB/s]

Iteration 12000. Loss 1.606

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:02, 27.27MB/s]

Iteration 13000. Loss 1.564

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:07, 27.27MB/s]

Iteration 14000. Loss 1.539

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:11, 27.27MB/s]

Iteration 15000. Loss 1.536

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:16, 27.27MB/s]

Iteration 16000. Loss 1.508

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:20, 27.27MB/s]

Iteration 17000. Loss 1.498

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:25, 27.27MB/s]

Iteration 18000. Loss 1.485

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:29, 27.27MB/s]

Iteration 19000. Loss 1.474

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:34, 27.27MB/s]

Iteration 20000. Loss 1.458

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:38, 27.27MB/s]

Iteration 21000. Loss 1.427

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:43, 27.27MB/s]

Iteration 22000. Loss 1.417

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:47, 27.27MB/s]

Iteration 23000. Loss 1.405

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:51, 27.27MB/s]

Iteration 24000. Loss 1.402

Download: https://minerl.s3.amazonaws.com/v4/MineRLObtainIronPickaxeVectorObf-v0.tar: 3083.0MB [08:56, 27.27MB/s]

Iteration 25000. Loss 1.385     
Training done

Start Minecraft¶

In [10]:

env = gym.make('MineRLObtainDiamondVectorObf-v0')
env = Recorder(env, './video', fps=60)

Run your agent¶

As the code below runs you should see episode videos and rewards show up. You can run the below cell multiple times to see different episodes.

In [11]:

action_centroids = np.load(TEST_KMEANS_MODEL_NAME)
network = NatureCNN((3, 64, 64), NUM_ACTION_CENTROIDS).cuda()
network.load_state_dict(th.load(TEST_MODEL_NAME))


num_actions = action_centroids.shape[0]
action_list = np.arange(num_actions)

for episode in range(TEST_EPISODES):
    obs = env.reset()
    done = False
    total_reward = 0
    steps = 0

    while not done:
        # Process the action:
        #   - Add/remove batch dimensions
        #   - Transpose image (needs to be channels-last)
        #   - Normalize image
        obs = th.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()
        # Turn logits into probabilities
        probabilities = th.softmax(network(obs), dim=1)[0]
        # Into numpy
        probabilities = probabilities.detach().cpu().numpy()
        # Sample action according to the probabilities
        discrete_action = np.random.choice(action_list, p=probabilities)

        # Map the discrete action to the corresponding action centroid (vector)
        action = action_centroids[discrete_action]
        minerl_action = {"vector": action}

        obs, reward, done, info = env.step(minerl_action)
        total_reward += reward
        steps += 1
        if steps >= MAX_TEST_EPISODE_LEN:
            break

    env.release()
    env.play()
    print(f'Episode #{episode + 1} reward: {total_reward}\t\t episode length: {steps}\n')

Output hidden; open in https://colab.research.google.com to view.

Content

3538

Show Comments

Comments

You must login before you can post a comment.