Ah I see the issue now. I think the confusion comes from line 121 in
[('forward', 1), ('jump', 1)]
This line doesn’t mean two actions, forward on first tick and then jump on the next tick. Instead it means that the forward and jump keys are both pressed for a single tick.
You can see that by printing out
act = env.action_space.noop():
OrderedDict([('attack', 0), ('back', 0), ('camera', array([0., 0.], dtype=float32)), ('forward', 0), ('jump', 0), ('left', 0), ('right', 0), ('sneak', 0), ('sprint', 0)])
This is a single action that does nothing, because none of the keys are pressed. If you then do:
act['forward'] = 1 act['jump'] = 1
act will become an action with those two buttons pressed. This is what the
ActionShaping() wrapper does. To create meta actions that perform 5 attacks and such you will need to do something else. Maybe frame skipping would be an easier way to achieve that?
The docstring of the class ActionShaping() should be enough to figure out how to adjust the actions for the RL part of the algo. What changes do you want to make and what have you tried?
Maybe playing Minecraft for a bit or watching a youtube guide would help with Minecraft knowledge?
yes, you can use the *DenseVectorObf environments in the Research track of the competition.
Good catch, thank you! The links have been fixed.
Here’s some analysis our team did on the whole obfuscated action + KMeans thing:
A teaser: sometimes the agents don’t have a single action to look up. So shy
Working Colab example (credit to @tviskaron):
!sudo apt-get purge openjdk-*
!sudo apt-get install openjdk-8-jdk
!pip3 install --upgrade minerl
!sudo apt-get install xvfb xserver-xephyr vnc4server
!sudo pip install pyvirtualdisplay
from pyvirtualdisplay import Display
display = Display(visible=0, size=(640, 480))
env = gym.make(‘MineRLNavigateDense-v0’)
obs = env.reset()
done = False
net_reward = 0
for _ in range(100):
action = env.action_space.noop()
action['camera'] = [0, 0.03*obs["compassAngle"]] action['back'] = 0 action['forward'] = 1 action['jump'] = 1 action['attack'] = 1 obs, reward, done, info = env.step( action) net_reward += reward print("Total reward: ", net_reward)
It could be the weight initialization, as pytorch uses he_uniform by default and tensorflow uses glorot_uniform. Using tensorflow with glorot_uniform I get 42 score on starpilot, while using tensorflow with he_uniform I get 19.
Will we be able to choose which submission to use for the final 16+4 evaluation? It might be the case that our best solution that was tested locally on 16 envs is not the same as the best one for the 6+4 envs on public LB.
So I was a little bored and decided to see how well I could play the procgen games myself.
python -m procgen.interactive --distribution-mode easy --vision agent --env-name coinrun
First I tried each game for 5-10 episodes to figure out what the keys do, how the game works, etc.
Then I played each game 100 times and logged the rewards. Here are the results:
|Environment||Mean reward||Mean normalized reward|
The mean normalized score over all games was 0.882. It stayed relatively constant throughout the 100 episodes, i.e. I didn’t improve much while playing.
I’m not sure how useful this result would be as a “human benchmark” though - I could easily achieve ~1.000 score given enough time to think on each frame. Also, human visual reaction time is ~250ms, which at 15 fps would translate to us being at least 4 frames behind on our actions, which can be important for games like starpilot, chaser and some others.
That worked, thank you!
Does it work properly for everyone else? When I run it for 100 episodes it only saves episodes number 0, 1, 8, 27, 64.
paint_vel_info flag that you can find under
env_config in the .yaml files. There are also some flags that are not in the .yaml files, but people are using (
use_backgrounds). You can find all of them if you scroll down here: https://github.com/openai/procgen .
Should we actually be allowed to change the environment? Maybe these settings should be reset when doing evaluation?
There was a mention about the final standings for round 2 being based on more seeds than 5 to get a proper average performance. Is that going to happen? I didn’t try to repeatedly submit similar models to overfit the 5 seeds for that reason.
mine says it expires 28 May 2020, not sure if that’s a set date or depends on when you redeem. I can’t find the date of when I redeemed.
Is the debug option off?
0.1, same as a single door (there’s 2 doors in each doorway).
And I was thinking I’m going mad when my previously working submission suddenly broke after “disabling” debug
Can’t wait! I’ve been trying to get my dopamine trained agent to be scored (only 5-7 floors so far), but the only response I get after every change is
The following containers terminated prematurely. : agent
and it’s not very helpful. It builds fine, but gets stuck on evaluation phase.
In the Obstacle Tower paper there is a section on human performance. 15 people tried it multiple times and the max floor was 22. Am I reading this right? I finished all 25 floors on my very first try without much trouble.
How far did everyone else get and how many runs did you do? We could try collecting more data and make a more accurate human benchmark this way.
Behavioural cloning baseline for the Research track Research track baseline
Behavioural cloning baseline for the Intro track BC lumberjack plus script
Fully scripted baseline for the Intro track Meet Bulldozer the lumberjack
Testing MineRL environment Test the environment by running a fixed sequence of actions in a fixed world