Loading
0 Follower
0 Following
denamganai_kevin

Organization

University of York

Location

GB

Badges

1
1
1

Activity

Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
Mon
Wed
Fri

Ratings Progression

Loading...

Challenge Categories

Loading...

Challenges Entered

Sample Efficient Reinforcement Learning in Minecraft

Latest submissions

No submissions made in this challenge.

Sample-efficient reinforcement learning in Minecraft

Latest submissions

See All
failed 108049
failed 85717

Sample-efficient reinforcement learning in Minecraft

Latest submissions

No submissions made in this challenge.

A new benchmark for Artificial Intelligence (AI) research in Reinforcement Learning

Latest submissions

See All
graded 5211
graded 5200
failed 5196
Participant Rating
Participant Rating

Unity Obstacle Tower Challenge

After *exactly* 2hours of usage, ObstacleTowerEnv ends in SIGABRT, from gRPC

Almost 5 years ago

I have not per say found the issue but it seems that both hurdles (i.e. (1) create environment instances within spawned/forked processes without raising UnityTimedOutException, and (2) use the environment instances without getting gRPC to bug after exactly 2 hours) vanished once the followings are set:

torch.multiprocessing.set_start_method('forkserver')
torch.multiprocessing.set_sharing_strategy('file_system')

Sources:

  1. https://pytorch.org/docs/master/multiprocessing.html#multiprocessing-cuda-sharing-details
  2. https://github.com/pytorch/pytorch/issues/11201

Hopefully it will be helpful to more than me, so good luck to you who is reading this :slight_smile: !

After *exactly* 2hours of usage, ObstacleTowerEnv ends in SIGABRT, from gRPC

Almost 5 years ago

Hello everybody!

Here is the error that I get from a Python Process that I spawn to handle one instance of the ObstacleTowerEnv, among many others:

E0712 18:36:48.452364988   30043 ev_epoll1_linux.cc:1061]    assertion failed: next_worker->initialized_cv

I am training by harvesting multiple instances of ObstacleTowerEnv with multiple processes, each environment being spawned with different worker_id (following a previous discussion: Running multiple instances).

Nevertheless, the issue occurs independantly of the number of environment/process spawned.

I have traced it back to gRPC, that is used in the client-server communication of each ObstacleTowerEnv instances.

Since it would terminate my harvesting processes with a SIGABRT, I meant to simply terminate the process, close the environment instance, and then restart a new process and a new environment instance --with another worker_id-- but it seems that there is something I am still not grasping.

Since I cannot skirt the problem, I rely on your good advice to guide me in some better directions please!

I am training with PyTorch and using the following packages, on Python 3.6.8 and Ubuntu 16.04.6 LTS (Xenial Xerus) (reproduced the error on Ubuntu 18.04.2 LTS (Bionic Beaver)) :

absl-py==0.7.1
astor==0.8.0
atari-py==0.2.3
atomicwrites==1.3.0
attrs==19.1.0
backcall==0.1.0
cloudpickle==1.2.1
cycler==0.10.0
decorator==4.4.0
dill==0.3.0
docopt==0.6.2
future==0.17.1
gast==0.2.2
google-pasta==0.1.7
grpcio==1.11.1
gym==0.13.1
gym-rock-paper-scissors==0.1
h5py==2.9.0
importlib-metadata==0.18
ipdb==0.12
ipython==7.6.1
ipython-genutils==0.2.0
jedi==0.14.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
Markdown==3.1.1
matplotlib==3.1.1
mlagents-envs==0.6.2
more-itertools==7.1.0
numpy==1.16.1
-e git+https://github.com/Unity-Technologies/obstacle-tower-env/@474fbf00564ae1357373b1e2d72dcb9af095540b#egg=obstacle_tower_env
opencv-python==4.1.0.25
packaging==19.0
pandas==0.24.2
parso==0.5.0
pexpect==4.7.0
pickleshare==0.7.5
Pillow==5.4.1
pluggy==0.12.0
prompt-toolkit==2.0.9
protobuf==3.6.1
ptyprocess==0.6.0
py==1.8.0
pyglet==1.3.2
Pygments==2.4.2
PyOpenGL==3.1.0
pyparsing==2.4.0
pytest==3.10.1
python-dateutil==2.8.0
pytz==2019.1
PyYAML==5.1.1
-e git+https://github.com/Danielhp95/Generalized-RL-Self-Play-Framework/@bd872b3b547a008fe126a3584b83448157f5ee3d#egg=regym
scipy==1.3.0
seaborn==0.9.0
six==1.12.0
tensorboard==1.12.0
tensorboardX==1.8
tensorflow==1.12.0
tensorflow-estimator==1.14.0
termcolor==1.1.0
torch==1.1.0
torchvision==0.3.0
tqdm==4.32.2
traitlets==4.3.2
wcwidth==0.1.7
Werkzeug==0.15.4
wrapt==1.11.2
zipp==0.5.2

EDIT:

  1. I am realizing that it might be important to mention the following, with regards to the spawning of the processes and the creation of the environment instances: I call the ObstacleTowerEnv() function in the main process (many times), and then pass each instance as argument to a new process that communicates with the main process via Queues.
    If I create the environment inside the spawned process, I would end up with UnityTimedOutException…

  2. I am using PyTorch, which implements its own flavours of multiprocessing, that I am using as well. At some point, I assumed that it was colliding with ObstacleTowerEnv’s own multiprocessing needs but my inquires were not fruitful…

denamganai_kevin has not provided any information yet.