Loading
Feedback

jyotish

Name

Jyotish

Organization

AIcrowd

Location

Guntur, IN

Badges

0
0
2

Activity

Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Mon
Wed
Fri

Ratings Progression

Loading...

Challenge Categories

Loading...

Challenges Entered

Measure sample efficiency and generalization in reinforcement learning using procedurally generated environments

Latest submissions

See All
graded 74221
graded 69619
failed 69617

Recognise Handwritten Digits

Latest submissions

See All
graded 60279
graded 60268

Crowdsourced Map Land Cover Prediction

Latest submissions

See All
graded 60315
graded 60314

Predict Labor Class

Latest submissions

See All
failed 71051
failed 71041

Real Time Mask Detection

Latest submissions

See All
graded 67702
graded 67701
graded 67600
Gold 0
Silver 0
Bronze 2
Trustable
May 16, 2020
Newtonian
May 16, 2020

Badges


  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020
  • Has filled their profile page
    May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020

  • May 16, 2020
  • Kudos! You've won a bronze badge in this challenge. Keep up the great work!
    Challenge: droneRL
    May 16, 2020
Participant Rating
BhaviD 0
Participant Rating
vrv

NeurIPS 2020: Procgen Competition

Submission issues

About 3 hours ago

Hello @dipam_chakraborty

The GPU memory usage is not machine dependant. But it does vary a lot based on the version of tensorflow/pytorch used.

local runs are taking up only 14.3 GB (including the evaluation worker)

In that case, you can try replicating your local software stack during the evaluations (pip freeze / conda export)? Please let me know if I can help you with that.

Submission issues

About 16 hours ago

Hello @dipam_chakraborty

Every element shown on the graph is from a different node. Ideally, you would have 8 elements on the graph, 4 for training and 4 for rollouts. Sometimes, you end with more elements because we use preemptive nodes, and every new node also shows up there. Sometimes, for the failed submissions, you might see less than 8 elements as the jobs exited before the metrics could be scraped for that job.

Submission issues

About 18 hours ago

Hello @dipam_chakraborty

We re-queued the submission for evaluation.

Rllib custom env

About 18 hours ago

Hello @dipam_chakraborty

Thanks for sharing your thoughts. This is the reason why we want the participants to use ProcgenEnvWrapper. The evaluation env_config shared in the other post will be forced at the base env used in your wrapper. So, it is not possible to override the config that we set using a wrapper.

Rllib custom env

3 days ago

Hello @tim_whitaker @anton_makiievskyi

We are passing a rollout flag in env_config. During the training phase we set this to rollout = False and during the rollouts, this will be set to rollout = True.

Different training steps shown in gitlab

5 days ago

Hello @kaixin

This looks like a bug in plotting on the issue page. The complete plots and the corresponding logs (for 8M steps) are available on the submission dashboard. You can access the dashboard by following this link

image

Custom Preprocessors

6 days ago

Hello @smileroom

You can place the code for your preprocessor at https://github.com/AIcrowd/neurips2020-procgen-starter-kit/blob/master/preprocessors/ and add the newly added custom preprocessor to the list of available preprocessors at https://github.com/AIcrowd/neurips2020-procgen-starter-kit/blob/master/preprocessors/__init__.py.

Then you should add it to the experiment YAML file, for example,

I hope the last part is what you were looking for.

Change in daily submission limits for round 1

9 days ago

Hello,

We are decreasing the daily submission quota to 2 submissions per participant due to reduced cluster capacity.

How to install external libraries?

10 days ago

Hello @dipam_chakraborty

I don’t think this was a network issue. The submission would fail before the start or would get marked failed after passing all the stages are the only possibilities we have with the network bug. I’m posting the relevant logs on the issue page.

Customized trainer

10 days ago

Hello @kaixin

Did you have a look at https://github.com/AIcrowd/neurips2020-procgen-starter-kit/tree/master/algorithms/random_policy?

If you are looking for an even low-level implementation, this should be a good starting point, https://github.com/AIcrowd/neurips2020-procgen-starter-kit/blob/master/algorithms/custom_random_agent/custom_random_agent.py. If you plan to follow this one, please make sure that you are able to train your agent locally with the configuration provided in FAQ: Round 1 evaluations configuration.

Question about 'num_outputs' in models

10 days ago

Hello @khrho_af

num_outputs comes from the environment’s action space. It is generally is equal to the dimension of the action space.

Please share your pain points so that we can improve the starter kit. Please feel free to post as many questions as you want. I’m sure that the community will be glad to help! :smiley: This will also help others who might be searching for answers to similar questions.

How to install external libraries?

10 days ago

Hello @dipam_chakraborty

I believe this is about submission #75240 and not #74986. We are still facing few network issues on our end and the evaluation didn’t start. However, we are re-queueing such failed submissions on our end and working on fixing the root cause of this.

Sorry for the inconvinience.

I have question about max reward of env

11 days ago

Hello @minbeom

The max reward that is returned during the training is not the max reward of the environment. For every iteration we collect mean, min and max reward values of the finished episodes during that iteration. For example if your single iteration has 4 episodes, the rewards array for your iteration is, say, [0, 6, 20, 10], then

mean_reward =  6
min_reward = 0
max_reward = 20

Please note that these min, mean and max rewards are for the episodes of an iteration. The graph you shared plots these values for every iteration.

AWS instance setup

11 days ago

Adding to @shivam’s response, we use the following values during the evaluation

RAY_MEMORY_LIMIT: "60129542144"
RAY_CPUS: "8"
RAY_STORE_MEMORY: "30000000000"

Submission failed after training and rollouts completed

12 days ago

Hello @dipam_chakraborty

Apologies for this. We are having some network issues on the evaluation cluster, and hence few submissions are getting marked as failed inpsite of succeeding. Submission #74986 is one of them. We marked such submissions as succeeded and are working to fix the root issue causing this.

FAQ: Round 1 evaluations configuration

12 days ago

Hello @xiaocheng_tang

Yes, you are free to pass additional parameters/flags in env_config. The only requirement is that the base env used by your gym wrapper should be ProcgenEnvWrapper provided in the starter kit.

The discussion from Rllib custom env might be useful to clear things up.

Getting Rmax from environment

13 days ago

Hello @jurgisp

I do not think there is a way to get the max reward value from the gym environments. [refer]

The Rmin and Rmax for the publicly available environments are available at

For the private env used for round 1 (caterpillar), the min and max rewards are [R_min, R_max] = [8.25, 24] .

Please note that Rmin is the score for an agent that has no access to the observations and not the minimum possible score.

Round 1 is open for submissions 🚀

14 days ago

Hello @jurgisp

This was a config issue on our end. The issue is fixed and we re-queued your submission.

The error of have not qualified for this round.?

14 days ago

Hello all,

The issue is fixed, and you should be able to make submissions now. We requeued the failed submissions. In case you made multiple submissions to get the submission to work and want to cancel a submission (so that they don’t count towards the submission quota), please share the submission IDs that you want us to cancel.

The error of have not qualified for this round.?

14 days ago

Hello @RDL_lms

We do not have any minimum requirements to participate in round 1. This seems to have happened due to a config issue on our end. Can you try submitting again?

Round 1 is open for submissions 🚀

14 days ago

Hello @Paseul

You can put any value in env_name. This will be overridden by relavant value during evaluation.

Round 1 is open for submissions 🚀

14 days ago

Hello all!

Thank you for your participation and enthusiasm during the warm-up round! We are accepting submissions for round 1.

Changes for Round 1

Hardware available for evaluations

The evaluations will run on the following hardware:

Resources
vCPUs 8
RAM 56 GB
GPU 16 GB Tesla V100

Evaluation configuration

The configuration used during evaluations is available at FAQ: Round 1 evaluations configuration.

Environments

This round will run on three public (coinrun, bigfish, miner) and one private environment.

Scoring

The final score will be an average of mean normalized rewards for public environments and the private environment

Score = \frac{1}{6}*R_{coinrun} + \frac{1}{6}*R_{bigfish} + \frac{1}{6}*R_{miner} + \frac{1}{2}*R_{privateEnv}

R_{env} = Mean normalized reward for env.

Prizes

We are super excited to share that AWS is the official sponsor of this competition. Apart from sponsoring all the compute for this competition, AWS has been generous to extend the following prizes:

  • $10,000 in AWS credits for the TOP 50 participants of the warm-up round. (you will be receiving a mail shortly!)
  • TOP 3 teams of the final round will each get $1000 (cash) and $3000 in AWS credits!

For the next round

Because significant computation is required to train and evaluate agents in the final round, only the top 50 submissions from Round 1 will be eligible to submit solutions for Round 2.

FAQ: Round 1 evaluations configuration

19 days ago

Helo @jurgisp

The above configuration will be used during both training and rollouts.

Help..submission failed --- "Unknown grader_id"

23 days ago

Hello @dynmi

Nothing is wrong with your submission. The warm-up round is closed for new submissions and hence the error. We will send out an email to the participants when the next round starts, and you should be able to submit then.

Same marks on the testing video

24 days ago

Hello @victor_le

What values are allowed to be modified? The values listed in env_config in impala-baseline.yaml ?

We will set all the env config params (except rand_seed) to the default values during evaluations, i.e. none of them is supposed to be changed by the participants.

Is the grader open-source?

We can’t opensource the grader at the moment. But you should be able to replicate the evaluation setup with the values mentioned in FAQ: Round 1 evaluations configuration.

FAQ: Round 1 evaluations configuration

24 days ago

The following values will be set during the evaluations. Any changes that you make to these parameters will be dropped and replaced with the default values during the evaluations.

stop:
  timesteps_total: 8000000
  time_total_s: 7200

checkpoint_freq: 25
checkpoint_at_end: True

env_config:
  env_name: <accordingly>
  num_levels: 0
  start_level: 0
  paint_vel_info: False
  use_generated_assets: False
  distribution_mode: easy
  center_agent: True
  use_sequential_levels: False
  use_backgrounds: True
  restrict_themes: False
  use_monochrome_assets: False

# We use this to generate the videos during training
evaluation_interval: 25
evaluation_num_workers: 1
evaluation_num_episodes: 3
evaluation_config:
  num_envs_per_worker: 1
  env_config:
    render_mode: rgb_array

During the rollouts, we will also pass a rand_seed to the procgen env.

Warmup round extension

25 days ago

Hello @maraoz @edan_meyer

Apologies for this. We got the compute sponsorship approved, and procurement is still in progress. We are working with them and will be launching next round as soon as it is complete.

Once the round is launched, we will send out an email to all the participants of the competition.

Resource limits for submissions

30 days ago

Hello @maraoz

The evaluations right now are running with 16 vCPUs, 56 GB RAM.

FAQ: Debugging the submissions

30 days ago

Hello @maraoz

Yes, it should be num_gpus + (num_workers+1)*num_gpus_per_worker <= 1. Thanks for pointing it out. Updated the post with the right variable name :smiley:

Rllib custom env

30 days ago

Hello @Mckiev

I’m not sure if I understood that right. We will use the same env for training and rollouts. The requirements from our side are

  • The base env you use should be the env returned by ProcgenEnvWrapper rather than the one you get from gym.make.
  • The wrapper that you use should extend gym.Wrapper class (in case you are writing one on your own).

Right way to use custom wrappers:

registry.register_env(
    "my_custom_env",
    lambda config: MyWrapper(ProcgenEnvWrapper(config))
)

Wrong way to use custom wrappers:

registry.register_env(
    "my_custom_env",
    lambda config: MyWrapper(gym.make("procgen:procgen-coinrun-v0", **config))
)

During the evaluation (both training and rollouts), we will use the env with your custom wrapper (if any).

If you have a more complex use case (like you need to pass some custom env variables but they should not be passed to the base env),

def create_my_custom_env(config):
    my_var = config.pop("my_var")
    env = ProcgenEnvWrapper(config)
    env = MyWrapper(env, my_var)
    return env

registry.register_env(
    "my_custom_env", create_my_custom_env
)

I hope this covers what you wanted to know.

Rllib custom env

About 1 month ago

Hello @bob_wei

Yes, the base env should be the ProcgenEnvWrapper provided in the starter kit. You can use any gym wrapper on top of this. If you use the env from gym.make instead of ProcgenEnvWrapper, the rollouts will fail.

Rllib custom env

About 1 month ago

Hello @bob_wei @Mckiev

We added support for using wrappers. Please give it a try. https://github.com/AIcrowd/neurips2020-procgen-starter-kit/tree/master/envs

2 hours training time limit

About 1 month ago

Hello @tim_whitaker

The time_total_s: 7200 is already in place during evaluations. We also have a limit on max parallel evaluations, based on the resources available, to avoid long waits amid evaluations. We also have a hard timeout with 2 hours + buffer period to close the stuck evaluations. We will check if there are any issues with the scheduling time and adjust the buffer time accordingly.

I believe that you are referring to submission #70205. The GitLab updates are failing for that submission, but the evaluation is still in progress for that one. We will check why the updates on the GitLab issue page stopped.

Same marks on the testing video

About 1 month ago

Hello @lars12llt @karolisram

Yes, these values are not supposed to be changed. We will override these values from the next grader update.

Impala Pytorch Baseline Bug(s)

About 1 month ago

Hello @mtrazzi

You need to set "docker_build": true in your aicrowd.json file. Without this, we will use the default image and there will not be any pip installs from your requirements.txt.

Is anyone experiencing the same warnings/errors?

About 1 month ago

Hello @mtrazzi

First of all, thanks for your effort in writing down these warnings/errors here. I’m sure that a lot of participants would have similar questions.

Dashboard crashes with error while attempting to bind on address ('::1', 8265, 0, 0): cannot assign requested address . I solved this by adding webui_host='127.0.0.1' in ray_init in train.py (cf. stackoverflow) on google colab , not sure i need to do the same for aicrowd submission (which would mean touching to train.py ).

You can ignore the dashboard error. You see that error because the port on which the dashboard is trying to bind is not available for it.


ls: cannot access '/outputs/ray-results/procgen-ppo/*/' this seems to be in how variables are set in run.sh. Don’t know why they would want to access ray-results early on.

We run on the evaluations on preemptive instances, which means that the node on which the evaluation is running on can shut down any time. Before starting the training, we check for any existing checkpoints and resume from that point. I understand that this is causing some confusion. We will hide these outputs in the next grader update.


given NumPy array is not writeable ( solved by downgrading to torch 1.3.1 locally, but still unclear how to downgrade when submitting , cf. discussion)

You can update the requirements.txt or edit your Dockerfile accordingly. Please make sure that you set "docker_build": true in your aicrowd.json file. If this is not set to true, we will not trigger a docker build and will use a default image.


[Errno 2] No such file or directory: 'merged-videos/training.mp4' : seems to be on aicrowd side, but maybe we need to change how we log videos? see this example or this PR.

This is not related to the issue you were mentioning. We generate a few videos for every few iterations and upload them at regular time intervals during training. These videos are shown on the GitLab issue page as well as the submission dashboard. This error basically means that we tried uploading the video bu,t there was no video. This typically happens when throughput is very low or, the rendering doesn’t work. If you are able to see a video on aicrowd website and the dashboard, rendering the video is not the problem. It looks like we missed some error handling here (though it will not effect your evaluation in any way). We will fix this in the next grader update.


WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! ... you may need to pass an argument with the flag '--shm-size' to 'docker run'. : this seems to be on aicrowd server side, but maybe we need to change the docker file or do clever things?

Yes, docker by default allocates very less shared memory. Typically, I do not expect this to have a drastic impact on performance. In terms of throughput, we were getting similar results on a physical machine and on the evaluation pipeline. But if you want us to increase this, please reach out to us and we will be glad to look into it.

Impala Pytorch Baseline Bug(s)

About 1 month ago

Hello @bob_wei

Can you try using pytorch 1.3.x version?

Rllib custom env

About 1 month ago

Hello @bob_wei

You can use a custom preprocessor for this.

Selecting seeds during training

About 1 month ago

Hello @Leckofunny

This should be possible. We do support callbacks so approach 2 mentioned in definitely doable. For the first approach, did you try passing it as a custom algorithm? According to this line,

You should be able to pass the train function in the approach 1 as a custom algorithm. Can you give it a try? If that doesn’t work, you should be able to extend the existing PPOTrainer class as custom algorithm.

For the re-init part, can you try running env.close() followed by env.__init__() with parameters from the current env? I’m not sure if this is really the right way. I’ll get back in case I find a better solution.

Training Error?

About 1 month ago

Hello @Paseul1

As mentioned in this comment, there was a problem with your interpreter line (shebang) in run.sh. If this line is not provided, the script is run with the default shell, which happens to be /bin/sh in the evaluation environment. It worked locally for you because your default shell might be bash, zsh, or equivalent. Please let me know if you are referring to a different submission.

How to save rollout video / render?

About 1 month ago

Hello @mtrazzi @xiaocheng_tang

This example should work.

#!/usr/bin/env python

import gym
import gym.wrappers

env = gym.make("procgen:procgen-coinrun-v0", render_mode="rgb_array")
env.metadata["render.modes"] = ["human", "rgb_array"]
env = gym.wrappers.Monitor(env=env, directory="./videos", force=True)

episodes = 10
_ = env.reset()

done = False
while episodes > 0:
    _, _, done, _ = env.step(env.action_space.sample())
    if done:
        _ = env.reset()
        episodes -= 1

How to save rollout video / render?

About 1 month ago

Hello @xiaocheng_tang

Do you see any warning that says something like render() returned None when you use the Monitor wrapper?

How to save rollout video / render?

About 1 month ago

Hello @xiaocheng_tang

Can you try the same with gym3==0.3.2 and procgen==0.10.3?

How to save rollout video / render?

About 1 month ago

Hello @mtrazzi

For procgen==0.10.x, you need to pass render_mode="rgb_array" as a config option to the environment for the videos.

PS: Note that passing render_mode="rgb_array" will have a performance impact and is not recommended to be used during training.

How to save rollout video / render?

About 1 month ago

Hello @mtrazzi

Can you share the procgen version that you are using?

Don't receive "graded"

About 1 month ago

Hello @liziniu

Your submission #69444 is still in evaluation and is not stuck. There are a lot of submissions and the hence most of them are in queue.

You can check the gitlab issue to know in what stage the evaluation is in. If you see something like “Preparing the cluster for you :rocket:” it means that the submission is waiting for the cluster resources to be available. At this point we can evaluate 4 submissions in parallel. You can also check the last updated timestamp on the gitlab issue.

FAQ: Regarding rllib based approach for submissions

About 1 month ago

Hello @alexander_ermolov

We explained how the random agent code provided in the starter kit works in FAQ: Implementing a custom random agent

FAQ: Regarding rllib based approach for submissions

About 1 month ago

Hello @lars12llt

The CustomRandomAgent class is registered as custom/CustomRandomAgent. So you need to use custom/CustomRandomAgent or update it’s name.

FAQ: Implementing a custom random agent

About 1 month ago

In this post, we will try to demystify the random agent code provided in the starter kit. The most common question we heard from the participants was the difference between the models and algorithms directories.

Our idea was to put all the models inside the models’ directory. These are the RL policy networks that you will build using convolutional layers, dense layers, and so on. The algorithms that govern the optimization of the RL policies go into the algorithms directory.

Implementing a custom random agent

Now, we will see what the code in the CustomRandomAgent class does [refer]. Our random agent does not learn anything. It returns random actions and collects rewards. First, we need to create the environment.

Now that we have the env ready, let’s randomly sample actions and run it till the episode finishes.

When training the agent, we want to run this in a loop for rollouts_per_iteration number of times.

Now, let’s collect the rewards and return a dict containing training stats for a given iteration.

That’s it! You can find the complete code for this agent at https://github.com/AIcrowd/neurips2020-procgen-starter-kit/blob/f8b943bffaf2c86a4c78043fcb0f1253ab1b42ba/algorithms/custom_random_agent/custom_random_agent.py

Now, how does rllib know that there is this custom agent that we want to use? We have a custom registry for this. First, list your python class as a custom algorithm here,

This will register the random agent class with the name custom/CustomRandomAgent. Now we need to add this to our experiments YAML file.

procgen-example:
  env: "procgen_env_wrapper"
  run: "custom/CustomRandomAgent"

So how does rllib know that it has to use the algorithms? We register all the custom algorithms and models in the train.py file!

Other resources on algorithms

Implement a custom loss function while leaving everything else as is.

https://docs.ray.io/en/master/rllib-concepts.html

More examples

FAQ: Debugging the submissions

About 1 month ago

How to know why my submission failed?

When things go south, we try our best to provide you with the most relevant message on the GitLab issue page. They look somewhat like these.

evaluation logs with timed out error evaluation logs with training failed message

Well, “Training failed” is not of much use. No worries! We got you covered. You can click on the Dashboard link on the issues page.

link to dashboard

Scroll a bit down. You should find a pane that displays the logs emitted by your training code.

Note: We do not provide the logs for rollouts on the dashboard to avoid data leaks. We will provide the relevant logs for the rollouts upon tagging us.

Common errors faced

Error says I’m requesting x/1.0 GPUs where x > 1

Make sure that num_gpus + (num_workers+1)*num_gpus_per_worker is always <= 1.

My submission often times out

A low throughput can be due to various reasons. Checking the following parameters is a good starting point,

Adjust the rollout workers and number of gym environments each worker should sample from. These values should be good initials.

num_workers: 6
num_envs_per_worker: 20

Make sure your training worker uses GPU.

num_gpus: 0.6

Make sure that your rollout workers use a GPU.

num_gpus_per_worker: 0.05

Note: rllib does not allocate the specified amount of GPU memory to the workers. For example, having num_gpus: 0.5 does not mean that half of the GPU memory is allocated to the training process. These parameters are very useful in a scenario where one has multiple GPUs. These parameters will be used by rllib to figure out which worker goes on which GPU. Since the evaluations run on a single GPU, setting num_gpus and num_gpus_per_worker to a nominal non-zero positive value should suffice. For more information on the precise tuning of these parameters, refer.

Figuring out the right values for num_gpus and num_gpus_per_worker

You can run nvidia-smi on your machine when you start the training locally. It should report how much memory each of the workers is taking on your GPU. You can expect them to take more or less the same amount of memory during the evaluation. For example, say you are using num_workers: 6 for traning locally. The output for nvidia-smi should look similar to this.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 208...  Off  | 00000000:03:00.0 Off |                  N/A |
| 43%   36C    P2   205W / 250W |   6648MiB / 11178MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     16802      C   ray::RolloutWorker.sample()                819MiB   |
|    0     16803      C   ray::PPO.train()                           5010MiB  |
|    0     16808      C   ray::RolloutWorker.sample()                819MiB   |
|    0     16811      C   ray::RolloutWorker.sample()                819MiB   |
|    0     16813      C   ray::RolloutWorker.sample()                819MiB   |
|    0     16831      C   ray::RolloutWorker.sample()                819MiB   |
|    0     16834      C   ray::RolloutWorker.sample()                819MiB   |
+-----------------------------------------------------------------------------+

From this output, I know that a single rollout worker is taking around 819 MB and the trainer is taking around 5010 MB of GPU memory. The evaluations run on Tesla P100 which has 16 GB memory. So, I would set num_workers to 12. The GPU usage during the evaluation should roughly be

5010 MB + 819 MB *(12+1) = 15.3 GB

Note: The above values are dummy values. Please do not use these values when making a submission.

Run your code locally to avoid wasting your submission quota

We assume that you made necessary changes in run.sh.

Make sure that your training phase runs fine

./run.sh --train

Make sure that the rollouts work.

./run.sh --rollout

If you are using a "docker_build": true without modifying the dockerfile but to install py packages from requirements.txt,

  • Create a virtual environment using conda / virtualenv / python3 -m venv
  • Activate your new environment.
  • Run pip instal -r requirements.txt.
  • Run ./run.sh --train.
  • Run ./run.sh --rollout.

In case you are using a completely new docker image, please build on top of the Dockerfile provided on the starter kit. You are free to choose a different base image, however, you need to make sure that all the packages that we initially were installing are still available. To avoid using failures in the docker build step, we recommend that you try running it before making a submission.

docker build .

This might take quite a while the frist time you run it. But it will be blazing fast from next time!

Not able to figure out what went wrong? Just tag @jyotish / @shivam on your issues page. We will help you! :smiley:

How can I rewrite my local algorithm as a custom algotithm in submission?

About 1 month ago

Hello @lars12llt

Welcome to the AIcrowd community! :smiley:

1, I believe your question is how you can assign a GPU for the evaluation. All the evaluations will be run on a Tesla P100 (16 GB).

2, Ideally, we expect the model/network related code to stay in models directory while the training algorithm related code (like custom policy functions, etc.,) to go into algorithms directory. For the wrapper part, can you have a look at the custom preprocessors in rllib? https://docs.ray.io/en/master/rllib-models.html#custom-preprocessors

For the parallel sampling, you don’t need to do it, rllib does it for you. The num_workers parameter in the experiment config file (.yaml file) controls the number of rollout workers that are spawned. You can set the number of envs each worker needs to run using num_envs_per_worker.

3, Yes, we expect you to set the config file to set hyper-parameters. For why we want you to do it that way, please refer FAQ: Regarding rllib based approach for submissions

Some reference that might help:

Submission/Gitlab Issue -> Git Tag

About 2 months ago

Hello @laxatives

You can view the submission details by clicking the “View” button of the respective submissions. The “Code” button takes you to the code that was submitted for a particular submission. You can view all the submission you made at https://www.aicrowd.com/challenges/neurips-2020-procgen-competition/submissions?my_submissions=true

You can compare two versions of your repository by visiting https://gitlab.aicrowd.com/username/repository/compare. In your case, this will be https://gitlab.aicrowd.com/laxatives/neurips-2020-procgen-starter-kit/compare

Some question about vCPU, Help!

About 2 months ago

Hello @DRL_AGI

The evaluations happen on Google Cloud Platform. The evaluations run on an n1-standard-16 VM which has 16 vCPUs and 60 GB RAM [refer]. The vCPUs are based on Intel Skylake architecture and have a base clock speed of 2.0 GHz [refer].

GPU utilization

About 2 months ago

Hello @shogoakiyama

Can you try setting

num_workers: 6 # Number of rollout workers to run
num_envs_per_worker: 20 # Number of envs to run per rollout worker
num_gpus: 0.6 # Fraction of GPU used by trainer
num_gpus_per_worker: 0.05 # Fraction of GPU used by rollout worker

Please make sure that num_gpus + num_gpus_per_worker*num_workers <= 1. Setting num_gpus to 0.5 doesn’t mean that half of the GPU memory is available to the trainer process. rllib doesn’t allocate GPUs but schedules the workers based on these values. They exist to make it easier to scale the training process to multiple GPUs. Since we use a single GPU during the evaluation, setting these values to some non-zero value should suffice.

Unusually large tensor in the starter code

About 2 months ago

Hello @the_raven_chaser

The impala baseline provided in the starter kit takes close to 15.8 GB of GPU memory. As a starting point, you can try setting num_workers: 1 in the experiment YAML file and see if it works. You can also try running nvidia-smi to check how much memory is being utilized by the trainer and the rollout worker. Based on that, you can try increasing num_workers to a higher number.

Multi-Task Challenge?

About 2 months ago

Hello @Leckofunny

Yes, the agent will be trained on unknown environments as well. The training and rollouts are done on the server-side. You only need to submit the code. You will be given access to the training logs once you make a submission.

High average reward but low score

About 2 months ago

Hello @quang_tran

Something went wrong with the rollouts for two of your submissions. We are looking into it.

We are using these for R_min and R_max values

The rollouts will be run for 1000 episodes with at most 1000 steps per episode. You can check the configuration used during rollouts here, Several questions about the competition

Evaluation failed for custom environment

About 2 months ago

Hello @justin_yuan

The Dockerfile seems to be missing for this particular submission and hence the error. You can check the code that was submitted by clicking the “Code” button for the corresponding submission on https://www.aicrowd.com/challenges/neurips-2020-procgen-competition/submissions?my_submissions=true

Several questions about the competition

About 2 months ago

Hello @tim_whitaker

Yes, any changes to train.py , rollout.py and envs/procgen_env_wrapper.py will be dropped during the evaluation. The custom algorithms (trainer class related ones) and the custom models will be used even during the rollouts. If the rollouts work locally without altering the provided rollout.py as expected, the same should work during evaluations as well.

What are the evaluation environments?

About 2 months ago

Hello @quang_tran

The warm-up round will only use the coinrun environment.

Number of levels in round 1?

About 2 months ago

Hello @quang_tran

This is probably what you were looking for? Several questions about the competition

Time out Error, Help!

About 2 months ago

Hello @mj1

Can you recheck the parameters that you were using? You can check your submissions here, https://www.aicrowd.com/challenges/neurips-2020-procgen-competition/submissions?my_submissions=true

You can click on the “code” button to view the state of the code at the time of your submissions.

Insufficient cluster resources to launch trial

About 2 months ago

Hello @CireNeikual

For why we want to use rllib, please refer FAQ: Regarding rllib based approach for submissions

Updating the requirements.txt won’t be enough. You also need to set "docker_build": true in your aicrowd.json file. More on using custom images can be found here, https://github.com/aicrowd/neurips2020-procgen-starter-kit#submission-environment-configuration

Please make sure to include the mlflow pip package in your custom image. We can’t post the evaluation updates on the gitlab issues page without this.

Change experiement yaml file

About 2 months ago

Hello @comet

You can use your own wrapper for the training phase. But we will use the procgen_env_wrapper that was provided in the starter for the rollouts.

You can use your own experiment yaml file. You need to make sure to make necessary changes in run.sh (https://github.com/AIcrowd/neurips2020-procgen-starter-kit/blob/master/run.sh#L8) as well.

Help..how to deal with Submission failed : iiid?

About 2 months ago

Hello @alexander_ermolov

Few submissions have failed yesterday due to a bug. We queued the effected submissions for re-evaluation.

Submission available resources

About 2 months ago

Hello @iamhatesz

The evaluations are now running with 16 vCPUs. :smiley:

Impala Pytorch Baseline Bug(s)

About 2 months ago

Hello @gregory_eales,

The “fails but then continues” is because we re-evaluated your submission as it failed the first time due to an internal glitch.

Following is the throughput vs ray worker configuration on an evaluation node (with 1 P100 and 8 vCPUs) for impala baseline (tensorflow version).

throughput workers envs_per_worker cpus_per_worker
757.7564057 2 2 1
923.4482437 4 2 1
993.133285 6 2 1
1006.230928 5 2 1
1107.859696 7 2 1
1109.078469 2 4 1
1362.100739 4 4 1
1409.114958 5 4 1
1457.701511 6 4 1
1460.446554 2 8 1
1534.667546 7 4 1
1613.769406 2 12 1
1732.683079 4 8 1
1735.013415 5 8 1
1756.762717 6 8 1
1803.119381 2 20 1
1811.492029 7 8 1
1824.598827 5 12 1
1827.744181 4 12 1
1831.147102 2 16 1
2035.535199 4 16 1
2106.670996 4 20 1
2108.46658 5 16 1
2128.366856 6 12 1
2206.309038 6 16 1
2218.835295 7 12 1
2224.173316 7 16 1
2243.448792 5 20 1
2291.233425 6 20 1
2329.457026 7 20 1

Help..how to deal with Submission failed : iiid?

About 2 months ago

Hello @breezeyuner, @laxatives

Thanks for reporting this. This was a bug on our end. We are restarting the evaluations for the effected submissions.

Several questions about the competition

2 months ago

Hello @the_raven_chaser

You can change the environment configuration as you like and we will use the same during the training phase. During the rollouts, we will force the following configuration

{
    "num_levels": 0,
    "start_level": 0,
    "use_sequential_levels": false,
    "distribution_mode": "easy",
    "use_generated_assets": false
}

As long as your code uses rllib, things should work. You can add more environment variables or arguments to python train.py line in run.sh but we will still use the train.py wrapper that we provided to trigger the training. So any changes you make to train.py will be dropped.

This discussion should give you more context on why we want to enforce the use of a framework, FAQ: Regarding rllib based approach for submissions

All the best for the competition! :smiley:

Submission available resources

2 months ago

Hello @iamhatesz

Yes, we plan to extend it to 16 vCPUs in a few days.

Submission available resources

2 months ago

Hello @iamhatesz

At the moment, the evaluations are running on an 8 vCPU , 1 GPU (P100) node. Of the 8 vCPUs, one is reserved for the evaluation worker.

TypeError: cannot pickle 'property' object

2 months ago

Hello @gregory_eales, can you downgrade to python 3.7/3.6 and try again? There is a known issue with cloudpickle and python 3.8.

Evaluation Limits

5 days ago

Hello @tim_whitaker

We saw the post and are discussing this with the organizers.

Problem with rollouts on docker build

About 2 months ago

Hello @tim_whitaker

Is it possible to share a few details like

  • How long did the rollouts take locally to complete?
  • What hardware were the rollouts run on?
  • For how many episodes and max_steps were the rollouts run for?

If you feel that this is something happening only during the evaluation, we can try to run your submission manually and see where things are going wrong.

Sorry for the late response.

RE: Several questions about the competition

About 2 months ago

Hello @tim_whitaker

Yes, any changes to train.py, rollout.py and envs/procgen_env_wrapper.py will be dropped during the evaluation.

MASKD

Getting INTERNAL SERVER ERROR

16 days ago

Hello,

We are re-evaluating the failed submissions. They should be updated in a while.

Novartis DSAI Challenge

Submitting Solution: Push Error and No code with current Tag

8 months ago

@devsahu, Can you try running the following in your repository’s directory and let us know if it works? In case it fails again, can you post the error that you get?

git add --all
git commit -m "<Your commit message>"
./submission.sh <solution tag>

Submitting Solution: Push Error and No code with current Tag

8 months ago

Hi @devsahu,

Sorry for the confusion. Can you run git log and git status commands while inside your repository’s directory and post their outputs?

Submitting Solution: Push Error and No code with current Tag

8 months ago

Hi @devsahu

Can you post the outputs of git log and git status?

jyotish has not provided any information yet.