AIcrowd | REAL 2020 - Robot open-Ended Autonomous Learning

Round 1: Completed

Round 2: Completed

GOAL-Robots

11.6k

🏆 Round 2 has started! 🚀 Starter Kit with instructions on how to submit

07 Nov 2020: Rules updated for Round 2. Round 2 starts on November 16th!

28 Oct 2020: REAL 2020 has been presented @ ICDL-2020.

Introduction

Robots that learn to interact with the environment autonomously.

Abstract

Open-ended learning, also named ‘life-long learning’, ‘autonomous curriculum learning’, ‘no-task learning’, aims to build learning machines and robots that are able to acquire skills and knowledge in an incremental fashion. The REAL competition addresses open-ended learning with a focus on ‘Robot open-Ended Autonomous Learning’ (REAL), that is on systems that: (a) acquire sensorimotor competence that allows them to interact with objects and physical environments; (b) learn in a fully autonomous way, i.e. with no human intervention, on the basis of mechanisms such as curiosity, intrinsic motivations, task-free reinforcement learning, self-generated goals, and any other mechanism that might support autonomous learning. The competition will have a two-phase structure where during a first ‘intrinsic phase’ the system will have a certain time to freely explore and learn in the environment, and then during an `extrinsic phase’ the quality of the autonomously acquired knowledge will be measured with tasks unknown at design time. The objective of REAL is to: (a) track the state-of-the-art in robot open-ended autonomous learning; (b) foster research and the proposal of new solutions to the many problems posed by open-ended learning; (c) favour the development of benchmarks in the field.

Challenge

In this challenge, you will have to develop an algorithm to control a multi-link arm robot interacting with a table, a shelf and a few objects. The robot is supposed to interact with the environment and learn in autonomous manner, i.e. no reward is provided from the environment to direct its learning. The robot has access to the state of its joint angle and to the output of a fixed camera seeing the table from above. By interacting with the environment, the robot should learn how to achieve different states of the environment: e.g. how to push objects around, how to bring them on top of the shelf and how to place them one on top of the other.

REAL environment

Evaluation

The evaluation of the algorithm is split in two phases: the intrinsic phase and the extrinsic phase. - In the first phase, the algorithm will be able to interact with the environment, without being provided any reward. In this intrinsic phase, the algorithm is supposed to learn the dynamics of the environment and how to interact with it. - In the second phase, a goal will be given to the algorithm that it needs to achieve within a strict time limit. The goal will be provided to the robot as an image of the state of the environment it has to reach. This goal might require, for example, to push an object in a certain position or move one object on top of another.

How to do it

While the robot is given no reward for the environment, it is perfectly reasonable (and expected) that the algorithm controlling the robot will use some kind of “intrinsic” motivation derived from its interaction with the environment. Below, we provide some of the approach to this problem found in the current literature. On the other hand, it would be “easy” for a human knowing the environment (and the final tasks) as described in this page to develop a reward function tailored to this challenge so that the robot specifically learns to grasp objects and move them around. This last approach is discouraged and it is not eligible to win the competition (see the rules below). The spirit of the challenge is that the robot initially does not know anything about the environment and what it will be asked to do. So the approach should be as general as possible.

Literature

This list gives a few examples of promising approaches found in the literature that can be adapted to address the challenge:

Carlos Florensa, David Held, Xinyang Geng, Pieter Abbeel Automatic Goal Generation for Reinforcement Learning Agents
Tianhe Yu, Gleb Shevchuk, Dorsa Sadigh, Chelsea Finn Unsupervised Visuomotor Control through Distributional Planning Networks
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, Trevor Darrell Curiosity-driven Exploration by Self-supervised Prediction
Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, Sergey Levine Visual Reinforcement Learning with Imagined Goals

The necessary software to participate is available on GitHub:

Real-robots Gym Environment: https://github.com/AIcrowd/real_robots
Starter Kit with baseline: https://github.com/AIcrowd/REAL2020_starter_kit

The Starter Kit includes the code of a baseline agent for Round 1 that participants can use and modify to make their submissions.

Rules

The rules of the competition will be as follows:

Overview The competition focuses on autonomous open-ended learning with a simulated robot. The setup features a simulated robot that in a first intrinsic phase interacts autonomously with a partially unknown environment and learns how to interact with it, and then in a second extrinsic phase has to solve a number of tasks on the basis of the knowledge acquired in the first phase. Importantly, in the intrinsic phase the system does not know the tasks it will have to solve in the extrinsic phase.
Simulator To this purpose, the competitors will be given a software kit with which they will be able to install the simulator of the robot and environment on their machines (see below).
Robot The robot will be formed by a seven degrees of freedom Kuka arm; a two degrees of freedom gripper; a top view camera.
Environment The environment used will be a simplified kitchen-like scenario formed by: a table with; a shelf; one cube and two kitchen objects.
Training and testing phases Both during the development on the participant’s machines, and during their evaluation on the AIcrowd platform, the competitor systems will have to undergo two phases: an intrinsic phase of training and an extrinsic
phase of testing. During the submission of the system, the participant can decide how many objects (1, 2, or 3) it will manage during the two phases, and this will determine an important dimension of the difficulty of the challenge. Choosing 1 or 2 objects will facilitate the challenge but will allow the participant to achieve respectively 1/3 or 2/3 of the maximum performance.
Intrinsic phase During the intrinsic phase, the robot will have to autonomously interact with an environment for a certain period of time during which it should acquire as much knowledge and skills as possible, to best solve the tasks in the extrinsic phase. Importantly, during the intrinsic phase the robot will not be aware of the tasks it will have to solve in the extrinsic phase.
Extrinsic phase During the extrinsic phase the system will be tested for the quality of the knowledge acquired during the intrinsic phase. The robot will have to solve a number of goals: each goal will involve a different configuration of 1 to 3 objects in the environment that the robot has to recreate starting from a different configuration.
Goal types. Goals will be drawn from the following classes of possible problems defined on the basis of the nature of the goal to accomplish:
(1) 2D goal type: overall goal defined in terms of the configuration of 1 to 3 objects on the table plane, never close to each other and with a fixed orientation;
(2) 2.5D goal type: overall goal defined in terms of the configuration of 1 to 3 objects set on the table plane and on the shelf, never close to each other and with a fixed orientation;
(3) 3D goal type: overall goal defined in terms of 1 to 3 objects set on the table plane and on the shelf, with any orientation and no minimum distance.
Each goal will be tested with a different starting configuration, which follows the same criteria of the goal. All objects will have to be moved from the starting configuration to reach the goal.
Learning time budget The time available for learning in the intrinsic phase is limited to 15 million time steps. Learning in the extrinsic phase will be possible but its utility will be strongly limited by the short time available to solve each task, consisting in 10 thousand time steps for solving each goal.
Computational limits All submissions are expected to be able to rune the intrinsic phase and extrinsic phase within a certain time limit on the evaluation machines. Current limits are set to 6h for the extrinsic phase and 72h for the intrinsic phase on an 8 CPU, 64 GB RAM, Nvidia V100 16GB virtual machine. Limits will be announced before each Round starts.
Score The performance of the extrinsic phase for an overall goal g will be scored according to the following metrics \(M_g\):
\( M_g = \sum_{o=1}^n \left[e^{-c||\textbf{p}^*_o - \textbf{p}_o||} \right]\\\)
where \(n\) is the number of objects (1, 2, or 3), \({p}^*_o\) is the (x, y, z) position vector of the mass center of object \(o\) in the target goal, \(p_o\) is the position of the object at the end of the task after the robot attempts to bring it to the goal position, \(c\) is a constant ensuring that this part of the score will be 0.25 if the distance to the goal position is 0.10 (10 cm). Note that the metrics ranges in (0, 1] for each object, and is equal to 1.0 if the object is exactly at the goal position, and decays exponentially with an increasing distance from it. Placing all 3 objects exactly in the overall goal configuration can yield a maximum score of 3.0. The total Score \(M\) of a certain system will be the average of its scores across all goals:
\(M = \frac{1}{G} \sum_{g=1}^G M_g\)
where \(G\) is the number of all goals.
Knowledge transfer The only regularities (`structure’) that are shared between the intrinsic and the extrinsic phase are related to the environment and objects; in particular in the intrinsic phase the robot has no knowledge about which tasks it will be called to solve in the extrinsic phase. Therefore, in the intrinsic phase the robot should undergo an autonomous open-ended learning process that should lead it to acquire, in the available time, as much knowledge and as many skills as possible to be ready to best face the unknown tasks of the following extrinsic phase.
Competition structure The competition will be divided in two Rounds plus a Final Evaluation. During both Round 1 and Round 2, only the extrinsic phase will be evaluated online by the competition servers. Participants will run the intrinsic phase on their machines and upload the code of their system, along with the acquired parameters, for evaluation.
- Round 1. The first round will offer a number of simplifications (see below) that the participants can freely choose to simplify some aspects of the challenge.
- Round 2. During the second round most of the simplifications will no longer be available (see below). At the end of Round 2, the Top 10 participants in the ranking will be selected for a final full evaluation.
- Final evaluation. See below.
Final evaluation Top 10 participants of Round 2 will be able to access the final evaluation. The final evaluation consists of a short round of one week where the participants will be able to submit again their submissions and this time their code will be run to simulate both the intrinsic and the extrinsic phase. Participants will be able to submit and evaluate online their solutions up to 3 times and the best result will be used as their final score. During this week participants are still able to modify their submissions, although given the short time span of one week, this final round is mostly meant to be used to correct technical errors that might prevent the evaluation (since up to this final evaluation the intrinsic phase has only been run locally by participants). Submissions that fail without a score (timeouts or code crashes) are not counted towards the total of three submissions. The best submission scores obtained during this final evaluation will determine the winners of the competition.
Spirit of the rules As also explained above, the spirit of the rules is that during the intrinsic phase the robot is not explicitly given any task to learn and it does not know of the future extrinsic tasks, but it rather learns in a fully autonomous way.
As such, the Golden Rule is that it is explicitly forbidden to use the scoring function of the extrinsic phase or variants of it as a reward function to train the agent. Participants should give as little information as possible to the robot, rather the system should learn from scratch to interact with the objects using curiosity, intrinsic motivations, self-generated goals, etc.
Simplifications Given the difficulty of the competition and the many challenges that it contains and to encourage a wide participation some simplifications are allowed.
The following simplifications are always available, both in Round 1 and Round 2.
- Joint or position control. Two possible control modes will be available: (a) joint control: nine joint-angle commands, including two gripper DOFs, at each simulation step (the robot moves towards the desired joint angles through a PID); (b) position control: Cartesian position control, where commands require the robot to achieve a certain (x,y,z) position with the wrist at each step (this control will be pursued through an inverse kinematic model); the gripper orientation will be controlled as a quaternion; the gripper 2 DOFs will be controlled through joint-angle commands.
- Home position. The participant can recall a ‘home’ action that brings the arm back to an initial position standing over the table; if used, this action must called at regular time intervals (variable time intervals are not allowed).
- Objects. For each submission the participant will decide how many objects to use (i.e. the cube, the cube and the tomato can, the cube, the tomato can and the mustard bottle): using 1 or 2 objects will facilitate the robot but at the same time will allow obtaining a lower maximum score (1/3 or 2/3 of the maximum score achievable with 3 objects).
- Fixed wrist orientation. While using position control, participant may elect to have the wrist in a fixed position so that the gripper will be kept vertical, pointing downwards, at all times. Robots with the fixed wrist will not be able to reach the shelf.
- Closed gripper. Participants may elect not to use the gripper and keep it close the whole time.
The following simplifications can only be used in Round 1.
- Additional observations. In addition to the standard observations (joint positions, touch sensors and camera image), the observation will include the position (x, y, z) of objects and a segmented image of the environment (an image where each pixel color is replaced with a number indicating the identity of the underlying object).
- Macro Action. Participants may elect to use ‘macro-actions’: instead of sending commands to the robot at each time step, participants can use a parameterized action, with the following parameters \(x_i, y_i, x_f, y_f \) . The macro-action will move the arm from the home position to the location \(x_i, y_i, z\) and then to \(x_f,y_f,z\) before returning home again after a predetermined number of time steps. This macro action uses position control, with the elevation \(z\) determined automatically to have the gripper close to the table, and it also uses the fixed-wrist orientation and closed gripper. The macro-action corresponds to performing a push movement along the table.
Code inspection To be eligible for ranking, participants are required to open the source code of their submissions to the competition monitoring check. Submitted systems will be sampled for checking their compliance with the competition rules and spirit during the competition. The top 10 systems of the final ranking will all be checked for compliance with the competition rules and spirit before declaring the competition winners.
Eligibility Participants belonging to the GOAL-Robots project, AIcrowd, or other parts of the Organization Team might participate to the competition to provide baselines for other participants but are ineligible for the ranking and prizes of any phase of the competition.

Rule addendum:
As a special exception, it is allowed to crop the observation image, in the following manner:

cropped_observation = observation['retina'][0:180,70:250,:]

See this discussion for the rationale:
https://discourse.aicrowd.com/t/cropping-images-rule-exception-and-other-rules-clarifications/3770

Prizes

Round 1
The members of Top 3 teams will receive free registrations for the IEEE International Conference on Development and Learning (https://cdstc.gitlab.io/icdl-2020/)
Round 2
Top 3 teams will be invited to co-author a paper.
Top 3 teams will also receive free registrations to ICDL-2021 - 1 free registration and a 50% discounted registration for the first team, 1 free registration for the second team, 1 discounted registration for the third team.