# Challenge Rules

## Rules

The rules of the competition will be as follows:

• Overview The competition focuses on autonomous open-ended learning with a simulated robot. The setup features a simulated robot that in a first intrinsic phase interacts autonomously with a partially unknown environment and learns how to interact with it, and then in a second extrinsic phase has to solve a number of tasks on the basis of the knowledge acquired in the first phase. Importantly, in the intrinsic phase the system does not know the tasks it will have to solve in the extrinsic phase.
• Simulator To this purpose, the competitors will be given a software kit with which they will be able to install the simulator of the robot and environment on their machines (see below).
• Robot The robot will be formed by a seven degrees of freedom Kuka arm; a two degrees of freedom gripper; a top view camera.
• Environment The environment used will be a simplified kitchen-like scenario formed by: a table with; a shelf; one cube and two kitchen objects.
• Training and testing phases Both during the development on the participant’s machines, and during their evaluation on the AIcrowd platform, the competitor systems will have to undergo two phases: an intrinsic phase of training and an extrinsic
phase of testing. During the submission of the system, the participant can decide how many objects (1, 2, or 3) it will manage during the two phases, and this will determine an important dimension of the difficulty of the challenge. Choosing 1 or 2 objects will facilitate the challenge but will allow the participant to achieve respectively 1/3 or 2/3 of the maximum performance.
• Intrinsic phase During the intrinsic phase, the robot will have to autonomously interact with an environment for a certain period of time during which it should acquire as much knowledge and skills as possible, to best solve the tasks in the extrinsic phase. Importantly, during the intrinsic phase the robot will not be aware of the tasks it will have to solve in the extrinsic phase.
• Extrinsic phase During the extrinsic phase the system will be tested for the quality of the knowledge acquired during the intrinsic phase. The robot will have to solve a number of goals: each goal will involve a different configuration of 1 to 3 objects in the environment that the robot has to recreate starting from a different configuration.

• Goal types. Goals will be drawn from the following classes of possible problems defined on the basis of the nature of the goal to accomplish:
(1) 2D goal type: overall goal defined in terms of the configuration of 1 to 3 objects on the table plane, never close to each other and with a fixed orientation;
(2) 2.5D goal type: overall goal defined in terms of the configuration of 1 to 3 objects set on the table plane and on the shelf, never close to each other and with a fixed orientation;
(3) 3D goal type: overall goal defined in terms of 1 to 3 objects set on the table plane and on the shelf, with any orientation and no minimum distance.
Each goal will be tested with a different starting configuration, which follows the same criteria of the goal. All objects will have to be moved from the starting configuration to reach the goal.

• Learning time budget The time available for learning in the intrinsic phase is limited to 15 million time steps. Learning in the extrinsic phase will be possible but its utility will be strongly limited by the short time available to solve each task, consisting in 10 thousand time steps for solving each goal.

• Computational limits All submissions are expected to be able to rune the intrinsic phase and extrinsic phase within a certain time limit on the evaluation machines. Current limits are set to 6h for the extrinsic phase and 72h for the intrinsic phase on an 8 CPU, 64 GB RAM, Nvidia V100 16GB virtual machine. Limits will be announced before each Round starts.

• Score The performance of the extrinsic phase for an overall goal g will be scored according to the following metrics $$M_g$$:
$$M_g = \sum_{o=1}^n \left[e^{-c||\textbf{p}^*_o - \textbf{p}_o||} \right]\\$$
where $$n$$ is the number of objects (1, 2, or 3), $${p}^*_o$$ is the (x, y, z) position vector of the mass center of object $$o$$ in the target goal, $$p_o$$ is the position of the object at the end of the task after the robot attempts to bring it to the goal position, $$c$$ is a constant ensuring that this part of the score will be 0.25 if the distance to the goal position is 0.10 (10 cm). Note that the metrics ranges in (0, 1] for each object, and is equal to 1.0 if the object is exactly at the goal position, and decays exponentially with an increasing distance from it. Placing all 3 objects exactly in the overall goal configuration can yield a maximum score of 3.0. The total Score $$M$$ of a certain system will be the average of its scores across all goals:
$$M = \frac{1}{G} \sum_{g=1}^G M_g$$
where $$G$$ is the number of all goals.

• Knowledge transfer The only regularities (structure’) that are shared between the intrinsic and the extrinsic phase are related to the environment and objects; in particular in the intrinsic phase the robot has no knowledge about which tasks it will be called to solve in the extrinsic phase. Therefore, in the intrinsic phase the robot should undergo an autonomous open-ended learning process that should lead it to acquire, in the available time, as much knowledge and as many skills as possible to be ready to best face the unknown tasks of the following extrinsic phase.

• Competition structure The competition will be divided in two Rounds plus a Final Evaluation. During both Round 1 and Round 2, only the extrinsic phase will be evaluated online by the competition servers. Participants will run the intrinsic phase on their machines and upload the code of their system, along with the acquired parameters, for evaluation.
• Round 1. The first round will offer a number of simplifications (see below) that the participants can freely choose to simplify some aspects of the challenge.
• Round 2. During the second round most of the simplifications will no longer be available (see below). At the end of Round 2, the Top 10 participants in the ranking will be selected for a final full evaluation.
• Final evaluation. See below.
• Final evaluation Top 10 participants of Round 2 will be able to access the final evaluation. The final evaluation consists of a short round of one week where the participants will be able to submit again their submissions and this time their code will be run to simulate both the intrinsic and the extrinsic phase. Participants will be able to submit and evaluate online their solutions up to 3 times and the best result will be used as their final score. During this week participants are still able to modify their submissions, although given the short time span of one week, this final round is mostly meant to be used to correct technical errors that might prevent the evaluation (since up to this final evaluation the intrinsic phase has only been run locally by participants). Submissions that fail without a score (timeouts or code crashes) are not counted towards the total of three submissions. The best submission scores obtained during this final evaluation will determine the winners of the competition.
• Spirit of the rules As also explained above, the spirit of the rules is that during the intrinsic phase the robot is not explicitly given any task to learn and it does not know of the future extrinsic tasks, but it rather learns in a fully autonomous way.
As such, the Golden Rule is that it is explicitly forbidden to use the scoring function of the extrinsic phase or variants of it as a reward function to train the agent. Participants should give as little information as possible to the robot, rather the system should learn from scratch to interact with the objects using curiosity, intrinsic motivations, self-generated goals, etc.

• Simplifications Given the difficulty of the competition and the many challenges that it contains and to encourage a wide participation some simplifications are allowed.
The following simplifications are always available, both in Round 1 and Round 2.

• Joint or position control. Two possible control modes will be available: (a) joint control: nine joint-angle commands, including two gripper DOFs, at each simulation step (the robot moves towards the desired joint angles through a PID); (b) position control: Cartesian position control, where commands require the robot to achieve a certain (x,y,z) position with the wrist at each step (this control will be pursued through an inverse kinematic model); the gripper orientation will be controlled as a quaternion; the gripper 2 DOFs will be controlled through joint-angle commands.

• Home position. The participant can recall a ‘home’ action that brings the arm back to an initial position standing over the table; if used, this action must called at regular time intervals (variable time intervals are not allowed).

• Objects. For each submission the participant will decide how many objects to use (i.e. the cube, the cube and the tomato can, the cube, the tomato can and the mustard bottle): using 1 or 2 objects will facilitate the robot but at the same time will allow obtaining a lower maximum score (1/3 or 2/3 of the maximum score achievable with 3 objects).

• Fixed wrist orientation. While using position control, participant may elect to have the wrist in a fixed position so that the gripper will be kept vertical, pointing downwards, at all times. Robots with the fixed wrist will not be able to reach the shelf.

• Closed gripper. Participants may elect not to use the gripper and keep it close the whole time.

• The following simplifications can only be used in Round 1.

• Additional observations. In addition to the standard observations (joint positions, touch sensors and camera image), the observation will include the position (x, y, z) of objects and a segmented image of the environment (an image where each pixel color is replaced with a number indicating the identity of the underlying object).

• Macro Action. Participants may elect to use ‘macro-actions’: instead of sending commands to the robot at each time step,  participants can use a parameterized action, with the following parameters $$​x_i, y_i, x_f, y_f ​$$ . The macro-action will move the arm from the home position to the location $$x_i, y_i, z$$ and then to $$x_f,y_f,z$$ before returning home again after a predetermined number of time steps. This macro action uses position control, with the elevation $$z$$ determined automatically to have the gripper close to the table, and it also uses the fixed-wrist  orientation and closed gripper. The macro-action corresponds to performing a push movement along the table.

• Code inspection To be eligible for ranking, participants are required to open the source code of their submissions to the competition monitoring check. Submitted systems will be sampled for checking their compliance with the competition rules and spirit during the competition. The top 10 systems of the final ranking will all be checked for compliance with the competition rules and spirit before declaring the competition winners.

• Eligibility Participants belonging to the GOAL-Robots project, AIcrowd, or other parts of the Organization Team might participate to the competition to provide baselines for other participants but are ineligible for the ranking and prizes of any phase of the competition.

cropped_observation = observation['retina'][0:180,70:250,:]`