Round 1: Starting soon #reinforcement_learning

# Multi-Agent Reinforcement Learning for Iterative Reasoning

10.2k
316
0
25

🏆 Prize:

🚀 Submissions Open

👩‍💻 Check out the starter kit

## 🌐 Overview

In this competition, you are challenged to design an algorithm that performs well in:

• eliminating obviously bad actions in self-play, competing well against itself (multi-agent environment) 🤖🆚🤖
• minimizing regret and chasing reward on its own (single-agent/adversarial environment) 🤖💹💸

The challenge is posed to the research community as well as anybody interested in reinforcement learning. Top performing algoritms would both further academic work in learning and game theory, as well as indicate potential improvements to current applied strategies in recommendation, planning, and trading.

## 🤖 Multi-agent environment

Our multi-agent environments have N agents that each pick one of K possible actions at every iteration. Depending on the current state of the game, certain actions will be dominated, i.e. suboptimal, regardless of opponent's action. Performance is indicated by the Progress of Elimination (PoE), or what proportion of dominated actions the agent has iteratively eliminated from consideration. The environment will play the algorithm against the Diamond in the Rough and Market for Lemons games.

Diamond in the Rough (DIR) is an N = 2 (2-player) game where the optimal pair of plays from both agents (the diamond) is "hidden" amongst all K*K possible combinations. Progress in the game involves gradual turn-based iterative dominance elimination, as eliminating one's next move requires the opponent to eliminate their next move first.

Akerlof's Market for Lemons simulates a used car market, with N seller agents deciding whether or not (K = 2) to list their car for sale to a single buyer. The car sells if the price paid by the buyer (who is unaware of car quality) is below a given seller's reservation price (noisy approximation of car quality). As the seller agents attempt to get the car bought while raising the price to maximize revenue, a Nash equilibrium of market collapse is reached, where the buyer lowers their price to below the minimum, and all sellers exit the market.

In the single-agent case, the agent will pick an action at every iteration, and observe the bandit feedback of this decision. The objective, over time, is to minimize the expected regret, defined as the average expected difference between the accumulated reward of the historically best action and average accumulated reward of the agent's chosen actions. The environment will play the algorithm against both an oblivious adversary (which will possibly provide random rewards) and a non-oblivious adversary (which will tailor rewards to try to maximize regret) to ensure robustness.

## 🛠 Submission

Submitted algorithms will interact with a collection of OpenAI Gym environments that will [?].🤠

Source code for the environments is available on the GitHub repository linked below. We encourage participants to use a local copy of this repository to build and test their environments, and only submit to the competition on AIcrowd when they feel it is ready for evaluation.

The GitLab for the starter kit can be found here

## 📊 Evaluation

All submissions will be evaluated based on an equal 4-way weighting to the lowest regrets across adversarial/non-adversarial single agent environments and highest PoEs across the DIR/Market for Lemons games. This score will be averaged across the simulation's final iterations, to ensure stability of convergence to an equilibrium. The calculation of this score can be found in [?]. 🤠

Top-ranking winners will succeed at accomplishing the goal of the challenge by performing better than the EXP3 with Diminishing History (EXP3-DH) algorithm described in (Wu, Xu, Yao 2021). This benchmark algorithm is capable of eliminating all dominated actions in a polynomial order class of rounds. As the purpose of the challenge is research-focused, we primarily seek to mathematically defend the performance guarantees of the top algorithms as well.

## 📅 Timeline

• - Compeition launches, submission system opens
• - Submission period: submissions allowed every [?] 🤠
• - Entry and team formation deadline
• - Winners Announced

## 💯 Team

This challenge has been organized by the Strategic IntelliGence for Machine Agents (SIGMA) Lab at the University of Virginia.

• Jibang Wu is a PhD student whose research lies at the intersection of learning and game theory. In addition to authoring the paper this competition is grounded on, he serves as its coordinator.
• Gustavo Moreira is an undergraduate student with a background in multi-agent systems and artificial intelligence. He serves as the lead platform developer and liaison with AIcrowd.
• Param Damle is an undergraduate student with a background in software development and applied machine learning. He is in charge of engineering the economic games and maintaining a fair competitive platform for the participants of this challenge.
• Haifeng Xu is the Alan Baston Assistant Professor in computer science at the University of Virginia and has served in the senior program committee at several conferences including IJCAI and AAAI. His strong track record in AI and EconCS covers more than 45 publications, including the paper this competition is grounded on. He advises on the competition design and supervises the operation of this competition.