AIcrowd | RL-VI | Challenges

Round 1: Completed #classroom Weight: 1.0

AIcrowd &

IIT Madras

6782

305

🕵️ Introduction

In this problem you will be implementing value iteration on a 10 × 10 gridworld based on the actions, rewards and the state space. Consider a 10 × 10 gridworld (GridWorld-1) as shown in the Figure 1:

Figure 1: GridWorld-1

State space: Gridworld has 100 distinct states. There is a special absorbing state called Goal. There are two wormholes labeled as IN in color Grey and Brown, any action taken in those states will teleport you to state labeled “OUT” in color Grey and Brown respectively. States labeled OUT are just a normal state.
Action space: There are 4 actions A = {North, East, South, West}, which moves you one cell in the corresponding direction.
Transition model: The Gridworld is stochastic. In this model, an action X ∈ {North, East, South, West} moves you one cell in the X direction of your current position with probability 0.8, but with probabilities 0.1 and 0.1 moves you one cell at angels of 90◦ and −90◦ to the direction X, respectively (Refer Table 1 for more details). For example, if the selected action is North, it will transition you one cell to the North of your current position with probability 0.8, one cell to the East of your current position with probability 0.1 and one cell to the West of your current position with probability 0.1. Transitions that take you off the grid will not result in any state change. There are no transitions available once you reach the Goal state

Table 1

Rewards: You will receive a reward of −1 for all transitions (including the one that take you off the grid) except the transition to the Goal state. Any transition to the Goal state gives you a reward of +100.

Instructions

Implement:
1. value iteration. Find the pseudocode below. Let S be the state space and A the action space:
2. a greedy policy w.r.t J_i as,
The value iteration loop goes to infinity (refer the pseudocode given above), so when would you stop your value iteration?
Plot a graph of max s∈S |J_i+1 (s) − J_i(s)| vs iterations.
Tabulate the values of J(s) and greedy policy π(s), ∀s ∈ S, after 10 iterations, 25 iterations, and after you stop the value iteration.
Consider a new gridworld (GridWorld-2) as shown Figure 2. Compare and contrast the
behavior of J and greedy policy π for GridWorld-1 and GridWorld-2.

Figure 2: GridWorld-2

You will be writing your solutions & making a submission through a notebook. You can follow the instructions in the starter code.

📁 Files

Under the Resources section you will find data files that contains parameters for the environment for this problem.

🚀 Submission

Submissions will be made through a notebook following the instructions in the starter code.

📱 Contact

RL TAs