RL Assignment 2 - Taxi

Starter notebook

Use this notebook to run your experiments and make a submission

What is the notebook about?

Problem - Taxi Environment Algorithms

This problem deals with a taxi environment and stochastic actions. The tasks you have to do are:

  • Implement Policy Iteration
  • Visualize the results
  • Explain the results

Setup AIcrowd Utilities 🛠

Do not edit this block.

In [3]:
!pip install -U aicrowd-cli > /dev/null
AIcrowd Runtime Configuration 🧷

Define configuration parameters.

In [36]:
import os

AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "f16aba30-5fd1-4e52-b3ec-148e106a3596_taxi_sample_data.zip")
AICROWD_RESULTS_DIR = os.getenv("OUTPUTS_DIR", "results")
API_KEY = "" # Get your key from https://www.aicrowd.com/participants/me (ctrl + click the link)
In [35]:
!aicrowd login --api-key $API_KEY
!aicrowd dataset download -c rl-assignment-2-taxi
API Key valid
Saved API Key successfully!
f16aba30-5fd1-4e52-b3ec-148e106a3596_taxi_sample_data.zip: 100% 16.5k/16.5k [00:00<00:00, 117kB/s]
In [37]:
DATASET_DIR = 'data/'
Install packages 🗃

Please add all package installations in this section

Import packages 💻

In [38]:
import numpy as np
import matplotlib.pyplot as plt 
import os

Prediction Phase

Taxi Environment

Read the environment to understand the functions, but do not edit anything

In [39]:
import numpy as np
In [40]:
class TaxiEnv_HW2:
    def __init__(self, states, actions, probabilities, rewards, initial_policy):        
        probabilities, rewards = self._build_prob_mapping(states, actions, probabilities,rewards)
        self.possible_states = states
        self._possible_actions = {st: actions for st in states}
        self._ride_probabilities = {st: pr for st, pr in zip(states, probabilities)}
        self._ride_rewards = {st: rw for st, rw in zip(states, rewards)}
        self.initial_policy = initial_policy

    def _build_prob_mapping(self,states, actions, probabilities,rewards):
        n_cities = len(states)
        n_actions = len(actions)

        probs = np.zeros((n_cities, n_actions, n_cities))
        rewards[0] = [0,0,0,0,0,0]    
        rews = np.zeros((n_cities, n_actions, n_cities))

        for src in range(n_cities):
          for action in ('1', '2'):
              for c, prob in probabilities[action].items():
                  dst = (src+c) % n_cities
                  probs[src][actions.index(action)][dst] = prob
                  rews[src][actions.index(action)][dst] = rewards[c][src]
          action = '3'
          action = actions.index(action)
          probs[src][action][0] = 1
        return probs, rews

    def _check_state(self, state):
        assert state in self.possible_states, "State %s is not a valid state" % state

    def _verify(self):
        Verify that data conditions are met:
        Number of actions matches shape of next state and actions
        Every probability distribution adds up to 1 
        ns = len(self.possible_states)
        for state in self.possible_states:
            ac = self._possible_actions[state]
            na = len(ac)
            rp = self._ride_probabilities[state]
            assert np.all(rp.shape == (na, ns)), "Probabilities shape mismatch"
            rr = self._ride_rewards[state]
            assert np.all(rr.shape == (na, ns)), "Rewards shape mismatch"

            assert np.allclose(rp.sum(axis=1), 1), "Probabilities don't add up to 1"

    def possible_actions(self, state):
        """ Return all possible actions from a given state """
        return self._possible_actions[state]

    def ride_probabilities(self, state, action):
        Returns all possible ride probabilities from a state for a given action
        For every action a list with the returned with values in the same order as self.possible_states
        actions = self.possible_actions(state)
        ac_idx = actions.index(action)
        return self._ride_probabilities[state][ac_idx]

    def ride_rewards(self, state, action):
        actions = self.possible_actions(state)
        ac_idx = actions.index(action)
        return self._ride_rewards[state][ac_idx]

Example of Environment usage

In [41]:
import numpy as np 

def check_taxienv():
    # These are the values as used in the assignment document, but they may be changed during submission, so do not hardcode anything

    states = [0, 1, 2, 3, 4, 5]

    actions = ['1','2','3']

    probs = {}
    probs['1'] = {-1: 1/2, 0: 1/4, 1: 1/4}
    probs['2'] = {-1: 1/16, 0: 3/4, 1: 3/16}

    rewards = {}
    rewards[-1] = [8,7,3,2,1,2]
    rewards[1]  = [8,8,5,1,3,9]

    initial_policy = {0:'1', 1:'1', 2:'1', 3:'1', 4:'1', 5:'1'}


    env = TaxiEnv_HW2(states, actions, probs, rewards, initial_policy)
    print("All possible states", env.possible_states)
    print("All possible actions from state B", env.possible_actions(1))
    print("Ride probabilities from state A with action 2", env.ride_probabilities(2, '2'))
    print("Ride rewards from state C with action 3", env.ride_rewards(4, '1'))

    base_kwargs = {"states": states, "actions": actions, 
                "probabilities": probs, "rewards": rewards,
                "initial_policy": initial_policy}
    return base_kwargs

base_kwargs = check_taxienv()
All possible states [0, 1, 2, 3, 4, 5]
All possible actions from state B ['1', '2', '3']
Ride probabilities from state A with action 2 [0.     0.0625 0.75   0.1875 0.     0.    ]
Ride rewards from state C with action 3 [0. 0. 0. 1. 0. 3.]

Task - Policy Iteration

Run policy iteration on the environment and generate the policy and expected reward

In [42]:
# 1.1 Policy Iteration
def policy_iteration(taxienv, gamma):
    # A list of all the states
    states = taxienv.possible_states
    # Initial values
    values = {s: 0 for s in states}

    # This is a dictionary of states to policies -> e.g {'A': '1', 'B': '2', 'C': '1'}
    policy = taxienv.initial_policy.copy()

    ## Begin code here

    # Hints - 
    # Do not hardcode anything
    # Only the final result is required for the results
    # Put any extra data in "extra_info" dictonary for any plots etc
    # Use the helper functions taxienv.ride_rewards, taxienv.ride_probabilities,  taxienv.possible_actions
    # For terminating condition use the condition exactly mentioned in the pdf


    # Put your extra information needed for plots etc in this dictionary
    extra_info = {}

    ## Do not edit below this line

    # Final results
    return {"Expected Reward": values, "Policy": policy}, extra_info

Policy Iteration with different values of gamma

In [43]:
# 1.2 Policy Iteration with different values of gamma
def run_policy_iteration(env):
    gamma_values = np.arange(5, 100, 5)/100
    results, extra_info = {}, {}
    for gamma in gamma_values:
        results[gamma], extra_info[gamma] = policy_iteration(env, gamma)
    return results, extra_info
In [44]:
# Do not edit this cell
def get_results(kwargs):

    taxienv = TaxiEnv_HW2(**kwargs)

    policy_iteration_results = run_policy_iteration(taxienv)[0]

    final_results = {}
    final_results["policy_iteration"] = policy_iteration_results

    return final_results
# Do not edit this cell, generate results with it as is

input_dir = os.path.join(DATASET_DIR, 'inputs')
if not os.path.exists(AICROWD_RESULTS_DIR):

for params_file in os.listdir(input_dir):
  kwargs = np.load(os.path.join(input_dir, params_file), allow_pickle=True).item()
  results = get_results(kwargs)
  idx = params_file.split('_')[-1][:-4]
  np.save(os.path.join(AICROWD_RESULTS_DIR, 'results_' + idx), results)
In [48]:
# Check your score on the given test cases (There are more private test cases not provided)
result_folder = AICROWD_RESULTS_DIR
target_folder = os.path.join(DATASET_DIR, 'targets')

def check_algo_match(results, targets):
    param_matches = []
    for k in results:
        param_results = results[k]
        param_targets = targets[k]
        policy_match = param_results['Policy'] == param_targets['Policy']
        rv = [v for k, v in param_results['Expected Reward'].items()]
        tv = [v for k, v in param_targets['Expected Reward'].items()]
        rewards_match = np.allclose(rv, tv, rtol=3)
        equal = rewards_match and policy_match
    return np.mean(param_matches)

def check_score(target_folder, result_folder):
    match = []
    for out_file in os.listdir(result_folder):
        res_file = os.path.join(result_folder, out_file)
        results = np.load(res_file, allow_pickle=True).item()
        idx = out_file.split('_')[-1][:-4]  # Extract the file number
        target_file = os.path.join(target_folder, f"targets_{idx}.npy")
        targets = np.load(target_file, allow_pickle=True).item()
        algo_match = []
        for k in targets:
            algo_results = results[k]
            algo_targets = targets[k]
            algo_match.append(check_algo_match(algo_results, algo_targets))
    return np.mean(match)

if os.path.exists(target_folder):
    print("Shared data Score (normalized to 1):", check_score(target_folder, result_folder))
Shared data Score (normalized to 1): 0.8157894736842105

