In this problem we have a a taxi driver, who serves three cities A, B and C. On a regular workday, the taxi driver can find a new ride by choosing one of the following actions:
- Cruise the streets looking for a passenger.
- Go to the nearest taxi stand and wait in line.
- Wait for a call from the dispatcher (this is not possible in town B because of poor reception).
For a given town and a given action, there is a probability that the next trip will go to each of the towns A, B and C and a corresponding reward in monetary units associated with each such trip. This reward represents the income from the trip after all necessary expenses have been deducted. Please refer to the table below for the rewards and transition probabilities.
- pkij is the probability of getting a ride to town j, by choosing an action k while the driver was in town i.
- rkij is the immediate reward of getting a ride to town j, by choosing an action k while the driver was in town i.
- Implement the DP algorithm. Find the pseudocode below. Let S be the state space and A the action space.
- Tabulate the optimal policy and optimal value for each state in each round for N = 10.
- Consider a policy that always forces the driver to go to the nearest taxi stand, irrespective ofthe state. Is it optimal? Justify your answer.
You will be writing your solutions & making a submission through a notebook. You can follow the instructions in the starter notebook.
Resources section you will find data files that contains parameters for the environment for this problem.
Submissions will be made through a notebook following the instructions in the starter notebook.
- RL TAs
[Starter Notebook] RL - Taxi Problem