Python Reinforcement Learning
上QQ阅读APP看书,第一时间看更新

Solving the frozen lake problem

If you haven't understood anything we have learned so far, don't worry, we will look at all the concepts along with a frozen lake problem.

Imagine there is a frozen lake stretching from your home to your office; you have to walk on the frozen lake to reach your office. But oops! There are holes in the frozen lake so you have to be careful while walking on the frozen lake to avoid getting trapped in the holes:

Look at the preceding diagram:

  • S is the starting position (home)
  • F is the frozen lake where you can walk
  • H are the holes, which you have to be so careful about
  • G is the goal (office)

Okay, now let us use our agent instead of you to find the correct way to reach the office. The agent's goal is to find the optimal path to go from S to G without getting trapped at H. How can an agent achieve this? We give +1 point as a reward to the agent if it correctly walks on the frozen lake and 0 points if it falls into a hole, so the agent can determine which is the right action. An agent will now try to find the optimal policy. Optimal policy implies taking the correct path, which maximizes the agent's reward. If the agent is maximizing the reward, apparently the agent is learning to skip the holes and reach the destination.

We can model our problem into MDP, which we studied earlier. MDP consists of the following:

  • States: Set of states. Here we have 16 states (each little square box in the grid).
  • Actions: Set of all possible actions (left, right, up, down; these are all the four possible actions our agent can take in our frozen lake environment).
  • Transition probabilities: The probability of moving from one state (F) to another state (H) by performing an action a.
  • Rewards probabilities: This is the probability of receiving a reward while moving from one state (F) to another state (H) by performing an action a.

Now our objective is to solve MDP. Solving the MDP implies finding the optimal policies. We introduce three special functions now:

  • Policy function: Specifies what action to perform in each state
  • Value function: Specifies how good a state is
  • Q function: Specifies how good an action is in a particular state

When we say how good, what does that really mean? It implies how good it is to maximize the rewards.

Then, we represent the value function and Q function using a special equation called a Bellman Optimality equation. If we solve this equation, we can find the optimal policy. Here, solving the equation means finding the right value function and policy. If we find the right value function and policy, that will be our optimal path which yields maximum rewards. 

We will use a special technique called dynamic programming to solve the Bellman optimality equation. To apply DP, the model dynamics have to be known in advance, which basically means the model environment's transition probabilities and reward probabilities have to be known in advance. Since we know the model dynamics, we can use DP here. We use two special DP algorithms to find the optimal policy:

  • Value iteration
  • Policy iteration