Python Reinforcement Learning
上QQ阅读APP看书,第一时间看更新

The Bellman equation and optimality

The Bellman equation, named after Richard Bellman, American mathematician, helps us to solve MDP. It is omnipresent in RL. When we say solve the MDP, it actually means finding the optimal policies and value functions. There can be many different value functions according to different policies. The optimal value function  is the one which yields maximum value compared to all the other value functions:

 

Similarly, the optimal policy is the one which results in an optimal value function.

Since the optimal value function  is the one that has a higher value compared to all other value functions (that is, maximum return), it will be the maximum of the Q function. So, the optimal value function can easily be computed by taking the maximum of the Q function as follows:

  -- (3)

The Bellman equation for the value function can be represented as, (we will see how we derived this equation in the next topic):

It indicates the recursive relation between a value of a state and its successor state and the average over all possibilities.

Similarly, the Bellman equation for the Q function can be represented as follows:

 --- (4)

Substituting equation (4) in (3), we get:

The preceding equation is called a Bellman optimality equation. In the upcoming sections, we will see how to find optimal policies by solving this equation.