2.2 Reinforcement Learning
2.2.2 Markov Decision Process (MDP)
2.2.2.1 Planning
Finding (exactly or approximately) the optimal policy for a known MDP is known asplanning. If an MDP is known (i.e.TandRare known) then in theory we can solve the Bellman Equations Bellman (1957) to calculate the state/action value functions for some policyπ.
Vπ(s) =
∑
a π(a |s)Qπ(s,a) Qπ(s,a) =R(s,a) + γ∑
s0 T(s0 |s,a)Vπ(s0)In particular when we wish to findV∗ andQ∗, we replace∑
aπ(a | s),Vπ andQπ with
maxa,V∗andQ∗respectively. Thenπ∗(s) =arg maxaQ∗(s,a)is the optimal policy. Occasionally, it will be useful to talk about the Bellman operatorTπand the Bellman optimality operatorT∗ which are convenient ways of expressing the Bellman equations. LetN=|S |, then for a deterministic policyπthe linear Bellman operatorTπ :RN →RNis defined on vectorsv∈RN as
(Tπv)(s) =R(s,
π(s)) +γ
∑
s0
T(s0 |s,π(s))v(s0)
Similarly, the non-linear Bellman optimality operator is definedT∗ :RN →RN on vectors
v∈RN as (T∗v)(s) =max a ( R(s,a) +γ
∑
s0 T(s0 |s,a)v(s0) )Vπis the fixed point of the Bellman operatorTπi.e. TπVπ =Vπ. Indeed, one can show that
it is a unique fixed point. The optimal value functionV∗is the unique fixed point ofT∗. Both
Tπ and T∗ are monotonic operators and contraction mappings under the max-norm with contraction factorγ.
Value iteration (VI) Value iteration uses the Bellman equations as stated above in an iterative way, in order to find the optimal policy. For each time stepk+1, value iteration calculates the action-values for each state-action pair from the values estimated at the previous stagek. Then it calculatesVk(s)via a simplemax(orsum) over actions. This approach is also calledsuccessive approximations.
Qk+1(s,a) = R(s,a) +γ
∑
s0∈S T(s0 |s,a)Vk(s0) Vk(s) = max a Q k(s,a)We start withV0(s) =maxaR(s,a). Value iteration is guaranteed to converge asymptotically (lim
k→∞kVk−V∗k∞ =0), with each iteration takingO(|S |2|A|)steps. In practice, we set some thresholdθ such that we stop whenkVk−Vk−1k∞ <θ. Puterman (1994) showed that ifθ = ε(1−γ)
2γ thenkV
k−V∗k
∞ <εfor someε.
An elegant way of looking at value iteration is simply as calculating for some vectorv ∈R|S | the limit of the Bellman (optimality) operator,lim
k→∞(T∗)kv=V∗.
Policy iteration (PI) Policy iteration also uses a dynamic programming approach to finding an optimal policy, but instead of iterating the value function, we now iterate the policy. At stagekwe have some policyπ
k, for which we can evaluateVπk. We then calculateπ
k+1as a
greedy policy with respect toVπk. πk+1(s) =arg max a ( R(s,a) +γ
∑
s0 T(s0 |s,a)Vπk )The value of the policyπ
kmay be determined via value iteration (for the fixed policy) or by solving the linear program defined by the Bellman equations. Policy iteration is guaranteed to converge in finite time, i.e.π
k =πk−1for somek, and this policy is optimal.
Value iteration, policy iteration and their variants can be performed eithersynchronouslyor asynchronously. In the descriptions of the algorithms above, we saw the synchronous case, where the updates at each iteration were over all state(-action) pairs. However, one can also perform updates for fewer states, or even one state at each iteration instead. Since the Bellman operator is a contraction mapping, the value function is guaranteed to improve with each iteration, and as long as every state is seen infinitely often we are guaranteed convergence (Sutton and Barto, 1998).
Comparison of VI and PI The conventional wisdom is that the policy can converge (to the optimal) long before the values converge, thus value iteration may run for unnecessary steps refining the value function, even though it is “good enough” for the purposes of extracting the optimal policy. Thus, using policy iteration with some fixed number of value determination steps for each policy iteration, can result in a faster performance. Unfortunately, it is hard to know in advance how many steps of value determination are necessary, just as it is hard to know what accuracy threshold to fix for value iteration to give the optimal policy. In practice, the (relative) performance of both methods depends on the size of the domain and the structure of the value function.
Monte Carlo tree search (MCTS) methods Monte-Carlo tree search methods constitute a family of algorithms that form expectimax search trees over the search space. The MCTS agent samples trajectories from the environment that terminate after some horizon or to the
end of the episode. Each node in the tree is the average of the playouts after taking the action sequence that led to the node. MCTS methods use heuristics to select the nodes most likely to lead to higher expected reward. At any point in time, the action with the highest value at the root node is the agent’s current best guess of the optimal action. In this sense MCTS methods areanytime algorithms.MCTS algorithms operate in the following stages.
1. Selection : Traverse the tree starting from the root downward, selecting child nodes until an unexpanded node is reached.
2. Expansion : If the unexpanded node does not end the episode, expand the node to all children. Select one of them.
3. Playout : Playout a (often random) policy from the selected node.
4. Value update : Backpropagate the value of the sampled trajectory that was played out, up the tree following the path that was traversed from the root.
Upper Confidence Bounds for Trees (UCT) by Kocsis and Szepesvári (2006) is a Monte-Carlo Tree Search (MCTS) algorithm that uses UCB from the bandit setting for exploration in the generative model setting. We will use UCT in Chapter Chapter 6 where we have access to an emulator that given a state and action can execute the action in that state and provide the received reward. The pseudocode for UCT is provided in Algorithm 1.