Approximation for Policy Iteration - Temporal Markov Decision Problems : Formalization and Reso

backups are performed asynchronously and in a relevant order.

[Bertsekas and Tsitsiklis, 1996] lay the basis of Asynchronous Policy Iteration. At iteration n, we select a subset Snof S and perform a policy Bellman backup on all

s ∈ Sn. This yields policy πn+1 with:

πn+1(s) = ( argmax a∈A r(s, a) + γ P s∈SP (s 0_{|s, a)V}πn_(s0_{) if s ∈ S}_n πn(s) if s 6∈ Sn (12.1) One can also similarly perform a certain number of Vk+1 = LπnVk operations to

update the value function on the states of Sn.

Hence, if one alternates one policy update and one value function update then the latter is equivalent to a Value Iteration update over the Snstates. Similarly, if the number of value

function updates is unbounded, we obtain the standard Policy Iteration method. Finally, if we alternate one policy update and mn value function updates, we obtain the Modified

Policy Iteration algorithm.

[Bertsekas and Tsitsiklis, 1996] prove that if the initial policy and value function verify V0 ≤ Lπ0V0 and if value function and policy updates are performed infinitely often in all

states as n tends to +∞, then Vn converges to V∗ and the policy converges to an optimal

policy.

Finally, one can see Asynchronous Policy Iteration as an elegant way of formulating both Value and Policy Iteration algorithms. It also naturally introduces the use of approximate value functions for Vπ _{and helps distinguishing between “conservative” methods (Modi-}

fied Policy Iteration, matrix inversion) and “less conservative” methods (approximate policy evaluation) to analyze convergence and optimality. Asynchronous Policy Iteration can be similarly presented from the point of view of actor-critic architectures.

12.2 Approximation for Policy Iteration

We have mentioned several times the possibility — and sometimes the need — for approximate policy evaluation methods. This section discusses the drastic assumption we made earlier about the existence of an evaluation black box and presents the different architectures of Approximate Policy Iteration.

12.2.1 Why Policy Iteration?

Let us start with a common sense question: why would one prefer a Policy Iteration method to a Value Iteration one? There is no particular reason for the choice of Policy Iteration against Value Iteration in general. Experience shows that exact Policy Iteration might converge in less iterations but more time (because of the evaluation phases) than Value Iteration, but this rule of thumb does not always apply and the time taken by the evaluation phase quickly becomes prohibitive.

Value Iteration methods have also received more attention in the Planning community because of their efficient representation of reward-to-go functions and the ease of manipula- tion of value functions. These value functions often have good properties, such as convexity,

Chapter 12. Real-Time Policy Iteration

monotonous evolution across the iterations, etc. On top of that, using value functions as a unified way of storing information facilitates the construction of asynchronous methods for value function optimization and allows to use results from heuristic search.

However, in order to make problems tractable, one often turns towards approximation schemes. Part II is a good illustration of how value functions can be more complex objects than policies. [Anderson, 2000] analyses why approximating a policy can be easier than approximating a value function. By comparing Q-learning (Cf. [Watkins, 1989]) and the direct gradient algorithm of [Baxter and Bartlett, 1999], both based on neural networks approximators, Anderson presents an example where Q-learning oscillates between the optimal policy and a suboptimal policy, while the direct policy search method converges to the optimal policy. As mentioned in the conclusion of the above paper, such an illustration does not support any general conclusion about the relative merits of policy-only versus value functions methods.

However, it suggests that it might be relevant to examine the complexity of approximating value functions or policies for the problem at hand in order to choose the way we represent the agent’s strategy.

Since value functions often hold more information than policies, they might be harder to approximate with good enough granularity. Dedicating the resources of a function approx- imator to representing relevant and irrelevant value function variations can be useless with respect to the final policy and can lead to non-convergence of algorithms or degradation of good policies.

Policy Iteration is basically an algorithm which explicitly stores the policy. However, it is often used in conjunction with value function storage, in order to facilitate policy evaluation. This feature of being both a policy and value function based algorithm yields Policy Iteration’s robustness (the ability to actually find a good policy) but also its long execu- tion time because of the alternance of updates on the policy and value function. While the optimization is performed directly on the policy, it needs to be propagated to the value function during the evaluation phase, yielding the drawbacks of Policy Iteration methods. Section 12.1.2’s analysis of Asynchronous Policy Iteration underlined even more the coupling between value function and policy.

For the arguments presented above and similarly to the first ideas of chapter 9, our approach has turned towards approximate Policy Iteration methods and towards direct im- provement of the decision variables.

12.2.2 Convergence of Approximate Policy Iteration

As mentioned earlier, exact Policy Iteration converges in practice in less iterations than Value Iteration but usually takes more time because of the evaluation phase’s computational cost. Thus, as for Value Iteration, it is common to use approximation schemes for this evaluation phase. This approximation’s goal is to reduce the complexity of a policy’s evaluation while still trying to fit its value function as closely as possible.

12.2. Approximation for Policy Iteration Similarly to the Value Iteration case, one has a few results for Approximate Policy It- eration. The first of these results being that, for the same reason as presented in section 6.4:

Approximate Policy Iteration usually does not converge.

Depending on the approximation’s quality, the first iterations yield a close-to-optimal policy which then oscillates around the optimal policy. The previous section’s argument was that this policy might oscillate “less” than the approximate value function itself and therefore is more robust to approximation.

In the case of discounted problems, [Bertsekas and Tsitsiklis, 1996] show that if we write the approximation error (the critic’s error) as:

∃ ∈ R+ / ∀f ∈ F(S, R), kAp(f) − fk∞≤ (12.2)

then we can write equation 12.3.

For discounted problems, one can bound the optimality loss due to approximation by:

lim sup

k→∞ kV

∗_{− V}πkk ≤ 2γ

(1 − γ)2 (12.3)

We can even be a little more precise and write that: lim sup k→∞ kV ∗_{− V}πk_k_∞_≤ 2γ (1 − γ)2 supj≤kkAp(V πj_{) − V}πj_k_∞ ! (12.4) In the case of undiscounted Stochastic Shortest Path problems, a similar bound exists, provided that is small enough. To establish this bound, one needs to introduce the ρπ

quantity defined in equation 12.5. This ρπ is the maximum probability that the process is

in a non-goal state s after |S| steps of applying π, starting in a non-goal state. ρπ = max

s6∈GoalStatesP r(s

|S| _{6∈ GoalStates|s}0 _{= s, π)} _(12.5)

If we consider the sequence of policies generated by the Approximate Policy Iteration algorithm, we can introduce ρk:

ρk= sup

j≤kρπj (12.6)

And similarly, for all proper policies, ie. for all policies such that ρπ < 1 (policies that

eventually lead to the goal with probability one), one can define ρ: ρ = sup

π∈P roperP oliciesρπ (12.7)

Since, for small enough, all policies are proper (Cf. [Bertsekas and Tsitsiklis, 1996]), one can write:

lim sup

k→∞ kV

∗_{− V}πkk_∞≤ 2|S| (1 − ρ + |S|)

(1 − ρ)2 (12.8)

The results from [Munos, 2003] generalize these results to the case of weighted quadratic norms. This is of crucial interest since many approximation techniques for value functions solve a regression problem defined in terms of L2 norms2.

2_{A good counter-example is provided in [Guestrin et al., 2001], where an L}

Chapter 12. Real-Time Policy Iteration

12.2.3 Approximation methods

Linear approximation architectures

A first set of Approximate Policy Iteration methods can be grouped under the name of “feature-based approximations” or, most commonly “linear approximation architectures”. Even though all regression methods are more or less related to feature-based representa- tions, this specific category uses a predefined finite set of feature functions. The idea is to represent the value function as a linear combination of features and thus to project the value functions (or the Q-functions) onto the subspace spanned by the features as illustrated by equation 12.9.

Vπ(s) =Xk

i=1

w_iπφi(s) (12.9)

The fixed degree polynomial approximations of part II fall into this category but not the piecewise polynomial approximation since one cannot exhibit a finite basis for the space of bounded degree piecewise polynomial functions. The linear approximation architecture has been used for instance in the Least-Squares Temporal Difference Learning (LSTD, [Bradtke and Barto, 1996]) algorithm for prediction tasks. This same method inspired the evaluation phase of the Least-Squares Policy Iteration (LSPI, [Lagoudakis and Parr, 2003]) algorithm. An other example of a direct linear approximation architecture is the approach of Approx- imate Linear Programming (ALP, see [Hauskrecht and Kveton, 2004] for example) which can be used for policy optimization or simply policy evaluation. These approaches provide a robust evaluation phase and help build efficient Policy Iteration algorithms, both from the model-based (planning) and the model-free (learning) point of view. Their main drawback lies in feature selection as pointed out by [Kveton and Hauskrecht, 2006]. The two next families of algorithms try to overcome this difficulty.

Simulation-based methods

We distinguish a second family of methods which we could call “Monte-Carlo methods” or “simulation-based methods” in the sense that they do not rely on a value function approximation architecture but on direct simulation and sampling to obtain an evaluation of the considered random variables. It is important to note that the families of algorithms we distinguish are closely related to each other: our point is not to categorize and separate algorithms but to provide a structured review of existing approximation methods. For instance, the LSTDQ evaluation in LSPI relies on the reusability of samples generated from the exploration versus exploitation trade-off. Monte-Carlo methods make extensive use of generative models, ie. suppose that generating samples and experience can be done at a very low cost.

Simulation-based approaches are quite close to the online approaches of RTDP or LAO*. The algorithm of [Kearns et al., 2002] for instance explores the reachable states from a current state s by simulating N times each action and repeating until a certain depth H. This recursively defines value functions for horizon 1 to horizon H. Then, the best action found in s is returned. This method is however quickly handicapped by the complexity of breadth-first search and it is hard to reach sufficiently large values of H and N to guarantee optimality and convergence. A more focused alternative is what [Bertsekas and Tsitsiklis, 1996] presents as simulation-based policy evaluation. It consists in calculating all Qπ_-values

12.3. Heuristic forward search for Asynchronous Value Iteration

In document Temporal Markov Decision Problems : Formalization and Resolution (Page 193-197)