Dynamic Programming and Optimal Control 3rd Edition, Volume II. Chapter 6 Approximate Dynamic Programming

(1)

Dynamic Programming and Optimal Control

3rd Edition, Volume II

by

Dimitri P. Bertsekas

Massachusetts Institute of Technology

Chapter 6 Approximate Dynamic Programming

This is an updated version of the research-oriented Chapter 6 on Approximate Dynamic Programming. It will be periodically updated as new research becomes available, and will replace the current Chapter 6 in the book’s next printing.

In addition to editorial revisions, rearrangements, and new exercises, the chapter includes an account of new research, which is collected mostly in Sections 6.3 and 6.8. Furthermore, a lot of new material has been added, such as an account of post-decision state simplifications (Section 6.1), regression-based TD methods (Section 6.3), feature scaling (Section 6.3), policy oscillations (Section 6.3), λ-policy iteration and exploration enhanced TD methods, aggregation methods (Section 6.4), new Q-learning algorithms (Section 6.5), and Monte Carlo linear algebra (Section 6.8).

This chapter represents “work in progress.” It more than likely con-tains errors (hopefully not serious ones). Furthermore, its references to the literature are incomplete. Your comments and suggestions to the author at [email protected] are welcome. The date of last revision is given below.

(2)

6 Approximate

Dynamic Programming

Contents

6.1. General Issues of Cost Approximation . . . p. 327 6.1.1. Approximation Architectures . . . p. 327 6.1.2. Approximate Policy Iteration . . . p. 332 6.1.3. Direct and Indirect Approximation . . . p. 337 6.1.4. Simplifications . . . p. 339 6.1.5. Monte Carlo Simulation . . . p. 345 6.1.6. Contraction Mappings and Simulation . . . p. 348 6.2. Direct Policy Evaluation - Gradient Methods . . . p. 351 6.3. Projected Equation Methods . . . p. 357 6.3.1. The Projected Bellman Equation . . . p. 358 6.3.2. Projected Value Iteration - Other Iterative Methodsp. 363 6.3.3. Simulation-Based Methods . . . p. 367 6.3.4. LSTD, LSPE, and TD(0) Methods . . . p. 369 6.3.5. Optimistic Versions . . . p. 380 6.3.6. Multistep Simulation-Based Methods . . . p. 381 6.3.7. Policy Iteration Issues - Exploration . . . p. 394 6.3.8. Policy Oscillations - Chattering . . . p. 403 6.3.9. λ-Policy Iteration . . . p. 414 6.3.10. A Synopsis . . . p. 420 6.4. Aggregation Methods . . . p. 425 6.4.1. Cost Approximation via the Aggregate Problem . p. 428 6.4.2. Cost Approximation via the Enlarged Problem . p. 431 6.5. Q-Learning . . . p. 440 6.5.1. Convergence Properties of Q-Learning . . . p. 443 6.5.2. Q-Learning and Approximate Policy Iteration . . p. 447 6.5.3. Q-Learning for Optimal Stopping Problems . . . p. 450 6.5.4. Finite Horizon Q-Learning . . . p. 455

(3)

6.6. Stochastic Shortest Path Problems . . . p. 458 6.7. Average Cost Problems . . . p. 462 6.7.1. Approximate Policy Evaluation . . . p. 462 6.7.2. Approximate Policy Iteration . . . p. 471 6.7.3. Q-Learning for Average Cost Problems . . . p. 474 6.8. Simulation-Based Solution of Large Systems . . . p. 477 6.8.1. Projected Equations - Simulation-Based Versions p. 479 6.8.2. Matrix Inversion and Regression-Type Methods . p. 484 6.8.3. Iterative/LSPE-Type Methods . . . p. 486 6.8.4. Multistep Methods . . . p. 493 6.8.5. Extension of Q-Learning for Optimal Stopping . p. 496 6.8.6. Bellman Equation Error-Type Methods . . . . p. 498 6.8.7. Oblique Projections . . . p. 503 6.8.8. Generalized Aggregation by Simulation . . . p. 504 6.9. Approximation in Policy Space . . . p. 509 6.9.1. The Gradient Formula . . . p. 510 6.9.2. Computing the Gradient by Simulation . . . . p. 511 6.9.3. Essential Features of Critics . . . p. 513 6.9.4. Approximations in Policy and Value Space . . . p. 515 6.10. Notes, Sources, and Exercises . . . p. 516 References . . . p. 539

(4)

In this chapter we consider approximation methods for challenging, compu-tationally intensive DP problems. We discussed a number of such methods in Chapter 6 of Vol. I and Chapter 1 of the present volume, such as for example rollout and other one-step lookahead approaches. Here our focus will be on algorithms that are mostly patterned after two principal methods of infinite horizon DP: policy and value iteration. These algorithms form the core of a methodology known by various names, such as approximate dynamic programming, or neuro-dynamic programming, or reinforcement learning.

A principal aim of the methods of this chapter is to address problems with very large number of states n. In such problems, ordinary linear algebra operations such as n-dimensional inner products, are prohibitively time-consuming, and indeed it may be impossible to even store an n-vector in a computer memory. Our methods will involve linear algebra operations of dimension much smaller than n, and require only that the components of n-vectors are just generated when needed rather than stored.

Another aim of the methods of this chapter is to address model-free situations, i.e., problems where a mathematical model is unavailable or hard to construct. Instead, the system and cost structure may be sim-ulated (think, for example, of a queueing network with complicated but well-defined service disciplines at the queues). The assumption here is that there is a computer program that simulates, for a given control u, the prob-abilistic transitions from any given state i to a successor state j according to the transition probabilities pij(u), and also generates a corresponding transition cost g(i, u, j).

Given a simulator, it may be possible to use repeated simulation to calculate (at least approximately) the transition probabilities of the system and the expected stage costs by averaging, and then to apply the methods discussed in earlier chapters. The methods of this chapter, however, are geared towards an alternative possibility, which is much more attractive when one is faced with a large and complex system, and one contemplates approximations. Rather than estimate explicitly the transition probabil-ities and costs, we will aim to approximate the cost function of a given policy or even the optimal cost-to-go function by generating one or more simulated system trajectories and associated costs, and by using some form of “least squares fit.”

Implicit in the rationale of methods based on cost function approxi-mation is of course the hypothesis that a more accurate cost-to-go approx-imation will yield a better one-step or multistep lookahead policy. This is a reasonable but by no means self-evident conjecture, and may in fact not even be true in a given problem. In another type of method, which we will discuss somewhat briefly, we use simulation in conjunction with a gradient or other method to approximate directly an optimal policy with a policy of a given parametric form. This type of method does not aim at good cost function approximation through which a well-performing policy

(5)

may be obtained. Rather it aims directly at finding a policy with good performance.

Let us also mention, two other approximate DP methods, which we have discussed at various points in other parts of the book, but we will not consider further: rollout algorithms (Sections 6.4, 6.5 of Vol. I, and Section 1.3.5 of Vol. II), and approximate linear programming (Section 1.3.4).

Our main focus will be on two types of methods: policy evaluation al-gorithms, which deal with approximation of the cost of a single policy (and can also be embedded within a policy iteration scheme), and Q-learning algorithms, which deal with approximation of the optimal cost. Let us summarize each type of method, focusing for concreteness on the finite-state discounted case.

Policy Evaluation Algorithms

With this class of methods, we aim to approximate the cost function Jµ(i) of a policy µ with a parametric architecture of the form ˜J(i, r), where r is a parameter vector (cf. Section 6.3.5 of Vol. I). This approximation may be carried out repeatedly, for a sequence of policies, in the context of a policy iteration scheme. Alternatively, it may be used to construct an approximate cost-to-go function of a single suboptimal/heuristic policy, which can be used in an on-line rollout scheme, with one-step or multistep lookahead. We focus primarily on two types of methods.†

In the first class of methods, called direct , we use simulation to collect samples of costs for various initial states, and fit the architecture ˜J to the samples through some least squares problem. This problem may be solved by several possible algorithms, including linear least squares methods based on simple matrix inversion. Gradient methods have also been used extensively, and will be described in Section 6.2.

The second and currently more popular class of methods is called indirect . Here, we obtain r by solving an approximate version of Bellman’s equation. We will focus exclusively on the case of a linear architecture, where ˜J is of the form Φr, and Φ is a matrix whose columns can be viewed as basis functions (cf. Section 6.3.5 of Vol. I). In an important method of † In another type of policy evaluation method, often called the Bellman equa-tion errorapproach, which we will discuss briefly in Section 6.8.4, the parameter vector r is determined by minimizing a measure of error in satisfying Bellman’s equation; for example, by minimizing over r

k ˜J − T ˜Jk,

where k · k is some norm. If k · k is a Euclidean norm, and ˜J(i, r) is linear in r, this minimization is a linear least squares problem.

(6)

this type, we obtain the parameter vector r by solving the equation

Φr = ΠT (Φr), (6.1)

where Π denotes projection with respect to a suitable norm on the subspace of vectors of the form Φr, and T is either the mapping Tµ or a related mapping, which also has Jµ as its unique fixed point [here ΠT (Φr) denotes the projection of the vector T (Φr) on the subspace].†

We can view Eq. (6.1) as a form of projected Bellman equation. We will show that for a special choice of the norm of the projection, ΠT is a contraction mapping, so the projected Bellman equation has a unique solution Φr∗_{. We will discuss several iterative methods for finding r}∗ _in Section 6.3. All these methods use simulation and can be shown to converge under reasonable assumptions to r∗_{, so they produce the same approximate} cost function. However, they differ in their speed of convergence and in their suitability for various problem contexts. Here are the methods that we will focus on in Section 6.3 for discounted problems, and also in Sections 6.6-6.8 for other types of problems. They all depend on a parameter λ ∈ [0, 1], whose role will be discussed later.

(1) TD(λ) or temporal differences method . This algorithm may be viewed as a stochastic iterative method for solving a version of the projected equation (6.1) that depends on λ. The algorithm embodies important ideas and has played an important role in the development of the subject, but in practical terms, it is usually inferior to the next two methods, so it will be discussed in less detail.

(2) LSTD(λ) or least squares temporal differences method . This algo-rithm computes and solves a progressively more refined simulation-based approximation to the projected Bellman equation (6.1). (3) LSPE(λ) or least squares policy evaluation method . This algorithm

is based on the idea of executing value iteration within the lower dimensional space spanned by the basis functions. Conceptually, it has the form

Φrk+1= ΠT (Φrk) + simulation noise, (6.2) † Another method of this type is based on aggregation (cf. Section 6.3.4 of Vol. I) and is discussed in Section 6.4. This approach can also be viewed as a problem approximation approach (cf. Section 6.3.3 of Vol. I): the original problem is approximated with a related “aggregate” problem, which is then solved exactly to yield a cost-to-go approximation for the original problem. The aggregation counterpart of the equation Φr = ΠT (Φr) has the form Φr = ΦDT (Φr), where Φ and D are matrices whose rows are restricted to be probability distributions (the aggregation and disaggregation probabilities, respectively).

(7)

i.e., the current value iterate T (Φrk) is projected on S and is suitably approximated by simulation. The simulation noise tends to 0 asymp-totically, so assuming that ΠT is a contraction, the method converges to the solution of the projected Bellman equation (6.1). There are also a number of variants of LSPE(λ). Both LSPE(λ) and its vari-ants have the same convergence rate as LSTD(λ), because they share a common bottleneck: the slow speed of simulation.

Q-Learning Algorithms

With this class of methods, we aim to compute, without any approximation, the optimal cost function (not just the cost function of a single policy). Q-learning maintains and updates for each state-control pair (i, u) an estimate of the expression that is minimized in the right-hand side of Bellman’s equation. This is called the Q-factor of the pair (i, u), and is denoted by Q∗_{(i, u). The Q-factors are updated with what may be viewed as a} simulation-based form of value iteration, as will be explained in Section 6.5. An important advantage of using Q-factors is that when they are available, they can be used to obtain an optimal control at any state i simply by minimizing Q∗_{(i, u) over u ∈ U(i), so the transition probabilities} of the problem are not needed.

On the other hand, for problems with a large number of state-control pairs, Q-learning is often impractical because there may be simply too many Q-factors to update. As a result, the algorithm is primarily suitable for systems with a small number of states (or for aggregated/few-state versions of more complex systems). There are also algorithms that use parametric approximations for the Q-factors (see Section 6.5), although their theoretical basis is generally less solid.

Chapter Organization

Throughout this chapter, we will focus almost exclusively on perfect state information problems, involving a Markov chain with a finite number of states i, transition probabilities pij(u), and single stage costs g(i, u, j). Ex-tensions of many of the ideas to continuous state spaces are possible, but they are beyond our scope. We will consider first, in Sections 6.1-6.5, the discounted problem using the notation of Section 1.3. Section 6.1 pro-vides a broad overview of cost approximation architectures and their uses in approximate policy iteration. Section 6.2 focuses on direct methods for policy evaluation. Section 6.3 is a long section on a major class of indirect methods for policy evaluation, which are based on the projected Bellman equation. Section 6.4 discusses methods based on aggregation. Section 6.5 discusses Q-learning and its variations, and extends the projected Bellman equation approach to the case of multiple policies, and particularly to opti-mal stopping problems. Stochastic shortest path and average cost problems

(8)

are discussed in Sections 6.6 and 6.7, respectively. Section 6.8 extends and elaborates on the projected Bellman equation approach of Sections 6.3, 6.6, and 6.7, discusses another approach based on the Bellman equation error, and generalizes the aggregation methodology. Section 6.9 describes methods based on parametric approximation of policies rather than cost functions.

6.1 GENERAL ISSUES OF COST APPROXIMATION

Most of the methodology of this chapter deals with approximation of some type of cost function (optimal cost, cost of a policy, Q-factors, etc). The purpose of this section is to highlight the main issues involved, without getting too much into the mathematical details.

We start with general issues of parametric approximation architec-tures, which we have also discussed in Vol. I (Section 6.3.5). We then consider approximate policy iteration (Section 6.1.2), and the two general approaches for approximate cost evaluation (direct and indirect; Section 6.1.3). In Section 6.1.4, we discuss various special structures that can be exploited to simplify approximate policy iteration. In Sections 6.1.5 and 6.1.6 we provide orientation into the main mathematical issues underlying the methodology, and focus on two of its main components: contraction mappings and simulation.

6.1.1 Approximation Architectures

The major use of cost approximation is for obtaining a one-step lookahead suboptimal policy (cf. Section 6.3 of Vol. I).† In particular, suppose that we use ˜J(j, r) as an approximation to the optimal cost of the finite-state discounted problem of Section 1.3. Here ˜J is a function of some chosen form (the approximation architecture) and r is a parameter/weight vector. Once r is determined, it yields a suboptimal control at any state i via the one-step lookahead minimization

˜

µ(i) = arg min u∈U(i)

n X

j=1

pij(u) g(i, u, j) + α ˜J (j, r). (6.3) The degree of suboptimality of ˜µ, as measured by kJ˜µ− J∗k∞, is bounded by a constant multiple of the approximation error according to

kJµ˜− J∗k∞≤ 2α

1 − αk ˜J − J ∗_k

∞,

† We may also use a multiple-step lookahead minimization, with a cost-to-go approximation at the end of the multiple-step horizon. Conceptually, single-step and multiple-step lookahead approaches are similar, and the cost-to-go approxi-mation algorithms of this chapter apply to both.

(9)

as shown in Prop. 1.3.7. This bound is qualitative in nature, as it tends to be quite conservative in practice.

An alternative possibility is to obtain a parametric approximation ˜

Q(i, u, r) of the Q-factor of the pair (i, u), defined in terms of the optimal cost function J∗ _as Q∗_{(i, u) =} n X j=1 pij(u) g(i, u, j) + αJ∗(j).

Since Q∗_{(i, u) is the expression minimized in Bellman’s equation, given the} approximation ˜Q(i, u, r), we can generate a suboptimal control at any state i via

˜

Q(i, u, r).

The advantage of using Q-factors is that in contrast with the minimiza-tion (6.3), the transiminimiza-tion probabilities pij(u) are not needed in the above minimization. Thus Q-factors are better suited to the model-free context.

Note that we may similarly use approximations to the cost functions Jµ and Q-factors Qµ(i, u) of specific policies µ. A major use of such ap-proximations is in the context of an approximate policy iteration scheme; see Section 6.1.2.

The choice of architecture is very significant for the success of the approximation approach. One possibility is to use the linear form

˜ J(i, r) = s X k=1 rkφk(i), (6.4)

where r = (r1, . . . , rs) is the parameter vector, and φk(i) are some known scalars that depend on the state i. Thus, for each state i, the approximate cost ˜J(i, r) is the inner product φ(i)′_{r of r and}

φ(i) =    φ1(i) .. . φs(i)   .

We refer to φ(i) as the feature vector of i, and to its components as features (see Fig. 6.1.1). Thus the cost function is approximated by a vector in the subspace S = {Φr | r ∈ ℜs_}, where Φ =    φ1(1) . . . φs(1) .. . ... ... φ1(n) . . . φs(n)   =    φ(1)′ .. . φ(n)′   .

(10)

State i Feature Extraction Mapping Feature Vector Approximator

iFeature Extraction Mapping Feature Vector Approximator ( )

Feature Extraction Mapping Feature VectorFeature Extraction Mapping Feature Vector Feature Extraction Mapping Feature Vector φ(i) Linear Costi) Linear Cost

i) Linear Cost Approximator φ(i)′r

Figure 6.1.1 A linear feature-based architecture. It combines a mapping that extracts the feature vector φ(i) = φ1(i), . . . , φs(i)

′

associated with state i, and a parameter vector r to form a linear cost approximator.

We can view the s columns of Φ as basis functions, and Φr as a linear combination of basis functions.

Features, when well-crafted, can capture the dominant nonlinearities of the cost function, and their linear combination may work very well as an approximation architecture. For example, in computer chess (Section 6.3.5 of Vol. I) where the state is the current board position, appropriate fea-tures are material balance, piece mobility, king safety, and other positional factors.

Example 6.1.1 (Polynomial Approximation)

An important example of linear cost approximation is based on polynomial basis functions. Suppose that the state consists of q integer components x1, . . . , xq, each taking values within some limited range of integers. For

example, in a queueing system, xk may represent the number of customers

in the kth queue, where k = 1, . . . , q. Suppose that we want to use an approximating function that is quadratic in the components xk. Then we

can define a total of 1 + q + q2 _{basis functions that depend on the state}

x = (x1, . . . , xq) via

φ0(x) = 1, φk(x) = xk, φkm(x) = xkxm, k, m = 1, . . . , q.

A linear approximation architecture that uses these functions is given by

˜ J(x, r) = r0+ q X k=1 rkxk+ q X k=1 q X m=k rkmxkxm,

where the parameter vector r has components r0, rk, and rkm, with k =

1, . . . , q, m = k, . . . , q. In fact, any kind of approximating function that is polynomial in the components x1, . . . , xq can be constructed similarly.

It is also possible to combine feature extraction with polynomial approx-imations. For example, the feature vector φ(i) = φ1(i), . . . , φs(i)

′ trans-formed by a quadratic polynomial mapping, leads to approximating functions of the form ˜ J (i, r) = r0+ s X k=1 rkφk(i) + s X k=1 s X ℓ=1 rkℓφk(i)φℓ(i),

(11)

where the parameter vector r has components r0, rk, and rkℓ, with k, ℓ =

1, . . . , s. This function can be viewed as a linear cost approximation that uses the basis functions

w0(i) = 1, wk(i) = φk(i), wkℓ(i) = φk(i)φℓ(i), k, ℓ = 1, . . . , s.

Example 6.1.2 (Interpolation)

A common type of approximation of a function J is based on interpolation. Here, a set I of special states is selected, and the parameter vector r has one component riper state i ∈ I, which is the value of J at i:

ri= J(i), i ∈ I.

The value of J at states i /∈ I is approximated by some form of interpolation using r.

Interpolation may be based on geometric proximity. For a simple ex-ample that conveys the basic idea, let the system states be the integers within some interval, let I be a subset of special states, and for each state i let i and ¯i be the states in I that are closest to i from below and from above. Then for any state i, ˜J(i, r) is obtained by linear interpolation of the costs ri = J(i)

and r¯i= J(¯i):

˜

J(i, r) = i − i

¯i − iri+¯i − i

¯i − ir¯i.

The scalars multiplying the components of r may be viewed as features, so the feature vector of i above consists of two nonzero features (the ones cor-responding to i and ¯i), with all other features being 0. Similar examples can be constructed for the case where the state space is a subset of a multidimen-sional space (see Example 6.3.13 of Vol. I).

A generalization of the preceding example is approximation based on aggregation; see Section 6.3.4 of Vol. I and the subsequent Section 6.4 in this chapter. There are also interesting nonlinear approximation architec-tures, including those defined by neural networks, perhaps in combination with feature extraction mappings (see Bertsekas and Tsitsiklis [BeT96], or Sutton and Barto [SuB98] for further discussion). In this chapter, we will mostly focus on the case of linear architectures, because many of the policy evaluation algorithms of this chapter are valid only for that case.

We note that there has been considerable research on automatic ba-sis function generation approaches (see e.g., Keller, Mannor, and Precup [KMP06], and Jung and Polani [JuP07]). Moreover it is possible to use standard basis functions which may be computed by simulation (perhaps with simulation error). The following example discusses this possibility.

(12)

Example 6.1.3 (Krylov Subspace Generating Functions) We have assumed so far that the columns of Φ, the basis functions, are known, and the rows φ(i)′_{of Φ are explicitly available to use in the various}

simulation-based formulas. We will now discuss a class of basis functions that may not be available, but may be approximated by simulation in the course of various algorithms. For concreteness, let us consider the evaluation of the cost vector

Jµ= (I − αPµ)−1gµ

of a policy µ in a discounted MDP. Then Jµhas an expansion of the form

Jµ= ∞

X

t=0

αtPµtgµ.

Thus gµ, Pµgµ, . . . , Pµsgµyield an approximation based on the first s+1 terms

of the expansion, and seem suitable choices as basis functions. Also a more general expansion is Jµ= J + ∞ X t=0 αtPµtq,

where J is any vector in ℜn_{and q is the residual vector}

q = TµJ − J = gµ+ αPµJ − J;

this can be seen from the equation Jµ− J = αPµ(Jµ− J) + q. Thus the basis

functions J, q, Pµq, . . . , Pµs−1q yield an approximation based on the first s + 1

terms of the preceding expansion.

Generally, to implement various methods in subsequent sections with basis functions of the form Pµmgµ, m ≥ 0, one would need to generate the ith

components (Pm

µgµ)(i) for any given state i, but these may be hard to

calcu-late. However, it turns out that one can use instead single sample approxi-mations of (Pm

µ gµ)(i), and rely on the averaging mechanism of simulation to

improve the approximation process. The details of this are beyond our scope and we refer to the original sources (Bertsekas and Yu [BeY07], [BeY09]) for further discussion and specific implementations.

We finally mention the possibility of optimal selection of basis func-tions within some restricted class. In particular, consider an approximation subspace

Sθ=Φ(θ)r | r ∈ ℜs ,

where the s columns of the n ×s matrix Φ are basis functions parametrized by a vector θ. Assume that for a given θ, there is a corresponding vector r(θ), obtained using some algorithm, so that Φ(θ)r(θ) is an approximation of a cost function J (various such algorithms will be presented later in this chapter). Then we may wish to select θ so that some measure of approximation quality is optimized. For example, suppose that we can

(13)

compute the true cost values J(i) (or more generally, approximations to these values) for a subset of selected states I. Then we may determine θ so that

X

i∈I

J(i) − φ(i, θ)′_r(θ)2

is minimized, where φ(i, θ)′ _{is the ith row of Φ(θ). Alternatively, we may} determine θ so that the norm of the error in satisfying Bellman’s equation,

Φ(θ)r(θ) − T Φ(θ)r(θ) 2 ,

is minimized. Gradient and random search algorithms for carrying out such minimizations have been proposed in the literature (see Menache, Mannor, and Shimkin [MMS06], and Yu and Bertsekas [YuB09]).

6.1.2 Approximate Policy Iteration

Let us consider a form of approximate policy iteration, where we com-pute simulation-based approximations ˜J(·, r) to the cost functions Jµ of stationary policies µ, and we use them to compute new policies based on (approximate) policy improvement. We impose no constraints on the ap-proximation architecture, so ˜J(i, r) may be linear or nonlinear in r.

Suppose that the current policy is µ, and for a given r, ˜J(i, r) is an approximation of Jµ(i). We generate an “improved” policy µ using the formula

n X

j=1

pij(u) g(i, u, j) + α ˜J (j, r), for all i. (6.5)

The method is illustrated in Fig. 6.1.2. Its theoretical basis was discussed in Section 1.3 (cf. Prop. 1.3.6), where it was shown that if the policy evaluation is accurate to within δ (in the sup-norm sense), then for an α-discounted problem, the method will yield in the limit (after infinitely many policy evaluations) a stationary policy that is optimal to within

2αδ (1 − α)2,

where α is the discount factor. Experimental evidence indicates that this bound is usually conservative. Furthermore, often just a few policy evalu-ations are needed before the bound is attained.

When the sequence of policies obtained actually converges to some ˆµ, then it can be proved that ˆµ is optimal to within

2αδ 1 − α

(14)

Approximate Policy Evaluation

Policy Improvement Guess Initial Policy

Generate “Improved” Policy µ

Evaluate Approximate Cost Φr Using Simulation Asynchronous Initial staterUsing Simulation Initial state

Initial state (

Figure 6.1.2Block diagram of approximate policy iteration.

(see Section 6.3.8 and also Section 6.4.2, where it is shown that if policy evaluation is done using an aggregation approach, the generated sequence of policies does converge).

A simulation-based implementation of the algorithm is illustrated in Fig. 6.1.3. It consists of four parts:

(a) The simulator , which given a state-control pair (i, u), generates the next state j according to the system’s transition probabilities. (b) The decision generator , which generates the control µ(i) of the

im-proved policy at the current state i for use in the simulator.

(c) The cost-to-go approximator , which is the function ˜J(j, r) that is used by the decision generator.

(d) The cost approximation algorithm, which accepts as input the output of the simulator and obtains the approximation ˜J(·, r) of the cost of µ.

Note that there are two policies µ and µ, and parameter vectors r and r, which are simultaneously involved in this algorithm. In particular, r corresponds to the current policy µ, and the approximation ˜J(·, r) is used in the policy improvement Eq. (6.5) to generate the new policy µ. At the same time, µ drives the simulation that generates samples to be used by the algorithm that determines the parameter r corresponding to µ, which will be used in the next policy iteration.

The Issue of Exploration

Let us note an important generic difficulty with simulation-based policy iteration: to evaluate a policy µ, we need to generate cost samples using that policy, but this biases the simulation by underrepresenting states that

(15)

System Simulator D Cost-to-Go Approx

r Decision Generator roximator Supplies Valu r) Decision µ(i) S

Cost-to-Go Approximator S State Cost Approximation ecision Generator r Supplies Values ˜J(j, r) D iCost Approximation A n Algorithm ˜ J(j, r) State i C r) Samples

Figure 6.1.3Simulation-based implementation approximate policy iteration al-gorithm. Given the approximation ˜J(i, r), we generate cost samples of the “im-proved” policy µ by simulation (the “decision generator” module). We use these samples to generate the approximator ˜J(i, r) of µ.

are unlikely to occur under µ. As a result, the cost-to-go estimates of these underrepresented states may be highly inaccurate, causing potentially serious errors in the calculation of the improved control policy µ via the policy improvement Eq. (6.5).

The difficulty just described is known as inadequate exploration of the system’s dynamics because of the use of a fixed policy. It is a particularly acute difficulty when the system is deterministic, or when the randomness embodied in the transition probabilities is “relatively small.” One possibil-ity for guaranteeing adequate exploration of the state space is to frequently restart the simulation and to ensure that the initial states employed form a rich and representative subset. A related approach, called iterative re-sampling, is to enrich the sampled set of states in evaluating the current policy µ as follows: derive an initial cost evaluation of µ, simulate the next policy µ obtained on the basis of this initial evaluation to obtain a set of representative states S visited by µ, and repeat the evaluation of µ using additional trajectories initiated from S.

Still another frequently used approach is to artificially introduce some extra randomization in the simulation, by occasionally using a randomly generated transition rather than the one dictated by the policy µ (although this may not necessarily work because all admissible controls at a given state may produce “similar” successor states). This and other possibilities to improve exploration will be discussed further in Section 6.3.7.

(16)

Limited Sampling/Optimistic Policy Iteration

In the approximate policy iteration approach discussed so far, the policy evaluation of the cost of the improved policy µ must be fully carried out. An alternative, known as optimistic policy iteration, is to replace the policy µ with the policy µ after only a few simulation samples have been processed, at the risk of ˜J(·, r) being an inaccurate approximation of Jµ.

Optimistic policy iteration has been successfully used, among oth-ers, in an impressive backgammon application (Tesauro [Tes92]). However, the associated theoretical convergence properties are not fully understood. As will be illustrated by the discussion of Section 6.3.8 (see also Section 6.4.2 of [BeT96]), optimistic policy iteration can exhibit fascinating and counterintuitive behavior, including a natural tendency for a phenomenon called chattering, whereby the generated parameter sequence {rk} con-verges, while the generated policy sequence oscillates because the limit of {rk} corresponds to multiple policies.

We note that optimistic policy iteration tends to deal better with the problem of exploration discussed earlier, because with rapid changes of policy, there is less tendency to bias the simulation towards particular states that are favored by any single policy.

Approximate Policy Iteration Based on Q-Factors

The approximate policy iteration method discussed so far relies on the cal-culation of the approximation ˜J(·, r) to the cost function Jµ of the current policy, which is then used for policy improvement using the minimization

n X

j=1

pij(u) g(i, u, j) + α ˜J (j, r).

Carrying out this minimization requires knowledge of the transition proba-bilities pij(u) and calculation of the associated expected values for all con-trols u ∈ U(i) (otherwise a time-consuming simulation of these expected values is needed). A model-free alternative is to compute approximate Q-factors ˜ Q(i, u, r) ≈ n X j=1 pij(u) g(i, u, j) + αJµ(j), (6.6) and use the minimization

˜

Q(i, u, r) (6.7)

for policy improvement. Here, r is an adjustable parameter vector and ˜

Q(i, u, r) is a parametric architecture, possibly of the linear form ˜ Q(i, u, r) = s X k=1 rkφk(i, u),

(17)

where φk(i, u) are basis functions that depend on both state and control [cf. Eq. (6.4)].

The important point here is that given the current policy µ, we can construct Q-factor approximations ˜Q(i, u, r) using any method for con-structing cost approximations ˜J(i, r). The way to do this is to apply the latter method to the Markov chain whose states are the pairs (i, u), and the probability of transition from (i, u) to (j, v) is

pij(u) if v = µ(j),

and is 0 otherwise. This is the probabilistic mechanism by which state-control pairs evolve under the stationary policy µ.

A major concern with this approach is that the state-control pairs (i, u) with u 6= µ(i) are never generated in this Markov chain, so they are not represented in the cost samples used to construct the approximation

˜

Q(i, u, r) (see Fig. 6.1.4). This creates an acute difficulty due to diminished exploration, which must be carefully addressed in any simulation-based implementation. We will return to the use of Q-factors in Section 6.5, where we will discuss exact and approximate implementations of the Q-learning algorithm.

i, u) States State-Control Pairs (i, u) States

) States j p j pij(u)

) g(i, u, j)

v µ(j) j)!j, µ(j)" State-Control Pairs: Fixed Policy µ

Figure 6.1.4Markov chain underlying Q-factor-based policy evaluation, associ-ated with policy µ. The states are the pairs (i, u), and the probability of transition from (i, u) to (j, v) is pij(u) if v = µ(j), and is 0 otherwise. Thus, after the first

transition, the generated pairs are exclusively of the form (i, µ(i)); pairs of the form (i, u), u 6= µ(i), are not explored.

The Issue of Policy Oscillations

Contrary to exact policy iteration, which converges to an optimal policy in a fairly regular manner, approximate policy iteration may oscillate. By this we mean that after a few iterations, policies tend to repeat in cycles. The associated parameter vectors r may also tend to oscillate. This phe-nomenon is explained in Section 6.3.8 and can be particularly damaging,

(18)

because there is no guarantee that the policies involved in the oscillation are “good” policies, and there is often no way to verify how well they perform relative to the optimal.

We note that oscillations can be avoided and approximate policy it-eration can be shown to converge under special conditions that arise in particular when aggregation is used for policy evaluation. These condi-tions involve certain monotonicity assumpcondi-tions regarding the choice of the matrix Φ, which are fulfilled in the case of aggregation (see Section 6.3.8, and also Section 6.4.2). However, when Φ is chosen in an unrestricted man-ner, as often happens in practical applications of the projected equation methods of Section 6.3, policy oscillations tend to occur generically, and often for very simple problems (see Section 6.3.8 for an example).

6.1.3 Direct and Indirect Approximation

We will now preview two general algorithmic approaches for approximating the cost function of a fixed stationary policy µ within a subspace of the form S = {Φr | r ∈ ℜs_{}. (A third approach, based on aggregation, uses a} special type of matrix Φ and is discussed in Section 6.4.) The first and most straightforward approach, referred to as direct , is to find an approximation

˜

J ∈ S that matches best Jµ in some normed error sense, i.e., min ˜ J∈SkJµ− ˜Jk, or equivalently, min r∈ℜskJµ− Φrk

(see the left-hand side of Fig. 6.1.5).† Here, k · k is usually some (possibly weighted) Euclidean norm, in which case the approximation problem is a linear least squares problem, whose solution, denoted r∗_{, can in principle be} obtained in closed form by solving the associated quadratic minimization problem. If the matrix Φ has linearly independent columns, the solution is unique and can also be represented as

Φr∗_{= ΠJ} µ,

where Π denotes projection with respect to k·k on the subspace S.† A major difficulty is that specific cost function values Jµ(i) can only be estimated † Note that direct approximation may be used in other approximate DP contexts, such as finite horizon problems, where we use sequential single-stage approximation of the cost-to-go functions Jk, going backwards (i.e., starting with

JN, we obtain a least squares approximation of JN −1, which is used in turn to

obtain a least squares approximation of JN −2, etc). This approach is sometimes

called fitted value iteration.

† In what follows in this chapter, we will not distinguish between the linear operation of projection and the corresponding matrix representation, denoting them both by Π. The meaning should be clear from the context.

(19)

Subspace S ={Φr | r ∈ ℜs_{} Set}

= 0

Subspace S ={Φr | r ∈ ℜs_{} Set}

= 0 Direct Method: Projection of cost vector JµΠ

µΠJµ

T_µ_(Φr)

Φr = ΠTµ(Φr)

Indirect Method: Solving a projected form of Bellman’s equation Projection on

Indirect Method: Solving a projected form of Bellman’s equation Direct Method: Projection of cost vector

( ) ( ) ( ) Direct Method: Projection of cost vector Jµ

Figure 6.1.5Two methods for approximating the cost function Jµas a linear

combination of basis functions (subspace S). In the direct method (figure on the left), Jµis projected on S. In the indirect method (figure on the right), the

approximation is found by solving Φr = ΠTµ(Φr), a projected form of Bellman’s

equation.

through their simulation-generated cost samples, as we discuss in Section 6.2.

An alternative and more popular approach, referred to as indirect , is to approximate the solution of Bellman’s equation J = TµJ on the subspace S (see the right-hand side of Fig. 6.1.5). An important example of this approach, which we will discuss in detail in Section 6.3, leads to the problem of finding a vector r∗ _{such that}

Φr∗_{= ΠT}_µ_(Φr∗_). _(6.8)

We can view this equation as a projected form of Bellman’s equation. We will consider another type of indirect approach based on aggregation in Section 6.4.

We note that solving projected equations as approximations to more complex/higher-dimensional equations has a long history in scientific com-putation in the context of Galerkin methods (see e.g., [Kra72]). For exam-ple, some of the most popular finite-element methods for partial differential equations are of this type. However, the use of the Monte Carlo simulation ideas that are central in approximate DP is an important characteristic that differentiates the methods of the present chapter from the Galerkin methodology.

An important fact here is that ΠTµ is a contraction, provided we use a special weighted Euclidean norm for projection, as will be proved in Sec-tion 6.3 for discounted problems (Prop. 6.3.1). In this case, Eq. (6.8) has a unique solution, and allows the use of algorithms such as LSPE(λ) and TD(λ), which are discussed in Section 6.3. Unfortunately, the contrac-tion property of ΠTµ does not extend to the case where Tµ is replaced by

(20)

T , the DP mapping corresponding to multiple/all policies, although there are some interesting exceptions, one of which relates to optimal stopping problems and is discussed in Section 6.5.3.

6.1.4 Simplifications

We now consider various situations where the special structure of the prob-lem may be exploited to simplify policy iteration or other approximate DP algorithms.

Problems with Uncontrollable State Components

In many problems of interest the state is a composite (i, y) of two compo-nents i and y, and the evolution of the main component i can be directly affected by the control u, but the evolution of the other component y can-not. Then as discussed in Section 1.4 of Vol. I, the value and the policy iteration algorithms can be carried out over a smaller state space, the space of the controllable component i. In particular, we assume that given the state (i, y) and the control u, the next state (j, z) is determined as follows: j is generated according to transition probabilities pij(u, y), and z is gen-erated according to conditional probabilities p(z | j) that depend on the main component j of the new state (see Fig. 6.1.6). Let us assume for notational convenience that the cost of a transition from state (i, y) is of the form g(i, y, u, j) and does not depend on the uncontrollable component z of the next state (j, z). If g depends on z it can be replaced by

ˆ g(i, y, u, j) =X z p(z | j)g(i, y, u, j, z) in what follows. ) States ) States j p j pij(u)

Controllable State Components

(i, y) ( ) (j, z) States

j g(i, y, u, j) ) Control u

) No Control u p(z | j)

Figure 6.1.6States and transition probabilities for a problem with uncontrollable state components.

(21)

For an α-discounted problem, consider the mapping ˆT defined by ( ˆT ˆJ)(i) =X

y

p(y | i)(T ˆJ)(i, y)

=X y p(y | i) min u∈U(i,y) n X j=0 pij(u, y) g(i, y, u, j) + α ˆJ (j),

and the corresponding mapping for a stationary policy µ, ( ˆTµJ)(i) =ˆ

X

y

p(y | i)(TµJ)(i, y)

=X y p(y | i) n X j=0

pij µ(i, y), y g i, y, µ(i, y), j + α ˆJ(j).

Bellman’s equation, defined over the controllable state component i, takes the form

ˆ

J(i) = ( ˆT ˆJ)(i), for all i. (6.9) The typical iteration of the simplified policy iteration algorithm consists of two steps:

(a) The policy evaluation step, which given the current policy µk_{(i, y),} computes the unique ˆJµk(i), i = 1, . . . , n, that solve the linear system

of equations ˆJ_µk = ˆT_µkJˆ_µk or equivalently ˆ Jµk(i) = X y p(y | i) n X j=0 pij µk(i, y) g i, y, µk_{(i, y), j + α ˆ}_J µk(j) for all i = 1, . . . , n.

(b) The policy improvement step, which computes the improved policy µk+1_{(i, y), from the equation ˆ}_T

µk+1Jˆ_µk = ˆT ˆJ_µk or equivalently

µk+1_{(i, y) = arg min} u∈U(i,y)

n X

j=0

pij(u, y) g(i, y, u, j) + α ˆJ_µk(j),

for all (i, y).

Approximate policy iteration algorithms can be similarly carried out in reduced form.

(22)

Problems with Post-Decision States

In some stochastic problems, the transition probabilities and stage costs have the special form

pij(u) = q j | f(i, u), (6.10) where f is some function and q · | f(i, u) is a given probability distribution for each value of f (i, u). In words, the dependence of the transitions on (i, u) comes through the function f (i, u). We may exploit this structure by viewing f (i, u) as a form of state: a post-decision state that determines the probabilistic evolution to the next state. An example where the conditions (6.10) are satisfied are inventory control problems of the type considered in Section 4.2 of Vol. I. There the post-decision state at time k is xk+ uk, i.e., the post-purchase inventory, before any demand at time k has been filled.

Post-decision states can be exploited when the stage cost has no de-pendence on j,† i.e., when we have (with some notation abuse)

g(i, u, j) = g(i, u).

Then the optimal cost-to-go within an α-discounted context at state i is given by J∗_{(i) = min} u∈U(i) h g(i, u) + αV∗ _{f (i, u)}i ,

while the optimal cost-to-go at post-decision state m (optimal sum of costs of future stages) is given by

V∗_{(m) =} n X

j=1

q(j | m)J∗_(j).

In effect, we consider a modified problem where the state space is enlarged to include post-decision states, with transitions between ordinary states and post-decision states specified by f and q · | f(i, u) (see Fig. 6.1.7). The preceding two equations represent Bellman’s equation for this modified problem.

Combining these equations, we have V∗_{(m) =} n X j=1 q(j | m) min u∈U(j) h g(j, u) + αV∗ _{f (j, u)}i , ∀ m, (6.11)

which can be viewed as Bellman’s equation over the space of post-decision states m. This equation is similar to Q-factor equations, but is defined † If there is dependence on j, one may consider computing, possibly by simu-lation, (an approximation to) g(i, u) =Pn

j=1pij(u)g(i, u, j), and using it in place

(23)

State-Control Pairs (

State-Control Pairs (i, u) StatesState-Control Pairs (j, v) States

g(i, u, m)

) m m

Controllable State Components Post-Decision States m m= f (i, u) No Control v p_{) q(j | m)}

No Control u p

Figure 6.1.7 Modified problem where the post-decision states are viewed as additional states.

over the space of post-decision states rather than the larger space of state-control pairs. The advantage of this equation is that once the function V∗ is calculated (or approximated), the optimal policy can be computed as

µ∗_{(i) = arg min} u∈U(i)

h

g(i, u) + αV∗ _{f (i, u)}i ,

which does not require the knowledge of transition probabilities and com-putation of an expected value. It involves a deterministic optimization, and it can be used in a model-free context (as long as the functions g and f are known). This is important if the calculation of the optimal policy is done on-line.

It is straightforward to construct a policy iteration algorithm that is defined over the space of post-decision states. The cost-to-go function Vµ of a stationary policy µ is the unique solution of the corresponding Bellman equation Vµ(m) = n X j=1 q(j | m)g j, µ(j) + αVµ f j, µ(j) , ∀ m.

Given Vµ, the improved policy is obtained as µ(i) = arg min

u∈U(i) h

g(i, u) + Vµ f (i, u) i

, i = 1, . . . , n.

There are also corresponding approximate policy iteration methods with cost function approximation.

An advantage of this method when implemented by simulation is that the computation of the improved policy does not require the calculation of expected values. Moreover, with a simulator, the policy evaluation of Vµ can be done in model-free fashion, without explicit knowledge of the

(24)

probabilities q(j | m). These advantages are shared with policy iteration algorithms based on Q-factors. However, when function approximation is used in policy iteration, the methods using post-decision states may have a significant advantage over Q-factor-based methods: they use cost function approximation in the space of post-decision states, rather than the larger space of state-control pairs, and they are less susceptible to difficulties due to inadequate exploration.

We note that there is a similar simplification with post-decision states when g is of the form

g(i, u, j) = h f (i, u), j, for some function h. Then we have

J∗_{(i) = min} u∈U(i)V

∗ _{f (i, u),}

where V∗ _{is the unique solution of the equation} V∗_{(m) =} n X j=1 q(j | m) h(m, j) + α min u∈U(j)V ∗ _{f (j, u)} , ∀ m.

Here V∗_{(m) should be interpreted as the optimal cost-to-go from} post-decision state m, including the cost h(m, j) incurred within the stage when m was generated . When h does not depend on j, the algorithm takes the simpler form V∗_{(m) = h(m) + α} n X j=1 q(j | m) min u∈U(j)V ∗ _{f (j, u),} _{∀ m.} _(6.12) Example 6.1.4 (Tetris)

Let us revisit the game of tetris, which was discussed in Example 1.4.1 of Vol. I in the context of problems with an uncontrollable state component. We will show that it also admits a post-decision state. Assuming that the game terminates with probability 1 for every policy (a proof of this has been given by Burgiel [Bur97]), we can model the problem of finding an optimal tetris playing strategy as a stochastic shortest path problem.

The state consists of two components:

(1) The board position, i.e., a binary description of the full/empty status of each square, denoted by x.

(2) The shape of the current falling block, denoted by y (this is the uncon-trollable component).

The control, denoted by u, is the horizontal positioning and rotation applied to the falling block.

(25)

Bellman’s equation over the space of the controllable state component takes the form

ˆ J(x) =X y p(y) max u h g(x, y, u) + ˆJ f (x, y, u)i, for all x,

where g(x, y, u) and f (x, y, u) are the number of points scored (rows removed), and the board position when the state is (x, y) and control u is applied, respectively [cf. Eq. (6.9)].

This problem also admits a post-decision state. Once u is applied at state (x, y), a new board position m is obtained, and the new state component x is obtained from m after removing a number of rows. Thus we have

m = f (x, y, u)

for some function f , and m also determines the reward of the stage, which has the form h(m) for some m [h(m) is the number of complete rows that can be removed from m]. Thus, m may serve as a post-decision state, and the corresponding Bellman’s equation takes the form (6.12), i.e.,

V∗(m) = h(m) + n X (x,y) q(m, x, y) max u∈U(j)V ∗ f (x, y, u), ∀ m,

where (x, y) is the state that follows m, and q(m, x, y) are the corresponding transition probabilities. Note that both of the simplified Bellman’s equations share the same characteristic: they involve a deterministic optimization.

Trading off Complexity of Control Space with Complexity of State Space

Suboptimal control using cost function approximation deals fairly well with large state spaces, but still encounters serious difficulties when the number of controls available at each state is large. In particular, the minimization

min u∈U(i) n X j=1 pij(u) g(i, u, j) + ˜J(j, r)

using an approximate cost-go function ˜J(j, r) may be very time-consuming. For multistep lookahead schemes, the difficulty is exacerbated, since the required computation grows exponentially with the size of the lookahead horizon. It is thus useful to know that by reformulating the problem, it may be possible to reduce the complexity of the control space by increasing the complexity of the state space. The potential advantage is that the extra state space complexity may still be dealt with by using function approximation and/or rollout.

(26)

In particular, suppose that the control u consists of m components, u = (u1, . . . , um).

Then, at a given state i, we can break down u into the sequence of the m controls u1, u2, . . . , um, and introduce artificial intermediate “states” (i, u1), (i, u1, u2), . . . , (i, u1, . . . , um−1), and corresponding transitions to mo-del the effect of these controls. The choice of the last control component um at “state” (i, u1, . . . , um−1) marks the transition to state j according to the given transition probabilities pij(u). In this way the control space is simplified at the expense of introducing m − 1 additional layers of states, and m − 1 additional cost-to-go functions

J1(i, u1), J2(i, u1, u2), . . . , Jm−1(i, u1, . . . , um−1).

To deal with the increase in size of the state space we may use rollout, i.e., when at “state” (i, u1, . . . , uk), assume that future controls uk+1, . . . , um will be chosen by a base heuristic. Alternatively, we may use function approximation, that is, introduce cost-to-go approximations

˜

J1(i, u1, r1), ˜J2(i, u1, u2, r2), . . . , ˜Jm−1(i, u1, . . . , um−1, rm−1), in addition to ˜J(i, r). We refer to [BeT96], Section 6.1.4, for further dis-cussion.

A potential complication in the preceding schemes arises when the controls u1, . . . , umare coupled through a constraint of the form

u = (u1, . . . , um) ∈ U(i). (6.13) Then, when choosing a control uk, care must be exercised to ensure that the future controls uk+1, . . . , um can be chosen together with the already chosen controls u1, . . . , uk to satisfy the feasibility constraint (6.13). This requires a variant of the rollout algorithm that works with constrained DP problems; see Exercise 6.19 of Vol. I, and also references [Ber05a], [Ber05b]. 6.1.5 Monte Carlo Simulation

In this subsection and the next, we will try to provide some orientation into the mathematical content of this chapter. The reader may wish to skip these subsections at first, but return to them later for a higher level view of some of the subsequent technical material.

The methods of this chapter rely to a large extent on simulation in conjunction with cost function approximation in order to deal with large state spaces. The advantage that simulation holds in this regard can be traced to its ability to compute (approximately) sums with a very large number of terms. These sums arise in a number of contexts: inner product and matrix-vector product calculations, the solution of linear systems of equations and policy evaluation, linear least squares problems, etc.

(27)

Example 6.1.5 (Approximate Policy Evaluation)

Consider the approximate solution of the Bellman equation that corresponds to a given policy of an n-state discounted problem:

J = g + αP J;

where P is the transition probability matrix and α is the discount factor. Let us adopt a hard aggregation approach (cf. Section 6.3.4 of Vol. I; see also Section 6.4 later in this chapter), whereby we divide the n states in two disjoint subsets I1and I2with I1∪ I2= {1, . . . , n}, and we use the piecewise

constant approximation

J(i) =nr1 if i ∈ I1, r2 if i ∈ I2.

This corresponds to the linear feature-based architecture J ≈ Φr, where Φ is the n × 2 matrix with column components equal to 1 or 0, depending on whether the component corresponds to I1 or I2.

We obtain the approximate equations

J(i) ≈ g(i) + α   X j∈I1 pij  r1+ α   X j∈I2 pij  r2, i = 1, . . . , n,

which we can reduce to just two equations by forming two weighted sums (with equal weights) of the equations corresponding to the states in I1 and

I2, respectively: r1≈ 1 n1 X i∈I1 J(i), r2≈ 1 n2 X i∈I2 J(i),

where n1 and n2 are numbers of states in I1 and I2, respectively. We thus

obtain the aggregate system of the following two equations in r1 and r2:

r1= 1 n1 X i∈I1 g(i) + α n1   X i∈I1 X j∈I1 pij  r1+ α n1   X i∈I1 X j∈I2 pij  r2, r2= 1 n2 X i∈I2 g(i) + α n2   X i∈I2 X j∈I1 pij  r1+ α n2   X i∈I2 X j∈I2 pij  r2.

Here the challenge, when the number of states n is very large, is the calcu-lation of the large sums in the right-hand side, which can be of order O(n2_).

Simulation allows the approximate calculation of these sums with complexity that is independent of n. This is similar to the advantage that Monte-Carlo integration holds over numerical integration, as discussed in standard texts on Monte-Carlo methods.

(28)

To see how simulation can be used with advantage, let us consider the problem of estimating a scalar sum of the form

z = X

ω∈Ω v(ω),

where Ω is a finite set and v : Ω 7→ ℜ is a function of ω. We introduce a distribution ξ that assigns positive probability ξ(ω) to every element ω ∈ Ω (but is otherwise arbitrary), and we generate a sequence

{ω1, . . . , ωT}

of samples from Ω, with each sample ωttaking values from Ω according to ξ. We then estimate z with

ˆ zT = 1 T T X t=1 v(ωt) ξ(ωt). (6.14) Clearly ˆz is unbiased: E[ˆzT] = 1 T T X t=1 E v(ωt) ξ(ωt) = 1 T T X t=1 X ω∈Ω ξ(ω)v(ω) ξ(ω) = X ω∈Ω v(ω) = z. Suppose now that the samples are generated in a way that the long-term frequency of each ω ∈ Ω is equal to ξ(ω), i.e.,

lim T →∞ T X t=1 δ(ωt= ω) T = ξ(ω), ∀ ω ∈ Ω, (6.15)

where δ(·) denotes the indicator function [δ(E) = 1 if the event E has occurred and δ(E) = 0 otherwise]. Then from Eq. (6.14), we have

zT = X ω∈Ω T X t=1 δ(ωt= ω) T · v(ω) ξ(ω), and by taking limit as T → ∞ and using Eq. (6.15),

lim T →∞zˆT = X ω∈Ω lim T →∞ T X t=1 δ(ωt= ω) T · v(ω) ξ(ω) = X ω∈Ω v(ω) = z.

Thus in the limit, as the number of samples increases, we obtain the desired sum z. An important case, of particular relevance to the methods of this chapter, is when Ω is the set of states of an irreducible Markov chain. Then, if we generate an infinitely long trajectory {ω1, ω2, . . .} starting from any

(29)

initial state ω1, then the condition (6.15) will hold with probability 1, with ξ(ω) being the steady-state probability of state ω.

The samples ωtneed not be independent for the preceding properties to hold, but if they are, then the variance of ˆzT is the sum of the variances of the independent components in the sum of Eq. (6.14), and is given by

var(ˆzT) = 1 T2 T X t=1 X ω∈Ω ξ(ω) v(ω) ξ(ω)− z 2 = 1 T X ω∈Ω ξ(ω) v(ω) ξ(ω)− z 2 . (6.16) An important observation from this formula is that the accuracy of the approximation does not depend on the number of terms in the sum z (the number of elements in Ω), but rather depends on the variance of the random variable that takes values v(ω)/ξ(ω), ω ∈ Ω, with probabilities ξ(ω).† Thus, it is possible to execute approximately linear algebra operations of very large size through Monte Carlo sampling (with whatever distributions may be convenient in a given context), and this a principal idea underlying the methods of this chapter.

In the case where the samples are dependent, the variance formula (6.16) does not hold, but similar qualitative conclusions can be drawn under various assumptions, which ensure that the dependencies between samples become sufficiently weak over time (see the specialized literature).

Monte Carlo simulation is also important in the context of this chap-ter for an additional reason. In addition to its ability to compute efficiently sums of very large numbers of terms, it can often do so in model-free fash-ion (i.e., by using a simulator, rather than an explicit model of the terms in the sum).

6.1.6 Contraction Mappings and Simulation

Most of the chapter (Sections 6.3-6.8) deals with the approximate com-putation of a fixed point of a (linear or nonlinear) mapping T within a

† The selection of the distribution

ξ(ω) | ω ∈ Ω

can be optimized (at least approximately), and methods for doing this are the subject of the technique of importance sampling. In particular, assuming that samples are independent and that v(ω) ≥ 0 for all ω ∈ Ω, we have from Eq. (6.16) that the optimal distribution is ξ∗ _{= v/z and the corresponding minimum variance value is 0. However, ξ}∗

cannot be computed without knowledge of z. Instead, ξ is usually chosen to be an approximation to v, normalized so that its components add to 1. Note that we may assume that v(ω) ≥ 0 for all ω ∈ Ω without loss of generality: when v takes negative values, we may decompose v as

v = v+− v−,

so that both v+ _{and v}− _{are positive functions, and then estimate separately}

z+₌P

ω∈Ωv

+_{(ω) and z}−₌P

ω∈Ωv −_(ω).

(30)

subspace

S = {Φr | r ∈ ℜs_}.

We will discuss a variety of approaches with distinct characteristics, but at an abstract mathematical level, these approaches fall into two categories:

(a) A projected equation approach, based on the equation

Φr = ΠT (Φr), (6.17)

where Π is a projection operation with respect to a Euclidean norm (see Section 6.3 for discounted problems, and Sections 7.1-7.3 for other types of problems).

(b) An aggregation approach, based on an equation of the form

Φr = ΦDT (Φr), (6.18)

where D is an s × n matrix whose rows are probability distributions and Φ are matrices that satisfy certain restrictions.

When iterative methods are used for solution of Eqs. (6.17) and (6.18), it is important that ΠT and ΦDT be contractions over the subspace S. Note here that even if T is a contraction mapping (as is ordinarily the case in DP), it does not follow that ΠT and ΦDT are contractions. In our analysis, this is resolved by requiring that T be a contraction with respect to a norm such that Π or ΦD, respectively, is a nonexpansive mapping. As a result, we need various assumptions on T , Φ, and D, which guide the algorithmic development. We postpone further discussion of these issues, but for the moment we note that the projection approach revolves mostly around Euclidean norm contractions and cases where T is linear, while the aggregation/Q-learning approach revolves mostly around sup-norm contractions.

If T is linear, both equations (6.17) and (6.18) may be written as square systems of linear equations of the form Cr = d, whose solution can be approximated by simulation. The approach here is very simple: we approximate C and d with simulation-generated approximations ˆC and ˆd, and we solve the resulting (approximate) linear system ˆCr = ˆd by matrix inversion, thereby obtaining the solution estimate ˆr = ˆC−1_{d. A primary}ˆ example is the LSTD methods of Section 6.3.4. We may also try to solve the linear system ˆCr = ˆd iteratively, which leads to the LSPE type of methods, some of which produce estimates of r simultaneously with the generation of the simulation samples of w (see Section 6.3.4).

Stochastic Approximation Methods

Let us also mention some stochastic iterative algorithms that are based on a somewhat different simulation idea, and fall within the framework of

(31)

stochastic approximation methods. The TD(λ) and Q-learning algorithms fall in this category. For an informal orientation, let us consider the com-putation of the fixed point of a general mapping F : ℜn _{7→ ℜ}n _{that is a} contraction mapping with respect to some norm, and involves an expected value: it has the form

F (x) = Ef (x, w) , (6.19)

where x ∈ ℜn_{is a generic argument of F , w is a random variable and f (·, w)} is a given function. Assume for simplicity that w takes values in a finite set W with probabilities p(w), so that the fixed point equation x = F (x) has the form

x = X

w∈W

p(w)f (x, w).

We generate a sequence of samples {w1, w2, . . .} such that the empirical frequency of each value w ∈ W is equal to its probability p(w), i.e.,

lim k→∞

nk(w)

k = p(w), w ∈ W,

where nk(w) denotes the number of times that w appears in the first k samples w1, . . . , wk. This is a reasonable assumption that may be verified by application of various laws of large numbers to the sampling method at hand.

Given the samples, we may consider approximating the fixed point of F by the (approximate) fixed point iteration

xk+1= X

w∈W nk(w)

k f (xk, w), (6.20)

which can also be equivalently written as

xk+1= 1 k k X i=1 f (xk, wi). (6.21)

We may view Eq. (6.20) as a simulation-based version of the convergent fixed point iteration

xk+1= F (xk) = X

w∈W

p(w)f (xk, w),

where the probabilities p(w) have been replaced by the empirical frequen-cies nk(w)

k . Thus we expect that the simulation-based iteration (6.21) con-verges to the fixed point of F .

On the other hand the iteration (6.21) has a major flaw: it requires, for each k, the computation of f (xk, wi) for all sample values wi, i =

(32)

1, . . . , k. An algorithm that requires much less computation than iteration (6.21) is xk+1= 1 k k X i=1 f (xi, wi), k = 1, 2, . . . , (6.22) where only one value of f per sample wi is computed. This iteration can also be written in the simple recursive form

xk+1= (1 − γk)xk+ γkf (xk, wk), k = 1, 2, . . . , (6.23) with the stepsize γk having the form γk = 1/k. As an indication of its validity, we note that if it converges to some limit then this limit must be the fixed point of F , since for large k the iteration (6.22) becomes essentially identical to the iteration xk+1= F (xk). Other stepsize rules, which satisfy γk → 0 andP∞_k=1γk= ∞, may also be used. However, a rigorous analysis of the convergence of iteration (6.23) is nontrivial and is beyond our scope. The book by Bertsekas and Tsitsiklis [BeT96] contains a fairly detailed development, which is tailored to DP. Other more general references are Benveniste, Metivier, and Priouret [BMP90], Borkar [Bor08], Kushner and Yin [KuY03], and Meyn [Mey07].

6.2 DIRECT POLICY EVALUATION - GRADIENT METHODS We will now consider the direct approach for policy evaluation.† In par-ticular, suppose that the current policy is µ, and for a given r, ˜J(i, r) is an approximation of Jµ(i). We generate an “improved” policy µ using the formula

n X

j=1

pij(u) g(i, u, j) + α ˜J (j, r), for all i. (6.24)

To evaluate approximately Jµ, we select a subset of “representative” states ˜

S (perhaps obtained by some form of simulation), and for each i ∈ ˜S, we obtain M (i) samples of the cost Jµ(i). The mth such sample is denoted by † Direct policy evaluation methods have been historically important, and provide an interesting contrast with indirect methods. However, they are cur-rently less popular than the projected equation methods to be considered in the next section, despite some generic advantages (the option to use nonlinear ap-proximation architectures, and the capability of more accurate apap-proximation). The material of this section will not be substantially used later, so the reader may read lightly this section without loss of continuity.

(33)

c(i, m), and mathematically, it can be viewed as being Jµ(i) plus some sim-ulation error/noise.‡ Then we obtain the corresponding parameter vector r by solving the following least squares problem

min r X i∈ ˜S M(i) X m=1 ˜ J(i, r) − c(i, m)2 , (6.25)

and we repeat the process with µ and r replacing µ and r, respectively (see Fig. 6.1.1).

The least squares problem (6.25) can be solved exactly if a linear approximation architecture is used, i.e., if

˜

J(i, r) = φ(i)′_r,

where φ(i)′ _{is a row vector of features corresponding to state i. In this case} r is obtained by solving the linear system of equations

X

i∈ ˜S M(i)

X

m=1

φ(i) φ(i)′_{r − c(i, m) = 0,}

which is obtained by setting to 0 the gradient with respect to r of the quadratic cost in the minimization (6.25). When a nonlinear architecture is used, we may use gradient-like methods for solving the least squares problem (6.25), as we will now discuss.

Batch Gradient Methods for Policy Evaluation

Let us focus on an N -transition portion (i0, . . . , iN) of a simulated trajec-tory, also called a batch. We view the numbers

N −1 X

t=k

αt−k_{g i}

t, µ(it), it+1, k = 0, . . . , N − 1,

‡ The manner in which the samples c(i, m) are collected is immaterial for the purposes of the subsequent discussion. Thus one may generate these samples through a single very long trajectory of the Markov chain corresponding to µ, or one may use multiple trajectories, with different starting points, to ensure that enough cost samples are generated for a “representative” subset of states. In either case, the samples c(i, m) corresponding to any one state i will generally be correlated as well as “noisy.” Still the average 1

M(i)

PM(i)

m=1c(i, m) will ordinarily

converge to Jµ(i) as M (i) → ∞ by a law of large numbers argument [see Exercise

6.2 and the discussion in [BeT96], Sections 5.1, 5.2, regarding the behavior of the average when M (i) is finite and random].

(34)

as cost samples, one per initial state i0, . . . , iN −1, which can be used for least squares approximation of the parametric architecture ˜J(i, r) [cf. Eq. (6.25)]: min r N −1 X k=0 1 2 J(i˜ k, r)− N −1 X t=k αt−k_{g i} t, µ(it), it+1 !2 . (6.26)

One way to solve this least squares problem is to use a gradient method, whereby the parameter r associated with µ is updated at time N by

r := r − γ N −1 X k=0 ∇ ˜J(ik, r) J(i˜ k, r)− N −1 X t=k αt−k_{g i} t, µ(it), it+1 ! . (6.27) Here, ∇ ˜J denotes gradient with respect to r and γ is a positive stepsize, which is usually diminishing over time (we leave its precise choice open for the moment). Each of the N terms in the summation in the right-hand side above is the gradient of a corresponding term in the least squares summation of problem (6.26). Note that the update of r is done after processing the entire batch, and that the gradients ∇ ˜J(ik, r) are evaluated at the preexisting value of r, i.e., the one before the update.

In a traditional gradient method, the gradient iteration (6.27) is repeated, until convergence to the solution of the least squares problem (6.26), i.e., a single N -transition batch is used. However, there is an im-portant tradeoff relating to the size N of the batch: in order to reduce simulation error and generate multiple cost samples for a representatively large subset of states, it is necessary to use a large N , yet to keep the work per gradient iteration small it is necessary to use a small N .

To address the issue of size of N , an expanded view of the gradient method is preferable in practice, whereby batches may be changed after one or more iterations. Thus, in this more general method, the N -transition batch used in a given gradient iteration comes from a potentially longer simulated trajectory, or from one of many simulated trajectories. A se-quence of gradient iterations is performed, with each iteration using cost samples formed from batches collected in a variety of different ways and whose length N may vary. Batches may also overlap to a substantial degree. We leave the method for generating simulated trajectories and form-ing batches open for the moment, but we note that it influences strongly the result of the corresponding least squares optimization (6.25), provid-ing better approximations for the states that arise most frequently in the batches used. This is related to the issue of ensuring that the state space is adequately “explored,” with an adequately broad selection of states being represented in the least squares optimization, cf. our earlier discussion on the exploration issue.

The gradient method (6.27) is simple, widely known, and easily un-derstood. There are extensive convergence analyses of this method and