We will now consider the direct approach for policy evaluation.‡ In par-ticular, suppose that the current policy is µ, and for a given r, ˜J(i, r) is an approximation of Jµ(i). We generate an “improved” policy µ using the
† The selection of the distribution
ξ(ω) | ω ∈ Ω
can be optimized (at least approximately), and methods for doing this are the subject of the technique of importance sampling. In particular, assuming that samples are independent and that v(ω) ≥ 0 for all ω ∈ Ω, we have
var(ˆzT) = z2 T
X
ω∈Ω
v(ω)/z2
ξ(ω) − 1
! ,
the optimal distribution is ξ∗ = v/z and the corresponding minimum variance value is 0. However, ξ∗cannot be computed without knowledge of z. Instead, ξ is usually chosen to be an approximation to v, normalized so that its components add to 1. Note that we may assume that v(ω) ≥ 0 for all ω ∈ Ω without loss of generality: when v takes negative values, we may decompose v as
v = v+− v−,
so that both v+ and v− are positive functions, and then estimate separately z+=P
ω∈Ωv+(ω) and z−=P
ω∈Ωv−(ω).
‡ Direct policy evaluation methods have been historically important, and provide an interesting contrast with indirect methods. However, they are cur-rently less popular than the projected equation methods to be considered in the next section, despite some generic advantages (the option to use nonlinear ap-proximation architectures, and the capability of more accurate apap-proximation).
The material of this section will not be substantially used later, so the reader may read lightly this section without loss of continuity.
formula
µ(i) = arg min
u∈U(i) n
X
j=1
pij(u) g(i, u, j) + α ˜J (j, r), for all i. (6.17)
To evaluate approximately Jµ, we select a subset of “representative” states S (perhaps obtained by some form of simulation), and for each i ∈ ˜˜ S, we obtain M (i) samples of the cost Jµ(i). The mth such sample is denoted by c(i, m), and mathematically, it can be viewed as being Jµ(i) plus some sim-ulation error/noise.† Then we obtain the corresponding parameter vector r by solving the following least squares problem
minr
X
i∈ ˜S M(i)
X
m=1
J(i, r) − c(i, m)˜ 2
, (6.18)
and we repeat the process with µ and r replacing µ and r, respectively (see Fig. 6.1.1).
The least squares problem (6.18) can be solved exactly if a linear approximation architecture is used, i.e., if
J(i, r) = φ(i)˜ ′r,
where φ(i)′ is a row vector of features corresponding to state i. In this case r is obtained by solving the linear system of equations
X
i∈ ˜S M(i)
X
m=1
φ(i) φ(i)′r − c(i, m) = 0,
which is obtained by setting to 0 the gradient with respect to r of the quadratic cost in the minimization (6.18). When a nonlinear architecture is used, we may use gradient-like methods for solving the least squares problem (6.18), as we will now discuss.
† The manner in which the samples c(i, m) are collected is immaterial for the purposes of the subsequent discussion. Thus one may generate these samples through a single very long trajectory of the Markov chain corresponding to µ, or one may use multiple trajectories, with different starting points, to ensure that enough cost samples are generated for a “representative” subset of states. In either case, the samples c(i, m) corresponding to any one state i will generally be correlated as well as “noisy.” Still the average M1(i)PM(i)
m=1c(i, m) will ordinarily converge to Jµ(i) as M (i) → ∞ by a law of large numbers argument [see Exercise 6.2 and the discussion in [BeT96], Sections 5.1, 5.2, regarding the behavior of the average when M (i) is finite and random].
Batch Gradient Methods for Policy Evaluation
Let us focus on an N -transition portion (i0, . . . , iN) of a simulated trajec-tory, also called a batch. We view the numbers
N −1
X
t=k
αt−kg it, µ(it), it+1, k = 0, . . . , N − 1,
as cost samples, one per initial state i0, . . . , iN −1, which can be used for least squares approximation of the parametric architecture ˜J(i, r) [cf. Eq.
(6.18)]:
minr N −1
X
k=0
1
2 J(i˜ k, r)−
N −1
X
t=k
αt−kg it, µ(it), it+1
!2
. (6.19)
One way to solve this least squares problem is to use a gradient method, whereby the parameter r associated with µ is updated at time N by
r := r − γ
N −1
X
k=0
∇ ˜J(ik, r) J(i˜ k, r)−
N −1
X
t=k
αt−kg it, µ(it), it+1
!
. (6.20)
Here, ∇ ˜J denotes gradient with respect to r and γ is a positive stepsize, which is usually diminishing over time (we leave its precise choice open for the moment). Each of the N terms in the summation in the right-hand side above is the gradient of a corresponding term in the least squares summation of problem (6.19). Note that the update of r is done after processing the entire batch, and that the gradients ∇ ˜J(ik, r) are evaluated at the preexisting value of r, i.e., the one before the update.
In a traditional gradient method, the gradient iteration (6.20) is repeated, until convergence to the solution of the least squares problem (6.19), i.e., a single N -transition batch is used. However, there is an im-portant tradeoff relating to the size N of the batch: in order to reduce simulation error and generate multiple cost samples for a representatively large subset of states, it is necessary to use a large N , yet to keep the work per gradient iteration small it is necessary to use a small N .
To address the issue of size of N , an expanded view of the gradient method is preferable in practice, whereby batches may be changed after one or more iterations. Thus, in this more general method, the N -transition batch used in a given gradient iteration comes from a potentially longer simulated trajectory, or from one of many simulated trajectories. A se-quence of gradient iterations is performed, with each iteration using cost samples formed from batches collected in a variety of different ways and whose length N may vary. Batches may also overlap to a substantial degree.
We leave the method for generating simulated trajectories and form-ing batches open for the moment, but we note that it influences strongly the result of the corresponding least squares optimization (6.18), provid-ing better approximations for the states that arise most frequently in the batches used. This is related to the issue of ensuring that the state space is adequately “explored,” with an adequately broad selection of states being represented in the least squares optimization, cf. our earlier discussion on the exploration issue.
The gradient method (6.20) is simple, widely known, and easily un-derstood. There are extensive convergence analyses of this method and its variations, for which we refer to the literature cited at the end of the chapter. These analyses often involve considerable mathematical sophis-tication, particularly when multiple batches are involved, because of the stochastic nature of the simulation and the complex correlations between the cost samples. However, qualitatively, the conclusions of these analyses are consistent among themselves as well as with practical experience, and indicate that:
(1) Under some reasonable technical assumptions, convergence to a lim-iting value of r that is a local minimum of the associated optimization problem is expected.
(2) For convergence, it is essential to gradually reduce the stepsize to 0, the most popular choice being to use a stepsize proportional to 1/m, while processing the mth batch. In practice, considerable trial and error may be needed to settle on an effective stepsize choice method.
Sometimes it is possible to improve performance by using a different stepsize (or scaling factor) for each component of the gradient.
(3) The rate of convergence is often very slow, and depends among other things on the initial choice of r, the number of states and the dynamics of the associated Markov chain, the level of simulation error, and the method for stepsize choice. In fact, the rate of convergence is sometimes so slow, that practical convergence is infeasible, even if theoretical convergence is guaranteed.
Incremental Gradient Methods for Policy Evaluation
We will now consider a variant of the gradient method called incremental . This method can also be described through the use of N -transition batches, but we will see that (contrary to the batch version discussed earlier) the method is suitable for use with very long batches, including the possibility of a single very long simulated trajectory, viewed as a single batch.
For a given N -transition batch (i0, . . . , iN), the batch gradient method processes the N transitions all at once, and updates r using Eq. (6.20). The incremental method updates r a total of N times, once after each
transi-tion. Each time it adds to r the corresponding portion of the gradient in the right-hand side of Eq. (6.20) that can be calculated using the newly available simulation data. Thus, after each transition (ik, ik+1):
(1) We evaluate the gradient ∇ ˜J(ik, r) at the current value of r.
(2) We sum all the terms in the right-hand side of Eq. (6.20) that involve the transition (ik, ik+1), and we update r by making a correction along their sum:
r := r − γ ∇ ˜J(ik, r) ˜J(ik, r) −
k
X
t=0
αk−t∇ ˜J (it, r)
!
g ik, µ(ik), ik+1
! . (6.21) By adding the parenthesized “incremental” correction terms in the above iteration, we see that after N transitions, all the terms of the batch iter-ation (6.20) will have been accumulated, but there is a difference: in the incremental version, r is changed during the processing of the batch, and the gradient ∇ ˜J(it, r) is evaluated at the most recent value of r [after the transition (it, it+1)]. By contrast, in the batch version these gradients are evaluated at the value of r prevailing at the beginning of the batch. Note that the gradient sum in the right-hand side of Eq. (6.21) can be conve-niently updated following each transition, thereby resulting in an efficient implementation.
It can now be seen that because r is updated at intermediate transi-tions within a batch (rather than at the end of the batch), the location of the end of the batch becomes less relevant. It is thus possible to have very long batches, and indeed the algorithm can be operated with a single very long simulated trajectory and a single batch. In this case, for each state i, we will have one cost sample for every time when state i is encountered in the simulation. Accordingly state i will be weighted in the least squares optimization in proportion to the frequency of its occurrence within the simulated trajectory.
Generally, within the least squares/policy evaluation context of this section, the incremental versions of the gradient methods can be imple-mented more flexibly and tend to converge faster than their batch counter-parts, so they will be adopted as the default in our discussion. The book by Bertsekas and Tsitsiklis [BeT96] contains an extensive analysis of the theoretical convergence properties of incremental gradient methods (they are fairly similar to those of batch methods), and provides some insight into the reasons for their superior performance relative to the batch versions;
see also the author’s nonlinear programming book [Ber99] (Section 1.5.2), and the paper by Bertsekas and Tsitsiklis [BeT00]. Still, however, the rate of convergence can be very slow.
Implementation Using Temporal Differences – TD(1)
We now introduce an alternative, mathematically equivalent, implemen-tation of the batch and incremental gradient iterations (6.20) and (6.21), which is described with cleaner formulas. It uses the notion of temporal difference (TD for short) given by
qk= ˜J(ik, r)−α ˜J (ik+1, r)−g ik, µ(ik), ik+1, k = 0, . . . , N −2, (6.22) qN −1= ˜J(iN −1, r)− g iN −1, µ(iN −1), iN. (6.23) In particular, by noting that the parenthesized term multiplying ∇ ˜J(ik, r) in Eq. (6.20) is equal to
qk+ αqk+1+ · · · + αN −1−kqN −1,
we can verify by adding the equations below that iteration (6.20) can also be implemented as follows:
After the state transition (i0, i1), set r := r − γq0∇ ˜J(i0, r).
After the state transition (i1, i2), set
r := r − γq1 α∇ ˜J (i0, r) + ∇ ˜J(i1, r).
Proceeding similarly, after the state transition (iN −1, t), set r := r − γqN −1 αN −1∇ ˜J(i0, r) + αN −2∇ ˜J(i1, r) + · · · + ∇ ˜J(iN −1, r).
The batch version (6.20) is obtained if the gradients ∇ ˜J(ik, r) are all evaluated at the value of r that prevails at the beginning of the batch.
The incremental version (6.21) is obtained if each gradient ∇ ˜J(ik, r) is evaluated at the value of r that prevails when the transition (ik, ik+1) is processed.
In particular, for the incremental version, we start with some vector r0, and following the transition (ik, ik+1), k = 0, . . . , N − 1, we set
rk+1= rk− γkqk k
X
t=0
αk−t∇ ˜J(it, rt), (6.24) where the stepsize γk may very from one transition to the next. In the important case of a linear approximation architecture of the form
J(i, r) = φ(i)˜ ′r, i = 1, . . . , n, where φ(i) ∈ ℜs are some fixed vectors, it takes the form
rk+1= rk− γkqk k
X
t=0
αk−tφ(it). (6.25) This algorithm is known as TD(1), and we will see in Section 6.3.6 that it is a limiting version (as λ → 1) of the TD(λ) method discussed there.