In this section we consider policy evaluation for finite-state stochastic short-est path (SSP) problems (cf. Chapter 2). We assume that there is no dis-counting (α = 1), and that the states are 0, 1, . . . , n, where state 0 is a special cost-free termination state. We focus on a fixed proper policy µ, under which all the states 1, . . . , n are transient.
There are natural extensions of the LSTD(λ) and LSPE(λ) algo-rithms. We introduce a linear approximation architecture of the form
J(i, r) = φ(i)˜ ′r, i = 0, 1, . . . , n, and the subspace
S = {Φr | r ∈ ℜs},
where, as in Section 6.3, Φ is the n × s matrix whose rows are φ(i)′, i = 1, . . . , n. We assume that Φ has rank s. Also, for notational convenience in the subsequent formulas, we define φ(0) = 0.
The algorithms use a sequence of simulated trajectories, each of the form (i0, i1, . . . , iN), where iN = 0, and it6= 0 for t < N. Once a trajectory is completed, an initial state i0 for the next trajectory is chosen according to a fixed probability distribution q0= q0(1), . . . , q0(n), where
q0(i) = P (i0= i), i = 1, . . . , n, (6.157) and the process is repeated.
For a trajectory i0, i1, . . ., of the SSP problem consider the probabil-ities
qt(i) = P (it= i), i = 1, . . . , n, t = 0, 1, . . .
Note that qt(i) diminishes to 0 as t → ∞ at the rate of a geometric pro-gression (cf. Section 2.1), so the limits
q(i) =
∞
X
t=0
qt(i), i = 1, . . . , n,
are finite. Let q be the vector with components q(1), . . . , q(n). We assume that q0(i) are chosen so that q(i) > 0 for all i [a stronger assumption is that q0(i) > 0 for all i]. We introduce the norm
kJkq = v u u t
n
X
i=1
q(i) J(i)2
,
and we denote by Π the projection onto the subspace S with respect to this norm. In the context of the SSP problem, the projection norm k · kq plays a role similar to the one played by the steady-state distribution norm k · kξ for discounted problems (cf. Section 6.3).
Let P be the n × n matrix with components pij, i, j = 1, . . . , n.
Consider also the mapping T : ℜn7→ ℜn given by T J = g + P J, where g is the vector with components Pn
j=0pijg(i, j), i = 1, . . . , n. For λ ∈ [0, 1), define the mapping
T(λ)= (1 − λ)
∞
X
t=0
λtTt+1
[cf. Eq. (6.71)]. Similar to Section 6.3, we have T(λ)J = P(λ)J + (I − λP )−1g, where
P(λ)= (1 − λ)
∞
X
t=0
λtPt+1 (6.158)
[cf. Eq. (6.72)].
We will now show that ΠT(λ)is a contraction, so that it has a unique fixed point.
Proposition 6.6.1: For all λ ∈ [0, 1), ΠT(λ) is a contraction with respect to some norm.
Proof: Let λ > 0. We will show that T(λ) is a contraction with respect to the projection norm k · kq, so the same is true for ΠT(λ), since Π is nonexpansive. Let us first note that with an argument like the one in the proof of Lemma 6.3.1, we have
kP Jkq ≤ kJkq, J ∈ ℜn. Indeed, we have q =P∞
t=0qtand qt+1′ = qt′P , so q′P =
∞
X
t=0
qt′P =
∞
X
t=1
qt′= q′− q0′,
or n
X
i=1
q(i)pij = q(j) − q0(j), ∀ j.
Using this relation, we have for all J ∈ ℜn,
From the relation kP Jkq≤ kJkq it follows that kPtJkq ≤ kJkq, J ∈ ℜn, t = 0, 1, . . . Thus, by using the definition (6.158) of P(λ), we also have
kP(λ)Jkq≤ kJkq, J ∈ ℜn.
Since limt→∞PtJ = 0 for any J ∈ ℜn, it follows that kPtJkq < kJkq for all J 6= 0 and t sufficiently large. Therefore,
kP(λ)Jkq < kJkq, for all J 6= 0. (6.160) We now define
β = maxkP(λ)Jkq | kJkq= 1
and note that since the maximum in the definition of β is attained by the Weierstrass Theorem (a continuous function attains a maximum over a compact set), we have β < 1 in view of Eq. (6.160). Since
kP(λ)Jkq ≤ βkJkq, J ∈ ℜn,
it follows that P(λ) is a contraction of modulus β with respect to k · kq. Let λ = 0. We use a different argument because T is not necessarily a contraction with respect to k · kq. [An example is given following Prop.
6.8.1. Note also that if q0(i) > 0 for all i, from the calculation of Eq.
(6.159) it follows that P and hence T is a contraction with respect to k · kq.] We show that ΠT is a contraction with respect to a different norm
by showing that the eigenvalues of ΠP lie strictly within the unit circle.†
Indeed, with an argument like the one used to prove Lemma 6.3.1, we have kP Jkq ≤ kJkq for all J, which implies that kΠP Jkq ≤ kJkq, so the eigenvalues of ΠP cannot be outside the unit circle. Assume to arrive at a contradiction that ν is an eigenvalue of ΠP with |ν| = 1, and let ζ be a corresponding eigenvector. We claim that P ζ must have both real and imaginary components in the subspace S. If this were not so, we would have P ζ 6= ΠP ζ, so that
kP ζkq > kΠP ζkq = kνζkq = |ν| kζkq= kζkq,
which contradicts the fact kP Jkq ≤ kJkq for all J. Thus, the real and imaginary components of P ζ are in S, which implies that P ζ = ΠP ζ = νζ, so that ν is an eigenvalue of P . This is a contradiction because |ν| = 1 while the eigenvalues of P are strictly within the unit circle, since the policy being evaluated is proper. Q.E.D.
The preceding proof has shown that ΠT(λ) is a contraction with re-spect to k·kqwhen λ > 0. As a result, similar to Prop. 6.3.5, we can obtain the error bound
kJµ− Φr∗λkq≤ 1 p1 − α2λ
kJµ− ΠJµkq, λ > 0,
where Φr∗λ and αλ are the fixed point and contraction modulus of ΠT(λ), respectively. When λ = 0, we have
kJµ− Φr0∗k ≤ kJµ− ΠJµk + kΠJµ− Φr∗0k
= kJµ− ΠJµk + kΠT Jµ− ΠT (Φr∗0)k
= kJµ− ΠJµk + α0kJµ− Φr∗0k,
where k · k is the norm with respect to which ΠT is a contraction (cf. Prop.
6.7.1), and Φr∗0and α0 are the fixed point and contraction modulus of ΠT . We thus have the error bound
kJµ− Φr0∗k ≤ 1
1 − α0kJµ− ΠJµk.
† We use here the fact that if a square matrix has eigenvalues strictly within the unit circle, then there exists a norm with respect to which the linear mapping defined by the matrix is a contraction. Also in the following argument, the projection Πz of a complex vector z is obtained by separately projecting the real and the imaginary components of z on S. The projection norm for a complex vector x + iy is defined by
kx + iykq=p
kxk2q+ kyk2q.
Similar to the discounted problem case, the projected equation can be written as a linear equation of the form Cr = d. The correspond-ing LSTD and LSPE algorithms use simulation-based approximations Ck
and dk. This simulation generates a sequence of trajectories of the form (i0, i1, . . . , iN), where iN = 0, and it6= 0 for t < N. Once a trajectory is completed, an initial state i0 for the next trajectory is chosen according to a fixed probability distribution q0= q0(1), . . . , q0(n). The LSTD method approximates the solution C−1d of the projected equation by Ck−1dk, where Ck and dk are simulation-based approximations to C and d, respectively.
The LSPE algorithm and its scaled versions are defined by rk+1= rk− γGk(Ckrk− dk),
where γ is a sufficiently small stepsize and Gk is a scaling matrix. The derivation of the detailed equations is straightforward but somewhat te-dious, and will not be given (see also the discussion in Section 6.8).