STOCHASTIC SHORTEST PATH PROBLEMS - Dynamic Programming and Optimal Control

In this section we consider policy evaluation for finite-state stochastic short-est path (SSP) problems (cf. Chapter 2). We assume that there is no dis-counting (α = 1), and that the states are 0, 1, . . . , n, where state 0 is a special cost-free termination state. We focus on a fixed proper policy µ, under which all the states 1, . . . , n are transient.

There are natural extensions of the LSTD(λ) and LSPE(λ) algo-rithms. We introduce a linear approximation architecture of the form

J(i, r) = φ(i)˜ ^′r, i = 0, 1, . . . , n, and the subspace

S = {Φr | r ∈ ℜ^s},

where, as in Section 6.3, Φ is the n × s matrix whose rows are φ(i)^′, i = 1, . . . , n. We assume that Φ has rank s. Also, for notational convenience in the subsequent formulas, we define φ(0) = 0.

The algorithms use a sequence of simulated trajectories, each of the form (i0, i1, . . . , iN), where iN = 0, and it6= 0 for t < N. Once a trajectory is completed, an initial state i0 for the next trajectory is chosen according to a fixed probability distribution q0= q0(1), . . . , q0(n), where

q0(i) = P (i0= i), i = 1, . . . , n, (6.157) and the process is repeated.

For a trajectory i0, i1, . . ., of the SSP problem consider the probabil-ities

qt(i) = P (it= i), i = 1, . . . , n, t = 0, 1, . . .

Note that qt(i) diminishes to 0 as t → ∞ at the rate of a geometric pro-gression (cf. Section 2.1), so the limits

q(i) =

∞

t=0

qt(i), i = 1, . . . , n,

are finite. Let q be the vector with components q(1), . . . , q(n). We assume that q0(i) are chosen so that q(i) > 0 for all i [a stronger assumption is that q0(i) > 0 for all i]. We introduce the norm

kJk^q = v u u t

i=1

q(i) J(i)2

and we denote by Π the projection onto the subspace S with respect to this norm. In the context of the SSP problem, the projection norm k · k^q plays a role similar to the one played by the steady-state distribution norm k · k^ξ for discounted problems (cf. Section 6.3).

Let P be the n × n matrix with components p^ij, i, j = 1, . . . , n.

Consider also the mapping T : ℜⁿ7→ ℜⁿ given by T J = g + P J, where g is the vector with components Pn

j=0pijg(i, j), i = 1, . . . , n. For λ ∈ [0, 1), define the mapping

T^(λ)= (1 − λ)

∞

t=0

λ^tT^t+1

[cf. Eq. (6.71)]. Similar to Section 6.3, we have T^(λ)J = P^(λ)J + (I − λP )⁻¹g, where

P^(λ)= (1 − λ)

∞

t=0

λ^tP^t+1 (6.158)

[cf. Eq. (6.72)].

We will now show that ΠT^(λ)is a contraction, so that it has a unique fixed point.

Proposition 6.6.1: For all λ ∈ [0, 1), ΠT^(λ) is a contraction with respect to some norm.

Proof: Let λ > 0. We will show that T^(λ) is a contraction with respect to the projection norm k · k^q, so the same is true for ΠT^(λ), since Π is nonexpansive. Let us first note that with an argument like the one in the proof of Lemma 6.3.1, we have

kP Jk^q ≤ kJk^q, J ∈ ℜⁿ. Indeed, we have q =P∞

t=0qtand q_t+1^′ = q_t^′P , so q^′P =

∞

t=0

q_t^′P =

∞

t=1

q_t^′= q^′− q0^′,

or _n

i=1

q(i)pij = q(j) − q⁰(j), ∀ j.

Using this relation, we have for all J ∈ ℜⁿ,

From the relation kP Jkq≤ kJkq it follows that kP^tJkq ≤ kJkq, J ∈ ℜⁿ, t = 0, 1, . . . Thus, by using the definition (6.158) of P^(λ), we also have

kP^(λ)Jk^q≤ kJk^q, J ∈ ℜⁿ.

Since limt→∞P^tJ = 0 for any J ∈ ℜⁿ, it follows that kP^tJk^q < kJk^q for all J 6= 0 and t sufficiently large. Therefore,

kP^(λ)Jk^q < kJk^q, for all J 6= 0. (6.160) We now define

β = maxkP^(λ)Jk^q | kJk^q= 1

and note that since the maximum in the definition of β is attained by the Weierstrass Theorem (a continuous function attains a maximum over a compact set), we have β < 1 in view of Eq. (6.160). Since

kP^(λ)Jk^q ≤ βkJk^q, J ∈ ℜⁿ,

it follows that P^(λ) is a contraction of modulus β with respect to k · k^q. Let λ = 0. We use a different argument because T is not necessarily a contraction with respect to k · k^q. [An example is given following Prop.

6.8.1. Note also that if q0(i) > 0 for all i, from the calculation of Eq.

(6.159) it follows that P and hence T is a contraction with respect to k · k^q.] We show that ΠT is a contraction with respect to a different norm

by showing that the eigenvalues of ΠP lie strictly within the unit circle.†

Indeed, with an argument like the one used to prove Lemma 6.3.1, we have kP Jk^q ≤ kJk^q for all J, which implies that kΠP Jk^q ≤ kJk^q, so the eigenvalues of ΠP cannot be outside the unit circle. Assume to arrive at a contradiction that ν is an eigenvalue of ΠP with |ν| = 1, and let ζ be a corresponding eigenvector. We claim that P ζ must have both real and imaginary components in the subspace S. If this were not so, we would have P ζ 6= ΠP ζ, so that

kP ζk^q > kΠP ζk^q = kνζk^q = |ν| kζk^q= kζk^q,

which contradicts the fact kP Jk^q ≤ kJk^q for all J. Thus, the real and imaginary components of P ζ are in S, which implies that P ζ = ΠP ζ = νζ, so that ν is an eigenvalue of P . This is a contradiction because |ν| = 1 while the eigenvalues of P are strictly within the unit circle, since the policy being evaluated is proper. Q.E.D.

The preceding proof has shown that ΠT^(λ) is a contraction with re-spect to k·kqwhen λ > 0. As a result, similar to Prop. 6.3.5, we can obtain the error bound

kJµ− Φr^∗λkq≤ 1 p1 − α²λ

kJµ− ΠJµkq, λ > 0,

where Φr^∗_λ and αλ are the fixed point and contraction modulus of ΠT^(λ), respectively. When λ = 0, we have

kJµ− Φr0^∗k ≤ kJµ− ΠJµk + kΠJµ− Φr^∗0k

= kJ^µ− ΠJ^µk + kΠT J^µ− ΠT (Φr^∗0)k

= kJ^µ− ΠJ^µk + α⁰kJ^µ− Φr^∗0k,

where k · k is the norm with respect to which ΠT is a contraction (cf. Prop.

6.7.1), and Φr^∗₀and α0 are the fixed point and contraction modulus of ΠT . We thus have the error bound

kJ^µ− Φr0^∗k ≤ 1

1 − α⁰kJ^µ− ΠJ^µk.

† We use here the fact that if a square matrix has eigenvalues strictly within the unit circle, then there exists a norm with respect to which the linear mapping defined by the matrix is a contraction. Also in the following argument, the projection Πz of a complex vector z is obtained by separately projecting the real and the imaginary components of z on S. The projection norm for a complex vector x + iy is defined by

kx + iyk^q=p

kxk²q+ kyk²q.

Similar to the discounted problem case, the projected equation can be written as a linear equation of the form Cr = d. The correspond-ing LSTD and LSPE algorithms use simulation-based approximations Ck

and dk. This simulation generates a sequence of trajectories of the form (i0, i1, . . . , iN), where iN = 0, and it6= 0 for t < N. Once a trajectory is completed, an initial state i0 for the next trajectory is chosen according to a fixed probability distribution q0= q0(1), . . . , q0(n). The LSTD method approximates the solution C⁻¹d of the projected equation by C_k⁻¹dk, where Ck and dk are simulation-based approximations to C and d, respectively.

The LSPE algorithm and its scaled versions are defined by rk+1= rk− γG^k(Ckrk− d^k),

where γ is a sufficiently small stepsize and Gk is a scaling matrix. The derivation of the detailed equations is straightforward but somewhat te-dious, and will not be given (see also the discussion in Section 6.8).

In document Dynamic Programming and Optimal Control (Page 113-117)