A policy iteration algorithm and finite-approximation

We have shown the existence of an optimal stationary policy above. In this section, we focus on the computational approach for finding optimal stationary policies.

Under Conditions 3.1-3.3 and 3.4, for each f ∈ F , by taking A(i) := {f (i)} for all i ∈ S, it follows from Theorem 3.5 that J (i, f ) is independent of states i (i.e., a constant denoted by ρf), which together with the function ψf(i) := E_ifheR0τi0(c(ξt,f (ξt))−ρf)dt

equation

ρψ(i) = c(i, f (i))ψ(i) +X

j∈S

ψ(j)q(j|i, f (i)) ∀i ∈ S, with ψ(i0) = 1. (3.28)

To establish the uniqueness of a solution to the Poisson equation (3.28) and the RS-AOE (3.27), we introduce the following condition.

Condition 3.5. sup_i∈Sψf_{(i) < ∞ for each f ∈ F ,}_ψf_{(i) := E}f i

eR0τi0(c(ξt,f (ξt))−ρf)dt

i .

Remark 3.6. Since ψf_{(i) ≥ ψ}∗ _{for any f ∈ F , Conditions 3.4 and 3.5 together}

with Theorem 3.5(c) implies that the ψ∗ in the solution (ρ∗, ψ∗) to the RS-AOE needs to be a bounded, positive function which is uniformly bounded away from zero. As mentioned in Remark 5.3 in [43], it is unsolved to show the existence of such a solution. Obviously, Conditions 3.4 and 3.5 are satisfied when S is finite. For the case of infinitely denumerable states we next give suitable conditions and examples for the verifications of Conditions 3.4 and 3.5.

Proposition 3.2. Suppose that q∗ := infa∈A(i),i6=i0q(i, a) > 0, and there exist a nonnegative bounded function V3 on S and a constant ˆδ > 0 such that

δ < q∗, V3(i0) = 0, and

j6=i0

q(j|i, a)V3(j) ≤ −ˆδV3(i) − 1 ∀a ∈ A(i), for all i 6= i0.

Then, the following assertions hold.

(a) ψf(i) ≥ e−L(i0) supi∈SV3(i) _{for all i ∈ S and f ∈ F , which implies Conditions} 3.4.

(b) If in addition c(i, a) ≤ ˆ_{δ for all (i, a) ∈ K, then Conditions 3.5 and 3.4 are} satisfied.

Proof. Let M := sup_i∈SV3(i). Then,for any f ∈ F , it follows from the proof of

Lemma 6.1.5 in [3] that E_if(τi0) ≤ V3(i) ≤ M and E f i h eδτˆi0 i

Thus, by ψf(i) = E_ifheR0τi0(c(ξt,f (ξt))−ρf)dt i

≥ e−ρf_Ef

i(τi0) _{≥ e}−L(i0)M_{, we see that} (a) is true. Obviously, (b) follows from that ψf_{(i) ≤ E}f

i h eR0τi0c(ξt,f (ξt))dt i ≤ E_ifheˆδτi0 i ≤ 1 + ˆδM .

We next prove the uniqueness of a solution to (3.28) or (3.27).

Proposition 3.3. Under Conditions 3.1, 3.2, 3.3, 3.4 and 3.5, the followings hold.

(a) The solution (ρ∗, ψ∗) to the RS-AOE (3.27) is unique in [0, L(i0)] × B1+(S).

(b) For each f ∈ F , the solution (ρf_{, ψ}f_{) to the multiplicative Poisson equation}

(3.28) is unique in [0, L(i0)] × B1+(S).

Proof. (a) In Theorem 3.5, we have shown that (ρ∗, ψ∗) is a solution in [0, L(i0)]×

B_V+₀(S) to the RS-AOE (3.27) and (ρ∗, ψ∗) = (ρf∗, ψf∗) for some f∗ ∈ F . Under Condition 3.5, it is obvious that ψ∗ ∈ B+

1 (S). Hence, it remains to show that

such a solution is unique to the RS-AOE (3.27) in [0, L(i0)] × B1+(S). To do

so, suppose that (ρ, ψ) is an arbitrary solution in [0, L(i0)] × B1+(S) to the RS-

AOE. Since (ρ, ψ) ∈ [0, L(i0)] × B1+(S), using a similar argument as in proving

Theorem 3.3(b,c) and Theorem 3.5(b) yields that ρ = infπ∈ΠJ (i, π), and ψ(i) =

infπ∈Πd mE π i h eR0τi0(c(ξt,πt)−ρ)dt i

for all i ∈ S. It then follows that ρ = ρ∗ and ψ(i) = ψ∗(i) for all i ∈ S.

(b) The proof is similar to those of part (a) above.

Basing on the uniqueness of a solution to (3.28), we next provide a policy iteration algorithm for computing optimal stationary policies.

The policy iteration algorithm:

1. Pick an arbitrary f ∈ F . Let n = 0, and take fn := f .

2. Policy evaluation (by Proposition 3.3(b)): Compute ρfn _{and ψ}fn _{by solving} the following multiplicative Poisson equation

ρψ(i) = c(i, fn(i))ψ(i) +

j∈S

3. Policy improvement: Obtain a policy fn+1∈ F such that, for each i ∈ S, fn+1(i) :=     

fn(i) when Bfn(i) = ∅

a with any a ∈ Bfn(i) 6= ∅,

(3.29)

where

Bfn(i) := {a ∈ A(i)| c(i, a)ψ

fn_{(i) +}P

j∈Sq(j|i, a)ψ

fn_{(j) < ρ}fn_ψfn_(i)}.

if the set Bfn(i) contains more than one action, then we choose any one of them to be fn+1(i).

4. If fn+1 = fn (i.e., Bfn(i) ≡ ∅),then stop because fn+1 is optimal (by Theo- rems 3.3 and 3.5(c) above). Otherwise, increase n by 1 and return to step 2.

To establish the convergence of this algorithm, M := {S, A(i), c(i, a), q(j|i, a)}

needs to be irreducible, which means that {ξt, t ≥ 0} is irreducible under each

f ∈ F . Then, we have the following.

Lemma 3.4. Under the conditions in Theorem 3.3, suppose that S and A(i)(i ∈ S) are finite and M is irreducible. Let {fn} be a sequence obtained by the policy

iteration algorithm, then the following assertions hold. (a) ρfn+1 _{≤ ρ}fn _{for all n ≥ 1.}

(b) If fn+1 6= fn for some n ≥ 0, then ρfn+1 < ρfn, and

ρfn+1_{− ρ}fn _{= c(i} 0, fn+1(i0))ψfn+1(i0) + X j∈S ψfn+1_(j)q(j|i 0, fn+1(i0)) −c(i0, fn(i0))ψfn(i0) − X j∈S ψfn_(j)q(j|i 0, fn(i0)).

Proof. By Theorem 5.1 in [43] and Proposition 3.3, we see that (a)-(c) are true.

In order to get optimal policies for the case of infinite states by finite-approximation, we construct models Mn := {Sn, An(i), cn(i, a), qn(j|i, a)}(n ≥ 1) with finite s-

tates, where

Sn := {0, . . . , n}; An(i) := A(i); cn(i, a) := c(i, a); and

for each i ∈ Sn and a ∈ An(i).

Obviously, the transition rates qn(j|i, a)(n ≥ 1) are also conservative and

stable. Moreover, if the function V0 in Condition 3.1 is nondecreasing on S, then

Condition 3.1 holds for each model Mn, which is verifies as follows: For each

n ≥ 1, i ∈ Sn, a ∈ An(i), X j∈Sn qn(j|i, a)V0(j) = X j∈Sn,j6=i [q(j|i, a) + 1 n X k>n

q(k|i, a)]V0(j) + q(i|i, a)V0(i)

= X j∈Sn q(j|i, a)V0(j) + 1 n X k>n q(k|i, a)[ X j∈Sn,j6=i V0(j)] ≤ X j≤n q(j|i, a)V0(j) + X k>n q(k|i, a)V0(k)

≤ −δ(i)V0(i) + b0I{i0} (by Condition 3.1). (3.31)

Thus, in summary, we have the fact below.

Theorem 3.6. Under Conditions 3.1, 3.2 and 3.3, if the functions V0 and V1 are

nondecreasing on S, then the following assertions hold for each Mn(n ≥ 1).

B₁+(S):

ρ∗_nψ∗_n(i) = inf

a∈A(i)

{cn(i, a)ψn∗(i) +

j∈Sn

ψ_n∗(j)qn(j|i, a)} ∀ i ∈ Sn, (3.32)

ψ_n∗(i0) = 1, ρ∗n≤ L0(i0), ψn∗(i) ≤ V0(i), for all i ∈ Sn and n ≥ 1.

(b) There exists a f_n∗ ∈ F achieving the minimum in (3.32). Moreover, a policy f in F is optimal for Mnif and only if f (i) achieves the minimum in (3.32)

for all i ∈ Sn.

(c) If, in addition, M is irreducible and A(i) is finite for each i ∈ S, then an optimal stationary policy f_n∗ for Mncan be obtained by the policy iteration

algorithm in a finite number of steps.

Proof. (a)and (b). Since Sn are finite, it follows from (3.31) that the Conditions

3.1, 3.2, 3.3, 3.4 and 3.5 are satisfied for each Mn. Thus, (a) follows from Theorem

3.5 and Proposition 3.3(a) as well as (3.31), (b) from Proposition 3.3.

any stationary policy fn ∈ F for model Mn and i, j ∈ Sn, i 6= j, we extend

fn to f ∈ F by letting f (i) := fn(i) for all i ∈ Sn and f (i) := ai for any

i 6∈ Sn, where ai ∈ A(i) is any fixed action. Then, since M is irreducible, there

exists K ≥ 0 states ik ∈ S \ {j}(k = 0, . . . K, with i0 = i, iK+1 = j) such that

q(ik+1|ik, f (ik)) > 0 for all k = 0, . . . , K. For the K + 2 states ik, if ik ∈ Sn

for all k = 0, · · · , K + 1, then q(ik+1|ik, fn(ik)) > 0 for all k = 0, · · · , K + 1,

which implies that i can reach j under fn. Otherwise, let k∗ := min{k : ik6∈ Sn}.

Since i0, iK+1 ∈ Sn, we have 1 ≤ k∗ ≤ K and ik∗₊₁ ∈ S/ _n. Then we have {i0, . . . , ik∗₋₁} ⊂ S_n, but i_k∗ 6∈ S_n. Thus, by the definition of the transition rates qn(·|·, ·) in (3.30) and ik∗₋₁ 6= j ∈ S_n, we have

qn(j|ik∗₋₁, f_n(i_k∗₋₁)) = q_n(j|i_k∗₋₁, f (i_k∗₋₁)) ≥ 1

nq(ik∗|ik∗−1, f (ik∗−1)) > 0, which, together with q(ik+1|ik, f (ik)) > 0 for k = 0, . . . , k∗ − 1, implies that i

can reach to j under fn for the model Mn. Thus, Mn is irreducible, and so (c)

follows from Lemma 3.4.

Since the sets A(i) are compact, and so is F . Thus, any sequence {f_n∗} in Theorem 3.6(c) has a limit policy (say ˆf∗) in F , that is, there is a subsequence {f∗

nk} of {f ∗

n} such that limk→∞fn∗k(i) = ˆf

∗_{(i) for each i ∈ S.}

Theorem 3.7. (Finite-approximation.) Suppose that Conditions 3.1, 3.2 and 3.3 are satisfied, and the V0 is nondecreasing on S. Then, the followings hold.

(a) A sequence {f_n∗} of optimal stationary policies f∗

n exists for Mn, and it has

a limit policy ˆf∗ such that

ρ ˆψ(i) = c(i, ˆf∗(i)) ˆψ(i) +X

j∈S

q(j|i, ˆf∗(i)) ˆψ(j)

= inf

a∈A(i){c(i, a) ˆψ(i) +

j∈S

q(j|i, a) ˆψ(j)} ∀i ∈ S. (3.33)

for some function ˆψ on S such that ˆψ ≤ V0.

(b) If, in addition, the condition in Proposition 3.2 holds with a nondecreasing V3 on S and a constant ˆδ ≥ c(i, a) on K, then the policy ˆf∗ in (a) is optimal

for the model M.

Proof. (a) Suppose that ˆf∗(i) = limk→∞fn∗k(i) for all i ∈ S. Then, by Theorem 3.5, we have 0 ≤ ρ∗_n

k ≤ L0(i0) and ψ ∗

nk(i) ≤ V0(i) for all k ≥ 1. Thus, the diagonalization arguments ensure the existence of a subsequence {nkl, l ≥ 1} of {nk, n ≥ 1} and ( ˆρ, ˆψ) ∈ [0, L0(i0)] × BV0(S) such that ˆψ(i0) = 1 and:

lim l→∞f ∗ n_kl(i) = ˆf ∗ (i), lim l→∞ρ ∗ n_kl = ˆρ, lim l→∞ψ ∗

n_kl(i) = ˆψ(i) ≤ V0(i) ∀i ∈ S (3.34)

3.6(a,b) gives ∀ l ≥ ˜n ρ∗_n klψ ∗_n kl = cn_kl(i, f ∗ n_kl(i))ψ ∗ n_kl(i) + X j∈S_nk l ψ∗_n kl(j)qnkl(j|i, f ∗ n_kl(i)) (3.35) = inf

a∈A(i){cnkl(i, a)ψ ∗ n_kl(i) + X j∈Sn ψ_n∗ kl(j)qnkl(j|i, a)} (3.36) ≤ cn_kl(i, a)ψ∗n_kl(i) +

j∈Sn ψ_n∗

kl(j)qnkl(j|i, a) ∀ a ∈ A(i).

On the other hand, since ψ∗_n≤ V0 for all n ≥ 1, by (3.30), we have

X j∈Sn ψ∗_n(j)qn(j|i, fn∗(i)) (3.37) = X j∈Sn ψ∗_n(j)q(j|i, f_n∗(i)) + 1 n X k>n q(k|i, f_n∗(i)) X 0≤j≤n,j6=i ψ_n∗(j)

which, together with the monotonicity of V0 and the following

0 ≤ 1 n X k>n q(k|i, f_n∗(i)) X 0≤j≤n,j6=i ψ_n∗(j) ≤X k>n q(k|i, f_n∗(i))V0(k) → 0 as n → ∞, implies that lim n→∞ " X j∈Sn ψ∗_n(j)qn(j|i, fn∗(i)) # =X j∈S ˆ ψ∗(j)q(j|i, ˆf∗(i)). (3.38) Thus, by (3.34)-(3.38), we get (3.33). (b) As the proof of (3.31), we have

δ < qn(i, a), and

j6=i0

qn(j|i, a)V3(j) ≤ −ˆδV3(i)−1 ∀a ∈ A(i), for every i 6= i0, n ≥ 1,

which, together with the proofs of Proposition 3.2 and Theorem 3.5(c), gives e−L(i0)M ≤ ψ∗

n(i) ≤ 1 + ˆδM , where M = supi∈SV3(i) < ∞, and so e−L(i0)M ≤

ψ∗(j) ≤ 1 + ˆδM for all i ∈ S. Hence, by (3.33) and Theorem 3.5, we see that (b) is also true.

In document Risk-sensitive Markov Decision Processes (Page 52-60)