The Final Reset Strategy - The Reset Phase

Chapter 5 Greedy Strategy Improvement For Markov Decision Pro-

5.4 The Reset Phase

5.4.4 The Final Reset Strategy

The purpose of this section is to provide a proof for the following proposition. Proposition 5.30. Greedy strategy improvement moves from the strategy σR3B to the strategy σ₀B′.

We begin by stating an analogue of Proposition 5.27. However, since the strategy σ_R3B is closer to the strategy σB₀′, more vertices can be included in this proposition.

Proposition 5.31. We have ValσR3B (v) = Valσ B′

0 (v) for every v ∈ {c

j, : j ≥

Proof. The proof of this proposition uses exactly the same reasoning as the proof of Proposition 5.27. This is becauseσB

R3 is closed on the set {cj, : j≥i} ∪ {rj : 1≤ j_≤n_{} ∪ {}bj, fj : j∈B′} ∪ {x, y}.

The arguments that we used to show that a vertex rj with j ≥ i will not be switched away from the strategy σB

R2 can also be applied to the strategy σBR3.

Since every outgoing action from the vertexrj is of the form (rj, v), wherev∈ {cj : j > i_}, Proposition 5.31 implies that AppealσB

′ 0 (r j, v) = Appealσ B R3(r j, v) for every outgoing action fromrj. The claim then follows from the fact that Proposition 5.16 implies that if greedy strategy improvement is applied toσB₀′, then it will not switch away fromσB₀′.

The same argument can also be used to prove that a vertex cj with j ∈B′ will not be switched away from the strategy σ₀B′(cj). Once again, this is because we have that every outgoing action from the vertex cj is of the form (cj, v) where v_{∈ {}cj, rj : j≥i} ∪ {bj, fj : j ∈B′}. Therefore, we can apply Propositions 5.31 and 5.16 in the same way in order to prove the claim.

We will now consider the vertices rj with j < i, where we must show that (rj, ci) is the most appealing action. We will begin by showing that the actions (rj, ck) with k < i are not switchable. Since we have σBR3(ck) = fk, σR3B (bk) = x, andσB_R3(x) =fi we have: AppealσBR3(b j, ck) =−(10n+ 4)2k−1+ Valσ B R3(f i).

On the other hand, sinceσB_R3(rj) =ci and σ_R3B (ci) =fi we have:

ValσR3B (r

j) = 4n+ Valσ

B R3(f

i).

Therefore, the actions (rj, ck) withk < iare not switchable inσBR3. Proposition 5.31

and Proposition 5.2 then imply that AppealσBR3(r

j, ci) >Appealσ

B R3(r

everyk > i. Therefore, greedy strategy improvement will not switch away from the action (rj, ci) in the strategyσR3B .

We will now consider the verticescj withj /∈B′ and j < i. At these vertices we must argue that greedy strategy improvement switches the action (cj, ri). Since σB_R3(cj) =fj,σBR3(bj) =x, and σBR3(x) =fi we have: ValσR3B (c j) = 1−(10n+ 4)2j−1+ Valσ B R3(f i)

On the other hand, sinceσB_R3(rj) =ci and σR3B (ci) =fi we have:

AppealσR3B (c

j, rj) =−1 + 4n+ 1 + Valσ

B R3(f

Therefore, we have AppealσBR3(c

j, rj)>Valσ

B R3(c

j), which implies that greedy strategy improvement will switch the action (cj, rj) at the vertexcj in the strategy σBR3.

Now we will consider the verticescj withj /∈B′ andj > i. At these vertices we must argue that greedy strategy improvement does not switch away from the action (cj, rj). We will do this by arguing that the action (cj, fj) is not switchable. As we have argued previously, we have:

AppealσBR3(c j, fj) = 1−(10n+ 4)2j−1+ Valσ B R3(f i) = (10n+ 4)(2i₋2i−1₋2j−1) + ValσR3B (c min(B′>i₎_∪{_n₊₁_})

We can then apply Proposition 5.31 and Proposition 5.4 to show:

AppealσBR3(c j, fj)≤(10n+ 4)(2i−2i−1−2j−1−2i+ 2j−1) + Valσ B R3(c j)<Valσ B R3(c j)

Therefore, the action (cj, rj) is not switchable in the strategyσB_R3.

We will now consider the vertex x, where we must show that the most appealing action is (x, fi). We will begin by showing that the actions (x, fj) where

j /_∈B′ _{are not switchable in}_σB R3. Since σR3B(bj) =x, we have: AppealσR3B (x, f j) =−(10n+ 4)2j−1−4n+ Valσ b R3(x)<ValσBR3(x)

Therefore, the actions (x, fj) where j /∈ B′ are not switchable in σR3B . For the

actions (x, fj) where j ∈ B′ and j 6= i, Proposition 5.31 and Proposition 5.19 imply that AppealσBR3(x, f

i)>Appealσ

B R3(x, f

j), which implies that greedy strategy improvement will not switch away from the action (x, fi).

We can apply the same reasoning for the vertexy, where we must show that (y, ci) is the most appealing action. For the actions (y, cj) withj /∈B′ we have:

AppealσR3B (y, c j) =−(10n+ 4)2j−1+ 1 + Valσ b R3(x)<Valσ B R3(x) + 4n+ 1.

Since σb_R3(y) =ci and σbR3(x) = fi we must have Valσ

R3(y) = ValσBR3(x) + 4n+ 1.

Therefore we have shown that the actions (y, cj) with j /∈B′ are not switchable in the strategyσB

R3. Proposition 5.31 and Proposition 5.2 imply that Appealσ

B R3(c

i)>

AppealσBR3(c

j) for everyj ∈B such that j6=i. Therefore greedy strategy improvement will not switch away from the action (y, ci) in the strategyσBR3.

We now consider the verticesdk for allk. For every vertexdk, we must show that (dk, y) is the most appealing action. For the action (dk, dk−1) we have:

AppealσBR3(d k, dk−1) = Valσ B R3(x)−1<Valσ B R3(x) = Appealσ B R3(d k, x).

For the action (dk, x), the fact that r(dk, y) = r(dk, x) implies:

AppealσBR3(d k, x) = r(dk, x) + Val(fi) <r(dk, y) + 4n+ 1 + Val(fi) = Appealσ B R3(d k, y).

Finally, we consider the verticesbj. We begin with the case whenj /∈B′. In this case we must show that (bj, y) is the most appealing action at the vertex bj. We will ﬁrst argue that the action (bj, x) is not switchable. Since σR3B (x) = fi, σ_R3B(y) =ci and σR3B (ci) =fi, we have the following two equalities:

ValσR3B (x) = Val(f

ValσR3B (y) = Val(f

i) + 4n+ 1

Therefore, we must have AppealσBR3(b

j, x) <Appealσ

B R3(b

j, y). These two equalities can also be used to prove that the actions of the form (bj, dk) are not switchable. This is because we have AppealσBR3(b

j, dk) ≤ 4n+ Valσ B R3(x) and we have AppealσBR3(b j, y) = 4n+ 2 + Valσ B R3(x).

We now consider the actions of the form (bj, fk). We will ﬁrst prove that the actions (bj, fk) wherek /∈B′ are not switched by greedy strategy improvement. Sincek /∈B′, we have: AppealσR3B (b j, fk) =−(10n+ 4)2k−1+ 1 + Valσ B R3(x) <ValσR3B (y) + 4n+ 2 = Appealσ B R3(b j, y)

Therefore, these actions will not be switched by greedy strategy improvement in the strategyσ_R3B . We now consider the actions (bj, fk) wherek∈B′. We will prove that these actions are not switchable inσ_R3B . The appeal of the action (bj, fk) is:

AppealσR3B (b

j, fk) = 4n+ 1 + Valσ

B R3(f

k).

On the other hand, the appeal of the action (bj, y) can be expressed as:

AppealσBR3(b

j, y) = (10n+ 4)(2i−2i−1)−4n+ Valσ

B R3(c

We can then use Proposition 5.31 and Proposition 5.4 to conclude: AppealσR3B (b j, y) = (10n+ 4)(2i−2i−1−2i+ 2k−1)−4n+ Valσ B R3(c k) = (10n+ 4)(2k−1₋2i−1)₋4n+ ValσR3B (c k) = (10n+ 4)(2k−1−2i−1) + 1 + ValσBR3(f k)>4n+ 1 + Valσ B R3(f k)

Therefore, the action (bj, fk) will not be switched by greedy strategy improvement. Finally, we consider the actionaj. We will ﬁrst argue that Appealσ

B R3(b

j, y)>

ValσR3B (b

j) + 1. This holds because we have Valσ

B R3(b j) = Valσ B R3(f i), and we have AppealσBR3(b j, y) = 4n+ 2 + Valσ B

R3(fi). On the other hand Proposition 5.31 and

Proposition 5.10 imply that AppealσR3B (b

j, aj) <Valσ

B R3(b

j) + 1. Therefore, greedy strategy improvement will not switch the actionaj.

To complete the proof of Proposition 5.30, we will show that greedy strategy improvement does not switch away from the actionaj at every vertexbj withj∈B′. We can use exactly the same arguments as we used for the vertices bj withj /∈B′ to argue that AppealσBR3(b

j, y) > Appealσ

B R3(b

j, a) for every action a ∈ {(bj, dk) : 1≤k≤n} ∪ {(bj, fk : j < k ≤n} ∪ {(bj, x)}. Therefore, we can prove the claim by showing that AppealσR3B (b

j, y)<Valσ

B R3(b

j). As we have done previously, we can use Proposition 5.31 and Proposition 5.4 to obtain the following characterisation for the appeal of the action (bj, y):

AppealσBR3(b j, y) = (10n+ 4)(2i−2i−1−2i+ 2j−1)−4n+ Valσ B R3(c j) = (10n+ 4)(2j−1₋2i−1)₋4n+ ValσBR3(c k) =−(10n+ 4)(2i−1)−4n+ 1 + ValσBR3(b j)<Valσ B R3(b j)

Therefore, greedy strategy improvement will not switch away from the actionaj at the vertexbj in the strategy σR3B .

5.5 Exponential Lower Bounds For The Average Re-

In document Strategy iteration algorithms for games and Markov decision processes (Page 176-182)