Chapter 5 Greedy Strategy Improvement For Markov Decision Pro-
5.4 The Reset Phase
5.4.4 The Final Reset Strategy
The purpose of this section is to provide a proof for the following proposition. Proposition 5.30. Greedy strategy improvement moves from the strategy σR3B to the strategy σ0B′.
We begin by stating an analogue of Proposition 5.27. However, since the strategy σR3B is closer to the strategy σB0′, more vertices can be included in this proposition.
Proposition 5.31. We have ValσR3B (v) = Valσ B′
0 (v) for every v ∈ {c
j, : j ≥
Proof. The proof of this proposition uses exactly the same reasoning as the proof of Proposition 5.27. This is becauseσB
R3 is closed on the set {cj, : j≥i} ∪ {rj : 1≤ j≤n} ∪ {bj, fj : j∈B′} ∪ {x, y}.
The arguments that we used to show that a vertex rj with j ≥ i will not be switched away from the strategy σB
R2 can also be applied to the strategy σBR3.
Since every outgoing action from the vertexrj is of the form (rj, v), wherev∈ {cj : j > i}, Proposition 5.31 implies that AppealσB
′ 0 (r j, v) = Appealσ B R3(r j, v) for every outgoing action fromrj. The claim then follows from the fact that Proposition 5.16 implies that if greedy strategy improvement is applied toσB0′, then it will not switch away fromσB0′.
The same argument can also be used to prove that a vertex cj with j ∈B′ will not be switched away from the strategy σ0B′(cj). Once again, this is because we have that every outgoing action from the vertex cj is of the form (cj, v) where v∈ {cj, rj : j≥i} ∪ {bj, fj : j ∈B′}. Therefore, we can apply Propositions 5.31 and 5.16 in the same way in order to prove the claim.
We will now consider the vertices rj with j < i, where we must show that (rj, ci) is the most appealing action. We will begin by showing that the actions (rj, ck) with k < i are not switchable. Since we have σBR3(ck) = fk, σR3B (bk) = x, andσBR3(x) =fi we have: AppealσBR3(b j, ck) =−(10n+ 4)2k−1+ Valσ B R3(f i).
On the other hand, sinceσBR3(rj) =ci and σR3B (ci) =fi we have:
ValσR3B (r
j) = 4n+ Valσ
B R3(f
i).
Therefore, the actions (rj, ck) withk < iare not switchable inσBR3. Proposition 5.31
and Proposition 5.2 then imply that AppealσBR3(r
j, ci) >Appealσ
B R3(r
everyk > i. Therefore, greedy strategy improvement will not switch away from the action (rj, ci) in the strategyσR3B .
We will now consider the verticescj withj /∈B′ and j < i. At these vertices we must argue that greedy strategy improvement switches the action (cj, ri). Since σBR3(cj) =fj,σBR3(bj) =x, and σBR3(x) =fi we have: ValσR3B (c j) = 1−(10n+ 4)2j−1+ Valσ B R3(f i)
On the other hand, sinceσBR3(rj) =ci and σR3B (ci) =fi we have:
AppealσR3B (c
j, rj) =−1 + 4n+ 1 + Valσ
B R3(f
i)
Therefore, we have AppealσBR3(c
j, rj)>Valσ
B R3(c
j), which implies that greedy strat- egy improvement will switch the action (cj, rj) at the vertexcj in the strategy σBR3.
Now we will consider the verticescj withj /∈B′ andj > i. At these vertices we must argue that greedy strategy improvement does not switch away from the action (cj, rj). We will do this by arguing that the action (cj, fj) is not switchable. As we have argued previously, we have:
AppealσBR3(c j, fj) = 1−(10n+ 4)2j−1+ Valσ B R3(f i) = (10n+ 4)(2i−2i−1−2j−1) + ValσR3B (c min(B′>i)∪{n+1})
We can then apply Proposition 5.31 and Proposition 5.4 to show:
AppealσBR3(c j, fj)≤(10n+ 4)(2i−2i−1−2j−1−2i+ 2j−1) + Valσ B R3(c j)<Valσ B R3(c j)
Therefore, the action (cj, rj) is not switchable in the strategyσBR3.
We will now consider the vertex x, where we must show that the most ap- pealing action is (x, fi). We will begin by showing that the actions (x, fj) where
j /∈B′ are not switchable inσB R3. Since σR3B(bj) =x, we have: AppealσR3B (x, f j) =−(10n+ 4)2j−1−4n+ Valσ b R3(x)<ValσBR3(x)
Therefore, the actions (x, fj) where j /∈ B′ are not switchable in σR3B . For the
actions (x, fj) where j ∈ B′ and j 6= i, Proposition 5.31 and Proposition 5.19 imply that AppealσBR3(x, f
i)>Appealσ
B R3(x, f
j), which implies that greedy strategy improvement will not switch away from the action (x, fi).
We can apply the same reasoning for the vertexy, where we must show that (y, ci) is the most appealing action. For the actions (y, cj) withj /∈B′ we have:
AppealσR3B (y, c j) =−(10n+ 4)2j−1+ 1 + Valσ b R3(x)<Valσ B R3(x) + 4n+ 1.
Since σbR3(y) =ci and σbR3(x) = fi we must have Valσ
B
R3(y) = ValσBR3(x) + 4n+ 1.
Therefore we have shown that the actions (y, cj) with j /∈B′ are not switchable in the strategyσB
R3. Proposition 5.31 and Proposition 5.2 imply that Appealσ
B R3(c
i)>
AppealσBR3(c
j) for everyj ∈B such that j6=i. Therefore greedy strategy improve- ment will not switch away from the action (y, ci) in the strategyσBR3.
We now consider the verticesdk for allk. For every vertexdk, we must show that (dk, y) is the most appealing action. For the action (dk, dk−1) we have:
AppealσBR3(d k, dk−1) = Valσ B R3(x)−1<Valσ B R3(x) = Appealσ B R3(d k, x).
For the action (dk, x), the fact that r(dk, y) = r(dk, x) implies:
AppealσBR3(d k, x) = r(dk, x) + Val(fi) <r(dk, y) + 4n+ 1 + Val(fi) = Appealσ B R3(d k, y).
Finally, we consider the verticesbj. We begin with the case whenj /∈B′. In this case we must show that (bj, y) is the most appealing action at the vertex bj. We will first argue that the action (bj, x) is not switchable. Since σR3B (x) = fi, σR3B(y) =ci and σR3B (ci) =fi, we have the following two equalities:
ValσR3B (x) = Val(f
i)
ValσR3B (y) = Val(f
i) + 4n+ 1
Therefore, we must have AppealσBR3(b
j, x) <Appealσ
B R3(b
j, y). These two equalities can also be used to prove that the actions of the form (bj, dk) are not switch- able. This is because we have AppealσBR3(b
j, dk) ≤ 4n+ Valσ B R3(x) and we have AppealσBR3(b j, y) = 4n+ 2 + Valσ B R3(x).
We now consider the actions of the form (bj, fk). We will first prove that the actions (bj, fk) wherek /∈B′ are not switched by greedy strategy improvement. Sincek /∈B′, we have: AppealσR3B (b j, fk) =−(10n+ 4)2k−1+ 1 + Valσ B R3(x) <ValσR3B (y) + 4n+ 2 = Appealσ B R3(b j, y)
Therefore, these actions will not be switched by greedy strategy improvement in the strategyσR3B . We now consider the actions (bj, fk) wherek∈B′. We will prove that these actions are not switchable inσR3B . The appeal of the action (bj, fk) is:
AppealσR3B (b
j, fk) = 4n+ 1 + Valσ
B R3(f
k).
On the other hand, the appeal of the action (bj, y) can be expressed as:
AppealσBR3(b
j, y) = (10n+ 4)(2i−2i−1)−4n+ Valσ
B R3(c
We can then use Proposition 5.31 and Proposition 5.4 to conclude: AppealσR3B (b j, y) = (10n+ 4)(2i−2i−1−2i+ 2k−1)−4n+ Valσ B R3(c k) = (10n+ 4)(2k−1−2i−1)−4n+ ValσR3B (c k) = (10n+ 4)(2k−1−2i−1) + 1 + ValσBR3(f k)>4n+ 1 + Valσ B R3(f k)
Therefore, the action (bj, fk) will not be switched by greedy strategy improvement. Finally, we consider the actionaj. We will first argue that Appealσ
B R3(b
j, y)>
ValσR3B (b
j) + 1. This holds because we have Valσ
B R3(b j) = Valσ B R3(f i), and we have AppealσBR3(b j, y) = 4n+ 2 + Valσ B
R3(fi). On the other hand Proposition 5.31 and
Proposition 5.10 imply that AppealσR3B (b
j, aj) <Valσ
B R3(b
j) + 1. Therefore, greedy strategy improvement will not switch the actionaj.
To complete the proof of Proposition 5.30, we will show that greedy strategy improvement does not switch away from the actionaj at every vertexbj withj∈B′. We can use exactly the same arguments as we used for the vertices bj withj /∈B′ to argue that AppealσBR3(b
j, y) > Appealσ
B R3(b
j, a) for every action a ∈ {(bj, dk) : 1≤k≤n} ∪ {(bj, fk : j < k ≤n} ∪ {(bj, x)}. Therefore, we can prove the claim by showing that AppealσR3B (b
j, y)<Valσ
B R3(b
j). As we have done previously, we can use Proposition 5.31 and Proposition 5.4 to obtain the following characterisation for the appeal of the action (bj, y):
AppealσBR3(b j, y) = (10n+ 4)(2i−2i−1−2i+ 2j−1)−4n+ Valσ B R3(c j) = (10n+ 4)(2j−1−2i−1)−4n+ ValσBR3(c k) =−(10n+ 4)(2i−1)−4n+ 1 + ValσBR3(b j)<Valσ B R3(b j)
Therefore, greedy strategy improvement will not switch away from the actionaj at the vertexbj in the strategy σR3B .