Breaking The Example Down - The Flip Phase

Chapter 5 Greedy Strategy Improvement For Markov Decision Pro-

5.3 The Flip Phase

5.3.1 Breaking The Example Down

In order to simplify our exposition, we will provide proofs for each gadget separately. To do this, we ﬁrst need to provide some preliminary proofs that will be used to break the example down into pieces. These proofs concern the valuation of the verticesci. These vertices are important, because they are on the border between the gadgets

in the example. For example, we know that the vertex y in the deceleration lane always chooses an action of the form (y, ci), and by knowing bounds of the valuation ofci we will be able to prove properties of the deceleration lane irrespective of the strategy that is being played on the vertices in the reset structure.

Recall that for each improvable conﬁgurationB, the strategyσ_jBchooses the action (ci, fi) if and only if i∈ B. This implies that if we follow the strategy σjB from some vertex ci where i∈B, then we will pass through every vertex ck where

k _∈ B>i_{. On the other hand, if we follow the strategy} _σB

j from some vertex ci where i /_∈ B, then we will move to the vertex ri, and then to the vertex ck where k= min(B>i_{∪ {}_n_{+ 1}_}_{). The next proposition uses these facts to give a formula for} the valuation of each vertexci in terms of the conﬁgurationB.

Proposition 5.2. Let B be an improvable configuration and σ be a member of

Sequence(B). For everyiwe have:

Valσ(ci) =        P j∈B≥i(10n+ 4)(2j−2j−1) if i∈B, P j∈B≥i(10n+ 4)(2j−2j−1)−1 otherwise.

Proof. We ﬁrst consider the case where i_∈B. Ifk= min(B>i_{∪ {}_n_{+ 1}_}_{) then the} deﬁnition ofσ gives:

Valσ(ci) = r(ci, fi) + r(fi, bi) + Valσ(bi)

= r(ci, fi) + r(fi, bi) + r(gi, ri) + r(ri, ck) + Valσ(ck)

= (4n+ 1)−((10n+ 4)2i−1−4n) + (10n+ 4)2i−1 + Valσ(ck) = (10n+ 4)(2i−2i−1) + Valσ(ck).

tution of the above expression for Valσ(ck) gives: Valσ(ci) = X j∈B≥i (10n+ 4)(2j −2j−1) + Valσ(cj+1) = X j∈B≥i (10n+ 4)(2j ₋2j−1).

We now consider the case wherei /∈B. Ifk= min(B>i∪{n+1}), then the deﬁnition ofσ gives: Valσ(ci) = r(ci, ri) + r(ri, ck) + Valσ(ck) = Valσ(ck)−1 = X j∈B≥i (10n+ 4)(2j −2j−1)−1.

The characterisation given by Proposition 5.2 gives some important properties about the valuation of the vertex ci. One obvious property is that passing through a vertex bi, where i∈B, provides a positive reward. Therefore, ifi and j are both indices in B and i < j, then the valuation of ci will be larger than the valuation of cj.

Proposition 5.3. Let B be an improvable configuration and σ be a member of

Sequence(B). For every i _∈ B and j _∈ B such that i < j, we have Valσ(ci) > Valσ(cj).

Proof. LetC =B≥i_∩_B<j _{be the members of}_B _{that lie between indices}_i_and_j₋_1. Proposition 5.2, the fact thati_∈C, and the fact that 2j₋₂j−1_{is positive for every}_j

imply that:

Valσ(ci) =X

j∈C

(10n+ 4)(2j−2j−1) + Valσ(cj)>Valσ(cj).

terms of a vertexcj withj > i. Some of our proofs will also require a corresponding upper bound. The next proposition provides such a bound.

Proposition 5.4. Let B be an improvable configuration and σ be a member of

Sequence(B). For everyi_∈B andj _∈B such that j > i, we have:

Valσ(ci)≤Valσ(cj) + (10n+ 4)(2j−1−2i−1).

Proof. LetC =B≥i_∩_B<j _{be the members of}_B _{that lie between indices}_i_and_j₋_1. Using Proposition 5.2 gives:

Valσ(ci) = X k∈B≥i (10n+ 4)(2k−2k−1) =X k∈C (10n+ 4)(2k−2k−1) + Valσ(cj).

We use the fact that (10n+ 4)(2k₋₂k−1₎_> _{0 for all} _k _{and the fact that} _{i >}_{0 to}

obtain: X k∈C (10n+ 4)(2k−2k−1)≤ j−1 X k=i (10n+ 4)(2k−2k−1) = (10n+ 4)(2j−1₋2i−1). Therefore, we have: Valσ(ci)≤Valσ(cj) + (10n+ 4)(2j−1−2i−1).

5.3.2 The Deceleration Lane

In this section we will prove that the deceleration lane behaves as we have described. In particular, we will provide a proof of the following proposition.

then applying greedy strategy improvement toσ0 produces the sequence of strategies hσ0, σ1, . . . , σ2ni.

This Proposition may seem strange at ﬁrst sight, because each strategyσi is a partial strategy that is deﬁned only for the vertices of the deceleration lane, and strategy improvement works only with full strategies. However, since the verticesx and y are the only vertices at which it is possible to leave the deceleration lane, placing an assumption on the valuations of these vertices will allow us to prove how greedy strategy improvement behaves on the deceleration lane. These proofs will hold irrespective of the decisions that are made outside the deceleration lane. In other words, if greedy strategy improvement arrives at a strategyσthat is consistent with σ0 on the vertices in the deceleration lane, and if Valσ(y) > Valσ(x), then

Proposition 5.5 implies that greedy strategy improvement will move to a strategyσ′

that is consistent with σ1 on the vertices in the deceleration lane. This approach

allows us to prove properties of the deceleration lane without having to worry about the behaviour of greedy strategy improvement elsewhere in the example.

Of course, this approach will only work if we can prove that Valσ(y) > Valσ(x). The next proposition conﬁrms that this property holds for every strategy in Sequence(B).

Proposition 5.6. For every improvable configuration B and every strategy σ in

Sequence(B) we have Valσ(y)>Valσ(x).

Proof. By the deﬁnition of σ, we have that there is some index i ∈ B such that σ(y) = ci and σ(x) = fi. Moreover, since i ∈ B we have that σ(ci) = fi. We therefore have the following two equalities:

Valσ(y) = r(y, ci) + r(ci, fi) + Valσ(fi) = Valσ(fi) + 4n+ 1, Valσ(x) = r(x, fi) + Valσ(fi) = Valσ(fi).

We now turn our attention to proving that greedy strategy improvement moves from the strategy σi to the strategy σi+1. To prove properties of greedy

strategy improvement, we must know the appeal of every action leaving the ver- ticesdj. The next proposition gives a characterisation for the appeal of the actions (dj, dj−1) for each strategy σi.

Proposition 5.7. For each strategy σi we have:

Appealσi₍_d j, dj−1) =        Valσi₍_y_{) + 4}_n₋_j_{+ 1} _if ₁_≤_j_≤_i_{+ 1}_, Valσi₍_y₎₋₁ _if _i_{+ 1}_{< j} _≤₂_n_.

Proof. We begin by considering the case where 1 _≤ j _≤ i+ 1. By the deﬁnition of σi we have that σi(dj) = dj−1 for all vertices dj with 1 ≤ j ≤ i, and we have thatσi(d0) =y. Using the deﬁnition of appeal, and applying the optimality equa-

tion (5.1) repeatedly gives, for every action (dj, dj−1) with 0≤j≤i+ 1:

Appealσi₍_d j, dj−1) = r(dj, dj−1) + Valσi(dj) = j X k=1 r(dk, dk−1) + r(d0, y) + Valσi(y) =₋j+ (4n+ 1) + Valσi₍_y₎_.

We now consider the case wherei+ 1< j _≤2n. By deﬁnition we have that σi(dj−1) =y. Using the deﬁnition of appeal and the optimality equation gives:

Appealσi₍_d

j, dj−1) = r(dj, dj−1) + Valσi(dj−1)

= r(dj, dj−1) + r(dj−1, dy) + Valσi(y) = Valσi₍_y₎₋₁_.

Proposition 5.7 conﬁrms that an action (dj, dj−1) is switchable only after

ment. It can clearly be seen that the only action of this form that is switchable in the strategy σi is the action (di+1, di). Every other action of this form has either already been switched, or is obviously not switchable.

We must also consider the actions (di, x). The next proposition conﬁrms that these actions will not be switchable in the strategy σi.

Proposition 5.8. If Valσi₍_x₎_<_Valσi₍_y₎ _then_Appealσi₍_d

j, x)<Appealσi(dj, y)for

all j.

Proof. Using the deﬁnition of appeal for the vertexdj gives two equalities:

Appealσi₍_d

j, y) = r(dj, y) + Valσi(y), Appealσi₍_d

j, x) = r(dj, x) + Valσi(x).

Observe that for all j we have r(dj, y) = r(dj, x). Therefore we can conclude that when Valσi₍_x₎_<_Valσi₍_y_{) we have Appeal}σi₍_d

j, x)<Appealσi(dj, y).

In summary, Proposition 5.7 has shown that there is exactly one action of the form (dj, dj−1) that is switchable in σi, and that this action is (di+1, di). Proposi- tion 5.8 has shown that no action of the form (dj, x) is switchable inσi. It is obvious that no action of the form (dj, y) is switchable inσi. Therefore, greedy strategy improvement will switch the action (di+1, di) in the strategy σi, which creates the strategyσi+1. Therefore, we have shown Proposition 5.5.

In document Strategy iteration algorithms for games and Markov decision processes (Page 139-145)