• No results found

Chapter 5 Greedy Strategy Improvement For Markov Decision Pro-

5.3 The Flip Phase

5.3.1 Breaking The Example Down

In order to simplify our exposition, we will provide proofs for each gadget separately. To do this, we first need to provide some preliminary proofs that will be used to break the example down into pieces. These proofs concern the valuation of the verticesci. These vertices are important, because they are on the border between the gadgets

in the example. For example, we know that the vertex y in the deceleration lane always chooses an action of the form (y, ci), and by knowing bounds of the valuation ofci we will be able to prove properties of the deceleration lane irrespective of the strategy that is being played on the vertices in the reset structure.

Recall that for each improvable configurationB, the strategyσjBchooses the action (ci, fi) if and only if i∈ B. This implies that if we follow the strategy σjB from some vertex ci where i∈B, then we will pass through every vertex ck where

k B>i. On the other hand, if we follow the strategy σB

j from some vertex ci where i / B, then we will move to the vertex ri, and then to the vertex ck where k= min(B>i∪ {n+ 1}). The next proposition uses these facts to give a formula for the valuation of each vertexci in terms of the configurationB.

Proposition 5.2. Let B be an improvable configuration and σ be a member of

Sequence(B). For everyiwe have:

Valσ(ci) =        P j∈B≥i(10n+ 4)(2j−2j−1) if i∈B, P j∈B≥i(10n+ 4)(2j−2j−1)−1 otherwise.

Proof. We first consider the case where iB. Ifk= min(B>i∪ {n+ 1}) then the definition ofσ gives:

Valσ(ci) = r(ci, fi) + r(fi, bi) + Valσ(bi)

= r(ci, fi) + r(fi, bi) + r(gi, ri) + r(ri, ck) + Valσ(ck)

= (4n+ 1)−((10n+ 4)2i−1−4n) + (10n+ 4)2i−1 + Valσ(ck) = (10n+ 4)(2i−2i−1) + Valσ(ck).

tution of the above expression for Valσ(ck) gives: Valσ(ci) = X j∈B≥i (10n+ 4)(2j −2j−1) + Valσ(cj+1) = X j∈B≥i (10n+ 4)(2j 2j−1).

We now consider the case wherei /∈B. Ifk= min(B>i∪{n+1}), then the definition ofσ gives: Valσ(ci) = r(ci, ri) + r(ri, ck) + Valσ(ck) = Valσ(ck)−1 = X j∈B≥i (10n+ 4)(2j −2j−1)−1.

The characterisation given by Proposition 5.2 gives some important prop- erties about the valuation of the vertex ci. One obvious property is that passing through a vertex bi, where i∈B, provides a positive reward. Therefore, ifi and j are both indices in B and i < j, then the valuation of ci will be larger than the valuation of cj.

Proposition 5.3. Let B be an improvable configuration and σ be a member of

Sequence(B). For every i B and j B such that i < j, we have Valσ(ci) > Valσ(cj).

Proof. LetC =B≥iB<j be the members ofB that lie between indicesiandj1. Proposition 5.2, the fact thatiC, and the fact that 2j2j−1is positive for everyj

imply that:

Valσ(ci) =X

j∈C

(10n+ 4)(2j−2j−1) + Valσ(cj)>Valσ(cj).

terms of a vertexcj withj > i. Some of our proofs will also require a corresponding upper bound. The next proposition provides such a bound.

Proposition 5.4. Let B be an improvable configuration and σ be a member of

Sequence(B). For everyiB andj B such that j > i, we have:

Valσ(ci)≤Valσ(cj) + (10n+ 4)(2j−1−2i−1).

Proof. LetC =B≥iB<j be the members ofB that lie between indicesiandj1. Using Proposition 5.2 gives:

Valσ(ci) = X k∈B≥i (10n+ 4)(2k−2k−1) =X k∈C (10n+ 4)(2k−2k−1) + Valσ(cj).

We use the fact that (10n+ 4)(2k2k−1)> 0 for all k and the fact that i >0 to

obtain: X k∈C (10n+ 4)(2k−2k−1)≤ j−1 X k=i (10n+ 4)(2k−2k−1) = (10n+ 4)(2j−12i−1). Therefore, we have: Valσ(ci)≤Valσ(cj) + (10n+ 4)(2j−1−2i−1).

5.3.2 The Deceleration Lane

In this section we will prove that the deceleration lane behaves as we have described. In particular, we will provide a proof of the following proposition.

then applying greedy strategy improvement toσ0 produces the sequence of strategies hσ0, σ1, . . . , σ2ni.

This Proposition may seem strange at first sight, because each strategyσi is a partial strategy that is defined only for the vertices of the deceleration lane, and strategy improvement works only with full strategies. However, since the verticesx and y are the only vertices at which it is possible to leave the deceleration lane, placing an assumption on the valuations of these vertices will allow us to prove how greedy strategy improvement behaves on the deceleration lane. These proofs will hold irrespective of the decisions that are made outside the deceleration lane. In other words, if greedy strategy improvement arrives at a strategyσthat is consistent with σ0 on the vertices in the deceleration lane, and if Valσ(y) > Valσ(x), then

Proposition 5.5 implies that greedy strategy improvement will move to a strategyσ′

that is consistent with σ1 on the vertices in the deceleration lane. This approach

allows us to prove properties of the deceleration lane without having to worry about the behaviour of greedy strategy improvement elsewhere in the example.

Of course, this approach will only work if we can prove that Valσ(y) > Valσ(x). The next proposition confirms that this property holds for every strategy in Sequence(B).

Proposition 5.6. For every improvable configuration B and every strategy σ in

Sequence(B) we have Valσ(y)>Valσ(x).

Proof. By the definition of σ, we have that there is some index i ∈ B such that σ(y) = ci and σ(x) = fi. Moreover, since i ∈ B we have that σ(ci) = fi. We therefore have the following two equalities:

Valσ(y) = r(y, ci) + r(ci, fi) + Valσ(fi) = Valσ(fi) + 4n+ 1, Valσ(x) = r(x, fi) + Valσ(fi) = Valσ(fi).

We now turn our attention to proving that greedy strategy improvement moves from the strategy σi to the strategy σi+1. To prove properties of greedy

strategy improvement, we must know the appeal of every action leaving the ver- ticesdj. The next proposition gives a characterisation for the appeal of the actions (dj, dj−1) for each strategy σi.

Proposition 5.7. For each strategy σi we have:

Appealσi(d j, dj−1) =        Valσi(y) + 4nj+ 1 if 1ji+ 1, Valσi(y)1 if i+ 1< j 2n.

Proof. We begin by considering the case where 1 j i+ 1. By the definition of σi we have that σi(dj) = dj−1 for all vertices dj with 1 ≤ j ≤ i, and we have thatσi(d0) =y. Using the definition of appeal, and applying the optimality equa-

tion (5.1) repeatedly gives, for every action (dj, dj−1) with 0≤j≤i+ 1:

Appealσi(d j, dj−1) = r(dj, dj−1) + Valσi(dj) = j X k=1 r(dk, dk−1) + r(d0, y) + Valσi(y) =j+ (4n+ 1) + Valσi(y).

We now consider the case wherei+ 1< j 2n. By definition we have that σi(dj−1) =y. Using the definition of appeal and the optimality equation gives:

Appealσi(d

j, dj−1) = r(dj, dj−1) + Valσi(dj−1)

= r(dj, dj−1) + r(dj−1, dy) + Valσi(y) = Valσi(y)1.

Proposition 5.7 confirms that an action (dj, dj−1) is switchable only after

ment. It can clearly be seen that the only action of this form that is switchable in the strategy σi is the action (di+1, di). Every other action of this form has either already been switched, or is obviously not switchable.

We must also consider the actions (di, x). The next proposition confirms that these actions will not be switchable in the strategy σi.

Proposition 5.8. If Valσi(x)<Valσi(y) thenAppealσi(d

j, x)<Appealσi(dj, y)for

all j.

Proof. Using the definition of appeal for the vertexdj gives two equalities:

Appealσi(d

j, y) = r(dj, y) + Valσi(y), Appealσi(d

j, x) = r(dj, x) + Valσi(x).

Observe that for all j we have r(dj, y) = r(dj, x). Therefore we can conclude that when Valσi(x)<Valσi(y) we have Appealσi(d

j, x)<Appealσi(dj, y).

In summary, Proposition 5.7 has shown that there is exactly one action of the form (dj, dj−1) that is switchable in σi, and that this action is (di+1, di). Proposi- tion 5.8 has shown that no action of the form (dj, x) is switchable inσi. It is obvious that no action of the form (dj, y) is switchable inσi. Therefore, greedy strategy im- provement will switch the action (di+1, di) in the strategy σi, which creates the strategyσi+1. Therefore, we have shown Proposition 5.5.