Policy iteration for stochastic zero-sum games

(1)

Policy iteration for stochastic zero-sum games

Marianne Akian

To cite this version:

Marianne Akian. Policy iteration for stochastic zero-sum games. NETCO 2014, 2014, Tours, France. <hal-01024097>

HAL Id: hal-01024097

https://hal.inria.fr/hal-01024097

Submitted on 15 Jul 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche fran¸cais ou étrangers, des laboratoires publics ou privés.

(2)

games

Marianne Akian

INRIA Saclay - ˆIle-de-France and CMAP, ´Ecole Polytechnique

NETCO 2014 23-27 June 2014, Tours

(3)

Hamilton-Jacobi-Bellman-Isaacs equations

The stationary equation:

−H(x, Dv (x)) = 0, x ∈ X ⊂ Rd +a boundary condition, H(x, p) = min

a∈A(x)bmax∈B(x)[f (x, a, b) · p + g(x, a, b)], x ∈ X , andp ∈ R

d _,

is the dynamic programming equation satisfied by the (upper) value function of the zero-sum game problem:

v(x) = inf (αt)t≥0 sup (βt)t≥0 Z ∞ 0 g(xt, αt, βt)dt ,

where ˙xt = f (xt, αt, βt), for allt≥ 0, andinfandsupare taken over

nonanticipating strategies of the first and second player (where the second player knows the current action of the first player).

(4)

⇒

v = F (v ), the fixed point equation of the dynamic programming or Shapley operatorF of a discrete time zero-sum two player stochastic game problem with finite state space.

Same for:Discounted problems, Optimal stopping time problems, Stochastic games.

(5)

Discrete time and state zero-sum stochastic games

LetF : Rn→ Rn be defined by:

[F (v )]i := min a∈Ai max b∈Bi   X j∈[n] M_ijabvj + riab  , i ∈ [n],

withM_ijab ≥ 0for alli, j ∈ [n], a ∈ Ai, b ∈ Bi.

The map F is thedynamic programming or Shapley operator of a dis-crete time zero-sum two player game problem with perfect information on the finite state spaceX := [n] := {1, . . . , n}, with:

Ai, Bi sets of actions of the 1st, 2nd player MIN, MAX, when in statei r_iab reward paid by MIN to MAX, at each time

M_ijab := γ_iabP_ijab ≥ 0 γ_iab :=P

j∈[n]Mijab ≥ 0discount factor (< 1or≤ 1or= 1) Pab

ij transition probability fromitoj( P

(6)

LetF : R → R be defined by: [F (v )]i := min a∈Ai max b∈Bi   X j∈[n] M_ijabvj + riab  , i ∈ [n],

withM_ijab ≥ 0for alli, j ∈ [n], a ∈ Ai, b ∈ Bi.

Denoteγab_i :=P

j∈[n]Mijab.

Then

• F is order preserving:u ≤ v ⇒ F (u) ≤ F (v ), for allu, v ∈ Rn_;

• ifγ_iab ≤ 1for alli∈ [n]anda∈ A, b ∈ B, thenF is additively

sub-homogeneous:F(λ + u) ≤ λ + F (u), for allλ ≥ 0andu ∈ Rn

• thusF is sup-norm nonexpansive.

• Ifγ_iab= 1for alli ∈ [n]anda∈ A, b ∈ B, thenF is additively

(7)

Let the value function of the gamewith infinite horizonbe given by: vx = inf (αk)k ≥0 sup (βk)k ≥0 E "_∞ X k=0 ( k−1 Y ℓ=0 γαℓ,βℓ Xℓ )r αk,βk xk | X0= x # ,

whereαk andβk are possible strategies of both players of the game (at

timek), andXk ∈ [n]is the state process of the game satisfying P(Xk+1= j|Xk = i, αk = a, βk = b) = P_ijab.

Ifγxa,b ≤ ¯γ < 1, thenF is a sup-norm contraction:

kF (v ) − F (w)k∞≤ ¯γkv − wk∞ ,

andv is the unique solution of

v = F (v ).

Moreover the optimal actions inF(v )give the optimal stationary strategies of the game.

(8)

Problem: compute v ∈ Rn_{such that F}_{(v ) = v , when such a solution is}

unique, and bound the complexity of this computation.

Whenγab_i ≤ ¯γ < 1for alli∈ [n]anda∈ A, b ∈ B, then

• Then, thevalue iterationscoincide with fixed point iterations:

vk+1= F (vk), and with the finite horizon approximations with

T = k andϕ = v0_{. They converge geometrically towards}_v _with

factorγ¯:

lim k→∞kv

k_{− v k}1/k _{≤ ¯}_{γ .}

• However, the value iteration algorithm is only pseudopolynomial.

• Also the existence of a polynomial algorithm is an open problem.

(9)

Policy iterations for discounted games

Assume:Ai andBi are finite sets, and

Denote byΣ := {σ : i ∈ [n] 7→ σi ∈ Ai}and∆ := {δ : i ∈ [n] 7→ δi ∈ Bi}

the sets of policies,

and forσ ∈ Σandδ ∈ ∆, define the matrices and vectors:

M(σδ)= (Mσiδi

ij )ij=1,...,n, andr

(σδ)_{= (r}σiδi

i )i=1,...,n ,

and the affine maps

F(σδ)(v ) = M(σδ)v+ r(σδ), v ∈ Rn .

Then,F can be written as:

F(v ) = min

σ∈ΣF

(σ)_{(v ) ,} _with_F(σ)_{(v ) := max} δ∈∆F

(σδ)_{(v ) ,} _v _{∈ R}n _,

where minima and maxima are for the partial order ofRn.

The mapsF(σδ),F(σ)andF are all order preserving and contracting for the sup-norm with contraction factorγ¯.

Important: the infimum and supremum are attained because the sets {F(σ)(v ) | σ ∈ Σ} and {F(σδ)(v ) | δ ∈ ∆} are rectangular.

(10)

(Howard, 1960)for 1-player games,(Denardo, 1967)for 2-player games.

Using operators:

Given an initial policyσ0_{∈ Σ}_{, apply successively the two following steps}

fors≥ 0untilσs+1_{= σ}s_:

1 Compute the fixed pointvs ofF(σs);

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈ Σsuch thatF(vs) = F(σs+1)(vs)

withσs+1= σs as soon as this is possible.

Step 1 is solved by using Policy iteration for the (one-player) game with fixed policyσs_{, which constructs}_vs,l _and_δs,l _from_δs,0_.

(11)

Policy iterations for discounted games

With control terminology:

1 Compute the valuevsof the game with fixed policyσs, that is the

solution ofv = F(σs)(v );

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈ Σsuch thatF(vs) = F(σs+1)(vs)or equivalently:

σ_is+1 ∈ argmin a∈Ai    max b∈Bi   X j∈[n] M_ijabv_js + r_iab      , i ∈ [n],

(12)

Simplex algorithm for 1-player games with Dantzig pivoting:

1 Compute the valuevsof the game with fixed policyσs, that is the

solution ofv = F(σs)(v );

2 Improve the policy: choose a policyσs+1∈ Σsuch that

σ_is+1 ∈ argmin a∈Ai    max b∈Bi   X j∈[n] M_ijabv_js + r_iab      , i ∈ [n],

(13)

Policy iterations for discounted games

❄ ✛ ✻ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆❆❯ . . . . . . PI external . . . PI internal . . . σ0 δs,0 σs+1 σs σs−1 δs,l−1 δs,l

(14)

❄ ✛ ✻ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆❆❯ . . . . . . PI external . . . PI internal . . . (Linear Equations) (Nonlinear Equations) vs vs−1 vs−2 v vs,l−1 vs,l vs

(15)

Policy iterations for discounted games: monotone

convergence

• The sequence(vs)s≥0is nonincreasing;

• Hence, the sequence(σs)s≥0does not visit the same policy two

times, until it becomes stationary;

• So the sequence(vs)sis stationary after a finite time (at most♯Σ),

(16)

convergence

• The sequence(vs)s≥0is nonincreasing;

• So the sequence(vs)sis stationary after a finite time (at most♯Σ),

and converges towards the solutionv ofv = F (v ).

• Whensis fixed, the sequence(vs,l)l is nondecreasing;

• Hence, the sequence(δs,l)l does not visit the same policy two

• So the sequence(vs,l)l is stationary after a finite time (at most♯∆),

and converges towards the solutionvs _of_v _{= F}(σs₎ (v ).

(17)

Policy iterations for discounted games: well known

properties

• The Policy iterations converge faster than the value iterations: for alls ≥ 0,v ≤ vs+1_{≤ F (v}s_{) ≤ v}s_{, so}_v _{≤ v}s _{≤ F}s_(v0_{) ≤ v}0_.

• If the discount factor is uniformely bounded by some constant

¯

γ < 1, then for the sup-norm, we have:

kvs+1− v k ≤ kF (vs) − v k ≤ ¯γkvs− v k .

• For 1-player games with an infinite number of actions and under regularity conditions, Policy iterations coincide with the Newton algorithm, and have asuper-linear convergence.

• However, in general, the number of (external) iterations is bounded by♯Σ ≥ 2n_if_♯A

(18)

• (Friedmann, 2009)showed a 2-player deterministic game problem withγ ≃ 1and an exponential number of iterations.

• (Fearnley, 2010)and(Andersson, 2009)showed the same for a 1-player stochastic game.

(19)

Policy iterations for discounted games: recent results

(Ye, 2011)showed that Policy iteration algorithm and Simplex algorithm solve 1-player discounted games with fixed discount factorγ < 1in

strongly polynomialtime.

(Hansen, Miltersen and Zwick, 2011)extended and improved this result to Policy iteration algorithm for 2-player games. They show that the number of iterationssmax(to obtain stationarity) satisfies:

smax≤ (m + 1)(1 + log(n2_{/(1 − γ))} − log(γ) ) = O( m 1− γlog n 1− γ),

withm=thetotal number of actions: the number of(i, a, b)withi∈ [n],

a∈ Ai andb∈ Bi.

(Feinberg, Huang, 2013): Same for a one-player game with

mean-payoff, and a statei0such thatP_ia_,i₀ ≥ 1 − γ, for alli ∈ [n], a ∈ Ai.

Question: What remains true when the discount factorsγ_iab are not uniformely bounded by a constant< 1?

(20)

2-player games satisfying

r(M(σδ)) ≤ λ ∀σ ∈ Σ, δ ∈ ∆

is strongly polynomial. More precisely, the number of external iterations

smaxsatisfies: smax≤ (m1− n)(1 + ⌊ log(1 − λ) log(λ) ⌋) = O( m1− n 1− λ log 1 1− λ),

withm1=thetotal number of actions of the first player: the number of

(21)

Proof. • Adapt the proof of(Hansen, Miltersen and Zwick, 2011)by using sup-norms instead ofℓ1norms and the nonlinear mapsF(δ)

to obtain the above bound when the discount factors are≤ λ. A similar bound is obtained by (Scherrer, 2013)in the one-player case with fixed discount factor.

• Using nonlinear spectral theory, show that for allλ < µ < 1, there existsϕ ∈ Rnsuch thatϕi > 0,i ∈ [n], andM(σδ)ϕ ≤ µϕ.

• LetG(v ) = ϕ−1F(ϕv )withϕv = (ϕivi)i∈[n]. ThenGis the dynamic

programming operator of a game with discount factors≤ µ, and the sequence of policiesσs forF andGare the same, so issmax.

• Equivalently,F is contracting onRnwith contraction factorµ, for the weighted sup-normk · kϕdefined by:

kv kϕ := max

i∈[n]|

vi ϕi

| ∀v ∈ Rn .

(22)

to obtain the above bound when the discount factors are≤ λ. A similar bound is obtained by (Scherrer, 2013)in the one-player case with fixed discount factor.

• Using nonlinear spectral theory, show that for allλ < µ < 1, there existsϕ ∈ Rnsuch thatϕi > 0,i ∈ [n], andM(σδ)ϕ ≤ µϕ. ← details

• LetG(v ) = ϕ−1F(ϕv )withϕv = (ϕivi)i∈[n]. ThenGis the dynamic

programming operator of a game with discount factors≤ µ, and the sequence of policiesσs forF andGare the same, so issmax.

• Equivalently,F is contracting onRnwith contraction factorµ, for the weighted sup-normk · kϕdefined by:

kv kϕ := max

i∈[n]|

vi ϕi

| ∀v ∈ Rn .

(23)

Definition (Nonlinear spectral radii(Nussbaum,Mallet-Paret, 1998))

Lethbe a nonlinear continuous positively homogenous map on a closed convex coneCofRn(h(λv ) = λh(v )for allλ > 0andv ∈ C):

• Thecone eigenvalue spectral radiusofh,ˆrC(h), is the maximal

modulus of an eigenvalue ofhinC, whereλis an eigenvalue associated tov ∈ C\{0}ifh(v ) = λv.

• TheCollatz-Wielandt numbercwC(h)is the infimum of the

super-eigenvalues ofh, whereλ > 0is a super-eigenvalue if there existsv in the interior ofC such thath(v ) ≤ λv.

• TheBonsall’s spectral radiusofhis defined as:

rC(h) := inf k≥1kh k_k1/k C , with khkC := sup x∈C, kxk=1 kh(x)k ,

(24)

C= Rn

+, all the above spectral radius notions ofhcoincide:

r(h) = infk≥1khkk1/k_Rn

+

= max{λ ∈ R | ∃v ∈ Rn+\{0}, h(v ) = λv }

= inf{λ > 0 | ∃v ∈ (R∗

+)n, h(v ) ≤ λv }

Proposition (A. Gaubert, Nussbaum, arXiv 2011)

Assume thathandhπ are continuous, positively homogenous, order

preserving selfmaps ofRn+, for allπ ∈ Π, and thath(v ) = maxπ∈Πhπ(v )

for allv ∈ Rn+, then

r(h) = max

π∈Πr(hπ) .

Applying the proposition toh(v ) := maxσ∈Σmaxδ∈∆(M(σδ)v), we get

thatr(h) ≤ λ < µand so by the theorem, there existsϕ ∈ (R∗+)nsuch

(25)

Consider the value function of the gamewith mean-payoff: ηx = inf (αk)k ≥0 sup (βk)k ≥0 lim sup T→∞ 1 TE "_T₋₁ X k=0 rαk,βk x_k | X0= x # .

LetF be the dynamic programming operator such thatγab i ≡ 1. F is additively homogeneous. We say thatv ∈ Rn_{is an}_(nonlinear

additive) eigenvectororbiaisofF witheigenvalueρ ∈ RifF(v ) = ρ + v.

• Ifρexists, thenηx = ρfor allx ∈ [n].

• If all the matricesM(σδ)are irreducible, thenρexists and the eigenvectorv is unique up to an additive constant.

• Other existence results ofρ:Bather, 1973,Gaubert, Gunawardena, 2001.

(26)

(Hoffman and Karp, 1966)We have to solveρ + v = F (v ). Using operators:

fors≥ 0untilσs+1= σs:

1 Compute the additive eigenvalue and eigenvectorρs andvs of F(σs), that is the solution ofρ + v = F(σs)(v );

2 Improve the policy: choose an optimal policy forvs, that is σs+1_{∈ Σ}_{such that}_F_(vs_{) = F}(σs+1₎

(vs₎

withσs+1= σs as soon as this is possible.

Step 1 is solved by using Policy iteration for the (one-player) game with fixed policyσs_{, which constructs}_ρs,l_,_vs,l _and_δs,l _from_δs,0_.

(27)

Policy iterations for “irreducible” mean-payoff games

(Hoffman and Karp, 1966)We have to solveρ + v = F (v ). With control terminology:

fors≥ 0untilσs+1= σs:

1 Compute the valueρs and the biaisvs of the game with fixed policy σs_{, that is the solution of}_{ρ + v = F}(σs₎

(v );

2 Improve the policy: choose an optimal policy forvs, that is σs+1_{∈ Σ}_{such that}_F_(vs_{) = F}(σs+1₎ (vs₎_{or equivalently:} σ_is+1 ∈ argmin a∈A    max b∈B   X j∈[n] M_ijabv_js + r_iab      , i ∈ [n],

(28)

monotone convergence

• The sequence(ρs₎

s≥0is nonincreasing;

• Ifρs= ρs+1, thenvs− vs+1is constant andvs = v.

• So the sequence(ρs, vs)s is stationary after a finite time (at most ♯Σ), up to an additive constant, and converges towards the solution

(29)

Policy iterations for “irreducible” mean-payoff games:

monotone convergence

• The sequence(ρs₎

s≥0is nonincreasing;

• Ifρs= ρs+1, thenvs− vs+1is constant andvs = v.

• So the sequence(ρs, vs)s is stationary after a finite time (at most ♯Σ), up to an additive constant, and converges towards the solution

(ρ, v )ofρ + v = F (v ).

• Whensis fixed, the sequence(ρs,l₎

l is nondecreasing;

• Ifρs,l _{= ρ}s,l+1_{, then}_vs,l_{− v}s,l+1_{is constant and}_vs,l _{= v}s_.

• Hence, the sequence(δs,l)l does not visit the same policy two

• So the sequence(ρs,l_{, v}s,l₎

l is stationary after a finite time (at most ♯∆), and converges towards the solution(ρs_{, v}s₎_of

(30)

Tij(M) = E[inf{k ≥ 1 | Xk = j} | X0= i] ,

the expected first return (or hitting) time in statej, starting fromi. Note thatTii0(M) < +∞for alli∈ [n]if and only ifMhas a unique

recurrent (final) class andi0belongs to it.

Theorem (A., Gaubert, arXiv:1310.4953)

Let us fixK > 0and a statei0. The policy iteration algorithm for the

class of 2-player mean-payoff games such that

Tii0(M

(σδ)_{) ≤ K} _{∀σ ∈ Σ, δ ∈ ∆, i ∈ [n]}

smaxsatisfies:

smax≤ (m1− n)(1 + ⌊

log(K )

log(K /(K − 1)⌋) = O((m1− n)K log K ),

(31)

Sketch of the proof. • Letϕ ∈ (R∗

+)n be defined by:

ϕi = maxσ∈Σmaxδ∈∆Tii₀(M(σδ)).

• LetQ(σδ)_{be obtained from}_M(σδ)_{by putting its}_i

0th column to zero.

Thenϕ = 1 + maxσ∈Σmaxδ∈∆(Q(σδ)ϕ).

• LetN(σδ)be obtained fromM(σδ)by replacing itsi0th column by the

nonnegative vector(ϕ − 1 − Q(σδ)_ϕ)/ϕ

i0.

• N(σδ)has nonnegative entries and satisfies:

N(σδ)ϕ = ϕ − 1 ≤ λϕ withλ = 1 − 1/K ⇒ r (N(σδ)) ≤ λ .

• Then the map

G(v ) = min

σ∈Σmaxδ∈∆(N

(σδ)_v_{+ r}(σδ)_{) ,} _v _{∈ R}n

satisfies the assumptions of the theorem for discounted games.

• Ifvi0 = 0, thenρ + v = F (v ) ⇔ ρϕ + v = G(ρϕ + v ).

• Hence, the sequences of policiesσsandδs,l forF andGare the same.

(32)

11 10 13 12 20 21 17 16 19 18 15 1 3 2 5 4 7 6 9 8 14

Nodes = web pages Arcs = hyperlinks

21 _{: spammer page} 1

: non controlled page.

Associated Markov matrix

S:Sij = 1/Ni if(i, j)is an

hyperlink,Sij = 0

otherwise;Ni =number of

hyperlinks fromi.

The PageRank is the invariant measureπofS.

(33)

• Letv be the preference probability vector of the Web search engine

• Letαbe a damping factor: the probability for a Web surfer to use the Web seach engine.

• Usually, one replacesSbyαS + (1 − α)✶v,✶ = (1 · · · 1)T.

• Similar to consider the Markov matrix of the Web with the Web search engine: M =h_α0_{✶ (1−α)S}v i.

• Ifr is an instantaneous reward such thatri = 1fori = sand0

otherwise, then the mean-payoff is the PageRank (frequency of visit)πs of the spammer sites.

• Optimizing the spammer site is a 1-player game with mean-payoff

(see for instance (Fercoq, A., Bouhtou, Gaubert, IEEE TAC 2013).

A zero-sum game problem:

• σ ∈ Σis the policy of the Web search engine, it controlsv and wants to minimize the PageRank of the spammer site;

• δ ∈ ∆is the policy of the spammer, it controls the rows ofS with index in his site, and wants to maximize its PageRank.

(34)

(35)

Related recent results for 1-player discounted games

• (Post, Ye, 2012)show that the simplex algorithm for deterministic MDP (1-player games) is strongly polynomial independently of the discount factor: it stops afterO(n5_m2_log2_n)_{iterations, where}_m_is

the number of possible actions by state (thusm1= nm).

• (Scherrer, 2013)generalizes this result to stochastic MDP which satisfy a bound which may be seen (and is equal when the discount factorγ tends to1) as a boundτr on the expected first

return time to recurrent states and a boundτt on the expected exit

time from transient states. Under these conditions the simplex algorithm stops after O(n3_m2_τ

rτtlog2(nτrτt)).

• (Scherrer, 2013)shows a similar result for Policy Iteration algorithm for stochastic MDP (1-player games), when the set of transient states is independent of the strategy. Under these conditions the Policy iteration algorithm stops after

n(m − 1)(⌈τrlog(nτr)⌉ + ⌈τtlog(nτt)⌉)iterations.

• However, this assumption implies that the recurrent classes are independent of the strategy.

(36)

class of 2-player discounted games with fixed discount factor,

M(σδ)_{= γP}(σδ)_with_{γ < 1}_{, such that}

Tii₀(P(σδ)) ≤ K ∀σ ∈ Σ, δ ∈ ∆, i ∈ [n]

smaxsatisfies:

smax≤ (m1− n)(1 + ⌊

log(K )

log(K /(K − 1)⌋) = O((m1− n)K log K ),

withm1=thetotal number of actions of the first player.

(37)

For a Markov matrixM, a statei and setCof states, denote:

TiC(M) = E[inf{k ≥ 1 | Xk ∈ C} | X0= i] ,

the expected first return (or hitting) time in setC, starting fromi.

Theorem (A., Gaubert, 2014)

Let us fixK > 0and a subsetCof states with cardinalitys. The policy iteration algorithm for the class of 2-player multichain mean-payoff games such that for allσ ∈ Σ, δ ∈ ∆, each final class ofM(σδ)contains exactly one element ofCand

TiC(M(σδ)) ≤ K ∀i ∈ [n]

smaxsatisfies:

smax≤ (m1− n)(1 + ⌊

log(sK )

log(sK /(sK − 1)⌋) = O((m1− n)sK log(sK )),

(38)

• In general,F may not have additive eigenvalue and eigenvector, that isρandv such thatρ + v = F (v ).

• If the action spacesAi andBi are finite for alli ∈ [n], thenF is

polyhedral, and since it is also nonexpansive, by theKohlberg

(1980)theorem, there existηandv inRnsuch that

F(tη + v ) = (t + 1)η + v , for t large enough.

• (η, v )is called aninvariant half-line.

• Thenη is the value of the game with mean-payoff.

• Moreover, there existFˆ andF´·such that(η, v )is an invariant

half-line if and only if it satisfies the system:

(

η = ˆF(η) , η + v = ´Fη(v ) .

(39)

Policy iterations for multichain mean-payoff games

Construct a sequence of policiesσs_{, values}_ηs _{and biais}_vs_.

They were introduced and proved to converge by

• (Howard, 1960)and(Denardo and Fox, 1968)for 1-player multichain mean-payoff games,

• (V ¨oge and Jurdzi ´nski, 2000)for parity games,

• (Cochet-Terrasson, Gaubert, Gunawardena, 1998 and 1999), (Bjorklund, Sandberg, Vorobyov, 2004), (Jurdzi ´nski,Paterson, Zwick, 2006)for 2-player deterministic games,

• (Cochet-Terrasson and Gaubert, 2006), (A., Cochet-Terrasson, Detournay, and Gaubert, arXiv:1208.0446, and CDC 2013),

(Detournay,PIGAMES library, 2012), (Bourque, Raghavan, preprint, 2012)for general multichain 2-player stochastic games.

(Detournay, 2012).

To avoid cycling, one need to add some constraints onvs, for instance:

• fix the valuev_is= 0at one pointi of each final class ofM(σδ)

(Howard, and Denardo and Fox, for one-player games);

• by a nonlinear projection (Cochet-Terrasson and Gaubert); and to choose optimal policies in a conservative way.

(40)

polynomial when restricted to the class of games such thatthe spectral radii of allM(σδ)are bounded byλ < 1. This result is invariant by diagonal scaling.

• The policy iteration algorithm for ergodic mean-payoff games is strongly polynomial when restricted to the class of ergodic games such thatthe expected first return (or hitting) time in some fixed

statei0of the Markov chain associated to anyM(σδ)and initial state

is bounded byK < ∞.

• Same result fordiscounted games.

• Same result for multichain mean-payoff games, wheni0is replaced

by a set of statesC, and each recurrence class contains exactly one element ofC.

Open:

• Is the policy iteration algorithm for multichain stochastic games strongly polynomial, under some more general constraints on the