• No results found

Policy iteration for stochastic zero-sum games

N/A
N/A
Protected

Academic year: 2021

Share "Policy iteration for stochastic zero-sum games"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Policy iteration for stochastic zero-sum games

Marianne Akian

To cite this version:

Marianne Akian. Policy iteration for stochastic zero-sum games. NETCO 2014, 2014, Tours, France. <hal-01024097>

HAL Id: hal-01024097

https://hal.inria.fr/hal-01024097

Submitted on 15 Jul 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destin´ee au d´epˆot et `a la diffusion de documents scientifiques de niveau recherche, publi´es ou non, ´emanant des ´etablissements d’enseignement et de recherche fran¸cais ou ´etrangers, des laboratoires publics ou priv´es.

(2)

games

Marianne Akian

INRIA Saclay - ˆIle-de-France and CMAP, ´Ecole Polytechnique

NETCO 2014 23-27 June 2014, Tours

(3)

Hamilton-Jacobi-Bellman-Isaacs equations

The stationary equation:

−H(x, Dv (x)) = 0, x ∈ X ⊂ Rd +a boundary condition, H(x, p) = min

a∈A(x)bmax∈B(x)[f (x, a, b) · p + g(x, a, b)], x ∈ X , andp ∈ R

d ,

is the dynamic programming equation satisfied by the (upper) value function of the zero-sum game problem:

v(x) = inft)t≥0 sup (βt)t≥0 Z ∞ 0 g(xt, αt, βt)dt ,

where ˙xt = f (xt, αt, βt), for allt≥ 0, andinfandsupare taken over

nonanticipating strategies of the first and second player (where the second player knows the current action of the first player).

(4)

v = F (v ), the fixed point equation of the dynamic programming or Shapley operatorF of a discrete time zero-sum two player stochastic game problem with finite state space.

Same for:Discounted problems, Optimal stopping time problems, Stochastic games.

(5)

Discrete time and state zero-sum stochastic games

LetF : Rn→ Rn be defined by:

[F (v )]i := min a∈Ai max b∈Bi   X j∈[n] Mijabvj + riab  , i ∈ [n],

withMijab ≥ 0for alli, j ∈ [n], a ∈ Ai, b ∈ Bi.

The map F is thedynamic programming or Shapley operator of a dis-crete time zero-sum two player game problem with perfect information on the finite state spaceX := [n] := {1, . . . , n}, with:

Ai, Bi sets of actions of the 1st, 2nd player MIN, MAX, when in statei riab reward paid by MIN to MAX, at each time

Mijab := γiabPijab ≥ 0 γiab :=P

j∈[n]Mijab ≥ 0discount factor (< 1or≤ 1or= 1) Pab

ij transition probability fromitoj( P

(6)

LetF : R → R be defined by: [F (v )]i := min a∈Ai max b∈Bi   X j∈[n] Mijabvj + riab  , i ∈ [n],

withMijab ≥ 0for alli, j ∈ [n], a ∈ Ai, b ∈ Bi.

Denoteγabi :=P

j∈[n]Mijab.

Then

F is order preserving:u ≤ v ⇒ F (u) ≤ F (v ), for allu, v ∈ Rn;

• ifγiab ≤ 1for alli∈ [n]anda∈ A, b ∈ B, thenF is additively

sub-homogeneous:F(λ + u) ≤ λ + F (u), for allλ ≥ 0andu ∈ Rn

• thusF is sup-norm nonexpansive.

• Ifγiab= 1for alli ∈ [n]anda∈ A, b ∈ B, thenF is additively

(7)

Let the value function of the gamewith infinite horizonbe given by: vx = inf (αk)k ≥0 sup (βk)k ≥0 E " X k=0 ( k−1 Y ℓ=0 γαℓ,βℓ X)r αkk xk | X0= x # ,

whereαk andβk are possible strategies of both players of the game (at

timek), andXk ∈ [n]is the state process of the game satisfying P(Xk+1= j|Xk = i, αk = a, βk = b) = Pijab.

Ifγxa,b ≤ ¯γ < 1, thenF is a sup-norm contraction:

kF (v ) − F (w)k∞≤ ¯γkv − wk∞ ,

andv is the unique solution of

v = F (v ).

Moreover the optimal actions inF(v )give the optimal stationary strategies of the game.

(8)

Problem: compute v ∈ Rnsuch that F(v ) = v , when such a solution is

unique, and bound the complexity of this computation.

Whenγabi ≤ ¯γ < 1for alli∈ [n]anda∈ A, b ∈ B, then

• Then, thevalue iterationscoincide with fixed point iterations:

vk+1= F (vk), and with the finite horizon approximations with

T = k andϕ = v0. They converge geometrically towardsv with

factorγ¯:

lim k→∞kv

k− v k1/k ≤ ¯γ .

• However, the value iteration algorithm is only pseudopolynomial.

• Also the existence of a polynomial algorithm is an open problem.

(9)

Policy iterations for discounted games

Assume:Ai andBi are finite sets, and

Denote byΣ := {σ : i ∈ [n] 7→ σi ∈ Ai}and∆ := {δ : i ∈ [n] 7→ δi ∈ Bi}

the sets of policies,

and forσ ∈ Σandδ ∈ ∆, define the matrices and vectors:

M(σδ)= (Mσiδi

ij )ij=1,...,n, andr

(σδ)= (rσiδi

i )i=1,...,n ,

and the affine maps

F(σδ)(v ) = M(σδ)v+ r(σδ), v ∈ Rn .

Then,F can be written as:

F(v ) = min

σ∈ΣF

(σ)(v ) , withF(σ)(v ) := max δ∈∆F

(σδ)(v ) , v ∈ Rn ,

where minima and maxima are for the partial order ofRn.

The mapsF(σδ),F(σ)andF are all order preserving and contracting for the sup-norm with contraction factorγ¯.

Important: the infimum and supremum are attained because the sets {F(σ)(v ) | σ ∈ Σ} and {F(σδ)(v ) | δ ∈ ∆} are rectangular.

(10)

(Howard, 1960)for 1-player games,(Denardo, 1967)for 2-player games.

Using operators:

Given an initial policyσ0∈ Σ, apply successively the two following steps

fors≥ 0untilσs+1= σs:

1 Compute the fixed pointvs ofFs);

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈ Σsuch thatF(vs) = Fs+1)(vs)

withσs+1= σs as soon as this is possible.

Step 1 is solved by using Policy iteration for the (one-player) game with fixed policyσs, which constructsvs,l andδs,l fromδs,0.

(11)

Policy iterations for discounted games

(Howard, 1960)for 1-player games,(Denardo, 1967)for 2-player games.

With control terminology:

Given an initial policyσ0∈ Σ, apply successively the two following steps

fors≥ 0untilσs+1= σs:

1 Compute the valuevsof the game with fixed policyσs, that is the

solution ofv = Fs)(v );

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈ Σsuch thatF(vs) = Fs+1)(vs)or equivalently:

σis+1 ∈ argmin a∈Ai    max b∈Bi   X j∈[n] Mijabvjs + riab      , i ∈ [n],

(12)

(Howard, 1960)for 1-player games,(Denardo, 1967)for 2-player games.

Simplex algorithm for 1-player games with Dantzig pivoting:

Given an initial policyσ0∈ Σ, apply successively the two following steps

fors≥ 0untilσs+1= σs:

1 Compute the valuevsof the game with fixed policyσs, that is the

solution ofv = Fs)(v );

2 Improve the policy: choose a policyσs+1∈ Σsuch that

σis+1 ∈ argmin a∈Ai    max b∈Bi   X j∈[n] Mijabvjs + riab      , i ∈ [n],

(13)

Policy iterations for discounted games

❄ ✛ ✻ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆❆❯ . . . . . . PI external . . . PI internal . . . σ0 δs,0 σs+1 σs σs−1 δs,l−1 δs,l

(14)

❄ ✛ ✻ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆❆❯ . . . . . . PI external . . . PI internal . . . (Linear Equations) (Nonlinear Equations) vs vs−1 vs−2 v vs,l−1 vs,l vs

(15)

Policy iterations for discounted games: monotone

convergence

• The sequence(vs)s≥0is nonincreasing;

• Hence, the sequence(σs)s≥0does not visit the same policy two

times, until it becomes stationary;

• So the sequence(vs)sis stationary after a finite time (at most♯Σ),

(16)

convergence

• The sequence(vs)s≥0is nonincreasing;

• Hence, the sequence(σs)s≥0does not visit the same policy two

times, until it becomes stationary;

• So the sequence(vs)sis stationary after a finite time (at most♯Σ),

and converges towards the solutionv ofv = F (v ).

• Whensis fixed, the sequence(vs,l)l is nondecreasing;

• Hence, the sequence(δs,l)l does not visit the same policy two

times, until it becomes stationary;

• So the sequence(vs,l)l is stationary after a finite time (at most♯∆),

and converges towards the solutionvs ofv = Fs) (v ).

(17)

Policy iterations for discounted games: well known

properties

• The Policy iterations converge faster than the value iterations: for alls ≥ 0,v ≤ vs+1≤ F (vs) ≤ vs, sov ≤ vs ≤ Fs(v0) ≤ v0.

• If the discount factor is uniformely bounded by some constant

¯

γ < 1, then for the sup-norm, we have:

kvs+1− v k ≤ kF (vs) − v k ≤ ¯γkvs− v k .

• For 1-player games with an infinite number of actions and under regularity conditions, Policy iterations coincide with the Newton algorithm, and have asuper-linear convergence.

• However, in general, the number of (external) iterations is bounded by♯Σ ≥ 2nif♯A

(18)

• (Friedmann, 2009)showed a 2-player deterministic game problem withγ ≃ 1and an exponential number of iterations.

• (Fearnley, 2010)and(Andersson, 2009)showed the same for a 1-player stochastic game.

(19)

Policy iterations for discounted games: recent results

(Ye, 2011)showed that Policy iteration algorithm and Simplex algorithm solve 1-player discounted games with fixed discount factorγ < 1in

strongly polynomialtime.

(Hansen, Miltersen and Zwick, 2011)extended and improved this result to Policy iteration algorithm for 2-player games. They show that the number of iterationssmax(to obtain stationarity) satisfies:

smax≤ (m + 1)(1 + log(n2/(1 − γ)) − log(γ) ) = O( m 1− γlog n 1− γ),

withm=thetotal number of actions: the number of(i, a, b)withi∈ [n],

a∈ Ai andb∈ Bi.

(Feinberg, Huang, 2013): Same for a one-player game with

mean-payoff, and a statei0such thatPia,i0 ≥ 1 − γ, for alli ∈ [n], a ∈ Ai.

Question: What remains true when the discount factorsγiab are not uniformely bounded by a constant< 1?

(20)

2-player games satisfying

r(M(σδ)) ≤ λ ∀σ ∈ Σ, δ ∈ ∆

is strongly polynomial. More precisely, the number of external iterations

smaxsatisfies: smax≤ (m1− n)(1 + ⌊ log(1 − λ) log(λ) ⌋) = O( m1− n 1− λ log 1 1− λ),

withm1=thetotal number of actions of the first player: the number of

(21)

Proof. • Adapt the proof of(Hansen, Miltersen and Zwick, 2011)by using sup-norms instead ofℓ1norms and the nonlinear mapsF(δ)

to obtain the above bound when the discount factors are≤ λ. A similar bound is obtained by (Scherrer, 2013)in the one-player case with fixed discount factor.

• Using nonlinear spectral theory, show that for allλ < µ < 1, there existsϕ ∈ Rnsuch thatϕi > 0,i ∈ [n], andM(σδ)ϕ ≤ µϕ.

• LetG(v ) = ϕ−1F(ϕv )withϕv = (ϕivi)i∈[n]. ThenGis the dynamic

programming operator of a game with discount factors≤ µ, and the sequence of policiesσs forF andGare the same, so issmax.

• Equivalently,F is contracting onRnwith contraction factorµ, for the weighted sup-normk · kϕdefined by:

kv kϕ := max

i∈[n]|

vi ϕi

| ∀v ∈ Rn .

(22)

to obtain the above bound when the discount factors are≤ λ. A similar bound is obtained by (Scherrer, 2013)in the one-player case with fixed discount factor.

• Using nonlinear spectral theory, show that for allλ < µ < 1, there existsϕ ∈ Rnsuch thatϕi > 0,i ∈ [n], andM(σδ)ϕ ≤ µϕ. ← details

• LetG(v ) = ϕ−1F(ϕv )withϕv = (ϕivi)i∈[n]. ThenGis the dynamic

programming operator of a game with discount factors≤ µ, and the sequence of policiesσs forF andGare the same, so issmax.

• Equivalently,F is contracting onRnwith contraction factorµ, for the weighted sup-normk · kϕdefined by:

kv kϕ := max

i∈[n]|

vi ϕi

| ∀v ∈ Rn .

(23)

Definition (Nonlinear spectral radii(Nussbaum,Mallet-Paret, 1998))

Lethbe a nonlinear continuous positively homogenous map on a closed convex coneCofRn(h(λv ) = λh(v )for allλ > 0andv ∈ C):

• Thecone eigenvalue spectral radiusofh,ˆrC(h), is the maximal

modulus of an eigenvalue ofhinC, whereλis an eigenvalue associated tov ∈ C\{0}ifh(v ) = λv.

• TheCollatz-Wielandt numbercwC(h)is the infimum of the

super-eigenvalues ofh, whereλ > 0is a super-eigenvalue if there existsv in the interior ofC such thath(v ) ≤ λv.

• TheBonsall’s spectral radiusofhis defined as:

rC(h) := inf k≥1kh kk1/k C , with khkC := sup x∈C, kxk=1 kh(x)k ,

(24)

C= Rn

+, all the above spectral radius notions ofhcoincide:

r(h) = infk≥1khkk1/kRn

+

= max{λ ∈ R | ∃v ∈ Rn+\{0}, h(v ) = λv }

= inf{λ > 0 | ∃v ∈ (R

+)n, h(v ) ≤ λv }

Proposition (A. Gaubert, Nussbaum, arXiv 2011)

Assume thathandhπ are continuous, positively homogenous, order

preserving selfmaps ofRn+, for allπ ∈ Π, and thath(v ) = maxπ∈Πhπ(v )

for allv ∈ Rn+, then

r(h) = max

π∈Πr(hπ) .

Applying the proposition toh(v ) := maxσ∈Σmaxδ∈∆(M(σδ)v), we get

thatr(h) ≤ λ < µand so by the theorem, there existsϕ ∈ (R∗+)nsuch

(25)

Consider the value function of the gamewith mean-payoff: ηx = inf (αk)k ≥0 sup (βk)k ≥0 lim sup T→∞ 1 TE "T−1 X k=0 rαkk xk | X0= x # .

LetF be the dynamic programming operator such thatγab i ≡ 1. F is additively homogeneous. We say thatv ∈ Rnis an(nonlinear

additive) eigenvectororbiaisofF witheigenvalueρ ∈ RifF(v ) = ρ + v.

• Ifρexists, thenηx = ρfor allx ∈ [n].

• If all the matricesM(σδ)are irreducible, thenρexists and the eigenvectorv is unique up to an additive constant.

• Other existence results ofρ:Bather, 1973,Gaubert, Gunawardena, 2001.

(26)

(Hoffman and Karp, 1966)We have to solveρ + v = F (v ). Using operators:

Given an initial policyσ0∈ Σ, apply successively the two following steps

fors≥ 0untilσs+1= σs:

1 Compute the additive eigenvalue and eigenvectorρs andvs of Fs), that is the solution ofρ + v = Fs)(v );

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈ Σsuch thatF(vs) = Fs+1)

(vs)

withσs+1= σs as soon as this is possible.

Step 1 is solved by using Policy iteration for the (one-player) game with fixed policyσs, which constructsρs,l,vs,l andδs,l fromδs,0.

(27)

Policy iterations for “irreducible” mean-payoff games

(Hoffman and Karp, 1966)We have to solveρ + v = F (v ). With control terminology:

Given an initial policyσ0∈ Σ, apply successively the two following steps

fors≥ 0untilσs+1= σs:

1 Compute the valueρs and the biaisvs of the game with fixed policy σs, that is the solution ofρ + v = Fs)

(v );

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈ Σsuch thatF(vs) = Fs+1) (vs)or equivalently: σis+1 ∈ argmin a∈A    max b∈B   X j∈[n] Mijabvjs + riab      , i ∈ [n],

(28)

monotone convergence

• The sequence(ρs)

s≥0is nonincreasing;

• Ifρs= ρs+1, thenvs− vs+1is constant andvs = v.

• Hence, the sequence(σs)s≥0does not visit the same policy two

times, until it becomes stationary;

• So the sequence(ρs, vs)s is stationary after a finite time (at most ♯Σ), up to an additive constant, and converges towards the solution

(29)

Policy iterations for “irreducible” mean-payoff games:

monotone convergence

• The sequence(ρs)

s≥0is nonincreasing;

• Ifρs= ρs+1, thenvs− vs+1is constant andvs = v.

• Hence, the sequence(σs)s≥0does not visit the same policy two

times, until it becomes stationary;

• So the sequence(ρs, vs)s is stationary after a finite time (at most ♯Σ), up to an additive constant, and converges towards the solution

(ρ, v )ofρ + v = F (v ).

• Whensis fixed, the sequence(ρs,l)

l is nondecreasing;

• Ifρs,l = ρs,l+1, thenvs,l− vs,l+1is constant andvs,l = vs.

• Hence, the sequence(δs,l)l does not visit the same policy two

times, until it becomes stationary;

• So the sequence(ρs,l, vs,l)

l is stationary after a finite time (at most ♯∆), and converges towards the solution(ρs, vs)of

(30)

Tij(M) = E[inf{k ≥ 1 | Xk = j} | X0= i] ,

the expected first return (or hitting) time in statej, starting fromi. Note thatTii0(M) < +∞for alli∈ [n]if and only ifMhas a unique

recurrent (final) class andi0belongs to it.

Theorem (A., Gaubert, arXiv:1310.4953)

Let us fixK > 0and a statei0. The policy iteration algorithm for the

class of 2-player mean-payoff games such that

Tii0(M

(σδ)) ≤ K ∀σ ∈ Σ, δ ∈ ∆, i ∈ [n]

is strongly polynomial. More precisely, the number of external iterations

smaxsatisfies:

smax≤ (m1− n)(1 + ⌊

log(K )

log(K /(K − 1)⌋) = O((m1− n)K log K ),

(31)

Sketch of the proof. • Letϕ ∈ (R∗

+)n be defined by:

ϕi = maxσ∈Σmaxδ∈∆Tii0(M(σδ)).

• LetQ(σδ)be obtained fromM(σδ)by putting itsi

0th column to zero.

Thenϕ = 1 + maxσ∈Σmaxδ∈∆(Q(σδ)ϕ).

• LetN(σδ)be obtained fromM(σδ)by replacing itsi0th column by the

nonnegative vector(ϕ − 1 − Q(σδ)ϕ)/ϕ

i0.

N(σδ)has nonnegative entries and satisfies:

N(σδ)ϕ = ϕ − 1 ≤ λϕ withλ = 1 − 1/K ⇒ r (N(σδ)) ≤ λ .

• Then the map

G(v ) = min

σ∈Σmaxδ∈∆(N

(σδ)v+ r(σδ)) , v ∈ Rn

satisfies the assumptions of the theorem for discounted games.

• Ifvi0 = 0, thenρ + v = F (v ) ⇔ ρϕ + v = G(ρϕ + v ).

• Hence, the sequences of policiesσsandδs,l forF andGare the same.

(32)

11 10 13 12 20 21 17 16 19 18 15 1 3 2 5 4 7 6 9 8 14

Nodes = web pages Arcs = hyperlinks

21 : spammer page 1

: non controlled page.

Associated Markov matrix

S:Sij = 1/Ni if(i, j)is an

hyperlink,Sij = 0

otherwise;Ni =number of

hyperlinks fromi.

The PageRank is the invariant measureπofS.

(33)

• Letv be the preference probability vector of the Web search engine

• Letαbe a damping factor: the probability for a Web surfer to use the Web seach engine.

• Usually, one replacesSbyαS + (1 − α)✶v,✶ = (1 · · · 1)T.

• Similar to consider the Markov matrix of the Web with the Web search engine: M =hα0✶ (1−α)Sv i.

• Ifr is an instantaneous reward such thatri = 1fori = sand0

otherwise, then the mean-payoff is the PageRank (frequency of visit)πs of the spammer sites.

• Optimizing the spammer site is a 1-player game with mean-payoff

(see for instance (Fercoq, A., Bouhtou, Gaubert, IEEE TAC 2013).

A zero-sum game problem:

• σ ∈ Σis the policy of the Web search engine, it controlsv and wants to minimize the PageRank of the spammer site;

• δ ∈ ∆is the policy of the spammer, it controls the rows ofS with index in his site, and wants to maximize its PageRank.

(34)
(35)

Related recent results for 1-player discounted games

• (Post, Ye, 2012)show that the simplex algorithm for deterministic MDP (1-player games) is strongly polynomial independently of the discount factor: it stops afterO(n5m2log2n)iterations, wheremis

the number of possible actions by state (thusm1= nm).

• (Scherrer, 2013)generalizes this result to stochastic MDP which satisfy a bound which may be seen (and is equal when the discount factorγ tends to1) as a boundτr on the expected first

return time to recurrent states and a boundτt on the expected exit

time from transient states. Under these conditions the simplex algorithm stops after O(n3m2τ

rτtlog2(nτrτt)).

• (Scherrer, 2013)shows a similar result for Policy Iteration algorithm for stochastic MDP (1-player games), when the set of transient states is independent of the strategy. Under these conditions the Policy iteration algorithm stops after

n(m − 1)(⌈τrlog(nτr)⌉ + ⌈τtlog(nτt)⌉)iterations.

• However, this assumption implies that the recurrent classes are independent of the strategy.

(36)

class of 2-player discounted games with fixed discount factor,

M(σδ)= γP(σδ)withγ < 1, such that

Tii0(P(σδ)) ≤ K ∀σ ∈ Σ, δ ∈ ∆, i ∈ [n]

is strongly polynomial. More precisely, the number of external iterations

smaxsatisfies:

smax≤ (m1− n)(1 + ⌊

log(K )

log(K /(K − 1)⌋) = O((m1− n)K log K ),

withm1=thetotal number of actions of the first player.

(37)

For a Markov matrixM, a statei and setCof states, denote:

TiC(M) = E[inf{k ≥ 1 | Xk ∈ C} | X0= i] ,

the expected first return (or hitting) time in setC, starting fromi.

Theorem (A., Gaubert, 2014)

Let us fixK > 0and a subsetCof states with cardinalitys. The policy iteration algorithm for the class of 2-player multichain mean-payoff games such that for allσ ∈ Σ, δ ∈ ∆, each final class ofM(σδ)contains exactly one element ofCand

TiC(M(σδ)) ≤ K ∀i ∈ [n]

is strongly polynomial. More precisely, the number of external iterations

smaxsatisfies:

smax≤ (m1− n)(1 + ⌊

log(sK )

log(sK /(sK − 1)⌋) = O((m1− n)sK log(sK )),

(38)

• In general,F may not have additive eigenvalue and eigenvector, that isρandv such thatρ + v = F (v ).

• If the action spacesAi andBi are finite for alli ∈ [n], thenF is

polyhedral, and since it is also nonexpansive, by theKohlberg

(1980)theorem, there existηandv inRnsuch that

F(tη + v ) = (t + 1)η + v , for t large enough.

(η, v )is called aninvariant half-line.

• Thenη is the value of the game with mean-payoff.

• Moreover, there existFˆ andF´·such that(η, v )is an invariant

half-line if and only if it satisfies the system:

(

η = ˆF(η) , η + v = ´Fη(v ) .

(39)

Policy iterations for multichain mean-payoff games

Construct a sequence of policiesσs, valuesηs and biaisvs.

They were introduced and proved to converge by

• (Howard, 1960)and(Denardo and Fox, 1968)for 1-player multichain mean-payoff games,

• (V ¨oge and Jurdzi ´nski, 2000)for parity games,

• (Cochet-Terrasson, Gaubert, Gunawardena, 1998 and 1999), (Bjorklund, Sandberg, Vorobyov, 2004), (Jurdzi ´nski,Paterson, Zwick, 2006)for 2-player deterministic games,

• (Cochet-Terrasson and Gaubert, 2006), (A., Cochet-Terrasson, Detournay, and Gaubert, arXiv:1208.0446, and CDC 2013),

(Detournay,PIGAMES library, 2012), (Bourque, Raghavan, preprint, 2012)for general multichain 2-player stochastic games.

(Detournay, 2012).

To avoid cycling, one need to add some constraints onvs, for instance:

• fix the valuevis= 0at one pointi of each final class ofM(σδ)

(Howard, and Denardo and Fox, for one-player games);

• by a nonlinear projection (Cochet-Terrasson and Gaubert); and to choose optimal policies in a conservative way.

(40)

polynomial when restricted to the class of games such thatthe spectral radii of allM(σδ)are bounded byλ < 1. This result is invariant by diagonal scaling.

• The policy iteration algorithm for ergodic mean-payoff games is strongly polynomial when restricted to the class of ergodic games such thatthe expected first return (or hitting) time in some fixed

statei0of the Markov chain associated to anyM(σδ)and initial state

is bounded byK < ∞.

• Same result fordiscounted games.

• Same result for multichain mean-payoff games, wheni0is replaced

by a set of statesC, and each recurrence class contains exactly one element ofC.

Open:

• Is the policy iteration algorithm for multichain stochastic games strongly polynomial, under some more general constraints on the

References

Related documents

If you have received sickness benefits for the maximum number of days and do not have any sickness benefit qualifying income or it is less than SEK 80,300, you can receive

Send your Holy Spirit to equip them with the gifts that are needed to serve you: For our Archbishop Michael and all ministers of God’s word and sacraments, that they may serve

How: You can join the Recreation GoTo Meeting by following the link or phone number provided above during the scheduled meeting

My data demonstrate that Caspase-2 functions as a tumor suppressor in a mouse model of oncogenic Kras -driven lung cancer, negatively impacting chemotherapeutic response to

Slechts onder bijzondere omstandigheden, bijvoorbeeld indien de benoeming slechts voor zeer korte duur is (waarvan naar het oordeel van kantonrechter in het onderhavige geval

In this paper, we identify several useful, general- purpose exception handling patterns and demonstrate their applicability in business process and software

Most of the previous study took the research in the classroom, video movie or in social media while the researcher tried to take the English meeting club as