Switching Policies - Strategy iteration algorithms for games and Markov decision processes

Chapter 3 Algorithms

3.2.4 Switching Policies

The section describes the ﬁnal component of a strategy improvement algorithm. Strategy improvement allows for any subset of switchable edges to be switched in each iteration of strategy improvement. Clearly, in order to have a complete algorithm, we need a procedure that picks which edges should be switched in each iteration. We will call this procedure a switching policy. A simple switching policy can be thought of as a function that picks a subset of the edges or actions that are switchable in the current strategy. In the average-reward MDP setting, we will have a switching policy for picking gain switchable actions, and a switching policy for picking bias switchable actions.

The switching policies that have been studied so far work for both two player games and Markov decision processes, and the upper bounds for the running time of these switching policies that have been found are usually the same across the two models. On the other hand, the lower bounds that have been found are usually specific to a particular type model. We will indicate the scope of the lower and upper bound results as we present them. When we give formal definitions of these switching policies, we will use the game formulation. These definitions can easily be adapted for the MDP setting.

We begin by stating a trivial upper bound on the number of iterations that any strategy improvement algorithm can take to terminate. Since strategy improvement algorithms cannot consider the same strategy twice, the number of iterations is obviously bounded by the total number of positional strategies that can be considered by the algorithm. Let Degree(v) denote the number of outgoing edges from

the vertexv in a two player game, or the number of outgoing actions from the vertexv in an MDP. Strategy improvement for two player games must terminate after at most Q_v_∈_V_MaxDegree(v) iterations, and strategy improvement for MDPs must terminate after at most Q_v_∈_V Degree(v) iterations. These bounds hold no matter what choices the switching policy makes.

The simplest possible switching policy is to arbitrarily pick a single switchable edge in each iteration. This switching policy can be deﬁned as Single(F) ={(v, u)}

where (v, u) is some edge contained in F. It has long been known that strategy improvement equipped with the single-switch policy can take exponential time. For games, the examples were originally found by Lebedev, and Bj¨orklund and Vorobyov adapted them to show an exponential lower bound on the number of iterations taken by their strategy improvement algorithm, when it is equipped with the single-switch policy [BV07]. For MDPs, Melekopoglou and Condon have shown an exponential lower bound for a single-switch strategy improvement using a very similar family of examples [MC94].

The most natural class of switching policies are all-switches policies. The idea here is that the strategy should be switched at every vertex that has a switchable edge. This defines a class of switching policies, because a vertex may have more than one switchable edge, and different switching policies may pick different edges to switch at each vertex. The most natural all-switches policy is thegreedy switching policy, that always picks the edge with maximum appeal.

We formally deﬁne the greedy switching policy. We must be aware that there may be more than one edge that maximizes the appeal at a vertex. For the sake of simplicity, we will use an index function to break ties: we will assume that each vertex v is given a unique index in the range {1,2, . . . ,|V|}, which we will denote as Index(v). The set of edges that the greedy policy will pick for the strategyσ can

then be deﬁned as follows:

Greedyσ(F) =_{(v, u) : There is no edge (v, w) _∈F with Appealσ(v, u)<Appealσ(v, w) or with

Appealσ(v, u) = Appealσ(v, w) and Index(u)<Index(w)_}.

The best upper bound that has been shown for strategy improvement algorithms equipped with the greedy policy is O(2n_/n_{) iterations [MS99]. This upper} bound holds for games and for MDPs. For many years, people were unable to find examples upon which strategy improvement equipped with the greedy policy took sig- nificantly more than a linear number of iterations to terminate. It was for this reason that greedy was conjectured to always terminate after a polynomial number of steps. However, in a breakthrough result, Friedmann found a family of parity games upon which the strategy improvement algorithm of Vöge and Jurdziński [VJ00] equipped with the greedy policy takes an exponential number of steps [Fri09, Fri10a]. It was later shown that this result can be generalised to prove an exponential lower bound for the strategy improvement for discounted games [And09]. It is also not difficult to adapt Friedmann’s examples to produce a set of input instances upon which the Björklund-Vorobyov strategy improvement algorithm equipped with the greedy policy takes an exponential number of steps.

Perhaps the most intriguing type of switching policies are optimal switching policies. A switching policy is optimal if for every strategy σ it selects a subset of switchable edges F that satisfies Valσ[H](v) ≤ Valσ[F](v) for every set H that is a subset of the switchable edges inσ, and for every vertexv. It is not difficult to show that such a set of edges must always exist, however at first glance it would seem unlikely that such a set could be efficiently computed. Nevertheless, Schewe has given an algorithm that computes such a set in polynomial time for the Björklund- Vorobyov strategy improvement algorithm [Sch08]. Therefore, optimal switching

policies can realistically be implemented for solving parity and mean-payoﬀ games. No analogue of this result is known for discounted games or for MDPs.

Although, using an optimal switching policy would seem likely to produce better results than the greedy policy, Friedmann has shown that his exponential time examples for the greedy policy can be adapted to provide a family of examples upon which an optimal switching policy will take an exponential number of steps [Fri10a]. Therefore, optimal switching policies perform no better than greedy switching policies in the worst case. It should be noted that the word optimal is used to describe the set of edges that an optimal switching policy chooses to switch. It does not imply that a strategy improvement algorithm equipped with an optimal policy will have an optimal running time. There may be switching policies that do not always make the maximal increase in valuations in every iteration, but still perform better in terms of worst case complexity.

Randomized switching policies have been shown to have better worst case complexity bounds. Mansour and Singh considered a switching policy that selects a subset of switchable edges uniformly at random [MS99]. They showed that this switching policy will terminate after expected O(20.78n_{) number of iterations for} binary games, and an expected O((1 + 2/logk)_·k/2)n_{) number of iterations for} games with out-degree at most k. These upper bounds hold for both games and MDPs.

The switching policy with the best currently-known worst case complexity bound is the random-facet switching policy. This switching policy is based on the randomized simplex methods for linear programming given by Kalai [Kal92] and Matouˇsek, Sharir, and Welzl [MSW96]. The first switching policy based on this method was given by Ludwig [Lud95], which terminates after an expected 2O(√n) number of iterations. However, his switching policy only works for binary games. This shortcoming has been rectified by the switching policy described by Björklund and Vorobyov [BV05], which terminates after an expected 2O(√nlogn) _{number of}

iterations for both games and MDPs.

In a recent result, Friedmann, Hansen, and Zwick have found a family of parity games upon which this bound is almost tight: the random-facet switching policy will take an expected 2Ω(√n/logn) iterations to terminate [FHZ10]. This lower bound can be extended to cover the strategy improvement algorithm for discounted games, and the Bj¨orklund-Vorobyov strategy improvement algorithm for the zero mean partition. This lower bound is not known to hold for strategy improvement for MDPs.

In document Strategy iteration algorithms for games and Markov decision processes (Page 68-72)