• No results found

vk+1= Ri(BRi(Π−ik+1), Π−ik+1) ≤ (1 − α)Ri(BRi(Πk), Πk) + α ˆR = (1 − α)vk+ α ˆR Then, Ri(Bik+1, Πk+1) = (1 − α)Ri(Bik+1, Π−ik ) + αR i(Bi k+1, B−ik+1) ≥ (1 − α)(vk− ε) − α ˆR ≥ vk+1− (1 − α)ε − 2α ˆR ≥ vk+1− ε + 2α ˆR 

This bounds the absolute amount by which reinforcement learning needs to improve the best response profile to achieve a monotonic decay of the optimality gap εk. However, εk only needs to decay asymptotically. Given αk → 0 as k → ∞, the bound suggests that in practice a finite amount of learning per iteration might be sufficient to achieve asymptotic improvement of best responses. In Section 5.6.1, we empirically analyse whether this is indeed the case in practice. Furthermore, in our experiments on the robustness of (full-width) XFP (see Section 4.5.3), it robustly approached lower-quality strategies for noisy, low-quality best responses.

5.4

Average Strategy Learning

This section develops supervised learning of approximate average strategies from sampled experience. We first translate fictitious players’ averaging of strategies into an equivalent problem of agents learning to model themselves. Then, we describe how appropriate experience for such models may be sampled. Next, we present suit- able memory architectures for off-policy learning from such experience. Finally, we discuss the approximation effects of machine learning on the fictitious play process.

5.4.1

Modeling Oneself

Consider the point of view of a particular player i who wants to learn a behavioural strategy π that is realization-equivalent to a convex combination of their own normal-form strategies, Π = ∑kj=1wjBj, ∑kj=1wj= 1. This task is equivalent to learning a model of the player’s behaviour when it is sampled from Π. In addition to describing the behavioural strategy π explicitly, Proposition 4.3.2 prescribes a way to sample data sets of such convex combinations of strategies.

Corollary 5.4.1. Let {Bk}1≤k≤n be mixed strategies of player i, Π = ∑nk=1wkBk, ∑nk=1wk= 1 a convex combination of these mixed strategies and µ−i a completely mixed sampling strategy profile that defines the behaviour of player i’s fellow agents. Then the expected behaviour of player i, when sampling from the strategy profile(Π, µ−i), defines a behavioural strategy,

π (a | u) = P Ait= a |Uti= u, Ait∼ Π 

∀u ∈Ui∀a ∈A (u), (5.1)

that is realization equivalent to Π.

Proof. This is a direct consequence of realization equivalence. In particular, con- sider player i sampling from Π = ∑nk=1wkBk. For k = 1, ..., n, let βkbe a behavioural strategy that is realization equivalent to Bk. Choose u ∈Ui and a ∈A (u). Then, when sampling from (Π, µ−i), the probability of player i sampling action a in in- formation state u is P(u, a | Π, µ−i) = n

k=1 wkxβk(σu)βk(a | u)

s∈I−1(u) xµ−i(σs), = n

k=1 wkxβ k(σu)βk(a | u) ! 

s∈I−1(u) xµ−i(σs)  ,

where xµ−i(σs) is the product of fellow agents’ and chance’s realization probabili-

5.4. Average Strategy Learning 82 on reaching u yields P(a | u, Π, µ−i) ∝ n

k=1 wkxβ k(σu)βk(a | u),

which is equivalent to the explicit update in Equation 4.1.

Hence, the behavioural strategy πi can be learned approximately from a data set consisting of trajectories sampled from (Πi, µ−i). Recall that we can sample from Π = ∑nk=1wkBk by sampling whole episodes from each constituent βki≡ Bi

k with probability wk.

5.4.2

Sampling Experience

In fictitious play, our goal is to keep track of the average mixed strategy profile

Πk+1= 1 k+ 1 k+1

j=1 Bj= k k+ 1Πk+ 1 k+ 1Bk+1.

Both Πk and Bk+1 are available at iteration k and we can therefore apply Proposi- tion 5.4.1 to sample experience of a behavioural strategy πk+1i that is realization- equivalent to Πik+1. In particular, for a player i we would repeatedly sample episodes of him following his strategy πk+1i against a fixed, completely mixed strat- egy profile of his opponents, µ−i. Note, that player i can follow πk+1i by choosing between πki and βk+1i at the root of the game with probabilities k+1k and 1k respec- tively and committing to his choice for an entire episode. The player would ex- perience sequences of state-action pairs, (st, at), where at is the action he chose at information state st following his strategy. This experience constitutes data that approximates the desired strategy that we want to learn, e.g. by fitting the data set with a function approximator. Alternatively, players can collect state-policy pairs, (st, ρt), where ρt = Et

 1

1{At=1}, ..., 11{At=n} is the player’s policy at state St.

Instead of resampling a whole new data set of experience at each iteration, we can incrementally update our data set from a stream of best responses, βij, j = 1, ... , k. In order to constitute an unbiased approximation of an average of best

responses, 1k∑kj=1Bij, we need to accumulate the same number of sampled episodes from each Bijand these need to be sampled against the same fixed opponent strategy profile µ−i. However, we suggest using the average strategy profile πk as the (now varying) sampling distribution µk. Sampling against πk has the benefit of focusing the updates on states that are more likely in the current strategy profile. When collecting samples incrementally, the use of a changing sampling distribution πk can introduce bias. However, in fictitious play πk is changing more slowly over time and thus it is conceivable for this bias to decay over time as well.

5.4.3

Memorizing Experience

In principle, collecting sampled data from an infinite stream of best responses would require an unbounded memory. To address this issue, we propose a table-lookup counting model and the use of reservoir sampling (Vitter, 1985) for general super- vised learning.

Table-Lookup A simple, but possibly expensive, approach to memorizing experi- ence of past strategies, is to count the number of times each action has been taken. For each sampled tuple, (sit, ait), we update the count of the chosen action.

N(sti, ati) ← N(sit, ait) + 1

Alternatively, we can accumulate local policies ρti = βti(sit), where sti is agent i’s information state and βtiis the strategy that the agent pursued at this state when this experience was sampled.

∀a ∈A (st) : N(st, a) ← N(st, a) + ρti(a)

The average strategy can be readily extracted.

∀a ∈A (st) : π(st, a) ←

N(st, a) N(st)

As each sampled tuple, (sit, ait) or (sit, ρti), only needs to be counted once, we can parse the experience of average strategies in an online fashion and thus avoid infinite

5.4. Average Strategy Learning 84 memory requirements.

Reservoir In large domains, it is unfeasible to keep track of action counts at all information states. We propose to track a random sample of past best response be- haviour with a finite-capacity memory by applying reservoir sampling to the stream of state-action pairs sampled from the stream of best responses.

Reservoir sampling (Vitter, 1985) is a class of algorithms for keeping track of a finite random sample from a possibly large or infinite stream of items, {xk}k≥1. Assume a reservoir (memory), M , of size n. For the first n iterations, all items, x1, ..., xn, are added to the reservoir. At iteration k > n, the algorithm randomly replaces an item in the reservoir with xk with probability nk, and discards xk other- wise. Thus, at any iteration k, the reservoir contains a uniform random sample of the items that have been seen so far, x1, ..., xk. Exponential reservoir sampling (Os- borne et al., 2014) replaces items in the reservoir with a probability that is bounded from below, i.e. max(p,nk), where p is a chosen minimum probability. Thus, once k= np items have been sampled, the items’ likelihood of remaining in the reservoir decays exponentially.

Adding agents’ experience of their (approximate) best responses to their re- spective supervised learning memories (reservoirs), MSL, with vanilla reservoir sampling approximates a T1 step size of fictitious play, where T abstracts the it- eration counter. Using exponential reservoir sampling results in an approximate step size of max(α,T1), where α depends on the chosen minimum probabiltiy and capacity of the reservoir.

5.4.4

Average Strategy Approximation

Let ˜Πikbe an approximation of the average strategy profile Πk=1k∑kj=1Bj, learned from a perfect fit of samples from the stream of best responses, Bk, k ≥ 1, e.g. with the counting model. As we collect more samples from this stream, the approxi- mation error, ˜Πik− Πik, decays with increasing k. However, these approximation errors can bias the sequence of learned best responses, Bk+1, as they are trained with respect to the approximate average strategies, ˜Πk. While generalised weak- ened fictitious play only requires asymptotically-perfect best responses, this could

slow down convergence to an impractical level. In Section 5.6.1, we empirically analyse the approximation error, ˜Πik− Πik.