Average Strategy Learning - Reinforcement Learning from Self-Play in Imperfect-Information Game

v_k+1= Ri(BRi(Π−i_k+1), Π−i_k+1) ≤ (1 − α)Ri(BRi(Πk), Πk) + α ˆR = (1 − α)vk+ α ˆR Then, Ri(Bi_k+1, Πk+1) = (1 − α)Ri(Bik+1, Π−ik ) + αR i_(Bi k+1, B−ik+1) ≥ (1 − α)(vk− ε) − α ˆR ≥ vk+1− (1 − α)ε − 2α ˆR ≥ vk+1− ε + 2α ˆR

This bounds the absolute amount by which reinforcement learning needs to improve the best response profile to achieve a monotonic decay of the optimality gap ε_k. However, ε_k only needs to decay asymptotically. Given α_k → 0 as k → ∞, the bound suggests that in practice a finite amount of learning per iteration might be sufficient to achieve asymptotic improvement of best responses. In Section 5.6.1, we empirically analyse whether this is indeed the case in practice. Furthermore, in our experiments on the robustness of (full-width) XFP (see Section 4.5.3), it robustly approached lower-quality strategies for noisy, low-quality best responses.

5.4 Average Strategy Learning

This section develops supervised learning of approximate average strategies from sampled experience. We first translate fictitious players’ averaging of strategies into an equivalent problem of agents learning to model themselves. Then, we describe how appropriate experience for such models may be sampled. Next, we present suit- able memory architectures for off-policy learning from such experience. Finally, we discuss the approximation effects of machine learning on the fictitious play process.

5.4.1 Modeling Oneself

Consider the point of view of a particular player i who wants to learn a behavioural strategy π that is realization-equivalent to a convex combination of their own normal-form strategies, Π = ∑kj=1wjBj, ∑kj=1wj= 1. This task is equivalent to learning a model of the player’s behaviour when it is sampled from Π. In addition to describing the behavioural strategy π explicitly, Proposition 4.3.2 prescribes a way to sample data sets of such convex combinations of strategies.

Corollary 5.4.1. Let {Bk}1≤k≤n be mixed strategies of player i, Π = ∑nk=1wkBk, ∑n_k=1wk= 1 a convex combination of these mixed strategies and µ−i a completely mixed sampling strategy profile that defines the behaviour of player i’s fellow agents. Then the expected behaviour of player i, when sampling from the strategy profile(Π, µ−i), defines a behavioural strategy,

π (a | u) = P Ait= a |Uti= u, Ait∼ Π

∀u ∈Ui∀a ∈A (u), (5.1)

that is realization equivalent to Π.

Proof. This is a direct consequence of realization equivalence. In particular, consider player i sampling from Π = ∑nk=1wkBk. For k = 1, ..., n, let βkbe a behavioural strategy that is realization equivalent to B_k. Choose u ∈Ui and a ∈A (u). Then, when sampling from (Π, µ−i), the probability of player i sampling action a in information state u is P(u, a | Π, µ−i) = n

∑

k=1 w_kx_β_k(σu)βk(a | u)

∑

s∈I−1_(u) x_µ−i(σ_s), = n

∑

k=1 w_kx_β k(σu)βk(a | u) ! 

∑

s∈I−1_(u) x_µ−i(σ_s)  ,

where x_µ−i(σ_s) is the product of fellow agents’ and chance’s realization probabili-

5.4. Average Strategy Learning 82 on reaching u yields P(a | u, Π, µ−i) ∝ n

∑

k=1 w_kx_β k(σu)βk(a | u),

which is equivalent to the explicit update in Equation 4.1.

Hence, the behavioural strategy πi can be learned approximately from a data set consisting of trajectories sampled from (Πi, µ−i). Recall that we can sample from Π = ∑n_k=1w_kB_k by sampling whole episodes from each constituent β_ki≡ Bi

k with probability wk.

5.4.2 Sampling Experience

In fictitious play, our goal is to keep track of the average mixed strategy profile

Πk+1= 1 k+ 1 k+1

∑

j=1 B_j= k k+ 1Πk+ 1 k+ 1Bk+1.

Both Πk and Bk+1 are available at iteration k and we can therefore apply Proposi- tion 5.4.1 to sample experience of a behavioural strategy π_k+1i that is realization- equivalent to Πi_k+1. In particular, for a player i we would repeatedly sample episodes of him following his strategy π_k+1i against a fixed, completely mixed strategy profile of his opponents, µ−i. Note, that player i can follow π_k+1i by choosing between π_ki and β_k+1i at the root of the game with probabilities _k+1k and 1_k respec- tively and committing to his choice for an entire episode. The player would experience sequences of state-action pairs, (st, at), where at is the action he chose at information state st following his strategy. This experience constitutes data that approximates the desired strategy that we want to learn, e.g. by fitting the data set with a function approximator. Alternatively, players can collect state-policy pairs, (s_t, ρt), where ρt = Et

1_{A_t=1}, ..., 11{At=n} is the player’s policy at state St.

Instead of resampling a whole new data set of experience at each iteration, we can incrementally update our data set from a stream of best responses, βi_j, j = 1, ... , k. In order to constitute an unbiased approximation of an average of best

responses, 1_k∑k_j=1Bi_j, we need to accumulate the same number of sampled episodes from each Bi_jand these need to be sampled against the same fixed opponent strategy profile µ−i. However, we suggest using the average strategy profile πk as the (now varying) sampling distribution µk. Sampling against πk has the benefit of focusing the updates on states that are more likely in the current strategy profile. When collecting samples incrementally, the use of a changing sampling distribution πk can introduce bias. However, in fictitious play πk is changing more slowly over time and thus it is conceivable for this bias to decay over time as well.

5.4.3 Memorizing Experience

In principle, collecting sampled data from an infinite stream of best responses would require an unbounded memory. To address this issue, we propose a table-lookup counting model and the use of reservoir sampling (Vitter, 1985) for general supervised learning.

Table-Lookup A simple, but possibly expensive, approach to memorizing experience of past strategies, is to count the number of times each action has been taken. For each sampled tuple, (si_t, ai_t), we update the count of the chosen action.

N(s_ti, a_ti) ← N(si_t, ai_t) + 1

Alternatively, we can accumulate local policies ρ_ti = β_ti(si_t), where s_ti is agent i’s information state and β_tiis the strategy that the agent pursued at this state when this experience was sampled.

∀a ∈A (st) : N(st, a) ← N(st, a) + ρti(a)

The average strategy can be readily extracted.

∀a ∈A (st) : π(st, a) ←

N(st, a) N(st)

As each sampled tuple, (si_t, ai_t) or (si_t, ρ_ti), only needs to be counted once, we can parse the experience of average strategies in an online fashion and thus avoid infinite

5.4. Average Strategy Learning 84 memory requirements.

Reservoir In large domains, it is unfeasible to keep track of action counts at all information states. We propose to track a random sample of past best response behaviour with a finite-capacity memory by applying reservoir sampling to the stream of state-action pairs sampled from the stream of best responses.

Reservoir sampling (Vitter, 1985) is a class of algorithms for keeping track of a finite random sample from a possibly large or infinite stream of items, {x_k}_k≥1. Assume a reservoir (memory), M , of size n. For the first n iterations, all items, x1, ..., xn, are added to the reservoir. At iteration k > n, the algorithm randomly replaces an item in the reservoir with xk with probability n_k, and discards xk other- wise. Thus, at any iteration k, the reservoir contains a uniform random sample of the items that have been seen so far, x1, ..., xk. Exponential reservoir sampling (Os- borne et al., 2014) replaces items in the reservoir with a probability that is bounded from below, i.e. max(p,n_k), where p is a chosen minimum probability. Thus, once k= n_p items have been sampled, the items’ likelihood of remaining in the reservoir decays exponentially.

Adding agents’ experience of their (approximate) best responses to their re- spective supervised learning memories (reservoirs), MSL, with vanilla reservoir sampling approximates a _T1 step size of fictitious play, where T abstracts the iteration counter. Using exponential reservoir sampling results in an approximate step size of max(α,_T1), where α depends on the chosen minimum probabiltiy and capacity of the reservoir.

5.4.4 Average Strategy Approximation

Let ˜Πi_kbe an approximation of the average strategy profile Πk=1_k∑kj=1Bj, learned from a perfect fit of samples from the stream of best responses, Bk, k ≥ 1, e.g. with the counting model. As we collect more samples from this stream, the approximation error, ˜Πi_k− Πi_k, decays with increasing k. However, these approximation errors can bias the sequence of learned best responses, Bk+1, as they are trained with respect to the approximate average strategies, ˜Πk. While generalised weak- ened fictitious play only requires asymptotically-perfect best responses, this could

slow down convergence to an impractical level. In Section 5.6.1, we empirically analyse the approximation error, ˜Πi_k− Πi_k.

In document Reinforcement Learning from Self-Play in Imperfect-Information Games (Page 80-85)