Proof. From Theorem 7.25 and Theorem 7.7.
Example 7.28 (Nash Equilibrium in Matching Pennies). Consider the matching pen- nies game from Example 7.23. The only pair of optimal policies is the pair of two uniformly random policies that playαand β with equal probability in every time step: if one of the agents picks a policy that plays one of the actions with probability >1/2, then the other agent’s best response is to play the other action with probability1. But now the first agent’s policy is no longer a best response. 3
7.5
Learning Reflective Agents
Since our class MO
refl solves the grain of truth problem, the result by Kalai and Lehrer
(1993) immediately implies that for any Bayesian agents π1, . . . , πn interacting in an
infinitely repeated game and for allε >0and alli∈ {1, . . . , n}there is almost surely a t0 ∈Nsuch that for allt≥t0 the policy πi is an ε-best response. However, this hinges
on the important fact that every agent has to know the game and also that all other agents are Bayesian agents. Otherwise the convergence to an ε-Nash equilibrium may fail, as illustrated by the following example.
At the core of the construction is adogmatic prior (Section 5.2.2). A dogmatic prior assigns very high probability to going to hell (reward 0 forever) if the agent deviates from a given computable policyπ. For a Bayesian agent it is thus only worth deviating from the policy π if the agent thinks that the prospects of following π are very poor already. This implies that for general multi-agent environments and without additional assumptions on the prior, we cannot prove any meaningful convergence result about Bayesian agents acting in an unknown multi-agent environment.
Example 7.29 (Reflective Bayesians Playing Matching Pennies). Consider the multi- agent environment matching pennies from Example 7.23. Let π1 be the policy that
takes the action sequence (ααβ)∞ and let π2 := πα be the policy that always takes
action α. The average reward of policy π1 is 2/3 and the average reward of policy
π2 is 1/3. Let ξ be a universal mixture (7.2). By on-policy value convergence (7.4),
Vπ1
ξ → c1 ≈ 2/3 and V π2
ξ → c2 ≈1/3 almost surely when following policies (π1, π2).
Therefore there is an ε > 0 such that Vπ1
ξ > ε and V π2
ξ > ε for all time steps. Now
we can apply Theorem 5.5 to conclude that there are (dogmatic) mixtures ξ10 and ξ20 such thatπ∗ξ0
1 always follows policy π1 andπ
∗
ξ20 always follows policy π2. This does not
converge to a (ε-)Nash equilibrium. 3
An important property required for the construction in Example 7.29 is that the environment class contains environments that threaten the agent with going to hell, which is outside of the class of matching pennies environments. In other words, since the agent does not know a priori that it is playing a matching pennies game, it might behave more conservatively than appropriate for the game.
The following theorem is our main convergence result. It states that for asymptot- ically optimal agents we get convergence to ε-Nash equilibria in any reflective-oracle- computable multi-agent environment.
Theorem 7.30(Convergence to Equilibrium). Letσ be an reflective-oracle-computable multi-agent environment and let π1, . . . , πn be reflective-oracle-computable policies that
are asymptotically optimal in mean in the class MO
refl. Then for all ε > 0 and all
i∈ {1, . . . , n} the σπ1:n-probability that the policy π
i is an ε-best response converges to
1 as t→ ∞.
Proof. Let i ∈ {1, . . . , n}. By Proposition 7.26, the subjective environment σi is
reflective-oracle-computable, therefore σi ∈ MOrefl. Since πi is asymptotically optimal
in mean in the class MO
refl, we get that E[V
∗
σi(æ<t)−V
πi
σi(æ<t)] → 0. Convergence in mean implies convergence in probability for bounded random variables, hence for all ε >0 we have σπi i [V ∗ σi(æ i <t)−Vσπii(æ i <t)≥ε]→0ast→ ∞.
Therefore the probability that the policyπi plays anε-best response converges to 1as
t→ ∞.
In contrast to Theorem 7.25 which yields policies that play a subgame perfect equi- librium, this is not the case for Theorem 7.30: the agents typically do not learn to predict off-policy and thus will generally not play ε-best responses in the counterfac- tual histories that they never see. This weaker form of equilibrium is unavoidable if the agents do not know the environment because it is impossible to learn the parts that they do not interact with.
Corollary 7.31 (Convergence to Equilibrium). There are limit computable policies
π1, . . . , πnsuch that for any computable multi-agent environmentσand for allε >0and
all i∈ {1, . . . , n} the σπ1:n-probability that the policy π
i is an ε-best response converges
to 1as t→ ∞.
Proof. Pickπ1, . . . , πn to be the Thompson sampling policyπT defined in Algorithm 2
over the countable class MO
refl. By Theorem 5.25 these policies are asymptotically op-
timal in mean. By Theorem 7.32 below they are reflective-oracle-computable and by Theorem 7.7 they are also limit computable. The statement now follows from Theo- rem 7.30.
Theorem 7.32(Thompson Sampling is Reflective-Oracle-Computable). The policy πT
defined in Algorithm 2 over the class MO
refl is reflective-oracle-computable.
Proof. The posterior w(· |æ<t) is reflective-oracle-computable by the definition (7.3) and according to Theorem 7.19 the optimal policiesπν∗are reflective-oracle-computable. On resampling steps we can compute the action probabilities ofπT by enumerating all
ν ∈ MO
refland computingπ
∗
ν weighted by the posteriorw(ν |æ<t). Between resampling
steps we need to condition the policyπT computed above by the actions it has already
taken since the last resampling step (compare Example 5.28). Because the posterior w( · |æ<t) is a ξ
π
-martingale when acting according to the policy π, it converges ξπ-almost surely according to the martingale convergence theo- rem (Theorem 2.8). Sinceξ dominates the subjective environmentσi, it also converges