Learning Reflective Agents - Nonparametric General Reinforcement Learning

Proof. From Theorem 7.25 and Theorem 7.7.

Example 7.28 (Nash Equilibrium in Matching Pennies). Consider the matching pennies game from Example 7.23. The only pair of optimal policies is the pair of two uniformly random policies that playαand β with equal probability in every time step: if one of the agents picks a policy that plays one of the actions with probability >1/2, then the other agent’s best response is to play the other action with probability1. But now the first agent’s policy is no longer a best response. 3

7.5 Learning Reflective Agents

Since our class MO

refl solves the grain of truth problem, the result by Kalai and Lehrer

(1993) immediately implies that for any Bayesian agents π1, . . . , πn interacting in an

infinitely repeated game and for allε >0and alli∈ {1, . . . , n}there is almost surely a t0 ∈Nsuch that for allt≥t0 the policy πi is an ε-best response. However, this hinges

on the important fact that every agent has to know the game and also that all other agents are Bayesian agents. Otherwise the convergence to an ε-Nash equilibrium may fail, as illustrated by the following example.

At the core of the construction is adogmatic prior (Section 5.2.2). A dogmatic prior assigns very high probability to going to hell (reward 0 forever) if the agent deviates from a given computable policyπ. For a Bayesian agent it is thus only worth deviating from the policy π if the agent thinks that the prospects of following π are very poor already. This implies that for general multi-agent environments and without additional assumptions on the prior, we cannot prove any meaningful convergence result about Bayesian agents acting in an unknown multi-agent environment.

Example 7.29 (Reflective Bayesians Playing Matching Pennies). Consider the multi- agent environment matching pennies from Example 7.23. Let π1 be the policy that

takes the action sequence (ααβ)∞ and let π2 := πα be the policy that always takes

action α. The average reward of policy π1 is 2/3 and the average reward of policy

π2 is 1/3. Let ξ be a universal mixture (7.2). By on-policy value convergence (7.4),

Vπ1

ξ → c1 ≈ 2/3 and V π2

ξ → c2 ≈1/3 almost surely when following policies (π1, π2).

Therefore there is an ε > 0 such that Vπ1

ξ > ε and V π2

ξ > ε for all time steps. Now

we can apply Theorem 5.5 to conclude that there are (dogmatic) mixtures ξ₁0 and ξ₂0 such thatπ∗_ξ0

1 always follows policy π1 andπ

∗

ξ20 always follows policy π2. This does not

converge to a (ε-)Nash equilibrium. 3

An important property required for the construction in Example 7.29 is that the environment class contains environments that threaten the agent with going to hell, which is outside of the class of matching pennies environments. In other words, since the agent does not know a priori that it is playing a matching pennies game, it might behave more conservatively than appropriate for the game.

The following theorem is our main convergence result. It states that for asymptotically optimal agents we get convergence to ε-Nash equilibria in any reflective-oracle- computable multi-agent environment.

Theorem 7.30(Convergence to Equilibrium). Letσ be an reflective-oracle-computable multi-agent environment and let π1, . . . , πn be reflective-oracle-computable policies that

are asymptotically optimal in mean in the class MO

refl. Then for all ε > 0 and all

i∈ {1, . . . , n} the σπ1:n_{-probability that the policy} _π

i is an ε-best response converges to

1 as t→ ∞.

Proof. Let i ∈ {1, . . . , n}. By Proposition 7.26, the subjective environment σi is

reflective-oracle-computable, therefore σi ∈ MO_refl. Since πi is asymptotically optimal

in mean in the class MO

refl, we get that E[V

∗

σi(æ<t)−V

πi

σi(æ<t)] → 0. Convergence in mean implies convergence in probability for bounded random variables, hence for all ε >0 we have σπi i [V ∗ σi(æ i <t)−Vσπii(æ i <t)≥ε]→0ast→ ∞.

Therefore the probability that the policyπi plays anε-best response converges to 1as

t→ ∞.

In contrast to Theorem 7.25 which yields policies that play a subgame perfect equilibrium, this is not the case for Theorem 7.30: the agents typically do not learn to predict off-policy and thus will generally not play ε-best responses in the counterfac- tual histories that they never see. This weaker form of equilibrium is unavoidable if the agents do not know the environment because it is impossible to learn the parts that they do not interact with.

Corollary 7.31 (Convergence to Equilibrium). There are limit computable policies

π1, . . . , πnsuch that for any computable multi-agent environmentσand for allε >0and

all i∈ {1, . . . , n} the σπ1:n_{-probability that the policy} _π

i is an ε-best response converges

to 1as t→ ∞.

Proof. Pickπ1, . . . , πn to be the Thompson sampling policyπT defined in Algorithm 2

over the countable class MO

refl. By Theorem 5.25 these policies are asymptotically op-

timal in mean. By Theorem 7.32 below they are reflective-oracle-computable and by Theorem 7.7 they are also limit computable. The statement now follows from Theo- rem 7.30.

Theorem 7.32(Thompson Sampling is Reflective-Oracle-Computable). The policy πT

defined in Algorithm 2 over the class MO

refl is reflective-oracle-computable.

Proof. The posterior w(· |æ_<t) is reflective-oracle-computable by the definition (7.3) and according to Theorem 7.19 the optimal policiesπ_ν∗are reflective-oracle-computable. On resampling steps we can compute the action probabilities ofπT by enumerating all

ν ∈ MO

refland computingπ

∗

ν weighted by the posteriorw(ν |æ<t). Between resampling

steps we need to condition the policyπT computed above by the actions it has already

taken since the last resampling step (compare Example 5.28). Because the posterior w( · |æ<t) is a ξ

-martingale when acting according to the policy π, it converges ξπ-almost surely according to the martingale convergence theorem (Theorem 2.8). Sinceξ dominates the subjective environmentσi, it also converges

In document Nonparametric General Reinforcement Learning (Page 159-161)