Bayes - Asymptotic Optimality - Nonparametric General Reinforcement Learning

5.4 Asymptotic Optimality

5.4.1 Bayes

In this section, we list two results from the literature regarding the asymptotic optimality of the Bayes optimal policy. The following negative result is due to Orseau (2010, 2013).

Theorem 5.22(Bayes is not Asymptotically Optimal in General Environments; Orseau, 2013, Thm. 4). For any classM ⊇ MCCM

compno Bayes optimal policyπ∗ξ is asymptotically

optimal: there is an environment µ∈ Mand a time step t0 ∈N such thatµπ

∗

ξ_-almost

surely for all time steps t≥t0

V_µ∗(æ<t)−V π∗_ξ

µ (æ<t) =

1 2.

Orseau calls this result thegood enough effect: A Bayesian agent eventually decides that the current strategy is good enough and that any additional exploration is not worth its expected payoff. However, if the environment changes afterwards, the Bayes agent is acting suboptimally.

Proof. Without loss of generality assumeA:={α, β}andE:={0,1/2,1}(observations are vacuous). We consider the following environment µ (transitions are labeled with action, reward).

s0 s1 . . . sn

β,1₂

α,0 ∗,0 ∗,0

∗,0

In states0the actionβis the exploitation action and the actionαthe exploration action.

The length of the state sequence is defined as an 1/t-effective horizon, n := Ht(1/t)

wheretis the time step in which the agent leaves state s0. Since the discount function

γ is computable by Assumption 4.6a,µ∈ MCCS LSC.

Assume that when acting in µ, the Bayes agent explores infinitely often. Let æ_<t be a history in which the agent is in state s0 and takes action α. Then V

π∗ ξ

µ ≤ 1/t.

By on-policy value convergence (Corollary 4.20),V_ξ∗(æ<t)−V π∗

µ (æ<t)→0 µπ

∗

ξ_-almost surely. Hence there is a time step t0 such that for all t ≥ t0 we have Vξ∗ < w(µ)/2.

Sinceµ is deterministic,w(µ|æ_<t)≥w(µ). Now we get a contradiction from (4.7): V_ξ∗(æ<t)≥w(µ|æ<t)Vµ∗(æ<t)≥w(µ)Vµ∗(æ<t) =

w(µ) 2 > V

∗

ξ(æ<t)

Therefore the Bayes agent stops taking the exploration actionα after time step t0,

and so it is not optimal in any ν ∈ MCCS

LSC that behaves like µ until time step t0 and

then changes: s0 s1 . . . sn β,1₂ t > t0:α,1 t≤t0:α,0 ∗,0 ∗,0 ∗,0

The following theorem is also known as the self-optimizing theorem. This theorem has been a source of great confusion because its statement in Hutter (2005, Thm. 5.34) is not very explicit about how the histories are generated. The formulation of Lattimore (2013, Thm. 5.2) is explicit, but less general.

Theorem 5.23(Sufficient Condition for Strong Asymptotic Optimality of Bayes; Hut- ter, 2005, Thm. 5.34). Letµbe some environment. If there is a policyπ and a sequence of policiesπ1, π2, . . . such that for all ν ∈ M

V_ν∗(æ_<t)−Vπt ν (æ<t)→0 as t→ ∞ µπ-almost surely, (5.4) then V_µ∗(æ_<t)−Vπ ∗ ξ µ (æ<t)→0 as t→ ∞ µπ-almost surely.

§5.4 Asymptotic Optimality 83

If π =π_ξ∗ and (5.4) holds for all µ∈ M, then π_ξ∗ is strongly asymptotically optimal in the class M.

It is important to emphasize that the policies π1, π2, . . . need to converge to the

optimal value on the history generated byµandπ, and not (as one might think)ν and πt. Intuitively, the policyπ is an ‘exploration policy’ that ensures that the environment

class is explored sufficiently. Typically, a policy is asymptotically optimal on its own history. So if π =π1 =π2 =. . ., then we get that Bayes is asymptotically optimal on

the history generated by the policyπ, not its own history. In light of Theorem 5.5 and Theorem 5.22 this is not too surprising; Bayesian reinforcement learning agents might not explore enough to be asymptotically optimal, but given a policy that does explore enough, Bayes learns enough to be asymptotically optimal.

This invites us to define the following policies πt: follow the information-seeking

policyπ_IG∗ until time stept, and then followπ_ξ∗ (explore untilt, then exploit). Since the information-seeking policy explores enough to prove off-policy prediction (Orseau et al., 2013, Thm. 7), we getV_ξπ−V_µπ →0for every policyπuniformly. Hencearg max_πV_ξπ →

arg max_πVπ

µ and thus Vµ∗−V π∗_ξ

µ →0 and (5.4) is satisfied. From Theorem 5.23 we get

V_µ∗−Vπ ∗ ξ

µ → 0, which we already knew. In order to get strong asymptotic optimality,

all we need to do is choose the switching time step t appropriately, i.e., wait until V_µ∗ and Vπ

∗ ξ

µ are close enough. Unfortunately, this is an invalid strategy: the agent does

not know the true environmentµ and hence cannot check this condition.

Hutter (2005, Sec. 5.6) uses Theorem 5.23 to show that the Bayes optimal policy is strongly asymptotically optimal in the class of ergodic finite-state MDPs if the effective horizon is growing, i.e.,Ht(ε)→ ∞for allε >0. This relies on the fact that in ergodic

finite-state MDPs we need a fixed number of steps to explore the entire environment up toε-confidence. Therefore we can define a sequence of policiesπ1, π2, . . .that completely

disregard the history and start exploring everything from scratch. Since the effective horizon is growing, this exploration phase takes a vanishing fraction of effective horizon and most of the value is retained. Therefore the sequence of policies π1, π2, . . .satisfies

the condition of Theorem 5.23 regardless of the history, thus in particular for the history generated by π = π_ξ∗ and any µ ∈ M. Note that the condition on the horizon is important: If the effective horizon is bounded, then Bayes is not asymptotically optimal in the class of ergodic finite-state MDPs because it can be locked into a dogmatic prior similarly to Theorem 5.5.

Proof of Theorem 5.23. From (4.6) we get for any historyæ<t

w(µ|æ<t) V_µ∗(æ<t)−V π∗_ξ µ (æ<t) ≤ X ν∈M w(ν|æ<t) V_ν∗(æ<t)−V π_ξ∗ ν (æ<t) = X ν∈M w(ν |æ<t)Vν∗(æ<t) ! −Vπ ∗ ξ ξ (æ<t) ≤ X ν∈M w(ν|æ<t)Vν∗(æ<t)−V_ξπt(æ<t)

= X

ν∈M

w(ν |æ<t) Vν∗(æ<t)−Vνπt(æ<t)

. (5.5) From (5.4) follows thatV_ν∗−Vπt

ν →0µπ-almost surely for allν ∈ M, so (5.5) converges

to 0 µπ-almost surely (Hutter, 2005, Lem. 5.28ii). Similar to Example 3.20, 1/w(µ | æ<t) is a nonnegative µπ-martingale and thus converges (to a finite value) µπ-almost

surely by Theorem 2.8. Therefore V_µ∗(æ_<t)−Vπ ∗ ξ

µ (æ<t) → 0 µπ-almost surely. If this

is true for all µ∈ M, the strong asymptotic optimality of π_ξ∗ follows from π = π_ξ∗ by definition.

In document Nonparametric General Reinforcement Learning (Page 99-102)