5.4 Asymptotic Optimality
5.4.1 Bayes
In this section, we list two results from the literature regarding the asymptotic optimal- ity of the Bayes optimal policy. The following negative result is due to Orseau (2010, 2013).
Theorem 5.22(Bayes is not Asymptotically Optimal in General Environments; Orseau, 2013, Thm. 4). For any classM ⊇ MCCM
compno Bayes optimal policyπ∗ξ is asymptotically
optimal: there is an environment µ∈ Mand a time step t0 ∈N such thatµπ
∗
ξ-almost
surely for all time steps t≥t0
Vµ∗(æ<t)−V π∗ξ
µ (æ<t) =
1 2.
Orseau calls this result thegood enough effect: A Bayesian agent eventually decides that the current strategy is good enough and that any additional exploration is not worth its expected payoff. However, if the environment changes afterwards, the Bayes agent is acting suboptimally.
Proof. Without loss of generality assumeA:={α, β}andE:={0,1/2,1}(observations are vacuous). We consider the following environment µ (transitions are labeled with action, reward).
s0 s1 . . . sn
β,12
α,0 ∗,0 ∗,0
∗,0
In states0the actionβis the exploitation action and the actionαthe exploration action.
The length of the state sequence is defined as an 1/t-effective horizon, n := Ht(1/t)
wheretis the time step in which the agent leaves state s0. Since the discount function
γ is computable by Assumption 4.6a,µ∈ MCCS LSC.
Assume that when acting in µ, the Bayes agent explores infinitely often. Let æ<t be a history in which the agent is in state s0 and takes action α. Then V
π∗ ξ
µ ≤ 1/t.
By on-policy value convergence (Corollary 4.20),Vξ∗(æ<t)−V π∗
ξ
µ (æ<t)→0 µπ
∗
ξ-almost surely. Hence there is a time step t0 such that for all t ≥ t0 we have Vξ∗ < w(µ)/2.
Sinceµ is deterministic,w(µ|æ<t)≥w(µ). Now we get a contradiction from (4.7): Vξ∗(æ<t)≥w(µ|æ<t)Vµ∗(æ<t)≥w(µ)Vµ∗(æ<t) =
w(µ) 2 > V
∗
ξ(æ<t)
Therefore the Bayes agent stops taking the exploration actionα after time step t0,
and so it is not optimal in any ν ∈ MCCS
LSC that behaves like µ until time step t0 and
then changes: s0 s1 . . . sn β,12 t > t0:α,1 t≤t0:α,0 ∗,0 ∗,0 ∗,0
The following theorem is also known as the self-optimizing theorem. This theorem has been a source of great confusion because its statement in Hutter (2005, Thm. 5.34) is not very explicit about how the histories are generated. The formulation of Lattimore (2013, Thm. 5.2) is explicit, but less general.
Theorem 5.23(Sufficient Condition for Strong Asymptotic Optimality of Bayes; Hut- ter, 2005, Thm. 5.34). Letµbe some environment. If there is a policyπ and a sequence of policiesπ1, π2, . . . such that for all ν ∈ M
Vν∗(æ<t)−Vπt ν (æ<t)→0 as t→ ∞ µπ-almost surely, (5.4) then Vµ∗(æ<t)−Vπ ∗ ξ µ (æ<t)→0 as t→ ∞ µπ-almost surely.
§5.4 Asymptotic Optimality 83
If π =πξ∗ and (5.4) holds for all µ∈ M, then πξ∗ is strongly asymptotically optimal in the class M.
It is important to emphasize that the policies π1, π2, . . . need to converge to the
optimal value on the history generated byµandπ, and not (as one might think)ν and πt. Intuitively, the policyπ is an ‘exploration policy’ that ensures that the environment
class is explored sufficiently. Typically, a policy is asymptotically optimal on its own history. So if π =π1 =π2 =. . ., then we get that Bayes is asymptotically optimal on
the history generated by the policyπ, not its own history. In light of Theorem 5.5 and Theorem 5.22 this is not too surprising; Bayesian reinforcement learning agents might not explore enough to be asymptotically optimal, but given a policy that does explore enough, Bayes learns enough to be asymptotically optimal.
This invites us to define the following policies πt: follow the information-seeking
policyπIG∗ until time stept, and then followπξ∗ (explore untilt, then exploit). Since the information-seeking policy explores enough to prove off-policy prediction (Orseau et al., 2013, Thm. 7), we getVξπ−Vµπ →0for every policyπuniformly. Hencearg maxπVξπ →
arg maxπVπ
µ and thus Vµ∗−V π∗ξ
µ →0 and (5.4) is satisfied. From Theorem 5.23 we get
Vµ∗−Vπ ∗ ξ
µ → 0, which we already knew. In order to get strong asymptotic optimality,
all we need to do is choose the switching time step t appropriately, i.e., wait until Vµ∗ and Vπ
∗ ξ
µ are close enough. Unfortunately, this is an invalid strategy: the agent does
not know the true environmentµ and hence cannot check this condition.
Hutter (2005, Sec. 5.6) uses Theorem 5.23 to show that the Bayes optimal policy is strongly asymptotically optimal in the class of ergodic finite-state MDPs if the effective horizon is growing, i.e.,Ht(ε)→ ∞for allε >0. This relies on the fact that in ergodic
finite-state MDPs we need a fixed number of steps to explore the entire environment up toε-confidence. Therefore we can define a sequence of policiesπ1, π2, . . .that completely
disregard the history and start exploring everything from scratch. Since the effective horizon is growing, this exploration phase takes a vanishing fraction of effective horizon and most of the value is retained. Therefore the sequence of policies π1, π2, . . .satisfies
the condition of Theorem 5.23 regardless of the history, thus in particular for the history generated by π = πξ∗ and any µ ∈ M. Note that the condition on the horizon is important: If the effective horizon is bounded, then Bayes is not asymptotically optimal in the class of ergodic finite-state MDPs because it can be locked into a dogmatic prior similarly to Theorem 5.5.
Proof of Theorem 5.23. From (4.6) we get for any historyæ<t
w(µ|æ<t) Vµ∗(æ<t)−V π∗ξ µ (æ<t) ≤ X ν∈M w(ν|æ<t) Vν∗(æ<t)−V πξ∗ ν (æ<t) = X ν∈M w(ν |æ<t)Vν∗(æ<t) ! −Vπ ∗ ξ ξ (æ<t) ≤ X ν∈M w(ν|æ<t)Vν∗(æ<t)−Vξπt(æ<t)
= X
ν∈M
w(ν |æ<t) Vν∗(æ<t)−Vνπt(æ<t)
. (5.5) From (5.4) follows thatVν∗−Vπt
ν →0µπ-almost surely for allν ∈ M, so (5.5) converges
to 0 µπ-almost surely (Hutter, 2005, Lem. 5.28ii). Similar to Example 3.20, 1/w(µ | æ<t) is a nonnegative µπ-martingale and thus converges (to a finite value) µπ-almost
surely by Theorem 2.8. Therefore Vµ∗(æ<t)−Vπ ∗ ξ
µ (æ<t) → 0 µπ-almost surely. If this
is true for all µ∈ M, the strong asymptotic optimality of πξ∗ follows from π = πξ∗ by definition.