THE ANALYSIS - ON THE SAMPLE COMPLEXITY OF EXPLORATION

On the Sample Complexity of Exploration

8. ON THE SAMPLE COMPLEXITY OF EXPLORATION

8.5. THE ANALYSIS

and for states not in K , Pk is absorbing. Clearly, the value that is chosen for m determines the quality of this approximation. However, let us ignore this for now and present the algorithm in terms of the parameter m.

Algorithm 14 Rmax______________________________________________________ (1) Set i f = 0

(2) Act: I s s e K l

(a) Yes, execute the action f (g, tmodT). Goto 2. (b) No, perform balanced wandering for one timestep.

(i) If a state becomes m-known, goto 3. (ii) Else, goto 2.

(3) Compute:

(a) Update K and Mk

(b) Compute an optimal policy -k for Mk- Goto 2.

The slightly modified Rmax algorithm is shown in 14, where s is the current state and

t is the current time (so fmodT is the current cycle time). Initially K is empty, and so the algorithm first engages in balanced wandering, where the agent tries the action which has been taken the least number of times at the current state. This is continued until a state becomes m-known, which must occur by the pigeonhole principle. Upon obtaining a state that is m-known, the algorithm then updates K and M k and an optimal policy ttis

computed for M k -

If the algorithm is at a state s Ç. K , then t t is executed with respect to the T-step cycle

(recall section 8.2.2), ie the action w(s, tmodT) is taken. At all times, if the agent ever reaches a state s ^ K , balanced wandering is resumed. If the known set K ever changes, then MDP M k is updated and the policy ü is recomputed. Note that computations are performed only when the known set K changes, else the algorithm just performs “table lookups”.

We hope that Mk is a good approximation to Mk and that this implies the agent is either exploiting or efficiently exploring. By the Pigeonhole Principle, successful exploration can only occur m N A times. Hence, as long as the escape probability is “large”, exploration must “quickly” cease and exploitation must occur (as suggested by the explore or exploit corollary, 8.4.5).

8.5. The Analysis

This section provides the analysis of the upper bounds. In the first subsection, we assume that a value of m is chosen large enough to obtain sufficiently accurate M ^ ’s and a sample complexity bound is stated in terms of m. It turns out that m is essentially the gap between the lower and upper bounds, so we desire a tight analysis to determine an appropriate value of m. The following subsection provides an improved li accuracy condition for

126 8. ON THE SAMPLE COMPLEXITY OF EXPLORATION

determining if Mk is accurate. It turns out that a straightforward Chemoff bound analysis is not sufficient to obtain a tight bound and a more involved analysis is used to determine m. The final subsection completes the proofs of the upper bounds for both general and deterministic MDPs.

An important point to note is that the accuracy condition we state in the first subsection is, informally, the weakest condition needed for our analysis to go through. This accuracy condition does not necessarily imply that we need to obtain an accurate transition model for Mk, only that optimal policies derived from Mk are accurate. However, when deter mining m in the subsection thereafter, we actually ensure that the transition model for Mk

is an accurate estimate to that in Mk (though this is done in a weaker /i sense as com pared to the original and Rmax algorithms). This issue of “accurate model building” lies at the heart of the gap between our lower and upper bound and is discussed in the next chapter, where we examine “model based” based approaches more carefully.

8.5.1. The Sample Complexity in terms of m. For now let us just assume a value of m is chosen such that Mk is accurate in the following sense.

C o n d it io n 8.5.1. (Approximation Condition) If R m a x uses the set of states K and an

MDP M k , then for the optimal policy tt for M k assume that for all states s and times t < T

Uif,t,MK (s) > ^t,MK (®) “ ^

The assumption states that the policy t tthat our algorithm derives from M k is near-optimal

in M k- Informally, this is the weakest condition needed in order for the analysis to go through. Note that this condition does not state we require an accurate transition model of

M k- In the next subsection, we determine a value of m such that this condition holds with high probability.

The following lemma bounds the sample complexity of exploration in terms of m.

Lem m a 8.5.2. Let M be an L-epoch MDP andso be a state for M. If c is an L-path

sampled from Pr(-|/?r„ox > M, Sq) if condition 8.5.1 holds, then with probability greater than 1 —Ô, the statement

U n ..,..( c t ) > U '( c ) - 2 s is true for all but log ^ ) timesteps t < L.

The high-level idea of the proof is as follows. Under the induced inequality lemma (8.4.4) and the previous condition, then we can show that either rt escapes K m M with probability greater than an e or ^ is a 2s near-optimal policy in M (where one factor of e is due to the accuracy assumption and the other factor is due to having an escape probability less

8.5. THE ANALYSIS 127

than e). By the Pigeonhole Principle, successful exploration can only occur m N A times, so eventually exploitation must occur.

P r o o f . Let ^ be an optimal policy with respect to some Mk that is used by Rmax-

Let 7T* be an optimal policy in M . The induced inequality lemma (8.4.5) and condition 8.5.1 imply that for all f < T and states s

(s) - Pr(escape from K\7t, M, st = s)

> U^Mj^ (s) - £ - Pr(escape from M, st = s)

> (a) - G - Pr(escape from K\it, M , st = s)

> Ut M(s) —e — Pr(escape from K\-k, M, St = s)

where we have used both inequalities in the induced inequality lemma and have used the optimality of vr* in M.

Recall that Rmax executes the policy t t in sync with the T-step cycle time as along as s € K . The definition of Ua and the previous inequality imply that either

U R „ ..( c t) > U '{ c ,) - 2 e

or the probability that ttescapes from K before the T-end time for t must be greater than e.

Now, we address how many timesteps the T-step escape probability can be greater than e for any K . Each attempted exploration can be viewed as a Bernoulli trial with chance of success greater than e. There can be at most m N A successful exploration attempts, until all state-actions are known and the escape probability is 0. Note that in steps the mean number of successful exploration attempts is m N A . The Hoeffding’s bound states that we need ^ log j samples to obtain a J3 fractionally accurate estimate of the mean, where each “sample” is attempts. Hence, we can choose a such that 0 ( m ^ log j ) attempts are sufficient for all m N A exploration attempts to succeed, with probability greater than 1 — Since each attempted exploration takes at most T steps, the number of exploration

steps is bounded by O( ) log i . □

Note that our lower bound is log j), and so the gap between our lower and upper bound is essentially m. The issue we now address is what is the sample size m. Clearly, a tight bound is desired here to minimize this gap.

8.5.2. What is the appropriate value of m? First, an approximation condition for

M is provided, such that the approximation condition 8.5.1 holds. We use the following definition for our less stringent li accuracy condition.

128 8. ON THE SAMPLE COMPLEXITY OF EXPLORATION

Definition 8.5.3. (/i accuracy) We say that a transition model P is an e-approximation

to a transition model P if for all states sand actions a

^ |P ( s > , a) - P (s ' |s, a) I < e ,

The following lemma addresses our accuracy needs.

Le m m a 8.5.4. (e-Approximation Condition) Let M and M be two MDPs with the same

reward functions and the same state-action space. If the transition model o f M is an e- approximation to that of M, then for all T-step policiest t , states s, and times t < T ,

As mentioned earlier, this guarantee is actually stronger than what is needed for our accu racy condition, since it holds for all policies. We address this issue in the next chapter.

Pr o o f. Let S t - i = (s, s t + i ,. . . , s t-i) be a T — t length sequence of states starting with a fixed state sand let Sr be the subsequence Sr = {s,S t+ i,. . . , Sr). Let Pr(5'T) and Pr(S'T) be the probability of in M and M , respectively, under policy n starting from St = 5 at time t. Let R { S t - i ) = y J2r=t ^(^r, 7r(sr)) and so

l^7r,t,M(®) ~ = I ^ (P r ( 5 T - l) - f t ( 5 ' T - l ) ) P t ( ‘S'7’- l ) |

St—1

< ^ |P r(5'T -i) - PtCS't-i)! St-1

since R is bounded between 0 and 1.

We now show that the error between P r(5T -i) and Pr(S'T-i) is bounded as follows ^ | P r ( S T - i ) - f i ( 5 T - i ) l < £ T

St- 1

where the sum is over all T — t length sequences St-i that start with state St = s. Let P and P be the transition models in M and M respectively. Slightly abusing notation, let P(s'|5'T) = P ( s '|5r , 7r(sr)) an d P (s'|5'r) = P ( s '|s r , 7r(sT)). The e-approximation

8.5. THE ANALYSIS 129

condition implies that | P ( s ' | 5 ' r ) — P (s'|S 't)| < s. For any r < T — 1, it follows that ^ | P r ( S , + i ) - P r ( S , + i ) |

'S'r+1

= Y . |ft(S'r)P(s'|Sr) - ft(Sr)P(s'|Sr)l S-r,8'

< Y |Pr(S,)P(s'|S,) - ?i{Sr)P(a'\Sr)\ + |fi(S ,)P (s'|S ,) - fi(S r)P (s'|S ,)| S - r , s '

= ^ |P r (5 ^ ) - f t ( 5 , ) | ^ P ( s ' l S , ) + ^ f i ( S , ) 5 ] | P ( s ' | S , ) - P ( s '|S ,) |

Sr »' St s'

< ^ | P r ( S , ) - P r ( g , ) | + 6 . 5r

Recursing on this equation leads to the result. □

The following technical lemma addresses how large m needs to be such that P is an s- approximation to P . The following lemma is for an arbitrary distribution p over N ele ments.

Lem m a 8.5.5. Assume that m samples are obtained from a distribution p, which is over

a set o f N elements. Let p be the empirical distribution, ie p{i) = ^ subserved ^

m = 0 ( ^ log y ), then with probability greater than 1 — 5

- p ( i) \ < e

A straightforward Chemoff bound analysis is not sufficient to prove this result (as this would lead to an O(N^) result). The proof demands different fractional accuracies for different values of p(i).

P ro o f. For simplicity, we write pi for p{i) and pi for p{i). The form of the Chemoff bound that we use is

P{\Pi -P i\ > ocPi) < 2 ex p {-a ^p im /2) . Let us define the values a* as follows, based on the values pi as follows:

a , = { i

Let us assume that for all i that

130 8. ON THE SAMPLE COMPLEXITY OF EXPLORATION

This implies that

^ l p ( i ) - p ( i ) l <

= I E P‘+ | E i

I + I

and so it is sufficient to show equation 8.5.1 holds with probability greater than 1 — 5 for = 0 { ^ log y ).

By the union bound and the Chemoff bound,

P{3i s.t. \pi - p i \ > aiPi) < 2 ex p (-a-p im /2 )

< 2 ^ e x p (-a -p im /2) + 2 ^ e x p (-a ? p im /2) .

i-Pi>-k i-Pi<-k

Using the value of tti Stated above, it follows that fori such that Pi > E^exp(-o:fpim /2) < e x p ( - - ^ ) < e x p ( - ^ ) . Similarly, for i such that p, < exp(-a?P im /2) < e x p ( - y ^ ) . Therefore,

P (3i st |pi - pi| > OiPi) < 2 ^ e x p ( - ^ ) = 2 i V e x p ( - ^ )

and the result follows if we demand 2iV e x p (- < J. □ The previous two lemmas directly imply the following tightened result for m over the and Rmax algorithm.

Lemma 8.5.6. (Setting m) In order for the approximation condition to hold (equation

8.5.1) with probability of error less than 5, it is sufficient that m = log

P ro o f. By lemma 8.5.4, we need to set the li error of P (|s, a) to be less than ^ for all s and o. This implies that an optimal policy ttin M k has value in M k that is e close to

the optimal value in M k- There are N A of these transition probabilities, so if we allocate error probability to each one, then the total error probabihty is less than <5. The result follows from the the previous lemma with e 4- ^ and S 4- □

8.5.3. Putting the lemmas together. The proof of the main sample complexity upper bound for the T-case follows.

In document On the Sample Complexity of Reinforcement Learning (Page 116-122)