Chapter 5: Monte Carlo Methods

Full text

(1)Chapter 5: Monte Carlo Methods ❐ Monte Carlo methods are learning methods Experience → values, policy ❐ Monte Carlo methods can be used in two ways: ! model-free: No model necessary and still attains optimality ! simulated: Needs only a simulation, not a full model ❐ Monte Carlo methods learn from complete sample returns ! Only defined for episodic tasks (in this book) ❐ Like an associative version of a bandit method. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(2) DP vs MC. 2.

(3) Backup diagram for Monte Carlo ❐ Entire rest of episode included ❐ Only one choice considered at each state (unlike DP) !. thus, there will be an explore/exploit dilemma. ❐ Does not bootstrap from successor states’s values (unlike DP) ❐ Time required to estimate one state does not depend on the total number of states R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. terminal state.

(4) The Power of Monte Carlo e.g., Elastic Membrane (Dirichlet Problem) How do we compute the shape of the membrane or bubble?. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(5) Two Approaches Relaxation. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. Kakutani’s algorithm, 1945.

(6) Monte Carlo Policy Evaluation ❐ Goal: learn v⇡ (s) ❐ Given: some number of episodes under π which contain s ❐ Idea: Average returns observed after visits to s. 1. 2. 3. 4. 5. ❐ Every-Visit MC: average returns for every time s is visited in an episode ❐ First-visit MC: average returns only for first time s is visited in an episode ❐ Both converge asymptotically R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(7) pter. Every-visit MC extends more naturally to function app ity traces, as discussed in Chapters 9 and 7. First-visit MC form inFirst-visit Figure 5.1. Monte Carlo policy evaluation Initialize: ⇡ policy to be evaluated V an arbitrary state-value function Returns(s) an empty list, for all s 2 S Repeat forever: Generate an episode using ⇡ For each state s appearing in the episode: G return following the first occurrence of s Append G to Returns(s) V (s) average(Returns(s)). Figure 5.1: The first-visit MC method for estimating v⇡ .. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(8) MC vs supervised regression. 8.

(9) Blackjack example ❐ Object: Have your card sum be greater than the dealer’s without exceeding 21. ❐ States (200 of them): ! current sum (12-21) ! dealer’s showing card (ace-10) ! do I have a useable ace? ❐ Reward: +1 for winning, 0 for a draw, -1 for losing ❐ Actions: stick (stop receiving cards), hit (receive another card) ❐ Policy: Stick if my sum is 20 or 21, else hit ❐ No discounting (𝜸 = 1) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(10) Learned blackjack state-value functions. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(11) Monte Carlo Estimation of Action Values (Q) ❐ Monte Carlo is most useful when a model is not available ! We want to learn q * ❐ qπ(s,a) - average return starting from state s and action a following π ❐ Converges asymptotically if every state-action pair is visited ❐ Exploring starts: Every state-action pair has a non-zero probability of being the starting pair. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(12) A bit on theoretical analysis of MC. 12.

(13) Monte Carlo Control evaluation Q. q⇡. ⇡. Q ⇡. greedy(Q). improvement. ❐ MC policy iteration: Policy evaluation using MC methods followed by policy improvement ❐ Policy improvement step: greedify with respect to value (or action-value) function R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(14) a. Policy improvement then can be done by constructing each ⇡k+1 Convergence of MC Controltheorem (Section with respect to q⇡k . The policy improvement ⇡k and ⇡k+1 because, for all s 2 S, ❐ Greedified policy meets the conditions for policy improvement: q⇡k (s, ⇡k+1 (s)) = q⇡k (s, argmax q⇡k (s, a)) a. = max q⇡k (s, a) a. q⇡k (s, ⇡k (s)) = ≥. v⇡k (s).. Andwethus must bein ≥ πthe the policychapter, improvement theoremassures u ❐ As discussed the theorem k byprevious better than ⇡starts as goodnumber as ⇡k , in which case th ❐ uniformly k , or just This assumes exploring and infinite of episodes policies. This evaluation in turn assures us that the overall process conve for MC policy and ❐ policy To solve theoptimal latter: value function. In this way Monte Carlo m to find optimal policies given only sample episodes and no oth ! update only to a given level of performance environment’s dynamics. ! alternate between evaluation and improvement per episode We made two unlikely assumptions above in order to easily obt R. S. Sutton and A. G. Barto: Reinforcement Learning: Introduction convergence forAnthe Monte Carlo method. One was that the epis.

(15) Monte Carlo Exploring Starts MONTE CARLO CONTROL Initialize, for all s 2 S, a 2 A(s): Q(s, a) arbitrary ⇡(s) arbitrary Returns(s, a) empty list. Fixed point is optimal policy π* Now proven (almost). Repeat forever: Choose S0 2 S and A0 2 A(S0 ) s.t. all pairs have probability > 0 Generate an episode starting from S0 , A0 , following ⇡ For each pair s, a appearing in the episode: G return following the first occurrence of s, a Append G to Returns(s, a) Q(s, a) average(Returns(s, a)) For each s in the episode: ⇡(s) argmaxa Q(s, a). re 5.4: Monte Carlo ES: A Monte Carlo control algorithm assuming exploring R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(16) MONTE CARLO METHODS. Vv***. De. Dealer showing. No Usable usable ace ace. STICK STICK. HIT HIT AA 22 33 44 55 66 77 88 9910 10. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction STICK. 21 20 19. m. Dealer showing. um 2211. "1 A. ale. er s. 21 21 20 20 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11. +1. r sh. Vivng** 1100. ow. Play. !*. A 2 3 4 5 6 7 8 9 10. 10. sum 2211. HIT HIT. 21 20 19 18 17 16 15 14 13 12 11. Player sum. No Usable usable ace ace. STICK STICK. Player sum. A 2 3 4 5 6 7 8 9 10. +1 "1 AA. DDee aalele r rssh hooww iningg. Play er. !**. 21. +1 Usable starts Exploring CHAPTER 5. ace "1 HIT ❐ Initial policy as described beforeA. 12. 108❐. 21 20 19 18 17 16 15 14 13 12 11. 10. 12. STICK. ** v V continued. 12. * ! Blackjack example.

(17) policies. Let ve⇤ and qe⇤ denote the optimal value functions for the new environmen Then a policy ⇡ is optimal among "-soft policies if and only if v⇡ = ve⇤ . From t definition of ve⇤ we know that it is the unique solution to ✏ X ve⇤ (s) = (1 ") max qe⇤ (s, a) + qe⇤ (s, a) a |A(s)| a h i X currently On-policy: learn about policy executing 0 0 = (1 ") max p(s , r|s, a) r + ve⇤ (s ). On-policy Monte Carlo Control. ❐ a s ,r ❐ How do we get rid of exploring starts? h i ✏ XX 0 0 + p(s , r|s, a) r + ve⇤ (s ) . ! The policy must be eternally |A(s)| a soft: s ,r –When π(a|s) > 0 holds for all and"-soft a policy ⇡ is no longer improved, then equality ands the know, from policy: (5.2), that ! e.g. ε-soft ✏ X ✏ v (s) = (1 ") max q (s, a) + q (s, a) 1 ✏ + ⇡ ⇡ ⇡ – probability of an action = |A(s)| or a |A(s)| a h i X max (greedy) 0non-max 0 = (1 ") max p(s , r|s, a) r + v⇡ (s ) 0. 0. a. we al. s0 ,r. h i ✏ XX 0 0 + policy towards p(s , r|s, a) r + policy v⇡ (s ) . move greedy |A(s)| a 0. ❐ Similar to GPI: s ,r (e.g., ε-greedy) However, this equation is the same as the previous one, except for the substitutio ❐ Converges to⇤ . best of v⇡ for ve Since ε-soft ve⇤ is thepolicy unique solution, it must be that v⇡ = ve⇤ .. In essence, we have shown in the last few pages that policy iteration works for "-so policies. Using the natural notion of greedy policy for "-soft policies, one is assur R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction of improvement on every step, except when the best policy has been found amon.

(18) In essence, we have shown in the last few pages that policy iteration works for "-sof olicies. Using the natural notion of greedy policy for "-soft policies, one is assure improvement on every step, except when the best policy has been found amon On-policy MC Control he "-soft policies. This analysis is independent of how the action-value functions ar Initialize, for all s 2 S, a 2 A(s): Q(s, a) arbitrary Returns(s, a) empty list ⇡(a|s) an arbitrary "-soft policy Repeat forever: (a) Generate an episode using ⇡ (b) For each pair s, a appearing in the episode: G return following the first occurrence of s, a Append G to Returns(s, a) Q(s, a) average(Returns(s, a)) (c) For each s in the episode: A⇤ arg maxa Q(s, a) For all a 2 A(s): ⇢ 1 " + "/|A(s)| if a = A⇤ ⇡(a|s) "/|A(s)| if a 6= A⇤. Figure 5.6: An on-policy first-visit MC control algorithm for "-soft policies.. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(19) What we’ve learned about Monte Carlo so far ❐ MC has several advantages over DP: ! Can learn directly from interaction with environment ! No need for full models ! No need to learn about ALL states (no bootstrapping) ! Less harmed by violating Markov property (later in book) ❐ MC methods provide an alternate policy evaluation process ❐ One issue to watch for: maintaining sufficient exploration ! exploring starts, soft policies. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(20) Off-policy methods ❐ Learn the value of the target policy π from experience due to behavior policy b ❐ For example, π is the greedy policy (and ultimately the optimal policy) while 𝜇 is exploratory (e.g., 𝜀-soft) ❐ In general, we only require coverage, i.e., that b generates 110 5. MONTE CARLO behavior that covers, or includes, CHAPTER π. that ⇡(a|s) > 0 implies µ(a|s) This is called b(a|s) > 0the assumption of c for every s,a>at 0.which follows from coverage that µ must be stochastic in states where it is n ❐toIdea: importance sampling ⇡. The target policy ⇡, on the other hand, may be deterministic, a this is a– case of particular interest in ratio control In control, the t Weight each return by the of problems. the probabilities is typically the deterministic greedy policy with respect to the current a of the trajectory under the two policies function estimate. This policy becomes a deterministic optimal polic behavior policy remains stochastic and more exploratory, for example, policy. In this section, however, we consider the prediction problem, in 20.

(21) Almost all o↵-policy utilize importance sampling, a general technique es estimating expected methods values under one distribution given samples from another.forWe values under one distribution sampleslearning from another. We apply importance sam apply importance sampling given to o↵-policy by weighting returns according to 5.9 *Per-reward Importance Sampling learning by weighting returns according to theoccurring relative probability of their the relative probability of their trajectories under the target andtrajectori behavio thepolicies, target and behavior policies, called the importance-sampling ratio. Given sta called the importance-sampling ratio. Given a starting state St , the aprob probability ofthe the subsequent state–action trajectory, At,return , St+1 , A ...,S , occurrinc There more way in which the structure of the ast+1 a ,sum ofT rewards abilityisofone subsequent state–action trajectory, A t St+1 , At+1 , . . . , ST , occurring ⇡ in isunder o↵-policy importance sampling, a way that may be able to reduce variance any policy ⇡ is. Importance Sampling Ratio. ❐discounting Probability of is, theeven restif of=the after estimators St, under (5.4) π: and (5.5 (that 1).trajectory, In the o↵-policy Pr{A , A,t+1 , . ., .. ,. S | S , AA 1 ⇠⇠⇡} t , Stt+1 Pr{A , St+1 At+1 .a,TSsum: ⇡} in the numerator is itself T | tSt , t:T t:T 1 = = ⇡(A )p(S |S|S |St+1 ) · · · p(STT|S ,, A t |Stt|S t+1t+1 t , tA t )⇡(A t+1 T 1 ⇡(A ,A |S ATT t )p(S t )⇡(A t+1 |S t+1 ) · · ·Tp(S T 1 t 1 ⇢t:T 1TY GT1t =1 ⇢t:T 1 Rt+1 + Rt+2 + · · · + RT Y = = ⇡(A k+1 k k ), ,A = ⇡(A ⇢k |Skk|S)p(S R +|S|S ⇢,kA R + · · · + T t 1⇢ k )p(S k+1 k ), k=tk=t. t:T. 1. t+1. t:T. 1. t+2. t:T. ) 11) 1 RT .. o↵-policy estimators rely on thereturn expected values of these terms; let the us s ❐The In pimportance sampling, each isfunction weighted by(3.3). the where here is the state-transition probability function defined by Thus, where p here is the state-transition probability defined by (3.10). Thusr a trajectory simpler way. Note that each sub-term of (5.10) is importance-sampling a product of a random of in the under the target and behavior policies (the ra the relative probability of the trajectory under the target and behavior policies (the relative probability of the trajectory under the two policies importance-sampling ratio.isFor example, the first sub-term can be written, us importance-sampling ratio) QT 1 TY1 ⇡(A |S )p(S |S , A ) ⇡(Ak |Sk ) k k k+1 k k . k=t ⇡(A |S ) ⇡(A |S ) ⇡(A |St+2 ) . ⇡(AT 1 |ST 1 ) T 1 t t t+1 t+1 t+2 ⇢t:T 1 =QQ = TT 1 1 Y ⇡(A |S )p(S |S , A ) ⇡(A ) k ·) · · ⇢t:T R = Rt+ k k k+1 k k b(A k |Skk|S t+1 b(Ak |Sk )p(Sk+1 |Sk , Ak ) k=t ⇢Tt = 1Qk=t = . (5.3 k=t b(A |S ) b(A |S ) b(A |S ) b(A |S ) t t t+1 t+1 t+2 t+2 T 1 T 1 T 1 µ(A |S ) k k µ(A |S )p(S |S , A ) k k k+1 k k k=t k=t ❐Now This is called the importance sampling ratio Although the trajectory probabilities depend on MDP’s transition probabilities, notice that, of all these factors, only thethe first and the last (the reward) arew Although the appear trajectory probabilities depend on the MDP’sand transition probabilities unknown, they identically in both the numerator denominator, and ratios are independent random variables whose expected value is one: ❐ All importance sampling ratios have expected value 1 which aresampling generallyratio unknown, they appear identically in two bothpolicies the numerator importance ends up depending only on the and the and seq  denominator, and thus X cancel. The⇡(a|S importance Xsampling ratio ends up depending MDP. ⇡(Ak |S k) k) . E the two policies = andb(a|S ) = on the ⇡(a|S k ) = 1. only we on the ksequence, not MDP. Now are to give a Monte Carlo algorithm that uses a batch of observed b(Aready |S ) b(a|S ) k k k a a are ready to give a Monte Carlohere algorithm that time uses asteps batchinofa observed policyNow b towe estimate v⇡ (s). It is convenient to number way tha 21.

(22) Importance Sampling 86. CHAP. ❐ New notation: time steps increase across episode boundaries: is fo 86 ! . .visited, . s . .denoted . . ▨ . T(s). . . . This . . ▨is. for . .ans every-visit . . . . ▨ .method; . .CHAP include time steps that were first visits to s within their episo ! t = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 of termination following time t, and Gt denote the return after. is visited, T(s). to This is for an every-visit fo the returnsdenoted that pertain state s, and ⇢t:T (t) 1 method; are t2T(s) include time steps that were first visits to s within their episod ratios. To estimate v (s), we simply scale the returns by the ra ⇡ T(s) = {4, 20} following T (4)t,=and 9 GtTdenote (20) = of termination time the25return after P next termination times setthe of returns start times that pertain to state s, and ⇢ t:T (t) 1 t2T(s) are ⇢ G . t2T(s) t:T (t) 1 t V To (s) estimate = . scale the returns by the ra ratios. v (s), we simply ⇡ |T(s)| ❐ Ordinary importance sampling forms estimate P ⇢t:T (t) is t 1 Gdone . t2T(s) When importance sampling as a simple average in th V (s) = . |T(s)| sampling.. An important alternative is weighted importance sampling, w ❐ Whereas weighted importance sampling estimate When importance sampling is done forms as a simple average in th. P sampling. . t2T(s) ⇢t:T (t) 1 Gt (s) = Palternative is weighted , An Vimportant importance sampling, w ⇢ t:T (t) 1 t2T(s) P 22.

(23) Example of infinite variance under ordinary importance sampling R = +1. left. 0.1 0.9. 86 86 ⇡(right|s) ⇡(left|s) =0 =2 b(right|s) b(left|s) 1 is visited, denoted T(s). This is b(left|s) = 1 µ(left|s) v⇡ (s) = 1 is visited, denoted T(s). This is 2 include time steps that were first include time steps that were first of termination following time t, an of termination following time t, an ⇢ 0:T 1 Trajectory G0 the returns that pertain to state the returns that pertain to state R = +1 s, left, 0, s, left, 0, s, left, 0, s, right, 0, ▨ 0 0 ratios.OIS: To estimate v⇡ (s), we simp ⇡(left|s) 16 = 1 ratios. To estimate v⇡ (s), we simp s, left, 0, s, left, 0, s, left, 0, s, left, +1, ▨ 1 P 0.1 right P t2T(s) ⇢t:T (t) 1 Gt s left . R=0 0.9 1 V (s) = . t2T(s) ⇢t:T (t) 1 Gt . µ(left|s) = V (s) = . |T(s)| 2 |T(s)| R=0 When importance sampling is do When importance sampling is do sampling. sampling. An important alternative is wei An WIS: important alternative is wei P P . t2T(s) ⇢t:T (t) 1 Gt V (s) = . P t2T(s) ⇢t:T (t) 1 Gt , V (s) = P t2T(s) ⇢t:T (t) 1 , t2T(s) ⇢t:T (t) 1. ⇡(left|s) = 1. right. s. R=0. R=0. 2. Monte-Carlo estimate of v⇡ (s) with ordinary importance sampling (ten runs). 1. 𝜸=1. 0 1. 10. 100. 1000. 10,000. 100,000. Episodes (log scale). 1,000,000. 10,000,000. or zero if the denominator is zero. or zeroestimates if the denominator is zero. their after observing a their estimates after observing a for the single return cancels in th for the single return cancels of in the th observed return independent 100,000,000 observed independent of the only one return observed, this is a reason only observed, a reason this one statistical sensethis it isisbiased. I this statistical sense it is biased. I is unbiased), but it 23 can be extreme.

(24) Example: Off-policy Estimation of the value of a single Blackjack State ❐ State is player-sum 13, dealer-showing 2, useable ace ❐ Target policy is stick only on 20 or 21 ❐ 114Behavior policy is equiprobable CHAPTER 5. MONTE CARLO METHODS ❐ True value ≈ −0.27726 4. Mean square error. Ordinary importance sampling 2. (average over 100 runs) Weighted importance sampling 0 0. 10. 100. 1000. 10,000. Episodes (log scale) 24.

(25) 114. CHAPTER 5. MONTE CARLO METHODS. O↵-policy MC prediction, for estimating Q ⇡ q⇡ Input: an arbitrary target policy ⇡ Initialize, for all s 2 S, a 2 A(s): Q(s, a) arbitrary C(s, a) 0 Repeat forever: b any policy with coverage of ⇡ Generate an episode using b: S0 , A0 , R1 , . . . , ST 1 , AT 1 , RT , ST G 0 W 1 For t = T 1, T 2, . . . downto 0: G G + Rt+1 C(St , At ) C(St , At ) + W Q(St , At ) Q(St , At ) + C(SW [G t ,At ) t |St ) W W ⇡(A b(At |St ) If W = 0 then ExitForLoop. Q(St , At )].

(26) count the returns’ internal structures as sums of discounted rewards. We now briefly O↵-policy MC control, for estimating ⇡ ⇡ ⇡⇤ Initialize, for all s 2 S, a 2 A(s): Q(s, a) arbitrary C(s, a) 0 ⇡(s) argmaxa Q(St , a) (with ties broken consistently) Repeat forever: b any soft policy Target policy is greedy Generate an episode using b: and deterministic S0 , A0 , R1 , . . . , ST 1 , AT 1 , RT , ST G 0 Behavior policy is soft, W 1 typically 𝜀-greedy For t = T 1, T 2, . . . downto 0: G G + Rt+1 C(St , At ) C(St , At ) + W [G Q(St , At )] Q(St , At ) Q(St , At ) + C(SW t ,At ) ⇡(St ) argmaxa Q(St , a) (with ties broken consistently) If At 6= ⇡(St ) then ExitForLoop W W b(At1|St ).

(27) Discounting-aware Importance Sampling (motivation) ❐ So far we have weighted returns without taking into account that they are a discounted sum ❐ This can’t be the best one can do! ❐ For example, suppose 𝜸 = 0 !. Then G0 will be weighted by ⇢0:T. !. 1 |ST. 1 |ST. 1) 1). But it really need only be weighted by ⇢0:1. !. 1. ⇡(A0 |S0 ) ⇡(A1 |S1 ) ⇡(AT = ··· b(A0 |S0 ) b(A1 |S1 ) b(AT. ⇡(A0 |S0 ) = b(A0 |S0 ). Which would have much smaller variance 27.

(28) two steps. The partial returns here are called flat partial returns: . ¯ t:h = G Rt+1 + Rt+2 + · · · + Rh , SAMPLING 0  t < h  T, 5.9. *PER-REWARD IMPORTANCE. Discounting-aware Importance Sampling ❐ ❐. where “flat” denotes the absence of discounting, and “partial” denote Define thedo flat partial return: returns not extend all the way but instead stop at h two steps. The partial returns hereto aretermination called flat partial returns: horizon (and . T is the time of termination of the episode). The conv ¯ Rt+1 + Rt+2 + · · · + Rh , 0  t < h  T, t:h = returnGG t can be viewed as a sum of flat partial returns as suggest Then follows:“flat” denotes the absence of discounting, and “partial” denot where . returns all+the2 Rway + to · termination instead stop at Gtdo = not Rt+1extend + Rt+2 · · + T t 1 Rbut t+3 T horizon (and T is the time of termination of the episode). The conv = (1 )Rt+1 return Gt can be viewed as a sum of flat partial returns as suggest + (1 ) (Rt+1 + Rt+2 ) follows: ) 2 (Rt+1 2+ Rt+2 + Rt+3 )T t 1 . + (1 Gt = Rt+1 + Rt+2 + Rt+3 + · · · + RT .. = (1 )Rt+1 . 2 +R + (1 ) T(Rtt+1 (Rt+1t+2 +)Rt+2 + · · · + RT 1 ) 2 + (1T t 1) (R (R +t+2 Rt+2 +R + ·+· ·R +t+3 RT) ) t+1t+1 .. T X1 . h t 1¯ ¯ t:T . = (1 ) Gt:h + T t 1 G + (1 h=t+1 ) T t 2 (Rt+1 + Rt+2 + · · · + RT 1 ). T t 1 + (Rt+1 Rt+2 + · · returns · + RT ) by an importance Now we need to scale the+flat partial 28 sampli.

(29) 5.9. *PER-REWARD IMPORTANCE SAMPLING Discounting-aware Importance Sampling. ❐ Define the flat partial return: two steps. The partial returns here are called flat partial returns: . ¯ t:h = G Rt+1 + Rt+2 + · · · + Rh ,. 0  t < h  T,. ❐ Then where “flat” denotes the absence of discounting, and “partial” denot T 1. X returns do not extend allh the t 1 way T t 1 ¯ but instead stop at ¯ t:h to+ termination Gt = (1 ) G Gt:T horizon (and T ish=t+1 the time of termination of the episode). The conv return Gt can be viewed as a sum of flat partial returns as suggest follows: . Gt = Rt+1 + Rt+2 + 2 Rt+3 + · · · + T t 1 RT. = (1. )Rt+1. + (1. ) (Rt+1 + Rt+2 ). + (1 .. .. ). 2. + (1. ). T t 2. +. T t 1. (Rt+1 + Rt+2 + Rt+3 ). (Rt+1 + Rt+2 + · · · + RT. (Rt+1 + Rt+2 + · · · + RT ). 1) 29.

(30) + (1 ) 2(Rt+1 + Rt+2 ) + (1 ) 2 (Rt+1 + Rt+2 + Rt+3 ) + (1 ) (Rt+1 + Rt+2 + Rt+3 ) .. . 5.9. ..*PER-REWARD IMPORTANCE SAMPLING + (1 ) TT tt 22 (Rt+1 + Rt+2 + · · · + RT 1 ) + (1T t 1) (Rt+1 + Rt+2 + · · · + RT 1 ) + Tflat (Rt+1 +return: Rt+2 + · · · + RT ) Define the t 1partial + (Rpartial + · · · here + RT are ) called flat partial returns: two steps. The returns t+1 + Rt+2 T X1 T t 1¯ 1 h t 1¯ = (1 ). TX G + G . ¯ t:h = t:h G R + R + · · · + R h t 1 T th ,1 ¯ t:T 0  t < h  T, t+1 t+2 ¯ = (1 ) h=t+1 Gt:h + Gt:T .. Discounting-aware Importance Sampling ❐. ❐ Then where “flat”h=t+1 denotes the absence of discounting, and “partial” denot T 1. Now we need to scale theX flat partial returns by an importance sampling ratio that returns do not extend all way to+by termination instead stopthat at h the t 1 returns t 1 ¯ butsampling ¯ t:h rewards we need to (1 scaleAs the partial anT up importance ¯flat Gt = )G G is Now similarly truncated. toGat:T horizon h, we ratio only need t:h only involves ¯the horizon T As ish=t+1 time termination ofto the episode). The need conv is G only rewards a horizon h, we only t:hup thesimilarly ratio oftruncated. the(and probabilities to involves h.ofWe define anup ordinary importance-sampling the ratio ofanalogous the up to as h. aWe define return Gt probabilities cantobe(5.4), viewed sum of an flatordinary partial importance-sampling returns as suggest estimator, as ❐estimator, Ordinary discounting-aware IS: analogous to (5.4), as follows: ⇣ ⌘ P P T (t) 1 h t 1⇢ T (t) t 1 ⇢ ⌘ ¯ t:h + ¯ ) PTh=t+1 . ⇣ (1 2 t 1 t:h 1 G T t 1 t:T (t) 1 Gt:T (t) t2T(s) (t) 1 . P h T (t) t 1 ¯· t:h ¯ t:T (t) , Gt2T(s) + Rt+3 + ·1·G + + RT ⇢t:T (t) 1 G t = Rt+1 (1 + ) Rt+2 ⇢t:h V (s) = h=t+1 . |T(s)| V (s) = , = (1 )Rt+1 |T(s)| (5.8) (5.8) + (1 ) (Rt+1 + Rt+2 ) a weighteddiscounting-aware importance-sampling estimator, analogous to (5.5), as ❐and Weighted IS: 2 +⇣ (1 ) (Rt+1 estimator, + Rt+2 + analogous Rt+3 ) and a weighted importance-sampling to (5.5), as ⌘ P P T (t) 1 h t 1 T (t) t 1 ¯ t:h + ¯ t:T (t) ⌘ .⇣ (1 ) PTh=t+1 ⇢t:T (t) 1 G (t) 1 h t 1 ⇢t:h 1 G T (t) t 1 . P t2T(s).. (1 ¯ t:h + ¯⌘t:T (t) . ⇢t:h 1 G ⇢t:T (t) 1 G ⇣ ) h=t+1 V (s) = t2T(s) . P P ⇣ (1 T ) tP2T (t) 1 h t 1 ⇢t:h 1 + T (t) t 1 ⇢t:T (t) 1 ⌘ V (s) = . P+ (t) t+1 1 h+t R1t+2 + · · · + R t2T(s) (1 (1) ) Th=t+1 (R ) ⇢t:h 1 + T T(t) 1t 1 ⇢t:T (t) 1 t2T(s) h=t+1 (5.9) + T t 1 (Rt+1 + Rt+2 + · · · + RT ) (5.9) 30.

(31) value is one: ··· Rt+1 . b(At |St ) b(At+1 |St+1 ) b(At+2 |St+2 ) Xb(AT 1 |ST 1 ) ⇡(Ak |Sk ) . ⇡(a|Sk ) X E the first and = the b(a|S = ⇡(a|Sk ) ow notice that, of all these factors, only lastk )(the reward) are b(Ak |Sk ) b(a|S ) k a a rrelated; all the other ratios are independent random variables whose expected Thus, because the expectation of the product of indepen lue is one:  ❐ AnotherX product ofX their expectations, even if 𝜸 all = 1the ratios except the fi ⇡(Ak |Sk ) . way of reducing ⇡(a|Sk ) variance, just ⇡(a|Sk ) = 1. E = b(a|Sk ) leaving = (5.11) b(A |S ) b(a|S ) k k ❐ kUses the afact that the return a t+1 sum rewards E[⇢t:Tais1 R ] =of E[⇢ t:t Rt+1 ] . hus, because⇢the expectation of the product ofk independent random variables 1 T t 1 is the + · · · + ⇢ R + · · · + ⇢t:T(5.10), t:T 1 Gt = ⇢t:T 1 Rt+1 t:T 1 t+k 1 RT we If we repeat this analysis for the kth term of oduct of their expectations, all the ratios except the first drop out in expectation, E[⇢t:T 1 Rt+k ] = E[⇢t:t+k 1 Rt+k ] . ❐ where aving just. ⇢t:T. 1 Rt+1. =. Per-reward Importance Sampling. follows that the expectation term ) ⇡(A |St+1 )then⇡(A |St+k ) ⇡(ATof 1our |ST original t |S t+k 1) E[⇢ Rt+1 ]== ⇡(A E[⇢t:t Rtt+1 ] . Itt+1 t:T 1R ⇢t:T · · · · · · R h i 1 t+k t+k . b(A |S ) b(At+1 |SE[⇢ b(At+kE|SG ) b(AT 1 |ST 1 ) t+1 ) ˜t+k we repeat this analysis tfortthe kth term of get t , t:T (5.10), 1 Gt ] =we. ∴ E[⇢t:T 1 Rt+k ] = E[⇢t:t+k 1 Rt+k where ].. 2 ˜t = G ⇢t:t Rt+1 + (5.10) ⇢t:t+1 Rcan + written ⇢t:t+2 Rt+3 +⇤ · · · + follows then that the⇥ expectation of our original term be t+2 k 1 T t 1 ∴ E[⇢t:T 1 Gt ] = Eh ⇢t:t ⇢t:t+k 1 Rt+k + · · · + ⇢t:T 1 RT i Rt+1 + · · · + We call this idea per-reward importance sampling. {z } It fo ˜|t , E[⇢t:T 1 Gt ] = E G ˜t G is an alternate importance-sampling estimator, with th here as the ordinary-importance-sampling estimator (5.4), u ❐ Per-reward ordinary IS: P 2 ˜ tT t 1 ⇢t:T 1 RT . ˜ Gt = ⇢t:t Rt+1 + ⇢t:t+1 Rt+2 + ⇢t:t+2 Rt+3 · · · +G . +t2T(s) V (s) = , |T(s)| e call this idea per-reward importance sampling. It follows immediately that there. 31.

(32) Summary ❐ MC has several advantages over DP: ! Can learn directly from interaction with environment ! No need for full models ! Less harmed by violating Markov property (later in book) ❐ MC methods provide an alternate policy evaluation process ❐ One issue to watch for: maintaining sufficient exploration ! exploring starts, soft policies ❐ Introduced distinction between on-policy and off-policy methods ❐ Introduced importance sampling for off-policy learning ❐ Introduced distinction between ordinary and weighted IS. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(33) Summary ❐ MC has several advantages over DP: ! Can learn directly from interaction with environment ! No need for full models ! Less harmed by violating Markov property (later in book) ❐ MC methods provide an alternate policy evaluation process ❐ One issue to watch for: maintaining sufficient exploration ! exploring starts, soft policies ❐ Introduced distinction between on-policy and off-policy methods ❐ Introduced importance sampling for off-policy learning ❐ Introduced distinction between ordinary and weighted IS ❐ Introduced two return-specific ideas for reducing IS variance ! discounting-aware and per-reward IS R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(34)

No results found