Chapter 3: The Reinforcement Learning Problem (Markov Decision Processes, or MDPs)

Full text

(1)Chapter 3: The Reinforcement Learning Problem (Markov Decision Processes, or MDPs) Objectives of this chapter: ❐ present Markov decision processes—an idealized form of the AI problem for which we have precise theoretical results ❐ introduce key components of the mathematics: value functions and Bellman equations. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(2) action Rt+1 St+1. At. t. t. Rt+1 SUMMARY OF NOTATION Agent St+1 SUMMARY OF NOTATION state reward. The Agent-Environment Environment. Interface S R of Notation Summary R agent–enviro Figure 3.1: The Summary ofEnvironmen Notat SUMMARY OF NOTATION S t. t. t+1. t+1 environment interaction in reinforcement learning. Agent Capital letters are used for random Lower case letters are used for the Capital letters are used for ra gives rise to rewards, special nu Figure 3.1: The agent–environment interact Summary of Notation statethatreward functions. Quantities that requ Lower caseaction letters are are used for cial numerical values the agent tries to maximize over time. A complete specifi St Rt functions. that in bold and in case (even ifare ra tQuantities specification of an environment defines a task ,gives one instance ofAlower the reinforcement le rise to rewards, special values Rt+1 Capital letters are usednumerical for random v in bold and in lower case (even ment learning problem. over time. A complete specification of an e St+1 Environment More specifically, Lower case letters are usedthe foragent the va instance of the reinforcement learning problem s of discrete state agent and environment interact at each of a sequence time steps, t = requir 0, 1, 2 functions. Quantities that are the agent and environme s specifically, state a More action 0, 1, 2, 3, . . ..2 At each time step t, the agent receives some representation of the in bold and in lower case (even ran 2 if env Agent and environment interact atofdiscrete time steps: t = 0, 1, 2, K discrete time steps, 2, 3, . . .. At ea a Sof possible setstates, ofaction all nonterminal sta he environment’s state, St 2 S, where S is the set and on that ba some of the environment’s sta +∈representation Agent observes state at step t: S S set ofstates, all nontermina S set of all including t hat basis selects an action, At 2 A(St ), where A(S thestate set actions possible states, andofon that basisavailable selects an ia s t ) is + allaction, states, produces actionlater, at stepint :part At ∈isasA(s) (S )S of actions set ofset actions possible set available in state the Sinclu lable in state St . One time step a tconsequence ofofits a t .inO athe action A(s) set oftheactions possibl of its R action, agent receives a resultingreward reward:, RRt+1 , the agent receives agets numerical 2 consequence RS ⇢ R, where is the set of possible R t +1 ∈ set of all nonterminal states finds itself in adiscrete new state, St+1step .3 Figure 3.1 d 3 t time ossible rewards, and and findsresulting itself innext a new state, St+1 agent–t state: St +1 ∈ S+. Figureset3.1 of diagrams all states, the including interaction. discrete time step , sof T! =t s0 , aset time of an episo agent–environment interaction. 0final 1, a 1 , . . .step A(s) actions possible in sta At each time step,step theofagen At each time step, thetime agent implements T final an St state at t abilities of selecting each possible action. Th e agent implements a mapping from states to probabilities of state selecting each possib R R R t+1 t+2 t+3 ... . . . The other random variables are a functi S at t t St St+1 St+2 S action at t step policy and ist+3 denoted ⇡t⇡ (a|s) is the p tAAtt+2policy discrete time At is calledAt+1 At+3 is⇡denoted t , where possible action. This mapping thetarget agent’s and , where t r is a function of s , a , A action at t t+1t tspecify t and R reward at t, dependent, Reinforcement learning methods howli t T s. Reinforcement final time step of anmethod episod where ⇡t (a|s) is the probability that At = a ifterminal St = learning target zt , and prediction ydiscou reward atagent’s t, depende 2r t , are aG resultR oft its return experience. The goal, R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction (cumulative.

(3) A particular finite MDP is defined by its state and action sets and b A reinforcement learning task Given that satisfies theand Markov dynamics of the environment. any state actionproperty s and a,ist decision process, or MDP. If state the state and action are finite, th of each possible pair of next andProcesses reward, s0 , r,spaces is denoted Markov Decision finite Markov decision process (finite MDP). Finite MDPs are particu . 0 , r|s, a) Pr St+1 = s0 ,learning. Rt+1 = r We | St = s, Atthem = a .extensively to thep(s theory of = reinforcement treat ❐ If a they reinforcement has the Markov it is book; are all youlearning need totask understand 90% of Property, modern reinforcem These quantities completely specify the dynamics of a finite MDP. Mos basically a Markov Decision Process (MDP). particular MDP defined by itsassumes state and sets andis weApresent in thefinite rest of this is book implicitly theaction environment 58 3. THE REINFORCEMENT PRO ❐ If stateCHAPTER and action sets are finite, it isany a finite dynamics of the environment. Given stateMDP. andLEARNING action s and a, Given the dynamics as specified by (3.6), one can compute anything of To each possible pairMDP, of next state and reward, s0 , r, is denoted ❐ define a finite you need to give: want to know about the environment, such as the expected rewards fo A particular finite MDP is defined by its state and action sets and . ! state 0 and action 0 pairs, sets p(s , r|s, a) = Pr S = s , Rt+1 =Given r | St = s, A . action s t+1 t = aand one-step dynamics of the environment. any state X X ! one-step . “dynamics” 0 reward, s0 , r, is d the probability of each possible pair of next state and a) = E[R At = a]the = dynamics r p(s a), MDP. Mo t+1 | St = s, Theser(s, quantities completely specify of ,ar|s, finite r2R s0 2S 0 in the rest of this book 0 we present implicitly the p(s , r|s, a) = Pr{St+1 = s , Rt+1 = r | Sassumes a}.environment t = s, At = theGiven state-transition probabilities, theisdynamics as specified by (3.6), one can compute anythin ! there also: These quantities completely specify the dynamicsX of a finite MDP. Mos want to know about the environment, such as the rewards . in the rest 0of this book implicitly expected 0 theoryp(s we0 |s, present assumes the envir a) = Pr S = s | S = s, A = a = p(s , r|s, a), t+1 t t pairs, is a finite MDP. r2R X X . 0 r(s, a) = E[R | S = s, A = a] = r p(s ,compute r|s, a), anyth Given the dynamics as specified by (3.6), one can t+1 t t and the expected rewards for state–action–next-state triples, r2R s0 2S one might want to know about the environment, such as the expected3 r P 0 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(4) The Agent Learns a Policy Policy at step t = π t = a mapping from states to action probabilities π t (a | s) = probability that At = a when St = s Special case - deterministic policies: πt (s) = the action taken with prob=1 when St = s ❐ Reinforcement learning methods specify how the agent changes its policy as a result of experience. ❐ Roughly, the agent’s goal is to get as much reward as it can over the long run. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 4.

(5) The Markov Property. ⇤ ❐ By “the state” at step t, the book whatever information is 3.5. means THE MARKOV PROPERTY available to the agent at step t about its environment. ⇤ 3.5.defined THE MARKOV PROPERTY only by specifying the complete probabil ❐ The state can include immediate “sensations,” highly processed 0 by specifying the=complete = r, Sfrom ssequences | S0 , Aprobability . . . , Sdi sensations, and structuresdefined built only upPr{R over t+1time t+1 0 , R1 ,of t sensations. 0 r, St+1 = s0 | S0 , A0 , R1 , . . . , St 1 , At Pr{R t+1s= for all r, , and all possible values of the past ❐ Ideally, a state should summarize At 1 , R , St ,sensations At . If the state has the M past so assignal to retain 0 t for all r, s , and all possible values of the past event ⇤ hand, then the environment’s response at t + 1 d 3.5. all THE MARKOV PROPERTY 55 “essential” information, it should have the Markov At i.e., , R , S , A . If the state signal has the Markov 1 t t t action representations at t, in which case the en hand, then the environment’s response at t + 1 depend Property: defined by specifying only. defined only by specifying the complete probability distribution: action representations at t, in which case the environm by ,specifying only Pr{Rt+1 = r, St+1 = s0 | S0 , A0 , Rdefined A0t, r|s, (3.4) Pr{R r, St+1 = s0 | St , A 1 , . . . , St 1p(s 1 , Ra) t , S= t, A t }, = t+1 =. p(s0 , events: r|s, a) =..., r, SSt+1 ,= s0 | St , At }, t+1 0 =SPr{R for all r, s0 , and all possible values of thefor past , A , R , t 1words, a state sig all r, s , St , 0and 0At . 1In other At 1 , Rt , St , At . If the state signal has the Markov property, on iftheand other and is0 , aStMarkov state, only if (3.5) is eq for all r, s , and A t . In other words, a state signal ha hand, then the 0environment’s response at thistories, + 1 depends only on ,the and , R , S , A . I S0 , state, A0 , R ..., state Sonly ❐ for all histories s 2 S+ , r 2 R, and all 1 and t 1, A t 1 t t t and is a Markov if action representations at t, in which case the environment’s dynamics can if be(3.5) is equal to and task as0 ,aRwhole are also said to have the Ma histories, S0 , A 1 , ..., St 1 , At 1 , Rt , St , At . In thi defined by specifying only 5. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(6) The Meaning of Life (goals, rewards, and returns). R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 6.

(7) Rewards and returns •. The objective in RL is to maximize long-term future reward. •. That is, to choose At so as to maximize Rt+1 , Rt+2 , Rt+3 , . . .. •. But what exactly should be maximized?. •. The discounted return at time t:. Gt = Rt+1 + Rt+2 +. 2. Rt+3 +. the discount rate 3. Rt+4 + · · ·. 2 [0, 1). Reward sequence. Return. 0.5(or any). 1 0 0 0…. 1. 0.5. 0 0 2 0 0 0…. 0.5. 0.9. 0 0 2 0 0 0…. 1.62. 0.5. -1 2 6 3 2 0 0 0…. 2.

(8) 4 value functions state values. action values. prediction. v⇡. q⇡. control. v⇤. q⇤. •. All theoretical objects, mathematical ideals (expected values). •. Distinct from their estimates:. Vt (s). Qt (s, a).

(9) Values are expected returns •. The value of a state, given a policy:. v⇡ (s) = E{Gt | St = s, At:1 ⇠ ⇡} •. v⇡ : S ! <. The value of a state-action pair, given a policy:. q⇡ (s, a) = E{Gt | St = s, At = a, At+1:1 ⇠ ⇡} •. The optimal value of a state:. v⇤ (s) = max v⇡ (s) ⇡. •. v⇤ : S ! <. The optimal value of a state-action pair:. q⇤ (s, a) = max q⇡ (s, a) ⇡. •. q⇡ : S ⇥ A ! <. q⇤ : S ⇥ A ! <. Optimal policy: ⇡⇤ is an optimal policy if and only if. ⇡⇤ (a|s) > 0 only where q⇤ (s, a) = max q⇤ (s, b) b. •. 8s 2 S. in other words, ⇡⇤ is optimal iff it is greedy wrt q⇤.

(10) optimal policy example. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 10.

(11) What we learned last time ❐ Finite Markov decision processes! ! States, actions, and rewards ! And returns ! And time, discrete time ! They capture essential elements of life — state, causality ❐ The goal is to optimize expected returns ! returns are discounted sums of future rewards ❐ Thus we are interested in values — expected returns ❐ There are four value functions ! state vs state-action values ! values for a policy vs values for the optimal policy R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 11.

(12) 4 value functions state values. action values. prediction. v⇡. q⇡. control. v⇤. q⇤. •. All theoretical objects, mathematical ideals (expected values). •. Distinct from their estimates:. Vt (s). Qt (s, a).

(13) Gridworld ❐ Actions: north, south, east, west; deterministic. ❐ If would take agent off the grid: no move but reward = –1 ❐ Other actions produce reward = 0, except actions that move agent out of special states A and B as shown.. State-value function for equiprobable random policy; γ = 0.9. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 13.

(14) Golf ❐ State is ball location ❐ Reward of –1 for each stroke until the ball is in the hole ❐ Value of a state? ❐ Actions: ! putt (use putter) ! driver (use driver) ❐ putt succeeds anywhere on the green. putt vVputt. !3. !4. sand. !3. !1. !2. !2 !1. 0. !4 s. !5 !6. !". a. green n d. !4 !" !2. !3. *(s, driver) Q q*(s, driver) sand. 0. !2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. !3. sa. n. !1 green. !2. 14.

(15) SUMMARY O Summary of a in bold SUMMARY OF NOT Optimal Value Functions Summary s are us Capital letters ❐ For finite MDPs, policies can be partially ordered: Summary a of Lower case letters ar. Capital letters if and only if vπ (s) ≥ vπfunctions. (s) for all s∈ S Quantities # Lower case let + Sorlower ❐ There are always one or more policies that inare bold and in better than Capital letters areQu us functions. A(s) ar equal to all the others. These are the optimal policies. We Lower case letters in bold and in denote them all π*. functions. Quantities s state t lower ❐ Optimal policies share the same optimalin function: bold and in astate-value action s state Tof all no v* (s) = max vπ (s) for all s ∈ S set a actio π + t all set Sof sS action-value ❐ Optimal policies also share the same optimal S state setsto set Aoft actio function: set o aA(s) S+action R t all setno o q* (s, a) = max qπ (s, a) for all s ∈ S and a ∈ A(s) (s) set of π G t+ discrete S set oft(n)alltim st This is the expected return for taking action a in state s G time t discrs T final A(s) set oft actio and thereafter following an optimal policy. Gt atfinal T state St t 15 St action at state At t. π ≥ π#. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction.

(16) Why Optimal State-Value Functions are Useful Any policy that is greedy with respect to v* is an optimal policy. Therefore, given v*, one-step-ahead search produces the long-term optimal actions. E.g., back to the gridworld: A. B +5 +10. B'. 22.0 24.4 22.0 19.4 17.5 19.8 22.0 19.8 17.8 16.0 17.8 19.8 17.8 16.0 14.4 16.0 17.8 16.0 14.4 13.0. A'. a) gridworld R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 14.4 16.0 14.4 13.0 11.7. V* b) v *. π* c) !* 16.

(17) sand. !". Optimal Value Function !1 for Golf !2 !2 !3. !1. 0. !4. s green a ❐ We can hit the!5ball farther with driver than with putter, n !6 d !4 !" but with less accuracy using !2 driver first, then ❐ q* (s,driver) gives the value or !3 using whichever actions are best. *(s, driver) Q q*(s,driver) sand. 0. !2 !3. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. sa. n d. !1. !2. green. 17.

(18) What About Optimal Action-Value Functions? Given q* , the agent does not even have to do a one-step-ahead search:. π * (s) = arg max q* (s, a) a. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 18.

(19) Value Functions x4. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 19.

(20) Bellman Equations x4. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 20.

(21) Bellman Equation for a Policy π The basic idea:. Gt = Rt +1 + γ Rt +2 + γ 2 Rt + 3 + γ 3 Rt + 4 L + .... (. ). = Rt +1 + γ Rt +2 + γ Rt + 3 + γ 2 Rt + 4 +L... = Rt +1 + γ Gt +1 So:. vπ (s) = Eπ {Gt St = s} = Eπ { Rt +1 + γ vπ ( St +1 ) St = s}. Or, without the expectation operator: v⇡ (s) =. X. ⇡(a|s). a. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. X s0 ,r. h. p(s0 , r|s, a) r + v⇡ (s0 ). i 21.

(22) More on the Bellman Equation v⇡ (s) =. X. ⇡(a|s). 0. h. i. This is a set of equations (in fact, linear), one for each state. The ⇥ value function for π is its unique solution. ⇤ 2. =. E⇡ Rt+1 + Rt+2 +. =. EBackup v⇡ (St+1 ) | St = s] ⇡[Rt+1 +diagrams: h i X X ⇡(a|s) p(s0 , r|s, a) r + v⇡ (s0 ) ,. =. 0. p(s , r|s, a) r + v⇡ (s ). s0 ,r. a. s). X. a. Rt+3 + · · ·. St = s. s0 ,r. for vπ R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. for qπ 22.

(23) Gridworld ❐ Actions: north, south, east, west; deterministic. ❐ If would take agent off the grid: no move but reward = –1 ❐ Other actions produce reward = 0, except actions that move agent out of special states A and B as shown.. State-value function for equiprobable random policy; γ = 0.9. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 23.

(24) v⇡ (s). ⇥. 2. =. E⇡ Rt+1 + Rt+2 +. =. E⇡[Rt+1 + v⇡ (St+1 ) | St = s] h i X X 0 0 ⇡(a|s) p(s , r|s, a) r + v⇡ (s ) ,. Rt+3 + · · ·. St = s. ⇤. Bellman Optimality Equation for v* =. s0 ,r. a. The value of a state under an optimal policy must equal the expected return for the best action from that state: v⇤ (s). =. max q⇡⇤ (s, a). =. max E[Rt+1 + v⇤ (St+1 ) | St = s, At = a] a X ⇥ ⇤ 0 0 max p(s , r|s, a) r + v⇤ (s ) .. =. a. a. s0 ,r. s. (a). (b). max. The relevant backup diagram:. a r. max. s'. v* is the unique solution of this system of nonlinear equations. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 24.

(25) =. max. a2A(s). X s0 ,r. ⇥. 0. 0. ⇤. p(s , r|s, a) r + v⇤ (s ) .. Bellman Optimality Equation for q*. The last two equations are two forms of the Bellman optimality equati v⇤ . The Bellman optimality equation for q⇤ is h i 0 q⇤ (s, a) = E Rt+1 + max q (S , a ) St = s, At = a ⇤ t+1 a0 h i X 0 0 = p(s0 , r|s, a) r + max q (s ,a ) . ⇤ 0 a. s0 ,r. (a). s. (b). r. max The relevant backup diagram: a. r. s,a s'. max. s'. a'. q* is the unique solution of this system of nonlinear equations. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 25.

(26) Solving the Bellman Optimality Equation ❐ Finding an optimal policy by solving the Bellman Optimality Equation requires the following: ! accurate knowledge of environment dynamics; ! we have enough space and time to do the computation; ! the Markov Property. ❐ How much space and time do we need? ! polynomial in number of states (via dynamic programming methods; Chapter 4), ! BUT, number of states is often huge (e.g., backgammon has about 1020 states). ❐ We usually have to settle for approximations. ❐ Many RL methods can be understood as approximately solving the Bellman Optimality Equation. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. 26.

(27) Summary ❐ Agent-environment interaction ! States ! Actions ! Rewards ❐ Policy: stochastic rule for selecting actions ❐ Return: the function of future rewards agent tries to maximize ❐ Episodic and continuing tasks ❐ Markov Property ❐ Markov Decision Process ! Transition probabilities ! Expected rewards. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. ❐ Value functions ! State-value function for a policy ! Action-value function for a policy ! Optimal state-value function ! Optimal action-value function ❐ Optimal value functions ❐ Optimal policies ❐ Bellman Equations ❐ The need for approximation. 27.

(28)

No results found