Temporal abstraction in reinforcement learning
38
0
0
Full text
(2) What is temporal abstraction? • Consider an activity such as cooking dinner. – High-level steps: choose a recipe, make a grocery list, get groceries, cook,... – Medium-level steps: get a pot, put ingredients in the pot, stir until smooth, check the recipe ... – Low-level steps: wrist and arm movement while driving the car, stirring, ... • All have to be seamlessly integrated! • Cf. macro actions in classical AI, controllers in robotics COMP 767 - Temporal abstraction. 1.
(3) Formalization of temporal abstraction • • • • •. Hierarchical abstract machines (Parr, 1998) MAXQ (Dietterich, 1998) Dynamic motion primitives (Schaal et al. 2004) Skills (Konidaris et al, 2009) Options (Sutton, Precup & Singh, 1999; Precup, 2000). COMP 767 - Temporal abstraction. 2.
(4) Options framework • Suppose we have an MDP hS, A, r, P, γi • An option ω consists of 3 components – An initiation set of states Iω ⊆ S (aka precondition) – A policy πω : S × A → [0, 1] πω (a|s) is the probability of taking a in s when following option ω – A termination condition βω : S → [0, 1]: βω (s) is the probability of terminating the option ω upon entering s • Eg., robot navigation: if there is no obstacle in front (Iω ), go forward (πω ) until you get too close to another object (βω ) Cf. Sutton, Precup & Singh, 1999; Precup, 2000. COMP 767 - Temporal abstraction. 3.
(5) Options as behavioral programs • Call-and-return execution – Option is a subroutine which gets called by a policy over options πΩ – When called, ω is pushed onto the execution stack – During the option execution, the program looks at certain variables (aka state) and executes an instruction (aka action) until a termination condition is reached – The option can keep track of additional local variables, eg counting number of steps, saturation in certain features (e.g. Comanici, 2010) – Options can invoke other options • Interruption – At each step, one can check if a better alternative has become available – If so, the option currently executing is interrupted (special form of concurrency) • The option identity is also a form of memory: what is the agent currently trying to achieve? Cf. Shaul et al, 2014, Kulkarni et al, 2016 COMP 767 - Temporal abstraction. 4.
(6) DoDelivery. Work(). DoDelivery() set z2 = z1 M. Nav(M,{s,f}). Nav(mdest,{s,f}). Alternative formalisms. getMail. delMail. C. D. Nav(z2,{s,f}). (b). set z1 = z2. (a). • MAXQ (Dietterich, 2000): specify a graph of sub-tasks Figure 1: (a) The Deliver–Patrol world. Mail appears at and must be delivered to the appro, , and . 1997): The robot’s battery priate location. Additional rewards appear sporadically • Hierarchies of Abstract Machines (Parrat &, Russell, abstractions may be recharged at . The robot is penalized for colliding with walls and “furniture” (small cirgiven by cles).automata (b) Three of the PHAMs in the partial specification for the Deliver–Patrol world. Right-facing half-circles are start states, left-facing half-circles are stop states, hexagons are call states, ovals are and (ALISP) are memory variables. When argumentsspecifying to primitive actions,programming and squares are choice points. • Andrecall (2003): language which allows states are in braces, then the choice is over the arguments to pass to the subroutine. The specifies anlearning interrupt to clean the camera lens whenever it gets dirty; the PHAM HAM PHAM programs, parts of them interrupts its patrolling whenever there is mail to be delivered.. ToDoor(dest,sp). Move(dir) dir=N. p() N. E S W fN fE fS fW. (a). dir=E. else. dir=S. noop. dir=W. N E S W. ~At(dest). (b). Figure 2: (a) A room in the Deliver–Patrol domain. The arrows in the drawing of the room inditransition function in ToDoor(dest,sp). Two arrows indicate cate the behavior specified by the a “fast” move (fN,fS,fE.fW), whereas a single arrow indicates a slow move (N, S, E, W). (b) The ToDoor(dest,sp) and Move(dir) PHAMs.. Nav(dest,sp). COMP 767 - Temporal abstraction. Nav( {A,B,C,D},sp) else. 5. ToDoor({N, E, S, W},sp). ~InRoom(dest) Patrol(). Nav( {A,B,C,D},{s,f}). NavRoom(dest,speed). DoAll(). work.
(7) Option models • Option model has two parts: 1. Expected reward rω (s): the expected return during ω’s execution from s – Needed because it is used to update the agent’s internal representations 2. Transition model Pω (s0|s): a sub-probability distribution over next states (reflecting the discount factor γ and the option duration) given that ω executes from s – P specifies where the agent will end up after the option/program execution and when termination will happen • Models are predictions about the future, conditioned on the option being executed. COMP 767 - Temporal abstraction. 6.
(8) Option models provide semantics • Programming languages: preconditions (initiation set) and postconditions • Models of options represent (probabilistic) post-conditions • Models that are compositional, can be used to reason about the policy over options • Sequencing rω1ω2. = rω1 + Pω1 rω2. Pω1ω2. = Pω1 Pω2. Cf. Sutton et al, 1999, Sorg & Singh, 2010 • Stochastic choice: can take expectations of reward and transition models • These are sufficient conditions to allow Bellman equations to hold • Silver & Ciosek (2012): re-write model in one matrix, compose models to construct programs Eg. good generalization in Towers of Hanoi COMP 767 - Temporal abstraction. 7.
(9) MDP + Options = Semi-Markov Decision Precess Time. MDP. State. SMDP. Options over MDP Fig. 1. The state trajectory of an MDP is made up of small, discrete-time transitions,. • Introducing in ancomprises MDP induces a related semi-MDP whereas thatoptions of an SMDP larger, continuous-time transitions. Options enableall anplanning MDP trajectory to be analyzed in either way. • Hence and learning algorithms from classical MDPs transfer directly options Sutton, Precup combining & Singh,a1999; Precup, 2000) tion 4 to considers the(Cf. problem of effectively given set of options a single overall policy. For example, a robot have pre-designed con• Butinto planning and learning with options can may be much faster!. trollers for servoing joints to positions, picking up objects, and visual search, but still face a difficult problem of how to coordinate and switch between these behaviors [17,22,38,48,50,65–67]. Sections 5 and 6 concern intra-option COMP 767 - Temporal abstraction learning—looking inside options to learn simultaneously about all options consistent with each fragment of experience. Finally, in Section 7 we illustrate a notion of subgoal that can be used to improve existing options and learn new ones.. 8.
(10) Ro o m s Ex a m p l e Illustration: Navigation wit h c e ll-t o -c e ll primit ive ac t io ns V (goal )=1. Iteration #0. Iteration #1. Iteration #2. Iteration #1. Iteration #2. wit h ro o m-t o -ro o m o pt io ns V (goal )=1. Iteration #0. COMP 767 - Temporal abstraction. 9.
(11) Illustration: Random landmarks • Generate a lot of options, then worry about which are useful! • Large set of landmarks, i.e. states in the environment, chosen at random (Mann, Mannor & Precup, 2015) • Rough planner which can get to a landmark from its vicinity, by solving a deterministic relaxation of the MDP. I. 6. PFVI OFVI. 1.0 1.0. 0.8. 1.0. OFVI LAVI. 1.0. 1.0. 0.8 0.8. 0.8. 0.8. 0.6 0.6. 0.6 0.6. 0.6. 0.4 0.4 0.2 0.2. 0.4 0.4 0.2 0.2. 0.4. 0.0 0.0 0.0 0.0. 0.2 0.2. 0.4 0.4. 0.6 0.6. 0.8 0.8. 1.0 1.0. 0.0 0.0 0.0 0.0. 0.2 0.2 0.2. 0.4 0.4. 0.6 0.6. 0.8 0.8. 1.0 1.0. 0.0 0.0. Landmark-based approximate value iteration gets a good solution much faster!. 0.2. 0. Figure 6: Example trajectories from the last (KOFVI, = 30)and iter ries for policies derived from the for lastpolicies (K = derived 30) iteration of PFVI, LAVICOMP on the continuous domain. For LAVI,are thedrawn landmark hyperspher o rooms domain. Forabstraction LAVI,two therooms landmark hyperspheres as black 767 - Temporal 10 ovals. 30 2.5. 30.
(12) The anatomy of the reward option model • Primitive action model: ra(s) = E[rt|st = s, at = a] • Option model: rω (s) = E[rt + γrt+1 + . . . |st = s, ωt = ω] • This expectation indicates a Markov-style property, as it depends only on the identity of the state and the option, not on the time step • Notice the model is basically a value function so we can write Bellman equations for the model: X X rω (s) = πω (s, a)[r(s, a) + γ(1 − βω )rω (s0)] a. s0. • This means that we can use RL methods to learn the models of options! • Very similar equations hold for the transition model COMP 767 - Temporal abstraction. 11.
(13) Intra-option algorithms • • • • •. Learning about one option at a time is very inefficient In fact, we may not want to execute options at all! Instead, learn about all options consistent with the behaviour In some sense, a form of attention E.g. action-value function, tabular case On single-step transition hs, a, r, s0i, for all ω that could have been executing in s and taken a: QΩ(s, ω) = QΩ(s, ω) + α[ra(s) + γ(1 − βω (s0))QΩ(s0, ω) + X 0 0 0 + βω (s )γ max Q (s , ω ) − QΩ(s, ω)] Ω 0 s0. ω. • In general function approximation, importance sampling will need to be used (several papers on this) COMP 767 - Temporal abstraction. 12.
(14) Frontier: Option Discovery • Options can be given by a system designer • If subgoals / secondary reward structure is given, the option policy can be obtained, by solving a smaller planning or learning problem (cf. Precup, 2000) • What is a good set of subgoals / options? • This is a representation discovery problem • Studied a lot over the last 15 years • Bottleneck states and change point detection currently the most successful methods. COMP 767 - Temporal abstraction. 13.
(15) Bottleneck states. • Perhaps the most explored idea in options construction • A bottleneck is a special state, which is visited more often than others, allows “circulating on the graph” • Lots of different approaches! – Frequency of states (McGovern et al, 2001, Stolle & Precup, 2002) – Graph partitioning / state graph analysis (Simsek et al, 2004, Menache et al, 2004, Bacon & Precup, 2013) – Information-theoretic ideas (Peters et al., 2010) • People seem quite good at generating these (cf. Botvinick, 2001, Solway et al, 2014) • Main drawback: expensive both in terms of sample size and computation COMP 767 - Temporal abstraction. 14.
(16) Option-critic • Explicitly state an optimization objective and then solve it to find a set of options • Handle both discrete and continuous set of state and actions • Learning options should be continual (avoid combinatorially-flavored computations) • Options should provide improvement within one task (or at least not cause slow-down...). COMP 767 - Temporal abstraction. 15.
(17) Actor-critic architecture Actor Policy. Gradient. st. Critic Value function. TD error at+1. rt. Environment. • Clear optimization objective: average or discounted return • Continual learning • Handles both discrete and continuous states and actions COMP 767 - Temporal abstraction. 16.
(18) 458 using the po tary material. 459 options has t 460 continuous) s 4. Option-critic architecture 461 Option-critic architecture smaller set of 462 options can b Behavior policy 463 ! tions models ⇡ 464 Options 465 5. Experim ⇡ , 466 467 In order to il Gradients 468 liminary exp TD error 469 Critic s a al., 1999). W Q ,A 470 ner and defin 471 r A penalty of 472 action taken Environment 473 elastic collisi 474 upon taking Figure 1: The option-critic architecture consists of a set of 475 actions were • Parameterize internal policies and termination conditions options, a policy over them and a critic. Gradients can be 476 cell in each o • Policy over options is computed by a separate process (planning, RL, ...) derived from the critic for both the intra-option policies and 477 south. Any a termination functions. The execution model is suggested 478 case17 the agen COMP 767 - Temporal abstraction pictorially by a switch ? over the contacts (. Switching 479 discount fact can only take place when a termination event is encoun480 We chose to tered. 481 t. ⌦. !0. t. !0. U. t. ⌦. t.
(19) Formulation • The option-value function of a policy over options πΩ is given by X QπΩ (s, ω) = πω (a|s)QU (s, ω, a) a. where QU (s, ω, a) = ra(s) + γ. X s0. Pa(s0|s)U (ω, s0). • The last quantity is the utility from s0 onwards, given that we arrive in s0 using ω U (ω, s0) = (1 − βω (s0))QπΩ (s0, ω) + βω (s0)VπΩ (s0) • We parameterize the internal policies by θ, as πω,θ , and the termination conditions by ν, as βω,ν • Note that θ and ν can be shared over the options! COMP 767 - Temporal abstraction. 18.
(20) Main result: Gradient updates . • Suppose we want to optimize the expected return: E QπΩ (s, ω) • The gradient wrt the internal policy parameters θ is given by: ∂ log πω,θ (a|s) E QU (s, ω, a) ∂θ. This has the usual interpretation: take better primitives more often inside the option • The gradient wrt the termination parameters ν is given by: 0 ∂βω,ν (s ) AπΩ (s0, ω) E − ∂ν where AπΩ = QπΩ − VπΩ is the advantage function This means that we want to lengthen options that have a large advantage COMP 767 - Temporal abstraction. 19.
(21) Results: Options transfer 600. SARSA(0) AC-PG OC 4 options OC 8 options. 500. Steps. 400 300 200 100 0. 0. 500. 1000 Episodes. 1500. 2000. • 4-rooms domain, tabular representations, value functions learned by Sarsa • Learning in the first task no slower than using primitives • Learning once the goal is moved faster with the options COMP 767 - Temporal abstraction. 20.
(22) 64 3 ⇥ 3 filters with a stride of 1 in the third layer. We then fed the output of the third layer into a dense shared layer of 512 neurons, as depicted in Figure 6. We fixed the learning rate for the intra-option policies and termination gradient to 0.00025 and Nonlinear used RMSProp for the critic. approximation Results: function. Avg. Score. arto 2009), a ball rily shaped polytate space is conthe ball in the xhoose among five aster or slower, in ⇡⌦ (·|s) e the null action. { ! (s)} an be used to the drag coefficient of {⇡! (·|s)} fter a finite numn repeatedly. Each Figure 6: DQN Deep neural network architecture. concatenation e taking no •action Atari simulator, to learn value functionAover options, actor as above of the last 4 images is fed through the convolutional layers, 0000 reward when producing a dense representation shared across intra-option d any episode tak2500 Testing policies, termination functions and over options.8000 unt factor to 0.99. 10000 10000policy Moving avg.10 2000 DQN 6000 8000 8000 critic with linear 1500 6000 4000 s (Konidaris et 6000 al. We represented the intra-option policies as linear-softmax 4000 4000 1000 2000 0 0. 50. Testing Moving avg.10 DQN 100 150 Epoch. 500 200. (a) Asterix. 0. 50. Testing Moving avg.10 DQN 100 150 Epoch. 2000. 2000 200. (b) Ms. Pacman. 0 0. 0 50. 100 Epoch. 150. 200. 0. (c) Seaquest. 50. Testing Moving avg.10 DQN 100 150 Epoch. 200. (d) Zaxxon. Figure 8: Learning curves in the Arcade Learning Environment. The same set of parameters was used across all four games: 8 • Performance matching better than and DQN options, 0.01 termination regularization, 0.01or entropy regularization, a baseline for the intra-option policy gradients. Time. COMP 767 - Temporal abstraction. Option 0. 21. Option 1.
(23) Results: Learned options are intuitive • In rooms environment, terminations are more likely near hallways (although there are no pseudo-rewards provided) 2500. Avg. Score. 10000. 6000 4000 2000 0 0. 10000. 2000. 8000. 50. Testing Moving avg.10 DQN 100 150 Epoch. (a) Asterix. 8000. 1500. 6000. 1000. 4000. 500 200. 0. 50. Testing Moving avg.10 DQN 100 150 Epoch. (b) Ms. Pacman. 6000 4000 2000. 2000 200. 0 0. 2011) of order 3. We experi We used Boltzmann policies linear-sigmoid functions for learning rates were set to 0 both the intra and termination Testing Movinggreedy avg.10 policy over options w DQN. 8000. Testing Moving avg.10 DQN. 0 50. 100 Epoch. 150. 200. 0. (c) Seaquest. 50. 100 Epoch. 150. 200. (d) Zaxxon. Figure 3: Termination probabilities • In Seaquest, separate options are learned for to the go option-critic up and down Time agent learning with 4 options. The darkest color represents the walls in the environment while lighter colors encode higher termination probabilities.. Undiscounted Return. 8500 7500 Figure 8: Learning curves in the Arcade Learning Environment. The same set of parameters was used across all four games: 8 options, 0.01 termination regularization, 0.01 entropy regularization, and a baseline for the intra-option policy gradients. 5000. 0. -5000. In the two temporally extended settings, with 4 options and 8 options, termination events are more likely to occur near doorways (Figure 3), agreeing with the intuition Optionthe 0 Option 1 0 25 50 1 that they would be good subgoals. As opposed to (Sutton, Figure 9: Up/down specialization in the by we option-critic learning 2 options in Seaquest. The top bar Precup, andsolution Singhfound 1999), did not when encode this with knowledge shows a trajectory in the game, with “white” representing a segment during which option 1 was active and “black” for option 2. Figure 5:22Learning cur ourselves but simply let the agents find options that would COMP 767 - Temporal abstraction maximize the expected discounted return. use the DQN framework to implement a gradient-based optheorem, and as discussed in (Thomas 2014),Inthe(Konidaris gradient and Barto tion learner, which uses intrinsic rewards to learn the internal estimators can be biased in the discounted case. By introPinball Domain used and updated after a ges Qt policies of options, and extrinsic rewards to learn the polducing factors of the form t i=1 (1 i ) in our updates learning is fully integrated in icy over options. As opposed to our framework, descriptions (Thomas 2014, eq (3)), it would be possible to obtain un-.
(24) What are beneficial options • Successful simultaneous learning of terminations and option policies • But, as expected, options shrink over time unless a margin is required for the advantage Cf. time-regularized options, Mann et al, (2014) • Intuitively, using longer options increase the speed of learning and planning (but may lead to a worse result in call-and-return execution) • What is the right tool to formalize this intuition?. COMP 767 - Temporal abstraction. 23.
(25) Regularization • Entropy-based regularization is useful to encourage diversity in the options • Length collapse: options “dissolve” into primitive action over time • Assumption: executing a policy is cheap, deciding what to do is expensive – Many choices may need to be evaluated (branching factor over actions) – In planning, many next states may need to be considered (branching factor over states) – Evaluating the function approximator might be expensive (e.g. if it is a deep net) • Deliberation is also expensive in animals: – Energy consumption (to engage higher-level brain function) – Missed opportunity cost: thinking too long means action is delayed. COMP 767 - Temporal abstraction. 24.
(26) Bacon, Harb & Precup. Deliberation/Switching Cost Time Base MDP + Options Deliberation Costs Figure 1: The switching cost is incurred upon entering SMDP decision points, represented. • Let byc(s, < 0Thebeaverage the decision immediate ofstep switching ω in s by openω) circles. cost per cost primitive (filled circle)to is represented intensityfunction of the subtrajectory. • Newthevalue expresses total deliberation cost: (s0 , o). Furthermore, if cω) Qc(s, ✓ Qc✓ (s, o) =. X. (s0 , o). X. 0. 0. X. 0. 0. ω Xs ⇥ 0 0 ⇡ (a | s, o) r(s, a) + P s s, a Q✓ (s , o). 0. ✓ (s. 0. , o) Ac✓ (s0 , o) + ⌘. a • New objective: maximize reward with reasonable effort. •. 0. 0. – which P we(s call|s, a switching cost we have µ(ωfunction |s )Q–c(s , ω ): ω) ==c(s,✓ ω) + s0. ⇤. !. ,. (19). c 0 where Ac✓ (s0 , o)=Q ˙ c✓ (s0 , o) Vmax The of the cost to the base MDP [Qintroduction τ Qswitching Ω (s, ω) + c (s, ω)] ✓ (s ). E Ω reward therefore leads to a di↵erent form for the intra-option Bellman equations (5) where a scalar ⌘ is now added to the advantage function. This suggests that the e↵ect of using a τ≥ 0 controls the between valueis believed and computation switching cost ⌘ is to set atrade-off baseline on how good an option to be compared to effort v✓ . increasing ⌘, we e↵ectively express that persisting with an option might be preferable CanBy be shown equivalent to requiring that advantage exceeds a threshold to reconsidering the current course of actions immediately. This preference for committing to theswitching same option might be motivated by computational or metabolic limitations (Simon, before 1957), or by the inherent approximation error (due to finite predictive capacity) or to the uncertainty in the value estimates.. COMP 767 - Temporal abstraction. 5.3 Di↵erent Horizons for Cost and Reward The generality of the regularized objective (18) allows a decoupling of the internal horizon on the expected discounted cost with the discount factor of the external environment. In. 25.
(27) Interesting properties • Immediate cost of deliberation is computed based on properties of the environment • We do not need pseudo-rewards for sub-goals! • Value function over options is still obtained accurately • If ξ = 0 we simply optimize returns as before. COMP 767 - Temporal abstraction. 26.
(28) Example: Immediate cost = number of actions • With primitive actions: average cost of |A| per time step • With options only: average cost of |Ω| incurred only when re-deciding what to do • If we re-decide on average every kth step, and if |Ω| < k|A|, deliberation with options is cheaper • Even if we use both options and primitive actions, average deliberation is cheaper if |Ω| < (k − 1)|A|. COMP 767 - Temporal abstraction. 27.
(29) Effect of Deliberation Cost Regularization Training curves Amidar. Log termination rate Amidar Breakout. Asterix. 800 600. 6000. 400. 4000. 200. 2000. 0. 0. 1 300 10. 101. 101 20000. 100. 200 100. 10000 100. 100. 10. Breakout 102. 30000. 400. 8000. Asterix Hero 102. 102. 10. 1. 10. 2. 1. 0. 10. 0 1 2 3 4 5 6 7 8. 0 1 2 3 4 5 6 7 8. 0 0 11 22 33 44 55 66 77 88. MsPacman. Pong. MsPacman Seaquest. 20. 2000. 10. 1500. 0. 1000. 1000. 10. 500 101. 0 0 1 1 22 33 44 55 66 77 88. 0 1 2 3 4 5. Pong Zaxxon. Seaquest. 6000 2 10. 102. 2500. 10. 102. 1500 1 10 4000. 101 100 2000. 10. 20. 500 0 1 2 3 4 5 6 7 8. 1. 100 0. 0 1 2 3 4 5 6 7 8. 0 0 11 22 33 44 55 66 77 88. 0 0 1 1 22 33 44 55 66 77 88. 0. 0.01 0.015 0.02 0.025 0.005 0.03 0.01 0. Yellow: no0.005regularization; red: most regularization. COMP 767 - Temporal abstraction. 0 1 2 3 4 5. 0.015. 0.02. 28.
(30) Example: Amidar. (a) Without a deliberation cost, options terminate instantly and are used in any scenario without specialization.. (b) Options are used for extended periods and in specific scenarios through a trajectory, when using a deliberation cost.. (c) Termination is sparse when using the deliberation cost. The agent terminates options at intersections requiring high level decisions.. 2: We show the effects of using deliberation costs on both the option termination and policies. In figures (a) and (b), •Figure Deliberation costs represents prevent options becoming too isshort every color in the agent trajectory a different option from being executed. This environment the game Amidar, of the Atari 2600 suite. • Terminations are intuitive. of deliberation cost with previous notions of regularization from (Mann et al. 2014) and (Bacon et al. 2017). The deliberation cost goes beyond only the idea of penalizing for lengthy computation. It can also be used to incorporate other forms of bounds intrinsic to an agent in its environment. One interesting direction for future work is to COMP 767 -ofTemporal abstraction also think deliberation cost in terms of missed opportunity and opening the way for an implicit form of regularization when interacting asynchronously with an environment. Another interesting form of limitation inherent to reinforcement learning agents has to do with their representational capacities when estimating action values. Preliminary work seems. neural foundations: A reinforcement learning perspective. Cognition, 113(3):262 – 280, 2009. [Branavan et al. 2012] S. R. K. Branavan, Nate Kushman, Tao Lei, and Regina Barzilay. Learning high-level planning from text. In ACL, pages 126–135, 2012. [Daniel et al. 2016] C. Daniel, H. van Hoof, J. Peters, and 29 G. Neumann. Probabilistic inference for determining options in reinforcement learning. Machine Learning, Special Issue, 104(2):337–357, 2016. [Dayan and Hinton 1992] Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In NIPS, pages 271–.
(31) Relationship to other ideas • Human problem solving: Solway et al (2014) proposed a Bayesian model selection framework to explain subgoal learning by humans trying to navigate an unknown map • Humans found subgoals “around” bottleneck states • Inspection of their criterion (Bayesian fit to the data) shows strong similarity with using a cost equal to the branching factor between options, plus the branching factor within the active option • Deliberation costs can also explain the value of options for exploration – Travelling quickly around the environment means values will become accurate more quickly – The best action becomes clear earlier, which would make it easier to choose. COMP 767 - Temporal abstraction. 30.
(32) Knowledge representation: Generalized Value Functions (GVFs) • Given a cumulant function c, state-dependent continuation function γ and policy π, the Generalized Value Function vπ,γ,c is defined as: vπ,γ,c(s) = E. ". ∞ X k=t. Ck+1. k Y. i=t+1. γ(Si)|St = s, At:∞ ∼ π. #. • Cumulant c is a function that can depend on the St, or (St, At) or (St, At, St+1) and can output a vector (even a matrix) Eg Ck+1 = c(Sk , Ak , Sk+1) • Continuation function γ maps states to [0,1] • Cf. Horde architecture (Sutton et al, 2011); Adam White’s thesis; inspiration from Pandemonium architecture COMP 767 - Temporal abstraction. 31.
(33) GV. s¯, a ¯. GVFs will be the building blocks of knowledge r✓ log ⇡(✓). ⇥. ⇡(✓). R. s. GVF. v • Note that one can take the output of a GVF and make it an input to another Figure 1: GVF GVFs for Policy Gradient. On the left, we illustra • Or,within the output ofgeneral a GVF could become part of framework. the “state” for another ⇡(✓) the value function On the righ GVF. as a general value function whose cumulant is a function of t on an initial state-action pair. R. COMP 767 - Temporal abstraction. Q(·, w0 ). 32. C(!). (1. (!)). ⇡(!). R. W.
(34) Usual value functions are GVFs • Cumulant is reward, γ is constant ∞ X vπ (s) = E[ Rk+1γ k−t|St = s, At:∞ ∼ π] k=t. ∞ k X Y = E[ Rt+1 γ|St = s, At:∞ ∼ π] t=0. i=t+1. ∗ • A special case is when a policy is optimal wrt c, γ, vc,γ - Universal Value Function approximation (UVFA) (Schaul et al, 2015). COMP 767 - Temporal abstraction. 33.
(35) Successor states and successor features are GVFs • Successor features (Barreto et al, 2017, 2018) are a natural extension of successor states (Dayan, 1992) • Successor states give the expected occupancy of future states • If states are defined by a feature vector φ(s), successor features give the expected, discounted sum of future feature vectors from a state. • In GVF terms, the cumulant is c = φ, and there is a fixed policy and discount • Interesting property highlighted in Barreto et al: vwT c,π,γ (s) = wT vc,π,γ (s) which leads to very efficient use of existing GVFs to do one-shot computation of new GVFs. COMP 767 - Temporal abstraction. 34.
(36) The anatomy of the reward option model • Option reward model is defined as: rω (s) = E[Rt+1 + γRt+2 + . . . |St = s, ωt = ω] We can write Bellman equations for the model: rω (s) =. X. πω (s, a)[r(s, a) +. a. X s0. Pω (s0|s)γ(1 − βω (s0))rω (s0)]. • This means the option reward model is a GVF, vω,r : cumulant is the environment reward r, policy is πω ,. continuation function is γ(1 − βω ) • Note that we could use γ = 1 for the overall task and these GVFs would still be well defined • Option transition model can similarly be shown to be a GVF COMP 767 - Temporal abstraction. 35.
(37) Option-value function • The option-value function of a policy over options πΩ is defined as: qπΩ (s, ω) = EπΩ Rt+1 + γβω (St+1)qπΩ (St+1, ωt+1)) + γ((1 − βω (St+1))qπΩ (St+1, ω)|St = s • Option-value function is a GVF but the cumulant includes the value of the next options • Intuitively, models are more self-contained than option-value functions. COMP 767 - Temporal abstraction. 36.
(38) To do list • Maybe: improve off-policy learning • How do we generate behavior from GVFs ? – Option call-and-return execution gives a hint, but is too limiting – Ideally, we will use GVFs to plan (done for option models so far) – Exploration: how to generate data for many GVFs? • Discovery: how do we get initial conditions, policies and continuation/termination functions?. COMP 767 - Temporal abstraction. 37.
(39)
Related documents