Outlook - Learning the Structure of Continuous Markov Decision Processes

This thesis has focused on basic research in hierarchical RL in a definite theoretical framework, namely in MDPs. This has allowed studying the proposed methods in isolation, performing reproducible experiments, and systematically varying properties of agent and environment. The thesis concludes by giving some possible routes for future work that builds upon the presented approaches. The outlined topics are considered to be essential for open-ended, lifelong learning of behavior in autonomous agents such as robots. Please refer also to Barto et al. (2013) for a related discussion.

Complex Behavioral Hierarchies This thesis has focused on hierarchical architectures consisting of three layers. These architectures contain the primitive actions on the lowest layer, the acquired skills on the middle layer, and a high-level policy on the top layer, which selects among the skills. Skill discovery changes the number of skills on the middle layer. The number of layers, however, remains constant. The same is true for most related work. Future work on learning deep architectures, in which skills can invoke other skills, is desirable since the ability to form such deep architectures is commonly considered to be a prerequisite for actual open-ended lifelong learning. The dendrogram generated by the hierarchical clustering of the transition graph (see Section 2.4.3.3) could be an interesting starting point for this.

Intrinsic Motivation Lifelong learning does not only require an agent to know how to learn but also to decide when and what to learn. Knowing when to learn is important since learning of behavior involves exploration which may not be feasible in all situa- tions, e.g., when an agent is in a dangerous situation or has to fulfill a time-critical task. Deciding what to learn is important since complex and dynamic environments offer such a multitude of experience that it becomes impossible for an agent to learn about every aspect of the world. Thus, the agent must decide what aspects of the world are relevant and which behavior is worth the effort of being learned. An intrinsic motivation system can address these issues. However, a lifelong learning agent will require a sophisticated motivation system that need not be redesigned for different problems anew. The intrinsic motivation systems proposed in this thesis as well as related works are typically problem-specific or focus solely on specific settings like the developmental

137 7.3 OUTLOOK

one. Future work on more general intrinsic motivation systems will thus be important for lifelong learning in robotics.

Parametrized Options This work has modeled skills as options, which are essentially closed-loop policies. However, as pointed out by Barto et al. (2013), what is commonly denoted as a skill is actually more flexible than an option: for instance, the skill “throwing an object” would correspond to many different options whose policies depend on “contextual information” such as the target positions, the type of object, or the desired trajectory of the throw. As proposed by da Silva et al. (2012), skills can be considered as family of options which are parameterized by the context. Transfer learning can be used to generalize from contexts for which option policies have been learned to novel but related contexts. Ongoing work by the author not covered in this thesis extends these parameterized options to skill templates, which take the uncertainty of the generalization into account during transfer, and aims at applying hierarchical RL in robotic manipulation tasks (Metzen and Fabisch, 2013; Metzen et al., 2014).

Related to this is contextual policy search (Deisenroth et al., 2013), a multi-task learning approach in which several options for different contexts are learned concurrently and a high-level policy generalizes experience of these low-level options over different contexts. Thus, there is some recent work on learning such “contextual” options. However, we do not know any work on the discovery of such options. The work of Daniel et al. (2012) could be an interesting starting point.

Undirected Behavior This thesis has focused on goal-directed skills, i.e., skills which terminate once a specific region of the state space is reached. While such skills are useful behavioral building blocks, for instance in the area of object manipulation, they are not suited for all kinds of behavior. For instance, they are not applicable to rhythmic and repetitive behavior such as walking, running, swimming, swinging, or stirring. For such behavior, rather than identifying goal regions, it is important to identify specific patterns in the dynamic behavior such as desired limit cycles. Future work on acquiring reusable building blocks for this kind of behavioral would be an interesting complement to the works discussed and proposed in this thesis.

Real-world Robotic Applications The most important future work will be to show the potential of hierarchical RL in real-world problems, e.g., in robotic applications. In this thesis—as in nearly all related work—the empirical evaluation has been conducted in simulated environments. This has the advantage that strengths and weaknesses of methods can be systematically explored, for instance by varying the stochasticity of a domain or the explorative behavior of the agent. While some of the problems considered in this thesis, e.g., the Octopus arm problem, should not be considered as “toy” problems since they are challenging for both humans and conventional control approaches due to their complex dynamics, future work that shows the potential of hierarchical RL in important real-world tasks is much needed (Barto et al., 2013). The following list gives some of the most severe challenges and how they could be addressed:

7. CONCLUSION AND OUTLOOK 138

• Real-world problems in robotics are noisy, partially observable, and potentially non-Markovian. Furthermore, learning must not impair the robot. One way of addressing these challenges is to integrate hierarchical RL into a robotic control architecture such as the one shown in Figure 1.1. By this, the learning component could be provided with a more abstracted and curated view onto the problem that is more amenable to RL. For instance, adding low-level reflexes and behavior supervision modules can reduce the risk that explorative behavior of the agent impairs the system. Perception modules can reduce noise in the sensors and estimate unobserved components of the environment’s state based on the sensory input. By this, the level of noise and partial observability may be reduced. • Real-world problems in robotics require learning in high-dimensional state and

action spaces. Traditional RL approaches like temporal difference learning suffer from the curse of dimensionality and do not scale easily to this kind of problems. Direct policy search approaches with problem-specific policy representations are considered to be more promising (Deisenroth et al., 2013). However, the combination of direct policy search with hierarchical RL is an area where further research is required and promises considerable progress for robot learning. • Real-world problems require that agents learn novel and adapt existing knowledge

with few trials. This implies that behavior is not learned from scratch but that reusable, modular building blocks of procedural knowledge are acquired which simplify learning of novel and adaptation of existing behavior. This thesis has focused on and contributed to the discovery and learning of these building blocks. Future work will require developing means which allow the efficient utilization of the building blocks in robotic tasks.

In document Learning the Structure of Continuous Markov Decision Processes (Page 150-152)