Limitations and future work - A formal methods approach to interpretability, safety and composa

5.4 Discussion

5.4.3 Limitations and future work

The provided examples generate a path plan in Euclidean space without considering the configuration space of the real system. As a result, the performance of the robot depends largely on how close it is able to track the given path. The method itself, however, can handle higher dimensional configuration space planning. Currently, we assume linear dynamics which simplifies the derivation of the CBF constraints. However, one of the strengths of the CBF is its ability to incorporate general nonlinear affine dynamics. One direction of future work is to develop a motion plan variant of the proposed method which takes into account the robot kinematics/dynamics and learns a policy that directly outputs joint level controls. The effectiveness of our method may also be limited by FSPAs with cycles (loops between automaton states that are not self-loops). This issue can be resolved by modifying our current (greedy) reward design to be potential-based rewards (Camacho et al., 2017).

In our formulation, even though we specify the task hierarchically using template formulas, the resultant FSPA is non-hierarchical. The sizes of the FSPAs in our tasks are manageable (22 nodes and 43 edges for the cooking task. 8 nodes and 22 edges for the serving task), in general, the size of an FSPA can grow rapidly with the complexity

of the TL formula (and knowledge bases can be large), which in turn increases the complexity of the reward. This can adversely affect learning. One approach is to maintain multiple simpler FSPAs (for example one for each template formula) instead of a complex one. This approach adds discrete dimensions to the state space (one for each FSPA) but can significantly reduce the complexity of each FSPA. Although not fully developed in this work, we believe that the incorporation of the knowledge base and template formulas present opportunities to extend our framework to a wider set of capabilities such as high-level (symbolic) task planning/validation, hierarchical learning and skill composition

Chapter 6 Temporal Logic Guided Skill Composition

6.1 Overview

Policies learned using reinforcement learning aim to maximize the given reward function and are often difficult to transfer to other problem domains (tasks with different rewards). Skill composition is the process of constructing new skills out of existing ones (policies) with little to no additional exploration. In stochastic optimal control, this idea has been adopted by (Todorov, 2009) and (Da Silva et al., 2009) to construct provably optimal control laws based on linearly solvable Markov decision processes.

In this chapter, we present a skill composition technique for policies that are learned under a variation of the FSPA-augmented MDP framework. We build on the results of (van Niekerk et al., 2018) and prove that the composed policy is optimal in both −AN D− (conjunctive) and −OR− (disjunctive) skill compositions. We show that incorporating temporal logic allows us to compose tasks of greater logical complexity. We evaluate our method in simulation (discrete state and action spaces) and experimentally on a Baxter robot (continuous state and action spaces).

6.2 Related Work

Recent efforts in skill composition have mainly adopted the approach of combining value functions learned using different rewards. (Peng et al., 2018) constructs a composite policy by combining the value functions of individual policies using the

Boltzmann distribution. With a similar goal, (Zhu et al., 2017) achieves task space transfer using deep successor representations (Kulkarni et al., 2016). However, it is re- quired that the reward function be represented as a linear combination of state-action features. The authors of (Andreas et al., 2017) use policy sketches for composition. However, only sequential sub-task execution is supported.

The authors of (Haarnoja et al., 2018a) have showed that when using energy- based models (Haarnoja et al., 2017), an approximately optimal composite policy can result from taking the average of the Q-functions of existing policies. The resulting composite policy achieves the −AN D− task composition i.e. the composite policy maximizes the average reward of individual tasks. In (van Niekerk et al., 2018), the authors took this idea a step further and showed that by combining individual Q- functions using the log-sum-exponential function, the −OR− task composition (the composite policy maximizes the (soft) maximum of the reward of constituent tasks) can be achieved optimally.

Multi-task learning (Andreas et al., 2017) and meta-learning (Finn et al., 2017) are often used to achieve few-shot/zero-shot task generalization. Here we make the dis- tinction between skill composition (our focus) and multi-task learning/meta-learning where the former constructs new policies from a library of learned policies and the latter often learns and generalizes from a predefined set of tasks/task distributions (meaning that the difference among the tasks in the task distribution is often con- trolled by part of the state space that the agent needs to generalize over). Contrasting with multi-task/meta-learning is not within the scope of this work.

In our framework, skill composition is accomplished by taking the product of the finite state predicate automata. Instead of interpolating/extrapolating among learned skills/latent features (Peng et al., 2018)(Zhu et al., 2017) , our method is based on graph manipulation of the FSPAs.

6.3 Problem Formulation And Approach

Problem 6.3.1. Given a set of TLTL formulas φ = {φ1, ..., φn} and their optimal

policies π?

φ = {π?φ1, ..., π

φn}, obtain the optimal policy π

φ∧ that satisfies φ∧ = Vn

i=1φi.

In the following sections, we refer to φ as the set of base specifications, π as the set of base policies. φ∧ as the composed specification, π∧ as the composed policy.

Problem 6.3.1 defines the problem of skill composition: given a set of policies each satisfying a TLTL specification, construct the policy that satisfies the conjunction (−AN D−) of the given specifications. Solving this problem is useful when we want to break a complex task into simple and manageable components, learn a policy that satisfies each component and “stitch” all the components together so that the original task is satisfied. It can also be the case that as the scope of the task grows with time, the original task specification is amended with new items. Instead of having to re- learn the task from scratch, we can learn only policies that satisfies the new items and combine them with the old policy.

In document A formal methods approach to interpretability, safety and composability for reinforcement learning (Page 110-114)