• No results found

Learning-based Trajectory Generation

2.3 Trajectory Generation

2.3.4 Learning-based Trajectory Generation

Here we present a brief overview of methods for motion generation based on observing and repeating sample motion trajectories. Unlike the planning and control methods we presented so far, imitation learning methods make no assumption that a cost function spec- ifying desired motions exists. Rather, the challenge is to learn approximate models (e.g. with machine learning) of this cost using the available data that can be used to generate motion.

2.3.4.1 Direct Policy Learning

Direct Policy Learning (DPL) is one of the fundamental approaches for imitation learning, see [Pomerleau 91,Argall 09]. Sometimes it is also called “behavior cloning”, because it is a straightforward approach that tries to repeat observed motions.

A standard way to describe a robot trajectory is{xt, ut}tT = 0, where xt represents the robot state at time t and ut is the control signal, e.g. the rate of change ˙xt. DPL tries to find a policy π : xt 7−→ ut from these observations. Different assumptions can be made for the choice of x, u and π [Calinon 07], with refinements like data transformations and

active learning. Given a parameterization of the policy, DPL essentially corresponds to a regression problem, e.g. with loss:

Ed pl= T

t=0

kπ(xt) − utk2 (2.4)

wherek.|2demotes the squared L2norm. Minimizing Ed pl finds a policy close in the least squares sense to the demonstrations. The above loss can be extended to multiple demon- stration trajectories by averaging over them.

Howard et. al [Howard 09] introduces an interesting alternative loss for DPL:

Ed pl= T

t=0

(kutk − π(xt)Tut/kutk)2 (2.5)

This loss penalizes the discrepancy between the projection of the policyπ(xt) on utand the true control ut . This article shows that in some problem domains this loss leads to better behavior than the standard least squares loss.

When the state and control spaces are high dimensional, DPL has a disadvantage: Gen- eralization is an issue and would essentially require the data to cover all possible situations.

2.3.4.2 Markov Decision Process and Reinforcement Learning

A Markov Decision Process (MDP) is a graphical model involving world states (e.g. robot position) and actions a (e.g. go left). It is a popular formalism for multiple learning prob- lems, including Reinforcement Learning (RL), see [Russell 09]. The MDP is defined by the following probabilities, taken for reference from [Jetchev 11]:

• world’s initial state distribution P(s0) • world’s transition probabilities P(st+1|at, st)

• world’s reward probabilities P(rt|at, st) and R(a, s) := E{|a,s} • agent’s policyπ(at|st) = P(a0|s0;π)

The value (expected discounted return) of policyπ when started in state s with discount- ing factorγ∈ [0,1] is defined as:

Vπ(s) = Eπ{r0+ γr1+ γ2r2+ . . . |s0} (2.6)

One way to do reinforcement learning in a MDP is to iterate the Bellman optimality equation until convergence of the value function - Value Iteration algorithm [Bellman 03]:

V(s) = max

a [R(a, s) + γ

s

The optimal policy given the optimal value function is simply the policy maximizing the immediate reward and expected future rewards:

π∗(s) = arg max

a [R(a, s) + γ

s

P(s|a,s)V(s′))] (2.8)

The value of a state V(s) is a more global indicator of desired states than the immediate reward of an action R(a, s). The values V (s) provides a gradient towards the desired states- going in direction of increasing V(s) is the desired behavior of the robot system.

2.3.4.3 Inverse Optimal Control

Reinforcement learning or planning in general (e.g. within a MDP framework) tries to generate motions maximizing some reward. Learning (e.g. with Value Iteration) requires constant feedback in the form of rewards for the states and action of the agent. However, in many tasks the reward is not analytically defined and there is no way to access it from the environment accurately. It is then up to the human expert to design a reward leading the robot to the desired behavior. There is an algorithm that can effectively learn the desired behavior and a policy for it just by observing example motions. Inverse Optimal Control (IOC), also known as Inverse Reinforcement Learning (IRL), aims to limit the reward feed- back requirement and human expertise required to design behavior for a task. IOC learns a proper reward function only on the basis of data, see [Ratliff 06]. Let’s assume that policies π give rise to expected feature counts ϖ(π) of feature vectors φ : i.e. what feature we will see if this policy is executed. A weight vectorω such that the behavior demonstrated by the expertπ∗ has higher expected reward (negative costs) than any other policy is learned by minimizing a loss:

min|ω|2 (2.9)

s.t. ∀π ωTϖ(π∗) > ωTϖ(π) + L (π∗, π) (2.10)

The termωTϖ(π) defines an expected reward, linear in the features. The scalable margin L penalizes those policies that deviate more from the optimal behavior ofπ∗. The above loss can be minimized with a max margin formulation. Efficient methods are required to find the π that violates the constraints the most and add it as new constraint. Once the reward model is learned, another module is required to generate motions maximizing the reward, e.g. [Ratliff 06] uses an A∗planner to find a path to a target with minimal costs.

Learning a policy based on estimated costs is much more flexible than DPL, and a simple cost function can lead to complex optimal policies. IOC can often generalize well to new situations, because states with low costs create a task manifold, a whole space of desired robot positions good for the task. In some domains it is much easier to learn a mapping from state to cost than to learn a mapping from state to action. The latter is a

more complex and higher dimensional problem, especially when considering actions in high dimensional continuous spaces such as robot control.

2.3.4.4 Discriminative Learning

Discriminative learning provides a common framework for many learning problems, in- cluding structured output regression. Popular approaches include large margin models [Tsochantaridis 05] and energy based models using neural networks [Lecun 06]. Data is given in the form of pairs of input and output values{xi,yi}. As in standard discriminative approaches (e.g., structured output learning), the energy or cost f(xi, yi;ω) provides a dis- criminative function such that the true output should get the lowest energy from the model

f:

yi= arg min

y∈y

f(xi, y) (2.11)

Training the parameter vectorω of the model f is done by minimizing a loss over the dataset. The loss should have the property that f is penalized whenever the true answer

yihas higher energy than the false answer with lowest energy which is at least distance r away:

yi= arg min y∈yky−yik>r

f(xi, y) (2.12)

Finding the most offending answer yiis very often a complicated inference problem in itself.

Related documents