• No results found

In this section, we evaluate the presented reactive templates for representing, learning and executing forehands in the setting of table tennis. For doing so, we evaluate our representation for striking movements first on hitting a hanging ball in Section 3.3.1 and, subsequently, in the task of returning a ball served by a ball launcher presented in Section 3.3.2.

When hitting a ping-pong ball that is hanging from the ceiling, the task consists of hitting the ball with an appropriate desired Cartesian velocity and orientation of the paddle. Hitting a ping-pong ball shot by a ball launcher requires predicting the ball’s future positions and velocities in order to choose an interception point. The latter is only sufficiently accurate after the ball has hit the table for the last time. This short reaction time underlines that the movement templates can be adapted during the trajectory under strict time limitations when there is no recovery from a bad generalization, long replanning or inaccurate movements.

3.3.1 Generalizing Forehands on Static Targets

As a first experiment, we evaluated how well this new formulation of hitting primitives generalizes forehand movements learned from imitation as shown in Figure 3.6 (a). First, we collected arm, racket and ball trajectories for imitation learning using the 7 DoF Barrett WAM robot as an haptic input device for kinesthetic teach-in where all inertial forces and gravity were compensated. In the second step, we employ this data to automatically extract the duration of the striking movement, the duration of the

0 1 2 3 4 0 0.5 1 1.5 2 0 1 2 3 4 2 1 0 1 2 3 0 1 2 3 4 15 10 5 0 5 10 15 20 - - - - -

Figure 3.5: This figure demonstrates the generalization of an imitated behavior to a different target that is15cm away from the original target. Note that this trajectory is for a static target, hence the slow motion. The depicted degree of freedom (DoF) is shoulder adduction-abduction (i.e., the second DoF). The solid gray bars indicate the time before and after the main movement, the gray dashed lines indicate the phase borders also depicted in Figure 3.1 and the target is hit at the second border.

(a)Demonstration by a Human Instructor

1 3

1 2 4

(b)Example: Reproduction for Hitting a Stationary Ball

(c)Application: Returning Balls launched by a Ball Gun

Figure 3.6: This figure presents a hitting sequence from the demonstration, a generalization on the robot with a ball attached by a string as well as a generalization hitting a ball shot by a ping-pong ball launcher. The demonstration and the flying ball generalization are captured by a25Hz video camera, the generalization with the attached ball is captured with200Hz through our vision system. From left to right the stills represent: rest posture, swing-back posture, hitting point, swing-through and rest posture. The postures (À-Ã) are the same as in Figure 3.2.

individual phases as well as the Cartesian target velocity and orientation of the racket when hitting the ball. We employ a model (as shown in Section 3.2) that has phases for swinging back, hitting and going to a rest posture. Both the phase for swing-back and return-to-home phases will go into intermediary still phases while the hitting phase goes through a target point with a pre-specified target velocity. All phases can only be safely executed due to the “safer dynamics” which we introduced in Section 3.2.3.

In this experiment, the ball is a stationary target and detected by a stereo camera setup. Subsequently, the supervisory level proposed in [Mülling and Peters, 2009] determines the hitting point and the striking velocity in configuration space. The motor primitives are adjusted accordingly and executed on the robot in joint-space using an inverse dynamics control law. The robot successfully hits the ball at different positions within a diameter of approximately1.2m if kinematically feasible. The adaptation for striking movements achieves the desired velocities and the safer dynamics allow generalization to a much larger area while successfully removing the possibly large accelerations at the transitions between motor primitives. See Figure 3.5 for a comparison of the training example and the generalized motion for one degree of freedom and Figure 3.6 (b) for a few frames from a hit of a static ball.

3.3.2 Playing against a Ball Launcher

This evaluation adds an additional layer of complexity as the hitting point and the hitting time has to be estimated from the trajectory of the ball and continuously adapted as the hitting point cannot be reliably determined until the ball has bounced off the table for the last time. In this setting, the ball is tracked by two overlapping high speed stereo vision setups with200Hz cameras. In order to obtain better estimates

Figure 3.7: Generalization to various targets (five different forehands at posture Â) are shown approxi- mately when hitting the ball.

of the current position and to calculate the velocities, the raw 3D positions are filtered by a specialized Kalman filter [Kalman, 1960] that takes contacts of the ball with the table and the racket into account [Mülling and Peters, 2009]. When used as a Kalman predictor, we can again determine the target point for the primitive with a pre-specified target velocity with the method described in [Mülling and Peters, 2009]. The results obtained for the still ball generalize well from the static ball to the one launched by a ball launcher at3m/s which are returned at speeds up to 8m/s. A sequence of frames from the attached video is shown in Figure 3.6. The plane of possible virtual hitting points again has a diameter of roughly1m as shown in Figure 3.7. The modified motor primitives generated movements with the desired hitting position and velocity. The robot hit the ball in the air in approx.95% of the trials. However, due to a simplistic ball model and execution inaccuracies the ball was often not properly returned on the table. Please see the videos accompanying this chapter http://www.robot-learning.de/Research/HittingMPs.

Note that our results differ significantly from previous approaches as we use a framework that allows us to learn striking movements from human demonstrations unlike previous work in batting [Senoo et al., 2006] and table tennis [Andersson, 1988]. Unlike baseball which only requires four degrees of freedom (as, e.g., in [Senoo et al., 2006] who used a 4 DoF WAM arm in a manually coded high speed setting), and previous work in table tennis (which had only low-inertia, was overpowered and had mostly prismatic joints [Andersson, 1988, Fässler et al., 1990, Matsushima et al., 2005]), we use a full seven degrees of freedom revolutionary joint robot and, thus, have to deal with larger inertia as the wrist adds roughly 2.5k g weight at the elbow. Hence, it was essential to train trajectories by imitation learning that distribute the torques well over the redundant joints as the human teacher was suffering from the same constraints.

3.4 Conclusion

In this paper, we rethink previous work on dynamic systems motor primitive [Ijspeert et al., 2002a, Schaal et al., 2003, 2007] in order to obtain movement templates that can be used reactively in batting and hitting sports. This reformulation allows to change the target velocity of the movement while maintaining the overall duration and shape. Furthermore, we present a modification that overcomes the problem of an initial acceleration step which is particularly important for safe generalization of learned movements. Our adaptations retain the advantages of the original formulation and perform well in practice. We evaluate this novel motor primitive formulation first in hitting a stationary table tennis ball and, subsequently, in returning ball served by a ping pong ball launcher. In both cases, the novel motor primitives manage to generalize well while maintaining the features of the demonstration. This new formulation of the motor primitives can hopefully be used together with meta-parameter leraning (Chapter 5) in a mixture of motor primitives [Mülling et al., 2010] in order to create a complete framework for learning tasks like table tennis autonomously.

4 Policy Search for Motor Primitives in Robotics

Many motor skills in humanoid robotics can be learned using parametrized motor primitives. While successful applications to date have been achieved with imitation learning, most of the interesting motor learning problems are high-dimensional reinforcement learning problems. These problems are often beyond the reach of current reinforcement learning methods. In this chapter, we study parametrized policy search methods and apply these to benchmark problems of motor primitive learning in robotics. We show that many well-known parametrized policy search methods can be derived from a general, common framework. This framework yields both policy gradient methods and expectation-maximization (EM) inspired algorithms. We introduce a novel EM-inspired algorithm for policy learning that is particularly well-suited for dynamical system motor primitives. We compare this algorithm, both in simulation and on a real robot, to several well-known parametrized policy search methods such as episodic REINFORCE, ‘Vanilla’ Policy Gradients with optimal baselines, episodic Natural Actor Critic, and episodic Reward- Weighted Regression. We show that the proposed method out-performs them on an empirical benchmark

of learning dynamical system motor primitives both in simulation and on a real robot. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task on a real Barrett WAM robot arm.

4.1 Introduction

To date, most robots are still taught by a skilled human operator either via direct programming or a teach-in. Learning approaches for automatic task acquisition and refinement would be a key step for making robots progress towards autonomous behavior. Although imitation learning can make this task more straightforward, it will always be limited by the observed demonstrations. For many motor learning tasks, skill transfer by imitation learning is prohibitively hard given that the human teacher is not capable of conveying sufficient task knowledge in the demonstration. In such cases, reinforcement learning is often an alternative to a teacher’s presentation, or a means of improving upon it. In the high- dimensional domain of anthropomorphic robotics with its continuous states and actions, reinforcement learning suffers particularly from the curse of dimensionality. However, by using a task-appropriate policy representation and encoding prior knowledge into the system by imitation learning, local reinforcement learning approaches are capable of dealing with the problems of this domain. Policy search (also known as policy learning) is particularly well-suited in this context, as it allows the usage of domain-appropriate pre-structured policies [Toussaint and Goerick, 2007], the straightforward integration of a teacher’s presentation [Guenter et al., 2007, Peters and Schaal, 2006] as well as fast online learning [Bagnell et al., 2004, Ng and Jordan, 2000, Hoffman et al., 2007]. Recently, policy search has become an accepted alternative of value-function-based reinforcement learning [Bagnell et al., 2004, Strens and Moore, 2001, Kwee et al., 2001, Peshkin, 2001, El-Fakdi et al., 2006, Taylor et al., 2007] due to many of these advantages.

In this chapter, we will introduce a policy search framework for episodic reinforcement learning and show how it relates to policy gradient methods [Williams, 1992, Sutton et al., 1999, Lawrence et al., 2003, Tedrake et al., 2004, Peters and Schaal, 2006] as well as expectation-maximization (EM) inspired algorithms [Dayan and Hinton, 1997, Peters and Schaal, 2007]. This framework allows us to re-derive or to generalize well-known approaches such as episodic REINFORCE [Williams, 1992], the policy gradient theorem [Sutton et al., 1999, Peters and Schaal, 2006], the episodic Natural Actor Critic [Peters et al., 2003, 2005], and an episodic generalization of the Reward-Weighted Regression [Peters and Schaal, 2007]. We derive a new algorithm called Policy Learning by Weighting Exploration with the Returns (PoWER), which is particularly well-suited for the learning of trial-based tasks in motor control.

We evaluate the algorithms derived from this framework to determine how they can be used for refining parametrized policies in robot skill learning. To address this problem, we follow a methodology suitable for robotics where the policy is first initialized by imitation learning and, subsequently, the policy search algorithm is used for self-improvement. As a result, we need a suitable representation in order to apply this approach in anthropomorphic robot systems. In imitation learning, a particular kind of motor control policy has been very successful, which is known as dynamical system motor primitives [Ijspeert et al., 2002a,b, Schaal et al., 2003, 2007]. In this approach, dynamical systems are used to encode a control policy suitable for motor tasks. The representation is linear in the parameters; hence, it can be learned straightforwardly from demonstrations. Such dynamical system motor primitives can represent both point-to-point and rhythmic behaviors. We focus on the point-to-point variant which is suitable for representing single-stroke, episodic behaviors. As a result, they are particularly well-suited for episodic policy search.

We show that all presented algorithms work sufficiently well when employed in the context of learning dynamical system motor primitives in different benchmark and application settings. We compare these methods on the two benchmark problems from [Peters and Schaal, 2006] for dynamical system motor primitives learning, the Underactuated Swing-Up [Atkeson, 1994] robotic benchmark problem, and the Casting task. Using entirely different parametrizations, we evaluate policy search methods on the mountain-car benchmark [Sutton and Barto, 1998] and the Tetherball Target Hitting task. On the mountain-car benchmark, we additionally compare to a value function based approach. The method with the best performance, PoWER, is evaluated on the complex task of Ball-in-a-Cup [Sumners, 1997]. Both the Underactuated Swing-Up as well as Ball-in-a-Cup are achieved on a real Barrett WAM robot arm. Please also refer to the videos at http://www.robot-learning.de/Research/ReinforcementLearning. For all real robot experiments, the presented movement is learned by imitation from a kinesthetic demonstration, and the Barrett WAM robot arm subsequently improves its behavior by reinforcement learning.