Self-Paced Deep Reinforcement Learning

(1)

Self-Paced Deep Reinforcement Learning

Pascal Klink1_{, Carlo D’Eramo}1_{, Jan Peters}1_{, Joni Pajarinen}1,2 1_{Intelligent Autonomous Systems, Technische Universität Darmstadt, Germany} 2_{Department of Electrical Engineering and Automation, Aalto University, Finland}

Correspondence to: [email protected]

Abstract

Curriculum reinforcement learning (CRL) improves the learning speed and stability of an agent by exposing it to a tailored series of tasks throughout learning. Despite empirical successes, an open question in CRL is how to automatically generate a curriculum for a given reinforcement learning (RL) agent, avoiding manual design. In this paper, we propose an answer by interpreting the curriculum generation as an inference problem, where distributions over tasks are progressively learned to approach the target task. This approach leads to an automatic curriculum generation, whose pace is controlled by the agent, with solid theoretical motivation and easily integrated with deep RL algorithms. In the conducted experiments, the curricula generated with the proposed algorithm significantly improve learning performance across several environments and deep RL algorithms, matching or outperforming state-of-the-art existing CRL algorithms.

1 Introduction

Reinforcement learning (RL) [1] enables agents to learn sophisticated behaviors from interaction with an environment. Combinations of RL paradigms with powerful function approximators, commonly referred to as deep RL (DRL), have resulted in the acquisition of superhuman performance in various simulated domains [2, 3]. Despite these impressive results, DRL algorithms suffer from high sample complexity. Hence, a large body of research aims to reduce sample complexity by improving the explorative behavior of RL agents in a single task [4, 5, 6, 7].

Orthogonal to exploration methods, curriculum learning (CL) [8] for RL investigates the design of task sequences that maximally benefit the learning progress of an RL agent, by promoting the transfer of successful behavior between tasks in the sequence. To create a curriculum for a given problem, it is both necessary to define a set of tasks from which it can be generated and, based on that, specify howit is generated, i.e. how a task is selected given the current performance of the agent. This paper addresses the curriculum generation problem, assuming access to a set of parameterized tasks. Recently, an increasing number of algorithms for curriculum generation have been proposed, em-pirically demonstrating that CL is an appropriate tool to improve the sample efficiency of DRL algorithms [9, 10]. However, these algorithms are based on heuristics and concepts that are, as of now, theoretically not well understood, preventing the establishment of rigorous improvements. In contrast, we propose to generate the curriculum based on a principled inference view on RL. Our approach generates the curriculum based on two quantities: The value function of the agent and the KL divergence to a target distribution of tasks. The resulting curriculum trades off task complexity (reflected in the value function) and the incorporation of desired tasks (reflected by the KL divergence). Our approach is conceptually similar to the self-paced learning (SPL) paradigm in supervised learning [11], which has only found application to RL in limited settings [12, 13]. Contribution We propose a new CRL algorithm, whose behavior is well explained as performing approximate inference on the common latent variable model (LVM) for RL [14, 15] (Section 4).

(2)

This enables principled improvements through the incorporation of advanced inference techniques. Combined with the well-known DRL algorithms TRPO, PPO and SAC [16, 17, 18], our method matches or surpasses the performance of state-of-the-art CRL methods in environments of different complexity and with sparse and dense rewards (Section 5).

2 Related Work

Simultaneously evolving the learning task with the learner has been investigated in a variety of fields ranging from behavioral psychology [19] to evolutionary robotics [20] and RL [21]. For supervised learning (SL), this principle was given the name curriculum learning [8]. This name has by now also been established in the RL community, where a variety of algorithms, aiming to generate curricula that maximally benefit the learner, have been proposed. Narvekar and Stone [22] showed that learning to create the optimal curriculum can be computationally harder than learning the entire task from scratch, motivating research on tractable approximations.

Keeping the agent’s success rate within a certain range allowed to create curricula that drastically improve sample efficiency in tasks with binary reward functions or success indicators [23, 10, 24]. Many CRL methods [25, 26, 27, 28] have been proposed inspired by the idea of ‘curiosity’ or ‘intrinsic motivation’ [29, 30] – terms that refer to the way humans organize autonomous learning even in the absence of a task to be accomplished. Despite the empirical success, no theoretical foundation has been developed for the aforementioned methods, preventing principled improvements. Another approach to curriculum generation has been explored under the name self-paced learning (SPL) for SL [11, 31, 32], proposing to generate a curriculum by optimizing the trade-off between exposing the learner to all available training samples and selecting samples in which it currently performs well. Despite its widespread application and empirical success in SL tasks, SPL has only been applied in a limited way to RL problems, restricting its use to the regression of the value function from an experience buffer [13] or to a strictly episodic RL setting [12]. Our method connects to this line of research, formulating the curriculum generation as a trade-off optimization of similar fashion. While the work by Ren et al. [13] is orthogonal to ours, we identify the result of Klink et al. [12] as a special case of our inference view. Besides allowing the combination of SPL and modern DRL algorithms to solve more complex tasks, the inference view presents a unified theory of using the self-paced learning paradigm for RL.

As we interpret RL from an inference perspective over the course of this paper, we wish to briefly point to several works employing this perspective [33, 34, 35, 15, 14]. Taking an inference perspective is beneficial when dealing with inverse problems or problems that require tractable approximations [36, 37]. For RL, it motivates regularization techniques such as the concept of maximum- or relative entropy [38, 18, 39] and stimulates the development of new, and interpretation of, existing algorithms from a common view [40, 41]. Due to a common language, algorithmic improvements on approximate inference [42, 43, 44] can be shared across domains.

3 Preliminaries

We formulate our approach in the domain of reinforcement learning (RL) for contextual Markov decision processes (CMDPs) [45, 46]. A CMDP is a tuple <C, S, A, M>, where M(c) is a function that maps a context c ∈ C to a Markov decision process (MDP) M(c)=<S, A, pc, rc, p0,c>. An

MDP is an abstract environment with states s ∈ S, actions a ∈ A, transition probabilities pc(s0|s, a),

reward function rc : S × A 7→ R and initial state distribution p0,c(s). Typically S, A and C are

discrete spaces or subsets of Rn. We can think of a CMDP as a parametric family of MDPs, which share the same state-action space. Such a parametric family of MDPs allows to share policies and representations between them [47], both being especially useful for CRL. RL for CMDPs encompasses approaches that aim to find a policy π(a|s, c) which maximizes the expected return over trajectories τ = {(st, at)|t ≥ 0} J (µ, π)=Eµ(c),pπ(τ |c) " X t≥0 γtrc(st, at) # , pπ(τ |c)=p0,c(s0) Y t≥0 pc(st+1|st, at)π(at|st, c), (1)

(3)

as the policy represents the behavior of a (possibly virtual) agent. RL algorithms parametrize the policy π with parameters ω ∈ Rn_{. We will refer to this parametric policy as π}

ω, sometimes replacing

πωby ω in function arguments or subscripts, e.g. writing J (µ, ω) or pω(τ |c). The so-called value

function encodes the expected long-term reward when following a policy πωstarting in state s

Vω(s, c)=Epω(τ |s,c) " X t≥0 γtrc(st, at) # , pω(τ |s, c)=δss0 Y t≥0 pc(st+1|st, at)πω(at|st, c), (2) where δs

s0 is the delta-distribution. The above value function for CMDPs has been introduced as a

general or universal value function [48, 47]. We will, however, just refer to it as a value function, since a CMDP can be expressed as an MDP with extended state space. We see that the value function relates to Eq. 1 via J (µ, ω) = Eµ(c),p0,c(s0)[Vω(s0, c)].

4 Self-Paced Deep Reinforcement Learning

Having established the necessary notation, we now introduce a curriculum to the contextual RL objective (Eq. 1) by allowing the agent to choose a distribution of tasks pν(c), parameterized by

ν ∈ Rm_{, to train on. Put differently, we allow the RL agent to maximize J (p}

ν, π) under a chosen pν,

only ultimately requiring it to match the “desired” task distribution µ(c) to ensure that the policy is indeed a local maximizer of J (µ, π). We achieve this by reformulating the RL objective as

max

ν,ω J (ν, ω) − αDKL(pν(c)kµ(c)) , α ≥ 0. (3)

In the above objective, DKL(·k·) is the KL divergence between two probability distributions. The

parameter α controls the aforementioned trade-off between freely choosing pν(c) and matching

µ(c). When only optimizing Objective (3) w.r.t. ω for a given ν, we simply optimize the contextual RL objective (Eq. 1) over the context distribution pν(c). On the contrary, if Objective (3) is only

optimized w.r.t. ν for a given policy πω, then α controls the trade-off between incorporating tasks

in which the policy obtains high reward and matching µ(c). So if we optimize Objective (3) in a block-coordinate ascent manner, we may use standard RL algorithms to train the policy under fixed pν(c) and then adjust pν(c) according to the obtained policy. If we keep increasing α during

this procedure, pν(c) will ultimately match µ(c) due to the KL divergence penalty and we train

on the true objective. The benefit of such an interpolation between task distributions under which the agent initially performs well and µ(c) is that the agent may be able to adapt well-performing behavior as the environments gradually transform. This can, in turn, avoid learning poor behavior and increase learning speed. The outlined idea resembles the paradigm of self-paced learning for supervised learning [11], where a regression- or classification model as well as its training set are alternatingly optimized. The training set for a given model is chosen by trading-off favoring samples under which the model has low prediction error and incorporating all samples in the dataset. Indeed, the idea of generating curricula for RL using the self-paced learning paradigm has previously been investigated by Klink et al. [12]. However, they investigate the curriculum generation only in the episodic RL setting and jointly update the policy and context distribution. This ties the curriculum generation to a specific (episodic) RL algorithm, that, as we will see in the experiments, is not suited for high-dimensional policy parameterizations. Our formulation is not limited to such a specific setting, allowing to use the resulting algorithm for curriculum generation with any RL algorithm. Indeed, we will now relate the maximization of Objective (3) w.r.t. ν to an inference perspective, showing that our formulation explains the results obtained by Klink et al. [12].

Interpretation as Inference Objective (3) can be motivated by taking an inference perspective on RL [14]. In this inference perspective, we introduce an ‘optimality’ event O, whose probability of occurring is defined via a monotonic transformation f : R 7→ R≥0of the cumulative reward

R(τ, c) =P

t≥0rc(st, at), yielding the following latent variable model (LVM)

pν,ω(O) =

Z

pν,ω(O, τ, c)dτ dc ∝

Z

f (R(τ, c))pω(τ |c)pν(c)dτ dc. (4)

(4)

approximate applications of the expectation maximization (EM) algorithm to LVM (4) to maximize pν,ω(O) w.r.t ω [40]. Our idea of maximizing Objective (3) in a block-coordinate ascent manner is

readily supported by the EM algorithm, since its steps can be executed alternatingly w.r.t. ν and ω. Consequently, we now investigate the case when maximizing pν,ω(O) w.r.t. ν, showing that known

regularization techniques for approximate inference motivate Objective (3). For brevity, we only state the main results here and refer to Appendix A for detailed proofs and explanations of EM.

When maximizing pν,ω(O) w.r.t. ν using EM, we introduce a variational distribution q(c) and

alternate between the so called E-Step, in which q(c) is found by minimizing DKL(q(c)kpν,ω(c|O)),

and the M-Step, in which ν is found by minimizing the KL divergence to the previously obtained variational distribution DKL(q(c)kpν(c)). Typically q(c) is not restricted to a parametric form and

hence matches pν,ω(c|O) after the E-Step. We now state our main theoretical insight, showing

exactly what approximations and modifications to the regular EM algorithm are required to retrieve Objective 3.

Theorem 1 Choosing f (·)= exp (·), maximizing Objective (3) minus a KL divergence term DKL(pν(c)kpν˜(c)) is equal to executing E- and M-Step while restricting q(c) to be of the same

para-metric form aspν(c) and introducing a regularized E-Step DKL

q(c) 1 Zpν,ω˜ (c|O) 1 1+α_µ(c) α 1+α . Theorem 1 is interesting for many reasons. Firstly, the extra term in Objective (3) can be identified as a regularization term, which penalizes a large deviation of pν(c) from pν˜(c). In the algorithm

we propose, we replace this penalty term by a constraint on the KL divergence between successive context distributions, granting explicit control over their dissimilarity. This is beneficial when estimating expectations in Objective (3) by a finite amount of samples. Next, restricting the variational distribution to a parametric form is a known concept in RL. Abdolmaleki et al. [40] have shown that it yields an explanation for the well-known on-policy algorithms TRPO and PPO [16, 17]. Finally, the regularized E-Step fits q(c) to a distribution that is referred to as a tempered posterior. Tempering, or deterministic annealing, is used in variational inference to improve the approximation of posterior distributions by gradually moving from the prior (in our case µ(c)) to the true posterior (here pν,ω(c|O)) [44, 49, 50], which in above equation corresponds to gradually decreasing α to zero.

We, however, increasingly enforce pν(c) to match µ(c) by gradually increasing α. To understand

this “inverse” behavior, we need to remember that the maximization w.r.t. ν solely aims to generate context distributions pν(c) that facilitate the maximization of J (µ, πω) w.r.t. ω. This means to

initially encode contexts in which the event O is most likely, i.e. pω(O|c) is highest, and only

gradually match µ(c). To conclude this theoretical section, we note that the update rule proposed by Klink et al. [12] can be recovered from our formulation.

Theorem 2 Choosing f (·)= exp (·/η), the (unrestricted) variational distribution after the regu-larized E-Step is given byq(c)∝pν(c) exp

_V

ω(c)+ηα(log(µ(c))−log(pν(c)))

η+ηα

, where Vω(c) is the

‘episodic value function’ as defined in [34].

The variational distribution in Theorem 2 resembles the results in [12] with the only difference that α is scaled by η. Hence, for a given schedule of α, we simply need to scale every value in this schedule by 1/η to match the results from Klink et al. [12].

Algorithmic Realization As previously mentioned, we maximize Objective (3) in a block-coordinate ascent manner, i.e. use standard RL algorithms to optimize J (pνi, πω) w.r.t. πωunder

the current context distribution pνi. Consequently, we only need to develop ways to optimize

Ob-jective (3) w.r.t. pν for a given policy πωi. We can run any RL algorithm to generate a set of

trajectories Di=(ck, τk)

k ∈ [1, K] , ck ∼ pνi(c), τk ∼ πωi(τ |ck) alongside an improved policy

πωi+1. Furthermore, most state-of-the-art RL algorithms fit a value function Vωi+1(s, c) while

generating the policy πωi+1. Even if the employed RL algorithm does not generate an estimate of

Vωi+1(s, c), it is easy to compute one using standard techniques. We can exploit the connection

between value function and RL objective J (p, ωi+1) = Ep(c),p0,c(s0)Vωi+1(s0, c) to optimize

(5)

Algorithm 1 Self-Paced Deep Reinforcement Learning

Input: Initial context distribution- and policy parameters ν0and ω0, Target context distribution

µ(c), KL penalty proportion ζ, offset Nα, number of iterations N , Rollouts per policy update K

for i = 1 to N do Agent Improvement:

Sample contexts: ck ∼ pνi(c), k = 1, . . . , K

Rollout trajectories: τk∼ πωi(τ |ck), k = 1, . . . , K

Obtain πωi+1from RL algorithm of choice using Di=(c

k_{, τ}k₎

k = 1, . . . , K Estimate Vωi+1(s

k

0, ck) for contexts ck(using the employed RL algorithm, if possible)

Context Distribution Update:

Obtain pνi+1optimizing (Eq. 5), using αi = 0, if i ≤ Nα, else B(νi, Di) (Eq. 6)

end for

instead of Objective (3) to obtain νi+1. The first term in Objective (5) is an importance-weighted

approximation of J (νi+1, ωi+1). Motivated by Theorem 1, the KL divergence constraint between

subsequent context distributions pνi(c) and pνi+1(c) avoids large jumps in the context distribution.

Above objective can be solved using any constrained optimization algorithm. In our implementation, we use the trust-region algorithm implemented in the SciPy library [51]. In each iteration, the parameter αiis chosen such that the KL divergence penalty w.r.t. the current context distribution is in

constant proportion ζ to the average reward obtained during the last iteration of policy optimization αi= B(νi, Di) = ζ 1 K PK k=1R τk, ck DKL(pνi(c)kµ(c)) , R τk, ck = P_t≥0γt_r ck sk_t, ak_t , (6)

as proposed by Klink et al. [12]. We further adopt their strategy of setting α to zero for the first Nαiterations. This allows to taylor the context distribution to the learner in early iterations, if the

initial context distribution is uninformative, i.e. covers large parts of the context space. Note that this is a naive choice, that nonetheless worked sufficiently well in our experiments. At this point, the connection to tempered inference allows for principled future improvements by using more advanced methods to choose α [44]. For the experiments, we restrict pν(c) to be Gaussian. Consequently,

Objective (5) is optimized w.r.t. the mean µ and covariance Σ of the context distribution. Again, the inference view readily motivates future improvements by using advanced sampling techniques [43, 42]. These techniques allow to directly sample from the variational distribution q(c) in Theorem 1, bypassing the need to fit a parametric distribution and allowing to represent multi-modal distributions. The outlined method is summarized in Algorithm 1.

5 Experiments

(a) Point-Mass (b) Ant (c) Ball-Catching Figure 1: Environments used for experimental evaluation. For the point mass environment (a), the upper plot shows the target task. The shaded areas in picture (c) visualize the target distribution of ball positions (green) as well as the ball positions for which the initial policy succeeds (blue). The aim of this section is to

investi-gate the performance and versatility of the proposed curriculum reinforcement learning algorithm (SPDL). To accom-plish this, we evaluate SPDL in three different environments with different DRL algorithms to test the proposi-tion that the learning scheme benefits the performance of various RL algo-rithms. We evaluate the performance using TRPO [16], PPO [17] and SAC [18]. For all DRL algorithms, we use the implementations provided in the Stable Baselines library [52].1 The first two environments aim at in-vestigating the benefit of SPDL when the purpose of the generated curricu-lum is solely to facilitate the learning

1

(6)

0 200 400 600 800 1000 Iteration 0 5 10 Rew ard Point Mass 0 200 400 600 800 1000 Iteration 0 5 10 Point Mass (2D) -4 -2 0 2 4 Position 0.5 3 5.5 8 Width -4 -2 0 2 4 Position -4 -2 0 2 4 Position -4 -2 0 2 4 Position -4 -2 0 2 4 Position -4 -2 0 2 4 Position

SPDL ALP-GMM Random Default GoalGAN SPRL

Figure 2: Reward of different curricula in the Point Mass (2D and 3D) environment for TRPO. Mean (thick line) and two times standard error (shaded area) is computed from 20 algorithm runs. The lower plots show samples from the context distributions p(c) in the Point-Mass 2D environment at iterations 0, 50, 80, 100, 120 and 200 (from left to right). Different colors and shapes of samples indicate different algorithm runs. The black cross marks the mean of the target distribution µ(c).

of a hard target task, which the agent is not able to solve without a curriculum. For this purpose, we create two environments that are conceptually similar to the point mass experiment considered by SPRL [12]. The first one is a copy of the original experiment, but with an additional parameter to the context space, as we will detail in the corresponding section. The second environment extends the original experiment by replacing the point mass with a torque-controlled quadruped ‘ant’. This increases the complexity of the underlying control problem, requiring the capacity of deep neural network function approximators used in DRL algorithms. The final environment is a robotic ball-catching environment. This environment constitutes a shift in curriculum paradigm as well as reward function. Instead of guiding learning towards a specific target task, this third environment requires to learn a ball-catching policy over a wide range of initial states (ball position and velocity). The reward function is sparse compared to the dense ones employed in the first two environments.

To judge the performance of SPDL, we compare the obtained results to state-of-the-art CRL algo-rithms ALP-GMM [27], which is based on the concept of Intrinsic Motivation, GoalGAN [23], which relies on the notion of a success indicator to define a curriculum, and SPRL [12], the episodic coun-terpart of our algorithm. Furthermore, we compare to curricula consisting of tasks uniformly sampled from the context space (referred to as ‘Random’ in the plots) and learning without a curriculum (referred to as ‘Default’). Additional details on the experiments as well as qualitative evaluations of them can be found in Appendix B.

5.1 Point Mass Environment

In this environment, the agent controls a point mass that needs to be navigated through a gate of given size and position to reach a desired target in a two-dimensional world. If the point mass crashes into the wall, the experiment is stopped. The agent moves the point mass by applying forces and the reward decays in a squared exponential manner with increasing distance to the goal. In our version of the experiment, the contextual variable c ∈ R3 changes the width and position of the gate as well as the dynamic friction coefficient of the ground on which the point mass slides. The target context distribution µ(c) is a narrow Gaussian with negligible noise that encodes a small gate at a specific position and a dynamic friction coefficient of 0. Figure 1 shows two different instances of the environment, one of them being the target task.

(7)

Table 1: Average final reward and standard error of different curricula and RL algorithms in the two Point Mass environments with three (P3D) and two (P2D) context dimensions as well as the Ball-Catching environment (BC). The data is computed from 20 algorithm runs. Significantly better results according to a t-test with p < 1% are highlighted in bold. The asterisks mark runs of SPDL/GoalGAN with an initialized context distribution and runs of Default learning without policy initialization.

PPO (P3D) SAC (P3D) PPO (P2D) SAC (P2D) TRPO (BC) PPO (BC) ALP-GMM 2.34 ± 0.2 0.96 ± 0.3 5.24 ± 0.4 1.15 ± 0.4 39.8 ± 1.1 46.5 ± 0.7 GoalGAN 0.50 ± 0.0 1.08 ± 0.4 1.39 ± 0.5 0.72 ± 0.2 42.5 ± 1.6 42.6 ± 2.7 GoalGAN* - - - - 45.8 ± 1.0 45.9 ± 1.0 SPDL 9.35 ± 0.1 4.43 ± 0.7 9.02 ± 0.4 4.69 ± 0.7 47.0 ± 2.0 53.9 ± 0.4 SPDL* - - - - 43.3 ± 2.0 49.3 ± 1.4 Random 0.53 ± 0.0 0.60 ± 0.1 1.34 ± 0.3 0.93 ± 0.3 - -Default 2.46 ± 0.0 2.26 ± 0.0 2.47 ± 0.0 2.23 ± 0.0 21.0 ± 0.3 22.1 ± 0.3 Default* - - - - 21.2 ± 0.3 23.0 ± 0.7

given context c, also leads to a good performance. Increasing the dimension of the context space has a stronger negative impact on the performance of the other CL algorithms than on both SPDL and SPRL, where it only negligibly decreases the performance. We suspect that this effect arises because both ALP-GMM and GoalGAN have no notion of a target distribution. Consequently, for a context distribution µ(c) with negligible variance, a higher context dimension decreases the average proximity of sampled tasks to the target one. By having a notion of a target distribution, SPDL ultimately samples contexts that are close to the desired ones, regardless of the dimension. The context distributions p(c) visualized in Figure 2 show that the agent focuses on wide gates in a variety of positions in early iterations. Subsequently, the size of the gate is decreased and the position of the gate is shifted to match the target one. This process is carried out at different pace and in different ways, sometimes preferring to first shrink the width of the gate before moving its position while sometimes doing both simultaneously.

5.2 Ant Environment

We replace the point mass in the previous environment with a four-legged ant similar to the one in the OpenAI Gym simulation environment [53].2_{The goal is to reach the other side of a wall by passing}

through a gate, whose width and position is determined by the contextual variable c ∈ R2(Figure 1). Opposed to the previous environment, an application of SPRL is not straightforward in this environ-ment, since the episodic policy needs to choose weights for a policy network with 6464 parameters. In such high-dimensional spaces, fitting the new episodic policy (i.e. a 6464-dimensional Gaussian) to the generated samples requires significantly more computation time than an update of a step-based policy, taking up to 25 minutes per update on our hardware. Furthermore, this step is prone to numerical instabilities due to the large covariance matrix that needs to be estimated. This observation stresses the benefit of our CRL approach, as it unifies the curriculum generation for episodic and step-based RL algorithms, allowing to choose the most beneficial one for the task at hand.

In this environment, we were only able to evaluate the CL algorithms using PPO. This is because the implementations of TRPO and SAC in the Stable-Baselines library do not allow to make use of the parallelization capabilities of the Isaac Gym simulator, leading to prohibitive running times (details in Appendix B).

Looking at Figure 3, we see that SPDL allows the learning agent to escape the local optimum which results from the agent not finding the gate to pass through. ALP-GMM and a random curriculum do not improve the reward over directly learning on the target task. However, as we show in Appendix B, both ALP-GMM and a random curriculum improve the qualitative performance, as they sometimes allow to move the ant through the gate. Nonetheless, this behavior is unreliable and inefficient, causing the action penalties in combination with the discount factor to prevent this better behavior from being reflected in the reward.

2

(8)

0 50 100 150 200 250 Iteration 0 1000 Rew ard Ant 0 100 200 300 400 500 Iteration 0 20 40 Ball Catching -1 -0.5 0 X-Position 1 1.4 1.8 Y-P osition -1 -0.5 0 X-Position -1 -0.5 0 X-Position -1 -0.5 0 X-Position -1 -0.5 0 X-Position -1 -0.5 0 X-Position SPDL ALP-GMM Random Default GoalGAN SPDL* GoalGAN* Default*

Figure 3: Mean (thick line) and two times standard error (shaded area) of the reward achieved with different curricula in the Ant environment for PPO and in the Ball-Catching environment for SAC (upper plots). The statistics are computed from 20 seeds. For Ball-Catching, runs of SPDL/GoalGAN with an initialized context distribution and runs of Default learning without policy initialization are indicated by asterisks. The lower plots show ball positions in the ‘catching’ plane sampled from the context distributions p(c) in the Ball-Catching environment at iterations 0, 50, 80, 110, 150 and 200 (from left to right). Different sample colors and shapes indicate different algorithm runs. Given that p(c) is initialized with µ(c), the samples in iteration 0 visualize the target distribution.

5.3 Ball-Catching Environment

Due to a sparse reward function and a broad target task distribution, this final environment is drastically different from the previous ones. In this environment, the agent needs to control a Barrett WAM robot to catch a ball thrown towards it. The reward function is sparse, only rewarding the robot when it catches the ball and penalizing it for excessive movements. In the simulated environment, the ball is said to be caught if it is in contact with the end effector that is attached to the robot. The context c ∈ R3parameterizes the distance to the robot from which the ball is thrown as well as its target position in a plane that intersects the base of the robot. Figure 1 shows the robot as well as the target distribution over the ball positions in the aforementioned ‘catching’ plane. In this environment, the context c is not visible to the policy, as it only changes the initial state distribution p(s0) via the

(9)

6 Conclusion

We proposed self-paced deep reinforcement learning, an inference-derived curriculum reinforcement learning algorithm. The resulting method is easy to use, allows to draw connections to established regularization techniques for inference, and generalizes previous results in the domain of CRL. In our experiments, the method matched or surpassed performance of other CRL algorithms, especially excelling in tasks where learning is aimed at a single target task.

As discussed, the inference view provides many possibilities for future improvements of the proposed algorithm, such as using more elaborate methods for choosing the hyperparameter α or approximating the variational distribution q(c) using more advanced methods. Such algorithmic improvements are expected to further improve the efficiency of the algorithm. Furthermore, a re-interpretation of the self-paced learning algorithm for supervised learning tasks using the presented inference perspective may allow for a unifying view across the boundary of both supervised- and reinforcement learning, allowing to share algorithmic advances.

Broader Impact

This work proposed a method to speed up and stabilize the learning of autonomous agents via curriculum reinforcement learning. In a practical scenario, such methods can reduce the amount of time, energy, or manual labor required to create autonomous agents for a given task, allowing for economic benefits. Given the inherent goal of RL to create versatile learning algorithms, free of ties to a specific domain, RL algorithms can be used in a variety of fields, ranging from automating aspects of elderly care over autonomous vehicles to military uses. Given the abstract nature of our work, it is, however, hard to estimate the immediate consequences of our work on society, since the algorithmic benefits arising from our work apply equally to all of the aforementioned examples.

Acknowledgments and Disclosure of Funding

This project has received funding from the DFG project PA3179/1-1 (ROBOLEAP). During parts of the work on this paper, Joni Pajarinen was affiliated with Tampere University, Finland.

References

[1] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT Press Cambridge, 1998.

[2] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.

[3] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun-young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. [4] Marlos C Machado, Marc G Bellemare, and Michael Bowling. Count-based exploration with

the successor representation. In AAAI, 2020.

[5] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In NIPS, 2017.

[6] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In NIPS, 2016.

[7] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In NIPS, 2016.

(10)

[9] Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing-solving sparse reward tasks from scratch. ICML, 2018.

[10] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. In CoRL, 2017.

[11] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In NIPS, 2010.

[12] Pascal Klink, Hany Abdulsamad, Boris Belousov, and Jan Peters. Self-paced contextual reinforcement learning. In CoRL, 2019.

[13] Zhipeng Ren, Daoyi Dong, Huaxiong Li, and Chunlin Chen. Self-paced prioritized curriculum learning with coverage penalty in deep reinforcement learning. IEEE transactions on neural networks and learning systems, 29(6):2216–2226, 2018.

[14] Marc Toussaint and Amos Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. In ICML, 2006.

[15] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.

[16] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In ICML, 2015.

[17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[18] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018. [19] Burrhus Frederic Skinner. The behavior of organisms: An experimental analysis. BF Skinner

Foundation, 2019.

[20] J Bongard and Hod Lipson. Once more unto the breach: Co-evolving a robot and its simulator. In ALIFE, 2004.

[21] Tom Erez and William D Smart. What does shaping mean for computational reinforcement learning? In ICDL, 2008.

[22] Sanmit Narvekar and Peter Stone. Learning curriculum policies for reinforcement learning. In AAMAS, 2019.

[23] Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. In ICML, 2018.

[24] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NIPS, 2017.

[25] Jürgen Schmidhuber. Curious model-building control systems. In IJCNN, 1991.

[26] Adrien Baranes and Pierre-Yves Oudeyer. Intrinsically motivated goal exploration for active motor learning in robots: A case study. In IROS, 2010.

[27] Rémy Portelas, Cédric Colas, Katja Hofmann, and Pierre-Yves Oudeyer. Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments. In CoRL, 2019. [28] Pierre Fournier, Olivier Sigaud, Mohamed Chetouani, and Pierre-Yves Oudeyer. Accuracy-based curriculum learning in deep reinforcement learning. arXiv preprint arXiv:1806.09614, 2018.

(11)

[30] Douglas Blank, Deepak Kumar, Lisa Meeden, and James B Marshall. Bringing up robot: Fun-damental mechanisms for creating a self-motivated, self-organizing architecture. Cybernetics and Systems: An International Journal, 36(2):125–150, 2005.

[31] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Self-paced curriculum learning. In AAAI, 2015.

[32] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. Self-paced learning with diversity. In NIPS, 2014.

[33] Peter Dayan and Geoffrey E Hinton. Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.

[34] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1–2):1–142, 2013.R

[35] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In IJCAI, 2013.

[36] Philipp Hennig, Michael A Osborne, and Mark Girolami. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2179):20150142, 2015.

[37] Simon JD Prince. Computer vision: models, learning, and inference. Cambridge University Press, 2012.

[38] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008.

[39] Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In AAAI, 2010.

[40] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. In ICLR, 2018.

[41] Matthew Fellows, Anuj Mahajan, Tim GJ Rudner, and Shimon Whiteson. Virel: A variational inference framework for reinforcement learning. In NIPS, 2019.

[42] Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu, and Lawrence Carin. Under-standing and accelerating particle-based variational inference. ICML, 2019.

[43] Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. COLT, 2018.

[44] Stephan Mandt, James McInerney, Farhan Abrol, Rajesh Ranganath, and David Blei. Variational tempering. In AISTATS, 2016.

[45] Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual markov decision processes. arXiv preprint arXiv:1502.02259, 2015.

[46] Aditya Modi, Nan Jiang, Satinder Singh, and Ambuj Tewari. Markov decision processes with continuous side information. In ALT, 2018.

[47] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi-mators. In ICML, 2015.

[48] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In AAMAS, 2011.

(12)

[50] Naonori Ueda and Ryohei Nakano. Deterministic annealing variant of the em algorithm. In NIPS, 1995.

[51] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, ˙Ilhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.

[52] Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Stable baselines. https: //github.com/hill-a/stable-baselines, 2018.

[53] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.