Adaptive Smoothing Path Integral Control

(1)

Adaptive Smoothing Path Integral Control

Dominik Thalmeier Radboud University Nijmegen

Nijmegen, The Netherlands [email protected]

Hilbert J. Kappen Radboud University Nijmegen

Nijmegen, the Netherlands [email protected]

Simone Totaro Universitat Pompeu Fabra

Barcelona, Spain [email protected]

Vicenç Gómez Universitat Pompeu Fabra

Barcelona, Spain [email protected]

Abstract

In Path Integral control problems a representation of an optimally controlled dy-namical system can be formally computed and serve as a guidepost to learn a parametrized policy. The Path Integral Cross-Entropy (PICE) method tries to exploit this, but is hampered by poor sample efficiency. We propose a model-free algorithm called ASPIC (Adaptive Smoothing of Path Integral Control) that applies an inf-convolution to the cost function to speedup convergence of policy optimiza-tion. We identify PICE as the infinite smoothing limit of such technique and show that the sample efficiency problems that PICE suffers disappear for finite levels of smoothing. For zero smoothing this method becomes a greedy optimization of the cost, which is the standard approach in current reinforcement learning. We show analytically and empirically that intermediate levels of smoothing are optimal, which renders the new method superior to both PICE and direct cost-optimization.

1 Introduction

Optimal control of non-linear dynamical systems that are continuous in time and space is hard. Methods that have proven to work well introduce a parametrized policy like a neural network [16, 5] and directly optimize the expected cost using gradient descent [27, 19, 22, 11]. To achieve a robust decrease of the expected cost, it is important to ensure that at each step the policy stays in the proximity of the old policy [5]. This can be achieved by enforcing a trust region constraint [18, 22] or using adaptive regularization [11]. However the applicability of these methods is limited, as in each iteration of the algorithm, samples from the controlled system have to be computed. We want to increase the convergence rate of policy optimization to reduce the number of simulations needed. To this end we consider Path Integral control problems [12?], that offer an alternative approach to direct cost optimization and explore if this allows to speed up policy optimization. This class of control problems permits arbitrary non-linear dynamics and state cost, but requires a linear dependence of the control on the dynamics and a quadratic control cost [12, 2, 25]. These restrictions allow to obtain an explicit expression for the probability-density of optimally controlled system trajectories. Through this, an information-theoretical measure of the deviation of the current control policy from the optimal control can be calculated. The Path Integral Cross-Entropy (PICE) method [13] proposes to use this measure as a pseudo-objective for policy optimization.

In this work we analyze a new kind of smoothing technique for the cost function based on recently proposed smoothing techniques to speed up convergence in deep neural networks [4]. We adapt this technique to Path Integral control problems and show that(i), in contrast to [4], smoothing in Path

(2)

Integral control can be solved analytically, providing an expression of the gradient that can directly be computed from Monte Carlo samples and(ii), we can interpolate between direct cost optimization and the PICE objective. Remarkably, the parameter governing the smoothing can be determined independently of the number of samples.

Based on these results, we introduce the ASPIC (Adaptive Smoothing of Path Integral Control) algorithm, a model-free algorithm that uses cost smoothing to speed up policy optimization. ASPIC adjusts the smoothing parameter in each step to keep the variance of the gradient estimator at a predefined level.

2 Path Integral Control Problems

Consider the (multivariate) dynamical system ˙

xt=f(xt, t) +g(xt, t) (u(xt, t) +ξt), (1)

with initial conditionx0. The control policy is implemented in the control functionu(x, t), which

is additive to the white noise ξt which has variance _dtν. Given a control functionuand a time

horizonT, this dynamical system induces a probability distributionpu(τ)over state trajectories

τ={xt|∀t: 0< t≤T}with initial conditionx0.

We define the regularized expected cost

C(pu) =hV(τ)ipu+γKL(pu||p0), (2)

withV(τ) =RT

0 V(xt, t)dt, where the strength of the regularizationKL(pu||p0)is controlled by

the parameterγ.

The Kullback-Leibler divergenceKL(pu||p0)puts high cost to controlsuthat bring the probability

distributionpufar away from the uncontrolled dynamicsp0whereu(xt, t) = 0. We can also rewrite

the regularizerKL(pu||p0)directly in terms of the control functionuby using the Girsanov theorem,

c.f., [25]:logpu(τ) p0(τ) = 1 ν RT 0 1 2u(xt, t) T_u₍_x t, t) +u(xt, t)Tξt

dt. The regularization then takes the form of a quadratic control cost

KL(pu||p0)= * 1 ν Z T 0 ₁ 2u(xt, t) T_u₍_x t, t) +u(xt, t)Tξt dt + pu = * 1 ν Z T 0 1 2u(xt, t) T_u₍_x t, t)dt + pu ,

where we used thatu(xt, t)Tξt

pu = 0. This shows that the regularizationKL(pu||p0)puts higher

cost for large values of the controlleru.

The Path Integral control problem is to find the optimal control functionu∗that minimizes

u∗= arg min

u

C(pu). (3)

For a more complete introduction to Path Integral control problems, see [25, 13].

−Direct cost optimization using gradient descent: A standard approach to find an optimal control function is to introduce a parametrized controlleruθ(xt, t)[11, 27, 22]. This parametrizes the path

probabilitiespuθ and allows to optimize the expected costC(puθ)(2) using stochastic gradient

descent on the cost function:

∇θC(puθ) = D S_pγ uθ(τ)∇θlogpuθ(τ) E p_uθ , (4)

with the stochastic costSγ

p_uθ(τ) :=V(τ) +γlog p_uθ(τ)

p0(τ) (see App. A for details).

−The Path Integral Cross-Entropy method: An alternative approach to direct cost-optimization was introduced in [13], and takes advantage of the analytical expression forpu∗, the probability density

of state trajectories induced by a system with the optimal controlleru∗,pu∗= arg min_p

uC(pu)with

C(pu)given by Eq. (2). Findingpu∗is an optimization problem over probability distributionsp_uthat

are induced by the controlled dynamical system (1). It has been shown [2, 25] that we can solve this by replacing the minimization overpuwith a minimization over all path probability distributionsp:

pu∗≡p∗:= arg min p C(p) = arg min p hV(τ)i_p+γKL(p||p0) = 1 Zp0(τ) exp −1 γV(τ) . (5)

(3)

with the normalization constantZ = Dexp−1 γV(τ)

E

p0

. Note that the above is not a trivial statement, as we now take the minimum also over non-Markovian processes with non-Gaussian noise. The PICE algorithm [13], instead of directly optimizing the expected cost, it minimizes the KL-divergenceKL(p∗||puθ)which measures the deviation of a parametrized distributionpuθ from the

optimal onep∗. Although direct cost optimization and PICE are different methods, their global minimum is the same if the parametrization ofuθcan express the optimal controlu∗ =uθ∗. The

parametersθ∗of the optimal controller are found using gradient descent:

∇θKL(p∗||puθ) = 1 Zp_uθ exp −1 γS γ p_uθ(τ) ∇θlogpuθ(τ) p_uθ , (6) whereZp_uθ := D exp−1 γS γ p_uθ(τ) E p_uθ.

That PICE uses the optimal density as a guidepost for the policy optimization might give it an advantage compared to direct cost-optimization. In practice however, this method only works properly if the initial guess of the controlleruθdoes not deviate too much from the optimal control,

as a high value ofKL(p∗||puθ)leads to a high variance of the gradient estimator and results in

bootstrapping problems of the algorithm [21, 24]. In the next section, we introduce a method that interpolates between direct cost-optimization and the PICE method, allowing us to take advantage of the analytical optimal density without being hampered by the same bootstrapping problems as PICE.

3 Interpolating Between Methods: Smoothing Stochastic Control Problems

Cost function smoothing was recently introduced as a way to speed up optimization of neural networks [4]: Optimization of a general cost functionf(θ)can be speeded up by smoothingf(θ) using an inf-convolution with a distance kerneld(θ0, θ). The smoothed function

Jα(θ) = inf

θ0 αd(θ

0_{, θ}_{) +}_f₍_θ0₎ ₍₇₎

preserves the global minima of the functionf(θ). To apply gradient descent based optimization on

Jα₍_θ₎_{instead of}_f₍_θ₎_{may significantly speed up convergence [4].}

We want to use this accelerative effect to find the optimal parametrization of the controlleruθ.

Therefore, we smooth the cost functionC(puθ)as a function of the parametersθ. AsC(puθ) = hV(τ)i_p

uθ +γKL(puθ||p0)is a functional on the space of probability distributionspuθ, the natural

“distance” is the KL-divergenceKL(pu_θ0||puθ). So we replace

f(θ)→C(puθ)

d(θ0, θ)→KL(pu_θ0||puθ),

and obtain the smoothed costJα(θ)as

Jα(θ) = inf

θ0 αKL(puθ0||puθ) +C(puθ0) = inf_θ0 αKL(puθ0||puθ) +γKL(puθ0||p0) +hV(τ)ipu_θ0.

(8) Note the different roles ofαandγ: the parameterαpenalizes the deviation ofpu_θ0 frompuθ, while

the parameterγpenalizes the deviation ofpu_θ0 from the uncontrolled dynamicsp0.

− Computing the smoothed cost and its gradient: The smoothed cost Jα is expressed as a minimization problem that has to be solved. Here we show that for Path Integral control problems this can be done analytically. To do this we first show that we can replaceinfθ0 →inf_p0 and then

solve the minimization overp0analytically. We replace the minimization overθ0by a minimization overp0in two steps: first we state an assumption that allows us to replaceinfθ0 →inf_u0 and then

proof that for Path Integral control problems we can replaceinfu0 →inf_p0.

We assume that for everyuθand anyα >0, the minimizerθ∗α,θover the parameter space

θ∗_α,θ:= arg min

θ0

(4)

is the parametrization of the minimizeru∗_α,θover the function space u∗_α,θ:= arg min u0 αKL(pu0||p_u θ) +C(pu0), such thatu∗_α,θ≡uθ∗

α,θ. We call this assumptionfull parametrization. Naturally it is sufficient for full

parametrization ifuθ(x, t)is a universal function approximator with a fully observable state spacex

and the timetas input, although this may be difficult to achieve in practice. With this assumption we can replaceinfθ0 →inf_u0. Analogously, we replaceinf_u0 →inf_p0: in App. B.1 we proof that

for Path Integral control problems the minimizeru∗_α,θover the function space induces the minimizer

p∗_α,θover the space of probability distributions

p∗_α,θ:= arg min

p0

αKL(p0||puθ) +C(p

0₎_, ₍₁₀₎

such thatp∗_α,θ≡pu∗

α,θ. This step is similar to the derivation of of Eq. (5) in Section 2, but now we

have added an additional termαKL(pu0||p_u

θ).

Hence, given a Path Integral control problem and a controlleruθthat satisfies full parametrization,

we can replaceinfθ0 →inf_p0and Eq. (8) becomes

Jα(θ) = inf p0 αKL(p 0_||_p uθ) +γKL(p 0_||_p 0) +hV(τ)i_p0. (11)

This can be solved directly: first we compute the minimizer (see App. B.2)

p∗_α,θ(τ) = 1 Zα p_uθ puθ(τ) exp − 1 γ+αS γ p_uθ(τ) , Z_pα uθ = exp − 1 γ+αS γ p_uθ(τ) p_uθ . (12) We plug this back in Eq. (11) and get the smoothed cost and its gradient (see App. B.3)

Jα(θ) =−(γ+α) log exp − 1 γ+αS γ p_uθ(τ) p_uθ (13) ∇θJα(θ) =− α Zα p_uθ exp − 1 γ+αS γ p_uθ(τ) ∇θlogpuθ(τ) p_uθ . (14)

Both can be estimated by samples from the distributionpuθ.

4 The ASPIC Algorithm

In this section, we derive an iterative algorithm that takes a parametrized control functionuθand

performs smooth parameter updates starting from initial parametersθ0. We focus on the effect that

a finiteα > 0 has on the iterative optimization of the controluθfor a fixed value ofγ. For our

theoretical results, we refer the reader to App. C, where we identify several existing settings as limiting cases of the parametersαandγ, and to App. D, where we proof that smooth updates are optimal in two-step sequential decision problems.

To simplify notation, we overloadpuθ →θso that we getC(puθ)→C(θ)andKL(puθ0||puθ)→

KL(θ0||θ). We use a trust region constraint to robustly optimize the policy, c.f., [18, 22, 10]. We define the smoothed update with stepsizeEas an updateθ→θ0withθ0 = ΘJα

E (θ)and

ΘJ_Eα(θ) := arg min

θ0

s.t.KL(θ0||θ)≤E

Jα(θ0). (15)

−Smoothed and direct updates using natural gradients: We first express the constraint optimiza-tion (15) as an unconstrained optimizaoptimiza-tion problem introducing a Lagrange multiplierβ

θn+1= arg min θ0

Jα(θ0) +βKL(θ0||θn). (16)

Following [22] we assume that the trust region sizeEis small. ForE 1we getβ 1and can expandJα₍_θ0₎_{to first and}_KL₍_θ0_||_θ

n)to second order (see App. E.1 for the details). This gives

θn+1=θn−β−1F−1∇θ0Jα(θ0)|_θ0_=θ

(5)

a natural gradient update with the Fisher-matrixF = ∇θ∇TθKL(θ0||θn)

_θ0_=θ

n(we use the conjugate

gradient method to approximately compute the natural gradient for high dimensional parameter spaces. See App. E.2 or [22] for details). Parameterβis determined using a line search such that

KL(θn||θn+1) =E. (18)

Note that for direct updates this derivation is the same, just replaceJαbyC.

−Reliable gradient estimation using adaptive smoothing: To compute smoothed updates using Eq. (17) we need the gradient of the smoothed cost. We assume full parametrization and use Eq. (14), which can be estimated usingN weighted samples drawn from the distributionpuθ:

∇θJα(θ)≈α N X i=1 wilogpuθ(τ i₎_. ₍₁₉₎

The weights are given by

wi = 1 ˜ Z exp − 1 γ+αS γ p_uθ(τ i ) , Z˜= N X i=1 exp − 1 γ+αS γ p_uθ(τ i ) .

The variance of this estimator depends sensitively on the entropy of the weights HN(w) =

−PN

i=1w

i_log(_wi_{). If the entropy is low, the total weight is concentrated on a few particles. This}

results in a poor gradient estimator where only a few of the particles actually contribute. This concen-tration is dependent on the smoothing parameterα: for smallα, the weights are very concentrated in a few samples, resulting in a large weight-entropy and thus a high variance of the gradient estimator. As smallαcorresponds to strong smoothing, we wantαto be as small as possible, but large enough to allow a reliable gradient estimation. Therefore, we set a bound to the weight entropyHN(w). To

get a bound that is independent of the number of samplesN, we use that in the limit ofN → ∞the weight entropy is monotonically related to the KL-DivergenceKL(p∗_α,u

θ||puθ)

KL(p∗_α,u

θ||puθ) = lim

N→∞logN−HN(w)

(see App. E.3). This provides a method for choosingαindependently of the number of samples: we set the constraintKL(p∗_α,u_θ||puθ)≤∆and determine the smallestαthat satisfies this condition by

using a line search. Large values of∆correspond to small values ofα(see App. E.4) and therefore strong smoothing, we thus call parameter∆thesmoothing strength.

−A model-free algorithm: We can compute the gradient (19) and the KL-divergence while treating the dynamical system as a black-box. For this we write the probability distributionpuθ over

trajec-toriesτas a Markov processpuθ(τ) =

Q

0<t<Tpuθ(xt+dt|xt, t), where the product runs over the

timet, which is discretized with time stepdt. We define the noisy actionat =u(xt, t) +ξtand

formulate the transitionspuθ(xt+dt|xt)for the dynamical system (1) as

puθ(xt+dt|xt) =δ(xt+dt− F(xt, at, t))·πθ(at|t, xt),

withδ(·)the Dirac delta function. This splits the transitions up into the deterministic dynamical systemF(xt, at, t)and a Gaussian policyπθ(at|t, xt) =N at|uθ(xt, t),_dtν

with meanuθ(xt, t)

and variance_dtν. Using this we get a simplified expression for the gradient of the smoothed cost (19) that is independent of the system dynamics, given the samples drawn from the controlled systempuθ:

∇θJα(θ)≈α N X i=1 X 0<t<T wi∇θlogπθ(ait|t, x i t).

Similarly we obtain an expression for the estimator of the KL divergence KL(θn||θn+1) ≈ 1 N PN i=1 P 0<t<Tlog πθn(ait|t,x i t) π_θn₊₁(ai t|t,xit)

. With this we formulate ASPIC (Algorithm 1) which optimizes the parametrized policyπθby iteratively drawing samples from the controlled system.

5 Numerical Experiments

We compare experimentally the convergence speed of policy optimization with and without smoothing. For the optimization with smoothing, we use ASPIC and for the optimization without smoothing, we

(6)

0 50 100 150 200 250 300 350 400 Iterations 100 0 100 200 300 400 500 Cost A Pendulum task

direct cost optimization smoothed cost optimization

0 100 200 300 400 500 600 700 800 Iterations 1000 500 0 500 1000 1500 Cost B Acrobot task

0 50 100 150 200 250 300 Iterations 250 200 150 100 50 0 50 100 150 Costs 2D Walker Task

smoothed cost optimization direct cost optimization

Figure 1: Smoothed cost-optimization exhibits faster convergence than direct cost-optimization in a variety of tasks. Plots show mean and standard deviation of the cost per iteration for10runs of the algorithm. For pendulum and acrobot tasks, we used∆ = 0.5andE= 0.1whereas for the walker, we used∆ = 0.05 logNandE= 0.01. See App. F for more details.

use a version of ASPIC where we replaced the gradient of the smoothed cost with the gradient of the cost itself. We consider three non-linear control problems, which violate the full parametrization assumption (pendulum swing-up task, Acrobot, and 2D walker). The latter was simulated using OpenAI gym [3]. For pendulum swing-up and the Acrobot tasks we used time-varying linear feedback controllers, whereas for the 2D walker task we parametrized the controluθusing a neural network.

We provide more details about the experimental settings and additional results in App. F.

−Convergence rate of policy optimization: Fig. 1 shows the comparison of ASPIC algorithm with smoothing against direct-cost optimization. In all three tasks, smoothing improves the convergence rate of policy optimization. Smoothed cost optimization requires less iterations to achieve the same cost reduction as direct cost-optimization, with only a negligible amount of additional computational steps that do not depend on the complexity of the simulation runs.

We can thus conclude that even in cases when the parametrized controller does not strictly meet the requirement of full parametrization needed to derive the gradient of the smoothed cost, a strong performance boost can also be achieved.

6 Discussion

Many policy optimization algorithms update the control policy based on a direct optimization of the cost; examples are Trust Region Policy Optimization (TRPO) [22] or the Path-Integral Relative Entropy Policy Search (PIREPS) [10], where the later is particularly developed for Path Integral control problems. The main novelty of this work is the application of the idea of smoothing as introduced in [4] to Path Integral control problems. This allows to outperform direct cost-optimization and achieve faster convergence rates with only a negligible amount of computational overhead. This procedure bears similarities to an adaptive annealing scheme, with the smoothing parameter playing the role of an artificial temperature. However in contrast to classical annealing schemes, such as simulated annealing, changing the smoothing parameter does not change the optimization target: the minimum of the smoothed cost remains the optimal control solution for all levels of smoothing. In the weak smoothing limit, ASPIC directly optimizes the cost using trust region constrained updates, similar to the TRPO algorithm [22]. TRPO differs from ASPIC’s weak smoothing limit by additionally using certain variance reduction techniques for the gradient estimator: They replace the stochastic cost in the gradient estimator by the easier-to-estimate advantage function, which has a state dependent baseline and only takes into account future expected cost. Since this depends on the linearity of the gradient in the stochastic cost and this dependence is non-linear for the gradient of the smoothed cost, we cannot directly incorporate these variance reduction techniques in ASPIC. In the strong smoothing limit ASPIC becomes a version of PICE [13] that—unlike the plain PICE algorithm—uses a trust region constraint to achieve robust updates. The gradient estimation problem that appears in the PICE algorithm was previously addressed in [21]: they proposed a heuristic that allows to reduce the variance of the gradient estimator by adjusting the particle weights used to compute the policy gradient. In [21] this heuristic is introduced as an ad hoc fix of the sampling problem and the adjustment of the weights introduces a bias with possible unknown side effects. Our study sheds a new light on this, as adjusting the particle weights corresponds to a change of the smoothing parameter in our case.

(7)

Acknowledgments

The research leading to these results has received funding from “La Caixa” Banking Foundation. Vi-cenç Gómez is supported by the Ramon y Cajal program RYC-2015-18878 (AEI/MINEICO/FSE,UE).

References

[1] O. Arenz, M. Zhong, and G. Neumann. Trust-region variational inference with Gaussian mixture models. arXiv preprint arXiv:1907.04710, 2019.

[2] J. Bierkens and H. J. Kappen. Explicit solution of relative entropy weighted control. Systems & Control Letters, 72:36–43, 2014.

[3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.

[4] P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier. Deep relaxation: partial differential equations for optimizing deep neural networks. Research in the Mathematical Sciences, 5(3):30, 2018.

[5] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. InProceedings of the 33rd International Conference on Machine Learning, 2016.

[6] W. Fleming and S. Sheu. Risk-sensitive control and an optimal investment model ii.Annals of Applied Probability, pages 730–767, 2002.

[7] W. H. Fleming and W. M. McEneaney. Risk-sensitive control on an infinite time horizon. SIAM Journal on Control and Optimization, 33(6):1881–1915, 1995.

[8] P. Glasserman.Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media, 2013.

[9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. InThirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.

[10] V. Gómez, H. J. Kappen, J. Peters, and G. Neumann. Policy search for path integral control. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 482–497. Springer, 2014.

[11] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, A. Eslami, M. Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.

[12] H. J. Kappen. Linear theory for control of nonlinear stochastic systems.Physical Review Letters, 95(20):200201, 2005.

[13] H. J. Kappen and H. C. Ruiz. Adaptive importance sampling for control and inference. Journal of Statistical Physics, 162(5):1244–1266, 2016.

[14] H. Kwakernaak and R. Sivan. Linear optimal control systems, volume 1. Wiley-Interscience New York, 1972.

[15] J. Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.

[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-forcement learning. Nature, 518(7540):529–533, 2015.

[17] M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli. Stochastic variance-reduced policy gradient. InProceedings of the 35th International Conference on Machine Learning, volume 80, pages 4026–4035. PMLR, 2018.

(8)

[18] J. Peters, K. Mülling, and Y. Altün. Relative entropy policy search. InProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, pages 1607–1612. AAAI Press, 2010. [19] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients.Neural

networks, 21(4):682–697, 2008.

[20] R. Ranganath, S. Gerrish, and D. Blei. Black Box Variational Inference. InProceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33, pages 814–822. PMLR, 2014.

[21] H.-C. Ruiz and H. J. Kappen. Particle smoothing for hidden diffusion processes: Adaptive path integral smoother.IEEE Transactions on Signal Processing, 65(12):3191–3203, 2017. [22] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization.

InProceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1889–1897. PMLR, 2015.

[23] M. W. Spong. The swing up control problem for the acrobot.IEEE control systems, 15(1):49–55, 1995.

[24] D. Thalmeier, V. Gómez, and H. J. Kappen. Action selection in growing state spaces: Control of network structure growth.Journal of Physics A: Mathematical and Theoretical, 50(3):034006, 2016.

[25] S. Thijssen and H. J. Kappen. Path integral control and state-dependent feedback. Physical Review E, 91(3):032104, 2015.

[26] B. van den Broek, W. Wiegerinck, and H. Kappen. Risk sensitive path integral control. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 615–622. AUAI Press, 2010.

[27] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3-4):229–256, 1992.

A

Derivation of the policy gradient

Here we derive Eq. (4). We writeC(puθ) =

Sγ uθ(τ) p_uθ, withS γ uθ(τ) :=V(τ) +γlog p_uθ(τ) p0(τ) and

take the derivative of Eq. (2):

∇θ Sγ_u_θ(τ)_p uθ =∇θ V(τ) +γlogpuθ(τ) p0(τ) p_uθ (20)

Now we introduce the importance samplerpu_θ0 and correct for it.

∇θ Suγθ(τ) p_uθ =∇θ _p uθ(τ) pu_θ0(τ) V(τ) +γlogpuθ(τ) p0(τ) pu_θ0 (21)

This is true for allθ0as long aspuθ(τ)andpuθ0(τ)are absolutely continuous to each other. Taking

the derivative we get:

= _∇ θpuθ(τ) pu_θ0(τ) V(τ) +γlogpuθ(τ) p0(τ) pu_θ0 + _p uθ(τ) pu_θ0(τ) γ 1 puθ(τ) ∇θpuθ(τ) pu_θ0 (22) = (∇θlogpuθ(τ)) V(τ) +γlogpuθ(τ) p0(τ) p_uθ +γ∇θ ₁ pu_θ0(τ) puθ(τ) pu_θ0 (23) = S_uγ θ(τ)∇θlogpuθ(τ) p_uθ +γ∇θh1ip_uθ (24) = S_uγ_θ(τ)∇θlogpuθ(τ) p_uθ. (25)

(9)

B

Smoothing Stochastic Control Problems

B.1 Replacing Minimization overuby Minimization overp0

Here we show that for

Jα(θ) = inf

u0 αKL(pu 0||p_u

θ) +γKL(pu0||p0) +hV(τ)ip0 (26)

we can replace the minimization overuby a minimization overp0to obtain Eq. (11). For this, we need to show that the minimizerp∗_α,θof Eq. (11) is induced byu∗_α,θ, the minimizer of Eq. (26):

p∗_α,θ≡pu∗

α,θ.

The solution to (11) is given by (see App. B.2)

p∗α,θ = 1 Zpuθ(τ) exp − 1 γ+αS γ p_uθ(τ) = 1 Zpuθ(τ) _p 0(τ) puθ(τ) γ+γα exp − 1 γ+αV(τ) . (27) We rewrite p0(τ) _p uθ(τ) p0(τ) 1−γ+γα =p0(τ) exp 1− γ γ+α Z T 0 ₁ 2uθ(xt, t) T_u θ(xt, t) +uθ(xt, t)Tξt dt ! ,

where we used the Girsanov theorem [2, 25] (and setν = 1for simpler notation). Withu˜θ(xt, t) :=

1− γ γ+α uθ(xt, t)this gives p0(τ) _p uθ(τ) p0(τ) 1−γ+γα =p0(τ) exp Z T 0 ₁ 2u˜θ(xt, t) T_u_˜ θ(xt, t) + ˜uθ(xt, t)Tξt dt ! · ·exp Z T 0 ₁ 2 γ αu˜θ(xt, t) T_u_˜ θ(xt, t) dt ! =pu˜θ(τ) exp Z T 0 ₁ 2 γ αu˜θ(xt, t) T_u_˜ θ(xt, t) dt ! . So we get p∗_α,θ= 1 Zpu˜θ(τ) exp Z T 0 ₁ 2 γ α˜uθ(xt, t) T_u_˜ θ(xt, t) dt ! exp − 1 γ+αV(τ) . (28)

This has the form of an optimally controlled distribution with dynamics

˙ xt=f(xt, t) +g(xt, t) (˜uθ(xt, t) + ˆu(xt, t) +ξt) (29) and cost * Z T 0 1 γ+αV(xt, t)− 1 2 γ αu˜θ(xt, t) T_u_˜ θ(xt, t)dt+ Z T 0 ₁ 2uˆ(xt, t) T_u_ˆ₍_x t, t) + ˆu(xt, t)Tξt dt + puˆ . (30)

This is a Path Integral control problem with state costR₀T _γ+α1 V(xt, t)−1₂_αγu˜θ(xt, t)Tu˜θ(xt, t)dt

which is well defined withu˜θ(xt, t) =

1−_γ+αγ uθ(xt, t).

Letuˆ∗be the optimal control of this Path Integral control problem. Thenp∗_α,θis induced by Eq. (29) withuˆ= ˆu∗_{. This is equivalent to say that}_p∗

α,θis induced by Eq. (1) Asp∗α,θ is the density that

(10)

B.2 Minimizer of smoothed cost Here we want to proof Eq. (12):

p∗_α,θ(τ) := arg min p0 αKL(p0||puθ) + D S_pγ uθ(τ) E p0 (31) = arg min p0 αlog p 0₍_τ₎ puθ(τ) +V(τ) +γlog p 0₍_τ₎ p0(τ) p0 . (32)

For this we take the variational derivative and set it to zero:

0 = δ δp0₍_τ₎ αlog p 0₍_τ₎ puθ(τ) +V(τ) +γlog p 0₍_τ₎ p0(τ) +κ p0 p0_=p∗ α,θ , (33)

where we added a Lagrange multiplierκto ensure normalization. We get

0 = αlog p 0₍_τ₎ puθ(τ) +V(τ) +γlog p 0₍_τ₎ p0(τ) +κ p0_=p∗ α,θ , (34)

from which follows

p∗_α,θ(τ) = exp _κ α+γ puθ(τ) α α+γ_p 0(τ) γ α+γ_exp − 1 γ+αV(τ) (35) = exp _κ α+γ puθ(τ) exp − 1 γ+αV(τ)− γ α+γlog puθ(τ) p0(τ) (36) = exp _κ α+γ puθ(τ) exp − 1 γ+αS γ p_uθ(τ) , (37)

whereκis chosen such that the distribution is normalized.

B.3 Derivation of the gradient of the smoothed cost function Here we derive Eq. (14) by taking the derivative of Eq. (13):

∇θJα(θ) =−(γ+α)∇θlog exp − 1 γ+α V(τ) +γlogpuθ(τ) p0(τ) p_uθ (38) =−γ+α Zα p_uθ ∇θ exp − 1 γ+α V(τ) +γlogpuθ(τ) p0(τ) p_uθ . (39)

Now we introduce the importance samplerpu_θ0 and correct for it.

∇θJα(θ) =− γ+α Zα p_uθ ∇θ _p uθ(τ) puθ0(τ) exp − 1 γ+α V(τ) +γlogpuθ(τ) p0(τ) pu_θ0 (40) =−γ+α Zα p_uθ ∇θ * p0(τ) γ γ+α pu_θ0(τ) (puθ(τ)) α γ+α_exp − 1 γ+αV(τ) + pu_θ0 (41) =− α Zα p_uθ * 1 pu_θ0(τ) _p uθ(τ) p0(τ) −γ+γα exp − 1 γ+αV(τ) ∇θpuθ + pu_θ0 (42) =− α Zα p_uθ exp − 1 γ+αS γ p_uθ(τ) ∇θlogpuθ(τ) p_uθ . (43)

C

PICE, Direct Cost-Optimization and Risk Sensitivity as Limiting Cases of

Smoothed Cost Optimization

The smoothed cost and its gradient depend on the two parametersαandγ, which come from the smoothing Eq. (7) and the definition of the control problem (2) respectively. Although at first glance

(11)

the two parameters seem to play a similar role, they change different properties of the smoothed cost

Jα₍_θ₎_{when they are varied.}

In the expression for the smoothed cost (13), the parameterαonly appears in the sumγ+α. Varying it changes the effect of the smoothing but leaves the optimumθ∗= arg min_θJα₍_θ₎_{of the smoothed}

cost invariant. Here we show that smoothing leaves the global optimum of the costC(puθ)invariant.

AsKL(pu_θ0||puθ)≥0we have that

Jα(θ) = inf

θ0 C(puθ0) +αKL(puθ0||puθ)≥inf

θ0 C(puθ0) =C(puθ∗).

To show that the global minimumθ∗ofCis also the global minimum ofJα_{, it is thus sufficient to}

show that

Jα(θ∗)≤C(puθ∗).

We have

Jα(θ∗) = inf

θ0 C(puθ0) +αKL(puθ0||puθ∗).

Using that the minimum of a sum of terms is never larger than the sum of the minimum of terms, we get Jα(θ∗)≤ inf θ0 C(puθ0) + inf θ0 αKL(puθ0||puθ∗) =C(pu_θ∗) +αKL(pu_θ∗||pu_θ∗) =C(puθ∗).

We also expect local maxima to be also preserved for large-enough smoothing parameterα. This would correspond to small time smoothing by the associated Hamilton-Jacobi partial differential equation [4].

We therefore callαthesmoothing parameter. The largerα, the weaker the smoothing; in the limiting caseα→ ∞, smoothing is turned off as we can see from Eq. (13): for very largeα, the exponential and the logarithmic function linearise,Jα₍_θ₎_→_C₍_p

uθ)and we recover direct cost-optimization.

For the limiting caseα→0, we recover the PICE method: the optimizerp∗_α,θbecomes equal to the optimal densityp∗and the gradient on the smoothed cost (14) becomes proportional to the PICE gradient (6): lim α→0 1 α∇θJ α₍_θ_{) =}_∇ θKL(p∗||puθ).

Varyingγchanges the control problem and thus its optimal solution. Forγ→0, the control cost becomes zero. In this case the cost only consists of the state cost and arbitrary large controls are allowed. We get Jα(θ) =−αlog exp −1 αV(τ) p_uθ .

This expression is identical to the risk sensitive control cost proposed in [6, 7, 26]. Thus, forγ= 0, the smoothing parameterαcontrols the risk-sensitivity, resulting in risk seeking objectives forα >0 and risk avoiding objectives forα <0. In the limiting caseγ→ ∞, the problem becomes trivial; the optimal controlled dynamics becomes equal to the uncontrolled dynamics:p∗→p0, c.f., Eq. (5),

andu∗→0.

If both parametersαandγare small, the problem is hard (see [21, 24]) as many samples are needed to estimate the smoothed cost. The problem becomes feasible if eitherαorγis increased. Increasing

γhowever, changes the control problem, while increasingαweakens the effect of smoothing.

D

The effect of cost function smoothing on policy optimization

We introduced smoothing as a way to speed up policy optimization compared to a direct optimization of the cost. In this section we analyse policy optimization with and without smoothing and show

(12)

Cost C(

θ)

low high

A

B

Θ’

θ

θ’

_θ’’

θ’

θ’’

Θ’

Θ’’

Figure 2: Illustration of optimal two-step updates compared with two consecutive direct updates. Illustrated is a two-dimensional cost landscape C(θ)parametrized byθ. Dark colors represent low cost, while light colors represent high cost. Green dots indicate the optimal two-step update

θ → Θ0 → Θ00 while red dots indicates two consecutive direct updates θ → θ0 → θ00 with

θ0 = ΘC

E(θ)andθ00 = ΘCE0(θ0). The dashed circles indicate trust regions. θ0,θ00andΘ00are the

minimizers of the cost in the trust regions aroundθ,θ0_and_Θ0_{respectively.}_Θ0_{is chosen such that}

the costC(Θ00)after the subsequent direct update is minimized. In both panels, the final cost after an optimal two-step updateC(Θ00)is smaller than the final cost after two direct updatesC(θ00). (A) Equal sizes of the update steps,E=E0_{. (B) When the size of the second step becomes small}_E0_E_,

the smoothed updateθ→Θ0becomes more similar to the direct updateθ→θ0.

analytically how smoothing can speed up policy optimization. To simplify notation, we overload

puθ →θso that we getC(puθ)→C(θ)andKL(puθ0||puθ)→KL(θ

0_||_θ_).

We use a trust region constraint to robustly optimize the policy, c.f., [18, 22, 10]. There are two options. On the one hand, we can directly optimize the costC:

Definition 1. We define the direct update with stepsizeEas an updateθ→θ0withθ0= ΘC E(θ)and

ΘC_E(θ) := arg min

θ0

s.t.KL(θ0||θ)≤E

C(θ0). (44)

The direct update results in the minimal cost that can be achieved after one single update. We define the optimal one-step cost

C_E∗(θ) := min

θ0

s.t.KL(θ0||θ)≤E

C(θ0).

On the other hand we can optimize the smoothed costJα_:

Definition 2. We define the smoothed update with stepsizeEas an updateθ→θ0withθ0= ΘJα E (θ) and ΘJ_Eα(θ) := arg min θ0 s.t.KL(θ0||θ)≤E Jα(θ0). (45)

While a direct update achieves the minimal cost that can be achieved after a single update, we show below that a smoothed update can result in a faster cost reduction if more than one update step is performed.

Definition 3. We define the optimal two-step updateθ→Θ0 →Θ00as an update that results in the lowest cost that can be achieved with a two-step updateθ→θ0→θ00with fixed stepsizesEandE0

respectively: Θ0,Θ00:= arg min θ0,θ00 s.t.KL(θ00||θ0)≤E0 KL(θ0||θ)≤E C(θ00)

(13)

and the corresponding optimal two-step cost C_E∗_,_E0(θ) := min θ0 s.t.KL(θ0||θ)≤E min θ00 s.t.KL(θ00||θ0)≤E0 C(θ00) = min θ0 s.t.KL(θ0||θ)≤E C ΘC_E0(θ0) . (46)

In Fig. 2 we illustrate how such an optimal two-step update leads to a faster decrease of the cost than two consecutive direct updates.

Theorem 1. Statement 1:For allE,αthere exists anE0_{, such that a smoothed update with stepsize} Efollowed by a direct update with stepsizeE0_{is an optimal two-step update:}

Θ0= ΘJ_Eα(θ) Θ00= ΘC_E0(Θ0) ⇒C(Θ00) =C_E∗_,_E0(θ)

The size of the second stepE0_{is a function of}_θ_and_α_.

Statement 2:E0_{is monotonically decreasing in}_α_.

While it is evident from Eq. (46) that the second step of the optimal two-step update must be a direct update, the statement that the first step is a smoothed update is non-trivial.

We split the proof into three subsections: in the first subsection, we state and proof a lemma that we need to proof statement 1. In the second subsection, we proof statement 1 and in the third subsection, we proof statement 2.

D.1 Lemma

Lemma 1. Withθ∗_α,θdefined as in Eq.(9)andEα(θ) =KL

θ∗_α,θ||θwe can rewriteJα₍_θ_): Jα(θ) = C ΘC_E0(θ) _E0₌_E α(θ)+αEα(θ). (47)

Proof. With the definition ofθ∗_α,θas the minimizer ofC(θ0) +αKL(θ0||θ)(see (9)) we have

Jα(θ) =C θ∗_α,θ

+αKL(θ∗_α,θ||θ) =C θ∗_α,θ

+αEα(θ).

What is left to show is that

θ∗_α,θ≡ΘC_E

α(θ)(θ).

As ΘC_E

α(θ)(θ) is the minimizer of the cost C within the trust region defined by {θ0:KL(θ0||θ)≤ Eα(θ)}we have to show that

1. θ∗_α,θlies within this trust region,

2. Cθ∗ α,θ

is a minimizer of the costCwithin this trust region.

The first point is trivially true asKL(θ_α,θ∗ ||θ) =Eα(θ)by definition. Henceθ∗α,θlies at the boundary

of this trust region and therefore in it, as the boundary belongs to the trust region. The second point we proof by contradiction: Givenθ_α,θ∗ is not minimizing the cost within the trust region, then there exists aθˆwithC(ˆθ)< C(θ∗

α,θ)andKL(ˆθ||θ)≤ Eα(θ) =KL(θα,θ∗ ||θ). Therefore it must hold that

C(ˆθ) +αKL(ˆθ||θ)< C(θ∗_α,θ) +αKL(θ_α,θ∗ , θ) which is a contradiction, asθ_α,θ∗ is the minimizer ofC(θ0) +αKL(θ0||θ).

(14)

D.2 Proof of Statement 1

Here we show that for everyαandθthere exists anE0₌_E∗

α(θ)such that CΘC_E0 ΘJ_Eα(θ) E0₌_E∗ α(θ) = C_E∗_,_E0 E0₌_E∗ α(θ) . (48)

Proof. AsJα₍_θ₎_{is the infimum of}_C₍_θ0_{) +}_αKL₍_θ0_||_θ_{), we have for any}_E0_>₀

Jα(θ)≤C ΘC_E0(θ) +αKL ΘC_E0(θ)||θ . Further, asΘC

E0(θ)lies in the trust region{θ0:KL(θ0||θ)≤ E0}we have thatKL ΘC_E0(θ)||θ

≤ E0, so we can write C ΘC_E0(θ) +αKL ΘC_E0(θ)||θ ≤C ΘC_E0(θ) +αE0 and thus Jα(θ)≤C ΘC_E0(θ) +αE0_.

Next we minimize both sides of this inequality within the trust region{θ0:KL(θ0||θ)≤ E}. We use that JαΘJ_Eα(θ)= min θ0 s.t.KL(θ0||θ)≤E Jα(θ0) and get JαΘJ_Eα(θ)≤ min θ0 s.t.KL(θ0||θ)≤E C ΘC_E0(θ0) +αE0 . (49)

Now we use Lemma 1 and rewrite the left hand side of this inequality.

JαΘJ_Eα(θ)= CΘC_E0 ΘJ_Eα(θ) _E0₌_E∗ α(θ) +αE∗ α(θ) withE∗ α(θ) :=Eα(ΘJ α

E (θ)). Plugging this back to (49) we get

CΘC_E0 ΘJ_Eα(θ) _E0₌_E∗ α(θ) +αE_α∗(θ)≤ min θ0 s.t.KL(θ0||θ)≤E C ΘC_E0(θ0) +αE0 .

As this inequality holds for anyE0_>₀_{we can plug in}_E∗

α(θ)on the right hand side of this inequality

and obtain CΘC_E0 ΘJ_Eα(θ) _E0₌_E∗ α(θ) +αEα∗(θ)≤ min θ0 s.t.KL(θ0||θ)≤E C ΘC_E0(θ0) E0₌_E∗ α(θ) +αEα∗(θ). We subtractαE∗ α(θ)on both sides CΘC_E0 ΘJ_Eα(θ) E0₌_E∗ α(θ) ≤ min θ0 s.t.KL(θ0_||_θ)_≤E C ΘC_E0(θ0)_E0₌_E∗ α(θ) .

Using Eq. (46) gives

CΘC_E0 ΘJ_Eα(θ) E0₌_E∗ α(θ) ≤C_E∗_,_E0(θ) E0₌_E∗ α(θ) ,

(15)

D.3 Proof of Statement 2 Here we show thatE0 ₌_E∗

α(θ)is a monotonically decreasing function ofα.Eα∗(θ)is given by

E_α∗(θ) =Eα ΘJ_Eα(θ)= KL(θ_α,θ∗ 0||θ0) θ0_=RJ α E (θ) . We have αKL(θ_α,θ∗ 0||θ0) +C θ_α,θ∗ 0 θ0_=RJ α E (θ) = inf θ00αKL(θ 00_||_θ0_{) +}_C₍_θ00₎ _θ0_=RJ α E (θ) = min θ0 s.t.KL(θ0||θ)≤E inf θ00αKL(θ 00_||_θ0_{) +}_C₍_θ00₎_.

For convenience we introduce a shorthand notation for the minimizers

θα:= ΘJ

α

E (θ)

θ_α0 :=θ∗_α,θ0|_θ0_=ΘJ α

E (θ).

We compareα1≥0withEα1∗ (θ) :=KL(θα10 ||θα1)andα2≥0withE ∗

α2(θ) :=KL(θα20 ||θα2)and

assume thatE∗

α1(θ)<Eα2∗ (θ). We show that from this it follows thatα1> α2.

Proof. Asθ0_α1,θα1minimizeα1KL(θ 0_||_θ_{) +}_C₍_θ0₎_{we have} α1KL(θα01||θα1) +C(θ 0 α1)≤α1KL(θ 0 α2||θα2) +C(θ 0 α2) ⇒α1Eα1(θ) +C(θ 0 α1)≤α1Eα2(θ) +C(θ 0 α2)

and analogous forα2

α2KL(θα01||θα1) +C(θ 0 α1)≥α2KL(θ 0 α2||θα2) +C(θ 0 α2) ⇒α2Eα1(θ) +C(θ 0 α1)≥α2Eα2(θ) +C(θ 0 α2) WithEα1(θ)<Eα2(θ)we get α1≥ C(θ0_α1)−C(θ0_α2) Eα2(θ)− Eα1(θ) ≥α2.

We showed that from Eα1(θ) < Eα2(θ) it follows that α1 ≥ α2 which proofs that Eα(θ) is

monotonously decreasing inα.

Direct updates are myopic and do not take into account successive steps and are thus suboptimal when more than one update is needed. Smoothed updates on the other hand, as we see on theorem 1, anticipate a subsequent step and minimize the cost that results from this this two-step update. Hence smoothed updates favour a greater cost reduction in the future over maximal cost reduction in the current step. The strength of this anticipatory effect depends on the smoothing strength, which is controlled by the smoothing parameterα: For largeα, smoothing is weak and the sizeE0 _{of this}

anticipated second step becomes small. Fig. 2 B illustrates that for this case, whenE0_{becomes small,}

smoothed updates become more similar to direct updates. In the limiting caseα→ ∞the difference between smoothed and direct updates vanishes completely, asJα₍_θ₎_→_C₍_θ₎_{(see section C).}

We expect that also with multiple update steps due to this anticipatory effect, iterating smoothed updates leads to a faster decrease of the cost than iterating direct updates. We will confirm this by numerical studies. Furthermore, we expect that this accelerating effect of smoothing is stronger for smaller values ofα. On the other hand, as we will discuss in the next section, for smaller values ofα

it is harder to accurately perform the smoothed updates. Therefore we expect an optimal performance for an intermediate value ofα. Based on this we build an algorithm in the next section that aims to accelerate policy optimization by cost function smoothing.

(16)

E

Additional Theoretical Results for Section 4

E.1 Smoothed Updates for Small Update StepsE

We want to compute Eq. (16) for small E which corresponds to large β. Assuming a smooth dependence ofpuθ on θ, bounding KL(θ||θn) to a very small value allows us to do a Taylor

expansion which we truncate at second order: arg min θ0 Jα(θ0) +βKL(θ0||θn)≈ (50) ≈arg min θ0 (θ0−θn)T∇θ0Jα(θ0) + 1 2(θ 0₋_θ n)T(H+βF) (θ0−θn) (51) =θn−β−1F−1∇θ0Jα(θ0)|_θ0_=θ n+O(β −2₎ ₍₅₂₎ with H = ∇θ0∇T_θ0Jα(θ0) _θ0_=θ n F = ∇θ0∇T_θ0KL(θ0||θn) _θ0_=θ n.

See also [15]. We used thatE 1⇔β1. With this the Fisher informationF dominates over the HessianHand thus the Hessian does not appear anymore in the update equation. This defines a natural gradient update with stepsizeβ−1.

E.2 Inversion of the Fisher matrix

We compute an approximation to the natural gradientgf=F−1gby approximately solving the linear

equationF gf =gusing truncated conjugate gradient. With the normal gradientgand the Fisher

matrixF =∇θ∇TθKL(puθ||puθn)(see App. E.1).

We use an efficient way to compute the Fisher vector productF y[22] using an automated differentia-tion package: First for each rolloutiand timepointtthe symbolic expression for the gradient on the KL multiplied by a vectoryis computed:

ai,t(θn+1) = ∇T θn+1log πθn(a i t|t, xit) πθn+1(a i t|t, xit) y.

Then we take the second derivative on this scalar quantity, sum over all times and average over the samples. This gives then the Fisher vector

F y= 1 N N X i=1 X 0<t<T ∇θn+1ai,t(θn+1).

For practical reasons, we reverse the arguments of the KL, since it is easier to estimate it from samples drawn from the first argument. For very small values, theKLis approximately symmetric in its arguments. Also, the equality in (18) differs from [22], which optimizes a value function within the trust region, e.g.,KL(θn||θn+1)≤ E.

E.3 Proof for equivalence of weight entropy and KL-divergence We want to show that

lim

N→∞logN−HN(w) = limN→∞logN+ N

X

i=1

wilog(wi) =KL(p∗_α,θ||puθ).

Where the samplesiare drawn frompuθand thew

i_{are given by} wi= 1 PN i exp − 1 γ+αSpuθ(τ i₎ exp − 1 γ+αSpuθ(τ i₎ ,

(17)

We get lim N→∞logN+ N X i=1 wilog(wi) = = lim N→∞logN+ N X i=1 1 PN i exp − 1 γ+αS γ p_uθ(τi) exp − 1 γ+αS γ p_uθ(τ i₎ · ·log   1 PN i exp − 1 γ+αS γ p_uθ(τi) exp − 1 γ+αS γ p_uθ(τ i₎   = lim N→∞logN+ 1 N N X i=1 1 1 N PN i exp − 1 γ+αS γ p_uθ(τi) exp − 1 γ+αS γ p_uθ(τ i₎ · ·log   1 N 1 N PN i exp − 1 γ+αS γ p_uθ(τi) exp − 1 γ+αS γ p_uθ(τ i₎   = lim N→∞ 1 N N X i=1 1 1 N PN i exp − 1 γ+αS γ p_uθ(τi) exp − 1 γ+αS γ p_uθ(τ i₎ · ·log   1 1 N PN i exp − 1 γ+αS γ p_uθ(τi) exp − 1 γ+αS γ p_uθ(τi)  

Now we replace in the limitN → ∞,_N1 PN

i → hip_uθ: = * 1 D exp− 1 γ+αS γ p_uθ(τ) E p_uθ exp − 1 γ+αS γ p_uθ(τ) · ·log    1 D exp− 1 γ+αS γ p_uθ(τ) E p_uθ exp − 1 γ+αS γ p_uθ(τ)    + p_uθ

Using Eq. (12) this gives

= * log    1 D exp− 1 γ+αS γ p_uθ(τ) E p_uθ exp − 1 γ+αS γ p_uθ(τ)    + p∗ α,θ = * log    1 D exp− 1 γ+αS γ p_uθ(τ) E p_uθ exp − 1 γ+αS γ p_uθ(τ) puθ(τ) puθ(τ)    + p∗ α,θ = logp ∗ α,θ(τ) puθ(τ) p∗ α,θ =KL(p∗_α,θ||puθ).

E.4 The Smoothness Parameter∆is monotonic inα

Now we show that

(18)

is a monotonic function ofα. ∂ ∂αKL(p ∗ α,θ||puθ) = ∂ ∂α lnp ∗ α,θ puθ p∗ α,θ = ∂ ∂α _p∗ α,θ puθ lnp ∗ α,θ puθ p_uθ = _∂ ∂α p∗ α,θ puθ lnp ∗ α,θ puθ p_uθ + _p∗ α,θ puθ ∂ ∂αln p∗ α,θ puθ p_uθ = _∂ ∂α p∗_α,θ puθ lnp ∗ α,θ puθ p_uθ + ₁ puθ ∂ ∂αp ∗ α,θ p_uθ = _∂ ∂α p∗_α,θ puθ lnp ∗ α,θ puθ p_uθ + ∂ ∂αh1ip∗α,θ = ∂ ∂α p∗_α,θ puθ lnp ∗ α,θ puθ p_uθ .

Now let us look at

∂ ∂α p∗_α,θ puθ = ∂ ∂α 1 Zα p_uθ exp − 1 γ+αS γ p_uθ(τ) ! Z_pα uθ = exp − 1 γ+αS γ p_uθ(τ) p_uθ . we get ∂ ∂α p∗_α,θ puθ = 1 (γ+α)2S γ p_uθ(τ) p∗_α,θ puθ −p ∗ α,θ puθ 1 Zα p_uθ ∂ ∂αZ α p_uθ ∂ ∂αZ α p_uθ = * 1 (γ+α)2S γ p_uθ exp − 1 γ+αS γ p_uθ(τ) + p_uθ . and thus ∂ ∂α p∗_α,θ puθ = 1 (γ+α)2S γ p_uθ(τ) p∗_α,θ puθ −p ∗ α,θ puθ 1 (γ+α)2 D S_pγ uθ E p∗ α,θ = 1 (γ+α)2 p∗_α,θ puθ Spγ_uθ(τ)− D Spγ_uθ E p∗ α,θ . So finally we get ∂ ∂αKL(p ∗ α,θ||puθ) = 1 (γ+α)2 _p∗ α,θ puθ S_pγ uθ(τ)− D S_pγ uθ E p∗ α,θ lnp ∗ α,θ puθ p_uθ = 1 (γ+α)2 _p∗ α,θ puθ S_pγ uθ(τ)− D S_pγ uθ E p∗ α,θ − 1 γ+αS γ p_uθ(τ)−logZ α p_uθ p_uθ = 1 (γ+α)2 S_pγ uθ(τ)− D S_pγ uθ E p∗ α,θ − 1 γ+αS γ p_uθ(τ)−logZ α p_uθ p∗ α,θ =− 1 (γ+α)3 S_pγ uθ 2 p∗ α,θ −DS_pγ uθ E2 p∗ α,θ ! =− 1 (γ+α)3Var S_pγ uθ ≤0.

(19)

Algorithm 1ASPIC - Adaptive Smoothing of Path Integral Control Require: State cost functionV(x, t)

control cost parameterγ

base policy that defines uncontrolled dynamicsπ0

simulator of system dynamics with a parametrized policyπθ

trust region sizesE

smoothing strength∆ number of samplesN

initializeθ0

n= 0 repeat

draw samplesτi, withi= 1, . . . , N, from simulator controlled by parametrized policyπθn

for each sampleicomputeSγ

p_uθn(τi) = P 0<t<TV(x i t, t) +γlog πθn(ait|t,x i t) π0(ait|t,xit)

{Find minimalαsuch thatKL≤∆}

α←0 repeat

increaseα

Sαi ←Spγ_uθn(τi)·γ+α1

compute weightswi←exp(−Siα)

normalize weightswi← Pwi

i(wi)

compute sample size independent weight entropyKL←logN+P

iwilog(wi)

untilKL≤∆ {whiten the weigths}

ˆ

wi← wi−_stdmean_(w(wi)

i)

{compute the gradient on the smoothed cost}

g←P i P twˆi ∂θ∂ logπθ(ait|t, xit) _θ=θ n

{compute Fisher matrix}

use conjugate gradient descent to compute an approximate solution to the natural gradient

gF =F−1g(see App. E.2)

do line search to compute step sizeηsuchKL(θn||θn+1) =E.

update parametersθn+1←θn+η·gF

n=n+ 1 untilconvergence

Therefore

∆ =KL(p∗_α,θ||puθ)

is a monotonically decreasing function ofα.

F

Experimental Details and Additional Results

Algorithm 1 summarizes ASPIC. We first analyze the behavior of ASPIC in a simple linear-quadratic control problem, F.1,F.2. We then look at the dependence on the number of rollouts per iterationN

in F.3 and the interplay between smoothing strength∆and trust region sizeEin F.4. Finally, we describe the parameter settings for all tasks in F.5.

F.1 A Simple Linear-Quadratic Control Problem: Brownian Viapoints

We analyse the convergence speed for different values of the smoothing strength∆in the task of controlling a one-dimensional Brownian particle

˙

x=u(x, t) +ξ. (53)

We define the state cost as a quadratic penalty for deviating from the viapointsxiat the different

timesti:V(x, t) =Piδ(t−ti) (x−xi)2

(20)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 ∆ 0 200 400 600 800 1000 #Iterations A N=100 0 200 400 600 800 1000 Iterations 0 1 2 3 4 5 6 7 8 Costs 1e6 B

Figure 3: LQ control problem: Brownian viapoints. For each iteration we usedN = 100rollouts to compute the gradient. A) Number of iterations needed for the cost to cross a thresholdC≤2·104 versus the smoothing strength∆. For∆ = 0there is no smoothing. Increasing the smoothing strength results in a faster decrease of the cost; when∆is increased further the performance decreases again. Errorbars denote mean and standard deviation over10runs of the algorithm. B) Cost versus the iterations of the algorithm. Direct optimization of the cost exhibits a slower convergence rate than optimization of the smoothed cost with∆ = 0.2 log 100.

varying linear feedback controller, i.e.,uθ(x, t) =θ1,tx+θ0,t. This controller fulfils the requirement

of full parametrization for this task (see App. F.2). For further details of the numerical experiment see appendix F.5.

We apply ASPIC to this control problem and compare its performance for different sizes of the smoothing strength∆(see Fig. 3). The results confirm our expectations from our theoretical analysis. As predicted by theory we observe an acceleration of the policy optimization when smoothing is switched on. This acceleration becomes more pronounced when∆is increased, which we attribute to an increase of the anticipatory effect of the smoothed updates as smoothing becomes stronger (see section D). When∆is too large the performance of the algorithm deteriorates again, which is in line with our discussion of gradient estimation problems that arise for strong smoothing.

F.2 Full parametrization in LQ problem

Here we discuss why for a linear quadratic problem a time varying linear controller is a full parametrization. We want to show that for every

p∗α,θ0 = 1 Zpu0(τ) exp − 1 γ+αS γ p_uθ 0 (τ) (54)

there is a time varying linear controlleruθ∗

α,θ₀ such thatpuθ∗_α,θ

0

= p∗_α,θ0. We assume thatuθ0 is

a time varying linear controller. In App. B.1 we have shown thatu∗_α,θ

0 is the solution to the Path

Integral control problem with dynamics ˙ xt=f(xt, t) +g(xt, t) (˜u(xt, t) + ˆu(xt, t) +ξt) and cost * Z T 0 1 γV(xt, t)− 1 2 γ α˜u(xt, t) T_u_˜₍_x t, t)dt+ Z T 0 1 2uˆ(xt, t) T_u_ˆ₍_x t, t) + ˆu(xt, t)Tξt dt + puˆ , (55) withu˜=1−_γ+αγ uθ0(xt, t).

(21)

0 100 200 300 400 500 600 700 800 Iterations 100 0 100 200 300 400 500 Cost

DDD Performance of ASPIC in the Pendulum task

N=50 N=100 N=500

Figure 4: Performance as a function of the number of iterations for different values of N ∈ {50,100,500}in the Pendulum swing-up task. Dashed lines denote the solution for a total fixed budget of25K rollouts, i.e.,500,250, and50iterations, respectively. In this case,N= 50achieves near optimal performance whereas using larger values ofNleads to worse solutions.

It is now easy to see that ifuθ0is a time varying linear controller, thus a linear function of the state,

the cost is a quadratic function of the statex(note thatV(xt, t)is quadratic in the LQ case). Thus for

all values ofα,u∗

α,θ0is the solution to a linear quadratic control problem and thus a time varying

linear controller (see e.g. [14]). Therefore a time varying linear controller is a full parametrization.

F.3 Dependence on the Number of Rollouts per IterationN

We now analyze the dependence of the performance of ASPIC on the number of rollouts per iterationN. In general, using larger values ofN allows for more reliable gradient estimates and achieves convergence in fewer iterations. However, too largeN may be inefficient and lead to suboptimal solutions in the presence of a fixed budget of rollouts.

Figure 4 illustrates this trade-off in the Pendulum swing-up task for three values ofN. For a total budget of25K rollouts (dashed lines), the lowest value ofN = 50achieves near optimal performance and is preferable to the other choices, despite resulting in higher variance estimates and requiring more iterations until convergence. In particular, the solutions achieved usingN = 500have cost>350, while forN= 50, all solutions have cost<−50.

F.4 Interplay Between Smoothing Strength∆and Trust Region SizeE

To understand better the relation between the smoothing strength and the trust region sizes, we analyze empirically the performance of ASPIC as a function of both∆andEparameters. We focus on the Acrobot task and in the setting ofN = 500and intermediate smoothing strength, when smoothing is most beneficial.

Figure 5 shows the cost as a function of∆ and E averaged over the first 500 iterations of the algorithm, and for10different runs. Larger (averaged) costs correspond runs where the algorithm fails to converge. Conversely, the lower cost, the fastest the convergence. In general, larger values of

Elead to faster convergence. However, the convergence is less stable for smaller values of∆. For stronger smoothing, the algorithm is more sensitive toE.

F.5 Details of Numerical Experiments Linear-Quadratic control Task

Dynamics: The dynamics are ODEs integrated by an Euler scheme (see section F.1). The differential equation is initialized atx= 0.dt= 0.1

Control problem: γ = 1. Time-Horizon T = 10s. State-Cost function: see section F.1. (x0, t0) = (−10,1), (x1, t1) = (10,2),(x2, t2) = (−10,3), (x3, t3) = (−20,4),

(x4, t4) = (−100,5), (x5, t5) = (−50,6), (x6, t6) = (10,7), (x7, t7) = (20,8),

(22)

1.0e-03 1.7e-03 2.8e-03 4.6e-03 7.7e-03 1.3e-02 2.2e-02 3.6e-02 6.0e-02 1.0e-01 1.0e+00 6.0e-01 3.6e-01 2.2e-01 1.3e-01 7.7e-02 4.6e-02 2.8e-02 1.7e-02 1.0e-02

ASPIC in the Acrobot task

800 400 0 400 800

Figure 5: Solution cost as a function of the smoothing strength∆and the trust region sizeεin the Acrobot task. Shown is the cost averaged over the first500iterations of the algorithm, and for10 different runs. Blue indicates failure to convergence. White indicates the solutions which converged fastest.

Algorithm: Batchsize:N = 100.E = 0.1.∆ = 0.2 log 100. Conjugate gradient iterations: 2 (for each time step separately). The parametrized controller was initialized atθ= 0.

Pendulum Task

Dynamics: The differential equation for the pendulum is: ¨ x+cω0x˙+ω02sin(x) =λ(u+ξ) with • cω0= 0.1[s−1] • ω₀2= 10.[s−2] • λ= 0.2

We implemented this differential equation as a first order differential equation and integrated it with an Euler scheme withdt= 0.01. The pendulum is initialized resting at the bottom:

˙

x= 0, x= 0.

As a parametrized controller we use a time varying linear feedback controller:

uθ(x,x, t˙ ) =θ3,tcos(x) +θ2,tsin(x) +θ1,tx˙+θ0,t.

The parametrized controller was initialized atθ= 0.

Control-problem: γ= 1..T = 3.0s. The State-Cost function has End-Cost only:

V(x,x, t˙ ) =δ(t−T) −500Y + 10 ˙x2

withY =−cos(x)(height of tip). Variance of uncontrolled dynamicsν = 1

Algorithm: Batchsize:N = 500.E= 0.1.∆ = 0.5. The Fisher-matrix was inverted for each time step separately using the scipy pseudo-inverse with rcond=1e-4.

Acrobot Task

Dynamics: We use the definition of the acrobot as in [23]. The differential equations for the acrobot are:

d11(x)¨x1+d12(x)¨x2+h1(x,x˙) +φ1(x) = 0

(23)

with d11=m1l2c1+m2 l12+l2c2+ 2l1lc2cos(x2) +I1+I2 d12=m2 lc22 +l1lc2cos(x2) +I2 d21=d12 d22=m2l2c2+I2 h1=−m2l1lc2sin(x2) ˙x22+ 2 ˙x1x˙2 h2=m2l1lc2sin(x2) ˙x21 φ2=m2lc2Gcos (x1+x2) φ1= (m1lc1+m2l1)gcos (x1) +φ2

with the parameter values

• G= 9.8

• l1= 1.[m]

• l2= 2.[m]

• m1= 1.[kg] mass of link 1

• m2= 1.[kg] mass of link 2

• lc1= 0.5[m] position of the center of mass of link 1

• lc2= 1.0[m] position of the center of mass of link 2

• I1= 0.083moments of inertia for both links

• I2= 0.33moments of inertia for both links

• λ= 0.2

We implemented this differential equation as a first order differential equation and integrated it with an Euler scheme withdt= 0.01. The acrobot is initialized resting at the bottom:

˙

x1= 0,x˙2= 0, x1=−

1

2π, x2= 0.

As a parametrized controller we use a time varying linear feedback controller:

uθ(x,x, t˙ ) =θ8,tcos(x1) +θ7,tsin(x2) +θ6,tcos(x2) +θ5,tsin(x2)+

+θ4,tsin(x1+x2) +θ3,tcos(x1+x2) +θ2,tx˙1+θ1,tx˙2+θ0,t.

The parametrized controller was initialized atθ= 0.

Control-problem: γ= 1.. Time-Horizon:T = 3.0s. The State-Cost function has End-Cost only:

V(x,x, t˙ ) =δ(t−T) −500Y + 10( ˙x12+ ˙x22)

withY =−l1cos(x1)−l2cos(x1+x2)(height of tip). Variance of uncontrolled dynamics

ν= 1.

Algorithm: Batchsize:N = 500.E= 0.1.∆ = 0.5. The Fisher-matrix was inverted for each time step separately using the scipy pseudo-inverse with rcond=1e-4.

Walker

Dynamics: For dynamics and the state cost function we used "BipedalWalker-v2" from the OpenAI gym [3]. The policy was a Gaussian policy, with static varianceσ= 1. The state dependent mean of the Gaussian policy was a neural network controller with two hidden layers with 32 neurons, each. The activation function is a tanh. For the initialization we used Glorot Uniform (see [9]). The inputs to the neural network was the observation space provided by OpenAI gym task "BipedalWalker-v2": State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements.

Control-problem: γ= 0. Time-Horizon: defined by OpenAI gym task “BipedalWalker-v2”. State-Cost function defined by OpenAI gym task "BipedalWalker-v2": Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score.