Directed Exploration - Bayesian Learning for Data-Efficient Control

4.2.1 B

AYESIAN

R

EINFORCEMENT

L

EARNING

The total loss in (4.3) comprises E-many episodes, in between which we have the opportunity to update the controller’s parameterisation according to data collected from the episode before. We wish to optimise J w.r.t. to our subjective belief p( f ), noting that p( f ) changes given new data. To capture how to optimise J given chang- ing p( f ), an RL solution proposed by Duff (2002) is to define partially-observed hyperstates h which concatenate the original state x with partially-observed transition model parameters. Doing so, the reinforcement learning problem in x-space is framed as a POMDP (see § 2.2.2) planning problem in h-space. The resultant ‘Bayes-optimal’ controller, which optimises (4.3) w.r.t. both the present model p( f ) and importantly: knowledge of how the model updates given new data. Given knowledge of how the model updates, the POMDP solution can estimate the expected value of information of any type of exploratory behaviour, naturally trading off exploration and exploitation in the original x state space. The Bayes-optimal solution is ‘optimal’ in the sense of optimising the average total loss, averaged over many experiments re-sampling the dynamics parameters from the belief prior p( f ) at e = 0. Unfortunately, since POMDPs are generally intractable to solve exactly (Mundhenk et al., 1997; Papadimitriou and Tsitsiklis, 1987), so too are Bayes-optimal controllers. This is especially true in continuous state-action-observation spaces, and in PILCO’s case using a nonparametric dynamics model, but approximations exist

which we now discuss.

Bayesian Reinforcement Learning (BRL) algorithms model the dynamics and/or the rewards, or value function, to approximate the (intractable) Bayes-optimal controller. Many approximate approaches exist, including myopic belief lookahead (which considers how the belief updates over at most one episode into the future) and deeper lookahead. Myopic lookahead methods apply a heuristic function (or ‘exploration bonus’) to the robot’s current uncertainties, or predicted uncertainties by next episode, an improvement on simpler methods such as Boltzmann-exploration (Meuleau and Bourgine, 1999) including sums of cost standard deviations (Deisen- roth, 2009), Shannon entropy (Deisenroth et al., 2008), and a variants on the expected- improvement heuristic (Dearden et al., 1998; Delage and Mannor, 2007). For sim- plicity, myopic exploration often ignores loss correlations between different control decisions (or in our case between choosing controllers for the next episode) which are in fact correlated via the Bellman equation (2.8). Another method is modelling loss distributions using GPs (Engel et al., 2003, 2005) and exploring according to

upper confidence bounds, suitable for small state spaces, and smooth loss functions, unfortunately inapplicable to PILCO.

Deeper (non-myopic) sparse-tree lookahead trees are also possible (Kearns et al., 2002; Ross et al., 2008; Walsh et al., 2010; Wang et al., 2005). Alternatives to sampling can include discretisation of the belief space (Wang et al., 2012). However, this is only feasible up to 3–4 dimensions. Discretisation must be carefully tuned: if too large the robot will not learn anything, and if too small the learning will be slow. If the dynamics model is easily sampled, then following an optimal controller of a sample from the model for several timesteps, called Thompson sampling (Thompson, 1933), is an effective solution still popular in RL (Asmuth et al., 2009; Gal and Ghahramani, 2015; Osband et al., 2016; Stadie et al., 2015; Strens, 2000). Thompson sampling generally outperforms ε-greedy and Boltzmann exploration since the sampling the uncertain model posterior corresponds to uncertainty-directed exploration.

We can think of PILCO (§ 2.5) as another BRL algorithm, which assumes fully observable states x and a partially observable transition model p( f ). PILCOapproxi- mates Bayes-optimal control by assuming no observation function associated with the transition model (i.e. assuming its current uncertainty over transition models is fixed) and is a pure exploitation RL algorithm. However, even though PILCO

does not intentionally explore the probabilistic trajectories, the saturating cost function has the indirect effect of favouring more uncertain polices when the expected cumulative-cost is poor (Deisenroth et al., 2015). This above effect together with system randomness (observation noise εtyand process noise εtx) usually ensure PILCO visits enough of the state space ‘accidentally’ to learn enough dynamics to optimise a reasonable controller. Nevertheless, PILCO achieved unprecedented data efficiency in the cart double-pole swing-up problem. As mentioned before, the key to PILCO’s success is the use of its probabilistic nonparametric dynamics model. The probabilistic nonparametric model enables probabilistic predictions from model uncertainty (distinct from state uncertainty) from arbitrarily complex models, help- ful for uncertainty-directed exploration. Earlier work by Deisenroth (2009, section 3.7.1) optimises the sum of the individual cost means and (weighted) cost standard deviations, as an approximate myopic BRL extension of PILCO. However, as they state, the sum of individual cost standard deviations is an approximation to using the full cumulative cost variance, which should include many cross terms, discussed § 4.4.

We build on this BRL extension of PILCO, to take PILCOfrom a pure exploitation algorithm to one that balances exploration and exploitation to achieve even greater

4.2 Directed Exploration 99

data efficiency than before. PILCO greedily optimises an expected cumulative- cost of the states using a probabilistic dynamics model to predict distributions of future states. Since Gaussian process models can be approximately sampled, a Thompson sampling approach could be employed for stochastic directed exploration. However, given PILCO’s probabilistic dynamics model can compute full cumulative- cost distributions (approximated as Gaussian) of controllers analytically, deterministic directed exploration is possible. Deterministic exploration is preferred for data efficient learning in single agent tasks since the specific actions which optimise the expected value of information gained are not random! Unlike Deisenroth (2009) who compute the marginal cost distributions per time step, we compute the full joint covariance of all timestep’s costs in an episode, required for computing the loss-variance (cumulative-cost variance). The cross terms in the joint cost covariance matrix typically contributes between 40% to 85% of the cumulative-cost variance, and thus must be included to avoid significantly underestimating the cumulative- cost variance. We use the additional cumulative-cost variance information for uncertainty-directed exploration. We evaluate the value of information by evaluating the uncertainty in the cumulative-cost function, as other myopic BRL algorithms do, to direct exploration, opposed to PILCO. As a second BRL extension of PILCOlater in § 4.5, we discuss a better BRL extension to PILCO, also myopic, which considers how the dynamics model might change in response to future data we might see, giving us a more accurate estimate of the value of information.

4.2.2 P

ROBABLY

A

PPROXIMATELY

C

ORRECT

(PAC) L

EARNING An alternate definition of ‘data efficiency’ as minimising cumulative regret (4.3) is instead minimising the number of episodes that the robot’s expected performance fails to be within a specified tolerance of the optimal loss. Such algorithms are called PAC-MDP (Probably Approximately Correct Markov Decisions Process), dis- covering ‘near-optimal’ controllers with high-probability within time polynomial to the number of states and controls (Even-Dar et al., 2002). Example PAC-MDP algorithms include R-max (Brafman and Tennenholtz, 2003), E3(Kearns and Singh, 2002), and Delayed Q-Learning (Strehl et al., 2006). PAC-MDP methods provide powerful probabilistic-guarantees on the data-complexity required before asymp- totic convergence to an optimal controller. However, such strong guarantees come at the price of over-exploration (Delage and Mannor, 2007; Kolter and Ng, 2009). PAC-MDP algorithms either systematically explore the complete state-space, or follow the principle of optimism under uncertainty (e.g. upper confidence bounds).

To be consistently optimistic under uncertainty is to overvalue exploration, and thus over-explore the state-space. For example, when priors over different transitions are artificially biased (‘optimistic’), a controller tends to explore less visited states even when Bayes-optimal controllers under same prior would deem the cost of doing so too high. The PAC-MDP formulation disregards costs incurred in the short term by instead concentrating on discovery of near-optimal controllers in the long term. In doing so, PAC-MDP methods solve a different formulation of data efficiency, ensuring long-term controller near-optimality instead of maximising the expected cumulative costs. Indeed, Bayes-optimal controllers are not PAC-MDP (Kolter and Ng, 2009). As such, the PAC-MDP formulation is undesirable when the system always incurs real-world cost to interact with (regardless of whether the robot is exploring or exploiting). Indeed many methods’ authors are willing to trade the probabilistic guarantees that PAC-MDP methods provide for practical performance gains (Jung and Stone, 2010). Similarly, any other methods that direct exploration by reducing dynamics uncertainty also incur data inefficiencies. The robot should not waste time modelling aspects of the system that are unlikely to help it achieve its goal. Since the goal of RL is optimisation of the expected cumulative costs (or rewards), exploratory actions should seek to reduce uncertainty of the objective (cumulative-cost) only. By evaluating exploration according to the objective distribution, we evaluate exploration in the same ‘units’ as we evaluate exploitation, making the trade-off between the two more straightforward.

4.3 BAYESIAN

OPTIMISATION

By simulating our system using our probabilistic dynamics model, is it possible to compute the cumulative-cost distribution given data up until current episode e exclusive: Ceψ ∼ N µeC, ΣCe. We have not yet discussed how to compute Cψ’s distribution, which we leave for § 4.4, but for the moment assume Cψ_{’s distribution} is computable. In this section, we discuss how to direct exploration using Cψ_{’s full} distribution with Bayesian Optimisation (BO) methods. We change the controller evaluation step (see Algorithm 1, line 6). No longer do we optimise the mean cumulative cost µeC as PILCO did, but instead optimise a function of the mean and variance: BO(µeC, ΣCe).

Bayesian optimisation is the problem of optimising an unknown function, often expensive to evaluate and without gradient information, through frugal succes- sive samples of data. Good introductions are Brochu et al. (2010), Shahriari et al.

In document Bayesian Learning for Data-Efficient Control (Page 111-115)