Discussion - Bayesian Learning for Data-Efficient Control

In the following chapters, we make additional assumptions on the system to be controlled. Like PILCOwe only consider continuous-state, continuous-control, discrete-time, finite-time-horizon control tasks. We additionally consider continuous- observations y. We also use a belief state b in (2.60), as found in the POMDP literature, a sufficient statistic of the probability of the system being in some state x (from the robot’s or an observers point of view) given all previous observations and control outputs. Beliefs thus exist in the space of Gaussian probability distributions over X dimensional spaces: NX. The belief’s compact representation is only possible given the Markov property of the state x. Thus we use this sufficient statistic as input to a controller. However, we simplify by only inputting the expected value of the belief distribution, an approximation which we discuss next chapter.

Compared to the general set of control equations above, we only consider a restricted set of equations to describe our system and controller:

Restricted equations of control we consider in Chapters 3–5:

x₀ = εx0 ∈ RX , εx0 iid∼ N (µ₀x, Σx₀) , (2.57) x_t+1 = f(xt, ut) + εtx ∈ RX , εtx iid ∼ N (0, Σε x) , (2.58) y_t = x_t+ εty ∈ RX , ε y t iid ∼ N 0, Σε y , (2.59) b_t|t = p(xt|y0:t, u0:t−1) ∈ NX , (2.60) u_t = π (Ebt|tbt|t) ∈ R U _, _(2.61) c_t = cost(xt) ∈ [0, 1] , (2.62) Jt = ∑Tτ =tγ τ −t E [cτ] ∈ [0, T − t + 1] . (2.63)

We shall now discuss the assumptions behind the restricted set of control equations (2.57)–(2.63), compared to the general control equations (2.51)–(2.56). We list the assumption in order by decreasing severity below.

1. Our most severe and limiting assumption is perhaps our observation function, reflected in restrictive form of our observation function in (2.59), simply a function of the latent state x with additive Gaussian observation noise εty, compared to (2.53). The only unknown is the observation noise variance Σε

y, which must be learned from data. By assuming this simple and known structure in the observation function, the burden is placed on the user to 1) define the state variables within vector x the robot needs to track, 2) preprocess sensory data to

measure each state variable. Also, observation noise εtybeing independent of state x and control u is not always the case. For example, a camera’s precision in measuring an object’s position could be largely dependent on the object’s velocity, due to blurring effects.

2. A medium-level limitation is the controller being a function of the belief-mean only, (2.61). It could condition on the belief variance also, or other statistics of the belief distribution. Adapting a controller to condition on variance is not difficult, but not trivial either. It involves additional chain terms in the controller optimisation process. The consequences of how detrimental such a restriction is unclear to the authors, however, as seen later in § 3.3.6, the optimal control of an inverted pendulum does depend on how certain the robot is about the state. Lower state uncertainty allows the robot to be aggressive, applying high controller gains to stabilise the system quickly, whilst higher state uncertainty warrant more cautious controllers slower to react.

3. A weaker-level assumption concerns our dynamics model in (2.58). We focus on the fairly general setting of unknown and nonlinear dynamics f w.r.t. to both input x and u as does (2.52), except with additive process noise εx

t. As discussed, nonlinear dynamics makes controller design more difficult, and we make no prior assumptions on our modelling of the latent dynamics f except for function smoothness (on some unknown scale) and time-invariance. Additional Gaussian process noise εx

t being independent of the state and control, whilst not general, will satisfy any weakly-stochastic systems and will arguably approximate well most stochastic systems.

4. We also make some weak assumptions for sake of simplicity, which are not difficult to avoid. For example, our time-invariant controller (2.61) is easily changed into a time-variant controller (2.54) since ‘time is known in advance’. Such a controller would be implemented with a set of time-dependent controller parameters to optimise, which only requires a slightly more complex chain rule expression than (2.50)–(2.50).

5. Another weak assumptions, for example, the cost function we consider (2.62) is only a function of the state x, but it is trivial to generalise to (2.55) being a function of the control u and time t and new state xt+1also. Note, as previously discussed, optimal control is invariant to affine transformations to the cost function, so a bounded cost function [0, 1] is still very general.

C

HAPTER

3 L

EARNING

C

ONTROL WITH A

F

ILTER

PILCOis an RL algorithm (discussed § 2.5) which uses GPs to learn a model of the system dynamics of continuous states. The method has shown to be highly data- efficient in the sense that it can learn with only very few interactions with the real system. However, a serious limitation of PILCOis that it assumes that the observation noise level is small. There are two main reasons, which make this assumption necessary. Firstly, the dynamics are learnt from the noisy observations (i.e. incorrectly modelling a non-Markov process f : yt× ut → yt+1 as if it were Markov). Learning the dynamics model in this way does not correctly account for the noise in the observations (the true dynamics is f : xt× ut → xt+1). Only if the observation noise ε_ty is small, then observations yt = xt+ εty≈ xt would be good approximations for input to the real dynamics function. Secondly, PILCO uses the noisy observation directly to calculate the control, ut= π(yt) = π(xt+ εty), which is problematic if the observation noise εtyis substantial. Imagine a controller π controlling an unstable system, where high gain feed-back is necessary for good performance. Observation noise is amplified when the noisy input is fed directly to the high gain controller, which in turn injects noise back into the state, creating cycles of increasing variance and instability.

In this chapter we extend PILCO to address these two shortcomings, enabling PILCO to be used in situations with substantial observation noise. The first issue is addressed using the so-called direct method for training the dynamics model, explained § 3.3.2. The second problem can be tackled by filtering the observations. One way to look at this is that PILCOdoes planning in observation space, rather than in belief space. In this chapter we extend PILCOto allow filtering of the observations, by combining the previous belief-state distribution with the dynamics model and the observation using Bayes rule to plan in belief space. Note, that this is easily done

when the controller is being applied, but to gain the full benefit of a filter, we have to also take the filter into account when simulating and evaluating the controller.

PILCOtrains its controller through minimising the predicted loss when simulating the system and controller. Since the dynamics are not known exactly, the simulation in PILCO had to simulate distributions of possible trajectories of the physical state of the system. This was achieved using an analytical approximation based on moment-matching and Gaussian state distributions. In this chapter we thus augment the simulation over physical states to also include the state of the filter, an information state or belief state. This is complicated by the fact that our belief state itself is a probability distribution, we will now have to simulate distributions over distributions. This will allow the algorithm both to apply filtering during control but also to anticipate the effect of filtering during training, thereby learning a better controller.

We will first explore the undesirable effects of noisy observations in § 3.1 before discussing how filtering helps mitigate such effects in § 3.2. In § 3.3 we discuss both how we extend the PILCOframework to apply to a restricted form of POMDPs to instead plan in belief space to include filtering. We show experimental results that the proposed algorithm handles observation noise better than competing algorithms. An assumption is we observe noisy versions of the state variables. We do not handle more general cases where other unobserved states are also learnt nor learn any other mapping from the state space to observations other than additive Gaussian noise. We also introduce the Direct method § 3.3.2 to train a dynamics model from noisy observations. Thereafter we discuss a more generalised version of learning in the presence of observation noise in § 3.4, closing with some additional topics and conclusions.

3.1 CONSEQUENCES OF

UNFILTERED

OBSERVATIONS

Previously, we saw PILCO model a noiseless control process using a Probabilistic

Graphical Model (PGM), Fig. 2.1 on page 37. A PGM depicts conditional-dependency relationships between random control variables (distinguished as capitals), useful for simulating the systems forwards in time during controller evaluation, since the value of future control variables is currently uncertain (due to system stochasticities and subjective uncertainty about the dynamics). A PGM also makes clear which variables are observed by the robot (highlighted grey) and which variables are unobserved (termed hidden or latent) by the robot (highlighted white).

3.1 Consequences of Unfiltered Observations 45

In document Bayesian Learning for Data-Efficient Control (Page 55-59)