Gaussian Process Training with Input Noise
Andrew McHutchon Department of Engineering
Cambridge University Cambridge, CB2 1PZ [email protected]
Carl Edward Rasmussen Department of Engineering
Cambridge University Cambridge, CB2 1PZ [email protected]
Abstract
In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points cor- rupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean.
The input noise variances are inferred from the data as extra hyperparameters.
They are trained alongside other hyperparameters by the usual method of max- imisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods.
1 Introduction
Over the last decade the use of Gaussian Processes (GPs) as non-parametric regression models has grown significantly. They have been successfully used to learn mappings between inputs and outputs in a wide variety of tasks. However, many authors have highlighted a limitation in the way GPs handle noisy measurements. Standard GP regression [1] makes two assumptions about the noise in datasets: firstly that measurements of input points, x, are noise-free, and, secondly, that output points, y, are corrupted by constant-variance Gaussian noise. For some datasets this makes intuitive sense: for example, an application in Rasmussen and Williams (2006) [1] is that of modelling CO 2
concentration in the atmosphere over the last forty years. One can viably assume that the date is available noise-free and the CO 2 sensors are affected by signal-independent sensor noise.
However, in many datasets, either or both of these assumptions are not valid and lead to poor mod- elling performance. In this paper we look at datasets where the input measurements, as well as the output, are corrupted by noise. Unfortunately, in the GP framework, considering each input location to be a distribution is intractable. If, as an approximation, we treat the input measurements as if they were deterministic, and inflate the corresponding output variance to compensate, this leads to the output noise variance varying across the input space, a feature often called heteroscedasticity. One method for modelling datasets with input noise is, therefore, to hold the input measurements to be deterministic and then use a heteroscedastic GP model. This approach has been strengthened by the breadth of research published recently on extending GPs to heteroscedastic data.
However, referring the input noise to the output in this way results in heteroscedasticity with a very particular structure. This structure can be exploited to improve upon current heteroscedastic GP models for datasets with input noise. One can imagine that in regions where a process is changing its output value rapidly, corrupted input measurements will have a much greater effect than in regions
Pre-conference version
where the output is almost constant. In other words, the effect of the input noise is related to the gradient of the function mapping input to output. This is the intuition behind the model we propose in this paper.
We fit a local linear model to the GP posterior mean about each training point. The input noise vari- ance can then be referred to the output, proportional to the square of the posterior mean function’s gradient. This approach is particularly powerful in the case of time-series data where the output at time t becomes the input at time t + 1. In this situation, input measurements are clearly not noise-free: the noise on a particular measurement is the same whether it is considered an input or output. By also assuming the inputs are noisy, our model is better able to fit datasets of this type.
Furthermore, we can estimate the noise variance on each input dimension, which is often very useful for analysis.
Related work lies in the field of heteroscedastic GPs. A common approach to modelling changing variance with a GP, as proposed by Goldberg et al. [2], is to make the noise variance a random variable and attempt to estimate its form at the same time as estimating the posterior mean. Goldberg et al. suggested using a second GP to model the noise level as a function of the input location.
Kersting et al. [3] improved upon Goldberg et al.’s Monte Carlo training method with a “most likely”
training scheme and demonstrated its effectiveness; related work includes Yuan and Wahba [4], and Le at al. [5] who proposed a scheme to find the variance via a maximum-a-posteriori estimate set in the exponential family. Snelson and Ghahramani [6] suggest a different approach whereby the importance of points in a pseudo-training set can be varied, allowing the posterior variance to vary as well. Recently Wilson and Ghahramani broadened the scope still further and proposed Copula and Wishart Process methods [7, 8].
Although all of these methods could be applied to datasets with input noise, they are designed for a more general class of heteroscedastic problems and so none of them exploits the structure inherent in input noise datasets. Our model also has a further advantage in that training is by marginal likelihood maximisation rather than by an approximate inference method, or one such as maximum likelihood, which is more susceptible to overfitting. Dallaire et al. [9] train on Gaussian distributed input points by calculating the expected the covariance matrix. However, their method requires prior knowledge of the noise variance, rather than inferring it as we do in this paper.
2 The Model
In this section we formally derive our model, which we refer to as NIGP (noisy input GP).
Let x and y be a pair of measurements from a process, where x is a D dimensional input to the process and y is the corresponding scalar output. In standard GP regression we assume that y is a noisy measurement of the actual output of the process ˜ y,
y = ˜ y + y (1)
where, y ∼ N 0, σ y 2 . In our model, we further assume that the inputs are also noisy measurements of the actual input ˜ x,
x = ˜ x + x (2)
where x ∼ N (0, Σ x ). We assume that each input dimension is independently corrupted by noise, thus Σ x is diagonal. Under a model f (.), we can write the output as a function of the input in the following form,
y = f (˜ x + x ) + y (3)
For a GP model the posterior distribution based on equation 3 is intractable. We therefore consider a Taylor expansion about the latent state ˜ x,
f (˜ x + x ) = f (˜ x) + T x ∂f (˜ x)
∂ ˜ x + . . . ' f (x) + T x ∂f (x)
∂x + . . . (4)
We don’t have access to the latent variable ˜ x so we approximate it with the noisy measurements.
Now the derivative of a Gaussian Process is another Gaussian Process [10]. Thus, the exact treatment
would require the consideration of a distribution over Taylor expansions. Although the resulting dis-
tribution is not Gaussian, its first and second moments can be calculated analytically. However, these
calculations carry a high computational load and previous experiments showed this exact treatment
provided no significant improvement over the much quicker approximate method we now describe.
Instead we take the derivative of the mean of the GP function, which we will denote ∂ f ¯ , a D- dimensional vector, for the derivative of one GP function value w.r.t. the D-dimensional input, and
∆ f ¯ , an N by D matrix, for the derivative of N function values. Differentiating the mean function corresponds to ignoring the uncertainty about the derivative. If we expand up to the first order terms we get a linear model for the input noise,
y = f (x) + T x ∂ f ¯ + y (5)
The probability of an observation y is therefore,
P (y | f ) = N (f, σ y 2 + ∂ T f ¯ Σ x ∂ f ¯ ) (6) We keep the usual Gaussian Process prior, P (f | X) = N (0, K(X, X)), where K(X, X) is the N by N training data covariance matrix and X is an N by D matrix of input observations. Combining these probabilities gives the predictive posterior mean and variance as,
E [f ∗ | X, y, x ∗ ] = k(x ∗ , X)K(X, X) + σ 2 y I + diag{∆ f ¯ Σ x ∆ T f ¯ } −1
y
V [f ∗ | X, y, x ∗ ] = k(x ∗ , x ∗ ) − k(x ∗ , X)K(X, X) + σ y 2 I + diag{∆ f ¯ Σ x ∆ T f ¯ } −1
k(X, x ∗ ) (7) This is equivalent to treating the inputs as deterministic and adding a corrective term, diag{∆ f ¯ Σ x ∆ T f ¯ }, to the output noise. The notation “diag{.}” results in a diagonal matrix, the elements of which are the diagonal elements of its matrix argument. Note that if the posterior mean gradient is constant across the input space the heteroscedasticity is removed and our model is essen- tially identical to a standard GP.
An advantage of our approach can be seen in the case of multiple output dimensions. As the input noise levels are the same for each of the output dimensions, our model can use data from all of the outputs when learning the input noise variances. Not only does this give more information about the noise variances without needing further input measurements but it also reduces over-fitting as the learnt noise variances must agree with all E output dimensions.
For time-series datasets (where the model has to predict the next state given the current), each dimension’s input and output noise variance can be constrained to be the same since the noise level on a measurement is independent of whether it is an input or output. This further constraint increases the ability of the model to recover the actual noise variances. The model is thus ideally suited to the common task of multivariate time series modelling.
3 Training
Our model introduces an extra D hyperparameters compared to the standard GP - one noise variance hyperparameter per input dimension. A major advantage of our model is that these hyperparameters can be trained alongside any others by maximisation of the marginal likelihood. This approach automatically includes regularisation of the noise parameters and reduces the effect of over-fitting.
In order to calculate the marginal likelihood of the training data we need the posterior distribution, and the slope of its mean, at each of the training points. However, evaluating the posterior mean from equation 7 with x ∗ ∈ X, results in an analytically unsolvable differential equation: ¯ f is a complicated function of ∆ f ¯ , its own derivative. Therefore, we define a two-step approach: first we evaluate a standard GP with the training data, using our initial hyperparameter settings and ignoring the input noise. We then find the slope of the posterior mean of this GP at each of the training points and use it to add in the corrective variance term, diag{∆ f ¯ Σ x ∆ T f ¯ }. This process is summarised in figures 1a and 1b.
The marginal likelihood of the GP with the corrected variance is then computed, along with its derivatives with respect to the initial hyperparameters, which include the input noise variances. This step involves chaining the derivatives of the marginal likelihood back through the slope calculation.
Gradient descent can then be used to improve the hyperparameters. Figure 1c shows the GP posterior
for the trained hyperparameters and shows how NIGP can reduce output noise level estimates by
taking input noise into account. Figure 1d shows the NIGP fit for the trained hyperparameters.
−1 0 1 2 3 4 5 6
Target
a) Initial hyperparameters & training data define a GP fit
b) Extra variance added proportional to squared slope
0 1 2 3 4 5 6
−1 0 1 2 3 4 5 6
Input
Target
c) Standard GP with NIGP trained hyperparameters
0 1 2 3 4 5 6
Input d) The NIGP fit including variance
from input noise
Figure 1: Training with NIGP. (a) A standard GP posterior distribution can be computed from an initial set of hyperparameters and a training data set, shown by the blue crosses. The gradients of the posterior mean at each training point can then be found analytically. (b) The NIGP method increases the posterior variance by the square of the posterior mean slope multiplied by the current setting of the input noise variance hyperparameter. The marginal likelihood of this fit is then calculated along with its derivatives w.r.t. initial hyperparameter settings. Gradient descent is used to train the hyper- parameters. (c) This plot shows the standard GP posterior using the newly trained hyperparameters.
Comparing to plot (a) shows that the output noise hyperparameter has been greatly reduced. (d) This plot shows the NIGP fit - plot(c) with the input noise corrective variance term, diag{∆ f ¯ Σ x ∆ T f ¯ }.
Plot (d) is related to plot (c) in the same way that plot (b) is related to plot (a).
To improve the fit further we can iterate this procedure: we use the slopes of the current trained NIGP, instead of a standard GP, to calculate the effect of the input noise, i.e. replace the fit in figure 1a with the fit from figure 1d and re-train.
4 Prediction
We turn now to the task of making predictions at noisy input locations with our model. To be true to our model we must use the same process in making predictions as we did in training. We therefore use the trained hyperparameters and the training data to define a GP posterior mean, which we differentiate at each test point and each training point. The calculated gradients are then used to add in the corrective variance terms. The posterior mean slope at the test points is only used to calculate the variance over observations, where we increase the predictive variance by the noise variances.
There is an alternative option, however. If a single test point is considered to have a Gaussian distribution and all the training points are certain then, although the GP posterior is unknown, its mean and variance can be calculated exactly [11]. As our model estimates the input noise variance Σ x during training, we can consider a test point to be Gaussian distributed: x 0 ∗ ∼ N (x ∗ , Σ x ).
[11] then gives the mean and variance of the posterior distribution, for a squared exponential kernel (equation 12), to be,
f ¯ ∗ =
K + σ y 2 I + Σ x ∂ ¯ f 2 −1
y T
q (8)
where,
q i = σ 2 f
Σ x Λ −1 + I
−
12exp − 1
2 (x i − x ∗ ) T (Σ x + Λ) −1 (x i − x ∗ )
(9) where Λ is a diagonal matrix of the squared lengthscale hyperparameters.
V [f ∗ ] = σ 2 f − tr
K + σ 2 y I + Σ x ∂ ¯ f 2 −1
Q
+ α T Qα − ¯ f ∗ 2 (10) with,
Q ij = k(x i , x ∗ )k(x j , x ∗ )
|2Σ x Λ −1 + I|
12exp
(z − x ∗ ) T Λ + 1
2 ΛΣ −1 x Λ −1
(z − x ∗ )
(11) with z = 1 2 (x i +x j ). This method is computationally slower than using equation 7 and is vulnerable to worse results if the learnt input noise variance Σ x is very different from the true value. However, it gives proper consideration to the uncertainty surrounding the test point and exactly computes the moments of the correct posterior distribution. This often leads it to outperform predictions based on equation 7.
5 Results
We tested our model on a variety of functions and datasets, comparing its performance to stan- dard GP regression as well as Kersting et al.’s ‘most likely heteroscedastic GP’ (MLHGP) model, a state-of-the-art heteroscedastic GP model. We used the squared exponential kernel with Automatic Relevance Determination,
k(x i , x j ) = σ f 2 exp − 1
2 (x i − x j ) T Λ −1 (x i − x j )
(12) where Λ is a diagonal matrix of the squared lengthscale hyperparameters and σ 2 f is a signal variance hyperparameter. Code to run NIGP is available on the author’s website.
Standard GP Kersting et al. This paper
−10 −5 0 5 10
−1.5
−1
−0.5 0 0.5 1 1.5
−10 −5 0 5 10
−1.5
−1
−0.5 0 0.5 1 1.5
−10 −5 0 5 10
−1.5
−1
−0.5 0 0.5 1 1.5
Figure 2: Posterior distribution for a near-square wave with σ y = 0.05, σ x = 0.3, and 60 data points.
The solid line represents the predictive mean and the dashed lines are two standard deviations either side. Also shown are the training points and the underlying function. The left image is for standard GP regression, the middle uses Kersting et al.’s MLHGP algorithm, the right image shows our model.
While the predictive means are similar, both our model and MLHGP pinch in the variance around the low noise areas. Our model correctly expands the variance around all steep areas whereas MLHGP can only do so where high noise is observed (see areas around x= -6 and x = 1).
Figure 2 shows an example comparison between standard GP regression, Kersting et al.’s MLHGP,
and our model for a simple near-square wave function. This function was chosen as it has areas
of steep gradient and near flat gradient and thus suffers from the heteroscedastic problems we are trying to solve. The posterior means are very similar for the three models, however the variances are quite different. The standard GP model has to take into account the large noise seen around the steep sloped areas by assuming large noise everywhere, which leads to the much larger error bars.
Our model can recover the actual noise levels by taking the input noise into account. Both our model and MLHGP pinch the variance in around the flat regions of the function and expand it around the steep areas. For the example shown in figure 2 the standard GP estimated an output noise standard deviation of 0.16 (much too large) compared to our estimate of 0.052, which is very close to the correct value of 0.050. Our model also learnt an input noise standard deviation of 0.305, very close to the real value of 0.300. MLHGP does not produce a single estimate of noise levels.
Predictions for 1000 noisy measurements were made using each of the models and the log proba- bility of the test set was calculated. The standard GP model had a log probability per data point of 0.419, MLHGP 0.740, and our model 0.885, a significant improvement. Part of the reason for our improvement over MLHGP can be seen around x = 1: our model has near-symmetric ‘horns’ in the variance around the corners of the square wave, whereas MLHGP only has one ‘horn’. This is because in our model, the amount of noise expected is proportional to the derivative of the mean squared, which is the same for both sides of the square wave. In Kersting et al.’s model the noise is estimated from the training points themselves. In this example the training points around x = 1 happen to have low noise and so the learnt variance is smaller. The same problem can be seen around x = −6 where MLHGP has much too small variance. This illustrates an important aspect of our model: the accuracy in plotting the varying effect of noise is only dependent on the accuracy of the mean posterior function and not on an extra, learnt noise model. This means that our model typically requires fewer data points to achieve the same accuracy as MLHGP on input noise datasets. To test the models further, we trained them on a suite of six functions. The functions were again chosen to have varying gradients across the input space. The training set consisted of twenty five points in the interval [-10, 10] and the test set one thousand points in the same interval. Trials were run for different levels of input noise. For each trial, ten different initialisations of the hyperparameters were tried. In order to remove initialisation effects the best initialisations for each model were chosen at each step. The entire experiment was run on twenty different random seeds. For our model, NIGP, we trained both a single model for all output dimensions, as well as separate models for each of the outputs, to see what the effect of using the cross-dimension information was.
Figure 3 shows the results for this experiment. The figure shows that NIGP performs very well on all the functions, always outperforming the standard GP when there is input noise and nearly always MLHGP; wherever there is a significant difference our model is favoured. Training on all the outputs at once only gives an improvement for some of the functions, which suggests that, for the others, the input noise levels could be estimated from the individual functions alone. The predictions using stochastic test-points, equations 8 and 10, generally outperformed the predictions made using deter- ministic test-points, equation 7. The RMSEs are quite similar to each other for most of the functions as the posterior means are very similar, although where they do differ significantly, again, it is to favour our model. These results show our model consistently calculates a more accurate predictive posterior variance than either a standard GP or a state-of-the-art heteroscedastic GP model.
As previously mentioned, our model can be adapted to work more effectively with time-series data, where the outputs become subsequent inputs. In this situation the input and output noise variance will be the same. We therefore combine these two parameters into one. We tested NIGP on a time- series dataset and compared the two modes (with separate input and output noise hyperparameters and with combined) and also to standard GP regression (MLHGP was not available for multiple input dimensions). The dataset is a simulated pendulum without friction and with added noise.
There are two variables: pendulum angle and angular velocity. The choice of time interval between observations is important: for very small time intervals, and hence small changes in the angle, the dynamics are approximately linear, as sin θ ≈ θ. As discussed before, our model will not bring any benefit to linear dynamics, so in order to see the difference in performance a much longer time interval was chosen. The range of initial angular velocities was chosen to allow the pendulum to spin multiple times at the extremes, which adds extra non-linearity. Ten different initialisations were tried, with the one achieving the highest training set marginal likelihood chosen, and the whole experiment was repeated fifty times with different random seeds.
The plots show the difference in log probability of the test set between four versions of NIGP and a
standard GP model trained on the same data. All four versions of our model perform better than the
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−2
−1.5
−1
−0.5 0 0.5 1
Negative log predictive posterior
sin(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−1.5
−1
−0.5 0 0.5
Near−square wave
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−1.5
−1
−0.5 0 0.5 1 1.5
2 exp(−0.2*x)*sin(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−1
−0.5 0 0.5 1 1.5 2
Input noise standard deviation
Negative log predictive posterior
tan(0.15*(x))*sin(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.5 1 1.5 2 2.5 3 3.5 4
Input noise standard deviation 0.2*x2*tanh(cos(x))
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−1
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6
Input noise standard deviation 0.5*log(x2*(sin(2*x)+2)+1) NIGP DTP all o/p
NIGP DTP indiv. o/p NIGP STP indiv. o/p NIGP STP all o/p Kersting et al.
Standard GP
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalised test set RMSE
sin(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55
Near−square wave
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
exp(−0.2*x)*sin(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Input noise standard deviation
Normalised test set RMSE
tan(0.15*(x))*sin(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Input noise standard deviation 0.2*x2*tanh(cos(x))
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22
Input noise standard deviation 0.5*log(x2*(sin(2*x)+2)+1)