3.1 Gaussian Process Regression
3.1.1 Kernel Functions
A Gaussian process explicitly depends on the kernel function to compute the covariance between all the random variables of the latent function. The type of kernel chosen has an impact on the types of functions that are drawn from both the prior and the posterior.
Kernel functions are typically conditional on some hyper-parameters, θK, that
further narrow down properties of the latent function. A maximum a posteriori (MAP) solution is typically found for these parameters, though it should be noted that this is an approximation that is only accurate when the posterior is strongly peaked. In Section 6.2 we consider a situation where this isn’t true, the effect it can have, and consider some existing approaches to handling this issue.
An example of a very popular choice of kernel within the literature is the radial basis function kernel (RBF), also known as the exponentiated quadratic kernel, or Gaussian kernel. The RBF kernel requires two hyper-parameters, θK = {ℓ, σrbf2 }, and
is given by the following equation:
k(xi,:, xj,:) = σ2rbfexp − 1 2ℓ2 q X k=1 (xi,k− xj,k)2 ! .
A useful way of visualising the assumptions made by using a Gaussian process prior with a particular kernel, is to sample from the prior, f ∼ N (0, Kf f), and plot it
with respect to the input, x. Figure 3.2shows the effect that varying the lengthscale hyper-parameter, ℓ, has on samples from a Gaussian process priors with RBF kernels.
3.1 Gaussian Process Regression 38 When the lengthscale is very large, draws of the Gaussian process prior are essentially linear within the input range. When the lengthscale is very small, each input, xi,: is
essentially independent of every other point xj,:, and appear like white noise. The
RBF kernel however is infinitely differentiable (Rasmussen and Williams,2006), and so functions drawn from a Gaussian process using the RBF kernel produce infinitely differentiable functions. If one was to zoom into the function using ℓ = 0.05, the function maintains its smoothness.
Allowing the lengthscale to vary depending on the input dimension,
k(xi,:, xj,:) = σ2rbfexp − 1 2 q X k=1 1 ℓ2 k (xi,k− xj,k)2 ! ,
gives the RBF automatic relevance determination (ARD) kernel (MacKay, 1996). In this case, when the kernel hyper-parameters are optimised, the model is able to learn that for some inputs k ∈ [1, · · · , q], the lengthscale can be very long compared to the scale of the data x:,k. It’s often stated that this essentially suggests that input is
irrelevant, since the covariance is essentially constant with respect to changes in either input, xi,k or xj,k. In certain cases however it actually suggests that it is linear with
respect to this input (Piironen and Vehtari, 2015), and is not necessarily irrelevant. If the lengthscale is smaller however, this indicates that covariance varies drastically depending on what the input values are, and these inputs are deemed relevant. Allowing the lengthscales to vary depending on the input dimension adds additional flexibility to the model.
The white noise kernel that assumes zero correlation between input locations can be written with the Kronecker delta function δ, a function that produces 1 when its two inputs are equal, and 0 otherwise,
k(xi,:, xj,:) = σwhite2 δ(i, j).
Another commonly used kernel function is the linear kernel function, that also does not depend on the in Euclidean distance between inputs xi,:− xj,:, is defined by the
inner product,
k(xi,:, xj,:) = σ2linx
⊤
i,:xj,:.
Gaussian process regression with a linear kernel transpires to be equivalent to Bayesian linear regression (Rasmussen and Williams, 2006).
Figure 3.3 shows draws from the prior of a range of different kernels encoding different assumptions about smoothness, linearity, differentiability, bias’ and even
3.1 Gaussian Process Regression 39 0 5 10 15 20 X −3 −2 −1 0 1 2 3 Y RBF (a) Squared exponential 0 5 10 15 20 X −3 −2 −1 0 1 2 3 Y Linear (b) Linear 0.0 0.5 1.0 1.5 2.0 X −3 −2 −1 0 1 2 3 Y Polynomial order 3 (c) Polynomial order 3 0 5 10 15 20 X −3 −2 −1 0 1 2 3 Y Matern32 (d) Matern32 0 5 10 15 20 X −3 −2 −1 0 1 2 3 Y Matern52 (e) Matern52 0 5 10 15 20 X −3 −2 −1 0 1 2 3 Y Brownian (f) Brownian 0 5 10 15 20 X −3 −2 −1 0 1 2 3 Y Bias (g) Bias 0 5 10 15 20 X −3 −2 −1 0 1 2 3 Y Periodic-Matern52 (h) Periodic Matern52
Fig. 3.3 Draws from zero mean GPs with a range of different kernel functions showing different assumptions about similarity between X locations. Equations for each kernel
can be found in Appendix B or Section 3.1.1
periodicity. Kernel functions themselves can be combined in a multitude of ways, such as summation, multiplication and transformation of the input x.
The summation of two kernels, results in the summation of the corresponding latent functions. This convenient property, amongst others, allows for complex prior distributions over functions to be defined trivially. For example, the RBF kernel assumes the true function is infinitely differentiable. In real data it is rarely likely that the function of interest is infinitely differentiable, and we may wish to encode this understanding within our prior over functions. It is common to relax this assumption by the addition of the white noise kernel, k = kRBF + kwhite. As can be seen in
Figure 3.2d, the use of this summation kernel produces functions that are not smooth, but contain a smooth non-linear trend. This can be used as a tool to avoid problems of overfitting when the infinite differentiability assumption is likely to be violated in practice (Damianou, 2015).
The reader is directed towardsDuvenaud (2014) for an intuitive visualisation and explanation of the implications of other methods of combining kernels.
3.2 Approximations 40