Chapter 2 Kernel Methods, Stochastic Processes and Bayesian Non-
2.3 Bayesian Nonparametric Models
2.3.3 Practical Issues with Gaussian Processes
GPs will be used extensively throughout the remainder of this thesis, and we there- fore pause to discuss issues relating to their practical implementation. This includes how to select a particular type of GP prior for Bayesian inference, stability of the numerical systems underlying conditional distributions, and issues relating to their scalability in high dimensional or large data settings.
Prior Specification
Prior specification (also called model selection) is an important consideration for working with GPs [Stein, 1999; Xu and Stein, 2017]. It consists of selecting the mean function m :X →R and the covariance function c :X × X → R of the GP prior; see Oakley [2002] for elicitation of priors in the area of computer experiments. Since prior models are “infinitely informative” in the nonparametric case, this choice will be of prime importance as it will significantly influence the result of the Bayesian analysis. Care is therefore required.
The choice of mean function for GPs has received relatively little attention. This is mainly due to the fact that an appropriate choice of prior mean should be guided by problem-specific knowledge. A common practice is to set the mean function to m = 0, then let the data influence the posterior. In cases where n
is large and the dimension of the domain X is low, this may be an acceptable approach. However when this is not the case, the data will not be informative about the function on the entire domain and, as a result, the posterior will revert to the prior in areas which are unexplored. An arbitrary choice of prior such asm= 0 can therefore have severe consequences in these cases. To avoid this problem, it is also possible to use a parametric model as prior mean, for example using a linear combination of basis functions [Kennedy and Hagan, 2001], or use meta-learning; see for example Fortuin and Ratsch [2019].
Figure 2.2: Importance of model selection for Gaussian processes. Left: Draws from a Gaussian Process prior with mean zero and covariance a Gaussian RBF kernel with lengthscaleσ= 0.1 (red),σ = 1 (blue) and σ= 5 (green). Right:
covariance function. This is because the covariance function will determine essential properties of the realisations and mean of the posterior. Some popular covariance functions have already been introduced in Section 2.1.3 and other examples can also be found in [Duvenaud, 2014]. It is common to base this choice on smoothness, periodicity and tail properties.
As was seen in these examples, covariance functions also tend to have several hyperparameters (jointly denoted by the vector γ) which need to be selected, and will have a significant influence on the prior obtained. This is for example illustrated in Figure 2.2 (left), where realisations from a GP with Gaussian RBF covariance function (see Equation 2.4) are plotted for various values of the lengthscale σ but fixed amplitudeλ= 1. Similarly, Figure 2.2 (right) contains realisations from a GP with Mat´ern covariance (see Equation 2.3) with lengthscale σ = 1 and amplitude
λ= 1 but varying smoothness hyperparameterν. In both case, the hyperparameters have a significant impact on the realisations obtained.
Consider a parametric covariance functionc(x,x0;γl, γs), with a distinction
drawn here between scale hyperparametersγl and smoothness hyperparameters γs.
The former are defined as parameterising the norm of the associated RKHS, whereas the latter affect the corresponding RKHS itself. Selection ofγl, γsbased on data can
only be successful in the absence of acute sensitivity to these hyperparameters. For scale hyperparameters, a wide body of evidence demonstrates that this is usually not a concern [Stein, 1999]. We now outline several approaches, which are described in more details by Rasmussen and Williams [2006]:
• Marginalisation: A natural approach, from a Bayesian perspective, is to set a prior on the hyperparameters γ and then to marginalise over the posterior distribution on these parameters. Recent results for certain infinitely differen-
tiable covariance functions establish minimax optimal rates for this approach, including in the practically relevant setting where π is supported on a low- dimensional sub-manifold of the ambient spaceX [Yang and Dunson, 2016]. However, the act of marginalisation itself involves an intractable integral which will usually break the conjugacy property of GPs. It is therefore important to keep in mind the additional computational resources required when assessing the advantages provided by marginalisation.
• Cross-Validation: Another approach to the choice of covariance function is cross-validation. It consists of separating the data into M ∈N subsets then, for a given hyperparameter value, conditioning the GP onM−1 subsets and assessing its predictive performance using the data points in the last subset. The procedure is then repeated over all choices of M −1 subsets, to obtain an indication of how good the hyperparameter value is for prediction. This procedure can then be repeated for several hyperparameter values, and the best performing hyperparameter is retained.
Clearly, this method will be a robust approach to selecting hyperparameters since it is less prone to suffer from outliers. However, it can be considered to be less principled than marginalisation from a Bayesian point of view since it selects a prior using the data. Another issue is that it can perform poorly when the numbernof data points is small, since the data needs to be further reduced intoM subsets. The performance estimates are known to have large variance in those cases.
• Empirical Bayes: An alternative to the above approaches is empirical Bayes. This consists in selecting hyperparameters γ to maximise the log-marginal likelihood of the data{f(xi)}ni=1:
l(γ) = −1 2f >C−1f−1 2log|C| − n 2 log 2π,
where|C|denotes the determinant of the matrixC. In practice, this objective can be maximised using any numerical optimisation routine. Empirical Bayes has the advantage of providing an objective function that is easier to optimise relative to cross-validation but it is not fully Bayesian since it also makes use of the data to select the hyperparameters. Empirical Bayes can lead to over- confidence whennis very small, since the full irregularity of the function has yet to be uncovered [Szab´o et al., 2015]. In addition, it can be shown that empirical Bayes estimates need not converge as n→ ∞. This is for example
the case when the GP is supported on infinitely differentiable functions [Xu and Stein, 2017].
Selection of smoothness hyperparameters is a much harder problem and an active area of theoretical research; see Szab´o et al. [2015]. In some cases it is possible to elicit a smoothness hyperparameter from physical or mathematical considerations, such as a known number of derivatives of the function. Alternatively, the three methods highlighted above can also be used for smoothness hyperparameters but are much less well understood in this case.
Stability of the Numerical System
The main computational challenge associated with the use of GPs is inverting the
n×nGram matrix C. This is required in order to obtain the posterior mean and variance in Equation 2.7 and 2.8. Whennis large, or in unfavourable hyperparam- eter regimes, the inverse of the covariance matrix can become numerically unstable. Understanding when this may happen is of great practical importance, and we refer the reader to Chapter 12 of Wendland [2005] for a detailled discussion.
Consider Figure 2.3 where we highlight this problem for the simple case of GP regression with Gaussian RBF covariance function where the function is evaluated at 100 equidistant points on [0,10]. When the covariance function has a large lengthscale σ, the matrix is ill-conditioned since neighboring rows or columns are very similar to one another. This may not be an issue from a theoretical viewpoint, but it is likely that the matrix will become numerically singular. Schaback and Wendland [2006] point out that this behaviour occurs for a large class of radial kernels. Another observation in this paper is that the conditioning of the Gram matrix will worsen with the smoothness of the covariance function.
Often it is the case that we need to compute the product C−1b where b
is a vector of lengthn. In this case, first solving the linear system b =Ca for a, then computing the matrix-vector product tends to be more numerically stable than computing the matrix inverse directly.
Several approaches to further improve stability include multipole expansions [Greegard and Strain, 1991], domain decomposition methods [Beatson et al., 2001], partition of unity methods [Babuska and Melenk, 1997], compactly supported kernels [Floater and Iske, 1996; Wendland, 2005] and preconditioning of the covariance matrix [Mouat, 2001].
Figure 2.3: Ill-conditioning of the Gram matrix in Gaussian process regression. We continue the example in Figure 2.2 and plot the Gram matrices corresponding to 100 equidistant points in [0,10] for a GP with Gaussian RBF kernel with amplitude hyperparameterλ= 1 and lengthscale hyperparameterσ= 0.1 (left)σ= 1 (middle) andσ = 5 (right).
Scalability
In situations where obtaining data is cheap, the naive O(n3) computational cost associated with inverting the covariance matrix renders GP regression slow. It is then natural to ask whether the uncertainty quantification provided by GPs is worth the increased off-line computational overhead. Below, several approaches to reducing the computational overhead of GPs are highlighted.
Exact inversion can be achieved at low cost through exploiting structure in the kernel matrix. Examples include: tensor product kernels [O’Hagan, 1991], circu- lant embeddings [Davies and Bryant, 2013] and low-rank kernels such as polynomial kernels. In addition there are many approximate inversion techniques. We highlight a few below: reduced rank approximations [Quinonero-Candela and Rasmussen, 2005; Bach, 2013; El Alaoui and Mahoney, 2015], explicit feature maps designed for additive kernels [Vedaldi and Zisserman, 2012], local approximations [Gramacy and Apley, 2015], multi-scale approximations [Iske, 2004; Katzfuss, 2017], random approximations of the kernel itself, such as random Fourier features [Rahimi and Recht, 2007], spectral methods [Lazaro-Gredilla et al., 2010; Bach, 2017], hash ker- nels [Shi et al., 2009], parallel programming [Dai et al., 2014] and efficient use of data structures [Wendland, 2005][Section 14].
Furthermore, several approach to improve conditioning of the linear system discussed in the previous also reduce the computational cost as a by-product. These include the fast multipole methods and compactly supported covariance functions.
This, of course, does not represent an exhaustive list of the (growing) lit- erature on kernel matrix methods. Note that the majority of approximate kernel
methods do not come with probability models for the additional source of numerical error introduced by the approximation.