• No results found

4.3 Sparse Gaussian Process Emulators

4.3.1 Model Approximations

Qui˜nonero-Candela and Rasmussen present a unified framework for model approxi- mations [122]. These approaches seek to modify the joint prior p (η∗, η) of the GP Eq. (4.14) in order to replace the complexity of inverting Kη,η with a less expensive inversion. This is performed by incorporating inducing points {Z, u} (where Z are a set of inducing inputs and u are the corresponding latent function evaluations) into the joint prior p (η∗, η, u) and marginalising the inducing variables, u, out of the posterior (although Z will affect the final solution). The key assumption for these sparse methods is that the joint prior can be approximated by assuming conditional independence between η and η given u. This means that η∗ and η are only linked through u; demonstrated in Eq. (4.45).

p (η∗, η) ' q (η∗, η) = Z

p (η∗| u)q (η | u)p (u)du (4.45)

Where p (u) = N (0, Ku,u) is the prior2 for the latent variables u and the test conditional, p (η∗| u), is defined in Eq. (4.46).

p (η∗| u) = N K∗,uKu,u−1u, K∗,∗− Q∗,∗ 

(4.46)

It is noted that the notation Qa,b = Ka,uKu,u−1Ku,b is used. The two model approxi- mation methods detailed differ in their assumption about the training conditional q (η | u), whilst assuming the same prior for the inducing variables and likelihood.

2

It is common for a nugget, εI to be incorporated here [126] for the same reasons as outline for emulators previously, i.e. increases the stability of the inversion of the covariance matrix. A nugget is implemented in this thesis meaning p (u) = N (0, Ku,u+ εI).

4.3. SPARSE GAUSSIAN PROCESS EMULATORS 85

The assumptions for the training conditional q (η | u), and the marginalised joint prior p (η∗, η), for both a Deterministic Training Conditional (DTC) and Fully Independent Training Conditional (FITC) approximation are shown in Table 4.3. The main difference between DTC and FITC is clear in the joint prior, presented in Table 4.3. The top left corner of the covariance is modified in FITC so that the approximation includes the exact covariance on the diagonal. This transforms the training conditional from deterministic to fully independent.

Method DTC FITC

q (η | u) N Kη,uKu,u−1u, 0 

N Kη,uKu,u−1u, diag(Kη,η− Qη,η)  p (η∗, η) N  0,Qη,η Qη,∗ Q∗,η K∗,∗  N  0,Qη,η− diag(Qη,η− Kη,η) Qη,∗ Q∗,η K∗,∗ 

Table 4.3: DTC and FITC assumptions for the training conditional q (η | u) and the joint prior p (η∗, η). The joint prior p (η, η∗) is calculated by substituting the training condition q (η | u) into Eq. (4.45) and solving the integral which can be done in closed form.

The posteriors q (η| y) and log marginal likelihoods p (y | X) for the DTC and FITC approximations can be unified into the analytical form outlined in Eq. (4.47) and Eq. (4.48) [124]. This is performed by substituting the assumptions from Table 4.3 into Eq. (4.45) and solving the integral (using standard Gaussian conditionals in Appendix A.2). q (η| y) = N Q∗,ηK¯η,η−1y, K∗,∗− Q∗,ηK¯η,η−1Qη,∗  (4.47) log p (y | X) = −1 2log | ¯Kη,η| − 1 2y TK¯−1 η,ηy − N 2 log 2π (4.48)

Where ¯Kη,η = Qη,η+diag(α(Kη,η− Qη,η))+νI. The marginal likelihood and posterior of the two methods can be formulated by setting α to zero or one for the DTC and FITC approximations respectively. After setting α, the low rank structure of ¯Kη,η should be exploited using the Woodbury inversion and determinant lemmas in order to improve the computational efficiency (see Appendix A.3). These amendments reduce the computational complexity for training to O(N M2) and for prediction to O(M ) and O(M2) for the mean and variance respectively [122–124].

The inducing inputs can be either a subset of the input data or as any set of points from the real line. The subset of inputs poses challenges when global prediction quality is required as the selection of inducing inputs from a discrete set of data will involve some form of greedy or combinatorial optimisation. In contrast, considering the inducing inputs as any point on the real line leads to a continuous optimisation problem [125]. This allows the inducing inputs to be inferred via optimisation of the log marginal likelihood. When the inducing inputs are equal to the training inputs, the marginal likelihood and the posterior are the same as the full GP for both DTC and FITC. A key drawback of model approximation methods are that optimising via the approximate marginal likelihood means treating the inducing inputs as parameters of the model, adding all the problems of overfitting and optimisation that are evident in parametric models [124, 127]. This view of the inducing points means the assumptions about the data and inference approximations are coupled. Learning via the exact marginal likelihood of the approximate model also means that the hyperparameters will be optimal for the approximate model and not necessarily the full GP.

Figures 4.8 and 4.9 present univariate numerical examples where the simulator output is a sample from a GP process with zero mean and a SE covariance; σ2f = 1 and ω = 8. The examples demonstrate the difference in the two approaches when the hyperparameters φ and inducing inputs Z are learnt through optimising the log marginal likelihood in Eq. (4.48). These illustrate a comparison of the two sparse GP methods, DTC and FITC, with a full GP solution and the training data, where the mean and ±3σ confidence intervals are displayed for the full and sparse GPs. It is shown that FITC gives a better approximation of the variance than DTC that tends to overestimate (due to the deterministic assumption). Signs of overfitting are present in both methods. The variance for DTC when X ≈ 0.9 reduces almost to zero, displaying overconfidence in the prediction when it would be expected to increase from the last training point, shown in the full GP solution. Figure 4.9 visually demonstrates that FITC fits the middle section of training data well, however the variance starts to increase before the training data boundary. This indicates that the inducing points have been placed in locations that overfit the middle section of the training data, leading to poor generalisation.

4.3. SPARSE GAUSSIAN PROCESS EMULATORS 87

Figure 4.8: Predictions from a sparse DTC GP with 10 inducing points, against a full GP and training simulator data for a numerical example. Shaded regions indicate ±3σ confidence levels.

Figure 4.9: Predictions from a sparse FITC GP with 10 inducing points, against a full GP and training simulator data for a numerical example. Shaded regions indicate ±3σ confidence levels.