• No results found

4.2 Gaussian Process Emulators

4.2.2 Validation and Diagnostics

Discussed in Chapter 3, validation is an important process in the construction of any model type, with GP emulators being no exception. A GP can provide poor emulation of a simulator for two main reasons. Firstly, the model form assumed by the initial GP prior is not appropriate for the simulator’s functional form. This can occur if any component of the prior is ill-suited to the functional structure of the simulator. For example, the mean function could be inappropriate, e.g. if a polynomial mean is used for a periodic simulator output. The covariance function could also impose incorrect assumptions, for example a stationary kernel is employed when the correlation is input dependent (where some regions have a faster functional transition that others) or if a SE kernel is utilised when the simulator output is non-smooth. These problems are often solved with better model selection. On the other hand, if joint normality is an unreasonable assumption for the simulator outputs, and no transform of the output distribution possible, or if the outputs are

4.2. GAUSSIAN PROCESS EMULATORS 71

uncorrelated, then a GP may be an invalid assumption. A second reason is when the training data used to estimate the hyperparameters does not fully reflect, or is a restricted case of the general space in which the emulator is expected to generalise. This may lead to poor estimates of the hyperparameters that do not generalise. In addition, if MLEs of the hyperparameters are employed (as stated in Section 4.2) rather than integrating them out of the full posterior, and these estimates are in the tails of the hyperparameter distribution due to inappropriate training data, the posterior prediction will be poor as it is conditioned on these hyperparameters. As a result diagnostics are required in order to improve predictions and generate a valid emulator for a specific task.

In the following section diagnostic tools, validation metrics and their specific applica- tion to GP emulators are presented. These tools are implemented on a numerical example which is displayed in Fig. 4.2. The simulator equation is shown in Eq. (4.30), where a grid of ten (N = 10) evenly spaced evaluations are used as training data D = {x, y} for the emulator. Validation data is constructed from the inputs x∗ = {−1, 0.99, . . . , 1} and their corresponding outputs y∗. The GP prior is formed from a linear mean function H = 1 of dimension p = 1 and a SE kernel. The MLE estimates of the hyperparameters are: ˆω = 50.83, ˆσf2 = 0.84 and β = 1.27 (with a fixed nugget ν = 1 × 10−8).

y = η(x) = 2 cos(2π × 2.5x) − 2.5 cos(2π × 2.2x) + 0.15 cos(2π × 6x) + 5x − 2 (4.30)

Individual Prediction Error (IPE) (or standardised residuals) allow an input depen- dant assessment of the emulator predictions and are formulated via Eq. (4.31).

DIP E(y∗) =

y− E (η) pdiag(V (η∗))

(4.31)

When the posterior is constructed from Eq. (4.23), and the emulator represents the simulator well, these residuals should be distributed as a standard Student t distribution of N − p degree of freedoms (conditioned on the training data D and hyperparameters ψ). As the number of data points and degrees of freedom increase, (or the posterior formulation in Eq. (4.15) is implemented) then the standardised residuals will tend to a Gaussian distribution. The degrees of freedom (N − p) equal

Figure 4.2: Posterior emulator prediction for the diagnostics numerical example.

9 from the training data in Fig. 4.2 meaning that the standardised residuals should tend to a Student’s t. Visually a QQ-Plot, where the quantiles of two distributions are compared, can indicate whether the residuals are Student’s t distributed. Fig. 4.3 presents two graphical interpretations of IPE, a comparison with the input index (Fig. 4.3a), and a QQ-Plot (Fig. 4.3b).

IPE should remain low, with large values indicating problems, and clusters of high standardised residuals indicating a systematic failure. To diagnose the cause several avenues must be explored. When values are large and clustered close to validation points the problem may be that the roughness parameters Ω are too small and have been poorly inferred due to an inadequate training data set. If the values are large but no systematic patterns can be determined, the estimate of the signal variance σf2 is likely to be the problem. When a large number of high IPEs are shown and they are of the same sign, the mean function and β coefficient have been inappropriately specified or a non-stationary kernel should be used. A heuristic definition of ‘large’ can be |DIP E| > 2 [115] (the same definition is used for all residual diagnostics). Figure 4.3a therefore indicates that the residuals are appropriate. It can be seen that the locations of worst performance are near the ends of the function, with a clear patterns visible. This indicates that although the emulator is adequate in terms of IPE residuals, the functional form has not been fully captured, potentially due to the roughness parameter in the SE kernel. More training points around the ends of

4.2. GAUSSIAN PROCESS EMULATORS 73 (a) -4 -3 -2 -1 0 1 2 3 4 -1.5 -1 -0.5 0 0.5 1 1.5 (b)

Figure 4.3: IPE diagnostic. Panel (a) displays the diagnostic against the input index with ±2 thresholds and panel (b) a QQ-Plot of the residuals.

the function, or a change in covariance function to a Mat´ern kernel (that captures less smooth functions better, see Table 4.1 — when ν = ∞, Mat´ern = SE) could be possible remedies. The QQ-Plot indicates that the IPEs are approximately standard Student’s t distributed, however the area of steeper gradient indicates a possible underestimation of the signal variance σ2f.

Another method for generating standardised residuals, that removes correlation unlike IPEs, are variance decomposition approaches. Here a standard deviation matrix G, capturing the cross terms, is generated from a decomposition of the posterior covariance matrix, i.e V (η∗) = GGT. The standardised errors, with uncorrelated elements and unit variances are formed from Eq. (4.32).

DV D(y∗) = G−1(y∗− E (η∗)) (4.32)

These residuals, in the same manner as IPEs, indicate if the normality assumption is invalid, being standard Student’s t distributed with N − p degrees of freedom when Eq. (4.23) is used. This means that a QQ-Plot can be used as a graphical test. Likewise as with IPEs, large values and systematic patterns diagnose problems with the emulator, however their interpretation will change based on the decomposition. Various decompositions can be used such as an eigenvalue, Cholesky or a pivoted Cholesky decomposition. Figure 4.4 presents a demonstration of the diagnostic using a pivoted Cholesky decomposition. By sorting the validation data by largest variance, conditioned on the previous element, the pivoted approach produces the permutation

50 100 150 200 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 (a) -4 -3 -2 -1 0 1 2 3 4 -1.5 -1 -0.5 0 0.5 1 1.5 (b)

Figure 4.4: Standard residuals via a pivoted Cholesky decomposition. Panel (a) displays the diagnostic against the pivoting order from the decomposition with ±2 thresholds and panel (b) a QQ-Plot of the residuals.

matrix P and the upper triangular matrix R where the standard deviation becomes G = P RT. If the standardised residuals are high/low in the initial part, this indicates that σf2 is inappropriate or that the function is heterogeneous, whereas the end of the standardised residuals will indicate issues with the covariance function structure and/or roughness parameters Ω. Figure 4.4a, although presenting residuals within the thresholds, evidences that the variance of the function could be better captured. This is stated by the large DP C at the beginning of the pivot order and may infer that either σ2

f is underestimated or that the roughness parameter ω is too long. This is confirmed by the QQ-Plot where although the majority of data lies on the reference line, heavy tails indicate larger variability than estimate by the emulator. When compared with IPEs, it can be seen that the cause of this seems to be because the estimated emulator function smooths through the first and last simulator points. Mahalanobis distances can be implemented as a summary statistic or diagnostic and is a measure of the distance between a point and an ellipse. The metric can be formulated from the variance decomposition, DM D(y∗) = DV D(y∗)TDV D(y∗) or from Eq. (4.33).

DM D(y∗) = (y∗− E (η∗))TV (η∗) −1

(y− E (η)) (4.33)

If the emulator outputs are fully independent then the Mahalanobis distance with the full posterior covariance V (η∗) and variance diag(V (η∗)) matrices will be equal.

4.2. GAUSSIAN PROCESS EMULATORS 75

(a) (b)

Figure 4.5: Panel (a) presents the individual Mahalanobis distances DM D, when independence of the outputs is assumed. Panel (b) displays the individual log posterior likelihoods p (y| ν, η) (where ν = N − p, the degrees of freedom), when independence of the outputs is assumed.

A comparison of these two values can help diagnose the degree of correlation in the posterior; 11.47 and 42.28 respectively for the example indicate a degree of dependence in the outputs. A large Mahalanobis distance would indicate poor emulation of the simulator.

Individual Mahalanobis distances can be calculated by assuming the posterior at each test point is independent. A visualisation of this metric presents a scale of the distance between posterior predictions and test points. Figure 4.5a demonstrates the application of this diagnostic. It can clearly be seen that predictions are worst at the edges of the function, again stating that the variation of the simulator has not been fully captured, indicating the roughness parameter may be too long.

The posterior density is a scaled version of the Mahalanobis distance. Here the posterior PDF is assessed for the predictive point; interpreted as a posterior likelihood — the likelihood of the test point being drawn from the model. The PDF will either be Gaussian or Student’s t depending on the posterior equations. When Eq. (4.23) is implemented the diagnostic is formulated as shown in Eqs. (4.34) and (4.35).

p (y∗| N − p, η∗) = Zt  1 + 1 N − p(y∗− E (η∗)) T V (η∗) −1 (y∗− E (η∗)) −N −p+n2 (4.34)

Zt=

Γ N −p+n2 

Γ N −p2  (N − p)

−n(π)−n/2

|V (η∗)|−1/2 (4.35)

Where n is the number of test points in y, Zt is the normalising constant and Γ(·) is a gamma function. For numerical reasons the diagnostic is often used in log form, called the log posterior likelihood (i.e. log p (y∗| N − p, η∗)). Both dependent and independent forms can be calculated (in the same fashion to the Mahalanobis distance), however the log posterior likelihood is generally less interpretable than the Mahalanobis distance. For the numerical example the log posterior likelihood is 1366 and −117 when V (η∗) and diag(V (η∗)) are used respectively. This would further indicate that the test data is most likely to have come from the emulator with dependence between the outputs, meaning the independence assumption is not maintained. Furthermore, the log posterior likelihood can be calculated individually, assuming independent outputs (i.e. using diag(V (η∗))), as shown in Fig. 4.5b. The log posterior likelihood is high at the training points and decreases away from these locations, with global minimums around the start and end of the function. This reinstates that the emulator performs poorly in these regions.

Model criticism via MMD is also achievable (see Section 3.4 for mathematical details) [49], however it is excluded from this discussion as it is best suited for situations where the model is compared to stochastic outputs.

Standard regression diagnostics that treat the output as deterministic can also be implemented. These measures will fail to fully explain the performance of a GP emulator due to its probabilistic formulation and therefore should never solely be used to assess the emulator validity. These scores are primarily useful for comparing GPs with other deterministic regression approaches and assessing the posterior mean. Two diagnostics are presented: NMSE and R2 score.

The NMSE formulation, presented in Eq. (4.36), is a highly interpretable diagnostic. A score of zero indicates mean predictions without any error. Conversely, a score of 100 represents a scenario where the prediction is no better than taking the mean of the true values (¯y = E (y)).

N M SE = 100

N σ2 y∗

X

4.2. GAUSSIAN PROCESS EMULATORS 77

Where σ2y

∗ is the variance of test outputs. The mean prediction from the emulator has a NMSE of 6.43, which implies a relatively good fit.

The R2 score (or coefficient of determination) is a measure of how well the mean prediction explains the test points variation (a ratio of the model explained variation over the total variation). The R2 score can be calculated using Eq. (4.37), where a score of zero indicates that 0% of the variation is explained by the model and 1 represents that 100% of the variation is captured.

R2 = 1 − P(y∗− η∗) 2

P(y− ¯y)2 (4.37)

An issue arises that as more basis functions are added to H, the R2 score will always increase, meaning that the score will improve with overfitting. Instead the adjusted R2 should be used to take into account the degrees of freedom of the data and the model, shown in Eq. (4.38).

aR2 = 1 − (1 − R2) N − 1

N − p − 1 (4.38)

This addition means that the score is penalised as more basis functions are added, meaning that it will favour minimally complex models. The adjusted R2 score for the example is 0.93, and would indicate that the emulator mean captures the variation in test outputs well.

By considering all the diagnostics presented, the emulator in Fig. 4.2 can be shown to be functionally appropriate, with the simulator test data lying comfortably within the predicted probability mass. Improvements could be made in order to better capture the beginning and end of the function. These improvements could be to change the covariance function to a Mat´ern class, to increase the training data with evaluations near the beginning and end of the function or to improve the estimates of σ2

f and ψ.