Common design selection methods - Gaussian Process Emulators in coastal wave modelling

In general, in order for the GPE to give robust predictions, it is important that the design events are selected in a way that ensures they are well spread out, preferably as far as possible from each other, covering the entire input variable space, (Iooss et al., 2010). There are many different ways of selecting design points.

The regular grid method, used in the LUT approach explored in Section 3.6, does not work well with GPEs because of its collapsing property. This is when multiple points have a fixed coordinate value when projected onto a variable axis, (Camus et al., 2011b). Moreover, with the regular grid method, the number of events in the design increases rapidly as we increase the dimensions.

A popular design selection method for GPEs is Latin Hypercube Sampling (LHS), see for example (Urban and Fricker, 2010), where a set of design points are selected subject to an input probability density constraint that ensures that across each input dimension, values are evenly spread. It is similar to the regular grid method, in that a single value is selected within each defined grid (or hypercube), however, unlike the regular grid method, no two design events have the same value for a single parameter. Urban and Fricker (2010) compare the Latin hypercube and grid ensemble designs for a multivariate GPE. They discuss the advantages and disadvantages of each of the methods. They recommend that although the grid design selection may be appropriate in some cases when looking at sensitivity analysis, the Latin hypercube designs should be used over the grid ensemble designs when the primary concern is to use the GPE for prediction. Iooss et al. (2010) discuss that as LHS is merely a form of stratified random sampling, it is not related to a particular criterion, and therefore the GPE prediction may have poor accuracy. Further enhancements have been applied to LHS to adopt an optimality criterion such as entropy, maximin and minimax distances. More information about these designs can be found in Morris and Mitchell (1995); Johnson et al. (1990); Jones

and Johnson (2009); Oakley (1999).

Other approaches that have been considered in the past are sequential design selections. This is when a number of events are selected to run initially, a GPE is fit to these events. Then predictions are made on the simulator output, and further events are selected based on the predicted response to minimise (or maximise) a certain criterion. Typically this criterion is such that the events chosen in the iterative steps are those with highest prediction variance. Such designs can improve the performance of the GPE in a very efficient way, (Iooss et al., 2010), but of course they have to be combined with another design selection method to make the initial choice of design events. Sobol sequences are a common technique under this kind of design selection and are advantageous over Latin hypercube design because they can be built sequentially, (Caflisch and Morokoff, 1994). However, they do not guarantee that each dimension will have a uniform selection of events.

In this thesis, we assume that the data used here cannot be defined by closed- form expressions of probability density. In other words, we assume that we already have a set of predetermined events where we would like to run the simulator. Other combinations of parameter values for an event may not be a plausible scenario to model. Therefore, the Latin hypercube approach is less appropriate for our application.

As an alternative, in this thesis, we explore the maximum dissimilarity algorithm (MDA) approach as described by (Camus et al., 2011a), and as applied in the context of coastal analysis by (Camus et al., 2011b; Gouldby et al., 2014). This algorithm analyses the events using a measure of the distance between points in the multidimensional space. Having normalized the input variables, and given an initial event (the starting point), the MDA selects the next point that is the furthest away in Euclidean distance in the multidimensional space. This method outputs a set of design points which efficiently represents all the events in X.

Selecting the “best” design

Selecting the “best” design depends on how the user would describe best. Would they like to minimise the error over the entire input space? Would they like to minimise the posterior variance over the input space? Would they like simply to classify events and the actual value of the output is not as important? It is evident that selecting the best design is quite subjective and depends on what the user would like to use the GPE predictions for. It is important that these are clear before selecting a design selection method and ultimately a design.

In most applications, it may be of interest to minimise the root mean square prediction error of all the events in X. Mathematically, the criterion we are trying to minimise can be written as:

v u u t PN i=1 η(xi) − Eη(xi)|D 2 N ,

where η(x) represents the simulator output and Eη(xi)|D denotes the posterior

mean prediction of an event derived from a GPE fitted using the design, D. In this case, the design we select would include events that are well spread out and efficiently cover the entire event parameter space. The design selection method that we propose for this criterion is the MDA technique.

In other applications, it may be of interest to minimise the RMSE error on a predetermined selection of events. In this case the design may include more events that would fall into the pre-determined selection and may be more clustered compared to a design selected using MDA. The design selection method that we propose for this criterion is a weighted MDA. This is similar to MDA but has a weight associated with each event. The Euclidean distance between two events is then multiplied with the weights associated to the events. The weighted distance of two higher weighted events will appear to be larger than that of two lower weighted events given a fixed

Euclidean distance between them.

Finally we also look at an application where it may be of interest only in the correct classification of events. This applies to scenarios where the accuracy of the predicted simulator output is not important, but the classification is. For this case, we propose a sequential design where first we select a small amount of events (using MDA for example), fit a GPE to it. Then we use the GPE to predict the simulator output for all other events, and select design events based on the uncertainty around its classification (using both the mean and standard deviation from the GPE predictions).

In the Sections that follow we consider these three different applications and explore the associated design selection method in more depth. Moreover we present a case study to compare the proposed design selection methods with randomly selected designs.

In document Gaussian Process Emulators in coastal wave modelling (Page 127-130)