Choice of correlation function - Gaussian process emulators with derivatives

Emulating function output

4.2 Gaussian process emulators with derivatives

4.2.2 Choice of correlation function

To calculate the correlation between two derivatives we differentiate the correla-tion funccorrela-tion twice, thus we should take the differentiability of the funccorrela-tion into consideration when selecting a correlation function. The most common approach, and the one we use regularly throughout this thesis, is to define the correlation function to have the Gaussian form: c(x_i, x_j) = exp{−(x_i − x_j)^TΘ(x_i − x_j)}, where Θ is a diagonal matrix of positive smoothness parameters, θ{k} with k ∈ {1, . . . , p} and p is the number of inputs. Note that alternative notation also popular in the literature is to replace the ‘smoothness’ parameter, θ{k}, with a correlation length, l{k}, such that l{k} = √¹

θ{k}. Throughout this thesis though we adopt the ‘smoothness’ parameter, θ{k}, notation. The correlation then between a point, x_i, and a derivative w.r.t input k at point j, x^(k)_j , is:

the correlation between two derivatives w.r.t input k but at points i and j is:

and finally the correlation between two derivatives w.r.t inputs k and l, where k 6= l, at points i and j is:

How the correlations vary as the distance between points changes is shown for the Gaussian covariance function in Figure 4.1. Killeya (2004) look further at the covariances involving derivatives and these are shown, for the Gaussian covariance function with varying levels of smoothness controlled by the parameter θ, in Figure 4.1 as well.

The Gaussian correlation function, described above, belongs to a class of correlation functions known as the exponential power form, used by Sacks et al.

(1989), and given by: make use of this flexibility though and must set ρ = 2 for all k, which results in the Gaussian form. This is because if we set ρ = 1 we are left with the exponential covariance function which is not differentiable, and neither are the resulting functions when ρ < 1. Finally, for 1 < ρ < 2, the resulting function is only once differentiable.

The Mat´ern class of correlation functions are an alternative to the exponential power functions. Stein (1999) recommends the use of this form of correlation function due to its flexibility, yet still with a manageable number of parameters.

The form of a Mat´ern correlation function is:

c(x_i, x_j) = 2^1−ν where K_ν is a modified Bessel function. The parameter ν must be positive and is often half-integer, resulting in a much simpler form of (4.1). The value assigned to ν controls the differentiability of the function and if we let ν → ∞ the result

-3 -2 -1 0 1 2 3

-0.50.00.51.0

|x_i− xj|

Covariance

(a) θ = 0.5

-3 -2 -1 0 1 2 3

-1.0-0.50.00.51.01.52.0

|x_i− xj|

Covariance

(b) θ = 1

-3 -2 -1 0 1 2 3

-20246

|x_i− xj|

Covariance

Covariance between points

Covariance between derivatives and points Covariance between derivatives

Figure 4.1: Gaussian covariance function shown as the distance between points varies.

is the Gaussian form, while setting ν = ¹₂ gives rise to the exponential correlation function. Rasmussen and Williams (2006) suggest that ν = ³₂ and ν = ⁵₂ are most interesting in the subject of machine learning but again for the purpose of this thesis we must ensure twice differentiability. The Mat´ern correlation function with ν = ³₂ is only once differentiable but if we set ν = ⁵₂ the resulting correlation function is twice differentiable and therefore is a valid alternative to the Gaussian form in this work. See Rasmussen and Williams (2006) for further discussion of correlation functions.

The correlation functions discussed in this section are summarised in Table 4.1, where we adopt the notation Q = (xi− x_j)^TΘ(x_i− x_j). Table 4.1: A summary of some correlations functions.

4.2.3 Design

We now need to create a design which consists of a set of points in the in-put space at which the simulator or adjoint is to be run to create the training sample. A design for the standard case results in an ordered set of n points:

D = {x₁, x₂, . . . , x_n} where each x is a location in the input space. Here though, we need a design which in addition to specifying the location of the inputs, also determines at which points we require function output and at which points we re-quire first derivatives. We arrange this information in the design ˜D = {(x_k, d_k)}, where k = {1, . . . , ˜n} and d_k ∈ {0, 1, . . . , p}. We have x_k which refers to the

location in the design and d_k determines whether at point x_k we require function output or a first derivative w.r.t one of the inputs. Each x_k is not distinct as we may have a derivative and the function output at point x_k, or we may require a derivative w.r.t several inputs at point x_k. The simulator, η(·), or the adjoint or derivative, of the simulator, ˜η(·), (depending on the value of each d), is then run at each of the input configurations. This results in our training data: ˜y = ˜η( ˜D), a vector of length ˜n.

Morris et al. (1993) look at optimal design when derivatives are observed in addition to the model response at all the locations x_i. Through an example they compare 4 designs, each of size n = 10. The example model has 8 input dimensions so ˜n = 90. The designs are as follows:

1. Latin hypercube design. For more detail on Latin hypercube samples see Chapter 3.

2. Maximin design. Maximin designs tend to favour the corners of the design space and for this reason an initial design, Dinitial, is created where each input dimension is assigned the value 0 or 1 with equal probability. The measure φ_initial is then calculated for this design where,

φ = 1

We have d_ij which is the distance between sites i and j in the design, D, and d(D) is the minimum distance between any two sites in D. All distances are Euclidean. This measure is chosen instead of simply calculating d(D) because in addition to maximising the minimum distance, this design aims to minimise the number of pairs of input sites which are separated by the minimum distance. An input site, xi, is then randomly selected from Dinitial

and each input dimension of x_i changed from 0 to 1 (or vice versa, depending on its initial value) with probability 0.3, to produce x_cand. The measure φ is recalculated and if φ_cand < φ_initial then the candidate input is accepted.

The algorithm continues for a specified number of iterations.

3. Maximin Latin hypercube design. This design is produced by generating a random Latin hypercube as an initial design and φ_initial, as defined in design (2.), calculated. A candidate new design, D_cand is generated by exchanging

2 entries of a randomly chosen column and φ_cand is then calculated. As in design (2.), the candidate design is accepted if φ_cand < φ_initial and the algorithm continues in this way.

Note that this method differs to how we generate maximin Latin hypercube designs later in this chapter and also in Chapters 5 and 6. The method we adopt is to generate a number of random Latin hypercube samples and then select the design which has the maximum minimum distance.

4. Modified maximin design. This design is generated by take the resulting maximin design of (2.), which has five 0s and five 1s in each column.

The 0’s within each of column are then randomly replaced with the val-ues 0,¹₉,²₉,³₉,⁴₉. Similarly the 1’s are replaced with ⁵₉,⁶₉,⁷₉,⁸₉, 1.

Morris et al. (1993) compare the designs described above via the prediction error of an emulator resulting from each design and find that the two ‘compromise’

designs, numbers (3.) and (4.), appear to be superior.

The choice of the sample size, n, for the standard case is discussed in Loeppky et al. (2009) and as a rule of thumb, n = 10p, where p is the number of inputs is recommended. There is not, however, such a guide for what ˜n might be. If we choose to obtain function output and the first derivatives w.r.t to all inputs at every location in the design, then we would expect that fewer than 10p locations would be required; how many fewer though, is difficult to estimate. This is discussed further in Section 4.7 but without any formal results.

In document Using derivative information in the statistical analysis of computer models (Page 59-64)