Combined design for validation - Design for validating Gaussian process emulators

5.3 Design for validating Gaussian process emulators

5.3.2 Combined design for validation

We are interested in a design that maximises this criterion, because we can have validation input points with minimum distance to training data points varying from small to large, providing input points with emulator uncertainty varying from low and large. The proposed validation design is

X(v)_{V arDist}= max

X(v)_⊂XΥV arDist(X

(v)_). _(5.4)

The optimisation method to solve (5.4) requires high-dimensional optimisation algorithms increasing considerably the computational cost. Alternatively, we could choose distance- based designs for validation in a class of Latin hypercube designs.

In practice, we generate a large number of Latin hypercube designs, which is computa- tionally cheap, and select the design the optimises the criterion (5.3). This procedure gives an approximation for the optimal validation design in the class of the Latin hypercube designs. Hence, a distance-based Latin hypercube design for validation is generated after sampling many Latin hypercube designs and choosing the design with largest value of the criterion (5.3). Notice that this procedure may not arbitrarily get validation points close to training points.

5.3.2 Combined design for validation

The distance-based validation design selects points in dierent uncertainty regions according to the optimisation method in (5.4). As an alternative, we propose a two-part sampling procedure. In the rst part, the data is chosen according to any independent design described in Section 5.2. In the second part, the input points are randomly chosen from a low emulation uncertainty region. These low emulation uncertainty regions are dened by regions nearby

5.3. DESIGN FOR VALIDATING GAUSSIAN PROCESS EMULATORS 73 the training input points. Such a design is called a combined validation design.

Let m be the sample size for the validation data. According to this combined design idea, m1 validation points are chosen for the rst part of the sampling procedure, and m2 = m−m1

validation points are chosen in a such way that the correlation between each validation point and the closest training data point is high. The combined design for validation of size m is dened by X(v)_C =   X(v)_C 1 X(v)_C 2  ,

where the design X(v)

C1 are the inputs from an independent design, and X (v)

C2 are the inputs

from a region with points highly correlated with some training data. For sampling X(v)

C1, we can use any space-lling design described in section 5.2. However,

we must guarantee that all validation inputs are dierent from the training inputs, otherwise it would be a waste of eort since the simulator is a deterministic function. An independent design for validation can have input points in regions with uncertainty varying from low to high. But, the training data size is generally small, since the simulator is generally compu- tationally expensive. Therefore, it is likely that the validation inputs from a independent design lie in regions with medium to high uncertainty.

To sample X(v)

C2, we only use the training inputs and not the outputs. We need to x a

maximum distance, d0, considered close in the input space. This is to guarantee that the

validation inputs are chosen from regions with low uncertainty. Centred on a training input, x, a low uncertainty region would be the ellipsoid

Rx = ( (z1, . . . , zp) ∈ X : X k (zk− xk)2 < d0 ) . (5.5)

The validation input points, X(v)

C2, are chosen using the following strategy:

1. Fix a maximum distance, d0, considered close;

74 CHAPTER 5. DESIGNS FOR BUILDING AND VALIDATING EMULATORS 3. Dene the ellipsoid region Rx, equation (5.5);

4. Randomly choose a validation point from a uniform distribution in region Rx;

5. Repeat 2-4 to generate m2 validation points.

It is not simple to x a distance considered close in the input space, particularly in high dimensions. As an alternative, we could set a value, say ρ, for the minimum correlation between two points in the input space still considered high. The strategy for choosing the validation points X(v)

C2 remains the same, but the ellipsoid region is now given by

Rx= {z = (z1, . . . , zp) ∈ X : Cδ(z, x) > ρ} , (5.6)

where Cδ(·, ·) is a correlation function with correlation parameter δ. It is necessary to choose

a correlation function and also x a value for the correlation parameter δ. If the training outputs are available, δ can be estimated from its posterior distribution (2.19). Note that the ellipsoid region (5.5) is just a particular case of the ellipsoid region (5.6).

Figure 5.2 illustrates the proposed combined design for validation in a two-dimension example. In Figure 5.2 (a), n = 7 training inputs were selected from a Latin hypercube design. In Figure 5.2 (b), m1 = 4 validation inputs, X

(v)

C1, were chosen from an independent

Latin hypercube design. Finally, in Figure 5.2 (c), m2 = 2 training inputs were randomly

chosen. For each chosen input a region Rx, equation (5.6), is dened. The validation points

are randomly sampled from a uniform distribution in each region. The correlation lengths used are ψ1 = ψ2 = 0.5, and the minimum correlation considered high was ρ = 0.80.

In document Validating Gaussian Process Models in Computer Experiments (Page 91-93)