A method for generating large LHDs - Verification and validation

3.5 Verification and validation

3.5.1 A method for generating large LHDs

For validation studies, particularly in computer experiments, a large number of input designs can be very useful. Ideally, these should each have good space-filling and orthogonality properties, and the design as a whole, when they are combined, should also possess these features. However, as designs increase in size, particularly if they are being made to fit some optimality criterion, both generation and storage can become difficult. whole may have desirable properties, when divided into c designs of size m, these properties are not necessarily retained.

In this section, a novel method is introduced for constructing large Latin hypercubes in such a way that neither generation nor storage need become an issue. The method is compared to several alternatives, and is used to perform a validation study with one million input points in Chapter 5. The method turns out to be similar in its goal to that of Qian (2012).

3.5. Verification and validation 39 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x1 x2

Figure 3.2: A 20-point LHD with K = 2 produced using c = 4 and m = 5, ar-

ranged using a 4-point index LHD with 2 dimensions, whose columns are the vectors (3, 4, 2, 1) , (1, 3, 2, 4).

Staggered Latin Hypercube Designs

We will assume that the designs for the validation study are each of size m and dimension k, and that there are c of them in total. The product m × c, the total size of the combined design, will be denoted by N .

The staggered Latin hypercube method uses multiple smaller LHDs to build a larger design which is itself an LHD. To build each column of the design, we split the interval [0, 1] into m sub-intervals, each of which is then divided into c pieces. Altogether this divides [0, 1] into N parts. Then for i = 1, . . . , c an m-point LHD is built whose co-ordinates can only be in a particular part of each of the m sub- intervals.

To avoid having a regimented structure in this design, rather than assign every point for sub-LHD i into the ith piece of the sub-intervals in every dimension, a

3.5. Verification and validation 40

further Latin hypercube, the index LHD, is built, containing c points in k dimensions. Then, the sub-interval in each dimension to which the points in the ith _sub-LHD

are assigned is determined by the ith _{row of the c × k index LHD. A staggered LHD}

is shown in Figure 3.2.

While this method ensures that both the N -point and m-point designs are Latin hypercubes, at no point does information about the entire design have to be kept together. This is a great advantage, since the memory required for a large design can severely limit the design sizes available. The sub-LHDs need not be stored or generated together, since the index LHD (which is usually relatively small) can be stored and used to generate each part of the design separately. Ensuring that these designs are LHDs allows us to claim the advantageous properties mentioned by Stein (1987), and the staggered method enables this to be done at relatively low cost.

We found that an effective strategy was to use distributed computing to generate the design, and a table in a database to store it, using the R package ‘RMySQL’ (James and DebRoy, 2010). Parts of the design can then easily be accessed and used in R (R Development Core Team, 2011).

Comparison Study

Table 3.5.1 compares summaries for designs built using the staggered LHD method with summaries for some intuitive alternative methods for building large designs. In this study, m = 100, c = 100 and therefore N = 104_{. The final two options both use}

the staggered LHD method, one by building unconstrained LHDs and one using the maximin algorithm described in Section 3.3.1. We were interested to see whether there was any positive or detrimental effect to the overall design by imposing the maximin criterion on each sub-design, and also to see how the properties of the sub-designs compared.

The methods, as numbered in Table 3.5.1 are

1. Generate c unconstrained LHDs each of size m using (imposing no constraint),

2. Generate c maximin LHDs each of size m using the algorithm from Grosso et al. (2009) (explained in Section 3.3.1),

3.5. Verification and validation 41

Method Maximum Correlation Minimum pairwise distance

Over N points Over sets of m

points

Over N points Over sets of m

points 1 0.0251 (0.0398) 0.374 (0.465) 0.156 (0.102) 0.258 (0.157) 2 0.0234 (0.0354) 0.357 (0.472) 0.153 (0.0824) 0.565 (0.523) 3 0.0247 (0.0405) 0.374 (0.445) 0.153 (0.0957) 0.254 (0.169) 5 0.0246 (0.0365) 0.376 (0.462) 0.151 (0.0889) 0.255 (0.161) 6 0.0233 (0.0338) 0.361 (0.410) 0.153 (0.0998) 0.561 (0.507)

Table 3.1: Summaries of designs built by the methods listed above, where each design contains c = 100 chunks of size m = 100, and has dimension k = 10. The figures shown are the means over 100 repetitions of the experiment, with the worst figure given in brackets.

3. Generate one N -point unconstrained LHD and, using random sampling, split it into c chunks of size m,

4. Generate one N -point maximin LHD and, using random sampling, split it into c chunks of size m,

5. Generate a staggered design with c chunks of size m, where each chunk is an unconstrained LHD,

6. Generate a staggered design with c chunks of size m, where each chunk is maximin.

Of these, the overall designs created by the first two methods are not LHDs, and nor are the sub-designs created by the third and fourth. The fourth method, which involves the generation of a maximin LHD of size N , is unrealistic for large designs, and so is not shown in the results table.

It seems that the staggered LHD method, shown in the bottom two rows of Table 3.5.1, produces overall designs with similar orthogonality and space-filling properties to those created by the other methods mentioned. Whether the sub- LHDs were created using the maximin algorithm seems to make little difference to the overall N -point design in terms of correlation or minimum pairwise distance.

In document Comparing multiple simulators using Bayesian emulators (Page 48-52)