DECISION TRUE STATE
4.7.3 Replicate Sampling
4.7.3.2 Balanced Half-Sample Replication
As noted above, taking different samples of pseudoreplicates will introduce variability into the estimates of half-sample variance because of the covariance between the half-samples. These covariances are represented by the cross- product terms involving dhdk in equation.(4.58). These cross-product terms cancel
one another over the entire set of 2L half-samples, or when one uses an "infinite" number of half-sample replications. The question then arises as to whether one can choose a relatively small subset of half-samples for which these terms will also disappear. If this can be done, then the corresponding half-sample estimates of variance will contain all the information available from the total sample.
A simple example will show that it is possible to select a subset of half-sample replications that will have the desired property. Consider a three-strata situation with observations (y1 1,y1 2), (y2 1,y2 2) and (y3 1,y3 2). There are (23=) 8 possible half-
sample replicates. Now consider the following subset of four replicates:
Table 4.6 Sample Data for Balanced Half-Sample Replication
Replicate Stratum 1 2 3
Deviation from Mean (yhs,i - yst) 1 y11 y21 y31 (1/2)(+W1d1 +W2d2 +W3d3) 2 y11 y22 y32 (1/2)(+W1d1 -W2d2 -W3d3)) 3 y12 y22 y31 (1/2)(-W1d1 -W2d2 +W3d3)) 4 y12 y21 y32 (1/2)(-W1d1 +W2d2 -W3d3))
The signs of the separate terms in the deviations are determined by the definition of dh = (yh1-yh2). It is, of course, immaterial how the two observations within a
stratum are numbered originally. Once the numbering is set, however, as in the first replicate, it is maintained in determining the remaining replicates. If these deviations are squared, the first part of each expression is W12d12/4 + W22d22/4
+ W32d32/4 , which is the desired estimate of variance. The second part of each
expression contains the cross-product terms, and it can easily be checked that all these cross-product terms cancel when the squared deviations are added over the four replicates. This follows from the fact that the columns of the matrix of signs in the deviations are orthogonal to one another. Thus this set of balanced half- samples can be identified as:
+ + + + - - - - + - + -
where a plus sign indicates yh1, while a minus sign denotes yh2. Notice that this
stratum appears in half the samples. Thus, the mean of the replicates is an unbiased estimate of the mean of the population, and because of the nature of the "cross-product balance" the variance estimate is also unbiased and unaffected by the correlations inherent in the composition of the individual half-samples. If one wishes to obtain a set of half-samples that will have this feature of "cross- product balance", for any fixed number of strata, then it becomes necessary to have a method of generating matrices of + and - signs whose columns are orthogonal to one another. A method is described by Plackett and Burman (1943- 46, p.323) for obtaining k x k orthogonal matrices, where k is a multiple of 4. Suppose, for example, that we have 5,6,7 or 8 strata. The Plackett-Burman method produces the following 8 x 8 matrix, which is the smallest that can be used for these cases because of the multiple-of-4 restriction. The rows identify a half-sample, while the columns refer to strata.
+ - - + - + + - + + - - + - + - + + + - - + - - - + + + - - + - + - + + + - - - - + - + + + - - - - + - + + + - - - - -
Any set of 5 columns for the 5 strata case (or in general, n columns for the n strata case) defines a set of eight half-sample replicates which will have the property of "cross-product balance". If it is necessary to use the eighth column, the resulting set of half-samples will not have each element appearing an equal number of times. This will not destroy the variance estimating characteristics of the set of half-samples, but it does mean that the average of the eight half-sample means will not necessarily be equal to the overall sample mean. When the number of strata is a multiple of four, it may then be wise to use the next highest multiple of four as the number of half-samples.
Since orthogonal matrices of plus and minus ones can be obtained whenever the order of the matrix is a multiple of four, it is always possible to find a set of half- sample replicates having cross-product balance. It follows that the number of half-samples required will be at most four more than the number of strata. Consider, for example, if the survey design were to be based on 21 stratification cells. In such a case, it would be necessary to use 24 half-sample pseudoreplicates. A possible set of balanced half-samples for this particular case is shown below. This design was obtained by using the first 21 columns of the construction given in the Plackett-Burman paper. Any two columns of this design are orthogonal, and each element appears in 12 of the 24 replicates. The entire pattern is determined by the first column. The 2nd column is obtained from the 1st by
moving each sign down one position and placing the 23rd sign at the top of the second column. This rotation is applied repeatedly to obtain the remaining columns. The 24th position is always '-' and is not involved in the rotation.
Half Stratum Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 + - - - - + - + - - + + - - + + - + - + + 2 + + - - - - + - + - - + + - - + + - + - + 3 + + + - - - - + - + - - + + - - + + - + - 4 + + + + - - - - + - + - - + + - - + + - + 5 + + + + + - - - - + - + - - + + - - + + - 6 - + + + + + - - - - + - + - - + + - - + + 7 + - + + + + + - - - - + - + - - + + - - + 8 - + - + + + + + - - - - + - + - - + + - - 9 + - + - + + + + + - - - - + - + - - + + - 10 + + - + - + + + + + - - - - + - + - - + + 11 - + + - + - + + + + + - - - - + - + - - + 12 - - + + - + - + + + + + - - - - + - + - - 13 + - - + + - + - + + + + + - - - - + - + - 14 + + - - + + - + - + + + + + - - - - + - + 15 - + + - - + + - + - + + + + + - - - - + - 16 - - + + - - + + - + - + + + + + - - - - + 17 + - - + + - - + + - + - + + + + + - - - - 18 - + - - + + - - + + - + - + + + + + - - - 19 + - + - - + + - - + + - + - + + + + + - - 20 - + - + - - + + - - + + - + - + + + + + - 21 - - + - + - - + + - - + + - + - + + + + + 22 - - - + - + - - + + - - + + - + - + + + + 23 - - - - + - + - - + + - - + + - + - + + + 24 - - - -
The use of replication, random half-sample replication, or balanced half-sample replication provides a means whereby the variability in a sample estimate can be obtained from samples of any degree of complexity. It should be noted, however, that this estimate of variability can be obtained only after the sample has been drawn and the data has been collected. Therefore, replication methods cannot be used to assist in the determination of the sample size. However, replication methods can be used on existing samples of various designs to enable the calculation of design effects which can then be used in sample design as outlined in the previous section.
There are many other ways of estimating variance directly, including various forms of "jack-knifing" techniques (Brillinger, 1966; Effron, 1981,1983). The essence of jack-knifing is to remove one observation from the sample and recalculate the mean of the resultant sample. By choosing which observation to remove in a systematic manner, akin to the orthogonal matrix technique described for replication, the mean and variance of the jack-knife samples can be made to provide unbiased estimates of the population parameters.
4.8 DRAWING THE SAMPLE
The final stage in the sampling process is the actual drawing of a sample from the sampling frame. In some cases (for example, with systematic sampling), the procedure is very simple and can be easily automated, either!in the office or in the field (although care should be taken to ensure that the sampling procedure is adhered to in the field). In most situations, however, the sample should be drawn by reference to a random process.
Ideally, the process should be truly random i.e. independent events with each outcome being equally probable. However, the only true random processes are strictly physical events such as tossing a coin or rolling a die. In most situations, such random processes are too time consuming to be useful in sample selection. We must, therefore, resort to some form of "pseudo-random" process which can quickly and easily generate a set of random numbers for use in sampling.
Two forms of such random number generation processes are commonly used; look-up tables and recursive mathematical equations. The use of tables of random numbers is widespread and for this purpose several publications have compiled tables of random numbers. The most well-known of these is "Rand's One Million Random Numbers" (The Rand Corporation, 1955) although there are several other compilations (Owen, 1962; Kendall and Smith, 1939). In addition, most statistics textbooks contain a reproduction of part of one of the more complete compilations. A table of random numbers is included as Appendix B in this book.
In using random number tables to select a sample, the first step is to number all sampling units in the sampling frame. The order of this numbering is immaterial, and should be done for maximum convenience. The next step is to pick a starting point in the table of random numbers to be used (anywhere!will do) and then systematically work through the table until the required number of random numbers has been selected. In sampling without replacement, the final sample of random numbers must contain no replications. In using tables of random numbers, any selection procedure is acceptable so long as it is systematic. For example, numbers may be read down or across the page, from left to right or right to left. In addition, numbers may be truncated in any way to obtain
numbers in the desired range, e.g. if the tabulated numbers are in five digit groups, as in Appendix B, and the user wants random numbers between 0 and 99 then either the first two or the last two (or any other two) digits of the five digit group may be used. The numbers may also be modified systematically to obtain numbers in the desired range, provided that no bias is introduced. For example, if random numbers in the range 0 to 39 are required, the user could sample numbers between 0 and 99, discarding numbers greater than 39. This procedure, while not introducing bias, is, however, wasteful of random numbers. A more efficient procedure is to subtract 40 from those numbers in the range 40 to 79 and then use the result as a valid random number in the range 0 to 39. Those numbers from 80 to 99 would still need to be discarded because they do not have the same range as the required numbers and hence, if included (by subtracting 80), would bias the selection towards numbers in the range 0 to 19.
Sometimes, tables of random numbers are not readily available for use in the field. In such situations, it is useful to be aware of other, more readily available sources of random numbers. One source which is almost universally available is a telephone directory. Whilst not recommended for large scale surveys, a telephone directory does provide a useful source of random numbers if used with due care. Thus, for example, random numbers could be selected by taking digits in the right-hand columns of the telephone number. For most telephone directories a maximum of four digits should be selected from any one telephone number (otherwise the biasing effect of a limited number of telephone exchange codes can become significant). Within these restrictions, the principle is again to choose numbers systematically. For example, choose a page and column at random, and then read down the list of numbers in that column.
The major difference in using a telephone directory to obtain random numbers, as opposed to using a specific table of random numbers, is that whereas the table of random numbers has been checked for randomness before publication, the numbers obtained from the telephone directory carry no such guarantee. It is therefore necessary to ensure that the numbers obtained are indeed (approximately) random, by means of a series of simple checks. As a minimum, three tests should be conducted: first, plot the frequency distribution of the listed numbers and perform a goodness-of-fit test e.g. Kolmogorov-Smirnov test; second, calculate the mean and standard deviation of these numbers; and third, perform a runs test to check whether the listed numbers are ordered non- randomly. The use of these tests is described in Fishman (1973) and Knuth (1969). Whilst the above methods of using "look-up" tables are convenient when one is drawing a relatively small sample, they become rather cumbersome to use when one is drawing a large sample, or drawing repeated samples, especially when using a computer. The storage space required to store tables of random numbers can be quite large, especially when large independent repeated !samples are
required. For that reason, use is often made of truly "pseudo-random" numbers which are generated by a .recursive mathematical equation. The use of an equation to generate random numbers may at first appear to be in direct contradiction of the concept of random numbers because each pseudo-random number so generated is completely determined by its predecessor and, consequently, all numbers are determined by the initial "seed" number. !While this is true, the critical point is that the random numbers so generated can pass the statistical tests for uniformity and independence required of truly random numbers. They are therefore indistinguishable from real random numbers.
While there exist a number of different random number generator algorithms (see Knuth, 1969) the most common type is the linear congruential method. In this method, numbers are generated by an equation of the form:
xi = (axi-1 + c) mod m (4.61)
where x0 = an initially specified seed number
Expanding equation (4.61) such that the modulo notation (mod m) is removed, we obtain: xi = axi-1 + c - ÎÍ È ˚ ˙ ˘ axi-1!+!c m . m (4.62)
where [ ] = integer portion of value inside brackets.
To convert the integer number obtained from equation (4.62) to a random number within a smaller specified range (A,B), the following transformation is applied:
xi(A, B) = m xi . (B-A) + A (4.63)
The critical factor in the use of such random number generators is to choose appropriate values of a, c and m. The selection of these values will depend on the computer being used to perform the calculations. Generally, each computer will have a random number generator routine with appropriate values of a, c and m which have been found to give satisfactory statistical results on that computer. One problem often noted with pseudo-random number generators is that, because they are deterministically calculated, then once a number is repeated an entire sequence of repetitions must follow. Obviously no more than m different random numbers can be generated, although the period p can be substantially less than m with inappropriate choice of a and c. Specific rules apply for maximising p (see Fishman, 1973). However, this problem of periodicity is more of a concern when using pseudo-random numbers in discrete event simulation modelling than it is in survey sampling procedures. Only when the sample size is
very large might periodicity become a problem (and even then it can be avoided by suitable choice of a, c and m).
Following selection, and checking, of a set of random numbers by one of the above methods, it then remains to select those units on the sampling frame with the corresponding number and then include them on the sample list for use in the survey.