Summary. A different approach to statistical inference is outlined based not on a probabilistic model of the data-generating process but on the randomization used in study design. The implications of this are developed in simple cases, first for sampling and then for the design of experiments.
9.1 General remarks
The discussion throughout the book so far rests centrally on the notion of a probability model for the data under analysis. Such a model represents, often in considerably idealized form, the data-generating process. The parameters of interest are intended to capture important and interpretable features of that generating process, separated from the accidental features of the particular data.
That is, the probability model is a model of physically generated variability, of course using the word ‘physical’ in some broad sense. This whole approach may be called model-based.
In some contexts of sampling existing populations and of experimental design there is a different approach in which the probability calculations are based on the randomization used by the investigator in the planning phases of the investigation. We call this a design-based formulation.
Fortunately there is a close similarity between the methods of analysis emer-ging from the two approaches. The more important differences between them concern interpretation of the conclusions. Despite the close similarities it seems not to be possible to merge a theory of the purely design-based approach seam-lessly into the theory developed earlier in the book. This is essentially because of the absence of a viable notion of a likelihood for the haphazard component of the variability within a purely design-based development.
178
9.2 Sampling a finite population 179
These points will be illustrated in terms of two very simple examples, one of sampling and one of experimental design.
9.2 Sampling a finite population
9.2.1 First and second moment theory
Suppose that we have a well-defined population of N labelled individuals and that the sth member of the population has a real-valued propertyηs. The quant-ity of interest is assumed to be the finite population mean mη = ηs/N. The initial objective is the estimation of mηtogether with some assessment of the precision of the resulting estimate. More analytical and comparative aspects will not be considered here.
To estimate mηsuppose that a sample of n individuals is chosen at random without replacement from the population on the basis of the labels. That is, some impersonal sampling device is used that ensures that all N!/{n!(N − n)!} dis-tinct possible samples have equal probability of selection. Such simple random sampling without replacement would rarely be used directly in applications, but is the basis of many more elaborate and realistic procedures. For convenience of exposition we suppose that the order of the observations y1,. . . , ynis also randomized.
Define an indicator variable Iksto be 1 if the kth member of the sample is the sth member of the population. Then in virtue of the sampling method, not in virtue of an assumption about the structure of the population, the distribution of the Iksis such that
ER(Iks) = PR(Iks= 1) = 1/N, (9.1) and, because the sampling is without replacement,
ER(IksIlt) = (1 − δkl)(1 − δst)/{N(N − 1)} + δklδst/N, (9.2) whereδklis the Kronecker delta symbol equal to 1 if k= l and 0 otherwise. The suffix R is to stress that the probability measure is derived from the sampling randomization. More complicated methods of sampling would be specified by the properties of these indicator random variables,
It is now possible, by direct if sometimes tedious calculation from these and similar higher-order specifications of the distribution, to derive the moments of the sample mean and indeed any polynomial function of the sample values. For this we note that, for example, the sample mean can be written
¯y = k,sIksηs/n. (9.3)
It follows immediately from (9.1) that
ER( ¯y) = mη, (9.4)
so that the sample mean is unbiased in its randomization distribution.
Similarly
varR( ¯y) = k,svar(Ik,s)η2/n2+ 2k>l,s>tcov(Iks, Ilt)ηsηt/n2. (9.5) It follows from (9.2) that
varR( ¯y) = (1 − f )vηη/n, (9.6) where the second-moment variability of the finite population is represented by vηη = (ηs− mη)2/(N − 1), (9.7) sometimes inadvisably called the finite population variance, and f = n/N is the sampling fraction.
Thus we have a simple generalization of the formula for the variance of a sample mean of independent and identically distributed random variables. Quite often the proportion f of individuals sampled is small and then the factor 1− f , called the finite population correction, can be omitted.
A similar argument shows that if s2= (yk− ¯y)2/(n − 1), then
ER(s2) = vηη, (9.8)
so that the pivot
mη− ¯y
{s2(1 − f )/n}1/2 (9.9)
has the form of a random variable of zero mean divided by an estimate of its standard deviation. A version of the Central Limit Theorem is available for this situation, so that asymptotically confidence limits are available for mηby pivotal inversion.
A special case where the discussion can be taken further is when the popu-lation valuesηsare binary, say 0 or 1. Then the number of sampled individuals having value 1 has a hypergeometric distribution and the target population value is the number of 1s in the population, a defining parameter of the hypergeo-metric distribution and in principle in this special case design-based inference is equivalent, formally at least, to parametric inference.
9.2.2 Discussion
Design-based analysis leads to conclusions about the finite population mean totally free of assumptions about the structure of the variation in the population
9.2 Sampling a finite population 181
and subject, for interval estimation, only to the usually mild approximation of normality for the distribution of the pivot. Of course, in practical sampling problems there are many complications; we have ignored the possibility that supplementary information about the population might point to conditioning the sampling distribution on some features or would have indicated a more efficient mode of sampling.
We have already noted that for binary features assumed to vary randomly between individuals essentially identical conclusions emerge with no special assumption about the sampling procedure but the extremely strong assumption that the population features correspond to independent and identically distrib-uted random variables. We call such an approach model-based or equivalently based on a superpopulation model. That is, the finite population under study is regarded as itself a sample from a larger universe, usually hypothetical.
A very similar conclusion emerges in the Gaussian case. For this, we may assumeη1,. . . , ηN are independently normally distributed with meanµ and varianceσ2. Estimating the finite population mean is essentially equivalent to estimating the mean¯y∗of the unobserved individuals and this is a prediction problem of the type discussed in Section8.1. That is, the target parameter is an average of a fraction f known exactly, the sample, and an unobserved part, a fraction 1−f , The predictive pivot is, with σ2known, the normally distributed quantity
( ¯y∗− ¯y)/(σ√
{1/n + 1/(N − n)}) (9.10)
and whenσ2is unknown an estimated variance is used instead. The pivotal distributions are respectively the standard normal and the Student t distribu-tion with n− 1 degrees of freedom. Except for the sharper character of the distributional result, the exact distributional result contrasted with asymptotic normality, this is the same as the design-based inference.
One difference between the two approaches is that in the model-based formu-lation the choice of statistics arises directly from considerations of sufficiency.
In the design-based method the justification for the use of ¯Y is partly general plausibility and partly that only linear functions of the observed values have a randomization-based mean that involves mη. The unweighted average¯y has the smallest variance among all such linear unbiased estimates of mη.
While the model-based approach is more in line with the discussion in the earlier part of this book, the relative freedom from what in many sampling applications might seem very contrived assumptions has meant that the design-based approach has been the more favoured in most discussions of sampling in the social field, but perhaps rather less so in the natural sciences.
Particularly in more complicated problems, but also to aim for greater the-oretical unity, it is natural to try to apply likelihood ideas to the design-based approach. It is, however, unclear how to do this. One approach is to regard (η1,. . . , ηN) as the unknown parameter vector. The likelihood then has the fol-lowing form. For those s that are observed the likelihood is constant when the ηsin question equals the corresponding Y and zero otherwise and the likelihood does not depend on the unobservedηs. That is, the likelihood summarizes that the observations are what they are and that there is no information about the unobserved individuals. In a sense this is correct and inevitable. If there is no information whatever either about how the sample was chosen or about the structure of the population no secure conclusion can be drawn beyond the indi-viduals actually sampled; the sample might have been chosen in a highly biased way. Information about the unsampled items can come only from an assumption of population form or from specification of the sampling procedure.
9.2.3 A development
Real sampling problems have many complicating features which we do not address here. To illustrate further aspects of the interplay between design- and model-based analyses it is, however, useful to consider the following extension.
Suppose that for each individual there is a further variable z and that these are known for all individuals. The finite population mean of z is denoted by mz
and is thus known. The information about z might indeed be used to set up a modified and more efficient sampling scheme, but we continue to consider random sampling without replacement.
Suppose further that it is reasonable to expect approximate proportionality between the quantity of interestη and z. Most commonly z is some measure of size expected to influence the target variable roughly proportionally. After the sample is taken¯y and the sample mean of z, say ¯z, are known. If there were exact proportionality betweenη and z the finite population mean of η would be
˜mη= ¯ymz/¯z (9.11)
and it is sensible to consider this as a possible estimate of the finite population mean mη in which any discrepancy between¯z and mz is used as a base for a proportional adjustment to¯y.
A simple model-based theory of this can be set out in outline as follows.
Suppose that the individual values,ηk, and therefore if observed yk, are random variables of the form
ηk= βzk+ ζk√
zk, (9.12)
9.2 Sampling a finite population 183
whereζ1,. . . , ζN are independently normally distributed with zero mean and varianceσζ2. Conformity with this representation can to some extent be tested from the data. When zkis a measure of size andηkis an aggregated effect over the individual unit, the square root dependence and approximate normality of the error terms have some theoretical justification via a Central Limit like effect operating within individuals.
Analysis of the corresponding sample values is now possible by the method of weighted least squares or more directly by ordinary least squares applied to the representation
yk/√
zk = β√
zk+ ζk, (9.13)
leading to the estimate ˆβ = ¯y/¯z and to estimates of σζ2and var( ˆβ) = σζ2/(n¯z).
Moreover,σζ2is estimated by
s2ζ = (yk− ˆβzk)2/zk
n− 1 . (9.14)
The finite population mean of interest, mη, can be written in the form n¯y + (1 − f )∗ηl/(N − n) = f ¯y + (1 − f ){β∗(zl+ ζl√
zl)} (9.15) where∗denotes summation over the individuals not sampled. Because
{n¯z + (1 − f )∗zl}/N = mz, (9.16) it follows, on replacingβ by ˆβ that the appropriate estimate of mη is ˜mηand that its variance is
(mz− f ¯z)mz
N¯z σζ2. (9.17)
This can be estimated by replacingσζ2by s2ζ and hence an exact pivot formed for estimation of mη.
The design-based analysis is less simple. First, for the choice of ˜mηwe rely either on the informal argument given when introducing the estimate above or on the use of an estimate that is optimal under the very special circumstances of the superpopulation model but whose properties are studied purely in a design-based formulation.
For this, the previous discussion of sampling without replacement from a finite population shows that( ¯y, ¯z) has mean (my, mz) and has covariance matrix, in a notation directly extending that used for vηη,
vηη vηz
vzη vzz
(1 − f )/n. (9.18)