3 Statistical effects of sampling and weighting
3.4 Missing data
3.4.2 Effect on variance estimates
Having done the imputation, the next question is: how will it affect variance formulae for mean or proportion estimates? The question is important because some sort of imputation is done for many surveys due to the fact that missing values occur very
135 3.4 missing data
frequently, even when the greatest effort is made to ensure high quality of both sam- pling and data recording. It is also clear that one cannot simply ignore that question, saying that there is no effect. Nevertheless, statistical literature does not pay much attention to this problem, with most discussions being limited, as usual, to simple random sampling.
Here we briefly consider the imputation effects, concentrating our efforts on getting a workable formula rather than on general theory. Notice that imputation, like sample design, will have different effects for different variables. This means that it is not very useful to talk about an ‘average’ effect and we should really deduce a new variance formula.
One particular problem with data imputation is that there are so many differ- ent imputation techniques, and even more situations where these techniques can be applied. Hence, when a variance formula is presented, the danger is that it could be too specific and not suitable for many surveys. On the other hand, a general formula may also not be very beneficial if it does not give a practical algorithm for calculating variance. The approach we have chosen is first to present a general tool to derive a variance formula that could be employed for almost any survey. Next, we choose our path, making several reasonable assumptions, and obtain a specific formula that, we believe, can still be applied in many cases. Any reader can, of course, use the same tool and choose another path by making different assumptions. One assumption that has to be made, of course, is that imputed values can be readily identified in the data set.
Let x be a variable of interest for a survey with n respondents. Assume that the first
m respondents have data (i.e. we know values x1, . . . , xm) while data for respondents
xm+1, . . . , xn are ‘missing’. Denote by f the variable of imputed values so that for
missing respondents we will use imputed values fm+1, . . . , fn instead of the actual
ones. Let ˜x be the weighted mean estimate with imputed values: ˜ x= m i=1wixi+ n i=m+1wifi n i=1wi , (3.29)
wherewi is the weight of respondent i . This is the estimate (in general, biased) for
which we will actually derive a variance formula. Notice that the estimate includes proportions as well: both x and f would be binary variables (with values 0 or 1) in this case. We deliberately distinguish the estimate above from ¯x which is still reserved for the mean without imputed values:
¯ x= n i=1wixi n i=1wi .
One more variable we need is the imputation errori = fi− xifor each respondent
i. It is now simple to express ˜x in terms of ¯x and an error term (see calculations in
Appendix E):
˜
where ˜ = n i=m+1wii n i=m+1wi
is the average error among missing respondents and
α =
n i=m+1wi
n i=1wi
is the item non-response rate.
This equation is our main tool for computing the variance. To start, we make the first assumption:
r for a given variable, the non-response rate α is presumed to be constant if the survey is repeated.
If the survey is conducted only once, this assumption is perhaps unavoidable (unless there is an alternative to equation (3.30) to derive a variance formula). But for many surveys it is also justified by practice. It is assumed that non-response to individual questions resulting from poor questionnaire design has been minimised by testing. Most changes in non-response rate occur when the sampling procedure is changed, or the questionnaire is changed, but the assumption above is, of course, conditional on there being no changes in sampling methodology. However, for readers who do have a chance to measure the variation ofα and are not happy with the assumption above, we will give an alternative expression for variance (see Remark 3.1 at the end of this subsection).
Now the standard formula for the variance of the sum of two variables can be expressed as:
var( ˜x)= var(¯x) + α2var(˜) + 2αcov(¯x, ˜). (3.31) This is a very general formula that can in fact be used for most surveys. However, it is perhaps too general and we must now state the second assumption:
r the average imputation error does not depend on the sampling estimate ¯x. This assumption, we believe, is very reasonable because the imputation procedure is usually independent of the sampling procedure. Notice that we do not assume that the error has any particular distribution. The independence of summands in equation (3.30) allows us, therefore, to get rid of the covariance and to state that
var( ˜x)= var(¯x) + α2var(˜). (3.32)
If a reader feels that the independence assumption is not true in a particular survey then, of course, the covariance should still be calculated. However, it is not obvious how to compute it in general and any estimate of the covariance would have to rest on a number of explicit assumptions.
The next step is to calculate the two summands in equation (3.32). The variance of ¯
137 3.4 missing data
that now imputed values should be used for the missing ones. The second summand is obviously more difficult to compute. In principle, ˜ is itself a ratio estimate and so its variance ought to be estimated by the same formula as the variance of ¯x. However, we do not actually know the error values for missing respondents (otherwise they would not be missing) so that it is extremely difficult, if not impossible, to calculate, in general, the effective sample size for them. This is where we make the last assumption:
r suppose that, for missing respondents, the calibrated sample size can replace the
effective sample size, to calculate the variance of the average error ˜.
This assumption is clearly more restrictive than the first two but, on the other hand, we are attempting to solve a more difficult problem. If the sample is clustered, it is also recommended to take into account the clustering effect, to get a conserva- tive variance estimate. In this situation that can be a good thing because we do not really know how big the error is among missing respondents; we can only esti- mate it using respondents with data. However, the clustering effect should not be large in this case because missing respondents usually constitute a relatively small subsample.3
Of course, a reader can make any other assumption that might be more appropriate for a particular survey and that would lead to a calculation of the effective sample size. However, even strong assumptions are not always very useful. For instance, it is commonly assumed in many research papers (but not here) that the error is normally distributed. But that still does not allow us to obtain the variance of ˜ because weights interact with errors (it is not a simple random sample) and they are not constant from survey to survey, so that one would need more assumptions anyway to obtain a practical formula.
To summarise our discussion, we obtain the following final formula for the variance.
Proposition 3.3 Under the three assumptions above, the variance of ˜x can be estimated as var( ˜x)= var(x) ne + α2var() (n− m)c , (3.33)
where ne is the total effective sample size and (n− m)c is the calibrated sam-
ple size for missing respondents. The variance of x can be estimated by the usual formula var(x)= m i=1wixi2+ n i=m+1wifi2 n i=1wi − m i=1wixi+ m i=1wifi n i=1wi 2 (3.34)
3 For a subsample, the average cluster size b is not greater than the total average cluster size because
each cluster will have fewer respondents. Thus, the clustering effect formula 1+ (b − 1)ρ implies that the clustering effect should, in general, become smaller for a subsample.
(alternatively, it can be estimated using only respondents with data) while the variance of must be based on respondents with data:
var() = m i=1wii2 m i=1wi − m i=1wii m i=1wi 2 . (3.35)
It must be emphasised once again that in the formula used to calculate the effec- tive sample size ne(or design effect) the imputed values should be used for missing
respondents. In the cases where design effect calculations are not possible, the alter- native is, as we know, to replace neby the calibrated sample size ncwhich is much
easier to compute.
As we see, the final variance for the imputed estimate is actually greater than the variance of ¯x, which is what one would expect. The increase will vanish when
α = 0.0, that is when there are no missing respondents. On the other hand, if α = 1.0, all respondents in the sample are missing. But it is impossible, of course,
to impute data for the whole sample – and dangerous to impute a substantial proportion.
It is also worthwhile to give a separate formula for proportion estimates.