These methods again replace each missing observation by a single value, leading to a ‘com- pleted’ data set. The originally intended complete data analysis is then used. As above, we call these replaced values imputed values.
Marginal mean imputation, as its name suggests, ignores other variables. Missing values are imputed by the average of the observed values for that variable. It is also sometimes referred to as simple mean imputation or just mean imputation.
Clearly, marginal mean imputation is problematic for categorical variables, where the ‘average category’ has no meaning. However, the problems go far beyond this. As marginal mean imputation ignores all the other variables in the data set, using it reduces the associations in the data set. Also, imputing all the missing observations to the same value is clearly wrong, and will underestimate the variability in the unseen data. It further goes against the principles of Chapter 1, where we saw the best we could hope for was a good estimate of the distribution of the missing observations.
6In other words baseline may be MAR or NMAR, but in neither case, given baseline data, must the missingness
2.5 Marginal and conditional mean imputation 43 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
Baseline FEV (litres)
6 month FEV (litres)
Observed 6 month FEV
Marginal mean imputation of 6 month FEV
Figure 2.4: Isolde trial, placebo arm: plot of baseline FEV1against 6 month FEV1 with missing
6 month FEV1’s imputed by the marginal mean
EXAMPLE2.1 Isolde trial (ctd)
Consider the FEV1 response 6 months after randomisation for the 375 patients in the placebo
group. Eighty seven have a missing response. The mean FEV1 of the remaining 288 is 1.36
litres. Marginal mean imputation sets each of the missing values equal to 1.36.
Figure2.4shows, for the 375 placebo patients, a plot of baseline FEV1 against 6 month FEV1.
The 87 patients with marginal mean imputed values are shown with a ‘4’. The shortcomings of marginal mean imputation are immediately obvious. Unless a patient’s baseline FEV1is close
to the mean baseline FEV1, the marginal mean is very unlikely to be close to the unobserved
value. ¤
We now consider conditional mean imputation. In the simplest case, suppose we have one fully observed variable, x, linearly related to the variable with missing data, y. Using the observed pairs, (xi, yi), i ∈ (1, . . . , n1), fit the regression of y on x:
average value of yi= α + β xi, (2.5)
obtaining estimates ( ˆα, ˆβ ) of (α, β ). Then, for the missing yi’s, i ∈ (n1+ 1, . . . , n), impute them
as yi= ˆα + ˆβ xi.
EXAMPLE2.1 Isolde study (ctd)
Consider again the baseline (denoted x) and 6 month (denoted y) FEV1 measurements for the
375 placebo patients. Fitting (2.5) to the 288 patients with both values observed gives
average value of yi= 0.024 + 0.947 × xi. (2.6)
1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
Baseline FEV (litres)
6 month FEV (litres)
Observed 6 month FEV Conditional mean imputation of 6 month FEV 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
Baseline FEV (litres)
6 month FEV (litres)
Conditional mean imputation of 6 month FEV
Figure 2.5: Isolde trial, placebo arm: plots of baseline FEV1 against 6 month FEV1 with
missing 6 month FEV1’s imputed by the conditional mean (2.6). Left panel: Observed and
imputed data; right panel: imputed data only
For example, a patient with baseline 0.645 litres is imputed a 6 month value of 0.024 + 0.947 × 0.645 = 0.635 litres.
Figure 2.5 shows the results of using (2.6) for the 88 placebo patients with missing 6 month FEV1. Conditional mean imputed values are shown with a ‘4’. It’s clear that the conditional
imputations are much more plausible than the marginal imputations. However, as the right panel indicates, they are much less variable than the observed data. Thus, regarding the condi- tional mean imputations as ‘observed data’ and using them in an analysis will generally lead to
underestimated standard errors, and p-values. ¤
One setting where the underestimation of the variance with conditional mean imputation may not be such a problem is when we have a quantitative response and missing baseline values. As with the missing indicator method, this is because randomisation ensures that baseline is not a confounder. As with the missing indicator method, we may need to weight as the variance of response given baseline will different in the group whose missing baselines have been replaced by the conditional mean imputations. To estimate the weights:
1. Using data from patients with both baseline and response observed, regress response on baseline and treatment. Note the residual standard error; call this ˆrb.
2. Using data from patients with observed response but missing baseline, whose missing baselines are replaced by their conditional mean imputations, regress response on baseline and treatment. Note the residual standard error; call this ˆrm.
2.5 Marginal and conditional mean imputation 45 3. Weights for patients with baseline and response observed are ˆrm2, and those for patients with missing baseline replaced by the conditional mean imputation are ˆr2p. Note ˆr2mis used in the weights for those with both baseline and response observed and vice-versa.
As with the missing indicator method, weighting is probably advisable, if not always necessary. EXAMPLE2.4 Missing baseline values (ctd)
We revisit Example 2.4, where we artificially made some baseline FEV1 missing. There we
considered the missing indicator method. Now though, we use baseline BMI (which drives the missingness mechanism (2.4)) to conditionally impute missing baseline FEV1.
The conditional mean imputation model for baseline FEV1is
Expected baseline FEV1i= α + β × baseline BMIi, (2.7)
which we fit to the i ∈ (1, . . . , 506) patients with baseline FEV1 and BMI observed. (BMI is
observed on all 750 patients). This gives estimates ( ˆα, ˆβ ) = (1.2268, 0.007542). For the 186 patients with only 6-month FEV1observed we then impute their baseline FEV1values as
1.2268 + 0.007542 × baseline BMIi.
For these data, ˆrb= 0.1638 and ˆrm= 0.5007, giving weights of 0.2507 for patients with both
baseline and response observed and 0.02683 for those with missing baseline.
We then perform three analyses: (a) ANCOVA with conditional mean imputation for missing baseline values; (b) weighted ANCOVA with conditional mean imputation for missing baseline values, and (c) maximum likelihood analysis. Analysis (b) uses the weights calculated above. However, as the variance of the conditional mean imputations of baseline FEV1’s is very small
compared to the variance of the observed baseline FEV1’s, normalised weights are virtually
identical to those used in the weighted missing indicator method (analysis (iiib) in Table 2.9). Analysis (c) includes baseline BMI, but does not condition the treatment estimates on it. Effec- tively, it assumes that baseline and 6-month FEV1 are MAR given fully observed BMI. Such
maximum likelihood analyses are discussed in detail Chapter 3; this example uses the data arrangement in Table3.12and the code for Example3.6.
Table2.10shows the results. The big differences are in the standard errors; because of the high correlation between baseline and 6-month FEV1, weighting is essential. The weighted analysis
(b) is very similar to the weighted analysis (iiib) in Table2.9, but the point estimate is fraction- ally closer to the original data analysis (i) and the standard error is slightly smaller, possibly indicative of a little gain through conditional imputation with missing baselines. Analysis (c) has similar efficiency but a slightly different point estimate. This is probably because it makes the slightly different assumption that both 6-month and baseline FEV1are MAR given BMI.
The results suggest that if we wish to avoid a maximum likelihood analysis, the weighted miss- ing indicator method — which gives estimates from a single model fit — is likely to be sufficient
in practice. ¤
The conditional imputation above just used one variable. In general, we can use as many vari- ables as we like, and form complicated, possibly non-linear imputation models. These can
Analysis Treatment Standard d.f. t p-value estimate error
(a) Conditional imputation 0.0641 0.0261 583 2.46 0.0143
(n=586 observations)
(b) Weighted conditional imputation 0.0689 0.0160 583 4.30 2.0×10−5
(n=586 observations)
(c) Maximum likelihood 0.0680 0.0161 434 4.24 2.7×10−5
(n=186 6-month only + n=106 baseline only + n=400 with both)
Table 2.10: Estimated 6 month treatment effect, adjusted for baseline. Row 1: missing baselines (made missing according to (2.4)) imputed using conditional imputation; row 2: weighted con- ditional imputation, and row 3: maximum likelihood analysis using SAS PROC MIXED (same code as Example3.6). Note degrees of freedom for the maximum likelihood analysis are from option ddfm=kr in SAS PROC MIXED
improve the accuracy of the prediction. This is particularly so if the data is assumed MAR and we include in our imputation model all the variables, conditional on which the response is MCAR. In this case, the mean of the imputed data will be sensible. However, we are still imputing single values for the missing data, when as we have seen what we need to do is to estimate the distribution of the missing data.
We therefore need an additional step to correctly estimate the variability of quantities estimated from a ‘completed’ data set obtained using conditional mean imputation. It is possible, but often non-trivial, to do this on a case-by-case basis. Alternatively, the attraction of Multiple Imputation (MI) (Rubin, 1987) is that it provides a simple, yet both general and sufficient, approach for accounting for the variability of the estimated distribution of the missing data given the observed data.
To do this, MI does not treat any one set of imputations as the true ‘unobserved’ values of the missing data. Rather, taking into account the uncertainty in estimating both (i) the relationship between y and x variables (i.e. ˆα, ˆβ in (2.5)), and (ii) the residual variability, several ‘complete’ data sets are imputed. These then provide a convenient representation of the distribution of the missing data given the observed. Each is analysed using the method intended had there been no missing data. Then, in a key second stage, the results are combined in order to give sensible results, which are unbiased and have approximately the correct standard error. Rubin derived rules for doing this, and it is the generality and simplicity of these rules that has placed multiple imputation at the centre of methods for handling missing data.
EXAMPLE2.1 Isolde study (ctd)
We now refine the conditional mean imputations above, to reflect (i) the variability in our esti- mates (0.024, 0.947) of (α, β ) and (ii) the variability of 6 month FEV1given baseline FEV1.
2.6 Conclusions 47