Multiple imputation for handling missing data

7.1 Missing data

7.1.4 Multiple imputation for handling missing data

Multiple imputation has been developed and further refined as a method for more appropriately dealing with missing data that uses the distribution of the observed data to estimate a range of plausible values for the missing data. It also adds random components into the model in order to allow for uncertainty from the imputed data.

This is repeated several times to created multiple data sets. The estimates are then combined in a way that takes into account the variability of the imputations to obtain the overall estimates, variances, and confidence intervals.

148

Multiple imputation is a three-stage process. First, m values are imputed for each missing data. Missing values are replaced by imputed values that have been sampled from their predictive distribution based on the observed data (Sterne et al., 2009).

Imputations are generally undertaken using Markov Chain Monte Carlo or multiple imputation through chained equation techniques. Secondly, the m complete data sets are treated as a real complete data set. Each data set will differ because the missing values have been replaced by different imputations. Thirdly, the results of the m analyses are combined using Rubin’s rules to create a single inference about the parameter of interest that includes a measure of uncertainty from the missing data (Rubin, 1987; Zhou et al., 2001).

Multiple imputation has been shown to perform well when the proportion of the overall missing data is less than 61% (Barzi & Woodward, 2004). The underlying assumption of multiple imputation is that missing data are MAR, and therefore missing values may depend on the observed data but not on the unobserved data.

Even if the assumption that the data are MAR does not hold, multiple imputation is less biased than methods such as complete case analysis.

7.1.4.1 Methods for imputing values for the missing data

Markov Chain Monte Carlo technique

Multiple imputations can be created from a multivariate normal model using Markov Chain Monte Carlo (MCMC) techniques. The MCMC method involves simulating draws from a multivariate normal distribution of all of the variables in the imputation model. This method generates predicted values based on the linear regressions and then random draws are made from the simulated error distribution for each regression

149

equation. The imputed values are created through the addition of the random errors to the predicted values for each individual (Allison, 2009).

Multiple imputations by chained equations

Multiple imputations can also be created through multiple imputation by chained equations (MICE). MICE is a recently developed sequential method where instead of assuming a single multivariate model for all of the data, uses a separate regression model to impute each variable with missing data. MICE cycles through all of the variables, and models each variable conditional on the others (Stuart et al., 2009).

Logistic regression is used for incomplete binary variables and linear regression for continuous data.

Therefore unlike MCMC techniques, where values imputed for one variable are never used as predictors to impute other variables, MICE methods use a sequential process so that the values that were imputed in the previous round are then used as predictors for imputing other variables (Allison, 2009). The variable with the least missingness is imputed first, followed by the variable with the second lowest amount of missing data, and so on. Variables with the same amount of missingness are processed in a random order, but the same order is always used (Royston, 2005b). One iteration is complete after all of the variables have been cycled through (Stuart et al., 2009) and the process repeats, imputing missing values until the process reaches convergence (i.e. more iterations will not produce significant changes in the parameter estimates) (Horton & Kleinman, 2007).

150

MICE techniques may be preferable in certain data sets because MCMC methods assume normality and linearity, and therefore are not well suited for the imputation of categorical variables. MCMC methods are often slow to converge and it is difficult in practice to assess convergence. However, the MCMC approach has stronger statistical underpinnings and there is no theoretical guarantee that the MICE method will

converge to the correct distribution for the missing values (Allison, 2009). Recent work by Lee and Carlin (2010) demonstrated that both methods are less biased than complete case analyses and that the results obtained from the two methods are similar.

7.1.4.2 Determining the number of imputations

Some simulation studies have demonstrated that three imputed data sets are sufficient for data where less than 20% are missing (van Buuren et al., 1999). Where rates of missing data are high, more than five to ten imputations tend to have little or no practical benefit (Schafer, 1999). However, more recent research has questioned the claims that more than ten imputed data values are seldom needed. Work by Bodner (2008) has suggested that precision is improved through the use of increasing numbers of imputations. A recent publication has conservatively suggested that the number of imputations should be greater than or equal to the percentage of incomplete cases (i.e. if a data set had 13% incomplete cases, an appropriate number of

imputations would be approximately 15) (White et al., 2010).

7.1.4.3 Methods for combining the complete data sets (Rubin’s rules)

Once the complete data sets are created, they are then combined using a set of rules in order to obtain the overall estimates, variances, and confidence intervals. This

incorporates both within-imputation variability (uncertainty about the results from one

151

imputed data set) and between-imputation variability (the uncertainty due to the missing data) (White et al., 2010).

The overall multiple imputation point estimate for the parameter of interest is the average of the m estimates of the variable Q from the imputed data sets:

The variability of variable Q has two components: (1) the estimated within-imputation variance (Ū),

and (2) the between-imputation variance (B).

ˆ )

The between-imputation variance is the additional variance created by the uncertainty around the missing values. The total variance for the overall multiple imputation estimate is defined as T.

m B U

T ⁼ ⁺

(

^{1 +} ¹

)

(7.4)

7.1.4.4 Selection of variables to include in the imputation model

In general, the selection of all available covariates produces multiple imputations with minimal bias and maximal certainty (van Buuren et al., 1999). However, due to computational limitations or problems with multicollinearity, it is often neither feasible nor appropriate to use all variables. A stepwise process for the selection of variables for inclusion into a multiple imputation model using either MCMC or MICE

152

has been suggested by van Buuren and colleagues (1999). First, one should include all of the patient variables with missing data, the outcome variables and important

observed covariates. Failure to include the outcome variable in the imputation of a missing covariate leads to an increase in the risk of bias when determining the association between covariates and the outcome (Moons et al., 2006). Specifically, there is an increased risk of underestimating the covariate-outcome association because there is no covariate-outcome association in the imputed data. Secondly, the variables that are associated with the missingness of the data should be included in the model. Thirdly, variables that are highly correlated with the variables with missing data should be included. Finally, variables that have a very high proportion of missing data should be removed from the imputation model if MCMC methods are being used. Because MICE imputes data variable by variable, one can use different

variables that may have been excluded due to a very high proportion of missing data to impute each variable, and therefore MICE can be more advantageous than MCMC methods. The selection of appropriate variables is crucial to providing accurate imputed values.

7.1.4.5 Monitoring convergence

Attempts should be made to determine whether the MICE algorithm has reached convergence or when the chain reaches equilibrium, although no definite method exists. The goal is to have a sufficient number of iterations to stabilize the distribution of the parameters. In general MICE requires fewer iterations than the MCMC

methods, which can often require thousands of iterations (van Buuren & Oudshoorn, 1999). Simulation studies have demonstrated that the imputations using a MICE algorithm have stabilized after 10 to 20 iterations, and as few as five (Brand, 1999),

153

such that the order in which the variables are imputed is no longer an issue (Stuart et al., 2009). One method for assessing convergence of the MICE algorithm is to

increase the number of iterations/cycles and examine the data for any noticeable differences. This can be carried out in Stata® by plotting the mean value of each imputed variable against the iteration number (Royston, 2005b). The model has converged when no trend—just random jumps up and down—is apparent in each plot.

7.1.4.6 Imputation diagnostics

After imputed data have been created, one should check to see whether data from the imputations are plausible and whether they differ from the observed data. Differences can arise from the model used to generate the imputations or may indicate that the missingness assumption has been violated, which is a more serious concern (Abayomi et al., 2008). Although there are no agreed tests, statistical and visual diagnostic tests

can be used to identify potential problems with the imputed data. A simple graphical method can be to plot the density distribution of observed and imputed values (i.e.

only those values actually imputed and not all values in the imputed data sets) (Royston, 2005a; Abayomi et al., 2008). These plots are useful for detecting important differences between the observed and imputed data. Another graphical method is to use bivariate scatter plots, which compare the internal consistency of the imputed and observed observations with respect to a continuous variable. Finally, a significant result from a Kolmogorov–Smirnov test may signal potential differences between observed and imputed values.

154

7.1.4.7 Assessing the impact of multiple imputation

Finally, one needs to make an assessment of the impact of the multiple imputation.

Often comparisons are made between the results obtained using complete case analysis and analyses that have used multiple imputed data.

7.1.4.8 The use of multiple imputation in the literature

The number of publications that have used multiple imputation has increased significantly, although the details of the imputation procedures are often severely lacking (Sterne et al., 2009). Guidelines on the reporting of information on missing data and the implementation of multiple imputation have been suggested by Sterne and colleagues (2009) (Table 7.1).

Table 7.1 – Guidelines for reporting any analysis potentially affected by missing data (from Sterne and colleagues 2009)

• Report the number of missing values for each variable of interest, or the number of cases with complete data for each important component of the analysis

• Clarify whether there are important differences between individuals with complete and incomplete data

• Describe the type of analysis used to account for missing data and the assumptions that were made For analyses based on multiple imputation

• Provide details of the imputation modelling:

- Report details of the software used and of key settings for the imputation modelling - Report the number of imputed data sets that were created

- What variables were included in the imputation procedure?

- How were non-normally distributed and binary/categorical variables dealt with?

- If statistical interactions were included in the final analyses, were they also included in imputation models?

• If a large fraction of the data is imputed, compare observed and imputed values

• Where possible, provide results from analyses restricted to complete cases, for comparison with results based on multiple imputation.

• Discuss whether the variables included in the imputation model make the MAR assumption plausible

• It is also desirable to investigate the robustness of key inferences to possible departures from the missing at random assumption, by assuming a range of MNAR mechanisms in sensitivity analyses.

155

7.1.4.9 Software for the application of multiple imputation procedures

Several widely used statistical packages provide methods for the development and analysis of imputed data sets (Harel & Zhou, 2007; Horton & Kleinman, 2007).

• SAS – the PROC MI procedure generates imputed data sets through different methods including MCMC techniques, regression, and propensity score methods. The MIANALYZE procedure combines the results of analyses of imputations.

• Stata® – the ice procedure implements multiple imputation by chained equations. The newest version of Stata® 11.0 now has an embedded mi procedure that does not implement MI through chained equations. Instead, imputations can be generated through various methods included MCMC techniques.

• R and S-Plus – implement multiple imputation through chained equations

Freely available, stand-alone programmes also exist for undertaking multiple imputation including NORM and MLWin, which use MCMC techniques for the generation of imputed values.

In document Monitoring of patients for the development of adverse reactions to antihypertensive drugs in general practice (Page 166-174)