Chapter 3: A simulation study to evaluate three key strategies to handle missing data to handle missing data
3.1 Review of the methods for handling missing data
3.2.5 Multivariate Normal imputation (MVNI)
Multiple imputations under the normal model assume a joint multivariate normal distribution for all variables. With the multivariate normal model missing data are imputed using simultaneous linear regression models in which each variable potentially depends on all other variables (Schafer and Olsen, 1998).Various methods can be used to fit and make Bayesian draws from the joint distribution. The method of choice depends on the type of missing data pattern, i.e. monotone or arbitrary. A data set is said to have a monotone missing pattern when a variable Yj is missing for the individual i implies that all subsequent variables Yk, k>j, are also missing for the individual i. For data sets with arbitrary or non-monotone missing patterns, a Markov Chain Monte Carlo
75 (MCMC) method (Schafer 1997) can be used. A Markov chain is a sequence of random variables in which the distribution of each element depends only on the value of the previous one. MCMC creates multiple imputations by drawing simulations from a Bayesian predictive distribution for normal data. A regression model is fitted for each variable with missing values, with other variables as covariates. Based on the fitted regression coefficients, a new regression model is simulated from the posterior predictive distribution of the parameters and is used to impute the missing values for each variable (Rubin 1987). The process is repeated sequentially for variables with missing values.
MVNI in SAS
For each of the 1,000 replicates, 5 imputed data sets were generated using PROC MI available in SAS (SAS OnlineDocTM: Version 8; Vargas-Chanes et al., 2003). The missing values were imputed using a Markov Chain Monte Carlo (MCMC) method which is suitable for arbitrary missing data patterns, and which assumes multivariate normality. In MCMC, one constructs a Markov Chain long enough for the distribution of elements to stabilize to a common, stationary distribution. By repeatedly simulating steps of the chain, it simulates draws from the distribution of interest.
In Bayesian inference, information about unknown parameters is expressed in the form of a posterior distribution. MCMC has been applied as a method for exploring posterior distributions in Bayesian inference. That is, through MCMC, one can simulate the entire joint distribution of the unknown quantities and obtain simulation-based estimates of posterior parameters that are of interest. Assuming that the data are from a multivariate normal distribution, data augmentation is applied to Bayesian inference with missing data by repeating a series of imputation and posterior steps. In the Imputation (I) step the missing data are imputed by drawing values from the conditional distribution, given the observed values and the parameters; in the Posterior (P) step new values for the parameters are imputed by drawing them from a Bayesian posterior distribution given the observed data and the most recent estimates (from the I step) for the missing data (Vargas-Chanes et al., 2003). These two steps are iterated long enough for the results to be reliable for a multiply imputed data set (Schafer 1997).
76 By default, the SAS procedure uses the MCMC method with a single chain to create five imputations. I have specified multiple chains meaning that a separate chain is used for each imputation (data set), because using multiple chains may be computationally more efficient than a single long chain. The posterior mode, the highest observed-data posterior density, with a non-informative prior, is computed from the Expectation Maximization (EM) algorithm and is used as the starting value for the chain. The EM algorithm starts with randomly assigning values to all the parameters to be estimated. It then iterately alternates between two steps, called the expectation step (E-step) where it computes the expected likelihood for the complete data, and the maximization step (M-step) where it re-estimates all the parameters by maximizing the likelihood function for the complete data (Little and Rubin, 2002). The MI procedure takes 200 burn-in iterations before the first imputation and 100 iterations between imputations. In a Markov chain, the information in the current iteration has influence on the state of the next iteration. The burn-in iterations are iterations at the beginning of each chain that are used to eliminate the series of dependence on the starting value of the chain and to achieve a stationary distribution.
In order to monitor the convergence in MCMC to assess whether the number of iterations is enough to achieve convergence, I looked at the time-series and autocorrelation function plots for means of the independent variables. For quality of life and depressive symptoms at each wave, I requested the time-plot of the mean against the iterations, and the autocorrelations (with 95% confidence limits) for the means at various lags in the sequence of iterations. The time-series plots showed that for both variables the series of iterations had converged, as each resembled a horizontal band without long upward or downward trends. Similarly, the autocorrelation plots showed no significant negative or positive correlations.
The imputation model included the same variables as the substantive models (including the interaction term between CHD and gender). Categorical and binary variables were imputed under the normal model and imputed values were rounded to the nearest category. The variable for depressive symptoms was imputed under the normal model, which was then transformed into a dummy variable for use as an outcome variable.
Although the regression and MCMC methods assume multivariate normality, inferences based on multiple imputation can be robust to departures from the multivariate normality assumption if the amount of missing information is not large. It often makes
77 sense to use a normal model to create multiple imputations even when the observed data are somewhat non-normal, as supported by simulation studies described in Schafer (1997) and the original references therein. The imputation of the quality of life measure (CASP-19) and of the depressive symptoms measure (CESD-8) were performed at the level of each summed index and not for the individual items that constitute the two measures.
Imputation of the 1,000 replicates was performed in blocks of 100 (SAS coding for imputation is available in Appendix 3.3). Imputed data sets were then saved and transferred to Stata for the estimation of the random intercept models using Rubin’s rules. Because of system limitations, analysis of each random intercept model was also run in blocks of 100 imputed replicates and the estimates stored.