Chapter 3 Statistical methods
3.6 Handling missing data
Multiple imputation was used to address missing data and attrition bias (discussed more in Chapter 7). The objective of any application of multiple imputation is to use the observed data to estimate
“plausible values” to replace missing data (White et al., 2011, Royston and White, 2011, Royston, 2004). A random draw of the posterior predictive distribution of the missing data, conditional on the observed data, is used for this purpose. Take the example used earlier where average units of alcohol consumption in 100 adolescents was the outcome, and the exposure was whether or not mothers drank over 14 units of alcohol in the average week. Assume both were fully observed (N=100). Multiple imputation could be used if a third variable with missing data was added, for example if we wanted to adjust for maternal age in years at the time of the adolescent’s birth, and the variable was only
partially observed (N=80). The multiple imputation approach would first involve using (linear) regression to model maternal age on maternal drinking and adolescent drinking for the 80 participants with observed data on maternal age. A ‘single imputation’ approach could terminate at this stage by using the regression parameters to predict each of the 20 missing values. However, the imputed data would have no error terms (i.e. uncertainty from the imputation process cannot be quantified). Instead, multiple imputation, uses resampling to make repeated random draws from the predictive posterior distribution to replace the missing data (hence multiple imputation). Additionally, a single imputed dataset is not generally regarded as sufficient to capture the error, and so the process is repeated m times to produce m imputed datasets. Thus, multiple imputation would produce m datasets in which the 20 missing values for maternal age vary across m.
A very common multiple imputation approach used in practice is Multiple Imputation of Chained Equations, or MICE (White et al., 2011). Its popularity is for several reasons, but one of the most notable is that it can reliably impute missing data for several incomplete variables while quantifying uncertainty in the same way as above. It uses multiple imputation on each variable with missing data in sequence and in iteration. In our example, we also want to adjust for a binary measure of maternal education (has or does not have a university degree). It also has missing data on 20 cases, although only 15 of these cases are also missing on maternal age. Thus both variables have 5 cases which are missing exclusively on them. As a result, 5 of the 80 cases that are observed for maternal smoking will be missing data on maternal education, and vice versa. MICE would then perform multiple imputation on the observed cases for maternal smoking and maternal education sequentially. All 20 missing cases would have imputed values for maternal smoking. A total of 5 of these would be imputed without any input from maternal education. Maternal education would then be imputed, but the 5 cases that were missing exclusively on smoking would now be replaced by their imputation, thus all 80 of the observed cases that are used to predict the imputed value for education would have input
47 from smoking. The process is then repeated on smoking to use the imputed values for education (hence ‘chained equations’). This cycle is iterated multiple times per m. Chaining the imputation allows the observed and the unobserved cases to contribute to the imputed values. This is a second reason for the popularity of MICE as it accounts for missing data with more complex patterns (referred to as non-monotone in the literature). An additional reason for the popularity of MICE is that each variable is imputed using its own model. For example, the normally distributed maternal age variable would use linear regression to estimate its plausible values and while the binary maternal education variable would use logit regression.
Multiple imputation is a vast topic and a complicated technique in its own right. Some of the more pertinent technical issues are considered below. See Appendices A, B, and C for a detailed
description of the specific imputation process used in this thesis, diagnostics on the imputation models, and descriptives of the imputed data.
When is it appropriate to impute?
There is no real rule of thumb for when MICE should be used over complete case analysis (White et al., 2011). However, for this thesis, MICE was selected over complete case analysis for three reasons. First, complete case analysis was excluded because of how the high volume of variables used in many of the models resulted in them having vastly reduced power or not being identified. Second, the MICE approach is appropriate for the ALSPAC data. MICE is primarily used on data that is ‘missing at random’ (MAR), meaning that the probability of data being missing and the value of the missing data are dependent on the observed data. Thus, as most surveys including ALSPAC have data that is missing but that can be predicted by the observed data, they can be assumed to be MAR. Indeed, ALSPAC data is frequently analysed after MICE has been performed (Mars et al., 2019, Khouja et al., 2019), including in studies focusing on adolescent alcohol harm (Mahedy et al., 2018, Lassi et al., 2019, Kendler et al., 2018). Third, the degree of missing data in ALSPAC would hamper the external validity of findings from complete case analysis (see Chapter 7).
What variables should be imputed?
Any variable that is to feature in the analysis should be imputed, including the exposure(s) and the outcome. Including the outcome can appear counterintuitive at first, especially in the context of a thesis which is interested in causal inference, rather than prediction. Nonetheless an association with the outcome and the exposures (and indeed confounders) can and should be exploited to predict missing values in the explanatory variables. Thus, including the outcome is entirely in keeping with the aim to replace missing data with plausible values. For the same reason, variables that predict missingness, but that may not be intended to be included in the analytical model, have also been shown to improve imputation (Royston, 2004). Interaction terms and other non-linear variables can
48 also be included. Historically, the literature is divided on whether interactions should be calculated prior to imputation (referred to as the ‘just another variable’ technique or JAV) or calculated during the imputation (passive imputation). The approach taken here was to use JAV, as the most concurrent literature suggests (Tilling et al., 2016).
What observations should be imputed?
In longitudinal data, it may be unrealistic to expect that imputations from participants who have not contributed data for a long time may be able to improve the plausibility of the imputed values more than they contribute to uncertainty. For example, an ALPSAC participant who dropped out in the first year offers much less predictive power when imputing data on an outcome 16 years later, thus
widening the range of the plausible values (uncertainty). The approach taken in this thesis was to restrict the sample prior to imputation to those participants who had some observed data from the age of 8 years.
How are different variables modelled?
As alluded to above, a wide range of common regression methods can be employed, including linear, logit, ordinal-logit, multinomial logit, Poisson, and more. This flexibility is another reason for MICE’s popularity. However, because imputation employs randomness, non-normal continuous measures should not be predicted using linear regression, as a skewed variable will start to normalise under this routine. MICE can instead use an alternative called predictive mean matching (PMM). This works by first modelling the continuous skewed variable on covariates, and then restricting the predictive posterior distribution to a pre-defined range of ‘nearest neighbours’, from which the imputed value is randomly drawn in each iteration. This has the additional advantage of ensuring the imputation obeys bounded data (maternal age at birth of offspring would not leave age ranges of female fertility that were observed in the sample, for example).
Estimation
When there are m datasets, there will be m estimates for the models of interest. In the above example, if m=20 then there would be 20 different estimates of the effect of maternal drinking on adolescent drinking, adjusting for maternal age and education. These estimates can then be ‘pooled’ to produce a single coefficient, a confidence interval, and a measure of the Monte Carlo simulation error. Rubin’s rules are used for this purpose. They are based on asymptotic theory in a Bayesian framework. Explanation is beyond the scope of this thesis, but it should be noted that each regression and IPW model in Chapter 8 that used the imputed data combined its estimates across m using Rubin’s Rules as encoded in the STATA teffects package. A total of 50 datasets were imputed for the analyses in this thesis (i.e. m=50). This number was selected as it was large enough to exploit the law of the big numbers, while still being manageable. However, it was not possible to pool the estimates for
49 gformula in this way, due to its nature as a user-written Stata routine. Instead, the average is provided for each estimate including the confidence intervals, and the number of models for which the confidence intervals crossed the null is also provided as a measure of the variability across the
50