Handling Missing Data with Multiple Imputation (MI)

5.3 Discussion of the England and Wales Analysis

6.1.1 Handling Missing Data with Multiple Imputation (MI)

Multivariate Imputation (MI), proposed by Rubin (1987), accounts for the uncertainty in missing values by generating several different plausible data sets, analysing each separately, and combining the results. The MI procedure assumes that the missing data is missing at random (MAR), i.e. the probability that a value is missing depends only on observed values and can be predicted using them. It handles missing data in three stages:

Stage 1: Imputation.M copies of the dataset are created, with missing values in each dataset replaced by plausible imputed values. These plausible values are predicted using variables correlated with the missing data. The differences between the imputed values across datasets reflect the uncertainty surrounding the missing data.

Stage 2: Analysis.Standard statistical methods can be used to model each imputed dataset. Estimates in each dataset will be different due to the variation introduced in the imputation process.

Stage 3: Pooling Results.TheM estimates for each parameter are averaged together to produce one overall unbiased estimateθ¯= 1

standard error, σ, is pooled from the M imputed data sets using Rubin’s Rules (1987) which account for this variability between imputed datasets. For each pooled parameter estimateθ, the standard error is σ¯_θ= v u u t 1 M M X m=1 σ2 m+ (1 + 1 M)( 1 M −1) M X m=1 (ˆθm−θ¯)2 where 1 M PM m=1σ 2

m is the average variance across imputations (within-imputation variance),

(_M1₋₁)PM

m=1(ˆθm−θ¯)

2 _{is the variance of parameter estimates across imputations (between-}

imputation variance), and (1 + 1

M) is a correction factor that converges to 1 as the number of imputations,M, increases. In this way the pooled standard errorσ¯_θincorporates the between- imputation uncertainty with the within-imputation uncertainty that would occur in any estimation method.

As long as the data is MAR, the pooled parameter estimates and standard errors are unbiased. With list-wise deletion, the SEs are larger due to the smaller sample size after discarding data from partial respondents. SEs under single imputation (in which only one imputed dataset is created) are too small because they do not account for the between-imputation uncertainty; MI minimises this bias (Newman,

2014).

Moreover, MI is robust to the violation of the normality assumptions and yields suitable results even with small sample sizes or reasonably high proportions of missing data (Kang,2013). For missingness of 50% or more in a variable, however, MI risks becoming unstable (Wulff and Ejlskov,2017).

It is advised to use techniques such as MI when the percentage of missing data is 10% or higher (Newman,

2014). While the amount of total missing data in the MNLD sample is low (4%), a few variables have much more; the variable with the most amount of missing data is mother’s ethnicity (30%). MI, therefore, seems appropriate for this analysis. To implement MI, we assume that the LMND data is missing at random.

Multivariate Imputation via Chained Equations (MICE)

Multivariate imputation via chained equations (MICE) is a popular approach to multiple imputation. MICE imputes missing values by taking each variable in turn and predicting values based on regression with the other variables.

Imputation follows an iterative process, sampling from the observed data. Consider a dataset with three variables,X,Y andZ, in whichXcontains missing values that depend on the other two variables. If

imputed. In this case, no iterations are necessary. However, ifZ also has missing values that depend onX, a problem arises. The two variables are interdependent; imputingXvalues fromZand thenZ

from the now-completeX will produce different results than if the order of imputation was reversed. Therefore, the imputations need to follow an iterative procedure.

In the first round or “iteration”, imputations for the variable with the least missing data are estimated using only complete data. Then, the variable with the second least missing values is imputed using the complete data and the previously imputed values. After each variable has been through this cycle (one iteration), the process is repeated using the data from the last iteration (Raghunathan et al.,2001). In the case ofXandZabove, the two variables iterate from each other until they stabilise. The number of iterations must be sufficient such that by the end of the process, the distributions of imputed values stabilise, and the order in which the variables were considered does not matter. This process can be computationally intense when dealing with large datasets containing many variables. In such circumstances, the MICE algorithm can be amended to specify and restrict predictors in a given regression to those that exceed a specified minimum correlation with the dependent variable, thus saving computation time without sacrificing information that could affect the missing values (van Buuren and Groothuis-Oudshoorn,2011).

Perhaps the greatest benefit of this sophisticated method is its flexibility to handle different variable types by assigning each to a suitable distribution. Linear regression is used to impute incomplete continuous variables, while logistic regression is used for binary data, Poisson regression for count data, and polytomous regression for categorical variables (Raghunathan et al.,2001).

Standard Logistic Regression and Model Selection with Multiply-imputed Data

Before model selection by backwards elimination, univariate logistic regression was performed on each of the twenty predictor variables.

Consider the observed birth outcomeyias a realisation of a Bernoulli random variableYitaking values 0 for live births and 1 for stillbirths, with probabilityπiof stillbirth. The probability distribution is given by

P r{Yij =yij}=π yij

ij (1−πij)1−yij.

The probabilityπi of stillbirth for individual ican be expressed as a linear function of K observed covariatesxk and parameters βk, through the transformation of the logit function as described in Equations5.1and5.2in the previous chapter.

The Wald chi-squared test. The significance of each variable was calculated with a Wald test. While the likelihood ratio test is popular for non-imputed data, the easier-to-obtain Wald test statistic has been favoured (and found to be more accurate) with imputed data, as it can be calculated from Rubin’s aforementioned pooled statistics (Wulff and Ejlskov,2017;Eekhout et al.,2017).

The Wald test works by testing the null hypothesis that some or all of the parameters in a given model

θ are equal to their “true” valuesθ0, which are usually set to zero. In this case it tests whether the

variables corresponding to these parameters have an influence on the dependent variable.

Consider the set ofkvariable coefficient estimatesθˆ= (ˆθ1, . . . ,θˆk)t, and the variance-covariance matrix,

U = varˆθofθˆ. For the null hypothesisH0:θ1=...=θk= 0or in generalH0:θ=θ0, the Wald test

takes the form

W = (ˆθ−θ0)tU−1(ˆθ−θ0)∼χ2k

(Rubin,1987). In the univariate analyses, the significance of each variable was calculated with a Wald test, compared against the null (empty) model. At the preliminary stage, each variable’s Wald statistic was compared with aχ2_{distribution with 1 degree of freedom, and any variable with a p-value of less than 0.2 was}

retained for multivariate analysis. This relatively large p-value was used to prevent potentially important variables from being removed before the multivariate stage (Mickey and Greenland,1989).

Multivariate variable selection. The iterative backwards selection process began with a full logistic regression model containing all of the remaining significant predictors. Each possible way of removing one variable from the model was considered, and each of these slightly smaller models were compared with the full model through a Wald test before removing the variable that was least significant to drop (i.e. the largest p-value) from the full model. This drop-the-least-significant process was repeated until there were no more variables to omit without weakening the model, i.e. all of the Wald test p-values were below 0.05).

This selection process was possible to manually achieve with multiply-imputed data within Rubin’s framework, although it can be computationally infeasible with very large datasets or a large number of imputations. At each stage of model selection, the models were fitted to all imputed datasets, and their estimates pooled wherein Wald tests can be applied to the pooled values (Wood et al.,2008).

6.1.2 Geographic Variation with Random Effects

As with the analysis for England and Wales in the previous chapter (Section5.1.2), we wished to determine whether there is any additional area-level variation in stillbirth risk across areas of Scotland

after accounting for the individual-level covariates. Like the England and Wales analysis, the LMND data contains a variable denoting the Local Authority district of residence for each mother. From this, the larger Health Board Area for each mother could also be derived.

To investigate the existence of such geographic variation, random effects were added to the final model from the backwards selection process in a similar fashion to Equation5.6from the previous chapter, remembering that the Bernoulli model in this analysis is a special case of the Binomial model depicted in this equation.

In document Investigating the association between socio economic position and stillbirth in Brazil and the UK (Page 166-171)