Application: Building an Imputation Model

This section discusses how the fully Bayesian framework and multiple imputation methods in Section 4.5.3 can be applied to our data set of the Year 2010 Survey. We adopted two types of data set: (1) a type of data set with covariates, excluding nested variables and derived variables and (2) another type of data set with the 15 drug-trying response variables only. Sections 4.6.1 and 4.6.2 below describe how we applied fully Bayesian framework and multiple imputation by chained equations to these two types of data set.

4.6.1 Fully Bayesian Framework

To impute missing data by fully Bayesian framework for the data set with the 15 drug-trying response variables only, we usedOpenBUGSprogram. Further details about theOpenBUGS program code can be found in Ntzoufras (2009). We specified a statistical model with parameters and equations for missing responses. We linked these parameters with observed covariates and we specified priors for these parameters. We loaded two Markov chains and compiled the data set and the statistical model. After specifying the initial values for the parameters, we updated model parameters and missing data for 17,000 cycles with 1,000 cycles of burning-in, providing 16,000 usable cycles for statistical inference. We also diagnosed the trace plots of the convergence of both Markov Chains.

The fully Bayesian Framework was applied to item response theory in Chapter 6. Details of the fully Bayesian Framework applied in item response theory can be referred to Sections 6.2.2 and 6.3.

4.6.2 Multiple Imputation by Chained Equations

In the multiple imputation by chained equations, we used mice package in R program (Buuren and Groothuis-Oudshoorn, 2011) to facilitate the multiple im-

CHAPTER 4. MISSING DATA THEORY, METHODOLOGY AND APPLICATION110 putation by chained equations on the two types of data set. Here, two MICE imputation schemes were involved, namely scheme 1: MICE imputation scheme based on 15 drug-trying response variables only and scheme 2: MICE imputation scheme based on full data frame. We producedW =10imputed data sets through 200 imputation cycles. For binary data, we adopted logistic regression method (logreg); for categorical variables that contained more than two lev- els, we adopted multinominal (polynominal) logit regression model (polyreg); for continuous variables, we adopted normal linear regression model (norm). All these methods were under Bayesian method according to Rubin (1987) and Brand (1999).

For continuous variables with lower limits, upper limits, or both, we trans- formed them to approximate normality before imputation by the following methods. Suppose a continuous variableY_P has a lower limit of zero, and we wished to transformY_P intoY_P0 for imputation, then for each value ofY_P corre- sponding to respondenti,Yi,P,i=1, . . . ,N, we adopted a transformation function f :(0,∞)→_R, to transform eachYi,PtoY_i0,P. The transformation function for each Y_i_,_P was defined as below:

f(Y_i,P) =Yi0,P=log(Yi,P). (4.16) For anyY_i_,_P=0, we added a small number, i.e. 1×10−6, ontoY_i_,_P before apply- ing the transformation function. After imputation, we used the inverse function f−1(Y_i0_,_P)to transformY_i0_,_Pback toY_i_,P. Values ofY_i_,_Pbetween 0 and1×10−6were treated as 0, and values ofY_i,Pbetweenu−1×10−6anduwere treated asu.

The above log-transformation method was implemented on any count data, as well as any variable that span across the range [0,∞) for mapping and

a variance-stabilising function, whereas the square root transformation is the variance-stabilising function. However, the log-transformation was chosen in this study over the square root transformation based on the following reasons: (1) square root transformation only maps variables that span monotonically across the range [0,∞) to [0,∞), given that the orders of the values are main-

tained; (2) it was possible that square-rooted value can be either negative value or positive value, such that values can be mapped from[0,∞)to(−∞,∞); but in

this case, this mapping is no longer monotone. However, log-transformation is monotone whlist mapping values from[0,∞)to(−∞,∞). In other words, orders

of values can be maintained during mapping and (3) we useMICEpackage inR programme for imputing missing data by multiple imputation by chained equations by the time of data analysis, sinceMICEpackage is the onlyRpackage that offered multiple imputation by chained equations. However,MICEdid not offer Poisson or Negative Binomial regression option for count variables, nor Gamma regression option for variables that span across the range[0,∞).

As mentioned above that log-transformation is not a variance-stabilising function, there might be a risk of heteroscedasticity in regression analysis, com- promising likelihood estimation of variable standard errors. However, as there were only three variables that used this log-transformation method for imputing missing values in MICE transformation scheme in this study, and the regression analysis that used iterative weighted least square, the same method used in the regression analysis, is robust against heteroscedasticity (Mak, 1992), log- transformation might not be a serious problem in the regression analysis.

For example, for the variable recording the number of cigarettes the students have smoked during a week prior to the survey (Cg7Num), the values before imputation and after imputation by Equation 4.16 are listed in the following table:

CHAPTER 4. MISSING DATA THEORY, METHODOLOGY AND APPLICATION112

Table 4.6.1: Values of Cg7Num Variable during Imputation Before imputation After imputation Yi,P augmentedYi,P Y_i0,P Yi0,P augmentedYi,P Yi,P 0 0.000001 -13.81551 -13.81551 0.000001 0 0.5 0.5 -0.693147 -0.693147 0.5 0.5 1 1 0 0 1 1 5 5 1.609438 1.609438 5 5 10 10 2.302585 2.302585 10 10

missing missing missing -20 0.000000 0

missing missing missing 0.5 1.648721 1.648721

We ordered the variables in both data sets, in ascending order, according to the percentage of missingness (from the smallest percentage of missingness to the largest percentage of missingness). This method was implemented since the multiple imputation by chained equations was implemented on each covariate according to its order, and arranging covariates with the fewest missing data to be imputed in higher priority led to more observed data available for imputation during imputation process, thus improving the prediction of missing values. After imputation, we checked the mean and standard deviation plots of all variables involved in the imputation to diagnose if all these variables were converged.

In document A study into drug trying behaviour among young people in England:categorical analysis models in the Presence of missing data (Page 134-137)