Method - CART used in Multiple Imputation and Data Augmentation

2.3 CART used in Multiple Imputation and Data Augmentation

3.2.2 Method

Unit nonresponse was handled as item nonresponse manifestated as values of the variable ’participation status’ as the binary response indicator. The analysis was performed using a binary probit model. The decision was made for a binary probit model instead of a binary logit, because the distributional assumptions of the binary probit model were more suitable for the analyzed model. The missing values in the data and the coefficients for the probit model were initialized at first. With the completed data CART was applied to replace the starting values with new draws from the donor values (final nodes). The parameters of the binary probit model were renewed as well based on those new values. Both approaches, CART and the binary probit analysis, were alternately conducted combined within a Gibbs-based sampler.

Out of a large list relevant variables have been identified for the analysis. These were the participation status, the information about sex, all marks of all students for the last four semesters (marks from chosen subjects), marks of final exams and the final mean mark.

The individual marks were aggregated as arithmetic mean within the fields of subjects for the analysis as there would have been too many structural missings using the single subject marks. In table C.1 there is an overview of the aggregation for all relevant subjects.

ANALYSIS OF UNIT NONRESPONSE COMBINING CART AND

DATA AUGMENTATION

From 2010 to 2011 the rules of subject choice changed, so the data of the mean marks of the students referenced a different calculation basis. Due to the fact that the calculation rule only changed slightly and in order to maintain comparability between the estimations, both mean marks were calculated with the 2010 calculation rule for the analysis.

The students were clustered in schools. So the cluster structure has to be taken into account when CART is used. Only the mean school mark was used as the clustered variable, that is the mean of all marks of all students of the 12th grade for each school. So the mean school mark (as an initialized or updated value) was used in a first (level 1) CART process as additional information. It was the same for all students within a school. Then the aggregated data for all students within a school was used in a second (level 2) CART model and the mean school mark value was updated.

The data situation can be distinguished into four missing value situations that were relevant for the estimation, as can be seen in figure 3.3 with Y as the dependent variable, that is the participation status.

ANALYSIS OF UNIT NONRESPONSE COMBINING CART AND

DATA AUGMENTATION 35

1) Complete cases: no missing values neither for the participation status, nor in the explanatory variables

2) Complete explanatory variables, but missing values in the participation indicator (measurement error)

3) Complete participation indicator, but missing values in the explanatory variables

4) Missing values in the participation indicator (measurement error) and the explanatory variables

When all variables are complete, as in situation 1), a standard probit regression could be used. For the situations 2) and 4) a so called Probit Forecast Draw was used for the participation status which was based on a Metropolis-Hastings algorithm, see Chib & Greenberg (1995). This approach used the information of the maximum number of participants in each class of students conditional on sex. So the participants were drawn from all students within a class, using the information of all explanatory variables (that have to be augmented for situation 4) before) and the information of the maximum number of participants.

For the situations 3) and 4) the missing explanatory variables were augmented by CART.

In the following, the Bayesian Probit model is described. A more detailed descrip-

tion can be found in Aßmann et al. (2014a). yij were values of a dichotomous

dependent variable with i = 1, ... , Nj as an index for the students within a school

j = 1, ... , J with Nj denoting the total number of students of a school and J as

number of schools. Whereas the observed variable is binary, a latent variable zij

is assumed which works as link between explaining factors Xij and yij:

yij =

(

1, if zij ≥ 0,

0, if zij < 0,

where zij = Xijβ + uj+ eij and eij is an independent identically normal distributed

error term with unit variance and uj a cluster-specific random error term with

N (0, σ2

ANALYSIS OF UNIT NONRESPONSE COMBINING CART AND

DATA AUGMENTATION

Pooling hence yields the complete likelihood

LP(Y |β, X , uj) =

QJ j =1

QNj

i =1Φ [(2yij − 1)(Xijβ + uj)] ,

where Φ(·) denotes the cumulative distribution function of a standard normal distribution.

The covariance matrix σ2

uof the random coefficients is sampled from independent

inverse gamma distributions IG(ασ2

u, βσu2) with parameters ασ2 u = J 2 + α 0 σ2 u and β_σ2 u = 1 2 PJ j =1u 2 j + βσ02 u

where the parameters of the conjugate inverse gamma prior distribution IG(α0

σ2 u, β 0 σ0 u) are α0 σ2 u = 1 and β 0 σ2 u = 1.

As mentioned above, there were four data situations which were relevant for the estimations. All four were handled by an initialization step and a Gibbs Sampler step including the presented Bayesian Probit model. The whole estimation rou-

tine can then be described by the following with Xmis and Xobs representing the

missing and observed values of the explanatory variables, Ymis and Yobs repre-

senting the missing and observed values of the participation status.

Initialization:

1. Unconditionally draw new values for Xmis from Xobs (with replacement).

2. Use the maximum likelihood estimation results based on complete cases as starting values for the β coefficients (informative prior for β).

3. Generate one run of the Metropolis-Hastings sequence to draw new val-

ues for Ymis (measurement error) based on the complete values from the

ANALYSIS OF UNIT NONRESPONSE COMBINING CART AND

DATA AUGMENTATION 37

Gibbs Sampler:

1. Generate new values for Xmis for level 1 and level 2 from full conditional

distributions provided by CART analysis.

2. Generate one run of the Metropolis-Hastings sequence to draw values for

Ymis (measurement error) based on the complete values from step 2 of the

initialization step for m = 1 and from step 4 of the preceding iteration for m > 1.

3. Generate new random effects variance-components σ2

u and uj.

4. Calculate new β coefficients based on conducted steps of the Gibbs Sam- pler.

5. Repeat the whole Gibbs procedure M times with iterations m = 1, ... , L, ... , M with L as the last iteration of the burn-in phase.

The initialization differed from Burgette & Reiter (2010) where the initialization equaled the imputation step with limited variable range as only completely observed variables were used and stepwise imputed variables were added. As there was no completely observed variable in the application data unconditional draws with replacement were sampled from the observed values.

Following the practical advice of Cowles & Carlin (1996) and Raftery & Lewis (1992) multiple long chains of length M = 20,000 with various starting values were running. The burn-in phase had to be discarded for more correct estimates at iteration L. Then, the values from the remaining iterations after the burn-in phase had to be combined. The Bayes posterior mean vector of unknown param-

eters ˆΘm = { ˆβ, ˆσu2} was then calculated as the mean of the remaining iterations

ˆ Θ = 1 M − L M P m=L+1 ˆ Θm.

ANALYSIS OF UNIT NONRESPONSE COMBINING CART AND

DATA AUGMENTATION

In document The application of nonparametric data augmentation and imputation using classification and regression trees within a large-scale panel study (Page 54-59)