• No results found

3.4 Estimation of Marginal Causal Effects in an Observational Setting

4.1.2 Imputation Model

Here, we describe the imputation model for the missing subgroup variable Si to be used

subgroup variable (Carpenter and Kenward, 2013). Sequential imputation involves model- ing and imputing each missing variable separately. Rather than using a global multivariate distribution for all variables, each individual missing variable is modeled separately, typ- ically using a regression analysis with one response variable. In our setting, there is one incomplete variable, therefore only one imputation model is necessary.

In terms of selecting an appropriate imputation model for the subgroup variable, the true conditional distribution of Si for subject i given all observed data is

P (Si|Yi, Xi, Z1i, Z2i) =

P (Yi|Xi, Si, Z1i; ϑ)P (Si|Z1i, Z2i; ξ2)

P1

s=0P (Yi|Xi, Si = s, Z1i; ϑ)P (Si = s|Z1i, Z2i; ξ2)

. (4.4) In practice, to obtain an estimate of the relationship between Si and the fully observed

variables, we use data from subjects with Ri = 1 to fit a logistic regression model to

approximate (4.4), where Si is the response variable, and we condition on all observed

variables. The logistic regression model to impute the missing subgroup variable has the following general structure:

P (Si = 1|Yi, Xi, Z1i, Z2i; ξ3) = expit(Aiξ3) , (4.5)

where ξ3 is the column vector of imputation model regression parameters, and Ai is a row

vector of variables which are a combination of the observed variables Yi, Xi, Z1i, Z2i, which

can include main effects and interaction terms.

A congenial imputation model is one that is compatible with the substantive response model (Meng, 1994). This means that any variable and any interaction term that is included in the response model, is included in the imputation model. Rubin (1996) offers the following guidelines for choosing imputation covariates: (i) variables that are included in the model that generates the missing variable (in this setting, these variables are Z1 and

Z2), see (4.3); and (ii) all of the variables that are included in the substantive response

model (in our setting, these variables are Y , X, S and Z1) (Rubin, 1996). Barnard and

Meng (1999) also recommend that the imputation model should include information about the missing data process, while avoiding over-fitting.

Given the above recommendations, we propose a formal method that can be used to choose the two-way interaction terms necessary in the imputation model. To approximate

the true conditional distribution of S using a logistic regression model, we examine the odds ratio of S for a particular variable, at at different levels of another variable, using the true conditional distribution (4.4). For example, to assess whether we need a Z1Z2

interaction term, we look at the odds of S comparing Z2 = 1 to Z2 = 0, for both levels

of Z1 (i.e. when Z1 = 1 and when Z1 = 0). The odds ratio of S = 1 versus S = 0 as a

function of Z2 when Z1 = 0 is P (S = 1|Y, X, Z1 = 0, Z2 = 1)/P (S = 0|Y, X, Z1 = 0, Z2 = 1) P (S = 1|Y, X, Z1 = 0, Z2 = 0)/P (S = 0|Y, X, Z1 = 0, Z2 = 0) , (4.6) but P (S = 1|Y, X, Z1, Z2 = 1) = P (Y |X, S = 1, Z1)P (S = 1|Z1, Z2 = 1) P (Y |X, Z1)

so the odds in the numerator is P (S = 1|Y, X, Z1, Z2 = 1) P (S = 0|Y, X, Z1, Z2 = 1) = P (Y |X, S = 1, Z1)P (S = 1|Z1, Z2 = 1) P (Y |X, S = 0, Z1)P (S = 0|Z1, Z2 = 1) . As a result (4.6) is equal to P (S = 1|Z1 = 0, Z2 = 1)/P (S = 0|Z1 = 0, Z2 = 1) P (S = 1|Z1 = 0, Z2 = 0)/P (S = 0|Z1 = 0, Z2 = 0)

which does not depend on Z1 since there is no Z1Z2 interaction term in the model for S.

To show that a Y X interaction term is needed in the imputation model S, we write the odds ratio of S comparing Y = 1 to Y = 0 in terms of the true model (4.4) for two settings: (i) when X = 0 and (ii) when X = 1. When X = 0,

P (S = 1|Y = 1, X = 0, Z1, Z2)/P (S = 0|Y = 1, X = 0, Z1, Z2)

P (S = 1|Y = 0, X = 0, Z1, Z2)/P (S = 0|Y = 0, X = 0, Z1, Z2)

= P (Y = 1|X = 0, S = 1, Z1)/P (Y = 1|X = 0, S = 0, Z1) P (Y = 0|X = 0, S = 1, Z1)/P (Y = 0|X = 0, S = 0, Z1)

This ratio of odds is different when X = 1 because there is an XS interaction term in the conditional probability for Y : P (Y = 1|X, S, Z1) = ϑ0 + ϑ1X + ϑ2S + ϑ3XS + ϑ4XZ1.

required in the logistic regression model for imputing S: Y X, Y Z1, and XZ1. Therefore an

imputation model that adequately describes the relationship between S and (Y, X, Z1, Z2)

is one for which

Ai = (1, Yi, Xi, Z1i, Z2i, YiXi, YiZ1i, XiZ1i) (4.7)

in equation (4.5). In practice, the true conditional distribution of S (equation (4.4)) is unknown, and investigators must make assumptions about the true distribution in order to select an appropriate imputation model.

From imputation model (4.5) with Aigiven by (4.7), we obtain the maximum likelihood

estimate ˆξ3 with covariance matrix Σ(ξ3). The estimated conditional probability of Si = 1

given Ai is

πi(ˆξ3) = expit(Aiˆξ3) .

By π(ˆξ3), we mean the estimated probability using the estimated parameters, and we will use this convention throughout the chapter. Having fitted the imputation model, the next step is to draw from the (estimated) posterior distribution N (ˆξ3, Σ(ˆξ3)) (Carpenter and Kenward, 2013); for the first imputation we let ˜ξ13 denote the drawn sample. For each subject i with Ri = 0 we then draw Si from the Bernoulli distribution with probability

πi(˜ξ 1

3) = expit(Ai˜ξ 1 3) ,

and let Si1 denote the realization for i ∈ Rc. This process is repeated K times, starting with the draw from the posterior distribution to get a new ˜ξk3 at the kth sample, to form K independent imputed datasets. With the use of the sampled values for Si, for Nmis

individuals with i ∈ Rc, each of the K imputed datasets is ‘complete’.

In the following section, we discuss methods to estimate the conditional causal param- eters using the multiply imputed datasets.