3.4 Estimation of Marginal Causal Effects in an Observational Setting
4.1.2 Imputation Model
Here, we describe the imputation model for the missing subgroup variable Si to be used
subgroup variable (Carpenter and Kenward, 2013). Sequential imputation involves model- ing and imputing each missing variable separately. Rather than using a global multivariate distribution for all variables, each individual missing variable is modeled separately, typ- ically using a regression analysis with one response variable. In our setting, there is one incomplete variable, therefore only one imputation model is necessary.
In terms of selecting an appropriate imputation model for the subgroup variable, the true conditional distribution of Si for subject i given all observed data is
P (Si|Yi, Xi, Z1i, Z2i) =
P (Yi|Xi, Si, Z1i; ϑ)P (Si|Z1i, Z2i; ξ2)
P1
s=0P (Yi|Xi, Si = s, Z1i; ϑ)P (Si = s|Z1i, Z2i; ξ2)
. (4.4) In practice, to obtain an estimate of the relationship between Si and the fully observed
variables, we use data from subjects with Ri = 1 to fit a logistic regression model to
approximate (4.4), where Si is the response variable, and we condition on all observed
variables. The logistic regression model to impute the missing subgroup variable has the following general structure:
P (Si = 1|Yi, Xi, Z1i, Z2i; ξ3) = expit(Aiξ3) , (4.5)
where ξ3 is the column vector of imputation model regression parameters, and Ai is a row
vector of variables which are a combination of the observed variables Yi, Xi, Z1i, Z2i, which
can include main effects and interaction terms.
A congenial imputation model is one that is compatible with the substantive response model (Meng, 1994). This means that any variable and any interaction term that is included in the response model, is included in the imputation model. Rubin (1996) offers the following guidelines for choosing imputation covariates: (i) variables that are included in the model that generates the missing variable (in this setting, these variables are Z1 and
Z2), see (4.3); and (ii) all of the variables that are included in the substantive response
model (in our setting, these variables are Y , X, S and Z1) (Rubin, 1996). Barnard and
Meng (1999) also recommend that the imputation model should include information about the missing data process, while avoiding over-fitting.
Given the above recommendations, we propose a formal method that can be used to choose the two-way interaction terms necessary in the imputation model. To approximate
the true conditional distribution of S using a logistic regression model, we examine the odds ratio of S for a particular variable, at at different levels of another variable, using the true conditional distribution (4.4). For example, to assess whether we need a Z1Z2
interaction term, we look at the odds of S comparing Z2 = 1 to Z2 = 0, for both levels
of Z1 (i.e. when Z1 = 1 and when Z1 = 0). The odds ratio of S = 1 versus S = 0 as a
function of Z2 when Z1 = 0 is P (S = 1|Y, X, Z1 = 0, Z2 = 1)/P (S = 0|Y, X, Z1 = 0, Z2 = 1) P (S = 1|Y, X, Z1 = 0, Z2 = 0)/P (S = 0|Y, X, Z1 = 0, Z2 = 0) , (4.6) but P (S = 1|Y, X, Z1, Z2 = 1) = P (Y |X, S = 1, Z1)P (S = 1|Z1, Z2 = 1) P (Y |X, Z1)
so the odds in the numerator is P (S = 1|Y, X, Z1, Z2 = 1) P (S = 0|Y, X, Z1, Z2 = 1) = P (Y |X, S = 1, Z1)P (S = 1|Z1, Z2 = 1) P (Y |X, S = 0, Z1)P (S = 0|Z1, Z2 = 1) . As a result (4.6) is equal to P (S = 1|Z1 = 0, Z2 = 1)/P (S = 0|Z1 = 0, Z2 = 1) P (S = 1|Z1 = 0, Z2 = 0)/P (S = 0|Z1 = 0, Z2 = 0)
which does not depend on Z1 since there is no Z1Z2 interaction term in the model for S.
To show that a Y X interaction term is needed in the imputation model S, we write the odds ratio of S comparing Y = 1 to Y = 0 in terms of the true model (4.4) for two settings: (i) when X = 0 and (ii) when X = 1. When X = 0,
P (S = 1|Y = 1, X = 0, Z1, Z2)/P (S = 0|Y = 1, X = 0, Z1, Z2)
P (S = 1|Y = 0, X = 0, Z1, Z2)/P (S = 0|Y = 0, X = 0, Z1, Z2)
= P (Y = 1|X = 0, S = 1, Z1)/P (Y = 1|X = 0, S = 0, Z1) P (Y = 0|X = 0, S = 1, Z1)/P (Y = 0|X = 0, S = 0, Z1)
This ratio of odds is different when X = 1 because there is an XS interaction term in the conditional probability for Y : P (Y = 1|X, S, Z1) = ϑ0 + ϑ1X + ϑ2S + ϑ3XS + ϑ4XZ1.
required in the logistic regression model for imputing S: Y X, Y Z1, and XZ1. Therefore an
imputation model that adequately describes the relationship between S and (Y, X, Z1, Z2)
is one for which
Ai = (1, Yi, Xi, Z1i, Z2i, YiXi, YiZ1i, XiZ1i) (4.7)
in equation (4.5). In practice, the true conditional distribution of S (equation (4.4)) is unknown, and investigators must make assumptions about the true distribution in order to select an appropriate imputation model.
From imputation model (4.5) with Aigiven by (4.7), we obtain the maximum likelihood
estimate ˆξ3 with covariance matrix Σ(ξ3). The estimated conditional probability of Si = 1
given Ai is
πi(ˆξ3) = expit(Aiˆξ3) .
By π(ˆξ3), we mean the estimated probability using the estimated parameters, and we will use this convention throughout the chapter. Having fitted the imputation model, the next step is to draw from the (estimated) posterior distribution N (ˆξ3, Σ(ˆξ3)) (Carpenter and Kenward, 2013); for the first imputation we let ˜ξ13 denote the drawn sample. For each subject i with Ri = 0 we then draw Si from the Bernoulli distribution with probability
πi(˜ξ 1
3) = expit(Ai˜ξ 1 3) ,
and let Si1 denote the realization for i ∈ Rc. This process is repeated K times, starting with the draw from the posterior distribution to get a new ˜ξk3 at the kth sample, to form K independent imputed datasets. With the use of the sampled values for Si, for Nmis
individuals with i ∈ Rc, each of the K imputed datasets is ‘complete’.
In the following section, we discuss methods to estimate the conditional causal param- eters using the multiply imputed datasets.