Non-Bayesian Multiple Imputation

(1)

Non-Bayesian Multiple Imputation

Jan F. Bjørnstad¹

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data. Rubin’s combination formula requires that the imputation method is

“proper,” which essentially means that the imputations are random draws from a posterior distribution in a Bayesian framework. In national statistical institutes (NSI’s) like Statistics Norway, the methods used for imputing for nonresponse are typically non-Bayesian, e.g., some kind of stratified hot-deck. Hence, Rubin’s method of multiple imputation is not valid and cannot be applied in NSI’s. This article deals with the problem of deriving an alternative combination formula that can be applied for imputation methods typically used in NSI’s and suggests an approach for studying this problem. Alternative combination formulas are derived for certain response mechanisms and hot-deck type imputation methods.

Key words: Variance estimation; survey sampling; stratified sampling; logistic regression;

nonresponse; hot-deck imputation.

1. Introduction

Multiple imputation is a method specifically designed for variance estimation in the presence of missing data, developed by Rubin (1987). Two more recent references with further discussions and studies are Rubin (1996) and Schafer (1997). The basic idea is to create m imputed values for each missing value and combine the m completed data sets by Rubin’s combination formula for variance estimation. For the estimator to be valid, the imputations must display an appropriate level of variability. In Rubin’s term, the imputation method is required to be “proper.” In national statistical institutes (NSI’s) the methods used for imputing for nonresponse very seldom if ever satisfy the requirement of being “proper.” However, the idea of creating multiple imputations to measure the imputation uncertainty and use it for variance estimation and for computing confidence intervals is still of interest. The problem is then that Rubin’s combination formula is no longer valid with the usual nonproper imputations used by NSI’s. The reason is that the variability in nonproper imputations is too small and the between-imputation component must be given a larger weight in the variance estimate. The problem is then to determine what this weight should be to give valid statistical inference, and also for what kind of nonresponse mechanisms and estimation problems it is possible to determine a simple

qStatistics Sweden

1Statistics Norway, Division for Statistical Methods and Standards, P.O. Box 8131 Dep., N-0033 Oslo, Norway.

Email: [email protected]

Acknowledgment:The problem of deriving a non-Bayesian multiple imputation method was studied in a Master’s thesis in 1999 by Tonje Braaten with the author as her adviser. The present research began within the DACSEIS research project, and started originally when the author was contacted by Tonje Braaten regarding this issue in her doctoral studies in epidemiology.

(2)

combination formula not dependent on unknown parameters. This article suggests an approach for studying this problem.

In Section 2 an approach for determining the combination of the imputed completed data sets is suggested. Section 3 has three applications with random nonresponse:

(i) estimating a population average from simple random samples using hot-deck imputation, (ii) estimating the regression coefficient in the ratio model using residual regression imputation and (iii) estimating the regression coefficient in simple linear regression with residual regression imputation. Section 4 deals with the general problem of multiple imputation for stratified samples. In Section 5 we apply the theory in Section 4 to stratified samples with random nonresponse within strata, covering (i) estimation of population average using stratified hot-deck imputation and (ii) estimation of log (odds ratios) in logistic regression with missingness both for the dependent variable and the explanatory variable. Section 6 takes up the problem of using the same combination rule for all estimation problems with a given imputation method and data and response model. A general result for hot-deck imputation and linear estimates is presented.

2. An Approach for Determining an Alternative Combination Formula for Variance Estimation in Multiple Imputation

Let s ¼ ð1; : : : ; nÞ denote the full sample, with y ¼ ð y1; : : : ; ynÞ denoting the full sample data, values of random variable Y1; : : : :; Yn. In the case of sampling from a finite population under a design model, a renumbering of the selected units has been performed, of course, and the stochastic nature of y is determined by the sampling plan. The objective is to estimate some parameter u. The observed data is denoted by yobs ¼ {ð yi: i [ srÞ; sr}, being the observed part of y and the response sample s_r of size n_r.

Let ^ube the estimator based on the full sample data y, with Varð ^uÞ estimated by ^VðyÞ.

For i [ s 2 sr we impute by some method y_i^* and let y^* denote the complete data ð yi: i [ sr; y_i^*: i [ s 2 srÞ. Based on y*, we have ^u^*¼ ^uð y^*Þ and ^V^*¼ ^Vð y^*Þ.

Multiple imputation of m repeated imputations leads to m completed data-sets with m estimates u^_i^*; i ¼ 1; : : : ; m; and related variance estimates V^_i^*; i ¼ 1; : : : ; m.

The combined estimate is given by u^*¼Pm

i¼1u^_i^*=m. The within-imputation variance is defined as V^*¼Pm

i¼1V^_i^*=m and the between-imputation component is B^*¼ Pm

i¼1ð ^u_i^*2 u^*Þ²=ðm 2 1Þ: The total estimated variance of u^* is then proposed to be W ¼ V^*þ k þ1

m

B^* ð1Þ

That is, we need to determine k such that

EðWÞ ¼ Varð u^*Þ ð2Þ

Rubin (1987) has shown that k ¼ 1 can be used with proper imputations, which essentially means drawing imputed values from a posterior distribution in a Bayesian framework.

(3)

In general, one has to determine the terms in (2). One way to try and do this is to use double expectation, conditioning on y_obs, that is, EðWÞ ¼ E{EðWjY_obsÞ} and Varð u^*Þ ¼ E{Varð u^*jYobsÞ} þ Var{Eð u^*jYobsÞ}. Typically,

Eð V^*Þ < Varð ^uÞ ð3Þ

and EðB^*jyobsÞ ¼ Varð ^u^*jyobsÞ. Hence, approximately

EðWÞ ¼ Varð ^uÞ þ EðkÞ þ 1 m

EVarð ^u^*jYobsÞ ð4Þ

Moreover, Varð u^*jyobsÞ ¼ Varð û^*jyobsÞ=m and Eð u^*jyobsÞ ¼ Eð û^*jyobsÞ. This implies that Varð u^*Þ ¼ m²¹E{Varð û^*jYobsÞ} þ Var{Eð û^*jYobsÞ}. From (3) and (4), Equation (2) becomes Varð ûÞ þ EðkÞEVarð û^*jYobsÞ ¼ Var{Eð û^*jYobsÞ}, which gives the following general expression:

EðkÞ ¼VarEð ^u^*jYobsÞ 2 Varð ^uÞ

EVarð ^u^*jYobsÞ ð5Þ

For this to be of interest, k must be, at least approximately, determined independently of unknown parameters. In addition, one needs to check that (3) holds. To illustrate how (5) can be used we shall in the next section consider three special cases with random nonresponse.

3. Three Applications for Random Nonresponse

3.1. Estimating Population Average with Hot-deck Imputation

Consider a simple random sample from a finite population of size N, where the aim is to estimate the population averagemof some variable y. We shall assume completely random nonresponse. In the terminology of Rubin (1987) and Little and Rubin (2002), the missingness mechanism is said to be MCAR (missing completely at random). We note that MCAR means that the response indicators R1; : : : ; RN are independent with the same response probability pr¼ PðRi¼ 1Þ. The imputation method is the hot-deck method, where y_i^*is drawn at random from yobswith replacement, and the estimate is the sample mean. Let yr be the observed sample mean and ^s_r²¼_n¹

r21

P

i[s_rð yi2 yrÞ² the observed sample variance. Then Y^*is the imputation-based sample mean for the completed sample, and the combined estimator is given by Y^*¼Pm

i¼1Y_i^*=m. Let Ysdenote the sample mean based on a full sample. Then, Varð Y_sÞ ¼s²ð¹_n2_N¹Þ, withs²¼ ðN 2 1Þ²¹PN

i¼1ð yi2mÞ² being the population variance. We have further that Eð Y^*jyobsÞ ¼ yr and Varð Y^*jyobsÞ ¼ {ðn 2 nrÞ=n²}{ðnr2 1Þ=nr} ^s_r² using that EðY_i^*jyobsÞ ¼ yr and VarðY_i^*jyobsÞ ¼ ^s²_rðnr2 1Þ=nr.

In this case, ^V^*¼ ^s²_*ð¹_n2_N¹Þ where ^s²_*¼_n21¹ P

s_rð yi2 y^*Þ²þP

s2s_rðy_i^*2 y^*Þ²

. It can be shown that Eð ^s²_*jyobsÞ ¼ ^s_r² 1 2_n¹

r

1 þ_nðn21Þⁿ^r

< ^s²_r and (3) holds.

(4)

We find, from (5),

EðkÞ ¼

Var Yr

2s² ¹ n21

N

E n 2 nr

n² nr2 1 n_r

Eðs^_r²jnrÞ

¼

s² E 1 nr

21

N

2s² ¹ n21

N

E n 2 n_r

n² n_r2 1 nr

s²

<ð1 2 prÞ=pr

1 2 pr

¼ 1 pr

which is satisfied approximately, with f ¼ ðn 2 nrÞ=n being the rate of nonresponse, by letting k ¼ 1=ð1 2 f Þ:

3.2. Estimating the Regression Coefficient in the Ratio Model with Residual Imputation We shall assume completely random nonresponse as in Section 3.1. We consider a ratio model, i.e., regression through the origin: Yi¼bxiþ 1i, with Varð1iÞ ¼s²xi; i ¼ 1, : : : ,n. It is assumed that all x_i’s are known, also in the nonresponse sample. The full data estimator of b is given by ^b¼Pn

i¼1Yi=Pn

i¼1xi. The unbiased estimator ofs²is given by ^s²¼Pn

i¼1 1

x_ið yi2 ^bx_iÞ²=ðn 2 1Þ.

We shall consider residual regression imputation. Let ^br be the ^b - estimate based on observed sample s_r. Define the standardized residuals e_i¼ ð yi2 ^brx_iÞ= ffiffiffiffipx_i

, for i [ sr. For i [ s 2 sr: draw the value of e^*_i at random, with replacement, from the set of observed residuals ei; i [ sr. The imputed y-value is given by y_i^*¼ ^brxiþ e_i^* ffiffiffiffixi

p . Let X ¼Pn

i¼1xi; Xr¼P

i[s_rxi and Xnr¼P

i[s2s_rxi¼ X 2 Xr. All considerations from now on are conditional on n_rand X_r, and we aim to determine k directly from (5).

The proportion of the x-total in the nonresponse group is denoted as fX ¼ Xnr=X.

We now have b^^*¼ P

s_ryiþP

s2s_ry_i^*

=X and s^²_*¼_n21¹ P

s_r 1

xið yi2 ^b^*xiÞ² þP

s2s_r1

x_iðy_i^*2 ^b^*xiÞ² .

In order to determine k from (5) we need to check the validity of (3) and derive EVarð ^b^*jyobsÞ; VarEð ^b^*jyobsÞ and Varð ^bÞ. We note that Varð ^bÞ ¼s²=X. In Appendix A.1 it is shown that condition (3) holds for moderate and large n_r, and that

VarEð ^b^*jyobsÞ ¼s² Xr

þð1 2 d1Þd2nnrXnr

X² s² nr

ð6Þ

EVarð ^b^*jyobsÞ ¼Xnr

X²s²

n_r ðnrþ d12 2Þ ð7Þ

Here, 0 # d1; d2# 1. From (5), using (6) and (7), we find k ¼nrX²2 nrXXrþ ð1 2 d1Þd2nnrXnrXr

X_rX_nrðnrþ d12 2Þ < X

X_rþ ð1 2 d1Þ d2

nnr

n_r

(5)

We note that if all x_i ¼ 1, then d1¼ d2 ¼ 1. Now, with fX ¼ Xnr=X being the proportion of the x-total in the nonresponse group and f ¼ n_nr=n the rate of nonresponse, we finally get, since typically ð1 2 d1Þd2< 0,

k < 1 1 2 fX

þ ð1 2 d1Þd2

f

1 2 f < 1 1 2 fX

for usual x-values and nonresponse rates.

3.3. Estimating the Regression Coefficient in Simple Linear Regression with Residual Imputation

As in Sections 3.1 and 3.2 the nonresponse mechanism is assumed to be MCAR with pr¼ PðRi¼ 1Þ. The simple linear regression model is assumed: Yi¼aþbxiþ 1i; with Varð1iÞ ¼s²; i ¼ 1; : : : ; n: All x_i’s are assumed to be known. We may assume, that x ¼Pn

i¼1xi=n ¼ 0. Then the full data estimates are given by ^b¼Pn

i¼1xiyi=SS_x, where SSx¼Pn

i¼1x_i²; and ^a¼ y ¼Pn

i¼1yi=n: The unbiased estimator ofs²is given by s^²¼_n22¹ Pn

i¼1ð yi2 ^a2 ^bx_iÞ². Let ^ar; ^brbe the estimates based on the response sample, a^r¼ yr2 ^brxrand ^br¼P

i[s_rðxi2 xrÞyi=SSx;r. Here, yr¼P

i[s_ryi=nr, xr ¼P

i[s_rxi=nr

and SSx;r¼P

i[srðxi2 xrÞ².

Simple residual imputation is defined as follows: The observed residuals are ej¼ ð yj2 ^ar2 ^brxjÞ; for j [ sr. For i [ s 2 sr: draw e_i^*at random, with replacement from ðej; j [ s_r). The imputed y- value is given by y_i^*¼ ^arþ ^brxiþ e_i^*:

The imputation based estimates are ^b^* ¼ P

i[s_rxiyiþP

i[s2s_rxiy_i^*

=SSx, ^a^*¼ ðnryrþ ðn 2 nrÞy_nr^*Þ=n where y_nr^* ¼P

s2sry_i^*=ðn 2 n_rÞ and s^²_* ¼_n22¹ P

srð yi2 n

a^^*2 ^b^*xiÞ²þP

s2s_rðy_i^*2 ^a^*2 ^b^*xiÞ²o

: It can be shown (see Appendix A.2 for a summary proof) that Eð ^s²_*Þ ¼s²Eðⁿ^r_n²²

r ^n22f_n22Þ <s² where, as in Section 3.1, f ¼ ðn 2 nrÞ=n. Since Varð ^bÞ ¼s²=SSx, (3) holds. It is readily seen that Eð ^b^*jyobsÞ ¼ b^r and Varð ^b^*jyobsÞ ¼ s_e²cr=SSx, where s_e²¼P

s_re_i²=nr and cr ¼P

s2s_rx_i²=SSx

[ k0; 1l. It can be shown that Eðs_e²jsrÞ ¼ⁿ^r_n²²

r s². Moreover, clearly Eðc_rÞ ¼ 1 2 pr and Varð ^brjsrÞ ¼s²=SSx;r: It follows, from (5), that

EðkÞ ¼ Eð1=SSx;rÞ 2 1=SSx

ð1 2 prÞE{ðnr2 2Þ=nr}=SSx

<1=EðSSx;rÞ 2 1=SSx

ð1 2 prÞ=SSx

Using the fact that conditional on n_r, s_r is a simple random sample such that the response indicators are correlated with Cov(R_i, R_j) ¼ 2 f (1 2 f)/(n 2 1), we find that EðSSx;rÞ ¼ ð pr2^12p_n21^rÞSSx. It follows that, approximately, E(k) ¼_p¹

r2¹_n< 1=prand we can use k ¼ 1/(1 2 f ).

4. Multiple Imputation for Stratified Samples 4.1. Separate Combinations

One way to combine the m completed data sets is to do it separately for each stratum, i.e., determine a separate k for each stratum. The general setup is then as follows:

(6)

The sample s is divided into H sample strata, s₁, : : : , s_H. Let y_h be the planned full data from subsample s_hof size n_h. It is assumed that y₁, : : : ,y_H are independent. The observed part of y_his denoted by y_h,obs with s_hrbeing the response sample from s_hof size n_hr. The estimator based on the full sample data is the sum of independent terms, u^¼PH

h¼1u^h where ^uh is based on the yh. Varð ^uÞ ¼PH

h¼1Varð ^uhÞ is estimated by Vð ^^ uÞ ¼PH

h¼1V^hð yhÞ where ^Vhð yhÞ is the variance estimate of ûh based on y_h. For i [ s_h2 s_hr we impute by some method y_i^* based on y_h,obs and let y_h^* denote the complete data ð yh;obs; y_i^*; i [ sh2 shrÞ. Based on y_h^*, we have û_h^*¼ ûhðy_h^*Þ and ^V_h^*¼ V^_hð y_h^*Þ: Then the imputation based estimator is given by u^^*¼PH

h¼1u^_h^* and V^^*¼PH

h¼1V^_h^*. Multiple imputation of m repeated imputations leads to m completed data sets with m estimates for each stratum h, ^uh;i; i ¼ 1; : : : ; m and related variance estimates ^V_h;i^*; i ¼ 1; : : : ; m: The total estimates and related variances are u^_i^*¼ PH

h¼1u^_h;i^* and ^V_i^*¼PH

h¼1V^_h;i^*; for i ¼ 1, : : : , m. The combined estimate for stratum h is given by u_h^*¼Pm

i¼1u^_h;i^*=m. The within-imputation variance for stratum h is V_h^*¼ Pm

i¼1V^_h;i^*=m and the between-imputation component is given by B_h^*¼Pm

i¼1ð ^u_h;i^* 2 u_h^*Þ²=ðm 2 1Þ. Following the same idea as in Section 2, Formula (1), the total estimated variance of u_h^* is then proposed to be Wh¼ V_h^*þ ðkhþ_m¹ÞB_h^*. The combined total estimate is given by u^* ¼Pm

i¼1u^_i^*=m ¼PH

h¼1u_h^*. It follows that the total estimated variance of u^* can be expressed as

Wsep¼X^H

h¼1

Wh¼ V^*þX^H

h¼1

khþ1 m

B_h^* ð8Þ

where V^*¼Pm

i¼1V^_i^*=m ¼PH

h¼1V_h^*. Provided (3) holds for each stratum h,

Eð V_h^*Þ < Varð ^uhÞ ð9Þ

we have from (5) that k_hmust satisfy

Eðk_hÞ ¼VarEð ^u_h^*jY_h;obsÞ 2 Varð ^uhÞ

EVarð ^u^*_hjYh;obsÞ ð10Þ

The combination Formula (8) is an alternative to the usual combination Formula (1), especially useful when we get simple expressions for k_hbut not for k. The next section develops an expression for k in this situation.

4.2. An Overall Combination Formula

Now let W be given by (1). We shall determine the between-imputation factor k. Since EðWÞ ¼ EðW_sepÞ we have

E X^H

h¼1

k_hþ1 m

B_h^*

( )

¼ E k þ 1 m

B^* ð11Þ

Here, B^*¼_m21¹ Pm

i¼1ð ^u_i^*2 u^*Þ²¼_m21¹ Pm i¼1

P

hð ^u_h;i^* 2 u_h^*Þ

n o2

. Note that EðB^*jyobsÞ ¼ E PH

h¼1B_h^*jyobs

, since EðB^*jyobsÞ ¼ Varð ^u^*jyobsÞ ¼PH

h¼1Varð ^u_h^*jyobsÞ and EðB_h^*jyobsÞ ¼ Varð ^u_h^*jyobsÞ.

(7)

Hence, the identity (11) becomes E{PH

h¼1khEðB_h^*jYobsÞ} ¼ E{kEðB^*jYobsÞ}: This gives us the solution k ¼PH

h¼1k_hEðB_h^*jyobsÞ=EðB^*jyobsÞ if we want to use the usual combination Formula (1) and hence

k ¼ X^H

h¼1

khVarð ^u_h^*jyobsÞ Varð ^u^*jyobsÞ ¼X^H

h¼1

khVarð ^u_h^*jyobsÞ

Varð ^u^*jyobsÞ ð12Þ

a weighted average of k_h. We get a simple expression for k only when all k_hare equal, say k_h¼ k₀. Then k ¼ k₀.

5. Four Applications to Stratified Samples and Random Nonresponse within Strata 5.1. Estimating Population Average from Stratified Sample with Stratified Hot-deck

Imputation

Consider stratified simple random samples from a finite population of size N, with H strata of sizes N_h, h ¼ 1, : : : ,H. The aim is to estimate the population average m of some variable y. We assume completely random nonresponse within each stratum, denoted as MAR (missing at random) by Rubin (1987) and Little and Rubin (2002). This means that the response indicators in stratum h, R_h;1; : : : ; R_h;N_h are independent with p_hr¼ PðRh;i¼ 1Þ. The imputation method is stratified hot-deck. Let y_h,obsbe the observed part from the response sample s_hrof size n_hrfrom stratum h, yh;obs¼ ð yi: i [ s_hrÞ. Then an imputed value y_i^* in stratum h is drawn at random from y_h,obs. The estimator based on the full sample data is the usual stratified weighted average Y_strat¼PH

h¼1N_hyh=N ¼PH

h¼1v_hyh. Here, v_h¼ Nh=N and yh¼P

shy_i=n_h, where s_h is the sample from stratum h and nh¼ jshj. Then Varð YstratÞ ¼PH

h¼1v_h²s_h²ð_n¹

h2_N¹

hÞ, with s_h² ¼

i[U_h

Pð yi2mhÞ²=ðN_h2 1Þ being the population variance in stratum h. Here Uh

is stratum population h andm_his the average in U_h.

Let yhrbe the observed sample mean from stratum h and ^s_hr² ¼_n¹

hr21

P

i[s_hrð yi2 yhrÞ² the observed sample variance. The imputation-based estimator is given by Y_strat^* ¼ PH

h¼1Nhy_h^*=N where y_h^*¼ P

s_hryiþP

s_h2s_hry_i^*

=nh¼ nhryhrþP

s_h2s_hry_i^*

=nh. Let the m imputation replicates of Y_strat^* be denoted by Y_strat;i^* for i ¼ 1, : : : , m. The combined estimator is given by Y_strat^* ¼Pm

i¼1Y_strat;i^* =m:

5.1.1. Separate Strata Combinations

It follows from Section 3.1 that kh¼ 1=ð1 2 fhÞ, where fh¼ ðnh2 n_hrÞ=nh is the rate of nonresponse in stratum h. The combination formula for the variance estimate of Y_strat^* becomes, from (8),

Wsep¼ V^*þX^H

h¼1

1 1 2 fh

þ1 m

B_h^* Here, V^*¼PH

h¼1V_h^*and V_h^*is the average of the m values of the imputation-based variance estimate ^V_h^* ¼v_h²s^_h*² _n¹

h2_N¹

h

where ^s_h*² ¼_n¹

h21

P

s_hrð yi2 y_h^*Þ²þP

s_h2s_hrðy_i^*2 y_h^*Þ²

.

(8)

5.1.2. Overall Combination Formula. Determination of k in (1)

From (12) we need to determine VarðvhY_h^*jyo b sÞ and Varð Y_{s t r a t}^* jyo b sÞ ¼ PH

h¼1Var ðv_hY_h^*jyo b sÞ. Then k ¼X^H

h¼1

1

1 2 f_hVarðvhY_h^*jyobsÞ Varð Y_strat^* jyobsÞ

Now, Eð Y_h^*jyh;obsÞ ¼ yhr and Varð Y_h^*jyh;obsÞ ¼ {ðnh2 nhrÞ=n_h²}{ðnhr2 1Þ=nhr} ^s_hr² <

f_hs^_hr²=n_h. Hence we can determine k as k ¼X^H

h¼1

1 1 2 fh

f_hv_h²s^_hr²=n_h X^H

k¼1

fkv_k²s^_kr²=nh

If the stratum sizes N_h are large then we can let ^Vðv_hY_hÞ ¼ v_h²s^_hr²=n_h. Let also bh ¼ fhVðv^ hYhÞ=PH

k¼1fkVðv^ kYkÞ. Then

k ¼ X^H

h¼1

Vðv^ hYhÞfh

1 1 2 f_h X^H

h¼1

Vðv^ hYhÞfh

¼X^H

h¼1

bh 1 1 2 fh

ð13Þ

Since PH

h¼1bh ¼ 1, we see that k is a weighted average of the inverse of the response rates. If all f_h¼ f, the overall nonresponse rate, we get as for simple random sample that k ¼ 1/(1 2 f). Otherwise, a stratum response rate 1 2 f_hhas large weight if either the nonresponse rate is large and/or the estimated variance of vhYh is large.

5.1.3. An Alternative Expression for k in (1)

By directly applying (5) we can get an alternative expression for k. Given y_obs, the imputed sample means Y_h^* are independent, which implies that Eð Y_strat^* jyobsÞ ¼PH

h¼1Nhyhr=N ¼

ystrat;r and Varð Y_strat^* jyobsÞ <PH

h¼1v_h²fhs^_hr²=nh: Just like in Section 3.1, (3) holds. From (5) we get

EðkÞ <Varð Ystrat;rÞ 2 Varð YstratÞ E

h

Xv_h²fhs^_hr²=n_h

!

¼ X^H

h¼1

v_h²s_h² E 1 n_hr

2 1 N_h

2X^H

h¼1

v_h²s_h² ¹ n_h2 1

N_h

X^H

h¼1

v_h²E f_h nh

Eðs^_hr²jnhrÞ

<

X^H

h¼1

v_h²s_h²^{1 2 p}^hr nh

1 phr

X^H

h¼1

v_h²s_h²^{1 2 p}^hr n_h

¼ X^H

h¼1

v_h²s_h² nhr

Eð fhÞ 1 2 fh

Eð1 2 fhÞ X^H

h¼1

v_h²s_h²

n_hrEð fhÞð1 2 fhÞ

ð14Þ

(9)

Now, Varð YhrÞ ¼ EVarð YhrjnhrÞ ¼s_h²Eð1=nhrÞ. Let ^VðvhYhrÞ ¼ v_h²s^_hr²=nhr. Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes N_hare large, by letting

1 k¼

X^H

h21

ð1 2 fhÞfhVðv^ hYhrÞ X^H

h21

f_hVðv^ _hY_hrÞ

¼X^H

h21

a_hð1 2 fhÞ ð15Þ

where the weights a_h¼ fhVðv^ _hY_hrÞ=PH

k¼1f_kVðv^ _kY_krÞ. Since PH

h¼1a_h¼ 1, we see that 1/k is a weighted average of the response rates. If all f_h ¼ f, the overall nonresponse rate, we have, as shown in Section 5.1.2, that k ¼ 1/(1 2 f ). As seen in Section 5.1.2, we note also in Expression (15) that a stratum response rate 1 2 fh has large weight if either the nonresponse rate is large and/or the estimated variance of vhYhr is large. The estimate of the total based on the response sample is given by Ystrat;r¼P

hvhYhr: We obtain Formula (13) for k by noting from (14) that we have EðkÞ <PH

h¼1 Varðv_hY_hÞEð fhÞ_Eð12f¹

hÞ=PH

h¼1Varðv_hY_hÞEð fhÞ. Then we see that the expression for E(k) is satisfied approximately, if the stratum sizes N_h are large, by letting k be given by (13).

5.2. Logistic Regression with Binary Explanatory Variable. Estimating Log(Odds Ratio) The variables Y1; : : : ; Yn are independent 0/1 -variables, and we have explanatory 0/1-variable x with fixed known values x1; : : : ; x_n. The class probabilities are given by p1¼ PðYi¼ 1jxi¼ 1Þ and p0¼ PðYi¼ 1jxi¼ 0Þ. We assume a MAR(missing at random) model for the response variables R₁; : : : ; R_n, with PðR_i¼ 1jxi¼ 1Þ ¼ p1r

and PðRi¼ 1jxi¼ 0Þ ¼ p0r. We can reparametrize the model in a logit version, log {PðY ¼ 1jxÞ=PðY ¼ 0jxÞ} ¼aþbx, where a¼ log {p0=ð1 2p0Þ} and b¼ log^p_p¹^=ð12p¹^Þ

0=ð12p0Þ¼ log(odds ratio). The aim is to estimateb: Let s ¼ (1, : : : , n) denote the full sample with strata s1¼ {i [ s : xi¼ 1} and s0¼ {i [ s : xi¼ 0}. The sizes of s1and s0are denoted by n1and n0. We note that n1¼Pn

i¼1xi¼ X and n0 ¼ n – X. The response samples in the strata are s1r ¼ {i [ s1: Ri¼ 1} and s0r¼ {i [ s0: Ri¼ 1} with total response sample being s_r of size n_r. Let also n_1r ¼ js1rj and n0r ¼ js0rj. We see that n1r¼P

s_rxi¼ Xrand n0r ¼ nr2 Xr. The data from s_rcan be represented as follows where n_ijrdenotes the number of observations with x ¼ i and y ¼ j: see (Table 1).

We then have the maximum likelihood estimates (MLE) ^p1r¼ n11r=n1r and ^p0r ¼ n01r=n0rand MLE ofbequals ^br¼ log^p_p^{^}_^^1r^{=ð12 ^}^p^1r^Þ

0r=ð12 ^p0rÞ¼ log ðn11rn00r=n10rn01rÞ. Similarly, the

Table 1. The observed data and nonresponse totals for the two classes

x\y y ¼ 0 y ¼ 1 Totals Nonresponse

x ¼ 0 n_00r n_01r n_0r n₀2 n_0r

x ¼ 1 n_10r n_11r n_1r n12 n1r

(10)

estimator based on the full sample is given by ^b¼ log^p_p^{^}_^¹^{=ð12 ^}^p¹^Þ

0=ð12 ^p0Þ¼ log ðn11n00=n10n01Þ with obvious analogue notation. We can express this estimate as b^¼ log { ^p1=ð1 2 ^p1Þ} 2 log { ^p0=ð1 2 ^p0Þ} ¼ ^b12 ^b0, of the same form as in Section 4.1.

We also have that ^b1and ^b0are independent based on the separate sample strata s₁and s₀. For large n0, n1, b^ is approximately Nðb;s²_^

bÞ where s²_^

b ¼ {n1p1ð1 2p1Þ}²¹þ {n0p0ð1 2p0Þ}²¹. So, approximately, Varð ^b1Þ ¼ 1={n1p1ð1 2p1Þ} and Varð ^b0Þ ¼ 1={n₀p0ð1 2p0Þ} and an estimate of Varð ^bÞ is given by

Vð ^^ bÞ ¼ 1

n₁p^1ð1 2 ^p1Þþ 1

n₀p^0ð1 2 ^p0Þ¼ 1 n₁₁þ 1

n₁₀

þ 1

n₀₁þ 1 n₀₀

such that ^Vð ^bÞ ¼ ^V₁þ ^V₀, where ^V₁¼ _n¹

11þ_n¹

10

and ^V₀ ¼ _n¹

01þ_n¹

00

are the variance estimates of ^b1 and ^b0, respectively.

We shall consider the following imputation method: For each missing value in s₁ – s_1r, the imputed value y* is drawn at random from the estimated distribution of Y given x ¼ 1:

y* ¼ 1 with probability ^p1r ¼ n11r=n1r and y* ¼ 0 with probability 1 2 ^p1r: The same imputation method is used for s₀ – s_0r with y* drawn at random from the estimated distribution of Y given x ¼ 0. This is the same as stratified hot-deck imputation, imputed values are drawn at random, with replacement, from y1;obs¼ ð yi: i [ s1rÞ and y0;obs¼ ð yi: i [ s0rÞ.

The imputed values in s – s_rcan be represented in the same form as the original data where now n_ij^*denotes the number of imputed values with x ¼ i and y ¼ j: see (Table 2).

The imputation-based estimate of p₁ is given by p^₁^*¼ ðn11rþ n₁₁^*Þ=n1 such that the imputation-based estimate b^₁^*¼ log { ^p₁^*=ð1 2 ^p₁^*Þ} ¼ log {ðn11rþ n₁₁^*Þ=

ðn12 n11r2 n₁₁^*Þ}. Similarly, the imputation-based estimates forb0 andbare given by b^₀^*¼ log {ðn01rþ n₀₁^*Þ=ðn02 n01r2 n₀₁^*Þ} and ^b^*¼ ^b₁^*2 ^b₀^*.

The m repeated imputations lead to m estimates ^b_1;i^*; ^b_0;i^*; ^b_i^*, for i ¼ 1, : : : , m. The combined estimate is given by b^*¼Pm

i¼1b^_i^*=m ¼Pm

i¼1b^_1;i^*=m 2Pm

i¼1b^_0;i^*=m ¼ b₁^*2 b₀^*. The imputed variance estimate ^V^*for ^bis given by

V^^* ¼ 1

n_11rþ n₁₁^* þ 1

n_10rþ n₁₀^* þ 1

n_01rþ n₀₁^* þ 1

n_00rþ n₀₀^* ð16Þ

We see that Eð ^V^*jyobsÞ <_n ¹

1p^1rð12 ^p1rÞþ_n ¹

0p^0rð12 ^p0rÞand (3) hold. We also note that (9) holds separately for each class.

Table 2. The imputed totals for the two classes

x\y y ¼ 0 y ¼ 1 Totals

x ¼ 0 n^*₀₀ n^*₀₁ n02 n0r

x ¼ 1 n^*₁₀ n^*₁₁ n₁2 n_1r

(11)

5.2.1. Separate Classes Combination

Let us first use the approach in Section 4.1 and determine separate k₁, k₀for the two classes. Consider first stratum s₁¼ {i [ s : xi¼ 1}. In Appendix A.3 it is shown that Eð ^b₁^*jy1;obsÞ< ^b1r and Varð ^b₁^*jy1;obsÞ < f1ð1 2 f1Þ ^Vð ^b1rÞ. From (10), we find approximately:

Eðk1Þ ¼ Varð ^b1rÞ 2 Varð ^b1Þ

E{ f1ð1 2 f1Þ ^Vð ^b1rÞ}¼ E Varð ^b1rjn1rÞ 2 Varð ^b1Þ E{ f1ð1 2 f1ÞE ½ ^Vð ^b1rÞjn1r}

<

1

p1ð1 2p1Þ E 1 n1r

2 1 n1

E f1ð1 2 f1Þ 1 n1rp1ð1 2p1Þ

<ð1 2 p1rÞ=p1r

1 2 p1r

¼ 1 p1r

which is satisfied approximately by letting k1¼ 1=ð1 2 f1Þ. In exactly the same way, we find that k₀ ¼ 1=ð1 2 f0Þ where f0 ¼ ðn02 n_0rÞ=n0is the rate of nonresponse in stratum s₀. The between-imputation component for ^b₁^*is given by B₁^*¼_m21¹ Pm

i¼1ð ^b_1;i^* 2 b₁^*Þ²and likewise B₀^*is the between-imputation component for ^b₀^*. Then an estimated variance of the combined imputation-based estimate b^* forbis given by, from (8),

W_sep¼ V^*þX1 x¼0

1 1 2 f_xþ1

m

B_x^*

where V^*is the average of m replicates of the imputed variance estimate ^V^*given by (16).

5.2.2. Overall Combination Formula. Determination of k in (1)

Since Varð ^b₁^*jy1;obsÞ ¼ f1ð1 2 f1Þ ^Vð ^b1rÞ and Varð ^b₀^*jy0;obsÞ ¼ f0ð1 2 f0Þ ^Vð ^b0rÞ, we have from (12)

k ¼ 1 1 2 f1

f1ð1 2 f1Þ ^Vð ^b1rÞ X¹

x¼0

fxð1 2 fxÞ ^Vð ^bxrÞ

þ 1

1 2 f0

f0ð1 2 f0Þ ^Vð ^b0rÞ X¹

x¼0

fxð1 2 fxÞ ^Vð ^bxrÞ

ð17Þ

Var ð ^b1Þ < ðn1r=n₁Þ Varð ^b1r j n1rÞ ¼ ð1 2 f1Þ Varð ^b1rjn1rÞ. Similarly, Varð ^b0Þ <

ð1 2 f0Þ Varð ^b0rjn0rÞ. We can therefore estimate the variance of the full sample estimates b^1and ^b0by ^Vð ^b1Þ ¼ ð1 2 f1Þ ^Vð ^b1rÞ and ^Vð ^b0Þ ¼ ð1 2 f0Þ ^Vð ^b0rÞ, respectively. Then

k ¼ 1

1 2 f₁ f1Vð ^^ b1Þ X¹

x¼0

fxVð ^^ bxÞ

þ 1

1 2 f₀ f0Vð ^^ b0Þ X¹

x¼0

fxVð ^^ bxrÞ

¼ 1

1 2 f₁b1þ 1

1 2 f₀ð1 2 b1Þ

Just like in Section 5.1.2 we see that k is a weighted average of the inverse of the response rates. If all f_h ¼ f, the overall nonresponse rate, we get that k ¼ 1/(1 2 f). Otherwise, a stratum response rate 1 – f_xhas large weight if either the nonresponse rate is large and/or the estimated variance of ^bxis large.

(12)

Alternatively, from (17), 1=k ¼P1

x¼0ð1 2 fxÞ fxVð ^^ bxrÞ=P1

x¼0fxVð ^^ bxrÞ ¼P1 x¼0

a_xð1 2 fxÞ, where the weights are ax¼ fxVð ^^ bxrÞ={ f1Vð ^^ b1rÞ þ f0Vð ^^ b0rÞ}. So we can alternatively express 1/k as a weighted average of the response rates.

If the aim is to estimate p1 andp0 we obtain, of course, k ¼ 1/(1 2 f₁) for p1 and k ¼ 1= (1 2 f0) forp0.

5.3. Logistic Regression with Categorical Explanatory Variable. Estimating Log(Odds Ratios)

If the explanatory x is categorical defining, say, H classes, we can generalize the results as follows:

Let ph¼ PðY ¼ 1jx ¼ hÞ, h ¼ 0, : : : , H – 1. Logistic regression defining the categories is done by introducing H 2 1 binary explanatory variables x₁, : : : , x_H-1 where x_h ¼ 1 if observation belongs to Class h, and 0 otherwise for h ¼ 1, : : : , H – 1. Then an observation belongs to Class 0 if x₁¼ x2¼ : : : ¼ xH21¼ 0. The logit version of the model becomes, with x ¼ ðx1; x2; : : : ; xH21Þ : log {PðY ¼ 1jxÞ=

PðY ¼ 0jxÞ} ¼aþb1x1þb2x2þ : : : þ xH21bH21. We see that a¼ log_12p^p⁰

0 and

bh¼ log^p_p^h^=ð12p^h^Þ

0=ð12p0Þ¼ log (odds ratio) for Class h versus Class 0. Estimating bh by multiple imputation is done in exactly the same manner as for binary x, with Class h replacing Class 1.

5.4. Logistic Regression with Missing Values in a Binary Explanatory Variable The situation is as in Section 5.2, except that y is fully observed in s, y ¼ ð y1; : : : ; y_nÞ, and we have missing values for the x-variable. Y1; : : : ; Ynare independent 0/1-variables and we have an explanatory 0/1-variable x with fixed values x₁; : : : ; x_n, some of which are missing. The response variables indicate missingness of the x_i’s but now with MAR model PðR_i¼ 1jyi¼ 1Þ ¼ q1r and PðR_i¼ 1jyi¼ 0Þ ¼ q0r.

Otherwise, the model is the same as in Section 5.2 with class probabilities: p1¼ PðYi¼ 1jxi¼ 1Þ and p0¼ PðYi¼ 1jxi¼ 0Þ, and the logit version

log {PðY ¼ 1jxÞ= PðY ¼ 0jxÞ} ¼aþbx with b¼ log^p_p¹^=ð12p¹^Þ

0=ð12p0Þ. The aim is still to estimate b.

Let now s¹ ¼ {i [ s : yi¼ 1} and s⁰ ¼ {i [ s : yi¼ 0} with sizes n⁺₁ and n⁺₀. The response samples in the strata are s¹_r ¼ {i [ s¹: Ri¼ 1} and s⁰_r ¼ {i [ s⁰ : Ri¼ 1} with total response sample being s_r¼ {i [ s : Ri¼ 1} ¼ s¹_r < s⁰_r. The data can now be represented as before, except that nonresponse totals are for each y-stratum. See Table 3.

The MLE ^p1r; ^p0r; ^br, based on s_r are the same as before, as is the full sample estimate ^b. The imputation method is stratified hot-deck for the y-strata. For each

Table 3. The observed data and nonresponse totals for the y-strata

x\y y ¼ 0 y ¼ 1

x ¼ 0 n_00r n_01r

x ¼ 1 n_10r n_11r

Totals n⁺_0r n⁺_1r

Nonresponse n⁺₀2 n⁺_0r n⁺₁2 n⁺_1r

(13)

missing value of x in s¹2 s¹_r, the imputed value x* is drawn at random from x_1;obs¼ ðxi: i [ s¹_rÞ. Similarly, imputed values in s⁰2 s⁰_r are drawn at random from x0;obs¼ ðxi: i [ s⁰_rÞ. The imputed values in s – s_r can be represented in the same form as the original data where now n_ij^* denotes the number of imputed values with x ¼ i and y ¼ j. See Table 4.

We need to find an approximate expression for the expectation and variance of ^b^*, now denoted ^b_*, conditional on the observed data. We defer to Appendix A.4 to show that Varð ^b_*jy; xobsÞ < f¹ð1 2 f¹Þð_n¹

11rþ_n¹

01rÞ þ f⁰ð1 2 f⁰Þð_n¹

10rþ_n¹

00rÞ and Eð ^b_*jy; xobsÞ < ^br. Here f¹¼ ðn⁺₁2 n⁺_1rÞ=n⁺₁ is the nonresponse rate in Stratum s¹and f⁰¼ ðn⁺₀2 n⁺_0rÞ=n⁺₀ the nonresponse rate in s⁰. We note that ^q1r ¼ n⁺_1r=n⁺₁¼ 1 2 f¹and ^q0r¼ n⁺_0r=n⁺₀. So the denominator in (5) becomes

E f¹1 2 f¹ 1 n11r

þ 1 n01r

þ f⁰1 2 f⁰ 1 n10r

þ 1 n00r

ð18Þ

The numerator in (5) equals, as before, Varð ^brÞ 2 Varð ^bÞ, and we have approximately

Varð ^brÞ 2 Varð ^bÞ ¼ 1

n1p1ð1 2p1Þ1 2 p_1r p1r

þ 1

n0p0ð1 2p0Þ1 2 p_0r p0r

ð19Þ

where, as before, p1r¼ PðRi¼ 1jxi¼ 1Þ and p0r ¼ PðRi¼ 1jxi¼ 0Þ. We need alternative estimates of p_1r and p_0r. Since p_1r¼p1q_1rþ ð1 2p1Þq0r; we have

^p1r¼ ^p1ð1 2 f¹Þ þ ð1 2 ^p1Þð1 2 f⁰Þ. Similarly, ^p0r ¼ ^p0ð1 2 f¹Þ þ ð1 2 ^p0Þð1 2 f⁰Þ.

We can also use that n₁^p1r < n_1rand n₀^p0r< n_0r. From (18) and (19) it follows that we can use

k ¼ 1 n_11rþ 1

n_10r

p^1rf¹þ 1 2 ^ð p1rÞf⁰

þ 1

n_01rþ 1 n_00r

p^0rf¹þ 1 2 ^ð p0rÞf⁰

f¹1 2 f¹ 1 n_11rþ 1

n_01r

þ f⁰1 2 f⁰ 1 n_01rþ 1

n_00r

¼

f¹ 1 n10r

þ 1 n00r

þ f⁰ 1 n11r

þ 1 n01r

f¹1 2 f¹ 1 n11r

þ 1 n01r

þ f⁰1 2 f⁰ 1 n01r

þ 1 n00r

We note that if f ¹ ¼ f⁰¼ f, then k ¼ 1/(1 – f). Otherwise, we can express 1/k as a linear combination of the response rates (1 – f¹, 1 – f⁰). Let w1¼_n¹

11rþ_n¹

01r and

w0 ¼_n¹

10rþ_n¹

00r. Then

Table 4. The imputed totals for the y-strata

x\y y ¼ 0 y ¼ 1

x ¼ 0 n^*₀₀ n^*₀₁

x ¼ 1 n^*₁₀ n^*₁₁

Totals n⁺₀2 n⁺_0r n⁺₁2 n⁺_1r