An extension of the factoring likelihood approach for non-monotone missing data

(1)

An extension of the factoring

likelihood approach for non-monotone

missing data

Jae Kwang Kim

∗

_{Dong Wan Shin}

†

January 14, 2010

ABSTRACT

We address the problem of parameter estimation in multivariate distribu-tions under ignorable non-monotone missing data. The factoring likelihood method for monotone missing data, termed by Rubin (1974), is extended to a more general case of non-monotone missing data. The proposed method is equivalent to the Newton-Raphson method for the observed likelihood, but avoids the burden of computing the first and the second partial derivatives of the observed likelihood. Instead, the maximum likelihood estimates and their information matrices for each partition of the data set are computed sepa-rately and combined naturally using the generalized least squares method. A numerical example is presented to illustrate the method. A Monte-Carlo experiment compares the proposed method with the EM method.

KEY WORDS. EM algorithm, Gauss-Newton method, Generalized least

squares, Maximum likelihood estimator, Missing at random.

∗_{Department of Statistics, Iowa State University, Ames, IA, 50014, U.S.A.,}

[email protected], phone: 1-515-294-3225, fax: 1-515-294-4040

†_Department _of _Statistics, _Ewha _University, _Seoul, _120-750, _Korea,

(2)

1 Introduction

Missing data is quite common in practice. Statistical analysis of data with missing values is an important practical problem because missing data is oftentimes non-ignorable. When we simply ignore missing data, the resulting estimates will have nonresponse bias if the responding part of the sample is systematically different than the nonresponding part of the sample. Also, we may lose some information observed in the partially missing data. Little and Rubin (2002) and Molenberghs and Kenward (2007) provide comprehensive overviews of the missing data problem.

We consider statistical inference of data with missing values using the maximum likelihood method. Specifically, we propose a computational tool for obtaining the maximum likelihood estimator (MLE) under multivariate missing data. To explain the basic idea, we use an example of bivariate normal data. Later in Section 2.3, an extension to general multivariate data is proposed. Let yi = (y1i, y2i)0 be a bivariate normal random variables

distributed as µ y1i y2i ¶ iid ∼ N ·µ µ1 µ2 ¶ , µ σ11 σ12 σ12 σ22 ¶¸ , (1)

where iid∼ is the abbreviation of independently and identically distributed.

Note that five parameters, µ1, µ2, σ11, σ12, and σ22, are needed to identify the

bivariate normal distribution. We assume that the observations are missing at random (MAR) as defined in Rubin (1976) so that the relevant likelihood is the observed likelihood, or the marginal likelihood of the observed data. Under MAR, we can ignore the response mechanism when estimating the population parameters.

(3)

If the missing data pattern is monotone in the sense that the set of respon-dents for one variable is a subset of the respondent set of the other variable, the observed likelihood can be factored into the marginal likelihood for one variable and the conditional likelihood for the second variable conditional on the first so that the maximum likelihood estimates can be estimated sepa-rately for each likelihood. For example, assume that y1 is fully observed with

n observations and y2 is observed with r observations. Anderson (1957) first

considered maximum likelihood parameter estimation under this setup by using an alternative representation of the bivariate normal distribution as

y1i iid∼ N (µ1, σ11) (2)

y2i | y1i iid∼ N (β20·1+ β21·1y1i, σ22·1) ,

where β20·1 = µ2 − β21·1µ1, β21·1 = σ11−1σ12 and σ22·1 = σ22− β21·12 σ11. The

observed likelihood is then written as a product of marginal likelihood of a fully observed variable y1 and the conditional likelihood of y2 given y1.

Thus, the parameters µ1 and σ11 for the marginal distribution of y1 can be

estimated with n observations and the other regression parameters, β20·1,

β21·1, and σ22·1, can be estimated from the conditional distribution with r

observations.

The factoring likelihood (FL) method, termed by Rubin (1974), expresses the observed likelihood as a product of the marginal likelihood and the condi-tional likelihood so that the maximum likelihood estimates can be obtained separately at each likelihood. Note that the FL approach consists of two steps. In the first step, the likelihood is factored, and in the second step the MLE for each likelihood is computed separately. In many cases, the MLE’s are easily computed in the FL approach because the marginal and the

(4)

condi-tional likelihoods are known so that we can directly use the known solutions of the likelihood equations for each likelihood. For the monotone missing data, the MLE’s for the conditional distribution are independent of those for the marginal distribution. This is because the two sets of parameters - the parameters for the marginal likelihood and those for the conditional likelihood - are orthogonal (Cox and Reid, 1997) and as a result the MLE’s for the conditional likelihood are not affected by the MLE’s for the marginal likelihood. Rubin (1974) recommended the FL approach as a general frame-work in the analysis of missing data with a monotone missing pattern. The main advantage of the FL is its computational simplicity.

Under non-monotone missing data patterns, however, the FL approach is not directly applicable. The EM algorithm, proposed by Dempster et al. (1977), successfully provides MLE’s under a general missing pattern. Using the EM algorithm also avoids the calculation of the observed likelihood func-tion and uses only the complete likelihood funcfunc-tion. Despite its popularity, there are several shortcomings of using the EM algorithm. First, the compu-tation is performed iteratively and the convergence is notoriously slow (Liu and Rubin, 1994). Second, the covariance matrix of the estimated parame-ters is not provided directly (Louis, 1982, and Meng and Rubin, 1991). The focus of this paper is to propose an alternative method that will resolve these two issues at the same time.

In this paper, we consider an extension of the FL method to the non-monotone missing data. To apply the FL method to non-non-monotone missing, in addition to the two steps in the original FL approach, we need another step that combines the separate MLE’s computed for each likelihood to produce

(5)

the final MLE’s. The proposed method turns out to be essentially the same as the direct maximum likelihood method using the Newton-Raphson algo-rithm which converges much faster than the EM algoalgo-rithm. Furthermore, the proposed method provides the asymptotic variance-covariance matrix of the MLE’s directly as a by-product of the computation. Using the variance-covariance expression, the asymptotic variances are compared with other estimators obtained by ignoring some part of partially observed data. A re-lated work is Chen et al. (2008), who compared variances of estimators for regression models with missing responses and covariates.

The proposed method is an extension of the preliminary work of Kim (2004) who considered the case of a bivariate missing data. In Section 2, some of the result of Kim (2004) is reviewed and extended to more general class of multivariate missing data. Efficiency comparisons based on the asymptotic variance-covariance matrix obtained from the proposed method are discussed in Section 3. The proposed method is applied to a categorical data example in Section 4. Results from a limited simulation study are presented in Section 5. Concluding remarks are made in Section 6.

2 Proposed method

The proposed method can be described in the following three steps:

[Step 1] Partition the original sample into several disjoint sets according to the missing pattern.

[Step 2] Compute the MLE’s for the identified parameters separately in each partition of the sample.

(6)

[Step 3] Combine the estimators to get a set of final estimates using a generalized least squares (GLS) form.

Kim (2004) discuss the procedures in detail for the bivariate case. We review the result of Kim (2004) for the bivariate normal case in Section 2.1. In Section 2.2, we consider a general class of bivariate distributions. In Section 2.3, the proposed method is extended to multivariate distributions.

2.1 Bivariate normal case

To simplify the presentation, we describe the proposed method in the bivari-ate normal setup with non-monotone missing pattern. The joint distribution of y = (y1, y2)0 is parameterized by the five parameters using model (1) or

(2). For the convenience of the factoring method described in Section 1, we use the parametrization in (2) and let θ = (β20·1, β21·1, σ22·1, µ1, σ11)0.

In Step 1, we partition the sample into several disjoint sets according to the pattern of missingness. In the case of a non-monotone missing pattern with two variables, we have 3 = 22 _{− 1 types of respondents that contain}

information about the parameters. The first set H has both y1 and y2

ob-served, the second set K has y1 observed but y2 missing, and the third set L

has y2 observed but y1 missing. See Table 1. Let nH, nK, nL be the sample

sizes of the set H, K, L, respectively. The case of both y1 and y2 missing can

be safely removed from the sample.

In Step 2, we obtain the parameter estimators in each set: For set H, we have the five parameters η_H = (β20.1, β21.1, σ22.1, µ1, σ11)0 of the

condi-tional distribution of y2 given y1 and the marginal distribution of y1, with

(7)

Table 1. An illustration of the missing data structure under bivariate normal distribution

Set y1 y2 Sample Size Estimable parameters

H Observed Observed nH µ1, µ2, σ11, σ12, σ22

K Observed Missing nK µ1, σ11

L Missing Observed nL µ2, σ22

ˆ

η_K = (ˆµ1,K, ˆσ11,K)0 are obtained for ηK = (µ1, σ11)0, the parameters of the

marginal distribution of y1. For set L, the MLE’s ˆηL= (ˆµ2,L, ˆσ22,L)0 are

ob-tained for η_L = (µ2, σ22)0, where µ2 = β20·1+β21·1µ1and σ22= σ22·1+β21·12 σ11.

In Step 3, we use the GLS method to combine the three estima-tors ˆη_H, ˆη_K, ˆη_L to get a final estimator for the parameter θ. Let ˆη =

(ˆη0_H, ˆη0_K, ˆη0_L)0_{. Then} ˆ η = ³ ˆ β20·1,H, ˆβ21·1,H, ˆσ22·1,H, ˆµ1,H, ˆσ11,H, ˆµ1,K, ˆσ11,K, ˆµ2,L, ˆσ22,L ´₀ . (3)

The expected value of this estimator is

η (θ) =¡β20·1, β21·1, σ22·1, µ1, σ11, µ1, σ11, β20·1+ β21·1µ1, σ22·1+ β21·12 σ11

¢₀ (4) and the asymptotic covariance matrix is

V = diag ½ Σ22.1 nH ,2σ 2 22·1 nH ,σ11 nH ,2σ 2 11 nH ,σ11 nK ,2σ 2 11 nK ,σ22 nL ,2σ 2 22 nL ¾ , (5) where Σ22.1= µ σ22·1 ¡ 1 + σ−1₁₁µ2 1 ¢ −σ₁₁−1σ22·1µ1 −σ−1 11σ22·1µ1 σ11−1σ22·1 ¶ .

Note that Σ22.1 = {E[(1, y1)(1, y1)0]}−1σ22.1 = [1_µ₁ _σµ₁₁1_+µ2 1]

−1_σ

22.1.

Deriva-tion for the asymptotic covariance matrix of the first five estimates in (3) is straightforward and can be found, for example, in Subsection 7.2.2 of Little

(8)

and Rubin (2002). We have a block-diagonal structure of V in (5) because ˆ

µ1K and ˆσ11K are independent due to normality and observations between

different sets are independent due to the iid assumptions.

Note that the nine elements in η(θ) are related to each other because they are all functions of the five elements of vector θ. The information contained in the extra four equations has not yet been utilized in constructing estimators

ˆ

η_H, ˆη_K, ˆη_L. The information can be employed to construct a fully efficient estimator of θ by combining ˆη_H, ˆη_K, ˆη_L through a GLS (generalized least squares) regression of ˆη = (ˆη0_H, ˆη0_K, ˆη0_L)0 _{on θ as follows:}

ˆ

η − η(ˆθS) = (∂η/∂θ0)(θ − ˆθS) + error,

where ˆθS is an initial estimator.

The expected value and variance of ˆη in (4) and (5) can be viewed as a

nonlinear model of the five parameters in θ. Using a Taylor series expan-sion on the nonlinear model, a step of the Gauss-Newton method can be formulated as eη = X ³ θ − ˆθS ´ + u, (6) where eη = ˆη − η ³ ˆ θS ´ , η ³ ˆ θS ´

is the vector (4) evaluated at ˆθS,

X =       1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 µ1 2β21·1σ11 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 β21·1 0 0 0 0 0 1 0 1 0 β2 21·1       0 , (7) and, approximately, u ∼ (0, V) ,

(9)

Table 2. Summary for bivariate normal case.

y1 y2 Data Set Size Estimable parameters Asymptotic variance

O M K nK ηK= θ1 WK= diag(σ11, 2σ112 ) O O H nH ηH= (θ1, θ2)0 WH= diag(WK, Σ22.1, 2σ22.12 ) M O L nL ηL= (µ2, σ22)0 WL= diag(σ22, 2σ222) O: observed, M: missing, θ1= (µ1, σ11)0, θ2= (β20·1, β21·1, σ22·1)0, η = (η0 H, η0K, η0L)0, θ = ηH, V = diag(WH/nH, WK/nK, WL/nL), Σ22.1= {E[(1, y1)(1, y1)0]}−1σ22.1, X = ∂η/∂θ0_{, µ} 2= β20.1+ β21.1µ1, σ22= σ22.1+ β21.12 σ11

and V is the covariance matrix defined in (5). The Gauss-Newton method for the estimation of nonlinear models can be found in Seber and Wild (1989). Relations among parameters η, θ, X, and V are summarized in Table 2.

A simple initial estimator is the weighted average of available estimators from data sets, defined as

ˆ

θS = ( ˆβ20·1,H, ˆβ21·1,H, ˆσ11·2,H, ˆµ1,HK, ˆσ11,HK)0, (8)

where ˆµ1,HK = (1 − pK) ˆµ1,H + pKµˆ1,K, ˆσ11,HK = (1 − pK) ˆσ11,H + pKσˆ11,K,

and pK = nK/ (nH + nK). This initial value is a √

n-consistent estimate of θ and guarantees the consistency of the one-step estimators.

The procedure can be carried out iteratively until convergence. Given the current value ˆθ(t), the solution of the Gauss-Newton method can be obtained iteratively as ˆ θ(t+1)= ˆθ(t)+ ³ X0 (t)Vˆ−1(t)X(t) ´₋₁ X0 (t)Vˆ−1(t) n ˆ η − η ³ ˆ θ(t) ´o , (9)

where X(t) and ˆV(t) are evaluated from X in (7) and V in (5), respectively,

using the current value ˆθ(t). The covariance matrix of the estimator in (9) can be estimated by C = ³ X0 (t)Vˆ(t)−1X(t) ´₋₁ , (10)

(10)

Remark 1 Bivariate normal monotone case. In the case of monotone

missing which consists of H and K, the iteration (9) produces the estimator obtained by the factoring likelihood, given in Little and Rubin (2002). In order to see this, note that (9) reduces to

ˆ θ(t+1)= ˆθ(t)+ ³ X0 HKVˆHK(t)−1 XHK ´₋₁ X0 HKVˆHK(t)−1 n ˆ η_HK− η ³ ˆ θ(t) ´o , (11) where XHK =       1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1       0 , ˆ VHK(t) = diag ( ˆ Σ22.1(t) nH ,2ˆσ 2 22·1(t) nH ,σˆ11(t) nH ,2ˆσ 2 11(t) nH ,σˆ11(t) nK ,2ˆσ 2 11(t) nK ) , and ˆ η_HK = ³ ˆ β20·1,H, ˆβ21·1,H, ˆσ22·1,H, ˆµ1,H, ˆσ11,H, ˆµ1K, ˆσ11K ´₀ . Starting with an initial estimator ˆθ(0) =

³ ˆ

β20·1,H, ˆβ21·1,H, ˆσ22·1,H, ˆµ1,H, ˆσ11,H

´₀

, the estimator constructed from data set H, one step application of (9) leads to the one-step estimate

ˆ θ(1) = ³ ˆ β20·1,H, ˆβ21·1,H, ˆσ22·1,H, ˆµ1,HK, ˆσ11,HK ´₀ (12)

where ˆµ1,HK and ˆσ11,HK are defined after (8). This is the same as the MLE

of Anderson (1957) from the original factoring likelihood method. When only y2 is subject to missingness, the estimated regression coefficient for the

regression of y2 on y1 using the set H only is fully efficient, but the estimated

regression coefficient for the regression of y1 on y2 based on the set H only

(11)

2.2 General bivariate case

We now consider the general bivariate case when the joint distribution is not necessarily normal. Assume that the joint distribution of y = (y1, y2)0 is

parameterized by θ. For set H, let η_H = η_H(θ) be a parametrization for the joint distribution of (y1, y2) such that the information matrix, say IH(ηH), is

easy to compute. One such parametrization is η_H = (η0

H1, η0H2)

0_{, where η} H1

is the parameter vector for the conditional distribution of y1 given y2 and

η_H2 is the parameter vector for the marginal distribution of y2. Since the

parameters for the conditional distribution are orthogonal to those for the marginal distribution, the parametrization η_H = (η_H1, η_H2) results in a block diagonal IH(ηH). For set K, let ηK = ηK(θ) be a parametrization for the

marginal distribution of y1 such that the information matrix, say IK(ηK), is

easy to compute. Define η_Lsimilarly. The parametrization for the set H need not be the same as that for set K or for set L, which provides more flexibility in choosing the parametrization. Separate orthogonal parametrization in each set will lead to computational advantages over the direct maximum likelihood method.

Let ˆη_H be the MLE of η_H constructed using sample set H. Define ˆη_K

and ˆη_Lsimilarly. Let VH, VK, and VL be the estimated covariance matrices

of the MLE’s ˆη_H, ˆη_K, and ˆη_L, respectively, and let V = diag(VH, VK, VL).

Note that V−1 _{= diag {I}

H(ηH) , IK(ηK) , IL(ηL)}. Because V is a function

of θ, we write V = V (θ). Also, define X (θ) = · ∂η0 H ∂θ , ∂η0 K ∂θ , ∂η0 L ∂θ ¸₀

(12)

different from η(ˆθ) in general. Using the above notation, the maximum

likelihood estimator can be computed iteratively from (9) with X(t) = X(ˆθ (t)

) and ˆV(t) = V(ˆθ

(t)

).

We now show that the procedure in (9) produces a fully efficient estima-tor of θ in that it is equivalent to the Newton-Raphson procedure for ML estimation based on the observed likelihood. Let the score function of the observed likelihood be defined as

Sobs(y; θ) = ∂ log lobs(θ) /∂θ,

where lobs(θ) = Y i Z f (yi; θ) dyi(mis),

with yi(mis) defined to be the missing part of yi, is the observed likelihood

function of parameter θ. The Newton-Raphson method for maximum likeli-hood estimation can be defined as

ˆ θ(t+1) = ˆθ(t)+ h Iobs ³ ˆ θ(t) ´i₋₁ Sobs ³ y; ˆθ(t) ´ , (13)

where Iobs(θ) = E [−∂2log lobs(θ) /∂θ∂θ0] is the expected information

ma-trix for θ.

We show in Theorem 1 below, that the iterations (9) and (13) are identi-cal in that, starting from ˆθ(t), the two iterations produce identical values for

ˆ

θ(t+1) for all t. Therefore, our procedure gives us a fully efficient estimator

of θ. Equivalence between the Gauss-Newton estimator and the maximum likelihood estimator will be established in a more general multivariate situa-tion in Secsitua-tion 2.3. Note that evaluasitua-tion of the likelihood lobs(θ) and hence

(13)

due to the complexities in evaluating the integral in the observed likelihood. On the other hand, our procedure is easy to implement because it involves evaluation of likelihoods corresponding to only observed parts.

The following theorem establishes equivalence between the Gauss-Newton estimator in (9) and the maximum likelihood estimator in (13).

Theorem 1 The Gauss-Newton estimator (9) is equivalent to the maximum

likelihood estimator (13) in that, starting from ˆθ(t), (9) and (13) give the same value for ˆθ(t+1).

Proof. See Appendix A.

Note that because of the nature of the Newton-Raphson algorithm, the iteration (9) converges much faster than the usual EM-algorithm. Moreover, our procedure directly produces a simple estimator C of the variance of the MLE ˆθ, while the EM-algorithm does not give a direct estimate of the

variance of ˆθ.

Remark 2 - One-step estimator. Given a suitable choice of the initial

estimate ˆθS, the one-step estimator

ˆ θ = ˆθS+ ³ X0 SVˆS−1XS ´₋₁ X0 SVˆ−1S eη, (14)

can be a very good approximation to the maximum likelihood estimator, where

XS and ˆVS are evaluated from X and V, respectively, using the initial esti-mator ˆθS. The one-step Newton-Raphson estimator (13) using

√

n-consistent initial estimates is asymptotically equivalent to the MLE (Lehmann, 1983, Theorem 3.1). By Theorem 1, the one-step Gauss-Newton estimator (14) is also asymptotically equivalent to the MLE.

(14)

2.3 General multivariate case

One advantage of the GLS Gauss-Newton procedure (9) is that it can easily extend to a general multivariate case having p−variables y = (y1, ..., yp)0.

Any general missing data set can be partitioned into say H1, ..., Hq, mutually

disjoint and exhaustive data sets such that, for each data set Hj, j = 1, ..., q,

all the element share the same missing pattern. Therefore, each set Hj can

be considered to be a complete data set if only all the observed variables are considered.

Let θ be a parameter vector for which the joint distribution of y is fully indexed and is identified. We choose a parameter vector η_j for the joint distribution of the observed variables corresponding to Hj such that the

joint distribution is identified and the information matrix, say Ij(θ), is easy

to compute. Let ˆη_j be the MLE of η_j computed from the data set Hj, which

can be easily computed because Hj is complete. We have Vj = var(ˆηj) = I_j−1(θ). Let η = (η0

1, ..., η0q)0 and let ˆη = (ˆη01, ..., ˆη0q)0. Then V = var(ˆη) = diag(I−1

1 (θ), ..., Iq−1(θ)). Letting X = ∂η/∂θ0, with an initial consistent

estimator ˆθS, the following iteration

ˆ

θ = ˆθS+

¡

X0_V−1_X¢−1_X0_V−1_(ˆ_{η − η(ˆ}_θ

S)), (15)

defines a one step Gauss-Newton procedure for ML estimation. The following theorem establishes equivalence of the proposed estimator and the MLE. The proof is a straightforward extension of that of Theorem 1 and thus is skipped for brevity.

Theorem 2 The one-step estimator (15) is equivalent to the maximum

(15)

In order to implement our procedure (15), we need to specify (θ, η, X, V) as well as an initial estimator ˆθS. Specification of (θ, η, X, V) depends on

data distribution and missing type. In the following remarks, explicit ex-pressions for (θ, η, X, V) are given for some important cases. These remarks demonstrate that our procedure (15) can be easily implemented to multivari-ate normal cases and hence multiple regressions with any missing type. Remark 3 - 3-variate general non-monotone missing case. We give a detailed implementation of the Gauss-Newton procedure (15) for 3-variate case. Assume that y = (y1, y2, y3)0 is jointly normal N3(µ, Σ),

µ = (µ1, µ2, µ3)0, Σ = (σij). All possible 7 = 23−1 missing cases are displayed in Table 3.

Expressions for θ, η, X and V are given in Table 3. Parameters θ1, θ2, θ3

correspond to the parameters of the distributions of y1, the conditional

dis-tribution of y2 given y1, and the conditional distribution of y3 given (y1, y2),

respectively. The parameter θ = (θ0

1, θ02, θ03)0 fully parameterizes the joint

distribution of (y1, y2, y3)0.

The conditional parameters can be written in the following regression equations y1 = µ1 + e1, e1 ∼ N(0, σ11), y2 = β20.1+ β21.1y1+ e2.1, e2.1 ∼ N(0, σ22.1), y3 = β30.12+ β31.12y1+ β32.12y2+ e3.12, e3.12∼ N(0, σ33.12), y3 = β30.1+ β31.1y1+ e3.1, e3.1 ∼ N(0, σ33.1), y3 = β30.2+ β31.2y2+ e3.2, e3.2 ∼ N(0, σ33.2),

in which the regression errors are independent of the regressors.

(16)

defini-tion of parameter η_j. For example, ˆη₁ = (ˆµ1,1, ˆσ11,1)0 is estimated from H1;

ˆ

η₂ = (ˆθ0

1,2, ˆθ2,20 )0 is estimated from data set H2 where ˆθ1,2 = (ˆµ1,2, ˆσ11,2)0 is

constructed from variable y1 and ˆθ2,2 = ( ˆβ20.1,2, ˆβ21.1,2, ˆσ22.1,2)0 is constructed

from the regression of y2 on y1; ˆη7 = (ˆµ2,7, ˆσ22,7, ˆβ30.2,7, ˆβ31.2,7, ˆσ33.2,7)0 is

estimated from H7 where (ˆµ2,7, ˆσ22,7) is constructed from variable y2 and

( ˆβ30.2,7, ˆβ31.2,7, ˆσ33.2,7) is constructed from regression of y3 on y2. We then

have ˆη = (ˆη0₇, ..., ˆη0₁)0_.

A simple initial estimator ˆθS = (ˆθ 0 1S, ˆθ 0 2S, ˆθ 0 3S)0 can be constructed by

averaging available estimators from data sets H1, ..., H7 as given by

ˆ θ1S = (n1θˆ1,1+ n2θˆ1,2+ n3θˆ1,3+ n6θˆ1,6)/(n1 + n2+ n3+ n6), ˆ θ2S = (n2θˆ2,2+ n3θˆ2,3)/(n2+ n3), ˆ θ3S = ˆθ3,3.

For evaluation of η and X = ∂η/∂θ0_{, we need expressions for η with}

respect to θ and their derivatives. This issue is addressed in Remark 4 below. We now have all the materials for implementing (15). Observe that some elements of η₄, ..., η₇ are nonlinear functions of θ. Therefore, the X matrix has elements other than 0 or 1, as occurred in the last two columns of X

in (7). For monotone missing pattern, sets H4− H7 are empty and the X

matrix consists of elements of 0 or 1, as in Remark 1.

Remark 4 - Evaluation of η and X. Consider the general p−dimensional

normal case Np(µ, Σ), µ = (µ1, ..., µp)0, Σ = (σij). As shown in Remark 3, elements of η take one of the following three forms: {θj, j = 1, , ., p}, {µ, Σ}, or { parameters, say (βj0.J, β0

jJ.J, σjj.J)0 of the regression of yj on a vector, say yJ, a subvector of (y1, y2, ..., yj−1)0 such that yj = βj0.J + βjJ0 yJ + ej.J, j

(17)

Table 3. 3-dimensional normal case: non-monotone missing.

y1 y2 y3 Data Set Size Estimable parameters Asymptotic variance

O M M H1 n1 η1= θ1 W1= diag(σ11, 2σ112 ) O O M H2 n2 η2= (θ01, θ02)0 W2= diag(W1, Σ22.1, 2σ22.12 ) O O O H3 n3 η3= (θ10, θ02, θ03)0 W3= diag(W2, Σ33.12, 2σ33.122 ) M M O H4 n4 η4= (µ3, σ33)0 W4= diag(σ33, 2σ332 ) M O M H5 n5 η5= (µ2, σ22)0 W5= diag(σ22, 2σ222 ) O M O H6 n6 η6= (θ01, β30.1, β31.1, σ33.1)0 W6= diag(W1, Σ33.1, 2σ33.12 ) M O O H7 n7 η7= (µ2, σ22, β30.2, β31.2, σ33.2)0 W7= diag(σ22, 2σ222, Σ33.2, 2σ33.22 ) θ1= (µ1, σ11)0, θ2= (β20·1, β21·1, σ22·1)0, θ3= (β30·12, β31·12, β32·12, σ33·12)0,

µ = (µ1, µ2, µ3)0 and Σ = (σij) are computed from θ1, θ2, θ3using the recursion in Remark 4 below,

β30.1= µ3− β31.1µ1, β31.1= σ13/σ11, σ33.1= σ33− β31.12 σ11, β30.2= µ3− β31.2µ2, β31.2= σ23/σ22, σ33.2= σ33− β31.22 σ22, Σ22.1= {E[(1, y1)(1, y1)0]}−1σ22.1, Σ33.12= {E[(1, y1, y2)(1, y1, y2)0]}−1σ33.12, Σ33.1= {E[(1, y1)(1, y1)0]}−1σ33.1, Σ33.2= {E[(1, y2)(1, y2)0]}−1σ33.2, η = (η0 7, η06, ..., η01)0, θ = η3, V = diag(W7/n7, W6/n6, ..., W1/n1), X = ∂η/∂θ0.

for the parameters µ, Σ and (βj0.J, β0

jJ.J, σjj.J)0 in terms of the conditional parameter θ = (θ0

1, ..., θ0p)0. Using a regression expression

yj+1 = β(j+1)0·12...j+ β(j+1)1·12...jy1+ ... + β(j+1)j·12...jyj+ e(j+1)·12...j,

we get, for j = 0, 1, 2, ..., p − 1,

µj+1= E(yj+1) = β(j+1)0·12...j+ β(j+1)1·12...jµ1+ ... + β(j+1)j·12...jµj,

σi,j+1= cov(yi, yj+1) = β(j+1)1·12...jσi1+ ... + β(j+1)j·12...jσij, i = 1, 2, ..., j,

and σj+1,j+1 = var(yj+1) = σ(j+1)(j+1)·12...j+ j X k=1 j X `=1 β(j+1)k·12...jβ(j+1)`·12...jσk`.

Note that, in the above three equations, (µj+1, σ1(j+1), σ2(j+1), ..., σ(j+1)(j+1))

is expressed in terms of the conditional

pa-rameter θj+1 = (β(j+1)0·12...j, β(j+1)1·12...j, ..., β(j+1)j·12...j, σ(j+1)(j+1)·12...j)0 and

(18)

θj, θj−1, ..., θ1. Therefore, recursive evaluation of these three equations for

j = 0, 1, ..., p−1 with initial values β10· = µ1 and σ11· = σ11 gives the required

expression for the marginal parameters (µj, σ1j, ..., σ(j−1)j, σjj), j = 1, ..., p in terms of the conditional parameters θj, θj−1, ..., θ1.

Partial derivatives are recursively computed as follows: for j = 0, 1, ..., p−

1, ∂µj+1/∂θt=    P_j `=1β(j+1)`·12...j∂µ`/θt if t = 1, ..., j, (1, µ1, ..., µj, 0)0 if t = j + 1 0 if t = j + 2, j + 3, ..., p,

∂σi,j+1/∂θt= β(j+1)1·12...j∂σi1/∂θt+ ... + β(j+1)j·12...j∂σij/∂θt,

i = 1, 2, ..., j, t = 1, 2, ..., j, ∂σj+1,j+1/∂θt = σ(j+1)(j+1)·12...j+ P_j k=1 P_j `=1β(j+1)k·12...jβ(j+1)`·12...j∂σk`/∂θt, t = 1, 2, ..., j,

∂σi,j+1/∂θj+1= (0, σi1, ..., σij, 0)0_{, i = 1, ..., j,} ∂σj+1,j+1/∂θj+1 = (0, 2Pj_k=1β(j+1)k·12...jσk1, 2 P_j k=1β(j+1)k·12...jσk2, , ..., 2 P_j k=1β(j+1)k·12...jσkj, 1)0, ∂σi,j+1/∂θt= 0, i = 1, 2, ..., j + 1, t = j + 2, j + 3, ..., p.

Given these expressions for µ, Σ and their partial derivatives in terms of θ, it is straightforward to compute (βj0.J, β0

jJ.J, σjj.J)0 and the corresponding derivatives because the regression parameters are simple functions of µ and

Σ.

Remark 5 - A 4-variate non-monotone case. In Table 4, expressions

for θ, η, X and V are given for a 4-dimensional case with a specific missing pattern.

(19)

Table 4. A 4-dimensional normal case - non-monotone missing.

y1 y2 y3 y4 Data Set Size Estimable parameters Asymptotic variance

O M M M H1 n1 η1= (µ1, σ11)0 W1= diag(σ11, 2σ112 ) M O M M H2 n2 η2= (µ2, σ22)0 W2= diag(σ22, 2σ222 ) M M O M H3 n3 η3= (µ3, σ33)0 W3= diag(σ33, 2σ332 ) M M M O H4 n4 η4= (µ4, σ44)0 W4= diag(σ44, 2σ442 ) O O O O H5 n5 η5= (θ10, θ20, θ03, θ04)0 W5, see below θ1= (µ1, σ11)0, θ2= (β20·1, β21·1, σ22·1)0, θ3= (β30·12, β31·12, β32·12, σ33·12)0, θ4= (β40·123, β41·123, β42·123, β43·123, σ44·123)0

µ = (µ1, ..., µ4)0 and Σ = (σij) are computed from θ1, ..., θ4 using the recursion in Remark 4,

W5= diag(σ11, 2σ112 , Σ22.1, 2σ22.12 , Σ33.12, 2σ33.122 , Σ44.123, 2σ244.123) Σ22.1= {E[(1, y1)(1, y1)0]}−1σ22.1, Σ33.12= {E[(1, y1, y2)(1, y1, y2)0]}−1σ33.12, Σ44.123= {E[(1, y1, y2, y3)(1, y1, y2, y3)0]}−1σ44.123, η = (η0 5, η04, ..., η01)0, θ = η5, V = diag(W5/n5, W4/n4, ..., W1/n1), X = ∂η/∂θ0

3 Efficiency comparison

We compare efficiencies of estimators constructed from different combinations of data sets H, K, L for the bivariate normal case. Under the non-monotone missing pattern, we can compute the following four types of the estimates.

1. ˆθH: the maximum likelihood estimator using the samples in H set.

2. ˆθHK: the maximum likelihood estimator using the samples in H ∪ K. 3. ˆθHL: the maximum likelihood estimator using the samples in H ∪ L.

4. ˆθHKL: the maximum likelihood estimator using the whole sample.

By Theorem 1, the Gauss-Newton estimator (9) is asymptotically equal to ˆ

θHKL. Write

X0 _{= [X}0

H X0K X0L]

where X0

H is the left 5 × 5 submatrix of X0, X0K is the 5 × 2 submatrix in the

middle of X0_{, and X}0

L is the 5×2 submatrix in the right side of X0. Similarly,

(20)

Using the arguments in the proof of Theorem 1, the asymptotic variance of ˆθH is ³ X0 HVˆ−1H XH ´₋₁ . Similarly, we have V ar ³ ˆ θHL ´ = ³ X0_HVˆ−1_H XH + X0LVˆL−1XL ´₋₁ , V ar ³ ˆ θHK ´ = ³ X0_HVˆ−1_H XH + X0KVˆ−1K XK ´₋₁ , and V ar³θˆHKL ´ =³X0 HVˆH−1XH + X0KVˆK−1XK+ X0LVˆ−1L XL ´₋₁ .

Using the matrix algebra such as ³ X0_HVˆ_H−1XH + X0LVˆ−1L XL ´₋₁ = ³ X0_HVˆ−1_H XH ´₋₁ − ³ X0 HVˆH−1XH ´₋₁ X0 L · ˆ VL+ XL ³ X0 HVˆ−1H XH ´₋₁ X0 L ¸₋₁ XL ³ X0 HVˆ−1H XH ´₋₁ ,

we can derive expressions for the variances of the estimators.

For estimates of the slope parameter, the asymptotic variances are

V ar( ˆβ21.1,HK) = σ22.1 σ11nH = V ar( ˆβ21.1,H) V ar( ˆβ21.1,HL) = σ22.1 σ11nH © 1 − 2pLρ2 ¡ 1 − ρ2¢ª V ar ³ ˆ β21·1,HKL ´ = σ22.1 σ11nH ½ 1 −2pLρ2(1 − ρ2) 1 − pLpKρ4 ¾ , where ρ2 _{= σ}2 12/ (σ11σ22), pK = nK/ (nH + nK) and pL = nL/ (nH + nL).

See Appendix B for derivations of these variances and other variances below. Thus, we have

V ar( ˆβ21.1,H) = V ar( ˆβ21.1,HK) ≥ V ar( ˆβ21.1,HL) ≥ V ar( ˆβ21.1,HKL). (16)

Here strict inequalities generally hold except for special trivial cases. Note that the asymptotic variance of ˆβ21·1,HK is the same as the variance of ˆβ21·1,H,

(21)

which implies that there is no gain of efficiency by adding set K (missing y2)

to H. On the other hand, by comparing V ar( ˆβ21·1,HL) with V ar

³ ˆ

β21·1,H

´ , we observe an efficiency gain by adding a set L (missing y1) to H. This analysis

is similar to the results from Little (1992) who summarized statistical results for regression with missing X’s whose data sets are H and L. Little (1992) did not include the cases of missing y2’s because data set (K) with missing y2does

not contain additional information in estimating the regression parameter. It is interesting to observe that even though adding K (the data set with missing

y2) to H does not improve efficiency of regression parameter estimate, i.e.,

V ar( ˆβ21.1,H) = V ar( ˆβ21.1,HK), adding K to (H, L) does improve the efficiency,

i.e., V ar( ˆβ21.1,HL) > V ar( ˆβ21.1,HKL).

Using these expressions, we can investigate variance reduction of ˆβ21.1,HKL

over ˆβ21.1,HK. For example, we can say that relative efficiency of ˆβ21.1,HKL

over ˆβ21.1,HK is lager for larger values of pK, pL, or ρ. If pL = pK = 0.5

and ρ = 0.5, the relative efficiency value is 1.0037. If pL = pK = 0.9 and ρ = 0.9, the relative efficiency value is 1.768. For the other parameters of the conditional distribution, relationships similar to (16) hold.

For the marginal parameters, we have

V ar(ˆµ1,HK) = σ11 nH (1 − pK) , V ar(ˆµ1,HL) = σ11 nH ¡ 1 − pLρ2 ¢ , V ar(ˆµ1,HKL) = σ11 nH ( (1 − pK) −pLρ 2_{(1 − p} K)2 (1 − pLpKρ2) ) ,

(22)

and V ar(ˆσ11,HK) = 2σ 2 11 nH (1 − pK) , V ar(ˆσ11,HL) = 2σ2 11 nH ¡ 1 − pLρ4 ¢ , V ar(ˆσ11,HKL) = 2σ2 11 nH ( (1 − pK) − pLρ4(1 − pK)2 1 − pLpKρ4 ) .

Note that efficiency of the marginal parameters (µ1, σ11) of y1 improves if

additional data for y2 with y1missing are provided. In particular, if nK = nL,

then

V ar(ˆµ1,H) ≥ V ar(ˆµ1,HL) ≥ V ar(ˆµ1,HK) ≥ V ar(ˆµ1,HKL)

and

V ar(ˆσ11,H) ≥ V ar(ˆσ11,HL) ≥ V ar(ˆσ11,HK) ≥ V ar(ˆσ11,HKL).

4 A Numerical Example

For a numerical example, we consider the data set adapted from Bishop, Fienberg and Holland (1975, Table 1.4-2). Table 5 gives the data for a 23_table

of three categorical variable (Y1=Clinic, Y2=Parential care, Y3=survival) with

one supplemental margin for Y2 and Y3 and another supplemental margin for

Y1 and Y3. In this setup, Yi are all dichotomous, taking either 0 or 1, and 8

parameters can be defined as πijk= P r(Y1 = i, Y2 = j, Y3 = k), i = 0, 1; j =

0, 1; k = 0, 1.

For the orthogonal parametrization, we use

η_H =¡π1|11, π1|10, π1|01, π1|00, π+1|1, π+1|0, π++1

(23)

Table 5. A 23 _{table with supplemental margins} Set y1 y2 y3 Count 1 1 1 293 1 0 1 176 0 1 1 23 H 0 0 1 197 1 1 0 4 1 0 0 3 0 1 0 2 0 0 0 17 1 1 100 K 0 1 82 1 0 5 0 0 6 1 1 90 L 0 1 150 1 0 5 0 0 10

(24)

where πi|jk = P r (y1 = i | y2 = j, y3 = k), π+j|k = P r (y2 = j | y3 = k) ,

π++k = P r (y3 = k). We also set θ ≡ (θ1, θ2, θ3, θ4, θ5, θ6, θ7) = ηH. Note

that the validity of the proposed method does not depend on the choice of the parametrization. A suitable parametrization will make the computation of the information matrix simple.

From the data in Table 5, we can obtain 13 observations for 7 parameters. The observation vector can be written ˆη =¡ηˆ0_H, ˆη0_K, ˆη0_L¢0, where

ˆ η_H = (293/316, 4/6, 176/373, 3/20, 316/689, 6/26, 689/715)0 ˆ η_K = ¡πˆ1|+1,K, ˆπ1|+0,K, ˆπ++1,K ¢ = (100/182, 5/11, 182/193) ˆ η_L = ¡πˆ+1|1,L, ˆπ+1|0,L, ˆπ++1,L ¢ = (90/240, 5/15, 240/255) with the expectations

η (θ) = (η0 H, η0K, η0L)0, where η_H = θ, η_K = ¡π1|11π+1|1+ π1|01π+0|1, π1|10π+1|0+ π1|00π+0|0, π++1 ¢₀ = (θ1θ5+ θ3(1 − θ5) , θ2θ6+ θ4(1 − θ6) , θ7) and η_L=¡π+1|1, π+1|0, π++1 ¢₀

= (θ5, θ6, θ7), and the variance-covariance

ma-trix V = diag {VH/nH, VK/nK, VL/nL} where VH = diag {θ1(1 − θ1) , θ2(1 − θ2) , . . . , θ7(1 − θ7)} , VK = diag © π1|+1 ¡ 1 − π1|+1 ¢ , π1|+0 ¡ 1 − π1|+0 ¢ , π++1(1 − π1++) ª VL = diag {θ5(1 − θ5) , θ6(1 − θ6) , θ7(1 − θ7)} .

(25)

The Gauss-Newton method as in (9) can be used to solve the nonlin-ear model of three parameters, where the initial estimator of θ is ˆθS =

(293/316, 4/6, 176/373, 3/20, 406/929, 11/41, 1111/1163)0 and the X matrix is X =           1 0 0 0 0 0 0 θ5 0 0 0 0 0 0 1 0 0 0 0 0 0 θ6 0 0 0 0 0 0 1 0 0 0 0 1 − θ5 0 0 0 0 0 0 0 0 1 0 0 0 0 1 − θ6 0 0 0 0 0 0 0 0 1 0 0 θ1− θ3 0 0 1 0 0 0 0 0 0 0 1 0 0 θ2− θ4 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1           0 .

The resulting one-step estimates are ˆ

θ1 = (0.923, 0.678, 0.454, 0.168, 0.426, 0.272, 0.955) .

Standard errors of the estimated values are computed from (10) and are ˆ

V1/2_(ˆ_θ

1) = diag (0.0098, 0.0173, 0.0178, 0.0134, 0.0155, 0.0140, 0.00606) .

On the other hand, the standard errors of the initial parameter estimates are ˆ

V1/2_(ˆ_θ

S) = diag (0.0097, 0.0176, 0.0187, 0.0134, 0.0159, 0.0142, 0.00606) .

Note that there is no efficiency gain for ˆπ++1 because y3 is fully observed

throughout the sample.

5 Simulation Study

To test our theory with finite sample sizes, we perform a limited simulation. For the simulated data set, we generate B = 10, 000 samples of size n from

(26)

Table 6. Monte Carlo variance of the point estimators under two difference estimation schemes based on samples of 10,000 trials.

Sample size Parameter EM estimation GN estimation True variance

µ1 .0124 .0124 .0122 µ2 .0132 .0132 .0130 100 σ11 .0271 .0244 .0256 σ22 .0277 .0275 .0276 σ12 .0193 .0188 .0196 µ1 .00242 .00242 .00244 µ2 .00262 .00262 .00260 500 σ11 .00544 .00515 .00511 σ22 .00551 .00550 .00551 σ12 .00400 .00396 .00392 the population x iid∼ N(1, 1) y1 iid∼ N (0, 1) y2 = −0.1 + 0.7y1+ e

where e ∼ N (0, 0.72_{). We use two levels of sample sizes, n = 100 and n =}

500. Variable x is always observed and variables y1and y2are subject to

miss-ingness. The response probability for y1 follows a logistic regression model

such that logit {P r(y1is observed)} = x, the response probability for y1

fol-lows a logistic regression model such that logit {P r(y1is observed)} = 0.7x,

and that the two responses are independent. The one-step Gauss-Newton (GN) estimation and the EM estimation are compared. The estimates from the EM algorithm are computed after 10 iterations with the same initial values as for the one-step GN estimator.

(27)

Table 7. Monte Carlo result for the estimated variance of the proposed method based on samples of 10,000 trials.

Sample size Parameter Mean Est. Var. Rel. Bias t-statistic

µ1 .01186 -.05 -3.21 µ2 .01268 -.04 -2.88 100 σ11 .02474 .01 0.93 σ22 .02678 -.03 -1.76 σ12 .01884 .00 0.00 µ1 .002429 .01 0.39 µ2 .002589 -.01 -0.88 500 σ11 .005092 -.01 -0.82 σ22 .005483 -.00 -0.15 σ12 .003899 -.02 -1.10

The means and variances of the point estimators and the mean of the estimated variances are calculated. Because the point estimators are all unbiased in the simulation, their simulation means are not listed here. Table 6 displays the Monte Carlo variances of the point estimators in each estimation method. The theoretical asymptotic variances of the MLE are also computed and presented in the last column of Table 6. The simulation results in Table 6 are generally consistent with the theoretical results. The simulation variances are slightly larger than the theoretical variances because the estimators were not computed until convergence.

Table 7 displays the mean, relative bias, and the t-statistic of the esti-mated variance of the one-step GN method. The relative bias is the Monte Carlo bias of the variance estimator divided by the Monte Carlo variance, where the variance is given in Table 6. The t-statistic for testing the hy-pothesis of zero bias is the Monte Carlo estimated bias divided by the Monte Carlo standard error of the estimated bias. The t-values as well as the values

(28)

of relative biases state that estimated variances of our estimators computed using (10) are close to their theoretical values.

The simulation results in Table 6 show that the two procedures have similar performance in terms of point estimation. The efficiency is slightly better for the one-step GN method because the EM algorithm was terminated after 10 iterations. The efficiency improvement is larger for the variance parameters than for the mean parameters, which suggests that convergence of the EM algorithm for the mean parameters is faster than convergence for the variance parameters. Also, as can be seen in Table 7, the one-step GN method provides consistent variance estimates for all parameters. The performance is better for a larger sample size because the consistency of the variance estimator is justified from the asymptotic theory.

6 Concluding remarks

We have proposed a Gauss-Newton algorithm for obtaining the maximum likelihood estimator under a general non-monotone missing data scenario. The proposed method is shown to be algebraically equivalent to the Newton-Raphson method but avoids the burden of obtaining the observed likelihood. Instead, the MLEs separately computed from each partition of the marginal likelihoods and the full likelihoods are combined in a natural way. The way we combine the information takes the form of GLS estimation and thus can be easily implemented using the existing software. The estimated covariance matrix is computed automatically and shows good finite sample performance in the simulation study. The proposed method is not restricted to the multi-variate normal distribution. It can be applied to any parametric multimulti-variate

(29)

distribution as long as the computation for the marginal likelihood and the full likelihood are relatively easier than that of the observed likelihood.

The proposed method assumes an ignorable response mechanism. A more realistic situation would be the case when the probability of y2 missing

de-pends on the value of y1. In this case, the assumption of missing at random

no longer holds and we have to take the response mechanism into account. Further investigation in this direction is a topic for a future research.

Appendix

A. Proof of Theorem 1

Note that the observed likelihood can be written as a sum of the log-likelihood in each set:

log lobs(θ) = log lH(θ) + log lK(θ) + log lL(θ) , (A.1)

where lH =

Q

i∈Hf (y; θ) is the likelihood function defined in set H, and lK

and lL are defined similarly. Under MAR, lH is the likelihood for the joint

distribution of y1 and y2, lK is the likelihood for the marginal distribution of y1, and lL is the likelihood for the marginal distribution of y2. By (A.1), the

score function for the likelihood can be written as

Sobs(y; θ) = SH(y; θ) + SK(y; θ) + SL(y; θ) (A.2)

and the expected information matrix also satisfies the additive decomposition Iobs(θ) = IH(θ) + IK(θ) + IL(θ) , (A.3)

where IH(θ) = E [−∂2log lH(θ) /∂θ∂θ0], and IK(θ) and IL(θ) are defined

(30)

The equation in (A.3) can be written as Iobs(θ) = µ ∂η0 H ∂θ ¶ IH(ηH) µ ∂η_H ∂θ0 ¶ + µ ∂η_K ∂θ 0¶ IK(ηK) µ ∂η_K ∂θ0 ¶ + µ ∂η_L ∂θ 0¶ IL(ηL) µ ∂η_L ∂θ0 ¶ = X0_Vˆ−1_X, _(A.4) where X = (∂η0 H/∂θ, ∂η0K/∂θ, ∂η0L/∂θ) 0 _and ˆ V−1 _{= diag {I}

H(ηH) , IK(ηK) , IL(ηL)}. Now, consider the score function

in (A.2). Using the chain rule, the score function can be written as

Sobs(y; θ) = µ ∂η0 H ∂θ ¶ SH(y; ηH) + µ ∂η0 K ∂θ ¶ SK(y; ηK) + µ ∂η0 L ∂θ ¶ SL(y; ηL) . (A.5) Let ˆη_H be the MLE of the likelihood lH. Taking a Taylor expansion of SH(y; ηH) about ˆηH leads to

SH(y; ηH)= S. H(y; ˆηH) − IH(ˆηH) (ηH − ˆηH) ,

where IH (ηH) = −∂2log lH(ηH) /∂ηH∂η0H. Using SH(y; ˆηH) = 0 and the

convergence of the observed information matrix to the expected information matrix, we have

SH(y; ηH)= −I. H(ˆηH) (ηH − ˆηH) .

Similar results hold for the sets K and L. Thus, (A.5) becomes

Sobs(y; θ) =. µ ∂η0 H ∂θ ¶ IK(ˆηH) (ˆηH − ηH) + µ ∂η0 K ∂θ ¶ IK(ˆηK) (ˆηK− ηK) + µ ∂η0 L ∂θ ¶ IL(ˆηL) (ˆηL− ηL) = X0_V_ˆ−1_(ˆ_{η − η) .} _(A.6)

(31)

B. Computations for variance formula

Using ³ X0 HVˆH−1XH + X0KVˆK−1XK ´₋₁ = ³ X0 HVˆ−1H XH ´₋₁ − ³ X0_HVˆ_H−1XH ´₋₁ X0_K · ˆ VK+ XK ³ X0_HVˆ−1_H XH ´₋₁ X0_K ¸₋₁ XK ³ X0_HVˆ−1_H XH ´₋₁ ,

it can be easily shown that ³ X0_HVˆ−1_H XH + X0KVˆ−1K XK ´₋₁ = diag ½ Σbb nH ,2σ 2 22·1 nH , σ11 nH + nK , 2σ 2 11 nH + nK ¾ .

Now, to use the formula ³ X0 HVˆH−1XH + X0LVˆ−1L XL ´₋₁ = ³ X0 HVˆ−1H XH ´₋₁ − ³ X0_HVˆ_H−1XH ´₋₁ X0_L · ˆ VL+ XL ³ X0_HVˆ−1_H XH ´₋₁ X0_L ¸₋₁ XL ³ X0_HVˆ−1_H XH ´₋₁ , note that ˆ VL+ XL ³ X0_HVˆ−1_H XH ´₋₁ X0_L = diag ½ σ22 nHL ,2σ 2 22 nHL ¾ , where n−1 HL = n−1H + n−1L , and XL ³ X0 HVˆ−1H XH ´₋₁ = 1 nH µ σ22.1 0 0 σ12 0 −2β21.1σ22.1µ1 2β21.1σ22.1 2σ222.1 0 2σ122 ¶ . Thus, we have ³ X0 HVˆ−1H XH + X0LVˆ−1L XL ´₋₁ = ˆVH − nHL n2 H       σ22.1/σ22 −β21.1σ22.1/σ222 0 β21.1σ22.1/σ222 0 σ2 22.1/σ222 σ12/σ22 0 0 σ2 12/σ222             σ22.1 −2β21.1σ22.1 0 2β21.1σ22.1 0 2σ2 22.1 σ12 0 0 2σ2 12       0 ,

(32)

which present the variances of ˆθHK.

To compute the variances of ˆθHKL, use

³ X0 HKVˆ−1HKXHK + X0LVˆ−1L XL ´₋₁ = ³ X0 HKVˆ−1HKXHK ´₋₁ − ³ X0_HKVˆ−1_HKXHK ´₋₁ X0_L · ˆ VL+ XL ³ X0_HKVˆ−1_HKXHK ´₋₁ X0_L ¸₋₁ XL ³ X0_HKVˆ−1_HKXHK ´₋₁ , where ³ X0 HKVˆ−1HKXHK ´₋₁ = ³ X0 HVˆ−1H XH + X0KVˆ−1K XK ´₋₁ . Writing D ≡ ˆVL+ XL ³ X0 HKVˆ−1HKXHK ´₋₁ X0 L = diag ½ σ22 nHL ¡ 1 − pLpKρ2 ¢ ,2σ222 nHL ¡ 1 − pLpKρ4 ¢¾ ,

where n−1_HL = n−1_H + n−1_L , pL= nL/(nH + nL), and pK = nK/(nH + nK), and

XL ³ X0 HKVˆ−1HKXHK ´₋₁ = 1 nH µ σ22.1 0 0 σ12(1 − pK) 0 −2β21.1σ22.1µ1 2β21.1σ22.1 2σ222.1 0 2σ122 (1 − pK) ¶ ,

the variance of ˆθHKL can be obtained by ³ X0 HKVˆ−1HKXHK + X0LVˆ−1L XL ´₋₁ = ³ X0 HKVˆ−1HKXHK ´₋₁ − 1 n2 H       σ22.1 −2β21.1σ22.1 0 2β21.1σ22.1 0 2σ2 22.1 σ12(1 − pK) 0 0 2σ2 12(1 − pK)      D −1       σ22.1 −2β21.1σ22.1 0 2β21.1σ22.1 0 2σ2 22.1 σ12(1 − pK) 0 0 2σ2 12(1 − pK)       0 .

(33)

References

Anderson, T.W. (1957). Maximum likelihood estimates for the multivariate normal distribution when some observation are missing, Journal of the

American Statistical Association 52, 200-203.

Chen, Q., Ibrahim, J. G., Chen, M-H, and Senchaudhuri, P. (2008). Theory and inference for regression models with missing responses and covari-ates, Journal of Multivariate Analysis 99, 1302-1331.

Cox, D.R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference (with discussion). Journal of Royal Statistical

Society: Series B 49, 1-39.

Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion), Journal

of Royal Statistical Society, Series B 39, 1-38.

Kim, J.K. (2004). Extension of Factoring Likelihood Approach To Non-Monotone Missing Data, Journal of the Korean Statistical Society (2004), 33, 401–410.

Lehmann, E.L. (1983). Theory of Point Estimation. Wiley, New York. Little, R.J.L. (1982). Models for nonresponse in sample surveys, Journal of

the American Statistical Association 77, 237-250.

Little, R.J.L. (1992). Regression with missing X’s: A review, Journal of the

(34)

Little, R.J.L. and Rubin, D.B. (2002). Statistical Analysis with missing

data. Wiley, New York.

Liu, C. and Rubin, D. B. (1994). The ECME Algorithm: A Simple Exten-sion of EM and ECM with Faster Monotone Convergence, Biometrika 81, 633-648.

Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B 44, 226-233.

Meng, X.-L. and Rubin, D.B. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the

American Statistical Association 86, 899 - 909.

Molenberghs, G. and Kenward, M. (2007). Missing Data in Clinical Studies. Wiley, New York.

Rubin, D.B. (1974). Characterizing the estimation of parameters in incom-plete data problems, Journal of the American Statistical Association 69, 467-474.

Rubin, D.B. (1976). Inference and missing data, Biometrika 63, 581-590. Seber, G.A.F. and Wild, C.J. (1989). Nonlinear Regression. Wiley, New