Note on the EM Algorithm in Linear Regression Model

(1)

Note on the EM Algorithm in Linear

Regression Model

Ji-Xia Wang and Yu Miao

College of Mathematics and Information Science Henan Normal University

Henan Province, 453007, China [email protected]

Abstract

Linear regression model has been used extensively in the fields of information processing and data analysis. In the present paper, we con-sider the linear model with missing data. Using the EM (Expectation and Maximization) algorithm, the asymptotic variances and the stan-dard errors for the MLE of the unknown parameters are established. Mathematics Subject Classification: 93C05; 93C41

Keywords: Conditional expectation; maximum likelihood estimator; EM algorithm; Newton-Raphson iteration

1 Introduction

As a typical statistical model, linear regression model has been widely used in the ﬁelds of information processing and data analysis. In fact, there have been several statistical methods for its learning or modeling (e.g., the expectation-maximization (EM) algorithm [2] for maximum likelihood and the self-organizing network with hyper-ellipsoidal clustering [5]). Generally, the parameters of lin-ear regressive model can be estimated via the EM algorithm under the maxi-mum likelihood framework, since the EM algorithm owns certain good conver-gence behaviors in certain situations. However, in some applications, there are many data sets including missing observations [9], which cause many problems if the missing data is related to the values of the missing item [8], for instance, in [4], Little and Rubin showed that this can cause bias and ineﬃciency for some estimations. So, an new algorithm for estimating unknown parameters is proposed based on the likelihood function. In [1], Baker and Laird used the

(2)

EM algorithm to obtain maximum likelihood estimates (MLE) of the unknown parameters in the model with the incomplete data. Ibrahim and Lipsitz [3] established Bayesian methods for estimation in generalized linear models.

In the present paper, we discuss the linear regression model with miss-ing data and propose a method for estimatmiss-ing parameters by usmiss-ing Newton-Raphson iteration to solve the score equation. Moreover, the standard errors of these estimators are calculated by the observed Fisher information matrix.

2 Linear regression model with missing data

Suppose that_y₁_{, y}₂_{, . . . , y}_nare independent identically distributed normal ran-dom variables with unit variances. Let _X_i = (_X_1i_{, X}_2i)T is a 2×1 random vec-tor of covariation, where _X_1i and _X_2i are independent observations and follow normal distributions with means _μ₁_{, μ}₂ and variances _σ₁2_{, σ}2₂, respectively. For notation convenience, let_X_i = (1_{, X}_1i_{, X}_2i)T and assume that_βT = (_β₀_{, β}₁_{, β}₂) are regression coeﬃcients. It is also supposed that

p(_y_i|_X_i_{, β}) = √1 2_πexp

⎧ ⎪ ⎨ ⎪

⎩−

yi−XiTβ

2 2

⎫ ⎪ ⎬ ⎪

⎭. (1)

We assume that _X_1i is completely observed, and _X_2i is partially missing for every _i and our objective is to estimate _{β, μ}₁_{, μ}₂_{, σ}₁2_{, σ}2₂ and their standard errors from the known data with missing values.

Missing value indicators are introduced in [6] as

ri =

0_, if _y_i is observed_, 1_, if _y_i is missing_. si =

0_, if _x_2i is observed_,

1_, if _x_2i is missing_. (2) with probabilities _p(_r_i) = _ψ_i_{, p}(_s_i) = _ϕ_i. Following the reference [8], for any

i= 1_,2_{, . . . , n}, the missing-data mechanism is deﬁned as logit(_ψ_i)log ψi

1−_ψ_i =δ1X1i+δ2X2i +yiω (3) and

logit(_ϕ_i)log ϕi

1−_ϕ_i =α1X1i+α2X2i+yiτ, (4) where _δ = (_δ₁_{, δ}₂)T_{, α} = (_α₁_{, α}₂)T, _ω and _τ are parameters determining the missing mechanism. Then the conditional probability functions for _r_i and _s_i are derived by Eqs. (2)-(4) as

p(_r_i|_X_i_{, y}_i_{, δ, ω}) = exp{ri(X T

i δ+yiω)} 1 + exp{_XT_δ+_y_i_ω},

(3)

p(_s_i|_X_i_{, y}_i_{, α, τ}) = exp{si(X T

i α+yiτ)} 1 + exp{_X_iT_α+_y_i_τ}. Now we derive the joint probability function of _y_i_{, x}_2i_{, r}_i_{, s}_i as

p(_y_i_{, x}_2i_{, r}_i_{, s}_i|_x_1i)

=_p(_r_i|_X_i_{, y}_i_{, δ, ω})_p(_s_i|_X_i_{, y}_i_{, α, τ})_p(_y_i|_X_i_{, β})_p(_x_2i|_X_1i) ∝exp{ri(XiTδ+yiω)}

1 + exp{_X_iT_δ+_y_i_ω}×

exp{_s_i(_X_iT_α+_y_i_τ)}

1 + exp{_X_iT_α+_y_i_τ} ×(2π) −1

2

×exp

−(yi−Xi T

β)2 2

×(2_πσ₂2)−12 ×_exp

−(x

Therefore, we can write down the complete-data log-likelihood _l(_θ) by log _L(_θ|_y_i_{, X}_i_{, r}_i_{, s}_i)

= n

i=1 log

exp{_r_i(_X_iT_δ+_y_i_ω)} 1 + exp{_r_i(_X_iT_δ+_y_i_ω)}

+ n

i=1 log

exp{_r_i(_X_iT_α+_y_i_τ)} 1 + exp{_s_i(_X_iT_α+_y_i_τ)}

+n

2 log(2π)− n

i=1

yi−XiTβ

2

2 −

n

2log(2πσ 2 2)−

n

i=1

(_x_2i−_μ₂)2 2_σ2₂ ,

where _θ = (_{β, δ, ω, α, τ, μ}₂_{, σ}₂2) is the parameter related to developing EM al-gorithm. The complete-data log-likelihood speciﬁes a model for the joint char-acterization of the observed data and the associated missing-data mechanism.

3 E-step of EM algorithm

The MLE of_θis a point which maximizes the observed-data likelihood function

L(_θ|(_{y, X})_obs_{, r}_i_{, s}_i), where (_{y, X})_obs is the observed components of (_{y, X}). Let

θ(r) be the_r-st iteration estimate of_θ and deﬁne the conditional expectation of

l(_θ)-with respect to the conditional distribution of the missing data (_{y, X})_mis given the observed data _y_i_{, X}_i_{, r}_i_{, s}_i and the value _θ(r) as the following:

Q(_θ|_θ(r)) =_E[_l(_θ)|(_{y, X})_obs_{, r, s, θ}(r)]_. (5) The EM algorithm is composed of E-step and M-step iterations. Now for the expectation of the complete-data log-likelihood in the E-step of EM algorithm, we consider four possible-cases: response variable _y_i is missing, a covariance _x_2i is missing, both of them are missing, and no missing values. Then the expected log-likelihood function can be written by

(6) =

(4)

where _x_2i,mis denotes the missing components of _x_2i. Eqs.(3.1) and (3.2) lead to the conditional expectation of _l(_θ), which is our target quantity as

Q(_θ|_θ(r)) = n1

i=1

l(_θ) + n2

i=n1+1

l(_θ)_p_y_i,mis|_X_i_{, r}_i_{, s}_i_{, θ}(r)_dy_i,mis

+ n3

i=n2+1

l(_θ)_p_x_2i,mis|_X_i,obs_{, y}_i_{, r}_i_{, s}_i_{, θ}(r)_dx_2i,mis

+ n

i=n3+1

∞

yi=1

l(_θ)_p_y_i,mis_{, x}_2i,mis|_X_i,obs_{, r}_i_{, s}_i_{, θ}(r)_dy_i,mis_dx_2i,mis

where _n₁_{, n}₂_{, n}₃ are corresponding sample sizes, _y_i,mis is the missing compo-nents of_y_i,_X_i,obsis the observed component of_X_i, and_p(_y_i,mis_{, x}_2i,mis|_X_i,obs_{, r}_i_{, s}_i),

p(_y_i,mis|_X_i_{, r}_i_{, s}_i) and _p(_y_i,mis_{, x}_2i,mis|_X_i,obs_{, r}_i_{, s}_i) are the conditional ities of the missing data given the observed data. These conditional probabil-ities are regarded as the weights in _Q(_θ|_θ(r)). The weights have the following form:

pyi,mis, x2i,mis|Xi,obs, ri, si, θ(r)

= p

yi|Xi, θ(r)p(x2i|x1i)pri|yi, Xi, θ(r)psi|yi, Xi, θ(r)

_∞

y1=1

p(_y_i|_X_i_{, θ}(r))_p(_x_2i|_x_1i)_p(_r_i|_y_i_{, X}_i_{, θ}(r))_p(_s_i|_y_i_{, X}_i_{, θ}(r)) ∝ pyi, x2i, ri, si|x1i, θ(r),

px2i,mis|Xi,obs, yi, ri, si, θ(r) = p

x2i|x1i, θ(r)psi|yi, Xi, θ(r)

p(_x_2i|_x_1i_{, θ}(r))_p(_s_i|_y_i_{, X}_i_{, θ}(r)) ∝exp{ri(XiTα+yiτ)}

1 + exp{_X_iT_α+_y_i_τ}×(2πσ 2 2)−

1

2 ×_exp

−(x2i−μ2)2 2_σ₂2

,

and

pyi,mis|Xi, ri, si, θ(r)= p

yi|Xi, θ(r)pri|yi, Xi, θ(r)

_∞

yi=1p(yi|Xi, θ(r))p(ri|yi, Xi, θ(r))

∝ pyi|Xi, θ(r)pri|yi, Xi, θ(r).

(5)

Metropolis-4

M-step of EM algorithm and convergence

Now we need to ﬁnd a value of _θ, saying _θ(r), at which _Q(_θ|_θ(r)) will attain the maximum. The Newton-Raphson method will be used to solve the score equation. The parameters _θ(r+1) in the M-step at the (_r+ 1)_st EM iteration and the (_r+ 1)_stNewton-Raphson iteration take the following form (for_β for example):

β(r+1)=_β(r)+

−∂2Q(θ|θ(r))

∂β∂βT

−1

β=β(r) ×

∂Q(_θ|_θ(r))

∂β β=β(r).

The derivatives of the parameter _β used in the iteration are given as follows:

∂Q(_θ|_θ(r))

∂β = n1 i=1 Xi

yi−XiTβ

+ n2

i=n1+1

E

Xi

yi−XiTβ

|Xi, θ(r)

+ n3

i=n2+1

E

Xi

yi−XiTβ

|Xobs, yi, θ(r)

+ n

i=n3+1

E

Xi

yi−XiTβ

|Xobs, θ(r)

,

and

∂2Q(_θ|_θ(r))

∂β∂βT =

n1

i=1

XiTXi+ n2

i=n1+1

E

XiTXi|Xi, θ(r)

+ n3

i=n2+1

E

XiTXi|Xobs, yi, θ(r)

+ n

i=n3+1

E

XiTXi|Xobs, θ(r)

.

The derivatives of other components of _β used in the iteration are given in the reference [6].

The (_r+1)_stestimates of_μ₂_{, σ}₂2are obtained by solving the score equations:

∂Q(_θ|_θ(r))

∂μ2 = n

i=1

E(_x_2i|_x_1i_{, y}_i_{, r}_i_{, s}_i)−_nμ₂ = 0_,

∂Q(_θ|_θ(r))

∂σ2₂ =

n

i=1

E(_x_2i−_μ₂)2|_x_1i_{, y}_i_{, r}_i_{, s}_i−_nσ₂2 = 0_.

Therefore, we can take _μ(r+1)₂ _{, σ}₂2(r+1) by

μ(r+1)₂ = 1

nE(x2i|x1i, yi, ri, si), σ

2(r+1)

2 =

1

nE

(6)

which are approximated by the sample averages of simulated and given obser-vations.

The sequence{_Q(_θ|_θ(r))}often exhibits an increasing trend, and then fluc-tuate around the value of _Q(_θ|_θ(r)) if _r becomes large enough. The sequence {θ(r)}would also fluctuate the MLE_θ(r)when_ris sufficiently large. To monitor the convergence of the EM algorithm we can plot {_Q(_θ|_θ(r))}as well as {_θ(r)} against iteration number. We terminate the algorithm when the sequence of {_Q(_θ|_θ(r))} become stationary. Otherwise, we continue by increasing the Monte Carlo precision in the E-step provided calculation is computationally feasible.

5 Standard errors of estimates

It is well know that the distribution of maximum likelihood estimates ˆ_θ asymp-totically tends to a normal distribution _{MV N}(_{θ, V}(_θ)) under some regularity conditions. The expected Fisher information matrix _I(ˆ_θ) which gives the in-verse of variance matrix of ˆ_θ is approximated by the observed information matrix _J_θ_ˆ(_Y):

V(ˆ_θ)−1 =_nE

−∂2logL(θ)

∂θ2 θ=ˆθ ∝n −

∂2log_L(_θ)

∂θ2

dx

≈ n

i=1

−∂2logL(θ)

∂θ2 θ=ˆθ ≈nJ(ˆθ).

By using the following relation which is obtained in [9]: observed

informa-tion=complete information-missing information, we have

I(ˆ_θ)≈_J_θ_ˆ(_Y) =−∂

2_log_L₍_θ₎

∂θ2 =

−∂2Q(θ|θ(r))

∂θ2 −V arθ

_n

i=1

∂log_L(_θ)

∂θ

θ=ˆθ, where _{V ar}(·) is the conditional variance given (_{y, X})_obs_{, r, s}, and _θ(r). The details are to be provided in the reference [6].

ACKNOWLEDGEMENTS.

The authors acknowledge the ﬁnancial support of the Foundation for Dis-tinguished Young Scholars of Henan Province (084100510013).

References

[1] S. G. Baker and N. M. Laird,Regression analysis for categorical variables with outcome subject to nonignorable nonresponse, J. Am. Stat. Assoc,

(7)

[2] A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from

incomplete data via the EM algorithm. J.Royal Stat. Soc. B, 1977, 39:

1-38.

[3] J. G. Ibrahim, S. R. Lipsitz, Missing covariates in generalized linear mod-els when the missing data mechanism is non-ignorable, J. Royal Stat. Soc. B, 1999, 61: 173-190.

[4] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, New York, Wiley, 2002.

[5] J. Mao and A. K. Jain, A self-organizing network for hyperellipsoidal

clustering, IEEE Trans. Neural Networks, 1996, 7(1): 16-29.

[6] J. S. Park, G. Q. Qian and Y. Jun,Monte Carlo EM algorithm in logistic linear models involving non-ignorable missing data, Appl. Math. Comput., 2008,197: 440-450.

[7] C. P. Robert and G. Casella,Monte Carlo Statistical Methods, Berlin: Springer, 1999.

[8] M. M. Rueda,S. Gonzalez and A. Arcos,Indirect methods of imputation of

missing data based on available units, Appl. Math. Comput., 2005, 164:

249-261.

[9] Y. G. Smirlis and E. K. Despotis, Data envelopment analysis with missing

values: An interval DEA approach, Appl. Math. Comput., 2006, 177:

1-10.