Model Selection and Estimation in Additive Regression Models

(1)

HUIPING, MIAO. Model Selection and Estimation in Additive Regression Models. (Under the direction of Dr. Daowen Zhang).

We propose a method of simultaneous model selection and estimation in additive regression models (ARMs) for independent normal data. We use the mixed model rep-resentation of the smoothing spline estimators of the nonparametric functions in ARMs, where the importance of these functions is controlled by treating the inverse of the smooth-ing parameters as extra variance components. The selection of important nonparametric functions is achieved by maximizing the penalized likelihood with an adaptive LASSO. A unified EM algorithm is provided to obtain the maximum penalized likelihood estimates of the nonparametric functions and the residual variance. In the same framework, we also consider forward selection based on score tests, and a two stage approach that imposes an early stage screening using an individual score test on each induced variance component of the smoothing parameter.

For longitudinal data, we propose to extend the adaptive LASSO and the two-stage selection with score test screening to the additive mixed models (AMMs), by introducing subject-specific random effects to the additive models to accommodate the correlation in responses. We use the eigenvalue-eigenvector decomposition approach to approximate the working random effects in the linear mixed model presentation of the AMMs, so as to reduce the dimensions of matrices involved in the algorithm while keeping most data information, hence to tackle the computational problems caused by large sample sizes in longitudinal data.

(2)

Huiping Miao

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fullfillment of the requirements for the Degree of

Doctor of Philosophy

Statistics

Raleigh, North Carolina 2009

APPROVED BY:

Dr. Marie Davidian Dr. Hao Zhang

Dr. Daowen Zhang Dr. Dennis Boos

(3)

DEDICATION

(4)

BIOGRAPHY

Huiping Miao grew up in Beijing, China. She graduated from Peking University in 2005 with a BS in statistics. That summer, she moved to Raleigh, North Carolina. At North Carolina State University (NCSU), Huiping completed the Master of Statistics in 2007. During her studies as a PhD student, she worked as a statistician in Clinical Operations at Talecris Therapeutics for a year. She taught an introductory statistics course to undergraduate students at NCSU. She also participated in a collaborative research project between NCSU and Qingdao Meteorological Bureau in supporting the sailing events of the 2008 Beijing Olympics. In 2009, she completed the requirements of a PhD in the Department of Statistics at North Carolina State University.

(5)

ACKNOWLEDGMENTS

Warmest thanks to Dr. Daowen Zhang, my advisor, for his guidance, inspiration and patience. To my committee members for their constructive comments. Also to Dr. Sujit Ghosh, whose attitude towards work and life I admire.

Special thanks to my good friends Ani Eloyan, Funda Gunes, Wanying Li and Liping Li for their encouragement and enthusiasm, and for sharing the up and down in all these years in graduate school.

My grateful thanks to my wonderful parents, Tong Miao and Ye Di, who always believe in me and give me the courage to pursue my dreams.

Thank you to my husband, Shuang Du, for his eternal love and support. Without him, this would never have come true.

(6)

TABLE OF CONTENTS

LIST OF TABLES . . . vii

LIST OF FIGURES . . . ix

1 Introduction . . . 1

1.1 A Review of Variable Selection Methods . . . 1

1.1.1 Linear Models and Variable Selection . . . 1

1.1.2 Nonparametric Models and Model Selection . . . 3

1.2 Estimating a Nonparametric Function Using a Smoothing Spline . . . 7

1.3 The Linear Mixed Model Representation . . . 9

1.4 Dissertation Outline . . . 10

2 Model Selection and Estimation in Additive Regression Models . . . 12

2.1 Introduction . . . 12

2.2 The Additive Regression Models and The Linear Mixed Model Representation 13 2.2.1 The Additive Regression Models . . . 13

2.2.2 The Linear Mixed Model Representation . . . 14

2.3 The Adaptive LASSO for Additive Regression Models . . . 15

2.3.1 Methodology . . . 15

2.3.2 Algorithm . . . 17

2.4 Forward Selection Based on Score Tests . . . 21

2.4.1 The Forward Selection Idea . . . 21

2.4.2 The Test Statistic and Its Distribution . . . 22

2.4.3 The Cutoff Valueα . . . 23

2.5 Two-Stage Selection: The Adaptive LASSO with Score Test Screening . . . 24

2.6 Simulation Studies . . . 26

2.7 The Boston Housing Example . . . 36

2.8 Summary . . . 41

3 The Adaptive LASSO and Two-Stage Selection with Score Test Screening for Longitudinal Data . . . 42

3.1 Introduction . . . 42

3.2 The Additive Mixed Models for Longitudinal Data . . . 43

3.2.1 The Model and Its Linear Mixed Model Representation . . . 43

3.2.2 The Computational Challenge . . . 44

3.3 Eigenvalue-Eigenvector Decomposition . . . 45

3.4 The Adaptive LASSO in Additive Mixed Models . . . 47

3.5 Two-Stage Selection with Score Test Screening . . . 49

(7)

3.7 The PTHRP Example . . . 61

3.8 Summary . . . 61

4 Conclusion . . . 64

Bibliography . . . 67

(8)

LIST OF TABLES

Table 2.1 The frequency of appearance of the covariates in selected models. The

covari-ates are independent, andε∼N(0, σ2_e). Covariates x1 tox4 are informative. . . 28

Table 2.2 The average size of selected models over 100 runs. The covariates are inde-pendent. The true model is of size 4. The standard error of the average size is in

the range (0.04,0.11).. . . 29

Table 2.3 The average model error over 100 runs using quadratic (h = 1) and cubic

(h = 2) smoothing splines. The covariates are independent. The standard error of

the average model error is in the range (0.01,0.09). . . 30

Table 2.4 The frequency of appearance of the covariates in selected models. The

covari-ates are correlated with (trimmed) AR(1) covariance, andε∼N(0, σ2

e). Covariates

x1,x3,x8 andx9 are informative.. . . 31

Table 2.5 The average size of selected models over 100 runs. The covariates are

corre-lated with (trimmed) AR(1) covariance, andε∼N(0, σ2

e). The true model is of size

4. The standard error of the average size is in the range (0.08,0.16). . . 31

Table 2.6 The average model error over 100 runs using quadratic (h = 1) and cubic

(h = 2) smoothing splines. The covariates are correlated with (trimmed) AR(1)

covariance, and ε∼N(0, σ2

e). The standard error of the average model error is in

the range (0.01,0.26).. . . 35

Table 3.1 The frequency of appearance of the covariates in selected models using

dif-ferent proportions (p0) of data information. The covariates are independent, with

εij ∼N(0, σe2) and bi∼N(0, σ2_b). Covariates x1 tox4 are informative.. . . 52

Table 3.2 The average size of selected models over 100 runs using different proportions

(p0) of data information. The covariates are independent, withεij ∼N(0, σe2) and

bi∼N(0, σ2_b). The true model is of size 4. The standard error of the average model

(9)

Table 3.3 The average model error over 100 runs using different proportions (p0) of data

information. The covariates are independent, withεij ∼N(0, σ2e) andbi ∼N(0, σb2).

The standard error of the average model error is in the range (0.02,0.03). . . 54

Table 3.4 The frequency of appearance of the covariates in selected models using

differ-ent proportions (p0) of data information. The covariates are (trimmed) AR(1), with

εij ∼N(0, σe2) and bi∼N(0, σ2b). Covariates x1,x3,x8 and x9 are informative.. . . . 57

Table 3.5 The average size of selected models over 100 runs using different proportions

(p0) of data information. The covariates are (trimmed) AR(1), withεij ∼N(0, σ2e)

and bi ∼N(0, σ_b2). The true model is of size 4. The standard error of the average

model size is in the range (0.04,0.07).. . . 57

Table 3.6 The average model error over 100 runs using different proportions (p0) of data

information. The covariates are (trimmed) AR(1), with εij ∼ N(0, σe2) and bi ∼

N(0, σ2

(10)

LIST OF FIGURES

Figure 2.1 The estimated nonparametric functions using the adaptive LASSO, averaged

over 100 simulation runs. The covariates are independent, andε∼N(0,1.74). The

solid (blue) lines are the true underlying functions. The dot-dashed (black) lines

are estimates whenh= 1. The dashed (red) lines are estimates whenh= 2. . . 32

over 100 simulation runs. The covariates are independent, andε∼N(0,3.48). The

over 100 simulation runs. The covariates are independent, and ε ∼ N(0,8). The

Figure 2.4 The estimated nonparametric functions using the adaptive LASSO, averaged over 100 simulation runs. The covariates are correlated with (trimmed) AR(1)

covariance, and ε ∼ N(0,1.74). The solid (blue) lines are the true underlying

functions. The dot-dashed (black) lines are estimates when h = 1. The dashed

(red) lines are estimates whenh= 2. . . 37

Figure 2.5 The estimated nonparametric functions using the adaptive LASSO, averaged over 100 simulation runs. The covariates are correlated with (trimmed) AR(1)

covariance, and ε ∼ N(0,3.48). The solid (blue) lines are the true underlying

Figure 2.6 The estimated functions of selected covariates in the Boston housing exam-ple. The dashed (blue) lines represent the estimates using the quadratic smoothing

splines (h = 1). The solid (red) lines represent the estimates using the cubic

smoothing splines (h= 2). . . 40

over 100 simulation runs. The covariates are independent, with εij ∼ N(0,1.74),

bi ∼N(0,3). The solid (blue) lines are the true underlying functions. The dashed

(red) lines represent estimates whenp0= 0.98. The dotted (dark green) lines

rep-resent estimates when p0 = 0.95. The dot-dashed (black) lines represent estimates

(11)

over 100 simulation runs. The covariates are independent, with εij ∼ N(0,3.48),

whenp0= 0.90. . . 56

Figure 3.3 The estimated nonparametric functions using the adaptive LASSO,

aver-aged over 100 simulation runs. The covariates are (trimmed) AR(1), with εij ∼

N(0,1.74), bi ∼N(0,3). The solid (blue) lines are the true underlying functions.

The dashed (red) lines represent estimates whenp0 = 0.98. The dotted (dark green)

lines represent estimates when p0 = 0.95. The dot-dashed (black) lines represent

estimates whenp0= 0.90. . . 59

over 100 simulation runs. The covariates are trimmed AR(1), withεij ∼N(0,3.48),

whenp₀= 0.90. . . 60

Figure 3.5 The estimated nonparametric functions associated with the selected

covari-ates for the PTHRP data. The adaptive LASSO withp₀= 0.95 is used. . . 62

Figure .1 The estimated nonparametric functions using forward selection with α =

0.05, averaged over 100 simulation runs. The covariates are independent, and ε∼

N(0,1.74). The solid (blue) lines are the true underlying functions. The dot-dashed

(black) lines are estimates whenh= 1. The dashed (red) lines are estimates when

h= 2. . . 71

0.1, averaged over 100 simulation runs. The covariates are independent, and ε ∼

h= 2. . . 72

Figure .3 The estimated nonparametric functions using the two-stage selection with

score test screening (α = 0.5), averaged over 100 simulation runs. The covariates

are independent, andε∼N(0,1.74). The solid (blue) lines are the true underlying

(12)

h= 2. . . 74

0.1, averaged over 100 simulation runs. The covariates are independent, and ε ∼

h= 2. . . 75

score test screening (α= 0.5). The covariates are independent, andε∼N(0,3.48).

The solid (blue) lines are the true underlying functions. The dot-dashed (black)

lines are estimates whenh= 1. The dashed (red) lines are estimates whenh= 2. 76

0.05, averaged over 100 simulation runs. The covariates are independent, and ε∼

N(0,8). The solid (blue) lines are the true underlying functions. The dot-dashed

h= 2. . . 77

Figure .8 The estimated nonparametric functions using forward selection withα= 0.1,

averaged over 100 simulation runs. The covariates are independent, andε∼N(0,8).

score test screening (α = 0.5). The covariates are independent, and ε ∼ N(0,8).

0.05, averaged over 100 simulation runs. The covariates are correlated with (trimmed)

AR(1) covariance, and ε∼N(0,1.74). The solid (blue) lines are the true

underly-ing functions. The dot-dashed (black) lines are estimates whenh= 1. The dashed

averaged over 100 simulation runs. The covariates are correlated with (trimmed)

AR(1) covariance, andε∼N(0,1.74). The solid (blue) lines are the true underlying

functions. The dot-dashed (black) lines are estimates whenh= 1. The dashed (red)

lines are estimates whenh= 2. . . 81

(13)

are correlated with (trimmed) AR(1) covariance, and ε ∼ N(0,1.74). The solid (blue) lines are the true underlying functions. The dot-dashed (black) lines are

estimates whenh= 1. The dashed (red) lines are estimates whenh= 2. . . 82

are correlated with (trimmed) AR(1) covariance, and ε ∼ N(0,1.74). The solid

(blue) lines are the true underlying functions. The dot-dashed (black) lines are

0.05, averaged over 100 simulation runs. The covariates are correlated with (trimmed)

AR(1) covariance, and ε∼N(0,3.48). The solid (blue) lines are the true

underly-ing functions. The dot-dashed (black) lines are estimates whenh= 1. The dashed

averaged over 100 simulation runs. The covariates are correlated with (trimmed)

AR(1) covariance, andε∼N(0,3.48). The solid (blue) lines are the true underlying

functions. The dot-dashed (black) lines are estimates whenh= 1. The dashed (red)

lines are estimates whenh= 2. . . 85

score test screening (α= 0.75), averaged over 100 simulation runs. The covariates

are independent, with εij ∼ N(0,1.74), bi ∼ N(0,3). The solid (blue) lines are

the true underlying functions. The dashed (red) lines represent estimates when

p0 = 0.98. The dotted (dark green) lines represent estimates whenp0 = 0.95. The

dot-dashed (black) lines represent estimates whenp0 = 0.90. . . 88

are independent, with ε_ij ∼ N(0,3.48), b_i ∼ N(0,7). The solid (blue) lines are

(14)

are (trimmed) AR(1), with εij ∼ N(0,1.74), bi ∼ N(0,3). The solid (blue) lines

are the true underlying functions. The dashed (red) lines represent estimates when

are trimmed AR(1), with εij ∼ N(0,3.48), bi ∼ N(0,7). The solid (blue) lines

are the true underlying functions. The dashed (red) lines represent estimates when

(15)

Chapter 1

Introduction

With the fast development of technology and massive amount of information be-coming accessible, being able to deal with high-dimensional data is desired, especially in genetics, environmental sciences and medical studies. Usually, researchers build regression models to study the relationship between the response variable and a number of covariates. However, not all the covariates contain useful information for the response variable. It is important to identify the most effective subset of informative covariates, which will benefit the researchers to better determine the active factors in their models. On the other hand, in regression problems, it is essential to consider the fact that the trade off of a smaller estimate bias is larger prediction variability. It is well known that the more coefficients are included and estimated in a model, the larger the variance of the predicted value will be. From this point of view, it is important to eliminate the covariates that rarely contribute to the prediction of the variable of interest. Moreover, models that have better interpretation are always preferred. When a model has many variables, it may be so hard to interpret the usefulness of the model that it becomes doubtable both theoretically and practically. Hence, efficient model selection methods are of great necessity.

1.1 A Review of Variable Selection Methods

1.1.1 Linear Models and Variable Selection

(16)

yi=f(Ti) +εi, i= 1,· · · , n, (1.1)

where yi is the observed response, Ti = (ti1,· · ·, tip)T is ap-dimensional vector containing

the explanatory variables, and n is the total number of observations. The error term ε_i

is usually assumed to be independent and identically distributed from normal distribution

N(0, σ2_).

Linear regression modeling is one of the most widely used approaches of fitting a regression model. Linear regression models assume a linear relationship between the

re-sponse variable and the explanatory variables. In particular, functionf(Ti) is approximated

by a linear expression

f(T_i) =β₀+

p X j=1

β_jt_ij.

Linear regression models are easy to fit and have nice interpretation. With these two advantages, they are commonly adopted in practice.

In literature, many variable selection methods regarding linear models have been proposed. The classical variable selection procedures that are widely used in practice include

the best subset selection methods using Mallows’Cp (Mallows, 1973), AIC (Akaike, 1973)

and BIC (Schwarz, 1978) criteria and the sequential selection techniques, known as forward selection, backward elimination and stepwise selection. However, these methods suffer from lack of stability and accuracy (Breiman, 1995). The discrete processes either drop or retain a covariate from the model. As a result, small perturbation in the data can result in very different models and can impact prediction accuracy (Tibshirani, 1996).

A family of shrinkage approaches emerged in literature. They are relatively stable and continuous processes that shrink negligible coefficients to zero and retain the ones of relatively large magnitude. Hence, the covariates corresponding to the nonzero coefficients are selected into the model and considered important. Famous members of this family are the bridge regression (Frank and Friedman, 1993), nonnegative garrote (Breiman, 1995), the LASSO (Tibshirani, 1996), the SCAD (Fan and Li, 2001), elastic net (Zou and Hastie, 2005) and the adaptive LASSO (Zou, 2006). Zhang and Lu (2007) extended the adaptive LASSO to Cox’s proportional hazards models. Among these methods, the adaptive LASSO is closely related to our new proposals, and it is briefly reviewed in the following.

(17)

LASSO by applying adaptive weights for penalizing different coefficients. Consider the following linear regression model

y_i =β₀+

p X j=1

β_jt_ij+ε_i, i= 1,· · · , n.

Denotey = (y1,· · ·, yn)T, t0 = (1,· · · ,1)Tn×1, and tj = (t1j,· · ·, tnj)T,j = 1,· · · , p. Let ˆβ

denote a root n-consistent estimator ofβ; for example, the ordinary least square estimator.

The adaptive LASSO estimates β by

ˆ

β∗(n)= argmin_β

¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯y−

p X j=0

tjβj ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 2

+λn

p X j=1

ˆ

wj|βj|, (1.2)

where λn is a tuning parameter. For givenγ > 0, ˆw = ( ˆw1,· · · ,wˆp)T = _|_β_ˆ1_|γ is the weight

vector. Zou (2006) pointed out that the above is in fact a convex optimization problem, thus the global minimizer can be obtained easily. Also, it is shown that the adaptive LASSO enjoys the oracle properties.

In addition, a relatively new family of variable selection is the False Selection Rate (FSR) approaches introduced by Wu, Boos, and Stefanski (2007). Their strategy is to

specify a target false selection rate, γ = γ0, and to adjust the tuning parameters of the

selection method so that γ0 is the achieved false selection rate.

1.1.2 Nonparametric Models and Model Selection

Linear models have the advantage of fast computation and easy interpretation. However, they are restricted by the assumption of linearity. It is assumed that there is a linear relationship between the response variable and the explanatory variables, and the

mean of the response variable yi is modeled as a linear function

P_p

j=1βjtij of a set of

covariates {ti1,· · ·, tip}. But, linearity is not always a proper assumption. If the response

variable and the covariates are not linearly related, linear modeling will suffer from model misspecification problems, hence the model fitting procedure and prediction results will be misleading.

In the past few decades, the trend has been to relax the assumption of linearity

and build models in a more nonparametric fashion. For p covariates Ti = (ti1,· · · , tip)T, a

(18)

yi=f(Ti) +εi, i= 1,· · · , n, (1.3)

where f(·) is an unspecified smooth function. To estimate this function, any scatterplot

smoother may be used, for example a running least squares line, a running mean, a running median, a kernel estimate, or a spline. The details of scatterplot smoothers are thoroughly described in literature.

Hastie and Tibshirani (1986) introduced the class of additive models, where the

linear form Pβjtij is replaced by a sum of unspecified smooth functions

P

fj(tij). In the

context of regression models, the smooth extension has the form

yi=f0+ p X j=1

fj(tij) +εi, i= 1,· · · , n, (1.4)

where the smooth functions fj(·)’s are standardized so that E{fj(tij)} = 0. Denote

y = (y1,· · ·, yn)T and tj = (t1j,· · · , tnj)T. The estimation of the nonparametric smooth

functions can be achieved via the local scoring algorithm. It uses scatterplot smoothers, such as the running lines smoother, to generalize the usual Fisher scoring procedure for computing maximum likelihood estimates.

The additive models are more flexible than linear models, since they only assume an additive relation between the response and the covariates, relaxing the linearity restriction. Yet they keep the advantage of easy interpretation of the contribution of each covariate. It is well known that the curse of dimensionality is a main issue in fitting nonparametric models with multiple covariates. In order to keep the variance of the estimate under control, one has to look further for sufficient points in near neighborhoods in high dimensions, which can result in severe bias. Whereas, additive models are able to avoid this problem. Under the

additivity assumption, smoothing for each functionf_j(·) is on a single coordinate. Thus, one

can include enough points locally in each coordinate and keep the variance of the estimates down (Hastie and Tibshirani, 1986). The linear models and the additive models are often combined and used together in practice.

Model selection is more challenging for nonparametric models than the linear mod-els. In the context of linear models, linear forms are assumed so that the importance of

each covariatet_j is fully reflected by the absolute magnitude of its corresponding regression

(19)

coefficient ˆβj is negligible. However, this is not the case in nonparametric models. Since the

functional form of each covariate is left completely unspecified, we need to estimate fj(·)

as a whole functional component. And, only when ˆfj(·) is set to be a zero function can we

exclude covariate tj from the model.

So far in literature there are not many model selection methods available for non-parametric models. Several works have been done in the smoothing spline analysis of

variance (SS-ANOVA) (Wahba, et al, 1995) framework. The SS-ANOVA model has the

following form

f(Ti) =b+ p X j=1

fj(tij) + X j<k

fjk(tij, tik) +· · · , (1.5)

wherebis a constant,f_j(·)’s are main effects,f_jk(·,·)’s are two-way interactions, and so on.

The multivariate adaptive regression spline (MARS, Friedman, 1991) builds up functional ANOVA models using forward selection and pruning techniques. It also works as a variable selection method, which is one of the most famous in the nonparametric setting. Gu (1992) illustrated a set of model-checking techniques for additive models using cosine diagnostics. Gunn and Kandola (2002) introduced a sparse kernel non-linear modeling approach. Since the sparsity in coefficients in SS-ANOVA models is not guaranteed, a separate model se-lection step has to be applied after model fitting (Lin and Zhang, 2006). In recent years, several new methods regarding the SS-ANOVA models or the additive models have been proposed.

Lin and Zhang (2006) proposed the component selection and smoothing opera-tor (COSSO), which has become a popular model selection and model fitting method in nonparametric regression models. It is a regularization with the penalty being the sum of

component norms. Suppose the data contain n observations. Consider the nonparametric

regression model (1.3). Under the SS-ANOVA model framework (1.5), the COSSO finds

f ∈ F (F is a reproducing kernel Hilbert space (RKHS) corresponding to decomposition

(1.5)) that minimizes

1

n n X i=1

{y_i−f(T_i)}2+τ_n2J(f), withJ(f) =

q X α=1

||Pαf||, (1.6)

where τn is a smoothing parameter. Let F1,· · ·,Fq denote theq orthogonal subspaces of

(20)

Fα_{’s are the main effect spaces. In a two-way interaction model, there are} _p _{main effect}

spaces andp(p−1)/2 two-way interaction spaces, thusq =p(p+ 1)/2.

A new class of sparse additive models (SpAMs) (Ravikumar, et al, 2008) was

introduced for high-dimensional nonparametric regression and classification. Based on the

additive model (1.4), SpAMs impose constraints on the functions fj(·)’s, which results in

simultaneous smoothness of each component function as well as sparsity across components,

so as to effectively fit models with a large number of covariates. Denote byH_j the Hilbert

space of measurable functions fj(tj)’s, such that E{fj(tj)} = 0 and E{fj2(tj)}< ∞, with

the inner product < fj, fj∗ >= E{fj(tj), fj∗(tj)}. Also, let fj(·) =βjgj(·), noting thatgj(·)

is a function scaled byβj. The SpAMs have the form

minβ∈Rp_,g_j_∈H_j E

n

y−Pp_j=1βjgj(tj) o₂

subject to Pp_j=1|βj| ≤L,

E(g_j2) = 1, j = 1,· · ·, p,

E(gj) = 0, j = 1,· · ·, p.

The constraint ofβj’s encourages sparsity ofβ’s, as for the parametric LASSO. Hence many

of the component functionsfj(·)’s will be set to zero.

Following the SpAMs, the sparsity-smoothness penalty approach (Meier, van de Geer and Buhlmann, 2009) was proposed for high-dimensional additive models to tackle the problem that function estimates are often too wiggly when the true underlying functions are very smooth. In model (1.4), besides controlling sparsity, this approach places restrictions of smoothness via penalizing the least squares by the sparsity-smoothness penalty (with

some abuse of notation fj =fj(tj))

J(f_j) =λ₁ q

||f_j||2

n+λ2I2(fj),

with

I2(fj) = Z

{f_j00(t)}2dt,

where λ1 and λ2 are tuning parameters. The norm ||fj||2n encourages sparsity at the

func-tion level, while I2₍_f

j) measures the smoothness of fj. The estimators ˆf1,· · ·,fˆp are the

(21)

min_f₁_,_·_,f_p_∈F ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯y−

p X j=1 fj ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 2 n + p X j=1

J(fj), (1.7)

whereF denotes a suitable class of functions.

1.2 Estimating a Nonparametric Function Using a

Smooth-ing Spline

In this section, our focus is on the estimation of a one dimensional nonparametric function using the smoothing spline method. A smoothing spline is a regularization method with the model complexity controlled by a smoothing parameter.

We briefly introduce the reproducing kernel Hilbert space (RKHS). More discus-sion of the RKHS’ and their applications in the smoothing spline framework can be found in Wahba (1990) and Gu (2002). First of all, we give several basic definitions. A functional

in a linear space L is a mapping of an element in L to a real number in real line R. For

any f, g ∈ L and α ∈ R, a linear functional L in L satisfies L(f +g) = L(f) +L(g) and

L(αf) =αL(f). A bilinear formJ <·,·>:L × L →R satisfiesJ < αf +βg, h >=αJ <

f, h >+βJ < g, h >and J < f, αg+βh >=αJ < f, g >+βJ < f, h >, where f, g, h∈ L

and α, β ∈ R. An inner product is a positive definite bilinear form defined as <·,·>. It

defines a norm to a linear spaceL such as||f||=< f, f >1/2. A space Lequipped with an

inner product<·,·>_L is called an inner product space.

A Hilbert space (C) is a complete inner product space. A substantial property of

Hilbert spaces is the Riesz representation theorem, which states that for every continuous

linear functional L in a Hilbert space C, there exists a unique gL ∈ C such that for any

f ∈ C,L(f) =< gL, f >C. In this case,gL is called a representer of the linear functionalL.

Let H be a Hilbert space of real valued functions on a domain T. An evaluation

functional Lt(f) maps f to f(t), namely Lt(f) =f(t), for any t∈ T;L∈ H is a bounded

functional if there exists a constant M such that |L(f)| ≤ M||f||_H. A Hilbert space H

is said to be a reproducing kernel Hilbert space (RKHS) if the evaluation functional is bounded.

(22)

yi=f(ti) +εi, i= 1,· · · , n, (1.8)

where yi is the observation of the ith of n units and ti is a one-dimensional covariate

corresponding to theith unit. Without loss of generality, we assume{t_i}’s are distinct and

0 ≤ t₁ < · · · < t_n ≤ 1. The error term ε_i is assumed to be independent and identically

distributed from normal distributionN(0, σ2_{). Let}_f(h)₍_t_{) denote the}_h_{th derivative of}_f₍_t_).

Supposef(·) is a smooth function from the function space

W_h ={g(t)|g(t), g0(t),· · ·, g(h−1)(t) absolutely continuous , Z ₁

0

{g(h)}2(t)dt <∞}.

Being equipped with the inner product

< f, g >=

h−1 X v=0

f(v)(0)g(v)(0) +

Z ₁ 0

f(h)(t)g(h)(t)dt,

the space W_h is an RKHS.

Let H = W_h. We decompose H = H0 ⊕ H1, where H0 consists of the constant

functions inH, and H1 is the complement subspace of H0. The smoothing spline method

obtains the estimate of the nonparametric functionf(·) in model (1.8) by solving

minf∈H

" 1 n n X i=1

{yi−f(ti)}2+λ||P1f||2H

#

, (1.9)

where λ >0 is the smoothing parameter, andP1 _projects _f _onto _H

1. One can show that

an equivalent expression for (1.9) is

minf∈H

" 1 n n X i=1

{yi−f(ti)}2+λ Z ₁

0

{f(h)(t)}2dt #

. (1.10)

In the penalized least squares, the smoothing parameter λ controls the goodness of fit of

the model and the roughness of f(·). When λ = 0, the smoothing spline fit would be a

data interpolation; when λgoes to infinity, it would be a polynomial with order no greater

than (h−1). So, choosing a good smoothing parameter is important. The minimizer of the

given penalized least squares is the smoothing spline estimator.

(23)

provided by Kimeldorf and Wahba (1971). The hth-order smoothing spline estimator has the following form

f(t) =

h X j=1

δjφj(t) + n X i=1

aiRh(t, ti), (1.11)

where {φj(t)}j=1h is a basis for the space of polynomials of order (h−1) (e.g., φj(t) =

tj−1_/₍_j₋_1)!_{, j} _{= 1}_{, . . . , h}_{) and}_R

h(t, s) is defined by

Rh(t, s) = _[(_h₋1_1)!]₂ Z ₁

0

(s−u)h₊−1(t−u)h₊−1du,

where (s−u)+=s−u ifs≥u and 0 otherwise. Whenh= 1, the estimator is a quadratic

smoothing spline withR1(t, s) = min(t, s). Whenh= 2, it is a cubic smoothing spline with

R2(t, s) =t2/2(s−t) +t3/3 (assumings > t).

1.3 The Linear Mixed Model Representation

We introduce the connection between the smoothing spline estimators and the linear mixed models. Taking advantage of the linear mixed model representation, we can adapt many well studied theories and methods in mixed models.

Under the smoothing spline estimate of f(t) (1.11), let δ = (δ1,· · · , δh)T and

a= (a1,· · ·, an)T. Denote byf the vector off(t) evaluated at ti. Then, f can be written

as

f =T δ+ Σa,

where T is an n×h matrix with the (k, l)th element equal to φ_l(t_k) and Σ is a positive

definite matrix with the (k, l)th element equal to Rh(tk, tl). Furthermore, the penalty term

in the penalized least squares (1.10) has the expression

λ Z ₁

0

{f(h)(t)}2dt=λaTΣa,

which suggests that we can treat a as a random effect distributed from N(0, τΣ−1_{) with}

τ = 1/λ.

(24)

y = f +ε

= T δ+ Σa+ε, (1.12)

wherey= (y1,· · ·, yn)T,δ is the fixed effect, and acan be treat as the random effect with

variance-covariance matrix τΣ−1_{. The estimate ˆ}_f _of _f _{can be obtained by the best linear}

unbiased predictor (BLUP) from the working linear mixed model (1.12).

1.4 Dissertation Outline

Our research work focuses on simultaneous model selection and estimation in the additive regression model framework. First of all, we consider the additive regression mod-els with independent normal responses. We use the linear mixed model representation of the smoothing spline estimators of the nonparametric functions, treating the inverse of the smoothing parameters as extra variance components. The importance of the nonparamet-ric functions is controlled by these working variance components. We propose a method of model selection and estimation based on penalized log-likelihood with the adaptive LASSO. Along with the proposal, a unified EM algorithm is provided to obtain the maximum penal-ized likelihood estimates of the nonparametric functions as well as the variance components. In the same model framework, we also conduct forward selection based on score tests. Fur-thermore, we consider a two-stage selection approach which imposes early stage screening using an individual score test on each induced variance component.

Secondly, in additive regression models, we take into account the possible corre-lation among the responses, such as longitudinal observations from repeated measures, by introducing subject-specific random effects to the additive models. The adaptive LASSO approach and the two-stage selection with score test screening method are applied in cases with correlated responses. To tackle the computational problem when sample size is very large, we propose to sacrifice a small proportion of information and reduce the dimension of the matrices in the algorithm via the eigenvalue-eigenvector decomposition approach.

We study the empirical performances of our proposals in different simulation set-tings. In addition, illustrations with data applications are provided.

(25)

The proposals regarding the additive regression models with independent responses are presented in Chapter 2. Section 2.1 gives a brief introduction to this chapter. To start with, in Section 2.2, we derive the linear mixed model representation of the smoothing spline estimators in the model framework of interest. The penalized log-likelihood with the adaptive LASSO, along with the unified EM algorithm, is presented in Section 2.3. Section 2.4 discusses forward selection based on score tests. Section 2.5 introduces the two-stage selection with score test screening. Empirical study results are provided in Section 2.6. We apply the adaptive LASSO and the two-stage selection method to a data application in Section 2.7. In the end, a brief summary wraps up this chapter.

In Chapter 3, our discussions focus on the additive regression models with corre-lated responses. A short introduction is in Section 3.1. Section 3.2 introduces the additive mixed models for longitudinal data, provides the derivation of the linear mixed model rep-resentation, and describes the computational challenge in data with correlated responses. The eigenvalue-eigenvector decomposition approach is proposed in Section 3.3. We present the formulation for both the adaptive LASSO and the two-stage selection with score test screening under the new model representation in Section 3.4 and 3.5. Section 3.6 gives the simulation studies. We illustrate the application of proposed methods in Section 3.7. Section 3.8 contains a short summary.

(26)

Chapter 2

Model Selection and Estimation in

Additive Regression Models

2.1 Introduction

In this chapter, the focus of our research work is the development of model selection and estimation methods in additive regression models (ARMs). Instead of linearity, the additive models only assume an additive structure of nonparametric smooth functions of covariates of interest, which makes this model family more general and more proper in many situations in real life. Lately in literature, a couple of methods have been proposed in the additive model framework, such as the component selection and smoothing operator

(COSSO, Lin and Zhang, 2006), the sparse additive models (SpAM, Ravikumaret al, 2008),

and the sparsity-smoothness penalty (Meieret al, 2008), as reviewed in Chapter 1.

Taking advantage of the linear mixed model representation of the ARMs, we de-velop our model selection and estimation methods by applying the well studied theories and methods in mixed models. We propose three new approaches, the adaptive LASSO in ad-ditive regression models, forward selection based on score tests, and the two-stage selection with score test screening.

(27)

devoted to the two-stage selection with score test screening approach. Simulation studies and a data application are provided in Section 2.6 and Section 2.7. Finally, we wrap up the chapter by a brief summary.

2.2 The Additive Regression Models and The Linear Mixed

Model Representation

2.2.1 The Additive Regression Models

We consider the additive regression models (ARMs) (Hastie and Tibshirani, 1986).

Let the response variable y_i be the observation of the ith of n observation units (i =

1,· · · , n). We assume yi’s are independent and continuous. The data also consist of p

covariates {ti1,· · · , tip}. The additive regression models have the following form

yi =β0+f1(ti1) +f2(ti2) +· · ·+fp(tip) +εi, (2.1)

whereβ0 represents a scalar intercept,fj(t)’s (j= 1,· · ·, p) are arbitrary smooth functions

associated with covariates {tij}’s, respectively. The error term εi is usually assumed to

be independent and identically distributed from normal distribution N(0, σ2

e). The above

expression provides a general modeling formulation, where additivity is the only assumption made for the relationship between the response variable and the covariates.

Lin and Zhang (1999) discussed the approach for estimating nonparametric func-tions in the generalized additive mixed models (GAMMs) in details. Model (2.1) that we consider can be viewed as a special case of GAMMs for independent Gaussian data. We

adapt Lin and Zhang’s proposal of estimating functionsfj(·)’s. For any positive integerh,

let f_j(h)(t) denote thehth derivative offj(t). Suppose fj(t)∈W2(h), and

W₂(h)={g(t)|g(t), g0(t),· · · , g(h−1)(t) absolutely continuous, Z

{g(h)}2(t)dt <∞}.

Let y = (y1,· · · , yn)T and denote byl{β0, f1(·),· · · , fp(·);y} the log-likelihood function of

{β0, f1(·),· · · , fp(·)}. The penalized log-likelihood function with respect to{β0, f1(·),· · · , fp(·)}

can be written as

l{β0, f1(·),· · ·, fp(·);y} −1₂ p X j=1

λj Z

(28)

whereλj’s are smoothing parameters that control the goodness of fit of the model and the

roughness of functions fj(·)’s. Large values of λj’s correspond to oversmoothing. Given

λj’s, the estimates offj(·)’s can be obtained by maximizing function (2.2). It can be shown

that such estimates are smoothing splines of orderh (Kimeldorf and Wahba, 1971; Wahba,

1990; Zhang and Lin, 2003). We consider the following smoothing spline representation by

Kimeldorf and Wahba (1971). Let t0

j = (t01j,· · ·, t0rjj)

T _{be a vector of} _r

j ordered distinct

t_ij’s, and without loss of generality, we assume 0< t0

1j <· · ·< t0rjj <1. The above estimate

of fj(·) can be expressed in the form of an hth-order smoothing spline

fj(t) = h X k=1

δkjφkj(t) + rj

X l=1

aljRh(t, t0lj), (2.3)

where {φkj(t)}_k=1h is a basis for the space of polynomials of order (h−1) (e.g., φkj(t) =

tk−1_/₍_k₋_1)!_{, k}_{= 1}_{, . . . , h}_{) and}_R

h(t, s) is defined by

R_h(t, s) = 1

[(h−1)!]2

Z ₁ 0

(s−u)h₊−1(t−u)h₊−1du,

where (s−u)₊ =s−u ifs≥u and 0 otherwise. Whenh= 1, the estimate is a quadratic

smoothing spline withR_h(t, s) = min(t, s). Whenh= 2, it is a cubic smoothing spline with

Rh(t, s) =t2/2(s−t) +t3/3 (assumings > t).

2.2.2 The Linear Mixed Model Representation

Under the smoothing spline representation (2.3), letδj = (δ1j,· · ·, δhj)T and aj =

(a1j,· · · , arjj)T. Denote by fj the vector offj(t) evaluated at t0j. Then,fj can be written

as

fj =Tjδj+ Σjaj,

whereTj is an rj×h matrix with the (k, l)th element equal toφlj(t0kj) and Σj is a positive

definite matrix with the (k, l)th element equal to Rh(t0_kj, t0_lj). Furthermore, the penalty

term in the penalized log-likelihood function (2.2) associated withfj(t) has the expression

λj

2

Z

{f_j(h)(t)}2dt= λj

2 a

T jΣjaj,

which suggests that we can treataj as a random effect distributed from N(0, τjΣ−j1) with

(29)

Denote by Nj an n×rj incidence matrix that mapst0_j to the original data{tij}.

We can then express the additive regression model (2.1) as

y=1β₀+N₁f₁+N₂f₂+· · ·+N_pf_p+ε,

where1is an×1 vector of ones, andε= (ε1,· · ·, εn)T. Define matrices,X= (1, N1T1,· · ·, NpTp),

N = (N1,· · ·, Np),β = (β0, δT1,· · · , δpT)T, and vectora= (aT1Σ1,· · · , aTpΣp)T. The additive

regression model can be written in matrix notation as the following working linear mixed model

y =Xβ+N a+ε, (2.4)

where β is the fixed effect and a ∼N(0,Σ) is the random effect with variance-covariance

matrix Σ = diag{τjΣj}.

Specially, when h = 1, the fixed effect reduces to a scalar β =β0 +

P_p

j=1δ1j, X

becomes an n×1 vector of ones, and the working linear mixed model has the form

y=1β+N a+ε. (2.5)

Now, the additive regression model has been rewritten in the form of a working

linear mixed model. As a result, estimation of functions fj(·)’s can be obtained from the

best linear unbiased predictors (BLUPs) by fitting the working linear mixed model (2.5) using the maximum likelihood or the restricted maximum likelihood approach. The well studied theories and methods on mixed models can also be applied.

2.3 The Adaptive LASSO for Additive Regression Models

2.3.1 Methodology

We propose a simultaneous model selection and estimation method for additive

re-gression model (2.1). The goal is to choose the important covariates{tij}’s and to estimate

the corresponding nonparametric smooth functionsfj(t)’s through which the covariates

(30)

It is shown in Section 2.2.2 that each nonparametric function fj(t) is associated

with a random effectaj distributed fromN(0, τjΣ−_j1). We consider the 1st-order smoothing

spline. When h= 1, from expression (2.3), we have

f_j(t) =δ_1j+

rj

X l=1

a_ljR₁(t, t0_lj),

with δ_1j being a scalar and a_j ∼ N(0, τ_jΣ−_j1). Notice that f_j(t) is a constant function

f_j(t) =δ_1j, if and only if variance componentτ_j is zero. In other words,τ_j = 0 is equivalent

to function fj(·) being constant, which is equivalent to that covariate {tij} has no

contri-bution to the response. Therefore, if variance component τj is zero, we conclude that the

corresponding covariate {tij} has no significant effect and should be eliminated from the

model.

Our proposal is to estimate the variance components τj’s using a penalized

like-lihood based approach. It shrinks the estimates of τ_j’s, with the intent to shrink those

associated with unimportant covariates to zero. For model selection, only covariates with nonzero estimated variance components are selected as significant effects. The

correspond-ing nonparametric functions fj(·)’s are estimated by their best linear unbiased predictors

(BLUPs).

In the working model (2.5), letθ= (τ1,· · · , τp, σ2e)T denote the vector of variance

components. The variance of y is V(θ) =σ2

eI +τ1N1Σ1N1T +· · ·+τpNpΣpNpT. The

log-likelihood function with respect to{β, θ}has the following form

l(β, θ;y) =−1

2log|V(θ)| −

1

2(y−1β)

T_V₍_θ₎−1₍_y₋₁_β₎_. _(2.6)

Following Zou (2006) and Zhang and Lu (2007), we propose the adaptive LASSO for additive

regression models which estimates the new intercept β and the variance components θ by

maximizing the following penalized log-likelihood with respect to β and θ

lp(β, θ;y, λ) = l(β, θ;y)−nλ p X j=1

τj

˜

τ_j +δ,

subject to τj ≥0,j = 1,· · · , p, (2.7)

(31)

situation of zero denominators, and ˜τj’s are “good” estimators of τj’s. We choose to use

their maximum likelihood estimates (MLEs) obtained by maximizing (2.7) when λ= 0.

Empirical experience shows that it is more likely that the true variance component

τj is actually zero when its MLE is small, whereas a positive true variance component often

leads to a larger MLE. The proposed penalized likelihood based shrinkage approach places bigger penalties on variance components with smaller MLEs and smaller penalties on those

with bigger MLEs. In this way, ˆτ_j’s associated with unimportant covariates will be shrunk

to zero quickly, whereas ˆτj’s associated with significant effects will remain relatively large

and the corresponding covariates will be selected as important ones.

The tuning parameter λis chosen by Bayesian information criteria (BIC)

BIC =−2l(β, θ;y) +dlog(n),

where d denotes the total number of parameters and variance components in the selected

model.

We use the BLUPs to estimate the nonparametric functions fj(·)’s. The centered

estimated values offj(·) evaluated att0j are obtained as follows,

ˆ

fj = Σjaˆj

= ˆτjΣjNjTV(ˆθ)−1(y−1βˆ), (2.8)

where ˆθ= ( ˆτ1,· · ·,τˆp,σˆe2)T, and ˆβ = (1T1)−11T(y−

P_p

j=1NjΣjˆaj). For any arbitrary data

point t, whenh= 1, the centered BLUP offj(t) is given by

ˆ

f_j(t) =

rj

X l=1

ˆ

a_ljR₁(t, t0_lj), (2.9)

where ˆaj = (ˆa1j,· · ·,aˆrjj)T.

2.3.2 Algorithm

We provide a unified algorithm to obtain the maximum penalized likelihood esti-mate (MPLE) of the variance components as discussed in the previous section adapting the

EM algorithm (Dempsteret al, 1977).

(32)

that it often produced negative updates for the induced variance components, sometimes even for those corresponding to important covariates, and it was hard to control. Whereas, for EM algorithm, from equations (2.10) presented in the next section, we can see that the

updating formulae of ˆτj(t+1) and ˆσ2

(t+1)

e are conditional expectations of quadratic forms.

Given the fact that Σj’s are positive definite matrices, the updated values will always be

non-negative. Hence, the modified EM algorithm suits the task of estimating variance components.

In this section, we present the general formulation of the unified EM algorithm for

situations where the hth-order smoothing splines are applied as shown in expression (2.4).

Working with model (2.4), the response y is the observed data, and the random effect a

can be viewed as missing. The maximum likelihood estimates of the fixed effect β and the

variance components{τ1,· · ·, τp, σ2e} can be obtained by the EM method.

Following Green (1990), we impose the same penalty term as in (2.7) at the E-stage of the algorithm to find the maximizer of the penalized likelihood function (2.7). At

step t, denoteγ = (β, θT₎T_,

Qp(γ|γ(t), y, λ) =Q(γ|γ(t), y)−nλ p X j=1

τ_j(t)

˜

τ_j+δ,

whereQ(γ|γ(t)_{, y}_{) is defined as the expected log-likelihood function without penalty,}

specif-ically,

Q(γ|γ(t), y) = −n

2 logσ

2 e−

1

2σ2

e

E{(y−Xβ−N a)T(y−Xβ−N a)|γ(t), y}

−1

2log|Σ| −

1

2E{a

T_Σ−1_a_|_γ(t)_{, y}_}

= −n

2 logσ

2 e−

1

2σ2

e

E{(y−Xβ−N a)T(y−Xβ−N a)|γ(t), y}

−1 2 p X j=1 ·

rjlogτj+ log|Σj|+ _τ1

jE{a T

jΣ−j1aj|γ(t), y} ¸

,

where rj = dim(Σj). Notice that function Qp(γ|γ(t), y, λ) is equal to Q(γ|γ(t), y) when

tuning parameter λ = 0. The MLE of γ can be obtained by maximizing Q(γ|γ(t)_{, y}_),

namely Qp(γ|γ(t), y) when λis fixed at zero; the MPLE for a given λ >0 can be obtained

(33)

At the M-stage, we get the updating formulae of the interceptβ and the variance

componentsθ by maximizingQp(γ|γ(t), y, λ)

ˆ

β(t+1) = (XTX)−1XT(y−Nˆa(t)) ˆ

τj(t+1) =

2E{aT

jΣjaj|γˆ(t), y} q

r2

j + 4djE{aTjΣjaj|γˆ(t), y}+rj

ˆ

σ_e2(t+1) = 1

nE{(y−Xβˆ

(t+1)₋_{N a}₎T₍_y₋_X_βˆ(t+1)₋_{N a}₎_|_γ_ˆ(t)_{, y}_}_,

where

ˆ

a(t)= E(a|γˆ(t), y). (2.10)

Let u = y−Pp_j=1τˆ_j(t)NjΣjNjTV−1(ˆθ(t))(y−Xβˆ(t)). After some calculation, β and θ are

updated by

ˆ

β(t+1) = (XTX)−1XTu

ˆ

τj(t+1) = q 2cj r2

j + 4djcj+rj

ˆ

σ2_e(t+1) = ˆσ2_e(t) + 1

n h

(u−Xβˆ(t+1))T(u−Xβˆ(t+1))−(ˆσ2_e(t))2tr{V−1(ˆθ(t))}

i ,

where

dj = _τ_˜2λn j+δ

cj = (ˆτ_j(t))2(y−Xβˆ(t))TV−1(ˆθ(t))NjΣjNjTV−1(ˆθ(t))(y−Xβˆ(t))

−(ˆτ_j(t))2tr{ΣjNjTV−1(ˆθ(t))Nj}+rjτˆj(t). (2.11)

The nonparametric functionsfj(·)’s are estimated by their BLUPs as stated in Section 2.3.1

at the convergence.

The proposed algorithm enjoys two attractive features. On one hand, it is a

unified algorithm for both the MLE and the MPLE. In equations (2.11), when λ = 0,

ˆ

τ_j(t+1) ₌_c

j/rj, which is exactly the update for the MLE of τj. When λ >0, it gives the

(34)

A potential deficiency of the EM algorithm is that it converges very slowly after the first several iterations. Empirical results show that the change between two iteration steps becomes quite small after 200 iterations, but still not small enough to meet the convergence criteria. We add an additional evaluating step in each iteration after the first 200 iterations

to fasten the converging process. If the updated value ˆτ_j(t+1) for some j is smaller than a

previously chosen threshold δ₀ (for example, δ₀ = 0.1), it is quite likely that the updates

are converging in the direction towards zero. Firstly, we set ˆτ_j(t+1)at zero keeping the other

updates unchanged. Secondly, We evaluate if zero is a reasonable update forτjby comparing

the values of the penalized log-likelihood (2.7), namely the target function, at the original updated value and zero, as the goal is to maximize the penalized log-likelihood function. If the former is larger, we keep the original updated value as the update in this iteration;

if the latter is larger, the update would be ˆτ_j(t+1) = 0. It is true that the penalized EM

algorithm will converge eventually without this additional step. However, this secondary evaluating step noticeably shortens the converging time. It is implemented for all of our simulation studies regarding the proposed unified EM algorithm.

Overall, the algorithm can be summarized as follows. Step 1: Set initial values to ˆβ(0)_{, ˆ}_τ(0)

1 ,· · ·,ˆτp(0), and ˆσ2

(0)

e .

Step 2 (getting MLE): Letλ= 0. At thetth iteration, update ˆβ(t+1), ˆτ₁(t+1),· · ·,ˆτp(t+1),

and ˆσ2(t+1)

e until convergence.

Step 3: Initialize ˆβ(0)_{, ˆ}_τ(0) 1 ,· · ·,τˆ

(0)

p , and ˆσ2

(0)

e using their MLE obtained from Step

2. When the MLE of ˆτ_k is zero, substitute it by a small positive value as the initial value.

Step 4 (getting MPLE): For any λ > 0. At the tth iteration, update ˆβ(t+1)_,

ˆ

τ₁(t+1),· · ·,ˆτp(t+1), and ˆσ2

(t+1)

e until convergence.

Step 5: Compute ˆfj(t)’s, forj = 1,· · · , p, at convergence.