Variable selection of linear regression model

2.4. Order determination

2.5.1. Variable selection of linear regression model

The consistent order estimators in section 2.4 can significantly reduce the model complexity of the autoregressive model and thus lead to more efficient estimation of the model coefficients. However, even when the order of a time series is correctly identified, there is still a possibility that some of the coefficients φ0

j’s are zeros and

including those zero coefficients will also result in an unnecessarily complex model which degrade the efficiency of the coefficient estimators and leads to less accurate predictions. This is especially true for long-memory autoregressive models whose order can increase as n increases. In addition, a model with a sparse representation reveals the underlying structure of the observed process. Therefore, variable selection can be a very important aspect of autoregressive models.

The idea of using penalized methods to do variable selection is pioneered by the revolutionary paper Tibshirani (1996) in the linear regression setting. Consider the linear regression model:

yi = xTi β+ ǫi, i = 1, · · · , n, (2.14)

To obtain the estimate of β, the Lasso method aims at minimizing Lasso(β) = n X i=1 (yi− xTi β)2+ λn p X j=1 |βj|, (2.15)

where λn> 0 is a tuning parameter used to obtain a balance between model fit and

model complexity. By shrinking the value of λ towards 0, some components of β will be shrunk to exact 0, which means those corresponding covariates are excluded from the model. The primary advantage of the Lasso method is that it can simultaneously do variable selection and model estimation, which is more stable than subsets selection in the sense that small changes in the data will not result in big change of the model selection result. Another advantage is that, as in ridge regression, the shrinkage in coefficients will help improve the prediction accuracy of the fitted model.

As appealing as the Lasso method is, Zou (2006) along with several other re- searchers pointed out that the Lasso variable selection result is not consistent under certain conditions. Denote β0 = {β10, · · · , βp0} as the true value of β and S = {j :

β0

j 6= 0, j = 1, . . . , p} and Snlasso = {j : ˆβjlasso 6= 0, j = 1, . . . , p} as the nonzero

coefficients estimated via the Lasso method. By inconsistency, we mean that

lim

n→∞P (S lasso

n = S) < 1.

In other words, under certain conditions, no matter how large your sample sizes is, there is a positive possibility that we will end up with an incorrect model using the Lasso method. To solve this problem, Zou (2006) proposed to use a modification of the Lasso method, named as the Adaptive Lasso method, which estimates β by minimizing aLasso(β) = n X i=1 (yi− xTi β)2+ λn p X j=1 wj|βj|, (2.16)

γ > 0 and ˆβ being a √n-consistent estimator to β0. Again, define Snalasso = {j :

ˆ βalasso

j 6= 0, j = 1, . . . , p} as the nonzero coefficients estimated via the Adaptive Lasso

method, Zou (2006) showed that if λn/√n → 0 and λnn(γ−1)/2 → ∞, then the

Adaptive Lasso estimator enjoys a so-called “Oracle property” (Fan and Li, 2001), which includes:

1. Consistency in variable selection: limn→∞P (Snalasso = S) = 1,

2. Asymptotic normality: √n( ˆβ_Salasso_{− β}0S) d

−

→ N(0, σ2_C−1 S ),

where CS = limn→∞ 1_nXSTXS with XS being the design matrix only using covariates

with nonzero estimated coefficients. “Oracle property” means that we can simultaneously do variable selection and model estimation as if the true model is known.

The Adaptive Lasso method is not the only penalized method that enjoys this “Oracle property”. Another famous example would be the smoothly clipped absolute deviation (SCAD) penalty function proposed in Fan and Li (2001). Zou and Li (2008) further proposed to modify the penalty term in (2.16) by replacing each λnwj

term with p′

λn(| ˜φ1j|) for some general penalty function pλ(·), for example, the SCAD penalty function, which maintains the “Oracle property”.

Wang et al. (2007b) considered the model (2.14) with the error term ǫifrom some

heavy tailed distribution, where they proposed to do model estimation and variable selection using the Lad-Lasso method by minimizing

LadLasso(β) = n X i=1 |yi− xTi β| + p X j=1 λj|βj|, (2.17)

where the tuning parameters can be chosen as

λj = λn

log n n| ˜βj|

, _{j = 1, · · · , p,}

tors of β. The use of least absolute deviation loss function in (2.17) instead of the least square loss function handles the problem of having residuals from heavy tailed distributions including those with infinite variances by assigning smaller weights to large values of deviations. Assuming that the error ǫi has a continuous density func-

tion f (·) such that f(0) > 0, then under certain conditions, Wang et al. (2007b) showed that as n → ∞,

P ( ˆβSc = 0) → 1, and √n( ˆβ_S − β_0S)−→ N(0,d 1 4f2₍₀₎C

−1 S ),

which implies that the Lad-Lasso method also enjoys the “Oracle property”. This ac- tually motivates us to consider apply the Lad-Lasso method to model infinite variance autoregressive model.

In document Variable Selection and Function Estimation Using Penalized Methods (Page 32-35)