Approaches for variable selection - Belitz, Christiane (2007): Model selection in genera

2.1 Introduction

3.1.1 Approaches for variable selection

An overview over methods for subsect selection in (generalised) linear models can be found in Miller (2002) or Kadane & Lazar (2004) for instance. The best known approaches are forward selection and backward elimination. Forward selection starts with the empty model containing the intercept term only. Then in each step, the best variable according to a selection criterion (compare subsection3.2) or a certain test statistic is added to the model (among those that have not been added previously). The algorithm stops when the model is not improved by adding one of the remaining variables.

Unlike forward selection, backward elimination starts with the full model containing all variables. At each step, it removes the least important variable from the model basing the decision again either on a selection criterion or on a test statistic. The process stops when the model is not improved by removing one of the remaining variables from the model.

These two approaches can be combined leading to stepwise regression (see e.g. Miller

(2002)).

Alternative approaches for subset selection in linear models which are closely related to each other are Lasso, forward stagewise regression and LARS (compareEfron, Hastie, Johnstone & Tibshirani (2004)). For all three approaches we assume that the response variable and all covariates are centered around zero and that the covariates are additionally standardised. Lasso was introduced by Tibshirani (1996) and estimates the regression coefficients by minimising the residual sum of squares subject to the condition that the sum of absolute

coefficient values is smaller than a certain threshold value, i.e.

j=1

|βj| ≤t

This threshold valuetserves as a tuning parameter and has to be determined appropriately, e.g. using cross validation. If the threshold value is large enough, the estimated coefficients are identical to the usual least squares estimates. In contrast, if the threshold value is small the parameter estimates are shrunken towards zero. Often some of the coefficients are even equal to zero so that the respective covariates can be considered having no effect on the response.

Forward stagewise regressionis an iterative method that chooses in each step the covariate xj with the highest absolute correlation to the current residual vector r = (y−µˆ). Then,

the current linear predictor ˆµis adjusted and replaced by ˆ

µ+²·sign(cor(xj, r))xj

using a small value for the constant². For ²=cor(xj, r) this approach is equivalent to the

simple forward selection. The starting values for the parameter estimates are zero. Vari- able selection is included implicitly by not choosing certain covariates during the entire process.

Least Angle Regression (LARS) introduced by Efron, Hastie, Johnstone & Tibshirani (2004)is a modified version of the forward stagewise regression. Similar to the formula for stagewise regression above, the linear predictor is in each step adjusted using the variable with the largest absolute correlation to the current residual vectorr. There are two differ- ences to forward stagewise regression: the value ² is not fixed but is in each step chosen such that the correlation between the newly adjusted residual vector and the actual chosen variable is as big as the correlation between the predictor and the next best covariatexk,

i.e.

|cor[y−(ˆµ+²·sign(cor(xj, r))·xj), xj]|=|cor[y−(ˆµ+²·sign(cor(xj, r))·xj), xk]|

must hold. Out of these two variables a new variablexk0 is built such that the angle between

the variable vectors xj and xk is divided equally by this new variable. The algorithm

continues using this artificial variable. Variable selection is again included implicitly by not choosing certain covariates during the entire process. The LARS algorithm can also be modified to provide solutions for Lasso.

Bayesian approaches for model selection can be based on Bayes factors which compare different models (compare Kass & Raftery (1995) or section 3.2.3 of this chapter). Other

Bayesian approaches for subset selection of variables in linear models can be based on indicator variablesγj for each of the covariates xj leading to the predictor

η=β0+γ1β1x1+. . .+γpβpxp

An example is the approach presented by George & McCulloch (1997). They use hierar- chical Bayes mixture models in combination with MCMC methods like the Gibbs sampler or the Metropolis–Hastings algorithm (compare Green (2001)) to perform the selection. The lowest level of the hierarchy is represented by the indicator variables γj. These are

provided independently of each other with prior probabilities πj = P(γj = 1) indicating

the probability that thej-th covariate has an influence on the response. The next level are the prior distributions for the regression parameters conditional on the indicator variables. Here, it is possible to use a normal mixture of the form

βj|γj = (1−γj)N(0, τj20) +γjN(0, τj21),

with a small value for τ2

j0 and a large one for τj21. How to choose the values for the

variances is described in George & McCulloch (1997). The parameter τ2

j0 can also be set

to zero leading to a point mass on βj = 0. This was considered in Geweke (1996). The

decision which model to use can be based on the posterior distributions of different models. Alternatively, these approaches also allow the performance of a kind of model averaging (compare chapter 5of this thesis).

The earlier approach ofMitchell & Beauchamp (1988) works similar. As prior distribution for each regression parameter they choose what they call slab and spike distribution: a mixture prior with a point mass at zero and a diffuse uniform distribution elsewhere. This prior depends on the ratio of the probability assigned to zero to the probability assigned to all other values. This ratio has to be chosen by the user, e.g. by using a kind of Bayesian cross validation.

In document Belitz, Christiane (2007): Model selection in generalised structured additive regression models. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 64-66)