Least Angle Regression - Outlier Detection and Multicollinearity in Sequential Variable Selecti

Consider the usual regression data structure withn× ppredictor matrixXwith columns standardized to have mean 0 and variance 1 and response vectorycen- tered at 0. The true underlying model is assumed to follow the sparse regression form

y=Xβ+σ2

where i ∼iid N(0,1) and most elements of β are zero. Let A be the active set, the set of indices of all non-zero elements of β, with cardinality |A|. There are three primary possible purposes for fitting the model: identifying the variables that are members of the active set, interpreting the coefficient estimate βˆ, and obtaining the predicted valuesyˆ =Xβˆ. The model is to be fitted according to the Lasso model discussed in Section 1.4.

The LARS algorithm described in [20] for estimating the Lasso is a (near) homotopy which introduces variables sequentially into the model. The underlying principal is that the estimated coefficient vector is updated through an adaptive piecewise-linear function (therefore differentiable almost everywhere) on a gradually increasing subset of predictors and responses, with nodes occur- ring whenever new variables enter the model.

As penalty parameter values are generally unknown a priori, the LARS algorithm (as with many other Lasso estimation methods) operates in a similar

way to standard tree-building methods like CART [8]: begin with a null model and gradually add in variables until the model is “full”. This model path may then be “trimmed” to an optimal size by setting the penalty parameter to an appropriate value.

2.3.1 Geometric Motivation

The geometric idea underlying the LARS algorithm is that only those variables most correlated with the residuals should be included in the model. Imagine that the coefficient estimate vector traces out a piecewise linear path through the coordinate space parametrically indexed on the penalty fraction s ∈ [0,1] described in Section 2.2, wheres=0corresponds to the null model whereA=∅

and s = 1 corresponds to the full regression model. The regression model is considered “full” when A contains either all variable indices or there are no more available degrees of freedom (i.e. |A| = min(n− 1,p) if the intercept is estimated).

At s = 0, the path starts at the origin and begins moving along the axis of the coefficient associated with the variable which is most correlated with the response (which acts as the initial residuals, sinceymay be assumed to be centered without loss of generality). At a certain point, another predictor variable yields a correlation with the corresponding residuals equal to that of the initial variable. At this point the path experiences a node (or joint, or elbow) in order to introduce the new variable into the model. The path continues in a linear fashion along the new “equiangular” direction (that is, the vector direction which bisects the angle between the previous coefficient trajectory and the new

axis). When another variable becomes equally correlated with the residuals corresponding to the point along the coefficient path, the path again experiences another node and changes direction. This continues until the model is full. For instances where p<n, the coefficient vector when s=1is the OLS solution.

2.3.2 Algorithm Details

This description of the LARS algorithm relies heavily on the paper by Efron et. al. [20]. Begin by centering the response vector y, and centering and scaling the predictor matrix columns to have mean 0 and variance 1. To calculate the path, begin with a null model such that A = ∅_{, i.e.} βˆ = 0 and the residuals

e = y. Determine correlations between the residuals and each of the predictors usingCˆ = eT_X_{. Let}_C max = maxj cˆj

be the largest correlation in absolute value

and ˆj = arg max_j_<_A cˆj

be the index of the variable(s) Xˆj most correlated with

e. To update the coefficient vector estimate, it is necessary to determine the new direction of the trajectory and also the distance along this vector to travel before the next variable should enter the model. Assume that at the current node we have obtained a coefficient estimateβˆ0 with corresponding fitted valuesyˆ0, residuals e0, and correlation vector Cˆ0 through initialization or completion of the previous step.

The active setA0 =nj: ˆβ0,j ,0

consists of the indices of all variables that are included in the model corresponding to the current node. Begin by updating Aso thatA = nj: ˆcj =Cˆmax

. LetX_A be the design matrix including only those columns with indices inA, letG_A= XT

Algorithm 1:LARS Algorithm for Lasso fit Data: centeredy; centered and scaledXn×p

Result: LARS path of coefficient vectorβλ0 indexed by penalty parameter λ0

(as step or fraction) Initialize: Coefficient vectorβ=0 Active setA= ∅ Residualse=y Penaltyλ0 = ₀ while |A|<min(n,p)do C=eT_X Cmax= max cj A=n j: cj = Cmax o

Determine new coefficient direction G_A= XT_AX_A w_A= 1T AG−1A1A −1/2 G−1 A

Determine distance in new direction to next node

a=XT_X AwA. a_A= 1T_AG−1_A1_A −1/2 ˆ γ=min+_j_<_A ( Cmax−cj a_A−aj ,Cmax+cj a_A+aj ) Update: β e λ0

with order equal to|A|_{. The new coefficient direction is calculated by} w_A = 1T_AG−1_A1_A−

1 2

G−1_A.

The new coefficient estimates are updated using the equation ˆ

β(γ)= _βˆ₀₊_γ_w_A

where γ is a scalar multiple. Now that the direction has been determined, the next step is to determine the distance (represented byγ) to the next node.

To determine this, leta = XT_X

AwA. Note thataA = 1T AG−1A 1A −12 1_A, however for convenience I will usea_A to represent the scalar1T

AG−1A1A

−12

vector form previously given. Parametric vector functions onγ can be used to represent the new fitted values

y(γ)=yˆ0+γXAwA and the correlation vector

C(γ)=XT(y−yˆ(γ))=Cˆ0−γa

as the coefficient vector βˆ(γ) progresses along the path defined above. Note that the elements ofC(γ)belonging toAwill remain equivalent and largest (in absolute value) asγ changes. Let us call this valueC_A(γ) = Cmax−γa_A. A new variable will enter the model when some

cj(γ)

=CA(γ)with j<A. Setting the

two values equal to each other and solving forγyields the estimate ˆ γ=min j_<A +(Cmax−cj a_A−aj ,Cmax+cj a_A+aj )

wheremin+indicates that the minimum is only taken over positive arguments.

In document Outlier Detection and Multicollinearity in Sequential Variable Selection: A Least Angle Regression-Based Approach (Page 43-47)