• No results found

Consider the usual regression data structure withn× ppredictor matrixXwith columns standardized to have mean 0 and variance 1 and response vectorycen- tered at 0. The true underlying model is assumed to follow the sparse regression form

y=Xβ+σ2

where i ∼iid N(0,1) and most elements of β are zero. Let A be the active set, the set of indices of all non-zero elements of β, with cardinality |A|. There are three primary possible purposes for fitting the model: identifying the variables that are members of the active set, interpreting the coefficient estimate βˆ, and obtaining the predicted valuesyˆ =Xβˆ. The model is to be fitted according to the Lasso model discussed in Section 1.4.

The LARS algorithm described in [20] for estimating the Lasso is a (near) homotopy which introduces variables sequentially into the model. The under- lying principal is that the estimated coefficient vector is updated through an adaptive piecewise-linear function (therefore differentiable almost everywhere) on a gradually increasing subset of predictors and responses, with nodes occur- ring whenever new variables enter the model.

As penalty parameter values are generally unknown a priori, the LARS al- gorithm (as with many other Lasso estimation methods) operates in a similar

way to standard tree-building methods like CART [8]: begin with a null model and gradually add in variables until the model is “full”. This model path may then be “trimmed” to an optimal size by setting the penalty parameter to an appropriate value.

2.3.1

Geometric Motivation

The geometric idea underlying the LARS algorithm is that only those variables most correlated with the residuals should be included in the model. Imagine that the coefficient estimate vector traces out a piecewise linear path through the coordinate space parametrically indexed on the penalty fraction s ∈ [0,1] described in Section 2.2, wheres=0corresponds to the null model whereA=∅

and s = 1 corresponds to the full regression model. The regression model is considered “full” when A contains either all variable indices or there are no more available degrees of freedom (i.e. |A| = min(n− 1,p) if the intercept is estimated).

At s = 0, the path starts at the origin and begins moving along the axis of the coefficient associated with the variable which is most correlated with the response (which acts as the initial residuals, sinceymay be assumed to be cen- tered without loss of generality). At a certain point, another predictor variable yields a correlation with the corresponding residuals equal to that of the ini- tial variable. At this point the path experiences a node (or joint, or elbow) in order to introduce the new variable into the model. The path continues in a lin- ear fashion along the new “equiangular” direction (that is, the vector direction which bisects the angle between the previous coefficient trajectory and the new

axis). When another variable becomes equally correlated with the residuals cor- responding to the point along the coefficient path, the path again experiences another node and changes direction. This continues until the model is full. For instances where p<n, the coefficient vector when s=1is the OLS solution.

2.3.2

Algorithm Details

This description of the LARS algorithm relies heavily on the paper by Efron et. al. [20]. Begin by centering the response vector y, and centering and scaling the predictor matrix columns to have mean 0 and variance 1. To calculate the path, begin with a null model such that A = ∅, i.e. βˆ = 0 and the residuals

e = y. Determine correlations between the residuals and each of the predictors usingCˆ = eTX. LetC max = maxj cˆj

be the largest correlation in absolute value

and ˆj = arg maxj<A cˆj

be the index of the variable(s) Xˆj most correlated with

e. To update the coefficient vector estimate, it is necessary to determine the new direction of the trajectory and also the distance along this vector to travel before the next variable should enter the model. Assume that at the current node we have obtained a coefficient estimateβˆ0 with corresponding fitted valuesyˆ0, residuals e0, and correlation vector Cˆ0 through initialization or completion of the previous step.

The active setA0 =nj: ˆβ0,j ,0

o

consists of the indices of all variables that are included in the model corresponding to the current node. Begin by updating Aso thatA = nj: ˆcj =Cˆmax

o

. LetXA be the design matrix including only those columns with indices inA, letGA= XT

Algorithm 1:LARS Algorithm for Lasso fit Data: centeredy; centered and scaledXn×p

Result: LARS path of coefficient vectorβλ0 indexed by penalty parameter λ0

(as step or fraction) Initialize: Coefficient vectorβ=0 Active setA= ∅ Residualse=y Penaltyλ0 = 0 while |A|<min(n,p)do C=eTX Cmax= max cj A=n j: cj = Cmax o

Determine new coefficient direction GA= XTAXA wA= 1T AG−1A1A −1/2 G−1 A

Determine distance in new direction to next node

a=XTX AwA. aA= 1TAG−1A1A −1/2 ˆ γ=min+j<A ( Cmax−cj aA−aj ,Cmax+cj aA+aj ) Update: β e λ0

with order equal to|A|. The new coefficient direction is calculated by wA = 1TAG−1A1A

1 2

G−1A.

The new coefficient estimates are updated using the equation ˆ

β(γ)= βˆ0+γwA

where γ is a scalar multiple. Now that the direction has been determined, the next step is to determine the distance (represented byγ) to the next node.

To determine this, leta = XTX

AwA. Note thataA = 1T AG−1A 1A −12 1A, however for convenience I will useaA to represent the scalar1T

AG−1A1A

−12

vector form previously given. Parametric vector functions onγ can be used to represent the new fitted values

ˆ

y(γ)=yˆ0+γXAwA and the correlation vector

C(γ)=XT(y−yˆ(γ))=Cˆ0−γa

as the coefficient vector βˆ(γ) progresses along the path defined above. Note that the elements ofC(γ)belonging toAwill remain equivalent and largest (in absolute value) asγ changes. Let us call this valueCA(γ) = Cmax−γaA. A new variable will enter the model when some

cj(γ)

=CA(γ)with j<A. Setting the

two values equal to each other and solving forγyields the estimate ˆ γ=min j<A +(Cmax−cj aA−aj ,Cmax+cj aA+aj )

wheremin+indicates that the minimum is only taken over positive arguments.

Related documents