Data Resampling Methods - Other Variable Selection Schemes

2.2 Other Variable Selection Schemes

2.2.2 Data Resampling Methods

Resampling methods attack variable selection from a different perspective by sampling repeatedly from the data to mimic what would happen with new data.

Cross Validation

The fundamental idea behind cross-validation (CV) is to divide the data into two parts and use the first part to build the model and the second part to evalute the fit. We generally focus on K-fold cross-validation which divides the data set randomly into K equal (or

nearly equal) parts. We denote these parts by Γ1, . . .ΓK and let Γ(k) be the data with Γk deleted, k = 1, . . . ,K. Suppose we have a sequence of forward selection models

M0,M1, . . .MpwithMjthe model with jvariables. Then for eachk, we carry out forward

selection on Γ(k) generating models M₀k,M₁k. . .Mkp with respective sizes 0,1, . . .p. Note

that there is no reason forMk

j to have the same jvariables as Mjsince Forward Selection

may select different variables on different subsets. Then for each model sizejwe evaluate its cross-validated error (CVc) as

c CV(j)= 1 n K X k=1 X (yi,xi)∈Γk (yi−µˆk(xi,Mkj)) 2

where ˆµk is the predictand evaluated on the left out subsetΓk under model Mk_j.

We select model sizek = arg minjCVc(j). Stone (1977) showed that leave-one-out cross-validation (K = n) is asymptotically equivalent to AIC. However, Breiman and Spector (1992) recommend five-fold cross validation for variable selection based on sim- ulation results because leave-one-out CV tends to select the same model as the entire data too often.

Both leave-one-out and five-fold CV are inconsistent for model selection. Shao (1993) showed that they tend to include too many variables. He proves we can ensure model consistency by letting the number of observations left out nv satisfynv/n → 1 as

Bootstrap

The bootstrap attacks the variable selection problem by sampling with replacement from the empirical distribution. The bootstrap is typically applied in one of two ways. First, we can bootstrap the residuals. We start by fitting the full model estimating regression coefficients ˆβand the residualsei, i= 1, . . .n. We then studentize the residuals defined by

e∗_i =ei/

√

1−hi wherehi is theith diagonal element of the hat matrixH =X(XTX)−1XT.

We then generate for each observation, a new y∗_j = xjβˆ + e∗j where e

∗

j is sampled with

replacement from the studentized residuals. Crucial to bootstrapping residuals is that they have constant variance. This method is suitable if we treatXas fixed.

Second, we can bootstrap by sampling with replacement from the observed (xi,yi) pairs. This method is appealing if we are working in the random-Xcase. It also makes no assumptions on the model, unlike the homoscedastic error assumption above. Conse- quently, we can view bootstrapping (x,y) pairs as “more nonparametric” than bootstrapping residuals. One problem that may arise in the p > ncase is each bootstrapped data set has rank on average about 63% as large as the original data set.

Shao (1996) showed that this bootstrap scheme is not consistent. If we bootstrap residuals, we can ensure consistency by increasing the variability of the residuals. If we bootstrap pairs, we can ensure consistency by constructing smaller bootstrap samples with size nb with nb/n → 0 asn → ∞. For a thorough treatment of the bootstrap see

Little Bootstrap

Breiman (1992) introduced the little bootstrap as an alternative toCp that does not suffer

from such severe selection bias. Suppose we fit a model with j coefficients denoted ˆµj.

The model error can be written as

ME( ˆµj)=kµ−µˆjk2 =RSSj−RSSp+kµ−µˆpk2−2(,µˆp−µˆj)

where the subscript pdenotes the full model with all predictors included and (·,·) is the inner product. We also estimate the residual variance from the full model fit ˆµp. The

first two RSS terms are directly computed. We can estimatekµ−µˆpk2as pσˆ2 assuming

the full model has no bias. We need an estimate for this last term. Breiman proposes the little bootstrap to estimate this term. We generate bootstrapped responses as

yi =yi+˜i where ˜i ∼N(0,t2σˆ2)

Note that we add the error term to the original responseyand not the predicted response ˆ

ylike the residual bootstrap of the previous section. We then fit the model using forward selection to the (˜yi,xi) pairs with jvariables and all pvariables as above. Call these fits

µjand ˜µp. We then calculate

t2(˜,µ˜p−µ˜j) Repeat this Btimes and average. Breiman showed that

t2E(˜,µ˜p−µ˜j)≈ E(,µˆp−µˆj)

expression above. d ME( ˆµj)=RSSj−RSSp+ pσˆ2− 2 B B X b=1 (˜b,µ˜b_p−µ˜b_j)

where the superscript bcorresponds to a particular bootstrap sample. Based on simula- tions, Breiman recommends t = 0.6 and B = 40. The best model is then chosen with respect to this model error estimate.

In document Permuted Inclusion Criterion: A Variable Selection Technique (Page 38-42)