• No results found

Stepwise and best subset regression

Stepwise variable selection is the most commonly used model selection technique (Harrell, 2001; Hastie et al., 2009). As a selection procedure, stepwise is implemented in three versions. Forward selection begins with an empty model consisting of the intercept only. We then add variables sequentially to the model until a predefined stopping rule is satisfied. At each step of the selection process, we add the variable whose inclusion results in the best fit (i.e. greatest increase in the summary measure). Some of the most commonly used summary measures include R2, adjusted R2, residual sum of squares and deviance (Draper and Smith,

1998). A predefined significance level is typically used as a stopping criteria so that only statistically significant variables are added to the model. Backward elimination procedures on the other hand begin with a saturated (full) model composed of all candidate predictor variables. Using a predefined stopping rule, the procedure sequentially removes variables which contribute least to the model fit. Stepwise selection is a variation combining both forward and backward selection algorithms: at each step of the variable selection process, after a variable has been added to the model, variables are allowed to be eliminated from the model. For instance, if the p-value of a given predictor is above a specified threshold, it is eliminated from the model. The iterative process is ended when a pre-specified stopping criteria is satisfied.

Stepwise variable selection procedures produce nested sequences of models. The inherent collinearity can cause predictors to compete hence making the selection of ‘important’ pre- dictors arbitrary. The competition and accompanying (potential) arbitrariness in selection procedures often results in use of greedy algorithms (Hastie et al., 2009; Hesterberg et al., 2008). Such model selection process is prone to make the best change at each individual step independent of future effects. This can produce unstable models where relatively small changes in the data is likely to cause one variable to be selected instead of another, after which subsequent choices may be completely different. Best-subset selection is an attempt to address this limitation by considering all subsets of variables of each size only limiting itself

to a maximum number of best predictor subsets (Furnival and Wilson, 1974). Given p vari- ables, best-subset selection finds the subset of size k∈ {0, 1, 2, . . . , p} that provides smallest residual sum of squares. A distinct advantage over stepwise procedures is the fact that with best-subset regression, the best set of two predictors need not include the predictor that was best when considered in isolation. However, because it considers a much greater number of possible models, biases in inference are even larger (Draper and Smith, 1998).

Although overfitting often results from too many predictors, using very few variables may fail to reveal the true underlying structure of a prediction model from inadequate informa- tion. Generally a saturated model often outperforms reduced models. When using stepwise selection procedures, Harrell (2001) suggested the need for less stringent stopping rules like Akaike’s information criterion (AIC) to decide candidate variables to retain or discard. When using backward elimination selection, Steyerberg et al. (2000) proposed using a p-value of 0.5 to allow deletion of some variables. In general, backward elimination performs better than forward stepwise selection procedures in the presence of multicollinearity (Mantel, 1970). Moreover, backward elimination initially allows examination of the full model which has the correct standard errors and p-values. Later we consider use of backward elimination pro- cedures to obtain reduced models. Stepwise selection procedures are implemented in most statistical softwares. Best-subset regression may be implemented using the leaps package in R1 software (Ihaka and Gentleman, 1996).

3.3.1

Limitations of stepwise selection procedures

While it may be objective to consider a subset and not individual potential predictors, the best-subset selection method unlike stepwise methods fails in reducing dimension by selecting more predictors. Moreover, the best subsets selection method can compare only the models with the same number of predictors (Draper and Smith, 1998) hence restricting the number of models that can be compared. The discrete process (variables either retained or discarded)

inherent in the best-subset selection often exhibits high variance (Hastie et al., 2009) resulting no reduction of the prediction error in the full model.

Despite their wide application in practice, stepwise selection procedures are associated with many limitations (Austin and Tu, 2004; Wang et al., 2004). The principal drawbacks of stepwise multiple regression include bias in parameter estimation, inconsistencies in results from different model selection algorithms, and multiple hypothesis testing before estimation. Although parsimony may be a desirable statistical practice, reliance on a single best model may lead to loss of information from excluded predictors. Even smaller models with few predictors is no guarantee of exclusion of noise variables. Excluding important predictors can be very costly, for example, Steyerberg et al. (1999) demonstrated that excluding a true predictor is worse than including a noise variable.

The stepwise selection procedures are based on test of hypothesis of individual parame- ters. The process that produces the ‘final’ model fails to account for the inherent multiple testing. In addition to ‘testimation’ bias, the resulting standard errors are also invalid because stepwise procedures fail to fully account for the search process (Harrell, 2001). Analysis of the ’final’ model from stepwise selection procedures assumes that the selected predictors were pre-specified which is not true since the predictors were selected adaptively according the se- lection algorithm. Consequently comparison of any models produces biased results because the analysis erroneously assumes the two models were fixed in advance. Specifically, variance of the regression coefficients calculated as if the selection were pre-specified will underesti- mate standard errors and p-values in the resulting model (Harrell, 2001). The problem is even acute for small data sets where stepwise procedures have limited power to select prognosti- cally important predictors which in turn lead to lower predictive ability (Chatfield, 1995).

The final model derived from stepwise selection procedures are dependent upon the cor- relation between individual predictors (Draper and Smith, 1998). This single model is not guaranteed to be the best among candidate models and interpretation using such a model includes only those predictors entered in that final model while ignoring other predictors not selected. Besides correlation, the final model depends on the order of entry/exit of predic-

tors into the model. Such a dependency is likely to result in inconsistencies among different model selection algorithms. In general, excluding potential important predictors on account of little/no correlation may lead to suboptimal decisions and limited predictability. The next section reviews multivariate regression techniques which attempt to address the problem of many predictors with high degree of correlation.

3.3.2

Principal component and partial least squares regressions

Ordinary least squares (OLS) estimation of regression coefficients in the presence of more predictors than number of observations and/or high degree of near collinearity among the predictors performs poorly by producing very unstable estimates and very poor prediction accuracy (Draper and Smith, 1998). Principal component regression (PCR) and partial least squares regression (PLSR) provide a solution to both challenges by using linear combinations of predictors instead of individual predictors (Vigneau et al., 1997). Strictly speaking PLSR is a generalization of PCR.

The difference between PCR and PLSR lies in the different ways they construct new predictor variables (components) as linear combinations of the original predictor. While PCR creates components to explain the observed variability in the predictor X variables only, PLSR creates components to explain variability in both the predictor X and response

Y variable. As a results PLSR becomes PCR if it ignores the response during the pro-

cess of creating components to explain observed variability. In constructing the principal components of X, the PLSR algorithm iteratively maximizes the strength of the relation of successive pairs of X and Y component scores by maximizing the covariance of each X-score with the Y variables. Because of its general strategy, PLSR is sometimes called Projection to Latent Structures in the natural sciences (Abdi, 2010). The distinct advantage of both methods lies in their use of linear combinations which leads to models that are able to fit the response variable with fewer components. We however note that whether or not this reduction ultimately translates into a more parsimonious model, in terms of its practical use,

depends on the context. While PCR is a popular technique among social scientists, PLSR enjoys large popularity among the natural sciences particularly in chemommetrics (compu- tational chemistry) where it is heavily used in chemical analysis following developments in spectroscopy (many highly correlated predictors) since the 1970s (Mevik and Wehrens, 2007). PLSR is a generalized multivariate statistical technique with ability to model multiple predictors as well as multiple responses, handle multicollinearity among predictors and help make stronger predictions by creating independent components/latent variables directly on the basis of cross-products involving the response variable/s. Some of its limitations include greater difficulty of interpreting the loadings of the independent latent variables (which are based on X−Y cross-product relations, not based as in common factor analysis on covariances among the manifest predictors). The mix of advantages and disadvantages makes PLSR more appealing as a predictive technique and not as an interpretive technique tool.

Theoretically PLSR should have an advantage over PCR but in most situations in practice both methods achieve similar prediction accuracies (Wold et al., 2001). While PLSR usually needs fewer latent variables than PCR (i.e. with the same number of latent variables) and will cover more of the variation in the response Y , PCR will cover more of the variances in predictor/s X. Frank and Friedman (1993) showed that both PCR and PLSR behave very similar to ridge regression (next section). Although Hastie et al. (2009) showed that both PCR and PLSR behave as shrinkage methods (next section), in some cases PLSR seems to increase the variance of individual regression coefficients, an observation which may mean that PLSR is not always better than PCR. PLSR can be implemented as a regression model to predict one or more responses from a set of one or more predictors using pls package in R software (Mevik and Wehrens, 2007).