• No results found

2. An Introduction to Boosting

2.5. Boosting as a Method for Sparse Modeling

A feature that directly relates to the component-wise fitting and the subsequent se- lection of base-learners is that boosting works especially well in settings where the number of covariates is large and where it is desirable to select a relatively small subset of predictors. In these situations boosting usually outperforms standard regression models with subset selection methods (Schmid and Hothorn 2008a). As each base-learner depends only on a small subset of the predictors and as in each iteration only one of the base-learners is fitted to the negative gradient vec- tor, component-wise boosting can be even applied in settings where the number of predictors is much larger than the number of observations (n p). Many classical variable selection techniques fail in this case. Others can only add up to (at ab- solute maximum) n variables to the model. In contrast, due to regularization, the final boosting model could — in theory — depend on all p predictors. However, usually this does not happen and it is not desirable as sparse models are easier to handle and interpret. The optimal model depends on few predictors only while having a very good prediction capability. As discussed in the introduction (Chap- ter 1), there are two competing goals when we try to learn (i.e., extract information) from data: the first goal is prediction, the second is interpretation. Prediction is the main target in machine learning, while interpretation is rather targeted in the sta- tistical community. Even if these goals are competing they are not always mutually exclusive. Sparse, regularized models often lead to better prediction accuracy and

2.5 Boosting as a Method for Sparse Modeling 35

at the same time are easier to interpret and understand. This is both facilitated by sparse models, as long as the level of sparsity is chosen with care. First, sparse models are easier to interpret as there are fewer variables (and corresponding ef- fects), and less complicated (interaction) structures will be derived in the model. Second, sparse models tend to be superior with regard to prediction: models that are too rich tend to fit the learning data very well — but they perform poorly on new data. A very rich model is usually not generalizable. Hence, prediction fails and at the same time the estimated effects should be doubted. On the other hand, models that are too sparse are contra-productive as well. They will miss important parts of structure in the data and hence are easy to interpret but highly misleading in their results. The prediction performance usually decreases notably. Therefore, it is desirable with respect to both goals (interpretation of the results and predic- tion modeling) to find a fair amount of sparsity, which relates to a fair amount of informative variables.

To further enhance sparsity of the model several approaches exist in the boost- ing literature. We will briefly sketch three: The first approach, sparse L2Boosting

(Bühlmann and Yu 2006), changes the criterion that is used to select the best fitting base-learner (Alg. 2, step (b2)). The residual sum of squares (RSS) is replaced by a penalized criterion such as the AIC, the BIC or the generalized minimum de- scription length (gMDL; see above for details) where the degrees of freedom arise from the approximation of the model degrees of freedom. Bühlmann and Yu show that this leads in many situations to sparser models, while the prediction perfor- mance at least does not suffer substantially. The second approach, twin boosting (Bühlmann and Hothorn 2010), uses two successive runs of the boosting algorithm. In the first run a standard boosting estimate is obtained. In the second run, the selection of the base-learners is not based on the RSS but on the RSS rescaled by the importance of the base-learner in the first boosting run. Hence, base-learners that were not selected in the first pass are not subject to selection in the second pass. Base-learners that had only a small contribution (e.g., had an effect close to zero) are less likely to be selected in the second round. The third approach is called stability selection (Meinshausen and Bühlmann 2010). Stability selection can be ap- plied to a wide range of methods including boosting. It allows to extract influential variables (or base-learners in the boosting context) with an error control. Stability selection is especially useful in settings with many potential predictors. To achieve variable selection, the empirical probability of a base-learner to enter the model is investigated via subsampling (i.e., the model is trained on random subsets of

the original data of size bn/2c; see, for example, Bühlmann and Yu 2002). Only base-learners that enter the model with a probability higher than a specific cut-off value are considered to be influential. At the same time, stability selection controls the family-wise error rate. Hence, non-influential variables are only selected by stability selection with a controllable, small probability. The final model is usually sparser and hence, easier to interpret than the model without stability selection.

Altogether, the third approach seems to be most promising as no changes in the boosting algorithm are required, and as, furthermore, no estimates for the degrees of freedom of the model are required. Stability selection can be simply applied to the final model (almost) no matter how the underlying estimation and selection algorithm works. Moreover, stability selection controls the probability of falsely selected base-learners.