Variable selection in the context of regression and classification has received increasing attention in recent years due to the huge growth in data collected across many fields of science and engineering [17, 55, 56]. While the goal is often focused on developing models for accurate output prediction, with these large datasets estimation of model structure and identification of the few variables driving the output variation is also mandatory for understand- ing and interpreting the underlying processes that generated the data [57, 58]. Often, the numerical identification of such models is challenging because of the data presenting, either a high correlation among subsets of predictors, or because the number of measured variables is bigger than the number of available data points [59]. For this reason, several techniques have recently been developed for sparse model identification, i.e. where only a few impor- tant variables are selected. A first possibility is to look for the best subset of variables considering all the possible combinations. However, this strategy becomes infeasible as the number of candidate variables increases. Rather than searching through all possible combinations, we can seek a path through them. For example, Forward-Stepwise selection starts with no variables and includes in the model one variable at the time with an order that depends on the improvement in terms of fit (or other cost functions such as the Akaike information criterion [60]). An alternative strategy is the so called Backward- Stepwise selection that starts with all variables in the model and iteratively removes them according to their ability to explain the output value. Com- binations of these methods can also be used to give improved performance,
for example sequential floating forward selection as proposed in [61] and the two-stage algorithm incorporating a backward refinement step proposed in [62].
Regularization based methods represent a different class of variable selection algorithm. These involve the estimation of the regression model by minimiz- ing a cost function consisting of two terms, the first representing the model fit, and the second the complexity of the model. A classical approach is to consider the sum of squares of the model coefficients as a measure of com- plexity, leading to the so-called ridge regression model. The resulting model suppresses the influence of irrelevant variables by forcing them to have small coefficients relative to those assigned to the important variables. In the last decade, Tibshirani and co-workers [63] showed that employing the sum of the absolute values of the model coefficients as the regularization penalty has the desirable effect of shrinking the non-important variable coefficients to exactly zero implicitly performing variable selection. This technique is the well known Least Absolute Shrinkage and Selection Operator (Lasso), also known as basis pursuit in the signal processing community [64]. In re- cent years, several modifications of the Lasso have been proposed such as the elastic-net [65], and the group Lasso [66], [67] and new penalized regres- sion methods have been proposed such as the Dantzig selector [68] and the Non-Negative Garrotte Estimator [69]. These methods have been designed in order to improve different aspects of the lasso estimator such its prediction performance and its handling of groups of similar variables.
In general sparsity and algorithmic stability are two desired properties of learning algorithms. This means that the output of the model should be a function of only a small subset of the input variables and that this subset should not change with small variations in the training data. Unfortunately a sparse algorithm cannot be stable and vice versa [70]. While this is a general problem of all sparse algorithms, this chapter focuses on the lasso estimator. The stability of the lasso is investigated and algorithms for detecting a stable set of variables are proposed. Four new algorithms are proposed: High Fre- quency Lasso (HF); High Mean (HM); Monte Carlo High Frequency (MCHF); and Monte Carlo High Mean (MCHM). The aim of these algorithms is to sta- bilize the lasso solution under CV variability taking into account both the K-fold Cross-Validation and the Monte Carlo Cross-Validation (bootstrap). These algorithms are easy to use and automate and at the same time provide competitive results with respect to competing approaches in the literature. In particular, we compare our algorithms with Stability Selection [71], Kappa Selection Criterion [72] and Lasso Percentile [73] for a range of simulated
and real datasets in order to highlight their strengths and drawbacks, both in terms of prediction accuracy, and in terms of their ability to recover the true model structure.
The outline of the chapter is as follow: Section 3.2 provides an introduction to linear models and penalized linear models. Section 3.3 reviews existing methods that attempt to obtain stable estimates from LASSO. Then section 3.4 introduces the four new algorithms, HF, HM, MCHF and MCHM. Sec- tion 3.5 describes some datasets used to benchmark the different methods and reports the results obtained for both the simulated and real datasets. Section 3.6 describes the computational complexity of the various algorithm and finally Section 3.7 concludes the chapter with some final remarks and suggestions.