Figure 2.8.
AIC Mn.M Mn.MM RAIC.M RAIC.MM RAICS'.MM
7 8 9 10 11 12
Model selection criteria
TMSPE
Figure 2.8: Trimmed mean square prediction error (TMSPE) for Boston housing data
2.6
Conclusion
In this thesis, we propose a robust AIC for MM-estimation and an adjusted robust scale based AIC for M and MM-estimation. Our proposed model selection criteria can maintain their robust properties in the presence of a high proportion of outliers and the outliers in the covariates. We compare our proposed criteria with other robust model selection criteria discussed in previous literature. Our simulation studies demonstrate a significant outperformance of robust AIC based on MM-estimation with outliers in the covariates and show a better performance of the adjusted robust scale based AIC for MM-estimation when the proportion of outliers in the response is relatively high. Real data examples also show a
better performance of robust AIC based on MM-estimation. More robust model selection criteria with different penalty terms (e.g. robust BIC) will require future research. Overall, the traditional AIC is very sensitive to outliers, while the robust versions of model selection criteria are resistant to outliers and should receive more attention in model selection studies.
Chapter 3
Robust Lasso Regression Using
Tukey’s Biweight Criterion
∗
3.1
Introduction
In multiple linear regression models, the ordinary least squares (OLS) estimates can give inaccurate predictions when there are a large number of predictors or multicollinearity is present. One way to improve the predictions is to reduce the number of variables in the model by model selection. Tibshirani (1996) proposed a new technique for model selection termed the “LASSO” for “Least Absolute Shrinkage and Selection Operator”, which incorporates an L1 penalty into the OLS loss function. The lasso shrinks some coefficients to exactly zero. This property of the lasso means that it provides parsimonious solutions that are easy to interpret.
Over the past decade, the lasso has become a very popular technique for simul- taneous estimation and variable selection. Many authors (Zou, 2006; Knight and Fu, 2000; Zou et al., 2007; Zhao and Yu, 2006) have investigated the properties of the lasso and developed different variants of the lasso. Zou (2006) demonstrated
∗The core contribution of Chapter 3 was submitted to Technometrics in April 2016 and
has been accepted for publication in February 2017. This chapter was presented at the CM- Statistics Conference held in London, the UK in December 2015.
that there exist certain scenarios where the lasso is inconsistent for variable se- lection. Thus, he suggested the adaptive lasso, where adaptive weights are used for penalizing coefficients differently in the L1 penalty. The adaptive lasso enjoys the oracle property, that is, asymptotically it performs as well as if the true un- derlying model were known. Moreover, the adaptive lasso can be computed using the same efficient algorithms that are used to compute the lasso, for example, least angle regression (LARS) of Efron et al. (2004). Zou and Hastie (2005) em- phasized the inappropriateness of using the lasso as a variable selection method if a group of variables are very highly correlated. In such a situation, the lasso tends to select only one variable from the group and does not care which variable is selected. To overcome this drawback of the lasso, Zou and Hastie (2005) de- veloped the elastic net, a new regularization and variable selection method that combines anL1 and L2 penalty. The elastic net performs variable selection, con- tinuous shrinkage, and more importantly, it selects groups of strongly correlated variables. Fan and Li (2001) argued that a good penalty function should have the properties of sparsity and unbiasedness. They proposed a special non-concave penalty function named the Smoothing Clipped Absolute Deviation (SCAD) that can produce sparse solutions and unbiased estimates for large parameters.
Datasets with outliers are commonly encountered in statistical analysis. These outliers may appear in the response and/or the predictors. It is well known that OLS estimates in linear regression are very sensitive to outliers. The lasso estimates, which utilize OLS, also suffer from the effect of outliers. Some authors have considered robust versions of the lasso, generally utilizing penalized versions of M-estimators, as in Owen (2007), Wang et al. (2007), Li et al. (2011), and Lambert-Lacroix and Zwald (2011). Wang et al. (2007) proposed to overcome the presence of outliers by combining the least absolute deviation (LAD) loss and the lasso penalty. Unfortunately, it is well known that the LAD loss is not adaptable for small errors because it penalizes strongly for small residuals (Owen, 2007; Lambert-Lacroix and Zwald, 2011). That is, the LAD-Lasso has
3.1. INTRODUCTION 39
lower efficiency than OLS estimates when there are no outliers in the response. Owen (2007) and Lambert-Lacroix and Zwald (2011) preferred to replace the squared loss with Huber’s loss, a hybrid of the squared error and absolute error loss functions. Extensive simulation studies in Lambert-Lacroix and Zwald (2011) have demonstrated the superior performance of the lasso with Huber’s criterion over the traditional lasso and the LAD-Lasso.
Although the robust lasso with the Huber’s loss is resistant to outliers in the response and achieves high asymptotic efficiency, it is not robust against high leverage points or outliers in the covariates. As discussed above, the literature on the robust lasso (Owen, 2007; Wang et al., 2007; Li et al., 2011; Lambert-Lacroix and Zwald, 2011) considers only outliers in the response. However, outliers in the covariates also appear frequently and they generally have a greater effect on the accuracy of the regression estimates than outliers in the response. Only a few papers have considered robust methods with respect to contamination in the co- variates. Maronna (2011) proposed S-Ridge and MM-Ridge regressions that add anL2 penalty to traditional unpenalized S-regression and MM-regression. These methods are shown to be robust against outliers in the covariates. However, simi- lar to ridge regression, S-Ridge and MM-Ridge do not perform variable selection. Khan et al. (2007) introduced a robust version of LARS (Efron et al., 2004), named Rlars, by replacing the classical correlation in the LARS algorithm with robust correlation estimates. Rlars orders the importance of variables robustly. However, the Rlars procedure does not optimize a clearly defined objective func- tion, which means its asymptotic properties cannot be explored theoretically. Alfons et al. (2013) proposed another approach that is robust with respect to high leverage points, by adding the L1 penalty to the well-known least trimmed squares (LTS) criterion, naming this approach the Sparse-LTS. They derived the breakdown point of this Sparse-LTS estimator and demonstrated that it can be robust to outliers in both the covariates and the response. However, simulation studies in Alfons et al. (2013) and in our work, find that Sparse-LTS loses effi-
ciency when there are no outliers. Moreover, Alfons et al. (2013) do not provide any asymptotic theory for their Sparse-LTS estimator.
To be robust against outliers in both covariates and the response, the deriva- tive of the loss function needs to be redescending (Rousseeuw and Yohai, 1984; Yohai, 1987). A commonly used loss function with this property is Tukey’s bi- weight (Tukey, 1960). In this paper, we propose replacing the squared loss in the lasso with Tukey’s biweight criterion, and name the method the Tukey-lasso, to handle outliers in the response and covariates. Contemporary works based on a similar idea can be found in Smucler and Yohai (2015, 2017). In our simulation study, we show that the Tukey-lasso outperforms the adaptive lasso and other robust implementations of the lasso, particularly in the presence of outliers in both the response and the predictors.
We further propose an accelerated proximal gradient (APG) algorithm to compute the Tukey-lasso. The APG computes the lasso minimization problem and guarantees a global minimizer for a convex objective function. Although the objective function for the robust lasso with Tukey’s biweight is non-convex, the APG algorithm still achieves very reliable results (a local minimizer) when the starting value of the algorithm is carefully selected. We further demonstrate that the computation time for the Tukey-lasso through the APG algorithm is substantially lower than that of its competitors, including Rlars and Sparse-LTS.
In Section 3.2, we describe the traditional lasso, other robust versions of the lasso and introduce the robust lasso with Tukey’s biweight loss. In Section 3.3, we introduce the accelerated proximal gradient algorithm and its implementations to lasso problems. In Section 3.4, we discuss the method of selecting the tuning parameter. In Section 3.5, we present our simulation settings, show our simulation results and compare the prediction and variable selection accuracy of various forms of the lasso. Computation times for these methods are also reported and compared. We analyse three real examples in Section 3.6. Finally, we present brief conclusions in Section 3.7.