Chapter 3 Lasso-type Robust Variable Selection for Time-Course Microarray Data
3.3 Robust Lasso using adjusted penalty
Consider the case where ε is not normally distributed, but its distribution is bell-shaped with heavier tails. Rosset and Zhu (2004) showed through simulation that the Lasso fails in identifying the correct model in such cases while a robust loss does it better with an appropriate choice of the regularization parameter. They proposed “Huberized Lasso” for the univariate response. In this section, we propose a robust Lasso for multivariate time-varying response, which can be implemented via the original group Lasso algorithm, which we would call the R Lasso 1. As in Section 3.2, the idea is to group the responses by time, by preprocessing the initial datasets so that they are of the form for the group Lasso algorithm. To make the Lasso more robust, we use Huber’s loss function in the first term of the objective function instead of the squared norm used in the original group Lasso. The minimization is carried out using the iteratively weighted least squares. This procedure involves updating the response and covariate using weights at each iteration, raising an issue that the penalty term should be adjusted to account for the weights that changes at every iteration. Therefore, we introduce a modified objective function based on Huber’s loss function and an adjusted penalty.
Suppose we have already preprocessed the original datasets by going through the steps 1,2 and 3 in Section 3.2.1. However, step 2 should be done differently. In step 2 of the R Lasso 1, we regress each column of Y[1]on X[0] using a robust method. The method we used for the robust regression will be discussed in
a detail in Section 3.3.1. The residuals from the robust fit are used to compute the minimum covariance determinant estimator of Σ.
3.3.1
Determining c in Huber’s loss function
Step 2 in the data preprocessing for the R Lasso 1 involves robust regression. Although any robust regression suffices for step 2 as long as it is robust against large residuals, we choose to use an M-estimation with Huber’s loss function to make it consistent with the R Lasso. For Huber’s loss function in (3.24), we need a threshold value c. It controls the amount of residuals that we downweight, and therefore is closely related to P , the percentage of downweighting we wish to have. To achieve P % of downweighting, we locate c at the (100 − P/2)thpercentile of the residuals.
function. It sounds somewhat contradictory, since we need the c to compute residuals, and we need residuals to determine c. The idea to overcome this problem is that we can unveil most of the outliers by initially setting c to a very small value near 0 and examining residuals from that fit. We start from a robust regression with a small c and save the residuals to determine the appropriate threshold based on them.
To implement robust regression, we use the R function rlm in the MASS package which fits a linear model by Huber’s M estimator. We need to determine two threshold values in Huber’s loss function : one for estimating Σ in step 2, and the other for the R Lasso 1 loss function. We denote them by cs and c`,
respectively.
Determining cs to estimate Σ in step 2
1. Regress each column of Y[1] on X[0] by Huber’s M-estimator with c near 0, in particular, we use
c = 0.001.
2. Extract the scaled residuals and find the (P/2)thand 100 − (P/2)thquantiles of them.
3. Set csto be the maximum of the absolute values of these two quantiles.
4. Refit each column of Y[1]on X[0]by Huber’s M-estimator with c = cs, and extract unscaled residuals.
The saved residuals are used to compute the minimum covariance determinant estimator of Σ.
Determining c` in the R Lasso 1
Once we obtain the preprocessed data Y and X, we need determine the value of the threshold c` of the
Huber’s loss. In addition, we need to have an initial value of β since R Lasso 1 is an iterative process. We can do both through the following steps.
1. Regress each column of Y on X with Huber’s M-estimator with a small c = 0.001.
2. Extract the scaled residuals from the fit, and find the (P/2)thand 100 − (P/2)thquantiles of the saved
residuals.
3. Set cs0to be the maximum of the absolute values of these quantiles.
4. Regress each column of Y on X by Huber’s M-estimator with c = cs0.
(1) Extract the unscaled residuals and form a residual matrix Rn×d.
(2) Extract the coefficients from each regression and stack it column by column to form a p × d matrix of coefficients βI. This βI is used as the initial value of β for every λ in the Lasso.
5. Note that c` will be used for the norms of the residuals in the R Lasso 1. We compute the norm of
each row of R, and compute the (100 − P )thquantile of those n values as c`.
3.3.2
Algorithm of the R Lasso 1
Suppose the initial data are preprocessed as described in Section 3.3. Let Y be an n × d response matrix with the ith row denoted as y
i, and X be an n × p design matrix of p covariates, with the ithrow denoted
as xi. The columns of Y are not correlated, and each column of Y and of X is centered. The coefficient β
is a p × d matrix with the jth row is denoted as β
j, j = 1, · · · , p. Let Aj be a d × d matrix for j = 1, · · · , p,
which will be defined later. The objective function of the R Lasso 1 is given by
R = n X i=1 ρ (kyi− xiβk) + λ p X j=1 kAjβj0k. (3.15)
The use of Aj will enable us to solve (3.15) using the group Lasso algorithm of Yuan and Lin (2006).
The objective function (3.15) is minimized in the following steps. The solution to (3.15) can be obtained by iteratively reweighted least squares (IRLS) using weights
wi =ρ (kyi− xiβk)
kyi− xiβk2
.
At each iteration of the IRLS, we solve
R(w) = n X i=1 wikyi− xiβk2+ λ p X j=1 kAwjβj0k, (3.16)
instead of (3.15), where wi is obtained from the previous estimate of β in each step of the iteration.
Note that (3.16) is equivalent to
n X i=1 kyw i − xwiβk2+ λ p X j=1 kAw jβj0k, (3.17) where yw i = √ wiyi and xwi = √
wixi. Each column of the updated matrices Yw = (y0w1, · · · , y0wn)0 and
Xw = (x0w
1, · · · , x0wn)0 need to be centered again in order to satisfy the requirements for the group Lasso
algorithm.
form of Yw, D be the adjusted design matrix from Xw, and δ be the vectorized coefficient. In other words, Z = (yw 1(1), · · · , yw1(d), · · · , ynw(1), · · · , ywn(d))0nd×1, D = Xw⊗ I d= (X1w⊗ Id, · · · , Xpw⊗ Id)nd×pd, δ = (β1, · · · , βp)0pd×1. (3.18)
This process produces the univariate response variable, and groups the columns of Yw. Each nd × d matrix
Dj = Xjw⊗ Id corresponds to the jthfactor (the jthgroup of covariates) for j = 1, · · · , p.
With Z, D and δ, we can write (3.17) as
kZ − Dδk2= nk X `=1 (Z`− p X j=1 D`,jβj0)2+ λ p X j=1 kAw jβj0k. (3.19)
Each Dj can be factorized as QjRj by QR decomposition, where Qj is an nd × d orthogonal matrix and Rj
is a d × d upper triangular matrix. Define Aw
j = Rj, and let γj = Rjδ0j= Awjβj0. Then (3.19) becomes nk X `=1 (Z`− p X j=1 Q`,jRjβ0j)2+ λ p X j=1 kAw jβ0jk = nk X `=1 (Z`− p X j=1 Q`,jγj)2+ λ p X j=1 kγjk. (3.20)
As noted in Section 3.2.1, the columns of Q are still centered. Since Xw
j is centered, all columns of Dj are
centered, or equivalently,10
ndDj= 0. Since Dj is decomposed as QjRj, we have
10
ndDj=10ndQjRj= 0. (3.21)
The Rj is invertible because the columns of Dj are orthogonal and therefore linearly independent. This
implies that 10
ndQj = 0, which means Qj is still centered. Therefore, the pseudo-data Z and Q in (3.20)
satisfy all the requirements for the original group Lasso. We can solve (3.20) using the group Lasso algorithm on Z and Q to estimate γj for j = 1, · · · , p.
Note that Aw
j is different from the Aj in (3.15). Aj is an unknown matrix employed in the objective
function (3.15) to enable us to use the group Lasso algorithm. We minimize R in (3.15) by minimizing R(w) at each iteration of the IRLS, by using Aw
j’s which vary with weights.
With pre-determined threshold values csand c`in Huber’s loss function, the R Lasso 1 on the pseudo-data
Table 3.1: Algorithm of the R Lasso 1.
For each λ{ β(0)= βI.
While diff > 0.00001, k ≥ 1 { IRLS
1. Compute the weights wi, for i = 1, · · · , n : wi =
ρ(kyi−xiβ(k−1)k)
kyi−xiβ(k−1)k2 .
2. Compute Yw
n×dand Xn×pw .
3. Center each column of Yw and Xw.
4. Vectorize Yw into Z
nd×1, and adjust Xw into Dnd×pd.
5. QR decomposition on each Dj for j = 1, · · · , p.
6. Run the group Lasso algorithm on Z and Q. 7. Compute diff=maxpj=1|kβ(k−1)j k − kβj(k−1)k|. } loop for IRLS.
Select Xj’s where kβjk’s are larger than 0.
} loop for λ.
3.3.3
Simulation study for the R Lasso 1
To compare the performance of the R Lasso 1 with the original group Lasso, we conducted a small simulation study. We generated 100 Monte Carlo datasets with n = 200. For each Monte Carlo sample, we generated X1 and X2 independently from the Uniform distribution on (0,1), and generated an n × 2 response matrix
Y from the model
Yi(t) = X1it + ei(t), (3.22)
for t = 1 and 2, where the independent error ei(t) is from N (0, 1002) for 5% of the observations i = 1, · · · , 10,
and from N (0, 1) for the remaining 95% of the observations i = 11, · · · , 200. Although the percentage of the large errors is only 5%, we downweighted 10% of the large residuals in the R Lasso 1. Among 100 Monte
Carlo data sets, the R Lasso 1 chose X1 before it chose X2 on 95 datasets, while the original group Lasso
chose X1 before X2 on only 54 datasets. This suggests that the R Lasso 1 selects the true covariate better
than the original Lasso when outliers are present.