We recall the penalised regression model with an observed univariate response vector y and multiple predictorsx1, . . . ,xpon a random sample ofn individuals
ˆ β = arg min β ky − Xβk22+ λP (β)
whereP (β) is the penalty function that promotes sparsity in the model and λ is the regulari- sation parameter that determines the amount of sparsity. For each value of the regularisation parameterλ, the penalised regression model reveals a set G(λ) ={j : ˆβj(λ)6= 0}, consist-
ing of the indices of the variables that have been estimated to have a non-zero regression coefficient. The objective is to uncover the true positive set,S, containing all the predictors that are truly associated with the response. This is defined such that S = {k : ˜βk 6= 0},
where the ˜βk’s are the true coefficients of the model underlying the association between
the response and the predictors. A good variable selection technique then aims to iden- tify as many variables as possible from the setS, while also controlling the corresponding amount of false detections, i.e. the number of variables belonging in the negative set,
5.2 Stability selection in multiple regression 98
N ={k : ˜βk = 0}, that are falsely declared as positives.
Rather than tuning the regularisation parameter λ, to estimate the best G(λ), stabil- ity selection is used, which as explained above seeks to find a stable set of variables over a range of values[λmin, λmax] of the parameter, where λmax corresponds to the null
model and λmin ∈ (0, λmax) corresponds to a Lasso solution. In particular, for a given λ ∈ [λmin, λmax], the stability selection approach consists of performing repeated random
sampling from the n subjects, typically of size bn/2c, without replacement, and fitting the penalised regression model on each random sub-sample. Each one of the B random sub-samples provides a sparse estimate ˆβ(b), revealing the set of selected variables
G(b)(λ) ={j : ˆβ(b)
j (λ)6= 0} for b = 1, . . . , B.
In what follows, we drop the (b) superscript and let G(λ) represent the selected set of variables from a random sub-sample. The probability of selection of each variable is then given by
Πk(λ) = P (k ∈ G(λ))
with the probability being with respect to random sub-sampling. The final set of variables is then obtained by thresholding the maximum selection probabilities across the range of the regularisation parameter. That is, for a probability cut-offπ ∈ (0, 1), the estimated set of selected variables is given by
ˆ
S = nk : ˆΠk≥ π
o
where ˆΠk = maxλ∈ΛΠk(λ) is the maximum selection probability of variable xk over the
range of λ which is defined as Λ = [λmin, λmax]. Using this approach,Meinshausen and
B¨uhlmann (2010) provide theoretical properties both asymptotically and also assuming finite sample size. A detailed discussion on these theoretical results is provided in the following sections.
5.2.1
Finite sample error control
In this section we elaborate on the theoretical results on error control thatMeinshausen and B¨uhlmann(2010) have demonstrated in the case of a finite sample size. To do this, we first introduce some additional notation. We define the number of falsely selected variables as the intersection between the set of negative variables and the set of selected variables from the sub-sampling procedure to be
F =|N ∩ ˆS|.
We further define the unique set of selected variables acrossΛ to be
G(Λ) = [
λ∈Λ
G(λ)
and the expected number of uniquely selected variables acrossΛ by
u(Λ) = E(|G(Λ)|).
The authors have shown that the expected number of false positives can be bounded as
E(F )≤ u(Λ)2 p 1 (2π− 1)
which depends on the threshold on the selection probabilities, onu(Λ) and on p. The result above is based on some assumptions and two lemmas which we detail below along with a sketch of the proof.
Assumptions:
The distribution of {1{k ∈ G(λ)}, k ∈ N} is exchangeable (A1)
The sparse model does better than random guessing, i.e. E(|S ∩ G(λ)|)
E(|N ∩ G(λ)|) ≥ |S|
5.2 Stability selection in multiple regression 100 Lemmas: ΠSIM k (λ)≥ 2Πk(λ)− 1 (L1) If P (k ∈ G(Λ)) ≤ then P max λ∈Λ Π SIM k (λ)≥ ξ ≤ 2 ξ (L2)
where we use the following definition
ΠSIM
k (λ) = P (k ∈ {G(b1)(λ)∩ G(b2)(λ)})
to represent the simultaneous selection probability for each variablexk, from two disjoint
random sub-samples (b1andb2) each of sizebn/2c.
Sketch of the proof:
1. Using the assumptions (A1) and (A2) it can be shown that P (k∈ G(Λ)) ≤ u(Λ)p ∀k ∈ N
2. Using the result from Step1and lemmas (L1) and (L2) it can be shown that P ( ˆΠk≥ π) ≤ u(Λ) p 2 1 (2π− 1)
3. Using the result from Step2, the bound on the expected false positives is such that E(F ) =Pk∈NPΠˆk ≥ π ≤u(Λ)p 2 1 (2π− 1)
5.2.2
Randomised Lasso
Within the same framework of stability selection,Meinshausen and B¨uhlmann(2010) have also introduced an extension to Lasso regression by introducing some extra source of ran- domness in the model, which they refer to as the randomised Lasso. As the authors have shown, such an extension leads to improved asymptotic variable selection properties, com- pared to the Lasso model. Variable selection consistency, i.e. selecting the true sparsity pattern when the sample size is increasing towards infinity, has been established for the original Lasso model under certain assumptions on the design matrix. The strongest such
assumption requires that the design satisfies a condition known as the irrepresentable con-
dition (Zhao and Yu, 2006). The randomised Lasso approach has been proposed in an attempt to weaken these assumptions.
This approach works by repeatedly fitting the Lasso model with random weights,Wk,
as scale factors for the regularisation parameter of each variable k. In each step of the sub-sampling procedure, the Lasso estimates are thus obtained as
ˆ β = arg min β ky − Xβk 2 2+ λ p X k=1 |βk| Wk !
It is suggested the random weights,Wk’s, are selected using the following technique: with
probability pw ∈ (0, 1) set Wk = α, with α ∈ (0, 1], otherwise set Wk = 1. The latter
corresponds to no randomisation for thekthvariable. It is suggested that reasonable values
of the parameterα lie in (0.2, 0.8). This randomisation technique can be easily accommo- dated within the algorithms used to solve the original Lasso problem, by re-weighting the corresponding variables based on their randomised weights. In particular, this corresponds to replacingxkbyxkWk, fork = 1, . . . , p, prior to applying the algorithm.
Meinshausen and B¨uhlmann(2010) have shown that the randomised Lasso can achieve variable selection consistency in the high-dimensional case, under much weaker assump- tions on the design matrix. As suggested by previous works, the false variables tend to be included in the model when the irrepresentable condition does not hold. The idea of the random weighting scheme lies in that randomly re-scaling the variables can make the false variables less sensitive to this condition, and thus decrease their frequency of selection.
It can be noted that using this randomised Lasso approach combined with stability selection, the final selection of the variables becomes even more conservative. As suggested by the asymptotic results derived in the original paper, although the method guarantees that no noise variables will be selected, it also implies that variables with small effects, i.e. small coefficients in the corresponding true underlying model, will be missed from this variable selection procedure.