The linear regression model is defined as:
y = Xβ + (3.1)
for a set of n observations of p variables where y is a (n × 1) vector, X is a (n × p) matrix collecting the data, β is a (p × 1) vector containing the linear model coefficients and the (n × 1) vector describing the portion of the data not described by the linear model. An important value associated with the linear model is the Signal to Noise Ratio (SNR, [74]) measuring the relation between the strength of the signal and the noise:
SN R = V ar(y)
V ar() (3.2)
The ordinary least squares estimator (OLS) of β in equation 3.1 is obtained by minimizing the least squares cost function Pn
i=1 k yi− ˆyi k22 given by the difference between the true output y and the one estimated by the model
ˆ
y = X ˆβ (3.3)
Estimating β by OLS is in general an ill-conditioned problem, unless the pre- dictors are orthogonal. Usually, these numerical problems arise when dealing with high-dimensional datasets, i.e. where p >> n, and when there is high correlation between subsets of the input variables (i.e. columns of X). We work under the assumption that the vector β is sparse in the sense that s << p coefficients βj are non-zero and that s < n. Here we denote the set
of non-zero coefficients as S0 = {j : βj 6= 0} and our goal is to obtain a good estimate of S0 that allows us to build a sparse model with good prediction performance using the Lasso estimator.
In order to describe the structure of the data some new notations are intro- duced. Define D = {1, . . . , n} as the index set of the rows of X and y and φ ⊂ D such that Xφ and yφ are the submatrices of X and the subvector of y, obtained using only the rows of X and y with indexes in φ, respectively.
Moreover, given A ⊂D, the complementary set is defined as A = D − A.
The Lasso estimator is formally defined as the solution of a convex optimiza- tion problem: ˆ β(α) = argmin β∈Rp k y − Xβ k2 2 +α k β k1 (3.4) where α is the regularization gain that controls the degree of sparsity in the model. For a given α the lasso estimator can be computed iteratively
using various algorithms, e.g. LARS [75] or coordinate descent [76]. In
this chapter the LARS algorithm is used. The optimum value of α can be estimated using a Cross-Validation (CV) procedure [77], [78]. Among these the most widely used is k-fold Cross-Validation as it provides a good balance between computational complexity and prediction error estimation accuracy. In order to perform k-fold Cross-Validation the data is randomly divided into K roughly equal-sized folds, with the index set for the kth fold denoted as fk. Then for a sequence of values for the tuning parameter α, penalized models are estimated using all but one of the folds as training data and the predictive performance of each model tested on the omitted “left-out” fold. This process is repeated until each fold has been left out. Thus
CV (α) = 1 K K X k=1 k yfk− Xfkβˆ fk(α) k 2 2 (3.5) where ˆβ
fk(α) are the lasso coefficients estimated using all the samples except the ones in the fold fk as training data and α as the regularization gain. The optimum value of α is then chosen to be the value in the sequence that minimizes the CV error, that is:
α∗ = argmin
α
{CV (α)} (3.6)
In the literature the number of folds K is often chosen as 5 or 10 [60]. In this chapter we will always use K = 10. The CV procedure is very popular
because it is intuitively appealing, easy to implement and can provide a good estimate of the expected prediction error [60]. However, the CV procedure is highly influenced by several factors such as the noise on the data and the split of the dataset into smaller subsets [73]. Problems arise when we are interested in identifying the underlying model structure S0. In particu- lar, different (random) data splits typically result in different α∗ values and different subsets of variables being selected. Indeed even for a fixed α the selected variables will vary substantially with different data permutations, especially if the candidate variables are highly correlated as will be described in the next section.
3.2.1
Oracle Property and Irrepresentable Condition
In general it is important to be aware of the prediction capabilities of a penalized model. This can be expressed through the Oracle Property: Definition 3.2.1. A penalized estimator is said to have the Oracle Property if it is asymptotically equivalent to the oracle estimator, which is defined as an ideal estimator obtained when using the true variables without penalization. More formally let X and y be the input matrix and the output vector and let S0 be the set of true variables. The Oracle estimator is then defined as:
y = XS0βˆS0 (3.7)
where XS0 is the set of columns of X whose indices are contained in S0 and
ˆ
βS0 = argmin
βS0
k y − XS0βS0 k2 (3.8)
A penalized estimator is said to have the Oracle property if there is a sequence λn such that with λ = λn
P ( ˆβ = ˆβS0) → 1 (3.9)
A slightly weaker definition is
P ( ˆS = S0) → 1 (3.10)
In [79], [80] and [81] it is proven that under certain conditions the least square estimator with Smoothly Clipped Absolute Deviation (SCAD), Ridge or Lasso penalty has the oracle property. In this context for the Lasso esti- mator the Irrepresentable Condition, presented in the next definition, plays an important role.
Definition 3.2.2. The neighborhood stability condition, also known as the irrepresentable condition [82], [83], [84] is defined as:
max k∈Sc 0 sign(βS0) T XTS0XS0 −1 XTS0Xk < 1 (3.11)
where XS0 is the set of true variables.
In [83] the authors prove that the Irrepresentable Condition (equation 3.11), except for a minor technicality (according to which a model is consistent if sign( ˆβi) = sign(βi) ∀i), is ‘almost‘ necessary and sufficient for consistency of the lasso estimator (the word ’almost’ refers to the fact that a necessary relationship uses ≤ instead of <). Consistency implies that, for each random realization there exists a correct amount of regularization that selects the true model. If this condition is violated, all that we can hope for is recovery of the regression vector β in an L2-sense of convergence by achieving
k ˆβ − βS k2→ 0 f or n → ∞ (3.12)
This type of L2-convergence can be used to achieve consistent variable selec- tion in a two-stage procedure by thresholding or, preferably, by employing the adaptive lasso [84]. The disadvantage of such a two-step procedure is the need to choose several tuning parameters without proper guidance on how these parameters can be chosen in practice. In conclusion if this condition is violated, the true β cannot be recovered unless some two-stage procedure based on thresholding or the adaptive LASSO are used [84], [85].