Chapter 2. Regularised Estimation
2.4 Theory for Regularised estimation
2.4.4 Bounds for the lasso
In the lasso problem (2.1.7), one can either assume that the regression coefficients θ0 are exactly sparse, referred to as the strong sparsity setting, or
that they can simply be approximated well by a sparse ˆθ, in the weakly sparse setting (Bunea et al. 2007). Both of these cases are studied in Negahban et al. (2012), however, only results for the strong sparsity setting are presented here. As further reference, there is a rich variety of work on the theoretical properties of lasso like estimators. These include results for; exact recovery (active predictors) given noiseless observations (Candes et al.2005; D. Donoho 2006), prediction error consistency and consistency in various `pnorms (Bunea
et al. 2007; Geer et al. 2009; C. Zhang et al. 2008), and variable selection consistency (Meinshausen and Bühlmann 2006; P. Zhao et al.2006).
The Taylor expansion of the quadratic loss function that underlies the lasso is δL(∆, θ0) = h∆, N−1X>X∆i = N−1kX∆k22and is thus independent of θ0.
In this much simplified case, to maintain the RSC it suffices to establish only a lower bound on N−1kX∆k2
2 that holds across an appropriately restricted
subset of ∆ ∈ Rp. When θ0 is exactly sparse, it is intuitive to select the
subspace to be equal to the support set S = {i | (θ0)i 6= 0} (recall that the
`1 regulariser is decomposable, Prop. A.1). One can view this as setting
the model subspace M(S) to look at the components of θ that relate to the non-zero components of θ0. We can thus obtain error vectors for the allowed
non-zero elements ∆S = ˆθS− θ0S corresponding to M(S) and perturbation
terms ∆Sc = ˆθSc − θ0Sc that correspond to ¯M⊥(S). Given that θ0 ∈ M, we can consider the restricted set ˆ∆ ∈ C = {∆ ∈ Rp| k∆Sck ≤ 3k∆Sk1}. In the lasso setting, the RSC condition translates into the well known restricted eigenvalue conditions (Geer et al.2009; Raskutti et al. 2010):
Corollary 2.1. Restricted Eigenvalue Condition
The RSC (Defintition 2.9) requires the design matrix X satisfies a re- stricted eigenvalue (RE) condition
(2.4.11) kXθk 2 2 N ≥ κLkθk 2 2 for all θ ∈ C(S) . or similarly N1kXθk2 2 ≥ κ 0 Lkθk21/|S|.
In many settings it is possible to prove with high-probability that the first order expansion of the loss function satisfies a lower bound. For example, in the lasso case with Gaussian design Xi,:
iid
∼ N (0, Σ), Raskutti et al. (2010, 2011) prove that a bound of the form
(2.4.12) kXθk 2 2 N ≥ κ1kθk 2 2 − κ2 log p N kθk 2 1,
holds with high-probability (greater than 1 − c1exp(−c2N )). Recalling that
the RSC condition is only required to hold over the set C. Consider θ ∈ C (2.4.5) with θ0 ∈ M(S), utilising the subspace compatibility condition on R
2. THEORY FOR REGULARISED ESTIMATION 71
we have
kθk1 ≡ R(θ) ≤ 4Ψ( ¯M)kθk = 4
√
skθk2 ,
where s = |S|. Thus, given bounds of the form (2.4.12) we can show N−1/2kXθk2 ≥
(κ1−16κ2s log p/N )kθk2 ≥ κLkθk2, where κL = κ1/2, and the last bound holds
in the case when N > 64(κ1/κ2)2s log p. This form of analysis enables us to
state with high probability that when a certain amount of data is collected, the RSC condition will be met.
Proposition 2.6. Error bound for Lasso
Consider the linear regression model y = Xθ0 + . Assuming that the
columns of the design matrix are normalised, ie N−1/2kX·,jk2 ≤ 1 for all j =
1, . . . , p and the noise term posseses sub-Gaussian tails such that for a given scale factor ζ < ∞, P (exp(t) ≤ exp(ζ2t2/2)) for all t ∈ R. For a suitably
chosen λN = 4ζp(log p)/N, then with high-probability we recover the bound
(2.4.13) kˆθλN − θ0k 2 2 ≤ 64ζ2 κ2 L s log p N .
Proof. The RSC condition can be demonstrated to hold in high probability using results similar to those of Eq. 2.4.12. One is also required to check that the regulariser is appropriately set. Specifically, this should satisfy λN ≥
2R∗(∇L(θ0)). In the lasso case we obtain 2R∗(∇L(θ0)) = 2kN−1X>k∞,
which can be bounded considering the sub-Gaussian error structure. For a full proof see Negahban et al. (2012).
As with the lasso, one can also obtain similar bounds for regularised covari- ance/precision estimation (Bühlmann et al.2011; Lam et al.2009; Ravikumar, Wainwright, and J.D. Lafferty 2010; Ravikumar, Wainwright, Raskutti, et al. 2011; Rothman et al. 2008; Saegusa et al. 2016). The following bound can be considered analogous to that of Prop. 2.6, except for the `1 penalised log-det
problem (2.3.4), c.f. the graphical lasso:
Proposition 2.7. Bound for `1 Log-Det Estimation (Ravikumar, Wainwright,
Raskutti, et al. 2011)
If the rescaled Xi/pΣ0ii are sub-Gaussian, the precision matrix has s true
degree, then under suitable regularisation conditions the precision matrix is bounded as (2.4.14) k ˆΘ − Θ0kF = O r (s + p) log p N ! , with probability 1 − 1/pτ −2 → 1, where τ > 2.
Proof. The above result is a summarised version of Theorem 1 in Ravikumar, Wainwright, Raskutti, et al. (2011) specific to sub-Gaussian sampling for Xi.
The parameter τ reflects the rate of convergence in probability, it affects the appropriate setting of both the regularsation constant λ and sample size re- quired for the claims. A high τ results in high probability claims, but also an increased lower bound on the sample size. Again, as in the lasso case, one needs to check that the glasso estimator meets both an RSC condition and that there is sufficient regularisation. The specifics of such sample size and regularisation requirements are omitted here for readability.
It is worth noting that the above result holds for estimating the precision matrix of sub-Gaussian random variables. In the case of a GGM, the preci- sion matrix elements can be considered as specification for a graph G(V, E) as discussed in Section 2.3. In the more general case there is not such a clear interpretation of the off-diagonal precision matrix structure. The result is typ- ical for high-dimensional graph selection problems. For similar related results see; Rothman et al. 2008, or Ravikumar, Wainwright, and J.D. Lafferty 2010 who also consider a binary Ising model, and Lam et al.2009 who additionally consider non-convex penalties.