METHODS
BY LONG FENG
A dissertation submitted to the Graduate School—New Brunswick Rutgers, The State University of New Jersey
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
Graduate Program in Statistics and Biostatistics Written under the direction of
Cun-Hui Zhang & Lee H. Dicker and approved by
New Brunswick, New Jersey October, 2017
Topics in high-dimensional regression and nonparametric
maximum likelihood methods
by LONG FENG
Dissertation Director: Cun-Hui Zhang & Lee H. Dicker
This thesis contains two parts. The first part, in Chapter 2-4, addresses three connected issues in penalized least-square estimation for high-dimensional data. The second part, in Chapter 5, concerns nonparametric maximum likelihood methods for mixture models.
In the first part, we prove the estimation, prediction and selection properties of concave penalized least-square estimation (PLSE) under fully observed and noisy/missing design, and validate an essential condition for PLSE: the restricted-eigenvalue condition. In Chapter 2, we prove that the concave PLSE matches the oracle inequalities for prediction and coefficients estimation of the Lasso, based only on the restricted eigenvalue condition, one of the mildest condition imposed on the design matrix. Furthermore, under a uniform signal strength assumption, the selection consistency does not require any additional conditions for proper concave penalties such as the SCAD penalty and MCP. A scaled version of the concave PLSE is also proposed to jointly estimate the regression coefficients and noise level. Chapter 3 concerns high-dimensional regression when the design matrix are subject to missingness or noise. We extend the PLSE for fully observed design to noisy or missing design and prove that the same scale of coefficients estimation error can be obtained, while requiring no additional condition. Moreover, we show that a linear combination of the `2
norm of regression coefficients and the noise level is large enough as penalty level when noise
required in Chapter 2 and Chapter 3 and considers a more general groupwise version. We prove that the population version of the groupwise RE condition implies its sample version under a low moment condition given usual sample size requirement. Our results include the ordinary RE condition as a special case.
In the second part, we consider nonparametric maximum likelihood (NPML) methods for mixture models, a nonparametric empirical Bayes approach. We provide concrete guidance on implementing multivariate NPML methods for mixture models, with theoretical and empirical support; topics covered include identifying the support set of the mixing distribution, and comparing algorithms (across a variety of metrics) for solving the simple convex optimization problem at the core of the approximate NPML problem. In addition, three diverse real data applications are provided to illustrate the performance of nonparametric maximum likelihood methods.
I would like to express my deepest gratitude to my advisors, Prof. Cun-Hui Zhang and Prof. Lee Dicker. I feel extremely fortunate to have the opportunity to work with them. Prof. Zhang is more than an advisor to me. He is a great researcher, a dedicated educator, a devoted mentor, a trusted friend and a respectable elder. He provides me helpful instructions and exceptional research trainning, and also unwavering support and constant encouragement. More importantly, his devotation to research and to students makes me interested in exploring a career in academia and to be a researcher and teacher. Prof. Dicker is the jonior professor I admire most. His brilliant ideas and excellent intuition on statistics always make my understaning of research questions deeper. He can always convey complex problems in plain language. I enjoy so much working with him and benefit a lot in each of our meetings.
Secondly I would like to extend my gradtitude to my dissertation committee, Prof. Pierre Bellec and Prof. Eitan Greenshtein, for the time they dedicated to review my thesis and comments on the manuscripts. Special thanks go to Prof. Greenshtein for his helpful discussion on the topics of nonparametric maximum likelihood estimation in Chapter 5 of this thesis.
In addition, I want to say thanks to Professor John Kolassa, for his support over the past five years and Prof. Minge Xie for his advices and encouragement during my study. Also, I want to thank the fellow students in our department and my friends at Rutgers for their suggestions and help. I feel very lucky to meet all these people in my graduate life and have such a happy and unforgettable journey.
To my family
Abstract . . . ii Acknowledgements . . . iv Dedication. . . v List of Tables . . . ix List of Figures . . . x 1. Introduction . . . 1 1.1. High-dimensional regression . . . 1
1.2. Nonparametric maximum liklihood methods . . . 3
2. Oracle properties of concave PLSE and its scaled version. . . 4
2.1. Introduction. . . 4
2.2. Statistical Properties of Concave PLSE methods . . . 7
2.2.1. Concave penalties . . . 8
2.2.2. The restricted eigenvalue condition . . . 9
2.2.3. Properties of concave PLSE . . . 10
2.3. Smaller penalty levels . . . 20
2.3.1. Smaller penalty levels . . . 20
2.3.2. RE-type conditions for smaller penalty levels . . . 21
2.3.3. Prediction and estimation errors bounds at smaller penalty levels . . 23
2.4. Scaled concave PLSE. . . 26
2.4.1. Description of the scaled concave PLSE . . . 27
2.4.2. Performance guarantees of scaled concave PLSE at universal penalty levels . . . 28
2.5.1. No signal case: β˚
“ 0. . . 35
2.5.2. Effect of correlation: ranging over different ρ . . . 35
2.5.3. Effect of signal-to-noise ratio: ranging over different snr . . . 36
2.5.4. Effect of sparsity: ranging over different α . . . 37
2.6. Discussion . . . 39
3. Penalized least-square estimation with noisy and missing data . . . 41
3.1. Introduction. . . 41
3.2. Theoretical Analysis of PLSE . . . 44
3.2.1. Restricted eigenvalue conditions . . . 44
3.2.2. Main results. . . 45
3.3. Theoretical penalty levels for missing/noisy data . . . 46
3.4. Scaled PLSE and Variance Estimation . . . 51
3.5. Conclusions . . . 55
4. Group Lasso under Low-Moment Conditions on Random Designs . . . 56
4.1. Introduction. . . 56
4.2. A review of restricted eigenvalue type conditions . . . 60
4.3. The group transfer principle . . . 62
4.4. Groupwise compatibility condition . . . 67
4.5. Groupwise restricted eigenvalue condition . . . 76
4.6. Convergence of the restricted eigenvalue . . . 80
4.7. Lemmas . . . 82
4.8. Discussion . . . 87
5. Nonparametric Maximum Likelihood for Mixture Models: A Convex Optimization Approach to Fitting Arbitrary Multivariate Mixing Distributions . . . 90
5.2.1. NPMLEs . . . 92
5.2.2. A simple finite-dimensional convex approximation . . . 93
5.3. Choosing Λ . . . 94
5.4. Connections with finite mixtures . . . 96
5.5. Implementation overview . . . 97
5.6. Simulation studies . . . 98
5.6.1. Comparing NPMLE algorithms . . . 99
5.6.2. Gaussian location scale mixtures: Other methods for estimating a normal mean vector . . . 100
5.7. Baseball data . . . 101
5.8. Two-dimensional NPMLE for cancer microarray classification . . . 104
5.9. Continuous glucose monitoring . . . 106
5.9.1. Linear model . . . 107
5.9.2. Kalman filter . . . 107
5.9.3. Comments on results . . . 108
5.10. Discussion . . . 109
2.1. Median bias of standard deviation estimates. No signal, σ “ 1, ρ “ 0, sample size n “ 100. Minimum error besides the oracle is in bold for each analysis. 35
5.1. Comparison of different NPMLE algorithms. Mean values (standard deviation in parentheses) reported from 100 independent datasets; p “ 1000, throughout simulations. Mixing distribution 1 has constant σj; mixing
distribution 2 has correlated µj and σj. . . 110
5.2. Mean TSE for various estimators of µ P Rp based on 100 simulated datasets; p “ 1000. pq1, q2q indicates the grid points used to fit ˆGΛ. . . 110
5.3. Baseball data. TSE relative to the naive estimator. Minimum error is in bold for each analysis. . . 111
5.4. Microarray data. Number of misclassification errors on test data. . . 111
5.5. Blood glucose data. MSE relative to CGM. . . 111
2.1. Median standard deviation estimates over different levels of predictor correlation. σ “ 1, α “ 0.5, snr “ 1, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving from left to right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ M CP (5), SZ M CP 2(6), SZ M CP 3(7). . . 36
2.2. Median standard deviation estimates over different levels of signal-to-noise level. σ “ 1, α “ 0.5, ρ “ 0, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving from left to right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ M CP (5), SZ M CP 2(6), SZ M CP 3(7). . . 37
2.3. Median standard deviation estimates over different levels of sparsity. σ “ 1, snr “ 1, ρ “ 0, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving from left to right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ M CP (5), SZ M CP 2(6), SZ M CP 3(7). . . 38
2.4. Five λ0s as functions k, n=100, p=1000. Line numbers refer to (1)
λ0pkq “ tp2{nq logppqu1{2, (2) λ0pkq “ tp2{nq logpp{kqu1{2, (3) λ0pkq “
p2{nq1{2L1pk{pq, (4) Adaptive λ0 described in section 2.5 with various k,
assuming that the correlation between columns of X is 0. (5) Same as (4) except assuming that the correlation between columns of X is 0.8. The k1
is the solution to (2.37), k2 is the solution to 2k “ L41pk{pq ` 2L21pk{pq. . . . 39
baseball dataset; (b) histogram of non-pitcher data from the baseball dataset; (c) histogram of pitcher data from the baseball dataset. . . 104
Chapter 1
Introduction
1.1 High-dimensional regression
This first part of this thesis addresses three issues in parameter estimation, prediction and variable selection for high-dimensional regression: concave penalized least-square regression, high-dimensional regression with noisy and missing data, and restricted eigenvalue-type conditions for high-dimensional regression.
As modern technology generates tons of data, high-dimensional data have been studied intensively both in statistics and computer science. In linear regression, a widely used approach to analyze high-dimensional data is the penalized least-square estimation (PLSE). The Lasso, or `1 penalization [71] and the concave penalization, such as the SCAD [21] and
MCP [84] are two mainstream methods in penalized least-square estimation. It is been shown that the concave PLSE guarantees variable selection consistency under significantly weaker conditions than the Lasso, for example, the strong irrepresentable condition on the design matrix required by the Lasso can be replaced by a sparse Riesz condition. Moreover, the concave PLSE also enjoys rate-optimal error bounds in prediction and coefficients estimation. However, the error bounds for prediction and coefficients estimation in the literature still require significantly stronger conditions than what Lasso require, for example, the knowledge of the `1 norm of the true coefficients vector or the upper sparse eigenvalue
condition. Ideally, selection, prediction and estimation properties should only depend on lower sparse eigenvalue/restricted eigenvalue, is that achievable? In the second chapter, we give an affirmative answer to this question.
In Chapter 2, we prove that the concave PLSE matches the oracle inequalities for prediction and `q coefficients estimation of the Lasso, with 1 ď q ď 2, based only on
condition on the design matrix. Furthermore, under a uniform signal strength assumption, the selection consistency does not require any additional conditions for proper concave penalties such as the SCAD penalty and MCP. Our theorem applies to all the local solutions that computable by path following algorithm starting from the origin. We also developed a scaled version of the concave PLSE that jointly estimates the regression coefficients and noise level. The scaled concave PLSE is not an easy extension of the scaled Lasso because the joint distribution of regression coefficients and noise level of the former is non-convex. The computation cost of scaled concave PLSE is negligible beyond computing a continuous solution path. All our consistency results apply to cases where the number of predictors p is much larger than the sample size n.
In Chapter 3, we consider high-dimensional regression when the design matrices are not fully observable. Two specifications are discussed: missing design and noisy design. We extend the PLSE to noisy or missing design and prove that the same scale of coefficients estimation error can be obtained compared with the fully observed design, while requiring no additional condition. Moreover, we prove that a linear combination of the noise level and `2 norm of coefficients is large enough for penalty level when noise or missing data
exists. This sharpens the existing results where an `1 norm of coefficients is required. We
further extend the scaled version of PLSE to missing and noisy data case. Since the cross-validation based technique is time consuming and maybe misleading for missing or noisy data, the proposed scaled solution is of great use.
As discussed before, restricted eigenvalue (RE) type conditions can be viewed as nearly the weakest available condition on design matrix to guarantee prediction and estimation performance of the Lasso, concave penalized least-square estimator and groupwise estimators in high-dimensional regression. In Chapter 4, we prove that the population version of the groupwise RE condition implies its sample version under: (i) a second moment uniform integrability assumption on the linear combinations of the design variables and (ii) a fourth moment uniform boundedness assumption on the individual design variables and a m-th moment assumption on the linear combinations of the within group design variables for m ą 2, provided usual sample size requirement. Moreover, the fourth and m-th moment assumptions can be removed given a slightly larger sample size.
Besides, the low moment condition is also sufficient to guarantee the groupwise compatibility condition, an `1-version of RE condition. Our results include the ordinary RE condition as
a special case. This study demonstrates a benefit of standardizing the design variables in penalized least squares estimation for heavy-tailed random designs. In addition, it indicates that the RE condition of bootstrapped sample can be guaranteed given the corresponding sample RE condition.
1.2 Nonparametric maximum liklihood methods
The second part of this thesis considers two types of models using nonparametric maximum likelihood (NPML) methods, a nonparametric empirical Bayes approach: NPML methods for mixture models and NPML methods for linear models.
Nonparametric maximum likelihood (NPML) for mixture models is a technique for estimating mixing distributions that has a long and rich history in statistics going back to the 1950s, and is closely related to empirical Bayes methods. Historically, NPML-based methods have been considered to be relatively impractical because of computational and theoretical obstacles. However, recent work focusing on approximate NPML methods suggests that these methods may have great promise for a variety of modern applications. Building on this recent work, we study a class of flexible, scalable, and easy to implement approximate NPML methods for problems with multivariate mixing distributions. In Chapter 5, we provide concrete guidance on implementing these methods, with theoretical and empirical support; topics covered include identifying the support set of the mixing distribution, and comparing algorithms (across a variety of metrics) for solving the simple convex optimization problem at the core of the approximate NPML problem. Additionally, we illustrate the methods’ performance in three diverse real data applications: (i) A baseball data analysis (a classical example for empirical Bayes methods, originally inspired by Efron & Morris), (ii) high-dimensional microarray classification, and (iii) online prediction of blood-glucose density for diabetes patients. Among other things, our empirical results clearly demonstrate the relative effectiveness of using multivariate (as opposed to univariate) mixing distributions for NPML-based approaches.
Chapter 2
Oracle properties of concave PLSE and its scaled version
2.1 Introduction
The purpose of this chapter is to study prediction, coefficient coefficients estimation, and variable selection properties of concave penalized least squares estimator (PLSE) in linear regression under the restrictive eigenvalue (RE) condition on the design matrix.
Consider the linear model
y “ Xβ˚` ε, (2.1)
where X “ px1, ..., xpq P Rnˆp is a design matrix, y P Rn is a response vector, ε P Rn is a
noise vector, and β˚
P Rp is an unknown coefficient coefficients vector. For simplicity, we assume throughout the chapter that the design matrix is column normalized with }xj}22“ n.
We shall focus on penalized loss functions of the form
Lpβ; λq “ 1 2n}y ´ Xβ} 2 2` p ÿ j“1 ρp|βj|; λq, (2.2)
where the penalty function ρpt; λq, indexed by λ ě 0, is concave in t ą 0 with ρp0`; λq “ ρp0; λq “ 0, and the index λ is taken as the penalty level limtÑ0`ρpt; λq{t. Additional
regularity conditions on ρp¨; ¨q will be described in Section 2.2. The PLSE can be defined as a statistical choice among local minimizers of the penalized loss.
Among PLSE methods, the Lasso [71] with the absolute penalty ρpt; λq “ λ|t| is the most widely used and extensively studied. The Lasso is relatively easy to compute as it is a convex minimization problem, but it is well known that the Lasso is biased. A consequence of this bias is the requirement of a neighborhood stability/strong irrepresentable condition
on the design matrix X for the selection consistency of the Lasso [51,88,72,79]. Fan and Li [21] proposed a concave penalty to remove the bias of the Lasso and proved an oracle property for one of the local minimizers of the resulting penalized loss. Zhang [84] proposed a path finding algorithm PLUS for concave PLSE and proved the selection consistency of the PLUS-computed local minimizer under a rate optimal signal strength condition on the coefficients and the sparse Riesz condition (SRC) [85] on the design. The SRC, which requires bounds on both the lower and upper sparse eigenvalues of the Gram matrix and is closely related to the restricted isometry property (RIP) [12], is substantially weaker than the strong irrepresentable condition. This advantage of concave PLSE over the Lasso has since become well understood.
For prediction and coefficient estimation, the existing literature somehow presents an opposite story. Consider hard sparse coefficient vectors satisfying |supppβ˚
q| ď s with small ps{nq log p. Although rate minimax error bounds were proved under the RIP and SRC respectively for the Dantzig selector and Lasso in [11] and [85], Bickel et al. [6] sharpened their results by weakening the RIP and SRC to the RE condition, and van de Geer and B¨uhlmann [77] proved comparable prediction and `1 estimation error bounds under an even
weaker compatibility or `1RE condition. Meanwhile, rate minimax error bounds for concave
PLSE still require two-sided sparse eigenvalue conditions like the SRC [84,87,80,22] or a proper known upper bound for the `1 norm of the true coefficient vector [46]. It turns out
that the difference between the SRC and RE conditions are quite significant as Rudelson and Zhou [66] proved that the RE condition is a consequence of a lower sparse eigenvalue condition alone. This seems to suggest a theoretical advantage of the Lasso, in addition to its computational simplicity, compared with concave PLSE.
An interesting question is whether the RE condition alone on the design matrix is also sufficient for the above discussed results for concave penalized prediction, coefficient coefficients estimation and variable selection, provided proper conditions on the coefficient coefficients and noise vectors. An affirmative answer of this question, which we provide in this chapter, amounts to the removal of the upper sparse eigenvalue condition on the design matrix and actually also a relaxation of the lower sparse eigenvalue condition or the restricted strong convexity (RSC) condition [56] imposed in [46]. We also extend
the prediction and estimation error bounds to smaller penalty levels λ which are more practical and provide rate minimaxity in prediction and coefficient coefficients estimation when ps{nq logpp{sq is small.
The Lasso still enjoys computational advantages over concave PLSE. However, this advantage may not be so drastic in many applications in view of the literature on statistical and computational properties of iterative and path finding algorithms for concave penalization [25,89,84,87,9,1,32,56,80,46,22]. In this chapter, we focus on statistical properties of local solutions of concave PLSE computable by path finding algorithms as we are also interested in adaptive choice of the penalty level λ in the solution path and the estimation of the noise level. Exact solution paths of the PLSE can be computed by the PLUS algorithm [84], while approximate solution paths can be computed by the gradient decent algorithm of Wang et al. [80] with computational complexity guarantee.
Suppose that a local solution path of the concave penalization problem is obtained, one still needs to take an appropriate choice of an estimator in the solution path or a proper penalty level. This problem, which we also study in this chapter, is equivalent to consistent estimation of the noise level due to scale invariance.
Substantial effort has been made in scale free estimation under the `1 penalty. The idea
is to make the penalty level proportional to the noise level σ. St¨adler et al. [67] proposed to estimate β and σ by maximizing their joint log-likelihood with an `1penalty on β{σ through
reparametrization. In the discussion of [67], Antoniadis [2] proposed to minimize Huber’s [34] concomitant joint loss function with the `1 penalty on β without reparametrization,
and Sun and Zhang [68] considered a “naive” iteration between the estimation of β and σ and proved the bias reduction property of one iteration from the joint estimator of [67]. Belloni et al. [5] introduced and studied a square-root Lasso for the estimation of β. It turns out that for the `1 penalty, Huber’s concomitant joint loss, the equilibrium of the iterative
algorithm, and the square-root Lasso all produce the same estimator. Sun and Zhang [69] proposed the iterative algorithm as scaled PLSE for joint estimation of β and σ under both the `1and concave penalties and studied the scaled Lasso with the joint penalized loss of [2],
especially the consistency and asymptotic normality of the resulting noise level estimator. However, a theoretical study of the scaled concave PLSE is noticeably missing.
A main reason for this absence of a theoretical study of scale free concave PLSE is the loss of the scale free property; In the joint likelihood, the concomitant loss and the square-root formulations, it is not proper to use scale free concave penalty functions as they are not proportional to the penalty level. While the iterative approach is still scale free with concave penalties, concave regularization is more difficult to study due to the loss of its equivalence to joint convex minimization, compared with the Lasso.
In this chapter, we find a much weaker condition under which local solutions of concave PLSE enjoy desired properties in prediction, coefficients estimation, and variable selection as well. Specifically, we prove that the concave PLSE achieves rate minimaxity in prediction and coefficients estimation under the `0 sparsity condition on β and the RE condition on X.
Furthermore, the selection consistency can also be guaranteed under an additional uniform signal strength condition on the nonzero coefficients. In addition, we prove that the same properties hold for the scaled concave PLSE in the iterative algorithm formulation.
The rest of this chapter is organized as follows. In Section2.2, we study concave PLSE under the RE condition on the design. In Section2.3we study concave PLSE with smaller penalty/threshold levels. In Section2.4we study theoretical properties of the scaled concave PLSE. Section2.5presents results of an extensive simulation study for variance estimation. Section2.6 contains some discussion.
Notation: We denote by β˚ the true regression coefficient coefficients vector, Σ “ XTX{n the sample Gram matrix, S “ supppβ˚
q the support set of the coefficient coefficients vector, s “ |S| the size of the support, and Φp¨q the standard Gaussian cumulative distribution function. For vectors v “ pv1, ..., vpq, we denote by }v}q “
ř
jp|vj|qq1{q the `q norm, with }v}8 “ maxj|vj| and }v}0 “ #tj : vj ‰ 0u. Moreover,
x` “ maxpx, 0q.
2.2 Statistical Properties of Concave PLSE methods
In this section, we present our results for concave PLSE at a sufficiently high penalty level to allow selection consistency. We first need to describe our assumptions on the penalty function and design matrix.
2.2.1 Concave penalties
We study the class of concave penalties ρpt; λq satisfying the following properties: (i) ρpt; λq is symmetric, ρpt; λq “ ρp´t; λq;
(ii) ρpt; λq is monotone, ρpt1; λq ď ρpt2; λq for all 0 ď t1 ă t2;
(iii) ρpt; λq is left- and right-differentiable in t for all t; (iv) ρpt; λq has selection property, 9ρp0`; λq “ λ;
(v) | 9ρpt´; λq| _ | 9ρpt`; λq| ď λ for all real t.
We write 9ρpt; λq “ x when x is between the left- and right-derivative of ρpt; λq at t, including t “ 0 where 9ρp0; λq “ x means |x| ď λ. We use the following quantities to measure the concavity of penalty functions. For a given penalty function ρp¨; λq, define the maximum concavity at t as κpt; ρ, λq “ sup t1ą0 9 ρpt1; λq ´ 9ρpt; λq t ´ t1 , (2.3)
where the supreme is taken over all possible choices of 9ρpt; λq and 9ρpt1; λq between the
left-and right-derivatives. Further, define the overall maximum concavity of ρp¨; λq as
κpρq “ κpρ, λq “ max
tě0 κpt; ρ, λq. (2.4)
Many popular penalties satisfy conditions (i) to (v). We illustrate the SCAD (smoothly clipped absolute deviation) penalty and MCP (minimax concave penalty) as examples. The SCAD penalty [21] is defined as
ρpt, λq “ λ ż|t| 0 " Ipx ď λq ` pγλ ´ xq` pγ ´ 1qλ Ipx ą λq * dx (2.5)
with a fixed parameter γ ą 2. A straightforward calculation yields κp0; ρ, λq “ 1{γ and κpρ, λq “ 1{pγ ´ 1q for the SCAD penalty. The MCP [84] is defined as
ρpt, λq “ λ ż|t|
0
p1 ´ x
with γ ą 0 and κpρ, λq “ κp0; ρ, λq “ 1{γ.
2.2.2 The restricted eigenvalue condition
We now consider conditions on the design matrix. The restricted eigenvalue (RE) condition, proposed in [6], can be viewed as nearly the weakest available condition on the design to guarantee rate optimal prediction and coefficients estimation performance of the Lasso. The RE coefficient RE2pS, ηq for the `2 estimation loss can be defined as follows: For η P r0, 1q
and δ˚ P r0, 1s, RE22pS; η, δ˚q “ inf " uTΣu }u}22 : p1 ´ ηq}uS c}1ď p1 ` δ˚ηq}uS}1 * . (2.7)
The RE condition refers to the property that RE2pS; η, δ˚q is no smaller than a certain
positive constant for all n and p. For the prediction and `1 estimation, an `1-version of the
RE can be employed. The following compatibility or `1-RE coefficient [77] can be used,
RE21pS; η, δ˚q “ inf " uTΣu|S| }uS}21 : p1 ´ ηq}uSc}1ď p1 ` δ˚ηq}uS}1 * . (2.8)
We introduce a relaxed cone invertibility factor (RCIF) for prediction as
RCIFpredpS; η, ωq “ inf
" }Σu}28|S| uTΣu : p1 ´ ηq}uSc}1 ď ´ω T SuS * , (2.9)
where ω P Rp, and a RCIF for the `q estimation, 1 ď q ď 2, as
RCIFest,qpS; η, ωq “ inf
# }Σu}8|S|1{q }u}q : p1 ´ ηq}uSc}1 ď ´ωTSuS + . (2.10)
The choices of δ˚ and ω depend on the problem under consideration in the analysis, but
typically we have }ω}8 ď 1`δ˚η so that the minimization in (2.9) and (2.10) is taken over a
smaller cone. For example, one may take ωS “ 0 for studying selection consistency. We will
use an RE condition to prove cone membership of the estimation error of the concave PLSE and the RCIF to bound the prediction and coefficients estimation errors. The following proposition shows that the RCIF may provide sharper bounds than the RE does.
Proposition 2.1. Let RE, RCIF be as in (2.7)-(2.10), η P p0, 1q, and ξ “ p1 ` δ˚ηq{p1 ´ ηq.
If }ωS}8 ď 1 ` δ˚η, then
RCIFpredpS; η, ωq ě RE21pS; η, δ˚q{p1 ` ξq2
RCIFest,1pS; η, ωq ě RE21pS; η, δ˚q{p1 ` ξq2 (2.11)
RCIFest,2pS; η, ωq ě RE1pS; η, δ˚qRE2pS; η, δ˚q{p1 ` ξq.
Proof of Proposition 2.1. Since }ωS}8ď 1 ` δ˚η, we have
p1 ´ ηq}uSc}1 ď ´ωTSuS ď p1 ` δ˚ηq}uS}1.
It then follows that
}Σu}28|S| uTΣu ě uTΣu|S| }u}21 ě uTΣu|S| p1 ` ξq2}uS}21 .
The first inequality of (2.11) is obtained by taking infimum in the cone C pS; η, δ˚q “ tu :
p1 ´ ηq}uSc}1 ď p1 ` δ˚ηq}uS}1u. Similarly,
}Σu}8|S| }u}1 ě u TΣu|S| }u}21 ě uTΣu|S| p1 ` ξq2}uS}21 , }Σu}8|S|1{2 }u}2 ě u TΣu|S| }u}1}u}2 ě u TΣu|S| p1 ` ξq}uS}1}u}2 .
The second and third inequality of (2.11) can be obtained by taking the infimum in the
coneC pS; η, δ˚q on the above inequalities. ˝
2.2.3 Properties of concave PLSE
As our analysis directly allows the penalty to depend on index j, we consider the follows generalization of the penalized loss (2.2),
Lpβ; λq “ 1 2n}y ´ Xβ} 2 2` p ÿ j“1 ρjpβj; λq.
Given penalty functions ρjp¨; ¨q and a penalty level λ, a vector pβ P Rp is a critical point
of the penalized loss (2.2) if the following local Karush-Kuhn-Tucker (KKT) condition is satisfied:
xTjpy ´ X pβq{n “ 9ρjp pβj; λq (2.12)
for a certain version of 9ρjp pβj; λq (between the left and right derivatives as in our convention)
for every j “ 1, . . . , p. By property (v) of the penalty, (2.12) is well defined and | 9ρjp pβj; λq| ď
λ. When the penalized loss is convex in t, the local KKT condition (2.12) is necessary and sufficient for the global minimization of the penalized loss Lp¨; λq. In general, solutions of (2.12) include all local minimizers of Lp¨; λq.
For positive λ˚and κ˚, consider the class of all penalty functions ρjp¨; λq with no smaller
penalty level than λ˚ and no greater concavity than κ˚,
Ppλ˚, κ˚q “
!
ρjp¨; λq : λ ě λ˚, κpρj, λq ď κ˚
)
. (2.13)
Among all local solutions for all such penalties ρjp¨; λq in Ppλ˚, κ˚q, we shall focus on the
subclass B0pλ˚, κ˚q of those connected to the origin through a continuous path of such
solutions. Formally, let
B “ Bpλ˚, κ˚q “
! p
β: (2.12) holds with some ρjp¨; λq PPpλ˚, κ˚q
) .
The classB0pλ˚, κ˚q can be written as
B0pλ˚, κ˚q “
! p
β : pβ and 0 are connected inBpλ˚, κ˚q
)
. (2.14)
As pβ “ 0 is the sparsest solution, B0 can be viewed as the sparse branch of the solution
space B.
By definition, B0pλ˚, κ˚q is the set of all local solutions computable by path following
algorithms starting from the origin, with constraints λ ě λ˚ and κpρj, λq ď κ˚ on the
local solutions connected to the origin regardless of the specific algorithms used to compute the solution and different types of penalties can be used in a single solution path. For example, the Lasso estimator belongs to the class as it is connected to the origin through the LARS algorithm [58,59,20]. The SCAD and MCP solutions belong to the class if they are computed by the PLUS algorithm [84] or by a path following algorithm from the Lasso solution.
The following theorem studies the difference between solutions pβ PB0pλ˚, κ˚q and an
oracle coefficient vector βo satisfying supppβoq Ď S under the RE condition on the design matrix. The vector βo P Rpcan be taken as the true regression coefficient coefficients vector β˚ so that Theorem 2.1 directly yields prediction and estimation error bounds under the
RE condition. Alternatively, βo can be taken as the oracle LSE pβo given by
p
βoS“ pXTSXSq´1XTSy, βp
o
Sc “ 0, (2.15)
with S “ supppβ˚
q, so that Theorem 2.1 directly yields sufficient conditions for selection consistency and indirectly sharper prediction and estimation error bounds, still under the RE condition.
We consider here penalty levels no smaller than a certain λ˚ satisfying
}XTScpy ´ XTβoq{n}8ă ηλ˚, }XTSpy ´ XTβoq{n}8 ď ηδ˚λ˚, (2.16)
where η ă 1 and δ˚ď 1. When ε “ y ´ Xβ˚„ N p0, σ2Inˆnq and
λ˚ “ pσ{ηq
a
p2{nq log p,
(2.16) holds with at least probability 1 ´a2{pπ log pq, as xj are normalized to }xj}2 “
? n, provided that βo is either the true β˚ with δ
˚ “ 1 or the oracle LSE in (2.15) with δ˚“ 1
δ˚ “ 0. Smaller penalty levels will be considered in Section2.3.
coefficients vector βo via a random vector ω “ ωpβo, λq with elements
wj “ 9ρjpβoj; λq{λ ´ xTjpy ´ Xβoq{pλnq. (2.17)
The relevance of ω can be clearly seen from the definition of pβ in (2.12) as
xTjXpβo´ pβq{n “ λwj` 9ρjp pβj; λq ´ 9ρjpβjo; λq (2.18)
We may choose ω to satisfy supppωq Ď S in our convention as 9ρjpβjo; λq is allowed to take
any value in r´λ, λs for βjo “ 0. However, this choice is not used in our analysis. Let φminpM q denote the minimum eigenvalue of symmetric matrices M .
Theorem 2.1. Let pβ be a solution of (2.12) in B0pλ˚, κ˚q with penalties ρjp¨; λq in the
class Ppλ˚, κ˚q. Suppose RE22pS; η, δ˚q ě κ˚ and (2.16) holds for certain βo P Rp and
S Ě supppβo q. Let ω be as in (2.17). (i) With ξ “ p1 ` δ˚ηq{p1 ´ ηq, }X pβ ´ Xβo}22{n ď p1 ` ηq2λ2|S| RCIFpredpS; η, ωq ď p1 ` ξq 2 p1 ` ηq2λ2|S| RE21pS; η, δ˚q , (2.19) }pβ ´ βo}q ď $ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ & ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ % p1 ` ηqλ|S| RCIFest,1pS; η, ωq ď p1 ` ξq 2p1 ` ηqλ|S| RE21pS; η, δ˚q , q “ 1, p1 ` ηqλ|S|1{2 RCIFest,2pS; η, ωq ď p1 ` ξqp1 ` ηqλ|S| 1{2 RE1pS; η, δ˚qRE2pS; η, δ˚q , q “ 2, p1 ` ηqλ|S|1{q RCIFest,qpS; η, ωq , q ě 1. (2.20)
(ii) Suppose maxjďpκpβjo; ρj, λq ď p1 ´ 1{C0qRE22pS; η, δ˚q. Then,
}X pβ ´ Xβo}22{n ď pC0λq2sup u‰0 “ωT SuS´ p1 ´ ηq}uSc}1 ‰2 ` uTΣu (2.21)
and for any seminorm } ¨ } as a loss function }pβ ´ βo} ď C0λ sup u‰0 }u}“ωTSuS´ p1 ´ ηq}uSc}1 ‰ uTΣu . (2.22)
(iii) Suppose βo is a solution of (2.12) or equivalently ωS “ 0. Then,
p
βSc “ 0 and sgnp pβjqsgnpβjoq ě 0 @ j P S. (2.23)
If κp0; ρj, λq ă φminpXTSXS{nq, then
sgnppβq “ sgnpβoq. (2.24)
If maxjPSκpβjo; ρj, λq ă φminpXTSXS{nq, then
p
β “ βo. (2.25)
Remark 2.1. In the above theorem, one may also use a relaxed version of RE1 and RE2
with the constraint replaced by p1 ´ ηq}uSc}1 ď ´ωTSuS.
Corollary 2.1. Suppose 9ρjpt; λq “ 0 for |t| ą λγ and conditions of Theorem 2.1 (ii) hold
with βo“ pβobeing the oracle estimator in (2.15). Then, (2.21) and (2.22) hold with }ω}22ď p1 ` δ˚ηq2|S1| where S1 “ tj P S : | pβoj| ď λγu. Consequently, when C02{RE22pS; η, δ˚q “
OPp1q and λ À σ
a
plog pq{n,
}X pβ ´ X pβo}22{n ` }pβ ´ pβo}22 “ OPpσ2{nq|S1| log p,
implying pβ “ pβo when |S1| “ 0, and
}X pβ ´ Xβ˚
}22{n ` }pβ ´ β˚}22 “ OPpσ2{nq
`
|S1| log p ` |S|˘.
Theorem 2.1 gives a unified treatment of penalized least squares methods, including the `1 and concave penalties, under the RE condition on the design matrix and natural
and (2.20) match those of state-of-art for the Lasso in both the convergence rate and the regularity condition on the design, while (2.21), (2.22) and Corollary 2.1 demonstrate the advantages of concave penalization when |S1| is of smaller order than |S|. Moreover, the
prediction and estimation error bounds in Theorem2.1 (ii) and Corollary2.1 directly and naturally provide selection consistency when ωS “ 0 or |S1| “ 0. More precisely, for
selection consistency, Theorem 2.1(iii) requires only the RE condition for (2.23) and mild additional eigenvalue conditions for (2.24) and (2.25), provided the existence of an oracular solution βo with supppβoq Ď S or equivalently ωS “ 0. Note that κp0; ρj, λq ď κ˚ and
RE22pS; 0q ď φminpXTSXS{nq by definition. For concave penalties, this condition ωS “ 0 can
be fulfilled by the rate-optimal signal strength condition minjPS| pβjo| ą γλ as in Corollary2.1.
However, the condition ωS “ 0 for the Lasso requires more restrictive `8-type conditions
such as the irrepresentable condition on the design. These RE-based results are new and significant as the existing theory for concave penalization, which requires substantially stronger conditions on the design such as the sparse Riesz condition in [84], leaves a false impression that the Lasso has a technical advantage in prediction and parameter estimation under the RE condition on the design. Moreover, compared with existing analysis, the proof of Theorem 2.1is much simpler.
For the Lasso, κ˚ “ 0 ď RE2pS; η, δ˚q always holds and C0 “ 1 in Theorem 2.1 (ii),
which implies the following corollary due to }ωS}8ď 1 ` η.
Corollary 2.2. Let pβ be the Lasso estimator. If (2.16) holds for a coefficient vector βo P Rp with S Ě supppβoq, then
}X pβ ´ Xβo}22 np1 ` ηq2λ2 ď sup u‰0 ψ2puq uTΣu ď max " |S|p1 ´ 1{ξq2 RE21pS; η, δ˚q , |S| RE21pS; 0q * with ψpuq ““}uS}1´ }uSc}1{ξ‰ ` and ξ “ p1 ` ηq{p1 ´ ηq, and }pβ ´ βo}2 p1 ` ηqλ ď supu‰0 }u}2ψpuq uTΣu ď max " |S|1{2p1 ´ 1{ξq RE21,2pS; η, δ˚q , |S| 1{2 RE21,2pS; 0q * ,
where RE1,2pS; η, δ˚q “ tRE2pS; η, δ˚qRE1pS; η, δ˚qu1{2 ě RE22pS; η, δ˚q.
Corollary 2.2 are the sharpest possible based on the basic inequality uTΣu ď ψpuq for u “ ppβ ´ βoq{λ. For example, it is strictly sharper than the familiar prediction error bound in [77],
}X pβ ´ Xβo}22{n ď p1 ` ηq2λ2|S|{RE21pS; η, δ˚q,
when RE21pS; η, δ˚q ă RE21pS; 0q.
To prove Theorem2.1, we first present the following lemma.
Lemma 2.1. Let S Ă t1, . . . , pu, λ ą 0, pβ be a solution of (2.12), and βo a coefficient vector satisfying supppβoq Ď S. Let h “ pβ ´ βo, ω be as in (2.17) and z “ py ´ Xβoq{n. Then, hTΣh “ řjhjt 9ρjpβjo; λq ´ 9ρjp pβj; λq ´ λwj ( ď řjPScpzjhj´ λ|hj|q ´ λωTShS` ř jκpβjo; ρj, λqh2j (2.26)
and |wj| ď 1 ` |zj|{λ for j P S, where κpβoj; ρj, λq is as in (2.3).
Proof of Lemma 2.1. Recall that h “ pβ ´ βo and z “ py ´ Xβoq{n with a βo satisfying supppβoq Ď S. For j P Sc, hj “ pβj, so that by (2.17)
hjt 9ρjpβjo; λq ´ 9ρjp pβj; λq ´ λwju “ βpjtzj´ 9ρjp pβj; λqu ď βpjzj ´ | pβj| 9ρjp| pβj|; λq (2.27) ď βpjzj ´ λ| pβj| ` p| pβj| ´ 0q ρ9jp0`; λq ´ 9ρjp| pβj|; λq ( ď zjhj´ λ|hj| ` κp0; ρj, λqh2j. For j P S, hjt 9ρjpβjo; λq ´ 9ρjp pβj; λq ´ λwju ď ´λwjhj` κpβjo; ρj, λqh2j. (2.28)
Summing the above inequalities over j, we find via (2.18) that (2.26) holds. Moreover, by the definition of ω in (2.17), |wj| ď 1 ` |zj|{λ for j P S. ˝
Proposition 2.2. Let S Ě supppβoq, tη, δ˚u be as in (2.16),
C pS; η, δ˚q “ u : p1 ´ ηq}uSc}1 ď p1 ` δ˚ηq}uS}1(,
and B˚
0pλ˚, κ˚q “ Bpλ˚, κ˚q X tβo `C pS; η, δ˚qu be the set of all solutions pβ of (2.12)
with penalties in Ppλ˚, κ˚q and estimation error pβ ´ βo in the cone C pS; η, δ˚q. Let
p β P B˚
0pλ˚, κ˚q with penalty level λ and rβ P Bpλ˚, κ˚q with penalty level rλ. Suppose
RE22pS; η, δ˚q ě κ˚ and (2.16) holds. Let 1 “ pη ´ }zSc}8{λ˚q{2, 2 “ 1{p2κ˚q and
0“ mint2, 12{p1 ` ηqu. Then,
› ›prβ ´ βoq{rλ ´ ppβ ´ βoq{λ › › 1ď 0 ñ rβ PB ˚ 0pλ˚, κ˚q.
Proposition 2.2 asserts that among general solutions pβ of (2.12) in Bpλ˚, κ˚q, those
with the normalized error ppβ ´ βoq{λ inside the cone C pS; η, δ˚q and outside the cone are
separated by 0 in the `1 distance of the normalized error. Thus, if pβ ptq
is a sequence of such solutions with penalty levels λptq such that the normalized errors uptq “ ppβptq´ βo
q{λptq have small `1 increments, }uptq´ upt´1q}1ď 0, then uptq are either all in the coneC pS; η, δ˚q or
all outside cone. In particular, Proposition 2.2 implies that the solutions pβ P B0pλ˚, κ˚q
has the cone property pβ ´ βo PC pS; η, δ˚q or equivalently B0pλ˚, κ˚q Ď βo`C pS; η, δ˚q,
as λ ‘ pβ is connected to λp0q‘ 0 through a continuous path and the origin 0 has the cone
property.
Proof of Proposition 2.2. Let u “ ppβ ´ βoq{λ and v “ prβ ´ βoq{λ. We want to prove that
}u ´ v}1ď 0 and u PC pS; η, δ˚q imply v PC pS; η, δ˚q. (2.29)
By the definition of 1 and condition (2.16), we have 1 ą 0. As κpβjo; ρj, λq ď κ˚ and
}zSc}8{λ ď }zSc}8{λ˚ď η ´ 21, Lemma2.1 implies that
uTΣu ``1 ´ η ` 21
˘ }uSc}1
ď ´ωTSuS` maxjκpβjo; ρj, λq ( }u}22 (2.30) ď `1 ` δ˚η ˘ }uS}1` κ˚}u}22
and that the same inequalities also hold for v. Recall that 2 “ 1{p2κ˚q. If }u}1 ď 2 and
}v ´ u}1ď 2, then }v}1 ď 1{κ˚, so that the v-version of (2.30) implies
`1 ´ η ` 21 ˘ }vSc}1 ď`1 ` δ˚η ˘ }vS}1` κ˚}v}21 ď`1 ` δ˚η ˘ }vS}1` 1}v}1, or equivalently `1 ´ η ` 1 ˘ }vSc}1 ď`1 ` δ˚η ` 1 ˘ }vS}1,
which then implies v PC pS; η, δ˚q. Because u P C pS; η, δ˚q, we have κ˚}u}22 ď uTΣu by
the RE condition, so that (2.30) implies
`1 ` 21´ η ˘ }uSc}1 ď`1 ` δ˚η˘}uS}1. Due to p1 ` 1´ ηq{p1 ´ 1` δ˚ηq ď p1 ` 21´ ηq{p1 ` δ˚ηq, we have `1 ` 1´ η ˘ }uSc}1 ď`1 ´ 1` δ˚η ˘ }uS}1.
If }u}1 ą 2 and }v ´ u}1ď 12{p1 ` δ˚ηq, then v PC pS; η, δ˚q follows from
`1 ´ η˘}vSc}1´`1 ` δ˚η ˘ }vS}1 ď `1 ´ η˘}uSc}1´`1 ` δ˚η ˘ }uS}1` p1 ` δ˚ηq}v ´ u}1 ď `1 ´ η˘}uSc}1´`1 ` δ˚η ˘ }uS}1` 1}u}1 ď 0.
Thus, (2.29) holds with 0 “ mint2, 12{p1 ` ηqu. ˝
Proof of Theorem 2.1. Let h “ pβ ´ βo and u “ h{λ. It follows from Proposition 2.2
that u P C pS; η, δ˚q as λ ‘ pβ is connected to λp0q‘ 0 through a continuous path and the
p1 ´ 1{C0qREpS, ηq. As u PC pS; η, δ˚q, uTΣu ě REpS, ηq}u}22, so that by (2.30)
C0´1uTΣu ` p1 ´ ηq}uSc}1ď ´ωT
SuSď }ωS}2}u}2. (2.31)
This immediately implies (2.21) and (2.22) with u “ ppβ ´ βoq{λ. For (2.19) and (2.20), we set C0 “ 8. However, by the definition of RCIF,
RCIFpredpS; η, ωquTΣu ď }Σu}28|S|.
Consequently, the first inequality in (2.19) follows from the fact that
}Σu}8ď }XTpy ´ X pβq{n}8{λ ` }XTpy ´ Xβoq{n}8{λ ď 1 ` η,
and the second inequality follows from the first inequality in (2.11). Similarly, the first inequality in (2.20) follows from
RCIFest,qpS; η, ωq}u}qď }Σu}8|S|1{q ď p1 ` ηq|S|1{q,
and the second follows from the second and third inequalities in (2.11).
Finally we consider selection consistency under the assumption ωS “ 0. In this case,
βo is a solution of (2.12), and }hSc}1 “ 0 by (2.31). Moreover, because both βo and pβ are
solutions of (2.12) with support in S,
κ˚}hS}22ď hTSΣShS “ ´ ÿ jPS hj ρ9jp pβj; λq ´ 9ρjpβjo; λq ( ď ÿ jPS κpβjo; ρj, λqh2j.
As κpβjo; ρj, λq ď κ˚, the maximum concavity is attained above at every j P S in the sense
of ´ 9ρjp pβj; λq ` 9ρjpβjo; λq “ κ˚p pβj ´ βojq for all j P S with hj ‰ 0. This is possible only
when sgnp pβjqsgnpβjoq ě 0 for all j P S. Furthermore, sgnp pβjqsgnpβjoq ą 0 for all j P S when
2.3 Smaller penalty levels
We have studied in Section 2.2 exact solutions of (2.12) for penalty levels λ ě λ˚ in the
event where λ˚ is a strict upper bound of the supreme norm of the random vector z “
XTpy ´ Xβoq{n as in (2.16). Such penalty or threshold levels are commonly used in the literature to study regularized methods in high-dimensional regression. However, this is quite conservative and often yields poor numerical results. In this section, we consider smaller penalty levels under somewhat stronger RE conditions on the design.
2.3.1 Smaller penalty levels
We consider penalty levels λ which control a sparse `2 norm of a truncated z “ XTpy ´
Xβoq{n, instead of the larger `8 norm of z. For q P r1, 8s and t ą 0, the sparse `q norm
is defined as
}v}pq,tq“ max
J Ăt1,...,pu,|J|ăt`1}vJ}q.
To control the effect of the noise, we consider penalty levels λ ě λ˚with a minimum penalty
level λ˚ such that
› › ›p|z| ´ η0λ˚q` › › › p2,mq“ sup|J |“m b ř jPJ ` |zj| ´ η0λ˚ ˘2 `ă η1m 1{2λ ˚ (2.32)
happens with high probability for certain positive numbers η0 and η1 satisfying η0` η1 ă 1
and a positive integer m. It is clear that (2.16) implies (2.32) with η “ η0` η1 and m “ 1.
As properties of the Lasso has been considered in [70] under penalty levels λ ą λ˚ with the
smaller λ˚ in (2.32), the results in this subsection can be viewed as an extension of their
results to general solutions of (2.12) in the set B0pλ˚, κ˚q in (2.14).
With η “ η0` η1 and z “ XTpy ´ Xβoq{n, define
˜
be the set of indexes of large |zj|s. Main consequences of (2.32) are
| ˜S| ă m, řjP ˜Sp|zj| ´ η0λ˚q2`ă mpη1λ˚q2, }zS˜c}8 ă ηλ˚. (2.34)
These properties can be used to prove
}X pβ ´ Xβo}22{n À`}ωS}22` m˘λ2, (2.35)
with }ωS}22 À |S| in the worst case scenarior scenario, and parallel estimation error bounds
under a certain RE-type condition. See [70] and Subsection 3.3.
Consider Gaussian noise ε “ y ´ Xβ˚ „ N p0, σ2Inˆnq. Let L1ptq “ Φ´1p´tq be the
standard normal negative quantile function. Sun and Zhang [70] proved that when βo is the true coefficient vector, βo “ β˚, (2.32) holds with at least probability 1 ´ under the conditions η0λ˚“ pσ{n1{2qL1pk{pq, (2.36) η1 η0 ą ˆ 4k{m L41pk{pq ` 2L21pk{pq ˙1{2 ` L1p{pq L1pk{pq ˆ κ`pmq m ˙1{2 ,
where κ`pmq “ maxtuTΣu : }u}0 “ m, }u}2 “ 1u is the upper sparse eigenvalue of Σ. A
conservative choice of k is to take
k “ L41pk{pq ` 2L21pk{pq (2.37)
as in [70], giving m “ Op1q in prediction and estimation error bounds. However, by (2.35), larger k can be taken without changing the order of error bounds as long as m À }ωS}22.
2.3.2 RE-type conditions for smaller penalty levels
When a smaller penalty level is taken, a lower level of regularization is imposed on the estimator pβ, so that the estimation error h “ pβ ´ βo may fail the condition p1 ´ ηq}hSc} ď
(2.32), we can still prove the membership of the error h in the following larger cone, U pS, η0, η1, mq “ ! u :`1 ´ η˘}uSc}1 ď p1 ` ηq}uS}1` η1pm1{2}uS˜}2´ }uS˜}1q )
with η “ η0` η1 ă 1 and the set ˜S in (2.33). This will be verified in the proof of Theorem
2.2 but can be also vaguely seen from (2.34). Consequently, the restricted eigenvalue is defined in the larger cone as
Ď RE2pS; η0, η1, mq “ inf # puTΣuq1{2 }u}2 : 0 ‰ u PU pS, η0, η1, mq + . (2.38)
When m “ 1, ˜S “ H and the restricted eigenvalue (2.38) coincides with the original RE as defined in (2.7). Although (2.38) is a random variable due to its dependence on ˜S (even for deterministic designs), it is no smaller than
Ď RE˚,2pS; η, mq “ min |T zS|ăm inf # puTΣuq1{2 }u}2 : }uTc}1ă ξ|T |1{2}uT}2 + (2.39) due to | ˜S| ă m in (2.34), where ξ “ p1 ` ηq{p1 ´ ηq.
Similarly, we extend the relaxed cone invertibility factor (RCIF) as
Ğ
RCIFpredpS; η0, η1, mq “ inf
!}Σu}2
cs˚
uTΣu : u PU pS; η0, η1, mq
)
Ğ
RCIFest,qpS; η0, η1, mq “ inf
!}Σu} cps˚q1{q }u}q : u PU pS, η0, η1, mq ) , (2.40)
where s˚ “ max |S|, | ˜S|( represents a potentially lower level of sparsity due to possible
selection of variables outside S and } ¨ }c is a combination of the `2 norm on ˜S and the `8
norm on ˜Scdefined as
}v}c“ max }vS˜}2{m1{2, }vS˜c}8(.
The new RCIF for prediction and estimation are respectively. When m “ 1, the combination norm coincides with the `8 norm and the modified RCIFs coincide with those in (2.7) and
(2.10) respectively.
2.3.3 Prediction and estimation errors bounds at smaller penalty levels
Theorem 2.2. Let pβ be a solution of (2.12) in B0pλ˚, κ˚q with penalties ρjp¨; λq P
Ppλ˚, κ˚q. Let η “ η0 ` η1 ă 1 with positive η0 and η1, m be a positive integer, ˜S
as in (2.33) with a certain βo P Rp, S Ě supppβoq, and s˚ “ max |S|, | ˜S|(. Suppose Ď
RE22pS; η0, η1, mq ě κ˚ and (2.32) holds. Then,
}X pβ ´ Xβ˚}22{n ď p1 ` ηqλ(2s˚ Ğ RCIFpredpS; η0, η1, mq ď p1 ` ηqξ1λ (2 s˚ Ď RE22pS; η0, η1, mq (2.41) with ξ1““2p|S|{s˚q1{2` p1 ´ η0qpm{s˚q1{2 ‰ {`1 ´ η˘, and }pβ ´ β˚}q ď $ ’ ’ ’ ’ ’ ’ & ’ ’ ’ ’ ’ ’ % p1 ` ηqλps˚q1{2 Ğ RCIFest,2pS; η0, η1, mq ď p1 ` ηqξ1λps ˚q1{2 Ď RE22pS; η0, η1, mq , q “ 2, p1 ` ηqλps˚q1{q Ğ RCIFest,qpS; η0, η1, mq , @q P r1, 2s. (2.42)
Remark 2.2. When m — s, k is on the same order k — m. The penalty level λ˚in (2.37) is
on the orderap2{nq logpp{kq. Theorem2.2guarantees that the prediction and `2 estimation
error are on the order
}X pβ ´ X pβ˚}22{n ` }pβ ´ pβ ˚
}22 — pm{nq logpp{mq — ps{nq logpp{sq.
This matches the minimax prediction and `2 estimation rate of the Slope in Bellec et al. [4]
As an extension of Theorem 2.1 (i), Theorem 2.2 provides prediction and estimation error bounds in the same form for smaller penalty levels with somewhat smaller RCIF and RE. However, the approach does not provide a full extension of Theorem2.1in several aspects. Due to the use of the `2norm in condition (2.32), the `qestimation error bound can
be extended only for 1 ď q ď 2, and the compatibility coefficient cannot be used to bound the prediction and `1 errors. In addition, solutions of (2.12) are not selection consistent at
We have considered so far solutions of (2.12) in the main branch of the solution space B0pλ˚, κ˚q in (2.14). Such solutions are computable by path finding algorithms. In fact, as
discussed below Proposition2.2, our analysis is also applicable if ppβ´βoq{λ is connected to a cone through a discrete sequence of such normalized errors in small `1increments. Statistical
and computational properties of iterative discrete solution paths have been studied in [80] among others. However, compared with Theorems 2.1 and 2.2, [80] requires upper sparse eigenvalue conditions on X and larger penalty levels satisfying (2.16).
Proof of Theorem 2.2. Let h “ pβ ´ βo. Recall that Lemma2.1gives
hTΣh ď řjPScpzjhj ´ λ|hj|q ´ λωTShS`řjκpβjo; ρj, λqh2j
“ hTz ´ λ}hSc}1´řjPShjρ9jpβjo; λq `řjκpβjo; ρj, λqh2j, (2.43)
where z “ XTpy ´ Xβoq{n, wj “ 9ρjpβjo; λq{λ ´ zj{λ, κpt; ρj, λq is as in (2.3). Let 1 “
min η ´ }zS˜c}8{λ˚, η1´ }p|zS˜| ´ η0λ˚q`}2{pm1{2λ˚q( and T Ě supppzq. By (2.34), 1 ą 0
and |hTz| ď pη ´ 1qλ˚}hT z ˜S}1` pη0´ 1qλ˚}hS˜}1 `řjP ˜S|hj|p|zj| ´ pη0´ 1qλ˚q` (2.44) ď ηλ˚}hT z ˜S}1` η0λ˚}hS˜}1´ 1λ˚}hT}1 ` ´ }p|zS˜| ´ η0λ˚q`}2` 1m1{2λ˚ ¯ }hS˜}2. ď pη ´ 1qλ˚}hT}1` η1λ˚`m1{2}hS˜}2´ }hS˜}1˘.
Let u “ h{λ with λ ě λ˚. Combining (2.43) and (2.44), we have
uTΣu ` 1}u}1` p1 ´ ηq}uSc}1 (2.45)
ď p1 ` ηq}uS}1` η1`m1{2}uS˜}2´ }uS˜}1
˘
` κ˚}u}22.
The above inequality holds for all u “ ppβ ´ βoq{λ as long as pβ PBpλ˚, κ˚q. As in the proof
that for all such u,
p1 ´ ηq}uSc}1 ď p1 ` ηq}uS}1` η1`m1{2}u˜
S}2´ }uS˜}1˘,
so that u PU pS, η0, η1, mq.
By the definition of ĞRCIF, we have
Ğ RCIFpredpS; η0, η1, mqhTΣh ď }Σh}2c|S|, (2.46) and Ğ RCIFest,qpS; η0, η1, mq}h}qď }Σh}c|S|1{q. (2.47) Moreover, we have }pΣhqS˜c}8ď }XTS˜cpy ´ X pβq{n}8` }XTS˜cpy ´ Xβ o q{n}8ď p1 ` ηqλ, and }pΣhqS˜}2 ď }XTS˜py ´ X pβq{n}2` }XTS˜py ´ X pβq{n}2 ď λm1{2` η0λ˚m1{2` η1m1{2λ˚ ď p1 ` ηqm1{2λ. Thus, }Σh}c“ max }pΣhqS˜c}8, }pΣhqS˜}2{m1{2 ( ď p1 ` ηqλ. (2.48)
We establish the RCIF error bounds in (2.41) and (2.42) by inserting the above inequality into (2.46) and (2.47) respectively.
To compare the RCIF and RE, we note that
for u PU pS; η0, η1, mq, so that for η ă 1
uTΣu “ uTS˜cpΣuqS˜c` uTS˜pΣuqS˜
ď }uS˜c}1}Σu}c` m1{2}uS˜}2}Σu}c ď ´2|S|1{2}uS}2` η1m1{2}u˜ S}2 1 ´ η ` m 1{2 }uS˜}2 ¯ }Σu}c ď 2|S| 1{2 ` p1 ´ η0qm1{2 1 ´ η }u}2}Σu}c. It follows that uTΣu }u}22 ď ”2p|S|{s˚q1{2` p1 ´ η 0qpm{s˚q1{2 1 ´ η ı2}Σu}2 cs˚ uTΣu , and uTΣu }u}22 ď ”2p|S|{s˚q1{2` p1 ´ η 0qpm{s˚q1{2 1 ´ η ı}Σu} cps˚q1{2 }u}2 .
Taking infimum in the cone U pS; η0, η1, mq on both sides and noting that ξ1 “
“2p|S|{s˚q1{2` p1 ´ η 0qpm{s˚q1{2 ‰ {`1 ´ η˘, we obtain Ğ RCIFpredpS; η0, η1, mq ě REĎ 2 2pS; η0, η1, mq{ξ21, and Ğ RCIFest,2pS; η0, η1, mq ě REĎ 2 2pS; η0, η1, mq{ξ1.
This completes the proof. ˝
2.4 Scaled concave PLSE
We have studied in previous sections the properties of all the local solutions in B0pλ˚, κ˚q.
Suppose that the local solution set B0pλ˚, κ˚q is obtained, one still needs to choose an
appropriate solution in the set or a proper penalty level. This problem, which will be studied in this section, is essentially to estimate the noise level σ due to scale invariance.
Numerous efforts have been devoted to scale free estimation under the `1 penalty. St¨adler
et al. [67] proposed the minimizer of joint log-likelihood of regression coefficients and noise level with an `1 penalty. The comment on this paper by Antoniadis [2] pointed out that
their estimator is equivalent to the joint minimization of Huber’s concomitant loss
ppβ,σq “ arg minp β,σ }y ´ Xβ}22 2nσ ` σ 2 ` λ0}β}1. (2.49)
It turns out that (2.49) coincides with many other works on the scale free estimation under the `1 penalty. For example, the square-root Lasso solution [5] and the equilibrium of the
iterative algorithm [69] are both equivalent to (2.49). However, all of these studies of the scale free estimation are limited to the `1 penalty. The scaled concave PLSE is not an easy
extension of the scaled `1 penalization due to the loss of scale free property.
In fact, the concomitant loss or the square-root formulation fail for concave penalties. To illustrate this, we take the MCP as an example. Denote σ˚
“ }y ´ Xβ˚}2{n1{2 as the
oracle noise level estimator given the true coefficients β˚. Under Gaussian assumption, this
is the maximum likelihood estimator for σ when β˚is known and thus a natural estimation target. For Minimax concave penalty ρpt, λq “ λş0|t|p1 ´ x{pλγqq`dx,
p σ2 “ arg min σ2 }y ´ Xβ˚}22 2nσ ` σ 2 ` 1 σ p ÿ j“1 ρp|β˚ j|; λ0σq “ tσ˚u2´ p1{γq ÿ jPt|β˚j|ăλ0σγup tβj˚u2, p
σ is expected to underestimate σ˚ unless there is no small β˚
j such that t|β˚j| ă λ0pσγu. This validates the argument that concomitant loss formulation fails for concave penalties. In addition, the iterative algorithm becomes extremely difficult to analyze due to the loss of its equivalence to joint convex minimization, compared with the Lasso.
2.4.1 Description of the scaled concave PLSE
Given a coefficients pβ, define the noise level estimator as
p
where d is a parameter provides an option to adjust degrees-of-freedom. Typically, we let d “ p when p ă n and d “ 0 otherwise. Within the local solution setB0pλ˚, κ˚q, we search
for a subclass of scaled concave penalized least-square estimatorsB0,scalpλ0; λ˚, κ˚q, defined
as B0,scalpλ0; λ˚, κ˚q “ ! p β PB0pλ˚, κ˚q : λ0pσppβq “ λ ) . (2.51)
Here, λ0 is a prefixed penalty level and independent of σ. For example, one may choose
λ0 “ Atp2{nq log pu1{2 for universal penalty and λ0 “ An´1{2L1pk{pq for smaller penalty,
with appropriate k and A. We derive the consistency results for noise level estimation for different λ0 separately in the following analysis.
As discussed in Section 2.2, B0pλ˚, κ˚q is a large class of estimators that includes all
local solutions connected to the origin regardless of the specific algorithms used to compute the solution. We here use the PLUS algorithm as an example to illustrate the computation of the estimators inB0,scalpλ0; λ˚, κ˚q. The PLUS, indexed by x, is defined as
λpxq‘ pβpxq” $ ’ ’ & ’ ’ %
a continuous path of solutions of (2.12) in R1`p with pβp0q“ 0 and limxÑ8λpxq“ 0.
(2.52)
Given a PLUS solution path, the scaled estimator can be defined as
p
βscal“ pβppxq, px “ mintx : λ0pσppβ
pxq
q ě λpxqu. (2.53)
The “ ě ” in defining px in (2.53) can be changed to “ “ ” by the continuity of the PLUS path. Under mild regularity conditions,we will prove that pβscal P B0,scalpλ0; λ˚, κ˚q. See
next subsections for the proof. This also guarantees the non-emptiness ofB0,scalpλ0; λ˚, κ˚q. 2.4.2 Performance guarantees of scaled concave PLSE at universal
penalty levels
In this subsection, we derive the consistency results for noise level estimation with sufficiently large λ0. Since σ˚“ }y´Xβ˚}2{n1{2is a natural target of noise level estimation,
we aim to derive the convergence results of σppp βq{σ
˚ with pβ P B
0,scalpλ0; λ˚, κ˚q in the
following theorem.
Theorem 2.3. Let β˚ be the true regression coefficients, pβscal be in (2.53) and σ˚ “
}y ´ Xβ˚}2{n1{2 be the oracle noise level estimator. Let 0 ă η ă 1 and ξ “ p1 ` ηq{p1 ´ ηq.
Suppose κpρq ď κ˚ and RE22pS; η, 1q ě κ˚.
piq Let τ0 “ p1 ` ξqp1 ` ηqλ0s1{2{RE1pS; η, 1q. When (2.16) holds with λ˚ “ λ0σ˚p1 ´ τ0q
and δ˚ “ 1, we have pβ scal
PB0,scalpλ0; λ˚, κ˚q. Moreover, for any pβ PB0,scalpλ0; λ˚, κ˚q,
max ˜ 1 ´σppp βq σ˚ , 1 ´ σ˚ p σppβq ¸ ď τ0, }X pβ ´ Xβ˚ }2 n1{2σ˚ ď τ0 1 ´ τ0 . (2.54)
In particular, if we take λ0“ Atp2{nq log pu1{2 with A ą 1{η and τ0Ñ 0, then for all ą 0
Pβ˚,σp|pσppβq{σ ´ 1| ą q Ñ 0. (2.55)
piiq Let τ˚2“ ηp1`ηqp1`ξq2λ20s{RE1pS; η, 1q. When (2.16) holds with λ˚ “ λ0σ˚p1´τ˚2q
and δ˚ “ 1, we have pβ scal
PB0,scalpλ0; λ˚, κ˚q. Moreover, for any pβ PB0,scalpλ0; λ˚, κ˚q,
max ˜ 1 ´pσppβq σ˚ , 1 ´ σ˚ p σppβq ¸ ď 3τ˚2. (2.56)
If we take λ0“ Atp2{nq log pu1{2 with A ą 1{η and τ˚2! n´1{2, then
n1{2ppσppβq{σ ´ 1q Ñ N p0, 1{2q (2.57)
in distribution under Pβ˚,σ.
By proving pβscal P B0,scalpλ0; λ˚, κ˚q, Theorem 2.3 guarantees the non-emptiness of
B0,scalpλ0; λ˚, κ˚q with appropriate λ˚ and λ0. Moreover, it provides the convergence
and asymptotic normality for the scaled concave estimation of noise level under only the restricted eigenvalue conditions. In part (i), we achieve an error rate τ0 —
a
ps{nq log p for noise level estimation with pβ PB0,scalpλ0; λ˚, κ˚q. This matches the `1penalized maximum
likelihood estimator in [67]. In part (ii), we provide sharper convergence rate and the asymptotic normality results. The sharper rate τ˚2 is on the order of ps{nq log p, which
essentially taking the square of the order in part (i). The asymptotic normality then follows from the sharper rate under mild assumptions. The convergence rate in part (ii) matches the rate of iterative algorithm formulation in Sun and Zhang [69].
Proof of Theorem 2.3. First prove (i). Denote z “ XTpy ´ Xβ˚q{n and hpxq “ p
βpxq´ β˚. Consider penalty level λpx0q “ λ
˚ “ λ0σ˚p1 ´ τ0q for certain x0 in the PLUS
path. Since λpx0q“ λ
˚ satisfies (2.16), it follows from Theorem2.1and the definition of τ0
that }Xhpx0q} 2{n1{2ď σ˚τ0p1 ´ τ0q ď σ˚τ0. Then we have λ0σppp β px0q q “ λ0}y ´ X pβ px0q }2{n1{2 ě λ0 ˇ ˇ ˇσ ˚ ´ }Xhpx0q }2{n1{2 ˇ ˇ ˇ ě λ0σ ˚ p1 ´ τ0q “ λpx0q. (2.58)
By the definition of x,p px ď x0. Since any penalty level λ
pxq
ě λ˚ is a local solution
of (2.12) in the PLUS path, λpxq is a non-increasing function of x for λpxq ě λ
˚. Thus, λppxq ě λpx0q“ λ ˚. It follows that pβ scal PB0,scalpλ0; λ˚, κ˚q.
Moreover, for any pβ PB0,scalpλ0; λ˚, κ˚q, with penalty λ ě λ˚, we have
p
σppβq “ λ{λ0 ě λ˚{λ0 “ σ˚p1 ´ τ0q. (2.59)
Furthermore, by Theorem2.1 we have ˇ ˇ ˇ}y ´ Xβ}p 2{n 1{2 ´ σ˚ ˇ ˇ ˇ ď }X pβ ´ βp ˚ q}2{n1{2ď τ0pσppβq. (2.60) Thus, p σppβq σ˚ “ }y ´ X pβ}2 n1{2σ˚ ď τ0pσppβq ` σ ˚ σ˚ “ 1 ` τ0 p σppβq σ˚ , (2.61) This implies pσppβq ď σ ˚
{p1 ´ τ0q. Combing with (2.59), the first part of (2.54) holds. In
addition,
}X pβ ´ Xβ˚
}2{n1{2ď τ0pσppβq ď σ
˚τ
The second part of (2.54) holds. To prove (3.2), since for certain A, Pβ,σ ” }z}8ď Aσtp2{nq log pu1{2 ı Ñ 1,
we have (3.2) follows from (2.54).
Now we prove (ii). By the KKT condition,
´p}z}8` λq}h}1 ď pXhqT ! y ´ Xβ˚` y ´ X pβ ) {n ď pσ˚q2´ }y ´ X pβ}22{n ď pXhqTt2py ´ Xβ˚q ´ Xhu{n ď 2}z}8}h}1. (2.63)
We use above inequalities as lower and upper bounds for pσ˚q2´ }y ´ X pβ}2 2{n.
Consider λpx1q “ λ
˚“ λ0σ˚p1 ´ τ˚2q in the PLUS path. Since λpx1q“ λ˚ satisfies (2.16),
it follows from Theorem2.1 that }hpx1q
}1 ď p1 ` ξq2p1 ` ηqλpx1qs{RE1pS; η, 1q. Combining with }z}8ă λ0ησ˚p1 ´ τ˚2q, we have λ20 σp 2 ppβpx1qq “ λ02 }y ´ X pβpx1q}22{n ě λ20 ´ tσ˚u2´ 2}z}8}hpx1q}1 ¯ ě λ20tσ˚u2`1 ´ 2τ˚2p1 ´ τ˚2q2˘ ě λ20tσ˚u2p1 ´ τ˚2q 2 “ pλpx1q q2. (2.64)
The last inequality holds since τ˚2 ď 1. As in part (i), we find λppxq ě λpx1q “ λ ˚ and
p
βscal PB0,scalpλ0; λ˚, κ˚q.
Similarly, for any pβ P B0,scalpλ0; λ˚, κ˚q with penalty λ ě λ˚ “ λ0σ˚p1 ´ τ˚2q, we have
p
σppβq “ λ{λ0 ě λ˚{λ0 “ σ˚p1 ´ τ˚2q. On the other hand, recall that }z}8 ă λ0ησ˚p1 ´ τ˚2q
and }pβ ´ β˚ }1 ď p1 ` ξq2p1 ` ηqλ0σppp βqs{RE1pS; η, 1q, we have p σppβq2 tσ˚u2 “ }y ´ X pβ}22 ntσ˚u2 ď tσ˚u2` ´ }z}8` λ0σppp βq ¯ }pβ ´ β˚}1 tσ˚u2 ď tσ ˚u2` τ2 ˚p1 ´ τ˚2qpσppβqσ ˚` p1{ηqτ2 ˚σp 2ppβq tσ˚u2 . (2.65)
Solving above equation w.r.t σppp βq{σ ˚, we obtain p σppβq{σ˚ ď p1 ` τ˚2q{p1 ´ τ˚2q. Thus 1 ´ σ˚{pσppβq ď 3τ2
˚. This proves (3.3). Given (3.3), the proof of (2.57) follows the proof of
Theorem 2 (ii) in Sun and Zhang [69]. ˝
2.4.3 Performance bounds of scaled concave PLSE at smaller penalty levels
In this subsection, we derive the consistency results for noise level estimation with smaller λ0.
Theorem 2.4. Let β˚, pβscal and σ˚ be as in Theorem2.3and ĎRE
˚,2pS; ¨, ¨q be as in (2.39).
Let m be a positive integer, η “ η0`η1 with positive η0, η1and ξ2“ r2`p1´η0qpm{sq1{2s{p1´
ηq. Define τ1“ p1`ηqλ0ξ2ps_mq1{2{ ĎRE˚,2pS; η, mq. Suppose κpρq ď κ˚ and RE22pS; η, 1q ě
κ˚. When (2.32) holds with λ˚ “ λ0σ˚p1´τ1q, we have pβ scal
PB0,scalpλ0; λ˚, κ˚q. Moreover,
for any pβ PB0,scalpλ0; λ˚, κ˚q,
max ˜ 1 ´σppp βq σ˚ , 1 ´ σ˚ p σppβq ¸ ď τ1, }X p β ´ Xβ˚}2 n1{2σ˚ ď τ1 1 ´ τ1 . (2.66)
If we take λ0“ An´1{2L1pk{pq with k in (2.37), A ą 1{η0 and τ1 Ñ 0, then for all ą 0
Pβ˚,σp|pσppβq{σ ´ 1| ą q Ñ 0. (2.67)
Similar as Theorem 2.3, Theorem 2.4 first guarantees the non-emptiness of B0,scalpλ0; λ˚, κ˚q but with smaller λ0. Furthermore, it provides the convergence results
for noise level estimation at smaller penalties with nearly identical condition as in Theorem
2.3. Compared with existing literatures, Theorem 2.4 could be viewed as a generalization of scaled Lasso with smaller penalties in Sun and Zhang [70].
Proof of Theorem 2.4. Consider penalty level λpx1q “ λ
˚ “ λ0σ˚p1 ´ τ1q for certain
x1 ă 8 in the PLUS path. Since (2.32) holds for λpx1q, by Theorem 2.2, we have
}Xhpx1q} 2{n1{2ď p1 ` ηqξ1λps˚q1{2 Ď RE2pS; η0, η1, mq ď p1 ` ηqξ2λ px1qps _ mq1{2 Ď RE˚,2pS; η, mq ď σ˚τ1.
Similar as (2.58), λ0σppp β px1q q “ λ0}y ´ X pβ px1q }2{n1{2ě λ0σ˚p1 ´ τ1q “ λpx1q.
As in the proof of Theorem 2.3, we find λppxq
ě λpx1q and pβscal P B
0,scalpλ0; λ˚, κ˚q.
Moreover, (2.66) and (2.67) can be proved in the same way as Theorem2.3. ˝
2.5 Simulation Study
In this section, we report the noise level estimation results of the scaled concave PLSE and compare with several competing methods in a comprehensive simulation study. The experimental settings follow Reid et al. [61] and are described with our notation as below.
The simulation aims to estimate noise level σ in a variety of settings. All simulations are run at a sample size of n “ 100, the number of predictors is considered in four different values: p “ 100, 200, 500, 1000. Elements of the design matrix X are generated randomly as Xij „ N p0, 1q. Correlation between columns of X is set to be ρ. The true parameter
β˚ is generated as follows: the number of nonzero elements is set to be p
nz “ rnαs, i.e., α
controls the sparsity of β˚: the higher the α; the less sparse of β˚. It ranges between 0 and 1. The indices corresponding to nonzero β˚ are selected randomly. Their value are set to
be random samples from a Laplacep0, 1q distribution. The elements of the resulting β˚ is scaled such that the signal-to-noise ratio, defined as tβ˚
uTΣβ˚Lσ2 is some predetermined
value, snr. Simulations were run over a grid of values for each of the parameters described above. In particular,
• ρ “ 0, 0.2, 0.4, 0.6, 0.8 • α “ 0.1, 0.3, 0.5, 0.7, 0.9 • snr “ 0.5, 1, 2, 5, 10, 20.
We simulate B “ 200 independent datasets for each set of parameters. The competing methods considered include: