TOPICS IN HIGH-DIMENSIONAL REGRESSION AND NONPARAMETRIC MAXIMUM LIKELIHOOD METHODS

(1)

METHODS

BY LONG FENG

A dissertation submitted to the Graduate School—New Brunswick Rutgers, The State University of New Jersey

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Graduate Program in Statistics and Biostatistics Written under the direction of

Cun-Hui Zhang & Lee H. Dicker and approved by

New Brunswick, New Jersey October, 2017

(2)

Topics in high-dimensional regression and nonparametric

maximum likelihood methods

by LONG FENG

Dissertation Director: Cun-Hui Zhang & Lee H. Dicker

This thesis contains two parts. The first part, in Chapter 2-4, addresses three connected issues in penalized least-square estimation for high-dimensional data. The second part, in Chapter 5, concerns nonparametric maximum likelihood methods for mixture models.

In the first part, we prove the estimation, prediction and selection properties of concave penalized least-square estimation (PLSE) under fully observed and noisy/missing design, and validate an essential condition for PLSE: the restricted-eigenvalue condition. In Chapter 2, we prove that the concave PLSE matches the oracle inequalities for prediction and coefficients estimation of the Lasso, based only on the restricted eigenvalue condition, one of the mildest condition imposed on the design matrix. Furthermore, under a uniform signal strength assumption, the selection consistency does not require any additional conditions for proper concave penalties such as the SCAD penalty and MCP. A scaled version of the concave PLSE is also proposed to jointly estimate the regression coefficients and noise level. Chapter 3 concerns high-dimensional regression when the design matrix are subject to missingness or noise. We extend the PLSE for fully observed design to noisy or missing design and prove that the same scale of coefficients estimation error can be obtained, while requiring no additional condition. Moreover, we show that a linear combination of the `2

norm of regression coefficients and the noise level is large enough as penalty level when noise

(3)

required in Chapter 2 and Chapter 3 and considers a more general groupwise version. We prove that the population version of the groupwise RE condition implies its sample version under a low moment condition given usual sample size requirement. Our results include the ordinary RE condition as a special case.

In the second part, we consider nonparametric maximum likelihood (NPML) methods for mixture models, a nonparametric empirical Bayes approach. We provide concrete guidance on implementing multivariate NPML methods for mixture models, with theoretical and empirical support; topics covered include identifying the support set of the mixing distribution, and comparing algorithms (across a variety of metrics) for solving the simple convex optimization problem at the core of the approximate NPML problem. In addition, three diverse real data applications are provided to illustrate the performance of nonparametric maximum likelihood methods.

(4)

I would like to express my deepest gratitude to my advisors, Prof. Cun-Hui Zhang and Prof. Lee Dicker. I feel extremely fortunate to have the opportunity to work with them. Prof. Zhang is more than an advisor to me. He is a great researcher, a dedicated educator, a devoted mentor, a trusted friend and a respectable elder. He provides me helpful instructions and exceptional research trainning, and also unwavering support and constant encouragement. More importantly, his devotation to research and to students makes me interested in exploring a career in academia and to be a researcher and teacher. Prof. Dicker is the jonior professor I admire most. His brilliant ideas and excellent intuition on statistics always make my understaning of research questions deeper. He can always convey complex problems in plain language. I enjoy so much working with him and benefit a lot in each of our meetings.

Secondly I would like to extend my gradtitude to my dissertation committee, Prof. Pierre Bellec and Prof. Eitan Greenshtein, for the time they dedicated to review my thesis and comments on the manuscripts. Special thanks go to Prof. Greenshtein for his helpful discussion on the topics of nonparametric maximum likelihood estimation in Chapter 5 of this thesis.

In addition, I want to say thanks to Professor John Kolassa, for his support over the past five years and Prof. Minge Xie for his advices and encouragement during my study. Also, I want to thank the fellow students in our department and my friends at Rutgers for their suggestions and help. I feel very lucky to meet all these people in my graduate life and have such a happy and unforgettable journey.

(5)

To my family

(6)

Abstract . . . ii Acknowledgements . . . iv Dedication. . . v List of Tables . . . ix List of Figures . . . x 1. Introduction . . . 1 1.1. High-dimensional regression . . . 1

1.2. Nonparametric maximum liklihood methods . . . 3

2. Oracle properties of concave PLSE and its scaled version. . . 4

2.1. Introduction. . . 4

2.2. Statistical Properties of Concave PLSE methods . . . 7

2.2.1. Concave penalties . . . 8

2.2.2. The restricted eigenvalue condition . . . 9

2.2.3. Properties of concave PLSE . . . 10

2.3. Smaller penalty levels . . . 20

2.3.1. Smaller penalty levels . . . 20

2.3.2. RE-type conditions for smaller penalty levels . . . 21

2.3.3. Prediction and estimation errors bounds at smaller penalty levels . . 23

2.4. Scaled concave PLSE. . . 26

2.4.1. Description of the scaled concave PLSE . . . 27

2.4.2. Performance guarantees of scaled concave PLSE at universal penalty levels . . . 28

(7)

2.5.1. No signal case: β˚

“ 0. . . 35

2.5.2. Effect of correlation: ranging over different ρ . . . 35

2.5.3. Effect of signal-to-noise ratio: ranging over different snr . . . 36

2.5.4. Effect of sparsity: ranging over different α . . . 37

2.6. Discussion . . . 39

3. Penalized least-square estimation with noisy and missing data . . . 41

3.2. Theoretical Analysis of PLSE . . . 44

3.2.1. Restricted eigenvalue conditions . . . 44

3.2.2. Main results. . . 45

3.3. Theoretical penalty levels for missing/noisy data . . . 46

3.4. Scaled PLSE and Variance Estimation . . . 51

3.5. Conclusions . . . 55

4. Group Lasso under Low-Moment Conditions on Random Designs . . . 56

4.2. A review of restricted eigenvalue type conditions . . . 60

4.3. The group transfer principle . . . 62

4.4. Groupwise compatibility condition . . . 67

4.5. Groupwise restricted eigenvalue condition . . . 76

4.6. Convergence of the restricted eigenvalue . . . 80

4.7. Lemmas . . . 82

4.8. Discussion . . . 87

5. Nonparametric Maximum Likelihood for Mixture Models: A Convex Optimization Approach to Fitting Arbitrary Multivariate Mixing Distributions . . . 90

(8)

5.2.1. NPMLEs . . . 92

5.2.2. A simple finite-dimensional convex approximation . . . 93

5.3. Choosing Λ . . . 94

5.4. Connections with finite mixtures . . . 96

5.5. Implementation overview . . . 97

5.6. Simulation studies . . . 98

5.6.1. Comparing NPMLE algorithms . . . 99

5.6.2. Gaussian location scale mixtures: Other methods for estimating a normal mean vector . . . 100

5.7. Baseball data . . . 101

5.8. Two-dimensional NPMLE for cancer microarray classification . . . 104

5.9. Continuous glucose monitoring . . . 106

5.9.1. Linear model . . . 107

5.9.2. Kalman filter . . . 107

5.9.3. Comments on results . . . 108

5.10. Discussion . . . 109

(9)

2.1. Median bias of standard deviation estimates. No signal, σ “ 1, ρ “ 0, sample size n “ 100. Minimum error besides the oracle is in bold for each analysis. 35

5.1. Comparison of different NPMLE algorithms. Mean values (standard deviation in parentheses) reported from 100 independent datasets; p “ 1000, throughout simulations. Mixing distribution 1 has constant σj; mixing

distribution 2 has correlated µj and σj. . . 110

5.2. Mean TSE for various estimators of µ P Rp based on 100 simulated datasets; p “ 1000. pq1, q2q indicates the grid points used to fit ˆGΛ. . . 110

5.3. Baseball data. TSE relative to the naive estimator. Minimum error is in bold for each analysis. . . 111

5.4. Microarray data. Number of misclassification errors on test data. . . 111

5.5. Blood glucose data. MSE relative to CGM. . . 111

(10)

2.1. Median standard deviation estimates over different levels of predictor correlation. σ “ 1, α “ 0.5, snr “ 1, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving from left to right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ M CP (5), SZ M CP 2(6), SZ M CP 3(7). . . 36

2.2. Median standard deviation estimates over different levels of signal-to-noise level. σ “ 1, α “ 0.5, ρ “ 0, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving from left to right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ M CP (5), SZ M CP 2(6), SZ M CP 3(7). . . 37

2.3. Median standard deviation estimates over different levels of sparsity. σ “ 1, snr “ 1, ρ “ 0, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving from left to right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ M CP (5), SZ M CP 2(6), SZ M CP 3(7). . . 38

2.4. Five λ0s as functions k, n=100, p=1000. Line numbers refer to (1)

λ0pkq “ tp2{nq logppqu1{2, (2) λ0pkq “ tp2{nq logpp{kqu1{2, (3) λ0pkq “

p2{nq1{2L1pk{pq, (4) Adaptive λ0 described in section 2.5 with various k,

assuming that the correlation between columns of X is 0. (5) Same as (4) except assuming that the correlation between columns of X is 0.8. The k1

is the solution to (2.37), k2 is the solution to 2k “ L41pk{pq ` 2L21pk{pq. . . . 39

(11)

baseball dataset; (b) histogram of non-pitcher data from the baseball dataset; (c) histogram of pitcher data from the baseball dataset. . . 104

(12)

Chapter 1 Introduction

1.1 High-dimensional regression

This first part of this thesis addresses three issues in parameter estimation, prediction and variable selection for high-dimensional regression: concave penalized least-square regression, high-dimensional regression with noisy and missing data, and restricted eigenvalue-type conditions for high-dimensional regression.

As modern technology generates tons of data, high-dimensional data have been studied intensively both in statistics and computer science. In linear regression, a widely used approach to analyze high-dimensional data is the penalized least-square estimation (PLSE). The Lasso, or `1 penalization [71] and the concave penalization, such as the SCAD [21] and

MCP [84] are two mainstream methods in penalized least-square estimation. It is been shown that the concave PLSE guarantees variable selection consistency under significantly weaker conditions than the Lasso, for example, the strong irrepresentable condition on the design matrix required by the Lasso can be replaced by a sparse Riesz condition. Moreover, the concave PLSE also enjoys rate-optimal error bounds in prediction and coefficients estimation. However, the error bounds for prediction and coefficients estimation in the literature still require significantly stronger conditions than what Lasso require, for example, the knowledge of the `1 norm of the true coefficients vector or the upper sparse eigenvalue

condition. Ideally, selection, prediction and estimation properties should only depend on lower sparse eigenvalue/restricted eigenvalue, is that achievable? In the second chapter, we give an affirmative answer to this question.

In Chapter 2, we prove that the concave PLSE matches the oracle inequalities for prediction and `q coefficients estimation of the Lasso, with 1 ď q ď 2, based only on

(13)

condition on the design matrix. Furthermore, under a uniform signal strength assumption, the selection consistency does not require any additional conditions for proper concave penalties such as the SCAD penalty and MCP. Our theorem applies to all the local solutions that computable by path following algorithm starting from the origin. We also developed a scaled version of the concave PLSE that jointly estimates the regression coefficients and noise level. The scaled concave PLSE is not an easy extension of the scaled Lasso because the joint distribution of regression coefficients and noise level of the former is non-convex. The computation cost of scaled concave PLSE is negligible beyond computing a continuous solution path. All our consistency results apply to cases where the number of predictors p is much larger than the sample size n.

In Chapter 3, we consider high-dimensional regression when the design matrices are not fully observable. Two specifications are discussed: missing design and noisy design. We extend the PLSE to noisy or missing design and prove that the same scale of coefficients estimation error can be obtained compared with the fully observed design, while requiring no additional condition. Moreover, we prove that a linear combination of the noise level and `2 norm of coefficients is large enough for penalty level when noise or missing data

exists. This sharpens the existing results where an `1 norm of coefficients is required. We

further extend the scaled version of PLSE to missing and noisy data case. Since the cross-validation based technique is time consuming and maybe misleading for missing or noisy data, the proposed scaled solution is of great use.

As discussed before, restricted eigenvalue (RE) type conditions can be viewed as nearly the weakest available condition on design matrix to guarantee prediction and estimation performance of the Lasso, concave penalized least-square estimator and groupwise estimators in high-dimensional regression. In Chapter 4, we prove that the population version of the groupwise RE condition implies its sample version under: (i) a second moment uniform integrability assumption on the linear combinations of the design variables and (ii) a fourth moment uniform boundedness assumption on the individual design variables and a m-th moment assumption on the linear combinations of the within group design variables for m ą 2, provided usual sample size requirement. Moreover, the fourth and m-th moment assumptions can be removed given a slightly larger sample size.

(14)

Besides, the low moment condition is also sufficient to guarantee the groupwise compatibility condition, an `1-version of RE condition. Our results include the ordinary RE condition as

a special case. This study demonstrates a benefit of standardizing the design variables in penalized least squares estimation for heavy-tailed random designs. In addition, it indicates that the RE condition of bootstrapped sample can be guaranteed given the corresponding sample RE condition.

1.2 Nonparametric maximum liklihood methods

The second part of this thesis considers two types of models using nonparametric maximum likelihood (NPML) methods, a nonparametric empirical Bayes approach: NPML methods for mixture models and NPML methods for linear models.

Nonparametric maximum likelihood (NPML) for mixture models is a technique for estimating mixing distributions that has a long and rich history in statistics going back to the 1950s, and is closely related to empirical Bayes methods. Historically, NPML-based methods have been considered to be relatively impractical because of computational and theoretical obstacles. However, recent work focusing on approximate NPML methods suggests that these methods may have great promise for a variety of modern applications. Building on this recent work, we study a class of flexible, scalable, and easy to implement approximate NPML methods for problems with multivariate mixing distributions. In Chapter 5, we provide concrete guidance on implementing these methods, with theoretical and empirical support; topics covered include identifying the support set of the mixing distribution, and comparing algorithms (across a variety of metrics) for solving the simple convex optimization problem at the core of the approximate NPML problem. Additionally, we illustrate the methods’ performance in three diverse real data applications: (i) A baseball data analysis (a classical example for empirical Bayes methods, originally inspired by Efron & Morris), (ii) high-dimensional microarray classification, and (iii) online prediction of blood-glucose density for diabetes patients. Among other things, our empirical results clearly demonstrate the relative effectiveness of using multivariate (as opposed to univariate) mixing distributions for NPML-based approaches.

(15)

Chapter 2 Oracle properties of concave PLSE and its scaled version

2.1 Introduction

The purpose of this chapter is to study prediction, coefficient coefficients estimation, and variable selection properties of concave penalized least squares estimator (PLSE) in linear regression under the restrictive eigenvalue (RE) condition on the design matrix.

Consider the linear model

y “ Xβ˚` ε, (2.1)

where X “ px1, ..., xpq P Rnˆp is a design matrix, y P Rn is a response vector, ε P Rn is a

noise vector, and β˚

P Rp is an unknown coefficient coefficients vector. For simplicity, we assume throughout the chapter that the design matrix is column normalized with }xj}22“ n.

We shall focus on penalized loss functions of the form

Lpβ; λq “ 1 2n}y ´ Xβ} 2 2` p ÿ j“1 ρp|βj|; λq, (2.2)

where the penalty function ρpt; λq, indexed by λ ě 0, is concave in t ą 0 with ρp0`; λq “ ρp0; λq “ 0, and the index λ is taken as the penalty level limtÑ0`ρpt; λq{t. Additional

regularity conditions on ρp¨; ¨q will be described in Section 2.2. The PLSE can be defined as a statistical choice among local minimizers of the penalized loss.

Among PLSE methods, the Lasso [71] with the absolute penalty ρpt; λq “ λ|t| is the most widely used and extensively studied. The Lasso is relatively easy to compute as it is a convex minimization problem, but it is well known that the Lasso is biased. A consequence of this bias is the requirement of a neighborhood stability/strong irrepresentable condition

(16)

on the design matrix X for the selection consistency of the Lasso [51,88,72,79]. Fan and Li [21] proposed a concave penalty to remove the bias of the Lasso and proved an oracle property for one of the local minimizers of the resulting penalized loss. Zhang [84] proposed a path finding algorithm PLUS for concave PLSE and proved the selection consistency of the PLUS-computed local minimizer under a rate optimal signal strength condition on the coefficients and the sparse Riesz condition (SRC) [85] on the design. The SRC, which requires bounds on both the lower and upper sparse eigenvalues of the Gram matrix and is closely related to the restricted isometry property (RIP) [12], is substantially weaker than the strong irrepresentable condition. This advantage of concave PLSE over the Lasso has since become well understood.

For prediction and coefficient estimation, the existing literature somehow presents an opposite story. Consider hard sparse coefficient vectors satisfying |supppβ˚

q| ď s with small ps{nq log p. Although rate minimax error bounds were proved under the RIP and SRC respectively for the Dantzig selector and Lasso in [11] and [85], Bickel et al. [6] sharpened their results by weakening the RIP and SRC to the RE condition, and van de Geer and B¨uhlmann [77] proved comparable prediction and `1 estimation error bounds under an even

weaker compatibility or `1RE condition. Meanwhile, rate minimax error bounds for concave

PLSE still require two-sided sparse eigenvalue conditions like the SRC [84,87,80,22] or a proper known upper bound for the `1 norm of the true coefficient vector [46]. It turns out

that the difference between the SRC and RE conditions are quite significant as Rudelson and Zhou [66] proved that the RE condition is a consequence of a lower sparse eigenvalue condition alone. This seems to suggest a theoretical advantage of the Lasso, in addition to its computational simplicity, compared with concave PLSE.

An interesting question is whether the RE condition alone on the design matrix is also sufficient for the above discussed results for concave penalized prediction, coefficient coefficients estimation and variable selection, provided proper conditions on the coefficient coefficients and noise vectors. An affirmative answer of this question, which we provide in this chapter, amounts to the removal of the upper sparse eigenvalue condition on the design matrix and actually also a relaxation of the lower sparse eigenvalue condition or the restricted strong convexity (RSC) condition [56] imposed in [46]. We also extend

(17)

the prediction and estimation error bounds to smaller penalty levels λ which are more practical and provide rate minimaxity in prediction and coefficient coefficients estimation when ps{nq logpp{sq is small.

The Lasso still enjoys computational advantages over concave PLSE. However, this advantage may not be so drastic in many applications in view of the literature on statistical and computational properties of iterative and path finding algorithms for concave penalization [25,89,84,87,9,1,32,56,80,46,22]. In this chapter, we focus on statistical properties of local solutions of concave PLSE computable by path finding algorithms as we are also interested in adaptive choice of the penalty level λ in the solution path and the estimation of the noise level. Exact solution paths of the PLSE can be computed by the PLUS algorithm [84], while approximate solution paths can be computed by the gradient decent algorithm of Wang et al. [80] with computational complexity guarantee.

Suppose that a local solution path of the concave penalization problem is obtained, one still needs to take an appropriate choice of an estimator in the solution path or a proper penalty level. This problem, which we also study in this chapter, is equivalent to consistent estimation of the noise level due to scale invariance.

Substantial effort has been made in scale free estimation under the `1 penalty. The idea

is to make the penalty level proportional to the noise level σ. St¨adler et al. [67] proposed to estimate β and σ by maximizing their joint log-likelihood with an `1penalty on β{σ through

reparametrization. In the discussion of [67], Antoniadis [2] proposed to minimize Huber’s [34] concomitant joint loss function with the `1 penalty on β without reparametrization,

and Sun and Zhang [68] considered a “naive” iteration between the estimation of β and σ and proved the bias reduction property of one iteration from the joint estimator of [67]. Belloni et al. [5] introduced and studied a square-root Lasso for the estimation of β. It turns out that for the `1 penalty, Huber’s concomitant joint loss, the equilibrium of the iterative

algorithm, and the square-root Lasso all produce the same estimator. Sun and Zhang [69] proposed the iterative algorithm as scaled PLSE for joint estimation of β and σ under both the `1and concave penalties and studied the scaled Lasso with the joint penalized loss of [2],

especially the consistency and asymptotic normality of the resulting noise level estimator. However, a theoretical study of the scaled concave PLSE is noticeably missing.

(18)

A main reason for this absence of a theoretical study of scale free concave PLSE is the loss of the scale free property; In the joint likelihood, the concomitant loss and the square-root formulations, it is not proper to use scale free concave penalty functions as they are not proportional to the penalty level. While the iterative approach is still scale free with concave penalties, concave regularization is more difficult to study due to the loss of its equivalence to joint convex minimization, compared with the Lasso.

In this chapter, we find a much weaker condition under which local solutions of concave PLSE enjoy desired properties in prediction, coefficients estimation, and variable selection as well. Specifically, we prove that the concave PLSE achieves rate minimaxity in prediction and coefficients estimation under the `0 sparsity condition on β and the RE condition on X.

Furthermore, the selection consistency can also be guaranteed under an additional uniform signal strength condition on the nonzero coefficients. In addition, we prove that the same properties hold for the scaled concave PLSE in the iterative algorithm formulation.

The rest of this chapter is organized as follows. In Section2.2, we study concave PLSE under the RE condition on the design. In Section2.3we study concave PLSE with smaller penalty/threshold levels. In Section2.4we study theoretical properties of the scaled concave PLSE. Section2.5presents results of an extensive simulation study for variance estimation. Section2.6 contains some discussion.

Notation: We denote by β˚ the true regression coefficient coefficients vector, Σ “ XTX{n the sample Gram matrix, S “ supppβ˚

q the support set of the coefficient coefficients vector, s “ |S| the size of the support, and Φp¨q the standard Gaussian cumulative distribution function. For vectors v “ pv1, ..., vpq, we denote by }v}q “

ř

jp|vj|qq1{q the `q norm, with }v}8 “ maxj|vj| and }v}0 “ #tj : vj ‰ 0u. Moreover,

x` “ maxpx, 0q.

2.2 Statistical Properties of Concave PLSE methods

In this section, we present our results for concave PLSE at a sufficiently high penalty level to allow selection consistency. We first need to describe our assumptions on the penalty function and design matrix.

(19)

2.2.1 Concave penalties

We study the class of concave penalties ρpt; λq satisfying the following properties: (i) ρpt; λq is symmetric, ρpt; λq “ ρp´t; λq;

(ii) ρpt; λq is monotone, ρpt1; λq ď ρpt2; λq for all 0 ď t1 ă t2;

(iii) ρpt; λq is left- and right-differentiable in t for all t; (iv) ρpt; λq has selection property, 9ρp0`; λq “ λ;

(v) | 9ρpt´; λq| _ | 9ρpt`; λq| ď λ for all real t.

We write 9ρpt; λq “ x when x is between the left- and right-derivative of ρpt; λq at t, including t “ 0 where 9ρp0; λq “ x means |x| ď λ. We use the following quantities to measure the concavity of penalty functions. For a given penalty function ρp¨; λq, define the maximum concavity at t as κpt; ρ, λq “ sup t1_ą0 9 ρpt1_{; λq ´ 9}_{ρpt; λq} t ´ t1 , (2.3)

where the supreme is taken over all possible choices of 9ρpt; λq and 9ρpt1_{; λq between the}

left-and right-derivatives. Further, define the overall maximum concavity of ρp¨; λq as

κpρq “ κpρ, λq “ max

tě0 κpt; ρ, λq. (2.4)

Many popular penalties satisfy conditions (i) to (v). We illustrate the SCAD (smoothly clipped absolute deviation) penalty and MCP (minimax concave penalty) as examples. The SCAD penalty [21] is defined as

ρpt, λq “ λ ż_|t| 0 " Ipx ď λq ` pγλ ´ xq` pγ ´ 1qλ Ipx ą λq * dx (2.5)

with a fixed parameter γ ą 2. A straightforward calculation yields κp0; ρ, λq “ 1{γ and κpρ, λq “ 1{pγ ´ 1q for the SCAD penalty. The MCP [84] is defined as

ρpt, λq “ λ ż|t|

0

p1 ´ x

(20)

with γ ą 0 and κpρ, λq “ κp0; ρ, λq “ 1{γ.

2.2.2 The restricted eigenvalue condition

We now consider conditions on the design matrix. The restricted eigenvalue (RE) condition, proposed in [6], can be viewed as nearly the weakest available condition on the design to guarantee rate optimal prediction and coefficients estimation performance of the Lasso. The RE coefficient RE2pS, ηq for the `2 estimation loss can be defined as follows: For η P r0, 1q

and δ˚ P r0, 1s, RE2₂pS; η, δ˚q “ inf " uT_Σu }u}2₂ : p1 ´ ηq}uS c}₁ď p1 ` δ_˚ηq}u_S}₁ * . (2.7)

The RE condition refers to the property that RE2pS; η, δ˚q is no smaller than a certain

positive constant for all n and p. For the prediction and `1 estimation, an `1-version of the

RE can be employed. The following compatibility or `1-RE coefficient [77] can be used,

RE2₁pS; η, δ˚q “ inf " uT_Σu|S| }uS}21 : p1 ´ ηq}uSc_}₁_{ď p1 ` δ}_˚ηq}u_S_}₁ * . (2.8)

We introduce a relaxed cone invertibility factor (RCIF) for prediction as

RCIFpredpS; η, ωq “ inf

" }Σu}28|S| uT_Σu : p1 ´ ηq}uSc}1 ď ´ω T SuS * , (2.9)

where ω P Rp, and a RCIF for the `q estimation, 1 ď q ď 2, as

RCIFest,qpS; η, ωq “ inf

# }Σu}8|S|1{q }u}q : p1 ´ ηq}uSc_}₁ _{ď ´ω}T_Su_S + . (2.10)

The choices of δ˚ and ω depend on the problem under consideration in the analysis, but

typically we have }ω}8 ď 1`δ˚η so that the minimization in (2.9) and (2.10) is taken over a

smaller cone. For example, one may take ωS “ 0 for studying selection consistency. We will

use an RE condition to prove cone membership of the estimation error of the concave PLSE and the RCIF to bound the prediction and coefficients estimation errors. The following proposition shows that the RCIF may provide sharper bounds than the RE does.

(21)

Proposition 2.1. Let RE, RCIF be as in (2.7)-(2.10), η P p0, 1q, and ξ “ p1 ` δ˚ηq{p1 ´ ηq.

If }ωS}8 ď 1 ` δ˚η, then

RCIFpredpS; η, ωq ě RE21pS; η, δ˚q{p1 ` ξq2

RCIFest,1pS; η, ωq ě RE21pS; η, δ˚q{p1 ` ξq2 (2.11)

RCIFest,2pS; η, ωq ě RE1pS; η, δ˚qRE2pS; η, δ˚q{p1 ` ξq.

Proof of Proposition 2.1. Since }ωS}8ď 1 ` δ˚η, we have

p1 ´ ηq}uSc_}₁ _{ď ´ω}T_Su_S _{ď p1 ` δ}_˚ηq}u_S_}₁.

It then follows that

}Σu}2₈|S| uT_Σu ě uTΣu|S| }u}2₁ ě uTΣu|S| p1 ` ξq2}uS}2₁ .

The first inequality of (2.11) is obtained by taking infimum in the cone C pS; η, δ˚q “ tu :

p1 ´ ηq}uSc_}₁ _{ď p1 ` δ}_˚ηq}u_S_}₁_{u. Similarly,}

}Σu}8|S| }u}1 ě u T_Σu|S| }u}2₁ ě uTΣu|S| p1 ` ξq2}uS}2₁ , }Σu}8|S|1{2 }u}2 ě u T_Σu|S| }u}1}u}2 ě u T_Σu|S| p1 ` ξq}uS}1}u}2 .

The second and third inequality of (2.11) can be obtained by taking the infimum in the

coneC pS; η, δ˚q on the above inequalities. ˝

2.2.3 Properties of concave PLSE

As our analysis directly allows the penalty to depend on index j, we consider the follows generalization of the penalized loss (2.2),

Lpβ; λq “ 1 2n}y ´ Xβ} 2 2` p ÿ j“1 ρjpβj; λq.

(22)

Given penalty functions ρjp¨; ¨q and a penalty level λ, a vector pβ P Rp is a critical point

of the penalized loss (2.2) if the following local Karush-Kuhn-Tucker (KKT) condition is satisfied:

xT_jpy ´ X pβq{n “ 9ρjp pβj; λq (2.12)

for a certain version of 9ρjp pβj; λq (between the left and right derivatives as in our convention)

for every j “ 1, . . . , p. By property (v) of the penalty, (2.12) is well defined and | 9ρjp pβj; λq| ď

λ. When the penalized loss is convex in t, the local KKT condition (2.12) is necessary and sufficient for the global minimization of the penalized loss Lp¨; λq. In general, solutions of (2.12) include all local minimizers of Lp¨; λq.

For positive λ˚and κ˚, consider the class of all penalty functions ρjp¨; λq with no smaller

penalty level than λ˚ and no greater concavity than κ˚,

Ppλ˚, κ˚q “

!

ρjp¨; λq : λ ě λ˚, κpρj, λq ď κ˚

)

. (2.13)

Among all local solutions for all such penalties ρjp¨; λq in Ppλ˚, κ˚q, we shall focus on the

subclass B0pλ˚, κ˚q of those connected to the origin through a continuous path of such

solutions. Formally, let

B “ Bpλ˚, κ˚q “

! p

β: (2.12) holds with some ρjp¨; λq PPpλ˚, κ˚q

) .

The classB0pλ˚, κ˚q can be written as

B0pλ˚, κ˚q “

! p

β : pβ and 0 are connected inBpλ˚, κ˚q

)

. (2.14)

As pβ “ 0 is the sparsest solution, B0 can be viewed as the sparse branch of the solution

space B.

By definition, B0pλ˚, κ˚q is the set of all local solutions computable by path following

algorithms starting from the origin, with constraints λ ě λ˚ and κpρj, λq ď κ˚ on the

(23)

local solutions connected to the origin regardless of the specific algorithms used to compute the solution and different types of penalties can be used in a single solution path. For example, the Lasso estimator belongs to the class as it is connected to the origin through the LARS algorithm [58,59,20]. The SCAD and MCP solutions belong to the class if they are computed by the PLUS algorithm [84] or by a path following algorithm from the Lasso solution.

The following theorem studies the difference between solutions pβ PB0pλ˚, κ˚q and an

oracle coefficient vector βo satisfying supppβoq Ď S under the RE condition on the design matrix. The vector βo P Rpcan be taken as the true regression coefficient coefficients vector β˚ _{so that Theorem} _2.1 _{directly yields prediction and estimation error bounds under the}

RE condition. Alternatively, βo can be taken as the oracle LSE pβo given by

p

βoS“ pXTSXSq´1XTSy, βp

o

Sc “ 0, (2.15)

with S “ supppβ˚

q, so that Theorem 2.1 directly yields sufficient conditions for selection consistency and indirectly sharper prediction and estimation error bounds, still under the RE condition.

We consider here penalty levels no smaller than a certain λ˚ satisfying

}XTScpy ´ XTβoq{n}₈ă ηλ_˚, }XT_Spy ´ XTβoq{n}₈ ď ηδ_˚λ_˚, (2.16)

where η ă 1 and δ˚ď 1. When ε “ y ´ Xβ˚„ N p0, σ2Inˆnq and

λ˚ “ pσ{ηq

a

p2{nq log p,

(2.16) holds with at least probability 1 ´a2{pπ log pq, as xj are normalized to }xj}2 “

? n, provided that βo is either the true β˚ _{with δ}

˚ “ 1 or the oracle LSE in (2.15) with δ˚“ 1

δ˚ “ 0. Smaller penalty levels will be considered in Section2.3.

(24)

coefficients vector βo via a random vector ω “ ωpβo, λq with elements

wj “ 9ρjpβoj; λq{λ ´ xTjpy ´ Xβoq{pλnq. (2.17)

The relevance of ω can be clearly seen from the definition of pβ in (2.12) as

xT_jXpβo´ pβq{n “ λwj` 9ρjp pβj; λq ´ 9ρjpβjo; λq (2.18)

We may choose ω to satisfy supppωq Ď S in our convention as 9ρjpβjo; λq is allowed to take

any value in r´λ, λs for β_jo “ 0. However, this choice is not used in our analysis. Let φminpM q denote the minimum eigenvalue of symmetric matrices M .

Theorem 2.1. Let pβ be a solution of (2.12) in B0pλ˚, κ˚q with penalties ρjp¨; λq in the

class Ppλ˚, κ˚q. Suppose RE22pS; η, δ˚q ě κ˚ and (2.16) holds for certain βo P Rp and

S Ě supppβo q. Let ω be as in (2.17). (i) With ξ “ p1 ` δ˚ηq{p1 ´ ηq, }X pβ ´ Xβo}22{n ď p1 ` ηq2λ2|S| RCIFpredpS; η, ωq ď p1 ` ξq 2 p1 ` ηq2λ2|S| RE2₁pS; η, δ˚q , (2.19) }pβ ´ βo}q ď $ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ & ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ % p1 ` ηqλ|S| RCIFest,1pS; η, ωq ď p1 ` ξq 2_{p1 ` ηqλ|S|} RE2₁pS; η, δ˚q , q “ 1, p1 ` ηqλ|S|1{2 RCIFest,2pS; η, ωq ď p1 ` ξqp1 ` ηqλ|S| 1{2 RE1pS; η, δ˚qRE2pS; η, δ˚q , q “ 2, p1 ` ηqλ|S|1{q RCIFest,qpS; η, ωq , q ě 1. (2.20)

(ii) Suppose maxjďpκpβjo; ρj, λq ď p1 ´ 1{C0qRE22pS; η, δ˚q. Then,

}X pβ ´ Xβo}22{n ď pC0λq2sup u‰0 “ωT SuS´ p1 ´ ηq}uSc}₁ ‰2 ` uT_Σu (2.21)

(25)

and for any seminorm } ¨ } as a loss function }pβ ´ βo} ď C0λ sup u‰0 }u}“ωTSuS´ p1 ´ ηq}uSc_}₁ ‰ uT_Σu . (2.22)

(iii) Suppose βo is a solution of (2.12) or equivalently ωS “ 0. Then,

p

βSc “ 0 and sgnp pβ_jqsgnpβ_joq ě 0 @ j P S. (2.23)

If κp0; ρj, λq ă φminpXTSXS{nq, then

sgnppβq “ sgnpβoq. (2.24)

If maxjPSκpβjo; ρj, λq ă φminpXTSXS{nq, then

p

β “ βo. (2.25)

Remark 2.1. In the above theorem, one may also use a relaxed version of RE1 and RE2

with the constraint replaced by p1 ´ ηq}uSc_}₁ _{ď ´ω}T_Su_S.

Corollary 2.1. Suppose 9ρjpt; λq “ 0 for |t| ą λγ and conditions of Theorem 2.1 (ii) hold

with βo“ pβobeing the oracle estimator in (2.15). Then, (2.21) and (2.22) hold with }ω}2₂ď p1 ` δ˚ηq2|S1| where S1 “ tj P S : | pβoj| ď λγu. Consequently, when C02{RE22pS; η, δ˚q “

OPp1q and λ À σ

a

plog pq{n,

}X pβ ´ X pβo}2₂{n ` }pβ ´ pβo}2₂ “ OPpσ2{nq|S1| log p,

implying pβ “ pβo when |S1| “ 0, and

}X pβ ´ Xβ˚

}22{n ` }pβ ´ β˚}22 “ OPpσ2{nq

`

|S1| log p ` |S|˘.

Theorem 2.1 gives a unified treatment of penalized least squares methods, including the `1 and concave penalties, under the RE condition on the design matrix and natural

(26)

and (2.20) match those of state-of-art for the Lasso in both the convergence rate and the regularity condition on the design, while (2.21), (2.22) and Corollary 2.1 demonstrate the advantages of concave penalization when |S1| is of smaller order than |S|. Moreover, the

prediction and estimation error bounds in Theorem2.1 (ii) and Corollary2.1 directly and naturally provide selection consistency when ωS “ 0 or |S1| “ 0. More precisely, for

selection consistency, Theorem 2.1(iii) requires only the RE condition for (2.23) and mild additional eigenvalue conditions for (2.24) and (2.25), provided the existence of an oracular solution βo with supppβoq Ď S or equivalently ωS “ 0. Note that κp0; ρj, λq ď κ˚ and

RE2₂pS; 0q ď φminpXTSXS{nq by definition. For concave penalties, this condition ωS “ 0 can

be fulfilled by the rate-optimal signal strength condition minjPS| pβjo| ą γλ as in Corollary2.1.

However, the condition ωS “ 0 for the Lasso requires more restrictive `8-type conditions

such as the irrepresentable condition on the design. These RE-based results are new and significant as the existing theory for concave penalization, which requires substantially stronger conditions on the design such as the sparse Riesz condition in [84], leaves a false impression that the Lasso has a technical advantage in prediction and parameter estimation under the RE condition on the design. Moreover, compared with existing analysis, the proof of Theorem 2.1is much simpler.

For the Lasso, κ˚ “ 0 ď RE2pS; η, δ˚q always holds and C0 “ 1 in Theorem 2.1 (ii),

which implies the following corollary due to }ωS}8ď 1 ` η.

Corollary 2.2. Let pβ be the Lasso estimator. If (2.16) holds for a coefficient vector βo P Rp with S Ě supppβoq, then

}X pβ ´ Xβo}2₂ np1 ` ηq2_λ2 ď sup u‰0 ψ2puq uT_Σu ď max " |S|p1 ´ 1{ξq2 RE2₁pS; η, δ˚q , |S| RE2₁pS; 0q * with ψpuq ““}uS}1´ }uSc_}₁_{ξ‰ ` and ξ “ p1 ` ηq{p1 ´ ηq, and }pβ ´ βo}2 p1 ` ηqλ ď supu‰0 }u}2ψpuq uT_Σu ď max " |S|1{2p1 ´ 1{ξq RE2_1,2pS; η, δ˚q , |S| 1{2 RE2_1,2pS; 0q * ,

where RE1,2pS; η, δ˚q “ tRE2pS; η, δ˚qRE1pS; η, δ˚qu1{2 ě RE22pS; η, δ˚q.

(27)

Corollary 2.2 are the sharpest possible based on the basic inequality uTΣu ď ψpuq for u “ ppβ ´ βoq{λ. For example, it is strictly sharper than the familiar prediction error bound in [77],

}X pβ ´ Xβo}22{n ď p1 ` ηq2λ2|S|{RE21pS; η, δ˚q,

when RE2₁pS; η, δ˚q ă RE21pS; 0q.

To prove Theorem2.1, we first present the following lemma.

Lemma 2.1. Let S Ă t1, . . . , pu, λ ą 0, pβ be a solution of (2.12), and βo a coefficient vector satisfying supppβoq Ď S. Let h “ pβ ´ βo, ω be as in (2.17) and z “ py ´ Xβoq{n. Then, hTΣh “ ř_jhjt 9ρjpβjo; λq ´ 9ρjp pβj; λq ´ λwj ( ď ř_jPScpzjhj´ λ|hj|q ´ λωTShS` ř jκpβjo; ρj, λqh2j (2.26)

and |wj| ď 1 ` |zj|{λ for j P S, where κpβoj; ρj, λq is as in (2.3).

Proof of Lemma 2.1. Recall that h “ pβ ´ βo and z “ py ´ Xβoq{n with a βo satisfying supppβoq Ď S. For j P Sc, hj “ pβj, so that by (2.17)

hjt 9ρjpβjo; λq ´ 9ρjp pβj; λq ´ λwju “ βp_jtz_j´ 9ρ_jp pβ_j; λqu ď βp_jz_j ´ | pβ_j| 9ρ_jp| pβ_j|; λq (2.27) ď βp_jz_j ´ λ| pβ_j| ` p| pβ_j| ´ 0q ρ9_jp0`; λq ´ 9ρ_jp| pβ_j|; λq ( ď zjhj´ λ|hj| ` κp0; ρj, λqh2j. For j P S, hjt 9ρjpβjo; λq ´ 9ρjp pβj; λq ´ λwju ď ´λwjhj` κpβjo; ρj, λqh2j. (2.28)

Summing the above inequalities over j, we find via (2.18) that (2.26) holds. Moreover, by the definition of ω in (2.17), |wj| ď 1 ` |zj|{λ for j P S. ˝

(28)

Proposition 2.2. Let S Ě supppβoq, tη, δ˚u be as in (2.16),

C pS; η, δ˚q “ u : p1 ´ ηq}uSc_}₁ _{ď p1 ` δ}_˚ηq}u_S_}₁(,

and B˚

0pλ˚, κ˚q “ Bpλ˚, κ˚q X tβo `C pS; η, δ˚qu be the set of all solutions pβ of (2.12)

with penalties in Ppλ˚, κ˚q and estimation error pβ ´ βo in the cone C pS; η, δ˚q. Let

p β P B˚

0pλ˚, κ˚q with penalty level λ and rβ P Bpλ˚, κ˚q with penalty level rλ. Suppose

RE2₂pS; η, δ˚q ě κ˚ and (2.16) holds. Let 1 “ pη ´ }zSc}₈{λ_˚q{2, ₂ “ ₁{p2κ_˚q and

0“ mint2, 12{p1 ` ηqu. Then,

› ›prβ ´ βoq{rλ ´ ppβ ´ βoq{λ › › 1ď 0 ñ rβ PB ˚ 0pλ˚, κ˚q.

Proposition 2.2 asserts that among general solutions pβ of (2.12) in Bpλ˚, κ˚q, those

with the normalized error ppβ ´ βoq{λ inside the cone C pS; η, δ˚q and outside the cone are

separated by 0 in the `1 distance of the normalized error. Thus, if pβ ptq

is a sequence of such solutions with penalty levels λptq _{such that the normalized errors u}ptq _{“ pp}_βptq_{´ β}o

q{λptq have small `1 increments, }uptq´ upt´1q}1ď 0, then uptq are either all in the coneC pS; η, δ˚q or

all outside cone. In particular, Proposition 2.2 implies that the solutions pβ P B0pλ˚, κ˚q

has the cone property pβ ´ βo PC pS; η, δ˚q or equivalently B0pλ˚, κ˚q Ď βo`C pS; η, δ˚q,

as λ ‘ pβ is connected to λp0q_{‘ 0 through a continuous path and the origin 0 has the cone}

property.

Proof of Proposition 2.2. Let u “ ppβ ´ βoq{λ and v “ prβ ´ βoq{λ. We want to prove that

}u ´ v}1ď 0 and u PC pS; η, δ˚q imply v PC pS; η, δ˚q. (2.29)

By the definition of 1 and condition (2.16), we have 1 ą 0. As κpβjo; ρj, λq ď κ˚ and

}zSc_}₈_{{λ ď }z}_Sc_}₈_{λ_˚_{ď η ´ 2}₁, Lemma2.1 implies that

uTΣu ``1 ´ η ` 21

˘ }uSc}₁

(29)

ď ´ωTSuS` maxjκpβjo; ρj, λq ( }u}22 (2.30) ď `1 ` δ˚η ˘ }uS}1` κ˚}u}22

and that the same inequalities also hold for v. Recall that 2 “ 1{p2κ˚q. If }u}1 ď 2 and

}v ´ u}1ď 2, then }v}1 ď 1{κ˚, so that the v-version of (2.30) implies

`1 ´ η ` 21 ˘ }vSc}₁ ď`1 ` δ_˚η ˘ }vS}1` κ˚}v}21 ď`1 ` δ˚η ˘ }vS}1` 1}v}1, or equivalently `1 ´ η ` 1 ˘ }vSc_}₁ _ď`1 ` δ_˚η ` ₁ ˘ }vS}1,

which then implies v PC pS; η, δ˚q. Because u P C pS; η, δ˚q, we have κ˚}u}22 ď uTΣu by

the RE condition, so that (2.30) implies

`1 ` 21´ η ˘ }uSc_}₁ _ď`1 ` δ_˚η˘_}u_S_}₁. Due to p1 ` 1´ ηq{p1 ´ 1` δ˚ηq ď p1 ` 21´ ηq{p1 ` δ˚ηq, we have `1 ` 1´ η ˘ }uSc_}₁ _ď`1 ´ ₁_{` δ}_˚η ˘ }uS}1.

If }u}1 ą 2 and }v ´ u}1ď 12{p1 ` δ˚ηq, then v PC pS; η, δ˚q follows from

`1 ´ η˘}vSc}₁´`1 ` δ_˚η ˘ }vS}1 ď `1 ´ η˘}uSc_}₁_´`1 ` δ_˚η ˘ }uS}1` p1 ` δ˚ηq}v ´ u}1 ď `1 ´ η˘}uSc}₁´`1 ` δ_˚η ˘ }uS}1` 1}u}1 ď 0.

Thus, (2.29) holds with 0 “ mint2, 12{p1 ` ηqu. ˝

Proof of Theorem 2.1. Let h “ pβ ´ βo and u “ h{λ. It follows from Proposition 2.2

that u P C pS; η, δ˚q as λ ‘ pβ is connected to λp0q‘ 0 through a continuous path and the

(30)

p1 ´ 1{C0qREpS, ηq. As u PC pS; η, δ˚q, uTΣu ě REpS, ηq}u}22, so that by (2.30)

C₀´1uTΣu ` p1 ´ ηq}uSc_}₁_{ď ´ω}T

SuSď }ωS}2}u}2. (2.31)

This immediately implies (2.21) and (2.22) with u “ ppβ ´ βoq{λ. For (2.19) and (2.20), we set C0 “ 8. However, by the definition of RCIF,

RCIFpredpS; η, ωquTΣu ď }Σu}28|S|.

Consequently, the first inequality in (2.19) follows from the fact that

}Σu}8ď }XTpy ´ X pβq{n}8{λ ` }XTpy ´ Xβoq{n}8{λ ď 1 ` η,

and the second inequality follows from the first inequality in (2.11). Similarly, the first inequality in (2.20) follows from

RCIFest,qpS; η, ωq}u}qď }Σu}8|S|1{q ď p1 ` ηq|S|1{q,

and the second follows from the second and third inequalities in (2.11).

Finally we consider selection consistency under the assumption ωS “ 0. In this case,

βo is a solution of (2.12), and }hSc_}₁ _{“ 0 by (}2.31). Moreover, because both βo and pβ are

solutions of (2.12) with support in S,

κ˚}hS}2₂ď hTSΣShS “ ´ ÿ jPS hj ρ9jp pβj; λq ´ 9ρjpβjo; λq ( ď ÿ jPS κpβ_jo; ρj, λqh2j.

As κpβ_jo; ρj, λq ď κ˚, the maximum concavity is attained above at every j P S in the sense

of ´ 9ρjp pβj; λq ` 9ρjpβjo; λq “ κ˚p pβj ´ βojq for all j P S with hj ‰ 0. This is possible only

when sgnp pβjqsgnpβjoq ě 0 for all j P S. Furthermore, sgnp pβjqsgnpβjoq ą 0 for all j P S when

(31)

2.3 Smaller penalty levels

We have studied in Section 2.2 exact solutions of (2.12) for penalty levels λ ě λ˚ in the

event where λ˚ is a strict upper bound of the supreme norm of the random vector z “

XTpy ´ Xβoq{n as in (2.16). Such penalty or threshold levels are commonly used in the literature to study regularized methods in high-dimensional regression. However, this is quite conservative and often yields poor numerical results. In this section, we consider smaller penalty levels under somewhat stronger RE conditions on the design.

2.3.1 Smaller penalty levels

We consider penalty levels λ which control a sparse `2 norm of a truncated z “ XTpy ´

Xβoq{n, instead of the larger `8 norm of z. For q P r1, 8s and t ą 0, the sparse `q norm

is defined as

}v}pq,tq“ max

J Ăt1,...,pu,|J|ăt`1}vJ}q.

To control the effect of the noise, we consider penalty levels λ ě λ˚with a minimum penalty

level λ˚ such that

› › ›p|z| ´ η0λ˚q` › › › p2,mq“ sup|J |“m b ř jPJ ` |zj| ´ η0λ˚ ˘2 `ă η1m 1{2_λ ˚ (2.32)

happens with high probability for certain positive numbers η0 and η1 satisfying η0` η1 ă 1

and a positive integer m. It is clear that (2.16) implies (2.32) with η “ η0` η1 and m “ 1.

As properties of the Lasso has been considered in [70] under penalty levels λ ą λ˚ with the

smaller λ˚ in (2.32), the results in this subsection can be viewed as an extension of their

results to general solutions of (2.12) in the set B0pλ˚, κ˚q in (2.14).

With η “ η0` η1 and z “ XTpy ´ Xβoq{n, define

˜

(32)

be the set of indexes of large |zj|s. Main consequences of (2.32) are

| ˜S| ă m, ř_{jP ˜}_Sp|zj| ´ η0λ˚q2`ă mpη1λ˚q2, }z_S˜c}8 ă ηλ˚. (2.34)

These properties can be used to prove

}X pβ ´ Xβo}2₂{n À`}ωS}2₂` m˘λ2, (2.35)

with }ωS}2₂ À |S| in the worst case scenarior scenario, and parallel estimation error bounds

under a certain RE-type condition. See [70] and Subsection 3.3.

Consider Gaussian noise ε “ y ´ Xβ˚ „ N p0, σ2Inˆnq. Let L1ptq “ Φ´1p´tq be the

standard normal negative quantile function. Sun and Zhang [70] proved that when βo is the true coefficient vector, βo “ β˚, (2.32) holds with at least probability 1 ´ under the conditions η0λ˚“ pσ{n1{2qL1pk{pq, (2.36) η1 η0 ą ˆ 4k{m L4₁pk{pq ` 2L2₁pk{pq ˙1{2 ` L1p{pq L1pk{pq ˆ κ`pmq m ˙1{2 ,

where κ`pmq “ maxtuTΣu : }u}0 “ m, }u}2 “ 1u is the upper sparse eigenvalue of Σ. A

conservative choice of k is to take

k “ L4₁pk{pq ` 2L2₁pk{pq (2.37)

as in [70], giving m “ Op1q in prediction and estimation error bounds. However, by (2.35), larger k can be taken without changing the order of error bounds as long as m À }ωS}22.

2.3.2 RE-type conditions for smaller penalty levels

When a smaller penalty level is taken, a lower level of regularization is imposed on the estimator pβ, so that the estimation error h “ pβ ´ βo may fail the condition p1 ´ ηq}hSc_{} ď}

(33)

(2.32), we can still prove the membership of the error h in the following larger cone, U pS, η0, η1, mq “ ! u :`1 ´ η˘}uSc}₁ ď p1 ` ηq}u_S}₁` η₁pm1{2}u_S_˜}₂´ }u_S_˜}₁q )

with η “ η0` η1 ă 1 and the set ˜S in (2.33). This will be verified in the proof of Theorem

2.2 but can be also vaguely seen from (2.34). Consequently, the restricted eigenvalue is defined in the larger cone as

Ď RE2pS; η0, η1, mq “ inf # puTΣuq1{2 }u}2 : 0 ‰ u PU pS, η0, η1, mq + . (2.38)

When m “ 1, ˜S “ H and the restricted eigenvalue (2.38) coincides with the original RE as defined in (2.7). Although (2.38) is a random variable due to its dependence on ˜S (even for deterministic designs), it is no smaller than

Ď RE˚,2pS; η, mq “ min |T zS|ăm inf # puTΣuq1{2 }u}2 : }uTc_}₁_{ă ξ|T |}1{2_}u_T_}₂ + (2.39) due to | ˜S| ă m in (2.34), where ξ “ p1 ` ηq{p1 ´ ηq.

Similarly, we extend the relaxed cone invertibility factor (RCIF) as

Ğ

RCIFpredpS; η0, η1, mq “ inf

!_}Σu}2

cs˚

uT_Σu : u PU pS; η0, η1, mq

)

Ğ

RCIFest,qpS; η0, η1, mq “ inf

!_}Σu} cps˚q1{q }u}q : u PU pS, η0, η1, mq ) , (2.40)

where s˚ _{“ max} _{|S|, | ˜}_{S|( represents a potentially lower level of sparsity due to possible}

selection of variables outside S and } ¨ }c is a combination of the `2 norm on ˜S and the `8

norm on ˜Sc_{defined as}

}v}c“ max }vS˜}2{m1{2, }vS˜c}8(.

The new RCIF for prediction and estimation are respectively. When m “ 1, the combination norm coincides with the `8 norm and the modified RCIFs coincide with those in (2.7) and

(34)

(2.10) respectively.

2.3.3 Prediction and estimation errors bounds at smaller penalty levels

Theorem 2.2. Let pβ be a solution of (2.12) in B0pλ˚, κ˚q with penalties ρjp¨; λq P

Ppλ˚, κ˚q. Let η “ η0 ` η1 ă 1 with positive η0 and η1, m be a positive integer, ˜S

as in (2.33) with a certain βo P Rp, S Ě supppβoq, and s˚ “ max |S|, | ˜S|(. Suppose Ď

RE2₂pS; η0, η1, mq ě κ˚ and (2.32) holds. Then,

}X pβ ´ Xβ˚}22{n ď p1 ` ηqλ(2s˚ Ğ RCIFpredpS; η0, η1, mq ď p1 ` ηqξ1λ (2 s˚ Ď RE2₂pS; η0, η1, mq (2.41) with ξ1““2p|S|{s˚q1{2` p1 ´ η0qpm{s˚q1{2 ‰ {`1 ´ η˘, and }pβ ´ β˚}q ď $ ’ ’ ’ ’ ’ ’ & ’ ’ ’ ’ ’ ’ % p1 ` ηqλps˚q1{2 Ğ RCIFest,2pS; η0, η1, mq ď p1 ` ηqξ1λps ˚_q1{2 Ď RE2₂pS; η0, η1, mq , q “ 2, p1 ` ηqλps˚q1{q Ğ RCIFest,qpS; η0, η1, mq , @q P r1, 2s. (2.42)

Remark 2.2. When m — s, k is on the same order k — m. The penalty level λ˚in (2.37) is

on the orderap2{nq logpp{kq. Theorem2.2guarantees that the prediction and `2 estimation

error are on the order

}X pβ ´ X pβ˚}22{n ` }pβ ´ pβ ˚

}22 — pm{nq logpp{mq — ps{nq logpp{sq.

This matches the minimax prediction and `2 estimation rate of the Slope in Bellec et al. [4]

As an extension of Theorem 2.1 (i), Theorem 2.2 provides prediction and estimation error bounds in the same form for smaller penalty levels with somewhat smaller RCIF and RE. However, the approach does not provide a full extension of Theorem2.1in several aspects. Due to the use of the `2norm in condition (2.32), the `qestimation error bound can

be extended only for 1 ď q ď 2, and the compatibility coefficient cannot be used to bound the prediction and `1 errors. In addition, solutions of (2.12) are not selection consistent at

(35)

We have considered so far solutions of (2.12) in the main branch of the solution space B0pλ˚, κ˚q in (2.14). Such solutions are computable by path finding algorithms. In fact, as

discussed below Proposition2.2, our analysis is also applicable if ppβ´βoq{λ is connected to a cone through a discrete sequence of such normalized errors in small `1increments. Statistical

and computational properties of iterative discrete solution paths have been studied in [80] among others. However, compared with Theorems 2.1 and 2.2, [80] requires upper sparse eigenvalue conditions on X and larger penalty levels satisfying (2.16).

Proof of Theorem 2.2. Let h “ pβ ´ βo. Recall that Lemma2.1gives

hTΣh ď ř_jPScpzjhj ´ λ|hj|q ´ λωTShS`ř_jκpβjo; ρj, λqh2j

“ hTz ´ λ}hSc}₁´ř_jPSh_jρ9_jpβ_jo; λq `ř_jκpβ_jo; ρ_j, λqh2_j, (2.43)

where z “ XTpy ´ Xβoq{n, wj “ 9ρjpβjo; λq{λ ´ zj{λ, κpt; ρj, λq is as in (2.3). Let 1 “

min η ´ }z_S˜c}8{λ˚, η1´ }p|z_S˜| ´ η0λ˚q`}2{pm1{2λ˚q( and T Ě supppzq. By (2.34), 1 ą 0

and |hTz| ď pη ´ 1qλ˚}h_{T z ˜}S}1` pη0´ 1qλ˚}hS˜}1 `ř_{jP ˜}S|hj|p|zj| ´ pη0´ 1qλ˚q` (2.44) ď ηλ˚}h_{T z ˜}S}1` η0λ˚}hS˜}1´ 1λ˚}hT}1 ` ´ }p|zS˜| ´ η0λ˚q`}2` 1m1{2λ˚ ¯ }hS˜}2. ď pη ´ 1qλ˚}hT}1` η1λ˚`m1{2}hS˜}2´ }hS˜}1˘.

Let u “ h{λ with λ ě λ˚. Combining (2.43) and (2.44), we have

uTΣu ` 1}u}1` p1 ´ ηq}uSc_}₁ (2.45)

ď p1 ` ηq}uS}1` η1`m1{2}u_S˜}2´ }u_S˜}1

˘

` κ˚}u}22.

The above inequality holds for all u “ ppβ ´ βoq{λ as long as pβ PBpλ˚, κ˚q. As in the proof

(36)

that for all such u,

p1 ´ ηq}uSc_}₁ _{ď p1 ` ηq}u}_S_}₁_{` η}₁`m1{2_}u_˜

S}2´ }uS˜}1˘,

so that u PU pS, η0, η1, mq.

By the definition of ĞRCIF, we have

Ğ RCIFpredpS; η0, η1, mqhTΣh ď }Σh}2c|S|, (2.46) and Ğ RCIFest,qpS; η0, η1, mq}h}qď }Σh}c|S|1{q. (2.47) Moreover, we have }pΣhq_S˜c}8ď }XTS˜cpy ´ X pβq{n}8` }XTS˜cpy ´ Xβ o q{n}8ď p1 ` ηqλ, and }pΣhqS˜}2 ď }XT_S˜py ´ X pβq{n}2` }XT_S˜py ´ X pβq{n}2 ď λm1{2` η0λ˚m1{2` η1m1{2λ˚ ď p1 ` ηqm1{2λ. Thus, }Σh}c“ max }pΣhqS˜c}8, }pΣhqS˜}2{m1{2 ( ď p1 ` ηqλ. (2.48)

We establish the RCIF error bounds in (2.41) and (2.42) by inserting the above inequality into (2.46) and (2.47) respectively.

To compare the RCIF and RE, we note that

(37)

for u PU pS; η0, η1, mq, so that for η ă 1

uTΣu “ uT_S_˜cpΣuqS˜c` uT_S_˜pΣuq_S˜

ď }uS˜c}1}Σu}c` m1{2}uS˜}2}Σu}c ď ´2|S|1{2_}u_S_}₂_{` η}₁m1{2_}u_˜ S}2 1 ´ η ` m 1{2 }u_S˜}2 ¯ }Σu}c ď 2|S| 1{2 ` p1 ´ η0qm1{2 1 ´ η }u}2}Σu}c. It follows that uT_Σu }u}2₂ ď ”2p|S|{s˚_q1{2_{` p1 ´ η} 0qpm{s˚q1{2 1 ´ η ı2_}Σu}2 cs˚ uT_Σu , and uTΣu }u}2₂ ď ”2p|S|{s˚_q1{2_{` p1 ´ η} 0qpm{s˚q1{2 1 ´ η ı_}Σu} cps˚q1{2 }u}2 .

Taking infimum in the cone U pS; η0, η1, mq on both sides and noting that ξ1 “

“2p|S|{s˚_q1{2_{` p1 ´ η} 0qpm{s˚q1{2 ‰ {`1 ´ η˘, we obtain Ğ RCIFpredpS; η0, η1, mq ě REĎ 2 2pS; η0, η1, mq{ξ21, and Ğ RCIFest,2pS; η0, η1, mq ě REĎ 2 2pS; η0, η1, mq{ξ1.

This completes the proof. ˝

2.4 Scaled concave PLSE

We have studied in previous sections the properties of all the local solutions in B0pλ˚, κ˚q.

Suppose that the local solution set B0pλ˚, κ˚q is obtained, one still needs to choose an

appropriate solution in the set or a proper penalty level. This problem, which will be studied in this section, is essentially to estimate the noise level σ due to scale invariance.

(38)

Numerous efforts have been devoted to scale free estimation under the `1 penalty. St¨adler

et al. [67] proposed the minimizer of joint log-likelihood of regression coefficients and noise level with an `1 penalty. The comment on this paper by Antoniadis [2] pointed out that

their estimator is equivalent to the joint minimization of Huber’s concomitant loss

ppβ,σq “ arg minp β,σ }y ´ Xβ}2₂ 2nσ ` σ 2 ` λ0}β}1. (2.49)

It turns out that (2.49) coincides with many other works on the scale free estimation under the `1 penalty. For example, the square-root Lasso solution [5] and the equilibrium of the

iterative algorithm [69] are both equivalent to (2.49). However, all of these studies of the scale free estimation are limited to the `1 penalty. The scaled concave PLSE is not an easy

extension of the scaled `1 penalization due to the loss of scale free property.

In fact, the concomitant loss or the square-root formulation fail for concave penalties. To illustrate this, we take the MCP as an example. Denote σ˚

“ }y ´ Xβ˚}2{n1{2 as the

oracle noise level estimator given the true coefficients β˚_{. Under Gaussian assumption, this}

is the maximum likelihood estimator for σ when β˚is known and thus a natural estimation target. For Minimax concave penalty ρpt, λq “ λş₀|t|p1 ´ x{pλγqq`dx,

p σ2 “ arg min σ2 }y ´ Xβ˚}2₂ 2nσ ` σ 2 ` 1 σ p ÿ j“1 ρp|β˚ j|; λ0σq “ tσ˚u2´ p1{γq ÿ jPt|β˚_j|ăλ0σγup tβj˚u2, p

σ is expected to underestimate σ˚ _{unless there is no small β}˚

j such that t|β˚j| ă λ0pσγu. This validates the argument that concomitant loss formulation fails for concave penalties. In addition, the iterative algorithm becomes extremely difficult to analyze due to the loss of its equivalence to joint convex minimization, compared with the Lasso.

2.4.1 Description of the scaled concave PLSE

Given a coefficients pβ, define the noise level estimator as

p

(39)

where d is a parameter provides an option to adjust degrees-of-freedom. Typically, we let d “ p when p ă n and d “ 0 otherwise. Within the local solution setB0pλ˚, κ˚q, we search

for a subclass of scaled concave penalized least-square estimatorsB0,scalpλ0; λ˚, κ˚q, defined

as B0,scalpλ0; λ˚, κ˚q “ ! p β PB0pλ˚, κ˚q : λ0pσppβq “ λ ) . (2.51)

Here, λ0 is a prefixed penalty level and independent of σ. For example, one may choose

λ0 “ Atp2{nq log pu1{2 for universal penalty and λ0 “ An´1{2L1pk{pq for smaller penalty,

with appropriate k and A. We derive the consistency results for noise level estimation for different λ0 separately in the following analysis.

As discussed in Section 2.2, B0pλ˚, κ˚q is a large class of estimators that includes all

local solutions connected to the origin regardless of the specific algorithms used to compute the solution. We here use the PLUS algorithm as an example to illustrate the computation of the estimators inB0,scalpλ0; λ˚, κ˚q. The PLUS, indexed by x, is defined as

λpxq‘ pβpxq” $ ’ ’ & ’ ’ %

a continuous path of solutions of (2.12) in R1`p with pβp0q“ 0 and limxÑ8λpxq“ 0.

(2.52)

Given a PLUS solution path, the scaled estimator can be defined as

p

βscal“ pβppxq, px “ mintx : λ0pσppβ

pxq

q ě λpxqu. (2.53)

The “ ě ” in defining px in (2.53) can be changed to “ “ ” by the continuity of the PLUS path. Under mild regularity conditions,we will prove that pβscal P B0,scalpλ0; λ˚, κ˚q. See

next subsections for the proof. This also guarantees the non-emptiness ofB0,scalpλ0; λ˚, κ˚q. 2.4.2 Performance guarantees of scaled concave PLSE at universal

penalty levels

In this subsection, we derive the consistency results for noise level estimation with sufficiently large λ0. Since σ˚“ }y´Xβ˚}2{n1{2is a natural target of noise level estimation,

(40)

we aim to derive the convergence results of σppp βq{σ

˚ _{with p}_{β P} _B

0,scalpλ0; λ˚, κ˚q in the

following theorem.

Theorem 2.3. Let β˚ be the true regression coefficients, pβscal be in (2.53) and σ˚ _“

}y ´ Xβ˚}2{n1{2 be the oracle noise level estimator. Let 0 ă η ă 1 and ξ “ p1 ` ηq{p1 ´ ηq.

Suppose κpρq ď κ˚ and RE22pS; η, 1q ě κ˚.

piq Let τ0 “ p1 ` ξqp1 ` ηqλ0s1{2{RE1pS; η, 1q. When (2.16) holds with λ˚ “ λ0σ˚p1 ´ τ0q

and δ˚ “ 1, we have pβ scal

PB0,scalpλ0; λ˚, κ˚q. Moreover, for any pβ PB0,scalpλ0; λ˚, κ˚q,

max ˜ 1 ´σppp βq σ˚ , 1 ´ σ˚ p σppβq ¸ ď τ0, }X pβ ´ Xβ˚ }2 n1{2_σ˚ ď τ0 1 ´ τ0 . (2.54)

In particular, if we take λ0“ Atp2{nq log pu1{2 with A ą 1{η and τ0Ñ 0, then for all ą 0

Pβ˚_,σp|pσppβq{σ ´ 1| ą q Ñ 0. (2.55)

piiq Let τ_˚2“ ηp1`ηqp1`ξq2λ2₀s{RE1pS; η, 1q. When (2.16) holds with λ˚ “ λ0σ˚p1´τ˚2q

and δ˚ “ 1, we have pβ scal

PB0,scalpλ0; λ˚, κ˚q. Moreover, for any pβ PB0,scalpλ0; λ˚, κ˚q,

max ˜ 1 ´pσppβq σ˚ , 1 ´ σ˚ p σppβq ¸ ď 3τ_˚2. (2.56)

If we take λ0“ Atp2{nq log pu1{2 with A ą 1{η and τ˚2! n´1{2, then

n1{2ppσppβq{σ ´ 1q Ñ N p0, 1{2q (2.57)

in distribution under Pβ˚_,σ.

By proving pβscal P B0,scalpλ0; λ˚, κ˚q, Theorem 2.3 guarantees the non-emptiness of

B0,scalpλ0; λ˚, κ˚q with appropriate λ˚ and λ0. Moreover, it provides the convergence

and asymptotic normality for the scaled concave estimation of noise level under only the restricted eigenvalue conditions. In part (i), we achieve an error rate τ0 —

a

ps{nq log p for noise level estimation with pβ PB0,scalpλ0; λ˚, κ˚q. This matches the `1penalized maximum

likelihood estimator in [67]. In part (ii), we provide sharper convergence rate and the asymptotic normality results. The sharper rate τ_˚2 is on the order of ps{nq log p, which

(41)

essentially taking the square of the order in part (i). The asymptotic normality then follows from the sharper rate under mild assumptions. The convergence rate in part (ii) matches the rate of iterative algorithm formulation in Sun and Zhang [69].

Proof of Theorem 2.3. First prove (i). Denote z “ XTpy ´ Xβ˚q{n and hpxq “ p

βpxq´ β˚. Consider penalty level λpx0q _{“ λ}

˚ “ λ0σ˚p1 ´ τ0q for certain x0 in the PLUS

path. Since λpx0q_{“ λ}

˚ satisfies (2.16), it follows from Theorem2.1and the definition of τ0

that }Xhpx0q_} 2{n1{2ď σ˚τ0p1 ´ τ0q ď σ˚τ0. Then we have λ0σppp β px0q q “ λ0}y ´ X pβ px0q }2{n1{2 ě λ0 ˇ ˇ ˇσ ˚ ´ }Xhpx0q }2{n1{2 ˇ ˇ ˇ ě λ0σ ˚ p1 ´ τ0q “ λpx0q. (2.58)

By the definition of x,p px ď x0. Since any penalty level λ

pxq

ě λ˚ is a local solution

of (2.12) in the PLUS path, λpxq _{is a non-increasing function of x for λ}pxq _{ě λ}

˚. Thus, λppxq ě λpx0q_{“ λ} ˚. It follows that pβ scal PB0,scalpλ0; λ˚, κ˚q.

Moreover, for any pβ PB0,scalpλ0; λ˚, κ˚q, with penalty λ ě λ˚, we have

p

σppβq “ λ{λ0 ě λ˚{λ0 “ σ˚p1 ´ τ0q. (2.59)

Furthermore, by Theorem2.1 we have ˇ ˇ ˇ}y ´ Xβ}p 2{n 1{2 ´ σ˚ ˇ ˇ ˇ ď }X pβ ´ βp ˚ q}2{n1{2ď τ0pσppβq. (2.60) Thus, p σppβq σ˚ “ }y ´ X pβ}2 n1{2_σ˚ ď τ0pσppβq ` σ ˚ σ˚ “ 1 ` τ0 p σppβq σ˚ , (2.61) This implies pσppβq ď σ ˚

{p1 ´ τ0q. Combing with (2.59), the first part of (2.54) holds. In

addition,

}X pβ ´ Xβ˚

}2{n1{2ď τ0pσppβq ď σ

˚_τ

(42)

The second part of (2.54) holds. To prove (3.2), since for certain A, Pβ,σ ” }z}8ď Aσtp2{nq log pu1{2 ı Ñ 1,

we have (3.2) follows from (2.54).

Now we prove (ii). By the KKT condition,

´p}z}8` λq}h}1 ď pXhqT ! y ´ Xβ˚` y ´ X pβ ) {n ď pσ˚q2´ }y ´ X pβ}2₂{n ď pXhqTt2py ´ Xβ˚q ´ Xhu{n ď 2}z}8}h}1. (2.63)

We use above inequalities as lower and upper bounds for pσ˚_q2_{´ }y ´ X p}_β}2 2{n.

Consider λpx1q _{“ λ}

˚“ λ0σ˚p1 ´ τ˚2q in the PLUS path. Since λpx1q“ λ˚ satisfies (2.16),

it follows from Theorem2.1 that }hpx1q

}1 ď p1 ` ξq2p1 ` ηqλpx1qs{RE1pS; η, 1q. Combining with }z}8ă λ0ησ˚p1 ´ τ˚2q, we have λ2₀ σp 2 ppβpx1qq “ λ₀2 }y ´ X pβpx1q}22{n ě λ2₀ ´ tσ˚u2´ 2}z}8}hpx1q}1 ¯ ě λ2₀tσ˚u2`1 ´ 2τ_˚2p1 ´ τ_˚2q2˘ ě λ2₀tσ˚u2p1 ´ τ˚2q 2 “ pλpx1q q2. (2.64)

The last inequality holds since τ_˚2 ď 1. As in part (i), we find λppxq ě λpx1q _{“ λ} ˚ and

p

βscal PB0,scalpλ0; λ˚, κ˚q.

Similarly, for any pβ P B0,scalpλ0; λ˚, κ˚q with penalty λ ě λ˚ “ λ0σ˚p1 ´ τ˚2q, we have

p

σppβq “ λ{λ0 ě λ˚{λ0 “ σ˚p1 ´ τ˚2q. On the other hand, recall that }z}8 ă λ0ησ˚p1 ´ τ˚2q

and }pβ ´ β˚ }1 ď p1 ` ξq2p1 ` ηqλ0σppp βqs{RE1pS; η, 1q, we have p σppβq2 tσ˚_u2 “ }y ´ X pβ}2₂ ntσ˚_u2 ď tσ˚u2` ´ }z}8` λ0σppp βq ¯ }pβ ´ β˚}1 tσ˚u2 ď tσ ˚_u2_{` τ}2 ˚p1 ´ τ˚2qpσppβqσ ˚_{` p1{ηqτ}2 ˚σp 2_pp_βq tσ˚u2 . (2.65)

(43)

Solving above equation w.r.t σppp βq{σ ˚_{, we obtain} p σppβq{σ˚ ď p1 ` τ˚2q{p1 ´ τ˚2q. Thus 1 ´ σ˚_{p_σpp_{βq ď 3τ}2

˚. This proves (3.3). Given (3.3), the proof of (2.57) follows the proof of

Theorem 2 (ii) in Sun and Zhang [69]. ˝

2.4.3 Performance bounds of scaled concave PLSE at smaller penalty levels

In this subsection, we derive the consistency results for noise level estimation with smaller λ0.

Theorem 2.4. Let β˚_{, p}_βscal _{and σ}˚ _{be as in Theorem}_2.3_{and Ď}_RE

˚,2pS; ¨, ¨q be as in (2.39).

Let m be a positive integer, η “ η0`η1 with positive η0, η1and ξ2“ r2`p1´η0qpm{sq1{2s{p1´

ηq. Define τ1“ p1`ηqλ0ξ2ps_mq1{2{ ĎRE˚,2pS; η, mq. Suppose κpρq ď κ˚ and RE22pS; η, 1q ě

κ˚. When (2.32) holds with λ˚ “ λ0σ˚p1´τ1q, we have pβ scal

PB0,scalpλ0; λ˚, κ˚q. Moreover,

for any pβ PB0,scalpλ0; λ˚, κ˚q,

max ˜ 1 ´σppp βq σ˚ , 1 ´ σ˚ p σppβq ¸ ď τ1, }X p β ´ Xβ˚}2 n1{2_σ˚ ď τ1 1 ´ τ1 . (2.66)

If we take λ0“ An´1{2L1pk{pq with k in (2.37), A ą 1{η0 and τ1 Ñ 0, then for all ą 0

Pβ˚_,σp|pσppβq{σ ´ 1| ą q Ñ 0. (2.67)

Similar as Theorem 2.3, Theorem 2.4 first guarantees the non-emptiness of B0,scalpλ0; λ˚, κ˚q but with smaller λ0. Furthermore, it provides the convergence results

for noise level estimation at smaller penalties with nearly identical condition as in Theorem

2.3. Compared with existing literatures, Theorem 2.4 could be viewed as a generalization of scaled Lasso with smaller penalties in Sun and Zhang [70].

Proof of Theorem 2.4. Consider penalty level λpx1q _{“ λ}

˚ “ λ0σ˚p1 ´ τ1q for certain

x1 ă 8 in the PLUS path. Since (2.32) holds for λpx1q, by Theorem 2.2, we have

}Xhpx1q_} 2{n1{2ď p1 ` ηqξ1λps˚q1{2 Ď RE2pS; η0, η1, mq ď p1 ` ηqξ2λ px1q_{ps _ mq}1{2 Ď RE˚,2pS; η, mq ď σ˚τ1.

(44)

Similar as (2.58), λ0σppp β px1q q “ λ0}y ´ X pβ px1q }2{n1{2ě λ0σ˚p1 ´ τ1q “ λpx1q.

As in the proof of Theorem 2.3, we find λppxq

ě λpx1q _{and p}_βscal _P B

0,scalpλ0; λ˚, κ˚q.

Moreover, (2.66) and (2.67) can be proved in the same way as Theorem2.3. ˝

2.5 Simulation Study

In this section, we report the noise level estimation results of the scaled concave PLSE and compare with several competing methods in a comprehensive simulation study. The experimental settings follow Reid et al. [61] and are described with our notation as below.

The simulation aims to estimate noise level σ in a variety of settings. All simulations are run at a sample size of n “ 100, the number of predictors is considered in four different values: p “ 100, 200, 500, 1000. Elements of the design matrix X are generated randomly as Xij „ N p0, 1q. Correlation between columns of X is set to be ρ. The true parameter

β˚ _{is generated as follows: the number of nonzero elements is set to be p}

nz “ rnαs, i.e., α

controls the sparsity of β˚: the higher the α; the less sparse of β˚. It ranges between 0 and 1. The indices corresponding to nonzero β˚ _{are selected randomly. Their value are set to}

be random samples from a Laplacep0, 1q distribution. The elements of the resulting β˚ is scaled such that the signal-to-noise ratio, defined as tβ˚

uTΣβ˚_Lσ2 _{is some predetermined}

value, snr. Simulations were run over a grid of values for each of the parameters described above. In particular,

• ρ “ 0, 0.2, 0.4, 0.6, 0.8 • α “ 0.1, 0.3, 0.5, 0.7, 0.9 • snr “ 0.5, 1, 2, 5, 10, 20.

We simulate B “ 200 independent datasets for each set of parameters. The competing methods considered include: