High Dimensional Inference for Semiparametric Models

(1)

Purdue e-Pubs

Open Access Dissertations Theses and Dissertations

January 2016

High Dimensional Inference for Semiparametric

Models

Zhuqing Yu

Purdue University

Follow this and additional works at:https://docs.lib.purdue.edu/open_access_dissertations

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information.

Recommended Citation

Yu, Zhuqing, "High Dimensional Inference for Semiparametric Models" (2016).Open Access Dissertations. 1401.

(2)

PURDUE UNIVERSITY GRADUATE SCHOOL Thesis/Dissertation Acceptance

This is to certify that the thesis/dissertation prepared By

Entitled

For the degree of

Is approved by the final examining committee:

To the best of my knowledge and as understood by the student in the Thesis/Dissertation Agreement, Publication Delay, and Certification Disclaimer (Graduate School Form 32), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy of Integrity in Research” and the use of copyright material.

Approved by Major Professor(s):

Approved by:

Head of the Departmental Graduate Program Date Zhuqing Yu

High Dimensional Inference for Semiparametric Models

Doctor of Philosophy Guang Cheng Chair Thomas Sellke Anirban DasGupta Chong Gu Guang Cheng Jun Xie 6/16/2016

(3)

A Dissertation Submitted to the Faculty

of

Purdue University by

Zhuqing Yu

In Partial Fulfillment of the Requirements for the Degree

of

Doctor of Philosophy

August 2016 Purdue University West Lafayette, Indiana

(4)

(5)

ACKNOWLEDGMENTS

I would first like to thank my advisor, Prof. Guang Cheng, for his continuous guidance and support of my Ph.D. study. He has generously shared his insights and expertise that greatly helped me in all the time of writing this thesis. Prof. Cheng has also provided me great research resources over the past five years, such as joining exciting research projects, attending professional conferences and activities.

Besides my advisor, I would like to thank the rest of my thesis committee, Prof. Anirban DasGupta, Prof. Chong Gu and Prof. Thomas Sellke for their time, interest, and invaluable comments.

I would also like to express my great gratitude to my collaborators. I am especially indebted to my previous supervisor, Prof. Stephen Lee from The University of Hong Kong, who has led me into the field of statistics. I thank Prof. Jianhua Huang for his great suggestions on paper writing. Prof. Shengchun Kong has been a great friend with whom I can discuss research, as well as career building. It was a great pleasure to work with Prof. Michael Levine who has offered many helpful discussions.

I deeply appreciate the guidance I have received from professors at Purdue Uni-versity. Many thanks go to Prof. Fabrice Baudoin, Prof. William Cleveland, Prof. Jose Figueroa-Lopez, Prof. Jayanta Ghosh, Prof. Chong Gu, Prof. Chuanhai Liu and Prof. Thomas Sellke for interesting lectures on related topics that helped me improve my knowledge in the area. Special thanks go to Prof. Anirban DasGupta for his very inspirational lectures, seminars and fruitful discussions on mathematical statistics.

I would like to thank Prof. Bruce Craig and Ms. Ce-Ce Furtner who offered me the opportunity to work as a statistical consultant in the department’s Statistical Consulting Service. I greatly value the discussions with Prof. Anindya Bhadra, Prof. Bruce Craig, Prof. Chong Gu, Prof. Arman Sabbaghi, Prof. Jun Xie and Prof.

(6)

Michael Zhu on various kinds of consulting projects. Thanks also go to my fellow colleagues who have kindly shared their experience.

I would also like to thank members of Prof. Guang Cheng’s Big Data Theory Research Group, including Prof. Shenchun Kong, Prof. Qifan Song, Dr. Shih-Kang Chao, Dr. Zuofeng Shang, Dr. Wei Sun, Ching-Wei Cheng, Meimei Liu, Botao Hao, Jingcheng Bai, Jiexin Duan, Hilda Ibriga, Yang Yu and Jiapeng Liu, for many valuable discussions on research problems over the past five years.

I greatly acknowledge the funding sources that made my Ph.D. work possible. I was funded by Ross Fellowship of Purdue University for my first four years and was supported by Purdue Research Foundation Grant for the fifth year. I also thank the department for teaching assistantships and Prof. Guang Cheng for research as-sistantships. I greatly appreciate the travel fundings from the previous department Head Prof. Rebecca Doerge and Purdue’s Women in Science Programs.

My time at Purdue was made enjoyable in large part due to many friends that became a part of my life. I am greatful for time spent with fellow graduate students, roommates, neighbors and friends, and for many other people and memories.

Finally, I would like to express my heartfelt gratitude to my family, especially my parents, who has been a constant source of love, concern, support and strength all these years. Thanks to my lovely daughter Chloe Chen whose bright smiles are the warmest encouragement. I deeply thank my loving, supportive, encouraging and patient husband Xianghong Chen, who has supported me in every possible way to see the completion of this work.

(7)

TABLE OF CONTENTS

Page

LIST OF TABLES . . . vi

LIST OF FIGURES . . . vii

ABBREVIATIONS . . . viii

ABSTRACT . . . ix

1 Introduction . . . 1

1.1 Minimax Optimal Estimation . . . 1

1.2 High Dimensional Inference . . . 5

2 Minimax Optimal Estimation . . . 7

2.1 Partial Linear Models. . . 7

2.2 Partial Linear Additive Models . . . 11

2.3 Appendix . . . 16

2.3.1 Proof for Section 2.1 . . . 16

2.3.2 Proof for Section 2.2 . . . 18

2.3.3 Results from Empirical Process Theory . . . 41

3 High Dimensional Inference . . . 43

3.1 Semiparametric Version of Debiased LASSO . . . 43

3.1.1 Construction of Debiased Estimator . . . 44

3.1.2 Asymptotic Distribution . . . 45

3.1.3 Semiparametric Efficiency . . . 49

3.2 Statistical Inference . . . 49

3.2.1 Component-Wise Confidence Interval . . . 51

3.2.2 Support Recovery . . . 52

3.2.3 Testing with FWER Control . . . 52

3.3 Appendix . . . 53

3.3.1 Preliminary Lemmas . . . 54

3.3.2 Proof for Section 3.1.2 . . . 61

REFERENCES . . . 82

(8)

LIST OF TABLES

Table Page

2.1 Estimation Interference Results for Model (2.1). . . 10 2.2 Estimation Interference Results for Model (2.6). . . 16 3.1 Average coverage probabilties and lengths of confidence intervals at 95%

nominal level based on 1000 replications;n= 100,p= 500,s0 = 3, Normal Error . . . 75 3.2 Average coverage probabilties and lengths of confidence intervals at 95%

nominal level based on 1000 replications; n = 100, p = 500, s0 = 3, t5 Error . . . 76 3.3 Average coverage probabilties and lengths of confidence intervals at 95%

nominal level based on 1000 replications; n = 100, p = 500, s0 = 15, Normal Error . . . 77 3.4 Average coverage probabilties and lengths of confidence intervals at 95%

nominal level based on 1000 replications; n = 100, p = 500, s0 = 15, t5 Error . . . 78 3.5 Mean and Standard deviation of d(Sb₀,S₀) based on 1000 replications;n =

100, p= 500 . . . 79 3.6 FWER and Power of Multiple Testing based on 1000 replications; n =

100, p= 500, s0 = 3 . . . 80 3.7 FWER and Power of Multiple Testing based on 1000 replications; n =

(9)

LIST OF FIGURES

Figure Page

(10)

ABBREVIATIONS

i.i.d. independent and identically distributed

LASSO Least Absolute Shrinkage and Selection Operator FWER Family Wise Error Rate

(11)

ABSTRACT

Yu, Zhuqing, PhD, Purdue University, August 2016. High Dimensional Inference for Semiparametric Models. Major Professor: Guang Cheng.

In the literature, high dimensional inference refers to statistical inference when the number of unknown parameters is much greater than the sample size. Semipara-metric models are models that include paraSemipara-metric and nonparaSemipara-metric components, such as partial linear models and partial additive models. Due to the high dimen-sionality of the parameter of interests and the presence of a nuisance function, it is very challenging to make estimation and inference for the parametric component in high dimensional semiparametric settings, for instance, construction of confidence intervals and hypothesis testings. In this thesis, I will present two sets of estimation and inference results under high dimensional semiparametric setups.

The first one is minimax optimal estimation in high dimensional semiparametric models. Our particular focus is on partially linear additive models with high di-mensional sparse vectors and smooth nonparametric functions. The minimax lower bound for the parametric component depends merely on the dimensionality and spar-sity, while the minimax lower bound for each nonparametric component is established as an interplay among dimensionality, sparsity and smoothness. Indeed, the minimax risk for parametric estimation cannot be affected by the roughness of the nonpara-metric functions. However, the minimax risk for smooth nonparanonpara-metric estimation can be slowed down to the classical parametric rate by the existence of high di-mensional sparse vector, given sufficiently large smoothness or didi-mensionality. Such rate-switching phenomenon differs significantly from low dimensional models where estimation rate for each component only depends on itself. In the above setting,

(12)

a general class of penalized least square estimators is constructed to nearly achieve minimax lower bounds.

The second one is high dimensional inference for partial spline models, where the dimension of parametric components is allowed to be as exponentially large as sample size. We propose a semiparametric version of de-biased Lasso estimator. In the high dimensional regime, this new estimator is shown to be asymptotically normal. Based on this distributional result, we further conduct a simultaneous hypothesis testing with applications to support recovery and multiple testing with strong family wise error rate control.

(13)

1. INTRODUCTION

High dimensional inference refers to statistical inference when the number of unknown parameters is much greater than the sample size. Semiparametric models are models that include parametric and nonparametric components, such as partial linear models and partial additive models. Due to the high dimensionality of the parameter of inter-ests and the presence of a nuisance function, it is very challenging to make estimation and inference for the parametric component in high dimensional semiparametric set-tings, for instance, construction of confidence intervals and hypothesis testings. As far as we are aware, the existing literature mostly focus on high dimensional paramet-ric or nonparametparamet-ric estimation, and penalization is a commonly used technique in dealing with these estimation problems, see [1]. For instance, the LASSO estimator is obtained from`1 penalized least square method, which can perform both shrinkage and variable selection, see [2]. For nonparametric regression, smoothing splines are well known to provide nice curves which smooth discrete, noisy data, and the rough-ness penalty based on the second derivative is the most common in modern literature, see [3] [4]. In this thesis, I will investigate a class of penalized estimators which incor-porates both `1 penalty for the parametric component and roughness penalty for the nonparametric functions, and provide new theoretical insights on statistical inference for the parameters under a high dimensional semiparametric settings.

1.1 Minimax Optimal Estimation

In Section 2, I will introduce penalized estimation for high dimensional semi-parametric models, which contain two different types of model components: sparse Euclidean parameters and smooth nonparametric functions. By imposing the more refined semiparametric structure, we aim to obtain new theoretical insights, in

(14)

par-ticular, on the interfering effect between sparse parametric estimation and smooth nonparametric estimation, and further construct (nearly) minimax optimal estima-tors.

We illustrate our theory in an important class of semiparametric models: partially linear additive models

Y =XTβ0 +

J

X

j=1

fj(Zj) +, (1.1)

where β0 ∈ Rp is sparse with p > n and fj : R 7→ R are nonparametric functions with possibly different smoothness. The additive components fj’s are not assumed to be sparse and J is fixed. In contrast with the literature on sparse parametric or nonparametric estimation such as [5], [6], [7], [8], [9] and [10], we are not interested in estimating the conditional mean function E(Y|X, Z1, . . .) as a whole, but rather separate minimax risk for each model component: β0, f1, . . . , fJ. Note that our results are not directly implied by the results in the literature where additive components are always assumed to share the same linear or nonlinear structure with the same smoothness.

To better illustrate our idea, we start from a simpler partially linear model:

Y =XTβ0+f0(Z) +ε,

where β0 ∈ Rp has at most s0 non-zero elements and f0 belongs to the α-th order Sobolev space (withα >1/2). Whenβ0is of fixed or low dimension (p < n), the above model has been extensively studied in the semiparametric literature, see references cited in [11] and also recent work by [12], [13]. [14] shows that the minimax risks for estimating β0 is shown to be bounded below by

s0 n log p s0 , (1.2)

and minimax risks for estimating f0 is shown to be bounded below by max n−2α/(2α+1),s0 n log p s0 (1.3)

(15)

up to a universal constant, based on iid observations {Yi, Xi, Zi}ni=1.

It is surprising to see that the lower bound1.2for estimatingβ0 is irrelevant to the nonparametric functions, while the bound 1.3 depends on both the parametric and nonparametric components. As depicted in Figure 1.1, the above rate (1.3) exhibits an interesting two regime dichotomy. In the sparse regime where f0 is sufficiently smooth or p is sufficiently high, the minimax risks (1.3) become s0log(p/s0)/n. In other words, the best possible estimation of f0 is slowed down to the well known sparse parametric estimation rate [6,7,15]. On the other hand, in thesmooth regime where f0 is very rough or p is low, the minimax risks become the classical nonpara-metric rate n−2α/(2α+1) [16,17], even for the sparse estimation of β0. We call these observations as rate-switching phenomenon. Interestingly, Figure 1.1 happens to co-incide with phase transition phenomenon discovered in [5,8,9] for high dimensional additive nonparametric models, which is further proven to hold even under approxi-mate sparsity by [10]. Our contribution is to demonstrate that the doubly penalized estimators proposed in [18] for (β0, f0) almost achieve these minimax lower bounds. This requires us to develop (a stronger version of) oracle inequalities that hold under expectation.

We next move to the partially linear additive models (1.1) and assume J = 2 for simplicity. Hence, we now have two nonparametric functions with possibly different smoothness:

Y =XTβ0+f0(Z) +g0(U) +ε,

whereg0 belongs to the γ-th order Sobolev space (withγ >1/2). The minimax lower bound for estimating β0 and f0 are exactly the same as (1.2) and (1.3), and hence does not depend on the smoothness of g0 at all (no matter α > γ or γ > α). The same bound applies to g0 as well by replacing α with γ in (1.3). The latter result essentially generalizes [19] who showed that, in an additive nonparametric regression model, each component can be estimated (up to the first order asymptotics) as well as if all the rest components were known. In the end, inspired by [12] and [20],

(16)

log ( s0 log ( p s0 )) log ( n ) α Smooth Regime Sparse Regime 0.5 1 1.5 0 0.25 0.5 1 ∞ Phase Transition: Optimal Rate

Figure 1.1.: Minimax Rate Phase Transition.

When the smoothness indexα, and dimensionality and sparsity measured by log(s0log(p/s0))/logn

falls in the smooth region, the optimal rate is given by n−2α/(2α+1) _{which is determined solely by}

the smoothness off0. On the other hand, if they fall into the sparse regime, then the optimal rate is

given bys0log(p/s0)/nwhich is determined entirely by the sparsity indexs0and the dimensionality

(17)

we propose penalized estimators for (β0, f0, g0) that can almost achieve these lower bounds.

Our main technical tools are a set of oracle inequalities implying that paramet-ric estimator can achieve the oracle rate and each nonparametparamet-ric function can be estimated with the rate of convergence as if the others were known. These are devel-oped based on some recent advances on empirical process theory [21]. To derive the risk upper bounds, we further strengthen these oracle inequalities to their moment versions.

1.2 High Dimensional Inference

In Section 3, I will introduce debiased LASSO estimator for the high dimensional parametric vector in partial spline models (1.4), in presence of a nuisance nonpara-metric function. Based on this estimator, I will further construct statistical inference, including confidence intervals, hypothesis testings, together with their applications.

Consider a high dimensional partial smoothing spline model:

Y =XTβ0+g0(Z) +ε, (1.4) whereβ0 ∈Rp is an unknown vector andg0 is an unknown smooth function belonging to the m-th order Sobolev space Gm_{. Here} _{X, Z} _∈

Rp+1 are covariates, Y ∈ R is response variable and ε is error term. In particular, we consider the dimension p

is greater than sample size n. Our interests are statistical inference for the high dimensional parameter β0, for instance, confidence intervals and hypothesis testings, in presence of the nuisance function g0.

In the low dimensional case where the number of covariates p in the linear part is smaller than the sample size n, the estimation of β0 and the asymptotic inference have been extensively studied, see [11,22,23] and references there in. The estimation of this high dimensional model has been also widely studied, see [12,18,24]. However, high dimensional statistical inference forβ0 has not been established in the literature to the best of our knowledge, due to high dimensionality and intractable limiting

(18)

distribution of LASSO type estimator. Recently, [25–27] have proposed a debiased version of the LASSO estimator for high dimensional linear and generalized linear models, which is non-sparse and has a limiting normal distribution. Inspired by such debiasing idea, we propose a debiased LASSO estimator, denoted as bb, for partial

smoothing spline model (1.4). Our proposed estimator is shown to be asymptotically unbiased forβ0, and each of its components has a limiting normal distribution. This distributional results naturally generalizes to linear contrast of β0 by using Wold device. As a byproduct, we have also calculated the variance of our estimatorbb.

Based on this, we further conduct a simultaneous hypothesis testing and pro-pose a test statistics together with its multiple bootstrap counterpart. In particular, this simultaneous testing method automatically takes into account of the dependence structure within bb, and is also adaptive to the number of tests which is allowed to

be exponentially larger than sample size. Our procedure is motivated by [28], who have recently proposed a statistics and its multiplier bootstrap version for high di-mensional linear models. Our theoretical results are also numerically investigated in three applications, including component-wise confidence intervals, support recovery for sparse vectors and multiple testing with strong family wise error rate control.

To prove our results, we first show an oracle type inequality in section3.1.2, which has also been strengthened to a version in expectation. Then we give explicit asymp-totic order for the accuracy of approximate inverse information matrix constructed from nodewise LASSO method. The major technical tools we have used are Bernstein type inequality, a weighted projection inequality from [29] and central limit theorem for maxima from [30].

The rest of this chapter is organized as follows. Section 3.1discusses the construc-tion of debiased LASSO estimator for β0 and its asymptotic inference. Section 3.2 studies three applications of our theoretical results with their numerical performance. All technical details are deferred to Section 3.3.

(19)

2. MINIMAX OPTIMAL ESTIMATION

In this chapter I will discuss estimation of two classes of models, partial linear mod-els (2.1) and partial linear additive models (2.6) as detailed later. Recently, [14] have derived minimax optimal estimation rates for both parametric and nonparametric components in these two models. I will construct estimators for both parameters and show that their risks achieve the minimax lower bounds in [14].

Before presenting any theoretical results, we introduce the following notations for convenience. For any vector v ∈ _Rn_{, we write its} _`

1, Euclidean and `∞ norm as

kvk1 = Pn_i₌₁|vi|, kvk =pP_in₌₁vi2 and kvk∞ = max1≤i≤n|vi|, respectively, and also

kvk2

n :=vTv/n. With a bit abuse of notation, we define for any function f :Z 7→R that kfk = p_Ef2₍_Z_), _k_f_k ∞ = supz∈Z|f(z)| and kfk2n = Pn i=1f 2₍_Z i)/n. Let S0 be the set of all non-zero components of β0 and s0 = |S0|. Define βS0 such that

(βS0)j = βj1{β0j 6= 0} and βS0c = β−βS0, for any β ∈ R

p_{. Thus,} _k_β_k

1 = kβS0k1+

kβSc

0k1. The α-th order Sobolev space over [0,1], denoted as W

α,2₍_L_{), is defined as}

{f ∈ [0,1] → _R : R₀1(f(α)(x))2dx ≤ L2} for a constant L > 0. For real sequences

an, bn, if an . bn (an & bn), then lim supan/bn ≤ C (c ≤ lim supan/bn), for some constant C (constant c). If an bn, then c ≤ lim infan/bn ≤ lim supan/bn ≤ C for some constant c, C. Also, we write an = O(bn) if |an| ≤ C|bn| for some constant

C > 0. In the sequel, c, c0, C, C0, . . . denote a generic constant which may differ at each appearance.

2.1 Partial Linear Models

Let us consider partial linear models as follows:

(20)

where (Xi, Zi)ni=1 ∈Rp×[0,1] are i.i.d. copies of (X, Z). We assumeX is a mean zero Gaussian vector with variance matrix Σ, and the errors {εi}ni=1 are i.i.d. standard Gaussian random variables independent of{Xi, Zi}ni=1. For simplicity, we standardize

X such that the diagonal of Σ consist of 1’s. In this chapter, we restrict our attention to the Gaussian design and noise since even in the high dimensional linear models, deriving sharp minimax bounds under non-Gaussian setting remains an open problem; see [15].

LetB[s0, p] be a set ofp-dimensional vectors with at mosts0 non-zero coordinates and Sp be a set of p×pmatrices with 1’s on the diagonal. Define

Rβ0(s0,Σ, α) := inf b β sup β0∈B[s0,p],f0∈Wα,2(L) E[kβ0−βbk2]. (2.2) and Rf0(s0,Σ, α) = inf b f sup β0∈B[s0,p],f0∈Wα,2(L) E Z 1 0 |fb(z)−f₀(z)|2dz.

And the minimax risks with respect to random designs with covariance matrices Σ are defined as Rβ0(s0, α) := inf Σ∈Sp Rβ0(s0,Σ, α), and Rf0(s0, α) := inf Σ∈Sp Rf0(s0,Σ, α), respectively.

It has been known from [14] that givenn i.i.d. samples from the high dimensional partial linear model (2.1), the minimax risk for estimating β0 can be bounded from below as Rβ0(s0, α)& s0 n log p s0 , (2.3)

and the minimax risk for estimating f0 can be bounded from below as

Rf0(s0, α)& n−2α/(2α+1),s0 n log p s0 . (2.4) Now, we demonstrate that the doubly penalized estimators of (β0, f0) proposed in [12] and [18] nearly achieve the lower bounds derived in (2.3) and (2.4); see Ta-ble 2.1. In particular, when s0 ∼pr, matching is exact up to some constant.

(21)

We define the penalized estimators (β,b fb) as follows: (β,b fb) := argmin₍_β,f₎_∈ Rp×Wα,2(L) ||Y −XTβ−f||2 n+λ||β||1 +ρ2J2(f) , (2.5) where kβk1 is the `1 penalty and J2(f) =

R1

0(f

(α)₍_z₎₎2_dz _{is the smoothness penalty.} Here, λ > 0 and ρ2 _{control the level of shrinkage for}

b

β and the roughness for fb,

respectively.

Before calculating risk upper bounds, we need the following assumptions, adopted from [18]. In particular, Assumption 2.2 avoids functions with high and very steep peaks. Let h1(Z) := E[X|Z].

Assumption 2.1. The smallest eigenvalue of _E(X−h1(Z))T(X−h1(Z)) is positive, and the largest eigenvalue of _EhT

1h1 is finite.

Assumption 2.2. For some constantKf, it holds that sup{kfk≤1, J(f)≤1}kfk∞≤Kf. Assumption 2.3. J(h1) is bounded.

Theorem 2.1. Suppose Assumptions2.1–2.3hold. Ifλ plogp/n,ρ2 n−2α/(2α+1), then we have Ekβb−β₀k2 . s0logp n , and E Z 1 0 |fb(z)−f₀(z)|2dz .max n−2α/(2α+1),s0logp n .

These risk upper bounds are almost optimal, comparing to (2.3) and (2.4). This generalizes the claim by [15] that LASSO achieves the (almost optimal) risk bound

s0log(p)/n in high dimensional linear models to partial linear models.

In the end, we remark that the oracle inequalities given in Theorems 1 and 2 of [18] only imply the estimation rates ofβbandfb(in terms of`₂-norm). Rather, our theorem

(22)

Table 2.1.: Estimation Interference Results for Model (2.1).

Sparsity Parameters High Dimensionalβ0

Lower Bound Penalized Estimator

s0, p, n Rβ0(s0, α) Ekβb−β0k

2

1

2α+1 < a+b <1 s0log(p/s0)/n s0logp/n a+b < ₂_α1₊₁ s0log(p/s0)/n s0logp/n

Sparsity Parameters Smoothf0

s0, p, n Rf0(s0, α) E R1 0 |fb(z)−f0(z)| 2_dz 1 2α+1 < a+b <1 s0log(p/s0)/n s0logp/n a+b < ₂_α1₊₁ n−2α/(2α+1) n−2α/(2α+1) We sets0=nb, p= exp(na).

(23)

2.2 Partial Linear Additive Models

We are now ready to consider the partially linear additive models with two additive components (for simplicity):

Yi =XiTβ0+f0(Zi) +g0(Ui) +εi, 1≤i≤n, (2.6) where (Xi, Zi, Ui)∈Rp×[0,1]×[0,1] are i.i.d. copies of (X, Z, U) with a joint density

pXZU, β0 ∈ B[s0, p], f0 ∈ Wα,2(L1) and g0 ∈ Wγ,2(L2). As before, we assume that

Xi ∼ Np(0,Σ) and εi ∼ N(0,1) independent of design. For identifiability purpose, we assume _Eg0(U) = 0.

Define the minimax risk for estimating β0 as

Rβ0(s0, α, γ) := inf Σ∈Sp Rβ0(s0, α, γ,Σ), where Rβ0(s0, α, γ,Σ) := inf b β sup β0∈B[s0,p],f∈Wα,2(L1), g∈Wγ,2₍_L 2) E[kβ0−βbk2].

It has been proved in [14] that given n i.i.d. samples from the high dimensional partial linear additive model (2.6), the minimax risk for estimatingβ0 can be bounded from below as Rβ0(s0, α, γ)& s0 n log p s0 , (2.7)

which is only affected by the least smooth function. Next, define the minimax risk of estimating f0 as

Rf0(s0, α, γ) := inf Σ∈Sp Rf0(s0,Σ, α, γ), where Rf0(s0,Σ, α, γ) = inf b f sup β0∈B[s0,p],g∈Wγ,2(L2) sup f0∈Wα,2(L1) E Z 1 0 |fb(z)−f0(z)|2dz. [14] shows that givenni.i.d. samples from the high dimensional partial linear additive model (2.6), the minimax risk for estimating f0 can be bounded from below as

Rf0(s0, α, γ)& n−2α/2α+1,s0 n log p s0 . (2.8)

(24)

Now, we construct estimators of (β0, f0, g0) that almost achieve the lower bounds as shown in [14]. Our construction is inspired by [12] and [20], and holds in a general setup as follows. We assume thatf andg belong to more general classes of functions, Hilbert spaces F and G with continuous functions f ∈ F and g ∈ G on [0,1]. In particular, Wα,2₍_L

1) ⊂ F and Wγ,2(L2) ⊂ G. Let I(·,·) and J(·,·) be semi-inner product on F and G, and I(·), J(·) be the corresponding semi-norm. A special case of F,G and I, J isF =Wα,2₍_L 1),G =Wγ,2(L2) andI2(f) = R1 0(f (α)₍_z₎₎2_{dz, J}2₍_g_{) =} R1 0(g

(γ)₍_z₎₎2_dz_{. Also, we can allow} _ε _{to be sub-Gaussian (not necessarily} _N₍₀_,_{1)) in} this section.

The penalized least square estimators of (β0, f0, g0) can be obtained as (β,b f ,b

b

g) = argmin₍_β_∈_Rp_,f_∈F_,g_∈G₎{kY −XTβ−f−gk_n2 +λkβk₁+ρ2I2(f) +µ2Jq(g)},

(2.9) where 1 ≤q ≤ 2 is some fixed constant. Without loss of generality, we assume that functions in F are smoother than those in G in a sense defined in Assumption 2.6. Corresponding to Wα,2₍_L

1) and Wγ,2(L2), it simply meansγ < α.

Assumption 2.4assumes sub-Gaussian error. Note that ifε1 is bounded such that

kε1k ≤M, then kε1kΨ ≤M.

Assumption 2.4. The error term ε is independent of (X, Z, U) and satisfies for some constant Kε≥1,

kεkΨ≤Kε,

where k · kΨ is an Orlicz norm1 with Ψ(t) = exp(t2)−1.

Let H = F ⊕ G be a Hilbert space of additive functions with the `2 norm k · k and Xe = X−Π(X|H) with Π(X|H) being the projection of X onto H defined as

arg minh∗_∈H_EkX−h∗k2. By the definition ofXe, we have

kXTβ+f +gk2 =kXeTβk2+kΠ(X|H)Tβ+f +gk2. (2.10)

1_k_ε_k

(25)

Also by the definition of H, Π(X|H) = (Π(X1_|H₎_,_{· · ·} _,_Π(_Xp_|H₎₎T _{can be written as} a sum of fX +gX wherefXj ∈ F and gXj ∈ G for 1≤j ≤p.

Assumption 2.5 is widely used in semiparametric literature [31,32], ensuring suf-ficient information in estimating β0.

Assumption 2.5. The smallest eigenvalueΛ2

min of EXeTXe is positive, and the largest

eigenvalue Λ2_max of _EΠ(X|H)TΠ(X|H) is finite.

Assumption 2.6 implies that f is “smoother” than g in terms of the following complexity measure. Letdbe a metric on the spaceF. For anyt >0, defineN(t,F, d) as the covering number ofF andH(t,F, d) = logN(t,F, d) the entropy ofF. LetAn be the set of all configurations An of n points within the support of PXZU. ForAn∈

An, we have kfkAn,∞= maxZ∈An|f(Z)|.LetH∞(t,F) = supAn∈AnH(t,F,k · kAn,∞).

Further, we write J∞(u,F) = C0inf δ>0 u Z 1 δ/4 p H∞(tu/2,F)dt+√nδu .

For arbitrary constants R0 > 0 and M0 > 0, we denote F(R0, M0) = {f ∈ F :

kfk ≤ R0, I(f) ≤ M0} and G(R0, M0) = {g ∈ G : kgk ≤ R0, J(g) ≤ M0}. Define

fβ(x) =xTβ and Fβ(R0, M0) ={fβ :kfβk ≤R0,kβk1 ≤M0}.

Assumption 2.6. Let 0 < k < m < 1. For R0 ≤ M0 and some constants AI ≥ 1 and AJ ≥1, it holds that

J∞(z,F(R0, M0))≤AIM0kz 1−k , and J∞(z,G(R0, M0))≤AJM0mz1 −m_.

In Assumption 2.6, if we take I2(f) = R(f(α)(x))2dx, then J∞(z,F(1,1)}) ≤

AIz1−

1

2α,i.e. k= 1/(2α), for some constant A_I >1.

Assumption 2.7. For some constant B ≥ 1, all M0 > 0 and any R0 ≤ M0/B, it holds that sup fβ∈Fβ(R0,M0) kfβk∞≤M0, sup f∈F(R0,M0) kfk∞ ≤M0, sup g∈G(R0,M0) kgk∞ ≤M0.

(26)

Assumption 2.8implies separate rates forf andg from that forf+g. This is due to the inequalitykf+gk2 _≥₍₁₋_γ₎₍_k_f_k₊_k_g_k₎2 _{as shown in Lemma 5.1 of [}₂₀_{] given} Ef0(Z) = 0. Here, γ is related to the minimal angle between two Hilbert spaces F and G, see A.4 of [31], and formally defined as follows

γ2 =

Z

(r−1)2pZpUdν,

where p=d_PZU/dν is the density of PZU w.r.t. ν =νZ×νU with marginal densities

pZ and pU, and r(z, u) = p(z, u)/(pZ(z)pU(u)). Assumption 2.8. It holds that γ <1.

We assume the projection fP(U) =E(f(Z)|U) to be smooth.

Assumption 2.9. For some constant Γ>0, it holds that, for any functionf ∈ F, J(fP)≤Γkfk,

and for some constant I1, J2, it holds that max

1≤j≤p|I(fXj)| ≤I1,1max≤j≤p|J(gXj)| ≤J2.

Before presenting our main theorem, we need a set of oracle inequalities that hold in probability. Define the norm

τ(β, f, g;R) = λkβk1 δ0R +kXTβ+f +gk+ρI(f) +µ R 2−_qq µJ(g), τI(β, f;RI) = λ kβk1 δ0RI +kXβe k+kf_XTβ+fk+ρI(f),

for some constant δ0 >0.

Lemma 2.1. Suppose Assumptions2.4-2.9 hold. Also assume that for some 0< δ <

1, max{A2_I, A2_J}/n≤n−δ and A2_J/n≤(A2_I/n)1+k1 ≤(A2 J/n) 1 1+mn−δ. Let λ r logp n , ρ 2 _A1+k2 I n − 1 1+k and µ2 A 2 1+m J n − 1 1+m.

(27)

If there exist R and RI satisfying R2 ≤ λ ≤ 1, R2 µ2 +λ2s0 and R2I ρ2 +λ2s0, then P τ(βb−β₀,fb−f₀, b g−g0;R)≤R ≥1−cexp(−nµ2/c), P τ(βb−β₀,fb−f₀, b g−g0;R)≤R, τI(βb−β₀,fb−f₀;R_I)≤R_I ≥1−Cexp(−nρ2/C) for some constants C, c >0.

A noteworthy case is that I2(f) = R(f(α)(z))2dz, J2(g) = R(g(γ)(u))2du with

γ < α and q = 2. Set ρ n−α/(2α+1) _and _µ _n−γ/(2γ+1)_{. Given that} _s

0logp/n =

o(n−2α/(2α+1)), Lemma 2.1 implies that kβ −β0k2 = OP(n−γ/(2γ+1)), kfb− f₀k = OP(n−α/(2α+1)) andkbg−g0k=OP(n

−γ/(2γ+1)_{); otherwise,}_k_β₋_β

0k2 =OP(s0logp/n),

kfb−f₀k=O_P(s₀logp/n) andk b

g−g0k=OP(s0logp/n). The upper bounds exhibit an interesting two regime dichotomy depending on the relation between s0logp/n and n−2α/(2α+1).

Lemma 2.2. Assume conditions of Lemma2.1. Then there exists constantsC0, c0 >0 such that with probability at least 1−7/(2p)−C0exp(−c0nµ2₎_,

kXeT(βb−β₀)k_n2 + (λ/2)kβb−β₀k₁ ≤ 4s0λ2 Λ2 e X,min .

Lemma2.2has two important implications: (i) prediction error: kXeT(βb−β₀)k2_n≤

4s0λ2/Λ2 e

X,min; (ii) `1 error: kβb−β0k1 ≤ 8s0λ/Λ 2

e

X,min. We note that these two rates are the same order as those standard lasso rates (as iff0 and g0 were known); see [1]. However, the probability that these rates hold is relatively smaller as reflected by an additional term exp(−c0nµ2). This is the price to pay for estimating two unknown nonparametric functions in the model.

We are now ready to prove that (β,b f ,b_bg) achieve the minimax lower bounds

es-tablished in (2.7) and (2.8): q = 2, F = Wα,2₍_L 1),G = Wγ,2(L2) and I(f) = R1 0(f (α)₍_z₎₎2_{dz, J}₍_g_{) =}R1 0(g (γ)₍_z₎₎2_dz_.

Theorem 2.2. Suppose Assumptions 2.5, 2.7, 2.8 and 2.9 hold. Set λ p

logp/n,

ρ2 _n−2α/(2α+1) _and _µ2 _n−2γ/(2γ+1)_{. Then} Ekβb−β₀k2 .

s0logp

(28)

Table 2.2.: Estimation Interference Results for Model (2.6).

Sparsity Parameters High Dimensionalβ0

s0, p, n Rβ0(s0, α) Ekβb−β0k

2

1

2η+1 < a+b <1 s0log(p/s0)/n s0logp/n a+b < ₂_η1₊₁ s0logp/n s0logp/n

Sparsity Parameters Smoothf0

2.3.1 Proof for Section 2.1

Proof of Theorem 2.1

Proof. Letρ2 ₌_n−2α/(2α+1) _and _λ₌p

logp/n in (2.5). By definition, we have

kY −Xβb−fbk2_n+λkβbk₁+ρ2I2(fb)≤ kY −XTβ₀−f₀k2_n+λkβ₀k₁+ρ2I2(f₀). (2.11)

Then, by triangle inequality, it holds that

(29)

which further implies for any k ≥1,

Ekβb−β₀kk ≤Ekβb−β₀kk₁ ≤E(kεk2_n/λ+kβ₀k₁+ρ2I2(f₀)/λ)k.

Note that nkεk2

n follows chi-squared distribution with degree of freedom n. Thus we have_Ekεkk

n=O(1). Also we have that kβ0k1 =O(

√

s0). Therefore, it follows

Ekβb−β0kk =O(1/λ+

√

s0)k.

Define the event T = {kX(βb−β₀)k2 ≤ λ2s₀}, then it is known from the proof of

Theorem 3.1 of [18] that

P(T)≥1−2/p−3 exp(−ncρ2).

Note that if we choose t= 2 log 2p/n in proof of Theorem 3.1 of [18], then the above probability inequality holds as follows,

P(T)≥1−2/p4−3 exp(−ncρ2). for some constant c >0. Now we have

Ekβb−β₀k2 =Ekβb−β₀k21T +Ekβb−β₀k21Tc ≤O(λ2s0) + q Ekβb−β₀k4 p P(Tc) ≤O(λ2s0) +O(1/λ2+s0) p exp(−ncρ2_{) + 2}_/p4 ≤O(λ2s0). (2.12) The last inequality holds due to the following arguments. Substituteρ2 ₌_n−2α/(2α+1) and λ=plogp/n into the last inequality, we have

(1/λ2)

q

exp(−n1/(2α+1)_{) =}_O₍_λ2_s

0), since n2exp(−n1/(2α+1)) =O(s0log2p), (1/λ2)(1/p2) =O(λ2s0), since 1 =O( p2_s 0log2p n2 ); s0 q exp(−n1/(2α+1)_{) =}_O₍_λ2_s

0), since nexp(−n1/(2α+1)) =O(logp);

s0(1/p2) = O(λ2s0), since 1 =O(

p2logp

(30)

Now we prove the second part of the theorem. Note that fb ∈ W2,α(L), then

together with (2.11), it implies that, for some constant C > 0, sup_z∈[0,1]|fb(z) − f0(z)|2 ≤ CI2(fb− f₀) = C R1 0(fb (α)₍_z₎₋_f(α) 0 (z))2dz ≤ C(kεkn2/ρ2 + 2λkβ0k1/ρ2 + I2(f0)). Therefore we have Ekfb−f₀k2_nk ≤(E Z 1 0 (fb(α)(z)−f (α) 0 (z)) 2_dz₎k _≤_O₍₁_/ρ2₊_λ√_s 0/ρ2)k =O(1/ρ2k), for any k ≥1. Ekfb−f0kn2 =Ekfb−f0k2n1T +Ekfb−f0k2n1Tc (2.13) ≤O(ρ2+λ2s0) +O(1/ρ2) p exp(−ncρ2₎ ≤O(ρ2+λ2s0). Finally it follows from Lemma 4.1 of [33] that _ER1

0 |fb(z)−f0(z)|

2_dz ₌_O₍_ρ2₊_λ2_s 0).

2.3.2 Proof for Section 2.2

In this section, we first define a set T(R) and show in Lemma 2.3 that τ(βb− β0,fb−f₀,

b

g −g0;R) ≤ R on T(R). The probability of T(R) is approximated in Lemma 2.4. We next show τI(βb−β0,fb−f0;RI) ≤ RI on the set T(R)∩ TI(RI) whereas the probability of TI(RI) is approximated in Lemma 2.6. Lemma 2.1 is then proved following Lemmas 2.3-2.6.

For some δ0 >0 small enough, define

M(R) ={(β, f, g) :τ(β, f, g;R)≤R}, T1(R) = ( sup M(R) kXTβ+f+gk_n2 − kXTβ+f+g k2 ≤δ2₀R2 ) , T2(R) = ( sup M(R) Pn ε(X T_β₊_f ₊_g₎ ≤δ 2 0R 2 ) , T(R) =T1(R)∩ T2(R).

(31)

Lemma 2.3. Under the conditions of Lemma 2.1, we have, on T(R),

τ(βb−β₀,fb−f₀, b

g−g0;R)≤R.

Proof. Takeδ0 ≤1/30. Under the conditions of Lemma2.1, we can findρand µsuch that ρ2I2(f0) +µ2Jq(g0)≤δ02R2, (2.14) and 4λ2_s 0/Λ2min≤R2I ≤R2. Define t= R R+τ(βb−β₀,fb−f₀, b g−g0;R) . Let βe = tβb+ (1−t)β₀, fe = tfb+ (1−t)f₀, e g = tg_b+ (1−t)g0. Notice that τ(βe−β₀,fe−f₀, e g −g0;R) = tτ(βb−β₀,fb−f₀, b g −g0;R) ≤ R, which implies (βe− β0,fe−f₀, e g−g0)∈ M(R). In order to showτ(βb−β₀,fb−f₀, b g−g0;R)≤R, it suffices to prove τ(βe−β₀,fe−f₀, e g−g0;R)≤R/2. By the convexity, we have

kY −XTβe−fe− e gk2 n+λkβek₁+ρ2I2(fe) +µ2Jq( e g) ≤ kY −XTβ0−f0−g0k2n+λkβ0k1+ρ2I2(f0) +µ2Jq(g0) ≤ kY −XTβ0−f0−g0k2n+λkβ0k1+δ02R 2_,

where the last inequality follows from equation (2.14). This implies

kXT(βe−β₀) + (fe−f₀) + ( e g−g0)k2n+λkβek₁+ρ2I2(fe) +µ2Jq( e g) ≤2_Pn ε(XT(βb−β₀) + (fb−f₀) + ( b g−g0)) +λkβ0k1+δ02R 2_. _(2.15)

Therefore, by the definition of T1(R) and T2(R),

kXT(βe−β₀) + (fe−f₀) + ( e g−g0)k2+λkβe_SC 0 k1+ρ 2_I2₍ e f) +µ2Jq(_eg) ≤δ₀2R2+δ2₀R2+ 2δ₀2R2+λkβ0k1−λkβe_S₀k₁ ≤4δ2₀R2+λkβ0S0 −βeS0k1. (2.16)

(32)

Note that λkβ0S0 −βeS0k1 ≤λ √ s0kβ0S0 −βeS0k ≤λ√s0kβe−β₀k ≤λ√s0kXeT(βe−β₀)k/Λ e X,min ≤λ2s0/Λ2_X,_e_min+kXeT(βe−β₀)k2/4 ≤δ₀2R2/4 +kXeT(βe−β₀)k2/4,

where the third and last inequalities hold by assumption and the fourth inequality follows from uv ≤u2₊_v2_/₄_._{Thus, substituting it into (}_2.16_{), we obtain}

(a) (3/4)kXT(βe−β₀) + (fe−f₀) + (_eg−g₀)k2 ≤(17/4)δ₀2R2; (b) ρ2_I2₍ e f)≤(17/4)δ2 0R2; (c) µ2Jq(_eg)≤(17/4)δ₀2R2. Now it follows from (a) that kXT₍

e

β−β0₎_{k ≤}₍√₁₇_/√₃₎_δ

0R. In addition, (b) (c) and (2.14) implies

ρI(fe−f₀)≤ρI(fe) +ρI(f₀)≤

√ 17 2 δ0R+ 2δ0R≤ √ 17δ0R and µ R 2−_qq Jq(_eg−g0)≤ √ 17 2 δ0R+ 2δ0R ≤ √ 17δ0R. Adding λkβ0S0 −βeS0k1 on both sides of (2.16), we get

kXT(βe−β0) + (fe−f0) + (eg−g0)k 2 +λkβe−β0k1+ρ2I2(fe) +µ2Jq(_eg) ≤4δ₀2R2+ 2λkβ0S0 −βeS0k1 ≤4δ₀2R2+kXeT(βb−β₀)k2+ 1 4δ 2 0R 2_, which implies λkβe−β₀k₁ ≤ 17 4 δ 2 0R2.

(33)

Invoking the definition of τ(βb−β₀,fb−f₀, b g−g0;R), we finally get τ(βb−β₀,fb−f₀,_bg−g₀;R)≤ √ 17/√3 + 2√17 + 17/4δ0R≤15δ0R≤ 1 2R by letting δ0 = 1/30.

Lemma 2.4. Under the conditions of Lemma 2.1, we have for some constants C >e

0,_ec >0, _P(T(R))≥1−Ceexp(− e cnρ2₎_.

Proof. Under the conditions of Lemma 2.1, we can find ρ2 _≤ ₍₁₋_γ₎_/B2 _and _µ2 _≤

R2−q(1−γ)q/2/Bq. Further, we find a constant L >0 such that the followings hold:

√ nρ1+k ≥LAI, √ nµ1+m ≥LAJ, (2.17) R ≥LLJAJ/ √ n, R≥Kερ, R ≥LJρ, R ≥K q q−(2−q)m ε µ (2.18) and ρk≤1/L, (2.19) where LJ = (R/µ) 2/q

. By similar arguments of [20], such L exists. Note that

τ(β, f, g;R) ≤ R implies that kXT_β ₊_f ₊_g_k2 _≤ _R2 _and _k_β_k

1 ≤ δ0R2/λ, I(f) ≤

R/ρ, J(g) ≤ LJ. By orthogonal decomposition (2.10), we have kXeTβk ≤ R and

kfT

Xβ+f +gXTβ+gk ≤R. Then Assumption 2.5 implies

kXTβk ≤ kXeTβk+kΠ(X|H)Tβk ≤R+ (Λ_max/Λ_min)kXβe k ≤(1 + Λ_max/Λ_min)R.

Similar arguments and assumption 2.8 implies that

kfk ≤(1 + Λmax/Λmin)R/(1−γ),kgk ≤(1 + Λmax/Λmin)R/(1−γ).

For simplicity, we take R1 = R2 = R3 = (1 + Λmax/Λmin)R/(1− γ) , R2/(1−

γ) and M1 = δ0R2/λ, M2 = R2/ρ2, M3 = LJ. In addition, Assumption 2.7 and

ρ2 _≤ ₍₁₋ _γ₎_/B2_, _µ2 _≤ _R2−q₍₁ ₋_γ₎q/2_/Bq _{yield that sup}

fβ∈Fβ(R1,M1)kfβk ≤ M1,

sup_f_∈F₍_R₂_,M₂₎kfk∞ ≤M2 and supg∈F(R3,M3)kgk∞ ≤M3. Thus, we takeKl =Ml,1≤ l ≤3. Let t=nρ2_/L2_{. Further, we choose} _δ

1, δ10 small enough such that

R q

log3(2n)≤δ1,

p

(34)

Without loss of generality, we assume C1 = 1 in Theorem 2.4. Otherwise, we can replace in L=LC1 in the proof. With (2.20) and the fact thatR2 ≤λ, it holds that

M1 r logp n q log3n ≤δ0δ1δ10R ≤ R L, M1 ρ L = δ0R2 λ ρ L ≤ δ0ρ L ≤ R L. (2.21)

Moreover, by Theorem2.3, (2.17) and (2.17), we obtain

J∞(K2,Fβ(R1, M1)) √ n ≤M1 r log(2p) n q log3(2n) + 2√K2 n ≤ R L + 2R ρ√n ≤ 3R L , (2.22) J∞(K3,Fβ(R1, M1)) √ n ≤M1 r log(2p) n q log3(2n) + 2√K3 n ≤ R L + 2LJ √ n ≤ 3R L . (2.23)

Further, Assumption 2.6 and equations (2.17), (2.18) and (2.19) show

J∞(K2,F(R2, M2)) √ n ≤ AI(R/ρ)k(R/ρ)1−k √ n ≤ AIR √ nρ ≤ R L, (2.24) J∞(K3,G(R3, M3)) √ n ≤ AJLkJL (1−k) J √ n ≤ AJLJ √ n ≤ R L, K3 ρ L ≤ ρLJ L ≤ R L, (2.25) J∞(K3,F(R2, M2)) √ n ≤ AI(R/ρ_√)kL1J−k n ≤ LAI √ nρ1+k ρ1−k_L1−k J R1−k ρ kR L ≤ R L2. (2.26) Now for any (β, f, g), it holds

kXTβ+f+gk_n2 − kXTβ+f +g k2 ≤kXTβk2_n− kXTβk2 + kfk2_n− kfk2 + kgk2_n− kgk2 + 2(Pn−P)XTβf +2(Pn−P)XTβg+ 2(Pn−P)f g ,A+B+C+D+E+F.

We bound each of the terms as follows.

A. Replace R∗ and M∗ by R1 and M1 in Theorem 2.3. Then we get from equa-tion (2.21) that A≤R1M1 r logp n q log3n+ ρ L ! +M₁2 logp n log 3_n₊ ρ2 L2 (2.27) ≤ 2R 2 Lp(1−γ2) + R 2 L2 ≤ 4R 2 Lp(1−γ2) .

(35)

B. Replace R∗ and K∗ by R2 and K2 in Theorem 2.4. Then we have from equa-tion (2.24) B ≤2R2J∞(K√2,F(R2, M2)) n +R2K2 ρ L + 4J2 ∞(K2,F(R2, M2)) n +K 2 2 ρ2 L2 (2.28) ≤ 2R 2 L√1−γ2 + R 2_ρ ρL√1−γ2 +4R 2 L2 + R2_ρ2 ρ2_L ≤ 8 Lp(1−γ2) R2.

C. Replace R∗ and K∗ by R3 and K3 in Theorem 2.4 and apply (2.25). Then we have C≤2R3J∞(K√3,G(R3, M3)) n +R3K3 ρ L+ 4J_∞2 (K3,F(R3, M3)) n +K 2 3 ρ2 L2 ≤ 2R 2 L√1−γ2 + R 2 L√1−γ2 +4R 2 L2 + R2 L2 ≤ 8 Lp(1−γ2) R2.

D. ReplaceR∗₁, R₂∗, K₁∗, K₂∗in Theorem2.5byR1, R2, K1, K2andM∗in Lemma2.11 byM1. Then with the application of (2.22), we get the following:

D≤R1J∞(K2√,F(R2, M2)) n + R2J∞(R1K2/R2,Fβ) √ n + R1K2ρ L + K1K2ρ2 L2 (2.29) ≤ R 2 Lp(1−γ2) + 3R 2 Lp(1−γ2) + R 2_ρ ρLp(1−γ2) + δ0R 2 λ Rρ2 ρL2 ≤ 6 Lp(1−γ2) R2.

E. ReplaceR∗₁, R₂∗, K₁∗, K₂∗in Theorem2.5byR1, R3, K1, K3andM∗in Lemma2.11 byM1. Then with the application of (2.22), we get the following:

E ≤R1J∞(K3√,G(R3, M3)) n + R3J∞(R1K3/R3,Fβ) √ n + R1K3ρ L + K1K3ρ2 L2 (2.30) ≤ R 2 Lp(1−γ2) + 3R 2 Lp(1−γ2) + RLJρ Lp(1−γ2) + δ0R 2 λ LJρ2 L2 ≤ 6 Lp(1−γ2) R2,

(36)

where ρLJ ≤R follows from equation (2.18).

F. Replace R∗₁, R∗₂, K₁∗, K₂∗ in Theorem 2.5 by R2, R3, K2, K3. Then with the ap-plication of (2.22), we get the following:

F ≤R2J∞(K3√,G(R3, M3)) n + R3J∞(R2K3/R3,F(R2, M2)) √ n + R2K3ρ L + K2K3ρ2 L2 (2.31) ≤ R 2 Lp(1−γ2) + R 2 L2p₍₁₋_γ 2) + RLJρ L√1−γ2 + RLJρ 2 ρL2 ≤ 4R 2 L√1−γ2 .

Combining A to F, we get for any (β, f, g)∈ M(R), sup (β,f,g)∈M(R) kXTβ+f +gk_n2 − kXTβ+f+g k2 ≤ 36R2 L√1−γ2 .

with probability at least 1−6 exp(−nρ2/L).

Look at the set T2(R) now. Note that

|_Pnε(XTβ+f +g)| ≤ P_nε(XTβ) + P_nεf + P_nεg .

Lemma 2.7 shows that

Pnε(XTβ) ≤δ0R2/10≤ R2 √ 1−γ2L , (2.32) where the last step follows by choosing δ0 ≤ 10. Then it follows from Theorem 5.2 of [20], Assumption 2.6 and equation (2.17) that

Pnεf ≤ KεJ∞(R2,F(R2, M2)) +KεR2 √ t √ n ≤ √ KεAIR n(1−γ2)(1−k)/2ρk + R 2 √ 1−γ2L ≤ R 2 L(1−γ2)(1−k)/2 + R 2 L√1−γ2 ≤ 2R 2 L√1−γ2 ,

(37)

and P_nεg ≤ KεJ∞(R3,G(R3, M3)) +KεR3 √ t √ n ≤ KεAJM m 3 R1−m √ n(1−γ2)(1−m)/2 + R 2 √ 1−γ2L ≤ R 2 L(1−γ2)(1−m)/2 + R 2 L√1−γ2 ≤ 2R 2 L√1−γ2 . Therefore, sup (β,f,g)∈M(R) |_Pnε(XTβ+f+g)| ≤ 5R2 L√1−γ2 with probability at least 1−3 exp(−nρ2/L).

By Letting 5/L√1−γ2 ≤δ20, we have shown that for some constantsC >e 0, e c >0, P(T(R))≥1−Ceexp(−_ecnρ2).

Let

MI(RI) ={(β, f) :τI(β, f;RI)≤RI}. ForδI sufficiently small, we define

TI,1(RI) = ( sup (β,f)∈MI(RI) kXβe +f_XAT β+f_Ak_n2 − kXβe +f_XTβ+f_Ak2 ≤δ_I2R2_I ) , TI,2(RI) = ( sup (β,f)∈MI(RI) Pn ε(Xβe +f_XAT β+f_A) ≤δ 2 IR 2 I ) , TI,3(RI) = ( sup (β,f,g)∈F(R),(β,f)∈MI(RI) Pn(Xβe +f T XAβ+fA)(fXPT β+g T Xβ+fP +g) ≤δ 2 IR 2 I ) and let TI(RI) =TI,1(RI)∩ TI,2(RI)∩ TI,3(RI).

Lemma 2.5. Under the conditions of Lemma 2.1, it holds that on T(R)∩ T1(RI),

(38)

Proof. Under the conditions of Lemma 2.1, we can find some ρ and µ such that ρ2I2(f0) +µ2Jq(g0)≤δ20R 2_{, ρ}2_I2₍_f 0)≤δI2R 2 I, (2.33) 2µ2(Γ +J2δ0RI/λ)(2δ0R/µ) 2(q−1) q ≤_δ IRI2, 2µ2(Γ +J2δ0RI/λ)q/R2 −q I ≤δ 2 I (2.34) for some δ0, δI >0, which will be taken small enough later.

kY −XTβb−fb− b gk2 n+λkβbk₁+ρ2I2(fb) +Jq( b g) ≤kY −XTβ0 −f0−(bg+fXP(βb−β0) +gX(βb−β0) +fbP −f 0 P)k 2 n+λkβ0k1+ ρ2I2(f0) +µ2Jq(bg+fXP(βb−β0) +gX(βb−β0) +fbP −f 0 P), which implies kX(βb−β₀) +f_XAT (βb−β₀) +fb_A−f_A0k2_n+ρ2I2(fb) ≤ −2_Pn (fXP +gX)T(βb−β₀) +fb_P −f_P0 + b g−g0 (Xe +f_XA)T(βb−β₀) +fb_A−f_A0 + 2_Pn εXe+f_XAT (βb−β₀) +fb_A−f_A0 +λkβ0k1+ρ2I(f0)−µ2Jq(bg) +µ2Jq(_bg+fXP(βb−β0) +gX(βb−β0) +fbP −fP0). Let t = RI RI +τI(βb−β₀,fb−f₀;R_I) .

Define βe = tβb+ (1−t)β0,fe= tfb+ (1−t)f0. Note that (β,e fe) ∈ T1(RI) Similarly as the proof of Lemma 2.3, it suffices to show that τI(βe−β₀,fe−f₀;R_I)≤R_I/2.By

convexity and the definition ofTI(RI), we have

kXe(βe−β₀) +f_XAT (βe−β₀) +fe_A−f_A0k2+λkβek₁+ρ2I2(fe) ≤5R2_I+λkβ0k1+ρ2I2(f0) +µ2Jq(bg+fXP(βb−β0) +gX(βb−β0) +fbP −f 0 P)−µ 2_Jq₍ b g).

Using the fact that for a, b >0 and 1< q <2,

(39)

we obtain Jq(_bg+fXP(βb−β₀) +g_X(βb−β₀) +fb_P −f_P0)−Jq( b g) ≤2Jq−1(_bg)J(f_XPT (βe−β₀) +g_XT(βe−β₀) +fe_P −f_P0) + 2Jq(f_XPT (βe−β₀) +g_XT(βe−β₀) +fe_P −f_P0) ≤2Jq−1(_bg)[J(g_XT(βe−β₀)) +J(f_XPT (βe−β₀) +fe_P −f_P0)] + 2[J(gT_X(βe−β₀)) +J(f_XPT (βe−β₀) +fe_P −f_P0)]q ≤2Jq−1(_bg)(kJ(gX)k∞kβe−β₀k₁+ Γkf_X(βe−β₀) +fe−f₀k) + 2(kJ(gX)k∞kβe−β₀k₁+ Γkf_X(βe−β₀) +fe−f₀k)q ≤2 2δ0R µ 2(q_q−1) J2 δ0R2I λ + ΓRI + 2 J2 δ0R2I λ + ΓRI q ≤2R2_I/µ2.

where the fourth inequality follows from J(_bg) ≤(2δ0R/µ)2/q onT(R) and Assump-tion 2.8,2.9 and the factkfX(βe−β₀) +fe−f₀k ≤R_I onT_I(R_I). The last step follows

from (2.34). Hence, we have

kXeT(βe−β₀)k2+kf_XAT (βe−β₀) +fe_A−f_A0k2+λkβek₁+ρ2I2(fe)≤8δ2_IR2_I+λkβ₀k₁.

Subtracting λkβe_S₀k₁ on both sides of above equation, we get

kXeT(βe−β₀)k2+kf_XAT (βe−β₀) +fe_A−f_A0k2+λkβe_Sc 0k1+ρ 2_I2₍ e f) ≤8δ_I2R2_I+λkβe_S₀ −β₀_S 0k1, (2.35) where λkβe_S₀ −β₀_S 0k1 ≤λ √ s0kβ0S0 −βeS0k ≤λ√s0kβe−β₀k ≤λ √ s0kXeT(βe−β₀)k/Λ_min ≤λ2s0/4Λ2min+kXeT(βe−β₀)k2 ≤δ_I2R2_I+kXeT(βe−β₀)k2. Therefore, kf_XAT (βe−β₀) +fe_A−f_A0k2+λkβe_Sc 0k1+ρ 2_I2₍ e f)≤9δ2_IR2_I.

(40)

Then it holds (a0) kfT

XA(βe−β₀) +fe_A−f_A0k ≤3δ_IR_I which further implies kf_XT(βe−β₀) +fe−f₀k ≤

3δIRI/

p

(1−γ2_);

(b0) ρI(fe−f₀)≤ρI(fe)+ρI(f₀)≤(3+1)δ_IR_I ≤4δ_IR_I together with equation (2.33).

Note that by using λkβeS0 −β0S0k1 ≤ λ

2_s

0/Λ2min/2 +kXT(βe−β0)k2/2, we can also obtain

(c0) kXeT(βe−β₀)k ≤

√

18δIRI Now, adding λkβe₀_S

0 −β0S0k1 on both sides of (2.35), we get

kXT(βe−β₀)k2+kf_XAT (βe−β₀) +fe_A−f_A0k2 +λkβe−β₀k₁+ρ2I2(fe) ≤8δ_I2R2_I+ 2λkβe_S₀ −β₀_S 0k1 ≤8δ 2 IR 2 I+λ 2_s 0/Λ2min+kX T₍ e β−β0)k2, which implies that

(d0) λkβe−β₀k₁ ≤9δ_I2R2_I.

Combine the results (a)−(c) and recall the form ofτI(βe−β₀,fe−f₀;R_I). We get τI(βe−β₀,fe−f₀;R_I)≤(( √ 18 + 16)/p1−γ2₎_δ IRI ≤ 1 2RI, given that δI ≤(( √

18 + 16)/2p1−γ2_{). This completes the proof of the lemma.}

Lemma 2.6. Under the conditions of Lemma 2.1, there exists constant CI and cI, such that

P(TI(RI))≥1−CIexp(−cInρ2). Proof. Note that on the setTI(RI), we have

kfk2 _≤_{(1 + Λ}

max/Λ0min)RI2, I(f)≤RI/ρ, and

(41)

where Λ_min0 represent the smallest eigenvalue of _E(fXfXT). Also we have kgk2 ≤ (1 + Λmax/Λmin)R2/(1−γ) andJ(g)≤LJ. Now, we take

R0₁ =R0₂2 = (1 + Λmax/Λ0min)R 2 I ,R 2 I/(1−γ1), R0₃2 = (1 + Λmax/Λmin)R2/(1−γ) = R2/(1−γ2), and M₁0 =δIR2I/λ, M 0 2 =RI/ρand M30 =LJ.

Assumption 2.7 and ρ2 _≤₍₁₋_γ₎_/B2_, _µ2 _≤_R2−q₍₁₋_γ₎q/2_/Bq _{yield that} sup f∈Fβ(R01,M 0 1) kfβk∞≤M₁0, sup f∈F(R0 2,M20) kfk∞≤M20 and sup g∈F(R0 3,M30) kgk∞ ≤M30.

Thus, we takeK_l0 =M_l0,1≤l ≤3.LetLbe the constant as in the proof of Lemma2.4. We further restrict it such that

RI ≥LLJAJ/ √ n, RI ≥Kερ, RI ≥LJρ, R ≥K q q−(2−q)m ε µ. (2.36)

This can be achieved due to the assumptions that ρ2

. R2

I ≤ R2 and µ2 . R2. We take t=nρ2_/L2 _{and look at} _T

I,1(RI) first. We have kXβe +f_XAT β+f_Ak_n2 − kXβe +f_XAT β+f_Ak2 ≤kXβe +f_XAT βk2_n− kXβe +f_XAT βk2 + kf_Ak2_n− kf_Ak2 + (P_n−P)(Xβe +f_XAT β)f_A ,A0 +B0+C0.

We boundA0, B0, C0 as follows, respectively.

A0. Note that fXP(·) = E(fX(Z)|U = ·) ∈ G and fXA = fX − fXP. We have

fX, fXA, fP are bounded. Without loss of generality, we assume the upper bound as 1. Note that by Assumption 2.7, kXβe +f_XAT βk∞ ≤ kXβe k∞+kf_XAT βk∞ ≤

(42)

2M₁0. Replace R∗ and M∗ by R₁0 and 2M₁0 in Theorem 2.3. Then similarly as (2.27), we get A0 ≤2R0₁M₁0   s logplog3n n + ρ L  + 4M 0₂ 1 logplog3n n + ρ2 L2 ≤ 4R 2 I √ 1−γ1L + 8R 2 I L2 ≤ 12R 2 I L√1−γ1 . B0. Note that sup f∈F(R02,M20) kfAk∞≤ sup f∈F(R02,M20) kfPk∞+ sup f∈F(R02,M20) kfk∞ ≤2K20 and sup f∈F(R0 2,M20) kfAk ≤R20.

ReplaceR∗ andK∗ byR₂0 and 2K₂0 in Theorem2.4. Similarly as (2.28), we then have B0 ≤2R 0 2J∞(2K 0 2,{fA:f ∈ F(R02, M 0 2)}) √ n + 2R 0 2K 0 2 ρ L (2.37) + 4J 2 ∞(2K20,{fA :f ∈ F(R02, M 0 2)}) n + 4K 0₂ 2 ρ2 L2 ≤4R 0 2J∞(2K_√20,F(R02, M20)) n + 2R2_Iρ ρL√1−γ1 + 16J 2 ∞(2K20,F(R02, M20)) n + 4R_I2 L ≤ 8R 2 I Lp(1−γ1) + 8R 2 I Lp(1−γ1) + 64R 2 I L2 + 4R_I2 L (2.38) ≤ 84 Lp(1−γ1) R2.

(43)

C0. Similarly as (2.29), it holds that C0 ≤R 0 1J∞(2K 0 2,{fA:f ∈ F(R02, M 0 2)}) √ n + R0₂J∞(2R0₁K₂0/R0₂,Fβ(R10,2M 0 1)) √ n +2R 0 1K 0 2ρ L + 4K₁0K₂0ρ2 L2 ≤2R 0 1J∞(2K_√20,F(R02, M20)) n + R0₂J∞(2K₂0,_√Fβ(R01,2M10)) n + 2R2_I L√1−γ1 +4δ0R 2 I λ R_I2 L2 ≤ 4R 2 I Lp(1−γ1) + 6R 2 I L√1−γ1 + 2R 2 I L√1−γ1 + 4R 2 I L2 ≤ 16R 2 I L√1−γ1 . Combining A0 to C0, we get sup (β,f)∈F(RI) kXˇTβ+fAk2n− kXˇ T β+fAk2 ≤ 112R2 I Lp(1−γ2) with probability at least 1−2 exp(−nρ2_/L_).

Next, look at TI,2(RI). We have

P_n(ε(Xβe +f_XAT β+f_A)) ≤ P_nε(Xβe +f_XAT β) + P_nεf_A ,where P_nε( ˇXTβ) ≤2R_I2/ √

1−γ1Lfollows from Lemma 2.7and similar ar-guments as (2.32). Further, Theorem 5.2 of [20], Assumption2.6and equation (2.36) shows PnεfA ≤ KεJ∞(R02,{fA:f ∈ F(R02, M 0 2)}) +KεR02 √ t √ n ≤2KεJ∞(R 0 2,F(R 0 2, M 0 2)) +KεR20 √ t √ n ≤√ 2KεAIRI n(1−γ1)(1−k)/2ρk + R 2 I √ 1−γ1L ≤ 2R 2 I L(1−γ1)(1−k)/2 + R 2 I √ 1−γ1L ≤ 3R 2 I √ 1−γ1L . Thus, we have sup (β,f)∈F(RI) P_nε(Xβe +f_XAT β+f_A) ≤ 5R2_I L√1−γ1

(44)

with probability at least 1−2C0exp(−nρ2/L).

Finally, we consider TI,3(RI). Notice thatP(Xβe +f_XAT β+fA)(fXPT β+gXTβ+fP+

g) = 0. Then we get P_n(Xβe +f_XAT β+f_A)(f_XPT β+g_XTβ+f_P +g) ≤(P_n−P) (Xβe +f_XAT β)(f_XPT β+gT_Xβ) + (P_n−P)(Xβe +f_XAT β)(f_P +g) (Pn−P) (fXPT β+g T Xβ)fA + (Pn−P) (fA(fP +g)) ,A00+B00+C00+D00. It is noted that kXβe +f_XAT βk ≤R0₁,kXβe +f_XAT βk∞ ≤2M₁0, kf_XPT β+g_XTβk ≤R₁0,kfP +gk ≤ kfPk+kgk ≤R02+R 0 3 ≤2R 0 3, J(fP +g)≤J(fP) +J(g)≤Γkfk+LJ ≤ΓRI+LJ ≤4K30. Then we apply Theorem 2.5 for A00, B00, C00, D00, respectively.

A00. Similar to the proof of equation (2.21) and (2.22),

A00≤ R 0 1J∞(K 0 1,Fβ(R01, M 0 1)) √ n + R0₁J∞(K₁0,Fβ(R01,2M 0 1)) √ n + R0₁K₁0ρ L + 2K₁02ρ2 L2 ≤ √ RI 1−γ1 RI L + 2K₁0 √ n + √RI 1−γ1 2RI L + 2K₁0 √ n + RIρ L√1−γ1 +2ρ 2 L2 ≤ 10R 2 I L√1−γ1 .

B00. Similar to the proof of (2.25), (2.23) and (2.30),

B00 ≤ R 0 1J∞(4K30_√,G(2R03,4K30)) n + 2R0₃J∞(R₁0(4K₃0)_√/2R0₃,Fβ(R10,2M10)) n + R 0 1(4K30)ρ L + 2K₁0(4K₃0)ρ2 L2 ≤ √ RI 1−γ1 4RI L + RI √ 1−γ1 2RI L + 2RI √ 1−γ1 4RI L + RI √ 1−γ1 4RI L + 2δ0RI2 λ 4LJρ2 L2 ≤ 26R 2 I L√1−γ1 .

(45)

C00. Similar to the proof of (2.22), (2.37), and (2.29), C00≤ R 0 1J∞(2K 0 2,{fA :f ∈ F(R02, M 0 2)}) √ n + R0₂J∞(R0₁(2K₂0)/R0₂,Fβ(R01,2M 0 1)) √ n +R 0 1(2K 0 2)ρ L + K₁0(2K₂0)ρ2 L2 ≤ √ RI 1−γ1 2RI L + RI √ 1−γ1 6RI L + RI √ 1−γ1 2RIρ ρL + δ0R2I λ 2RIρ2 ρL2 ≤ 12R2_I L√1−γ1 .

D00. Similar to the proof of (2.26) and (2.31),

D00 ≤ R 0 2J∞(4K30_√,G(2R03,4K30)) n + 2R₃0J∞(R₂0(4K₃0)/2R₃0_√,{fA :f ∈ F(R20, M20)}) n +R 0 2(4K30)ρ L + (2K₂0)(4K₃0)ρ2 L2 ≤ √ RI 1−γ1 4RI L + √ 1−γ2 √ 1−γ1 1−k 8Rk √ 1−γ2 AIL1J−kRIρ √ nρ1+k + RI √ 1−γ1 4LJρ L + 2RI ρ 4LJρ2 L2 ≤ 18R 2 I √ 1−γ1( √ 1−γ2)kL .

Therefore, we have with probability at least 1−4 exp(−nρ2_/L_), sup (β,f,g)∈M(R),(β,f)∈F(RI) Pn( ˇXTβ+fA)(h2β+fP +g) ≤ 66R_I2 √ 1−γ1( √ 1−γ2)aL . By letting 65/(√1−γ1( √

1−γ2)aL)≤δI2, we conclude that there exists constantCI and cI, such that

P(TI(RI))≥1−CIexp(−cInρ2).

Proof of Lemma 2.1

Proof. Take C = max(C,e _ec) and C = max(C, Ce I,ec, cI). Then the theorem follows as

below:

(46)

and

P(T(R),TI(RI))≥1−P(T(R))c−P(TI(RI))c≥1−Ceexp(−cnρ_e 2)−C_Iexp(−c_Inρ2)

≥1−Cexp(−Cnρ2).

Proof of Corollary 2.2

In this section, we prove Corollary 2.2. We start from the following preliminary lemmas.

Lemma 2.7. With probability at least 1−1/p, 2Pnε(XeT(βb−β₀)) ≤2KX p 6Kε r 2 logp n kβb−β0k1 ≤ λ 10kβb−β0k1. Proof. First we have

P_nε(XeT(βb−β₀))

≤ kP_nεXeTk∞kβb−β₀k₁.

Assumption2.4that_Eexp(ε2_i/K_ε2)≤2 implies_Eexp(tεi)≤exp(3Kε2t/2), see [34]. Then we get Eexp t(1 n n X i=1 εiXeij) ! = n Y i=1 Eexp(t nXeijεi) ≤ n Y i=1 exp(3 2K 2 ε( t2 n2Xe 2 ij)) = exp 3 2K 2 ε t2 nkXejk 2 n ,

which implies given Xe fixed, for t >0 and all j,

P ( 1 n n X i=1 εiXe_ij > r t n2kXejkn r 3 2Kε ) ≤exp(−t), see [34]. Hence P ( max 1≤j≤p 1 n n X i=1 εiXeij > r t+ logp n 2kXejkn r 3 2Kε ) ≤exp(−t).

(47)

Then by Lemma 14.16 of [1], we have for some constant KX ≥1 that

P ( max 1≤j≤p|kXejk 2 n−EkXe_jk2_n| ≥K_X r log(2p) n ) ≤1/(2p),

which further implies

P max 1≤j≤pkXejk 2 n≥2KX ≤_P ( max 1≤j≤pkXejk 2 n≥1 +KX r log(2p) n ) ≤1/(2p), (2.39) Now take t= log(2p). With probability at least 1−1/p,

k_PnεXeTk∞ ≤2 √ 6KXKε r 2 log(2p) n .

Noting that λ _& plogp/n, we can have 2k_PnεXeTk∞ ≤ 4

√

6KXKε

p

2 log(2p)/n ≤

λ/10.

Lemma 2.8. With probability at least 1−5/(2p)−Cexp(−nρ2/c) for some constant

c, C >0, 2Pn(fb−f₀+ b g−g0)XeT(βb−β₀) +2 Pn(fXT(βb−β₀)+gT_X(βb−β₀))XeT(βb−β₀) ≤ λ 10kβb−β0k1, kXe(βb−β₀)k_n2 − kXe(βb−β₀)k2 ≤ λ 2kβb−β0k1. Proof. On the set T(R)∩ TI(RI), we have

P_n(fb−f₀+_bg−g₀)XeT(βb−β₀) ≤ k_Pn(fb−f₀ +_bg−g₀)XeTk∞ kβb−β₀k₁ ≤ max 1≤j≤p 1 n n X i=1 (fb−f₀+g_b−g₀)_iXe_ij ! kβb−β₀k₁.

Note that given Xe, we have for each 1≤j ≤p,

|(fb−f0+bg−g0)iXeij| ≤2R/ p

(48)

By Lemma 14.15 of [1], we have P    max 1≤j≤p 1 n n X i=1 (fb−f₀+g_b−g₀)_iXe_ij ≥ max 1≤j≤p s (2R/√1−γ2)2 Pn i=1Xe_ij2 n s 2 t2₊ log(2p) n    ≤exp(−nt2).

Again by (2.39) and choosing t2 = log(2p)/n, we have

P ( max 1≤j≤p 1 n n X i=1 (fb−f₀+ b g−g0)iXe_ij ≥2K_X(2R/ p 1−γ2) r 4 log(2p) n ) ≤1/p. By choosing λ≥40KX p

log(2p)/n, we have with probability at least 1−1/p, max 1≤j≤p 1 n n X i=1 (fb−f₀+ b g−g0)iXe_ij ≤λ/20.

Finally, note that

1 n n X i=1 ((fX +gX)T(βb−β₀))_iXeij ≤ 1 n n X i=1 p X k=1 (fX +gX)ik(βb−β₀)_kXeij ≤ p X k=1 (βb−β₀)_k( 1 n n X i=1 (fX +gX)ikXeij) ≤kβb−β₀k₁ max 1≤k≤p 1 n n X i=1 (fX +gX)ikXe_ij ≤δ0 R2 λ 1max≤k≤p 1 n n X i=1 (fX +gX)ikXeij ,

where _E(fX +gX)ikXe_ij = 0 and |(f_X +g_X)_ikXe_ij| ≤ M₀|Xe_ij| given Xe known. By

Lemma 14.15 in [1], we obtain that given Xe,

P  max 1≤j≤p1max≤k≤p 1 n n X i=1 (fX +gX)ikXe_ij ≥ max 1≤j≤p s M2 0 Pn i=1Xe_ij2 n s 2 t2₊2 log 2p n   ≤exp(−nt2).

Similarly, letting t2 _{= log(2}_p₎_/n _{and revoking (}_2.39_{) gives}

P max 1≤j≤p1max≤k≤p 1 n n X i=1 (fX +gX)ikXeij >2KXM0 r log 2p n ! ≤1/p.

(49)

Choose λ >2KXM0

p

log(2p)/n. We finally get with probability at least 1−1/p, 2P_n(f_XT(βb−β₀) +g_XT(βb−β₀))XeT(βb−β₀)

≤δ₀R2kβb−β₀k₁

which can be smaller than ₂₀λkβb−β0k1 by taking suitable choices of δ0. Similarly, we can get

|kXe(βb−β₀)k2_n− kXe(βb−β₀)k2| ≤δ₀ R2 λ 1≤maxk,j≤p 1 n n X i=1 (Xe_ikXe_jk−EXe_ikXe_jk) kβb−β₀k₁,

whereXe_ikXe_jk−EXe_ikXe_jk is sub-exponential. By Lemma 14.16 of [1] we have for some

constant K e X that P max 1≤j,k≤p 1 n n X i=1 (Xe_ikXe_jk −EXe_ikXe_jk) > K_X_e r log 2p n ! ≤1/(2p). Therefore, by choosing λ >2δ0KXe p log 2p/n, we have kXe(βb−β₀)k_n2 − kXe(βb−β₀)k2 ≤λkβb−β₀k₁/2,

with probability at least 1−1/(2p). Recalling the probability ofT(R)∩ TI(RI) from Lemma 2.4, this lemma is proved.

Lemma 2.9. Assume ρ2 ≤ δ 2 0R2 2(I1+I(f0))I1 .

Then on the set T(R)∩ TI(RI),

ρ2I2(fb+f_XT(βb−β₀)−ρ2I2(fb) ≤ λ 10kβb−β0k1. Proof. ρ2I2(fb+f_XT(βb−β0)−ρ2I2(fb) = ρ2[I2(f_XT(βb−β0)) + 2I(f , fb _XT(βb−β0)] ≤ρ2δ0R 2 λ I 2 1kβb−β0k1+ 2I(fb)I(f_XT(βb−β0) ≤ δ0ρ2I12+ 2ρ2( R ρ +I(f0))I1 kβb−β₀k₁ ≤ 1 2δ 3 0R 2₊ √ 2 2 δ0R 2₊_δ2 0R 2 ≤3δ0R2kβb−β₀k₁,

(50)

where the first equality follows from definition of I(·), the second inequality follows from Assumption2.9, and the third is true due to triangular inequality. Choosingδ0 such that 3δ0R2 ≤λ/10, we get the desired result.

Lemma 2.10. Assume µ2 ≤ δ 2 0R2 (J₂q−1 +Jq−1₍_g 0))J2 . (2.40)

Then on the set T(R)∩ TI(RI),

µ2J2(_bg+g_XT(βb−β₀))−µ2J2( b g)≤ λ 10kβb−β0k1 Proof. µ2Jq(_bg+gT_X(βb−β₀))−µ2Jq( b g)≤µ22Jq−1(_bg)J(g_XT(βb−β₀) + 2Jq(gT_X(βb−β₀) Note thatJ(gT X(βb−β₀))≤J₂kβb−β₀k₁ by Assumption 2.9and kβb−β₀k₁ ≤ δ0R 2 λ . We have µ2Jq(_bg+g_XT(βb−β₀))−µ2Jq(_bg) ≤µ2 2 (R µ) 2/q ₊_J₍_g 0) q−1 J2kβb−β₀k₁+ 2J₂qkβb−β₀kq₁ ! ≤2δ₀2/qR2+ 2δ₀2R2+ 2δ2₀R2 δ0R2 λ q−1 kβb−β0k1 ≤6δ₀2/qR2kβb−β₀k₁,

where the first inequality follows from definition and the second one follows from the condition (2.40). Choosingδ0such that 6δ

2/q

0 R2 ≤λ/10, we get the desired result. Based on Lemmas 2.7-2.10, we are now ready to prove Corollary 2.2.

Proof of Corollary 2.2. Recall that Π(X|H) = fX +gX. By definition, we have

kY −XTβb−fb− b gk2 n+λkβbk₁+ρ2I2(fb) +µ2Jq( b g) ≤kY −XTβ0−(fb+f_XT(βb−β₀))−( b g+g_XT(βb−β₀))k2_n+λkβ₀k₁ +ρ2I2(fb+f_XT(βb−β₀)) +µ2Jq( b g+gT_X(βb−β₀)).