• No results found

High Dimensional Inference for Semiparametric Models

N/A
N/A
Protected

Academic year: 2021

Share "High Dimensional Inference for Semiparametric Models"

Copied!
99
0
0

Loading.... (view fulltext now)

Full text

(1)

Purdue e-Pubs

Open Access Dissertations Theses and Dissertations

January 2016

High Dimensional Inference for Semiparametric

Models

Zhuqing Yu

Purdue University

Follow this and additional works at:https://docs.lib.purdue.edu/open_access_dissertations

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information.

Recommended Citation

Yu, Zhuqing, "High Dimensional Inference for Semiparametric Models" (2016).Open Access Dissertations. 1401.

(2)

PURDUE UNIVERSITY GRADUATE SCHOOL Thesis/Dissertation Acceptance

This is to certify that the thesis/dissertation prepared By

Entitled

For the degree of

Is approved by the final examining committee:

To the best of my knowledge and as understood by the student in the Thesis/Dissertation Agreement, Publication Delay, and Certification Disclaimer (Graduate School Form 32), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy of Integrity in Research” and the use of copyright material.

Approved by Major Professor(s):

Approved by:

Head of the Departmental Graduate Program Date Zhuqing Yu

High Dimensional Inference for Semiparametric Models

Doctor of Philosophy Guang Cheng Chair Thomas Sellke Anirban DasGupta Chong Gu Guang Cheng Jun Xie 6/16/2016

(3)

A Dissertation Submitted to the Faculty

of

Purdue University by

Zhuqing Yu

In Partial Fulfillment of the Requirements for the Degree

of

Doctor of Philosophy

August 2016 Purdue University West Lafayette, Indiana

(4)
(5)

ACKNOWLEDGMENTS

I would first like to thank my advisor, Prof. Guang Cheng, for his continuous guidance and support of my Ph.D. study. He has generously shared his insights and expertise that greatly helped me in all the time of writing this thesis. Prof. Cheng has also provided me great research resources over the past five years, such as joining exciting research projects, attending professional conferences and activities.

Besides my advisor, I would like to thank the rest of my thesis committee, Prof. Anirban DasGupta, Prof. Chong Gu and Prof. Thomas Sellke for their time, interest, and invaluable comments.

I would also like to express my great gratitude to my collaborators. I am especially indebted to my previous supervisor, Prof. Stephen Lee from The University of Hong Kong, who has led me into the field of statistics. I thank Prof. Jianhua Huang for his great suggestions on paper writing. Prof. Shengchun Kong has been a great friend with whom I can discuss research, as well as career building. It was a great pleasure to work with Prof. Michael Levine who has offered many helpful discussions.

I deeply appreciate the guidance I have received from professors at Purdue Uni-versity. Many thanks go to Prof. Fabrice Baudoin, Prof. William Cleveland, Prof. Jose Figueroa-Lopez, Prof. Jayanta Ghosh, Prof. Chong Gu, Prof. Chuanhai Liu and Prof. Thomas Sellke for interesting lectures on related topics that helped me improve my knowledge in the area. Special thanks go to Prof. Anirban DasGupta for his very inspirational lectures, seminars and fruitful discussions on mathematical statistics.

I would like to thank Prof. Bruce Craig and Ms. Ce-Ce Furtner who offered me the opportunity to work as a statistical consultant in the department’s Statistical Consulting Service. I greatly value the discussions with Prof. Anindya Bhadra, Prof. Bruce Craig, Prof. Chong Gu, Prof. Arman Sabbaghi, Prof. Jun Xie and Prof.

(6)

Michael Zhu on various kinds of consulting projects. Thanks also go to my fellow colleagues who have kindly shared their experience.

I would also like to thank members of Prof. Guang Cheng’s Big Data Theory Research Group, including Prof. Shenchun Kong, Prof. Qifan Song, Dr. Shih-Kang Chao, Dr. Zuofeng Shang, Dr. Wei Sun, Ching-Wei Cheng, Meimei Liu, Botao Hao, Jingcheng Bai, Jiexin Duan, Hilda Ibriga, Yang Yu and Jiapeng Liu, for many valuable discussions on research problems over the past five years.

I greatly acknowledge the funding sources that made my Ph.D. work possible. I was funded by Ross Fellowship of Purdue University for my first four years and was supported by Purdue Research Foundation Grant for the fifth year. I also thank the department for teaching assistantships and Prof. Guang Cheng for research as-sistantships. I greatly appreciate the travel fundings from the previous department Head Prof. Rebecca Doerge and Purdue’s Women in Science Programs.

My time at Purdue was made enjoyable in large part due to many friends that became a part of my life. I am greatful for time spent with fellow graduate students, roommates, neighbors and friends, and for many other people and memories.

Finally, I would like to express my heartfelt gratitude to my family, especially my parents, who has been a constant source of love, concern, support and strength all these years. Thanks to my lovely daughter Chloe Chen whose bright smiles are the warmest encouragement. I deeply thank my loving, supportive, encouraging and patient husband Xianghong Chen, who has supported me in every possible way to see the completion of this work.

(7)

TABLE OF CONTENTS

Page

LIST OF TABLES . . . vi

LIST OF FIGURES . . . vii

ABBREVIATIONS . . . viii

ABSTRACT . . . ix

1 Introduction . . . 1

1.1 Minimax Optimal Estimation . . . 1

1.2 High Dimensional Inference . . . 5

2 Minimax Optimal Estimation . . . 7

2.1 Partial Linear Models. . . 7

2.2 Partial Linear Additive Models . . . 11

2.3 Appendix . . . 16

2.3.1 Proof for Section 2.1 . . . 16

2.3.2 Proof for Section 2.2 . . . 18

2.3.3 Results from Empirical Process Theory . . . 41

3 High Dimensional Inference . . . 43

3.1 Semiparametric Version of Debiased LASSO . . . 43

3.1.1 Construction of Debiased Estimator . . . 44

3.1.2 Asymptotic Distribution . . . 45

3.1.3 Semiparametric Efficiency . . . 49

3.2 Statistical Inference . . . 49

3.2.1 Component-Wise Confidence Interval . . . 51

3.2.2 Support Recovery . . . 52

3.2.3 Testing with FWER Control . . . 52

3.3 Appendix . . . 53

3.3.1 Preliminary Lemmas . . . 54

3.3.2 Proof for Section 3.1.2 . . . 61

REFERENCES . . . 82

(8)

LIST OF TABLES

Table Page

2.1 Estimation Interference Results for Model (2.1). . . 10 2.2 Estimation Interference Results for Model (2.6). . . 16 3.1 Average coverage probabilties and lengths of confidence intervals at 95%

nominal level based on 1000 replications;n= 100,p= 500,s0 = 3, Normal Error . . . 75 3.2 Average coverage probabilties and lengths of confidence intervals at 95%

nominal level based on 1000 replications; n = 100, p = 500, s0 = 3, t5 Error . . . 76 3.3 Average coverage probabilties and lengths of confidence intervals at 95%

nominal level based on 1000 replications; n = 100, p = 500, s0 = 15, Normal Error . . . 77 3.4 Average coverage probabilties and lengths of confidence intervals at 95%

nominal level based on 1000 replications; n = 100, p = 500, s0 = 15, t5 Error . . . 78 3.5 Mean and Standard deviation of d(Sb0,S0) based on 1000 replications;n =

100, p= 500 . . . 79 3.6 FWER and Power of Multiple Testing based on 1000 replications; n =

100, p= 500, s0 = 3 . . . 80 3.7 FWER and Power of Multiple Testing based on 1000 replications; n =

(9)

LIST OF FIGURES

Figure Page

(10)

ABBREVIATIONS

i.i.d. independent and identically distributed

LASSO Least Absolute Shrinkage and Selection Operator FWER Family Wise Error Rate

(11)

ABSTRACT

Yu, Zhuqing, PhD, Purdue University, August 2016. High Dimensional Inference for Semiparametric Models. Major Professor: Guang Cheng.

In the literature, high dimensional inference refers to statistical inference when the number of unknown parameters is much greater than the sample size. Semipara-metric models are models that include paraSemipara-metric and nonparaSemipara-metric components, such as partial linear models and partial additive models. Due to the high dimen-sionality of the parameter of interests and the presence of a nuisance function, it is very challenging to make estimation and inference for the parametric component in high dimensional semiparametric settings, for instance, construction of confidence intervals and hypothesis testings. In this thesis, I will present two sets of estimation and inference results under high dimensional semiparametric setups.

The first one is minimax optimal estimation in high dimensional semiparametric models. Our particular focus is on partially linear additive models with high di-mensional sparse vectors and smooth nonparametric functions. The minimax lower bound for the parametric component depends merely on the dimensionality and spar-sity, while the minimax lower bound for each nonparametric component is established as an interplay among dimensionality, sparsity and smoothness. Indeed, the minimax risk for parametric estimation cannot be affected by the roughness of the nonpara-metric functions. However, the minimax risk for smooth nonparanonpara-metric estimation can be slowed down to the classical parametric rate by the existence of high di-mensional sparse vector, given sufficiently large smoothness or didi-mensionality. Such rate-switching phenomenon differs significantly from low dimensional models where estimation rate for each component only depends on itself. In the above setting,

(12)

a general class of penalized least square estimators is constructed to nearly achieve minimax lower bounds.

The second one is high dimensional inference for partial spline models, where the dimension of parametric components is allowed to be as exponentially large as sample size. We propose a semiparametric version of de-biased Lasso estimator. In the high dimensional regime, this new estimator is shown to be asymptotically normal. Based on this distributional result, we further conduct a simultaneous hypothesis testing with applications to support recovery and multiple testing with strong family wise error rate control.

(13)

1. INTRODUCTION

High dimensional inference refers to statistical inference when the number of unknown parameters is much greater than the sample size. Semiparametric models are models that include parametric and nonparametric components, such as partial linear models and partial additive models. Due to the high dimensionality of the parameter of inter-ests and the presence of a nuisance function, it is very challenging to make estimation and inference for the parametric component in high dimensional semiparametric set-tings, for instance, construction of confidence intervals and hypothesis testings. As far as we are aware, the existing literature mostly focus on high dimensional paramet-ric or nonparametparamet-ric estimation, and penalization is a commonly used technique in dealing with these estimation problems, see [1]. For instance, the LASSO estimator is obtained from`1 penalized least square method, which can perform both shrinkage and variable selection, see [2]. For nonparametric regression, smoothing splines are well known to provide nice curves which smooth discrete, noisy data, and the rough-ness penalty based on the second derivative is the most common in modern literature, see [3] [4]. In this thesis, I will investigate a class of penalized estimators which incor-porates both `1 penalty for the parametric component and roughness penalty for the nonparametric functions, and provide new theoretical insights on statistical inference for the parameters under a high dimensional semiparametric settings.

1.1 Minimax Optimal Estimation

In Section 2, I will introduce penalized estimation for high dimensional semi-parametric models, which contain two different types of model components: sparse Euclidean parameters and smooth nonparametric functions. By imposing the more refined semiparametric structure, we aim to obtain new theoretical insights, in

(14)

par-ticular, on the interfering effect between sparse parametric estimation and smooth nonparametric estimation, and further construct (nearly) minimax optimal estima-tors.

We illustrate our theory in an important class of semiparametric models: partially linear additive models

Y =XTβ0 +

J

X

j=1

fj(Zj) +, (1.1)

where β0 ∈ Rp is sparse with p > n and fj : R 7→ R are nonparametric functions with possibly different smoothness. The additive components fj’s are not assumed to be sparse and J is fixed. In contrast with the literature on sparse parametric or nonparametric estimation such as [5], [6], [7], [8], [9] and [10], we are not interested in estimating the conditional mean function E(Y|X, Z1, . . .) as a whole, but rather separate minimax risk for each model component: β0, f1, . . . , fJ. Note that our results are not directly implied by the results in the literature where additive components are always assumed to share the same linear or nonlinear structure with the same smoothness.

To better illustrate our idea, we start from a simpler partially linear model:

Y =XTβ0+f0(Z) +ε,

where β0 ∈ Rp has at most s0 non-zero elements and f0 belongs to the α-th order Sobolev space (withα >1/2). Whenβ0is of fixed or low dimension (p < n), the above model has been extensively studied in the semiparametric literature, see references cited in [11] and also recent work by [12], [13]. [14] shows that the minimax risks for estimating β0 is shown to be bounded below by

s0 n log p s0 , (1.2)

and minimax risks for estimating f0 is shown to be bounded below by max n−2α/(2α+1),s0 n log p s0 (1.3)

(15)

up to a universal constant, based on iid observations {Yi, Xi, Zi}ni=1.

It is surprising to see that the lower bound1.2for estimatingβ0 is irrelevant to the nonparametric functions, while the bound 1.3 depends on both the parametric and nonparametric components. As depicted in Figure 1.1, the above rate (1.3) exhibits an interesting two regime dichotomy. In the sparse regime where f0 is sufficiently smooth or p is sufficiently high, the minimax risks (1.3) become s0log(p/s0)/n. In other words, the best possible estimation of f0 is slowed down to the well known sparse parametric estimation rate [6,7,15]. On the other hand, in thesmooth regime where f0 is very rough or p is low, the minimax risks become the classical nonpara-metric rate n−2α/(2α+1) [16,17], even for the sparse estimation of β0. We call these observations as rate-switching phenomenon. Interestingly, Figure 1.1 happens to co-incide with phase transition phenomenon discovered in [5,8,9] for high dimensional additive nonparametric models, which is further proven to hold even under approxi-mate sparsity by [10]. Our contribution is to demonstrate that the doubly penalized estimators proposed in [18] for (β0, f0) almost achieve these minimax lower bounds. This requires us to develop (a stronger version of) oracle inequalities that hold under expectation.

We next move to the partially linear additive models (1.1) and assume J = 2 for simplicity. Hence, we now have two nonparametric functions with possibly different smoothness:

Y =XTβ0+f0(Z) +g0(U) +ε,

whereg0 belongs to the γ-th order Sobolev space (withγ >1/2). The minimax lower bound for estimating β0 and f0 are exactly the same as (1.2) and (1.3), and hence does not depend on the smoothness of g0 at all (no matter α > γ or γ > α). The same bound applies to g0 as well by replacing α with γ in (1.3). The latter result essentially generalizes [19] who showed that, in an additive nonparametric regression model, each component can be estimated (up to the first order asymptotics) as well as if all the rest components were known. In the end, inspired by [12] and [20],

(16)

log ( s0 log ( p s0 )) log ( n ) α Smooth Regime Sparse Regime 0.5 1 1.5 0 0.25 0.5 1 ∞ Phase Transition: Optimal Rate

Figure 1.1.: Minimax Rate Phase Transition.

When the smoothness indexα, and dimensionality and sparsity measured by log(s0log(p/s0))/logn

falls in the smooth region, the optimal rate is given by n−2α/(2α+1) which is determined solely by

the smoothness off0. On the other hand, if they fall into the sparse regime, then the optimal rate is

given bys0log(p/s0)/nwhich is determined entirely by the sparsity indexs0and the dimensionality

(17)

we propose penalized estimators for (β0, f0, g0) that can almost achieve these lower bounds.

Our main technical tools are a set of oracle inequalities implying that paramet-ric estimator can achieve the oracle rate and each nonparametparamet-ric function can be estimated with the rate of convergence as if the others were known. These are devel-oped based on some recent advances on empirical process theory [21]. To derive the risk upper bounds, we further strengthen these oracle inequalities to their moment versions.

1.2 High Dimensional Inference

In Section 3, I will introduce debiased LASSO estimator for the high dimensional parametric vector in partial spline models (1.4), in presence of a nuisance nonpara-metric function. Based on this estimator, I will further construct statistical inference, including confidence intervals, hypothesis testings, together with their applications.

Consider a high dimensional partial smoothing spline model:

Y =XTβ0+g0(Z) +ε, (1.4) whereβ0 ∈Rp is an unknown vector andg0 is an unknown smooth function belonging to the m-th order Sobolev space Gm. Here X, Z

Rp+1 are covariates, Y ∈ R is response variable and ε is error term. In particular, we consider the dimension p

is greater than sample size n. Our interests are statistical inference for the high dimensional parameter β0, for instance, confidence intervals and hypothesis testings, in presence of the nuisance function g0.

In the low dimensional case where the number of covariates p in the linear part is smaller than the sample size n, the estimation of β0 and the asymptotic inference have been extensively studied, see [11,22,23] and references there in. The estimation of this high dimensional model has been also widely studied, see [12,18,24]. However, high dimensional statistical inference forβ0 has not been established in the literature to the best of our knowledge, due to high dimensionality and intractable limiting

(18)

distribution of LASSO type estimator. Recently, [25–27] have proposed a debiased version of the LASSO estimator for high dimensional linear and generalized linear models, which is non-sparse and has a limiting normal distribution. Inspired by such debiasing idea, we propose a debiased LASSO estimator, denoted as bb, for partial

smoothing spline model (1.4). Our proposed estimator is shown to be asymptotically unbiased forβ0, and each of its components has a limiting normal distribution. This distributional results naturally generalizes to linear contrast of β0 by using Wold device. As a byproduct, we have also calculated the variance of our estimatorbb.

Based on this, we further conduct a simultaneous hypothesis testing and pro-pose a test statistics together with its multiple bootstrap counterpart. In particular, this simultaneous testing method automatically takes into account of the dependence structure within bb, and is also adaptive to the number of tests which is allowed to

be exponentially larger than sample size. Our procedure is motivated by [28], who have recently proposed a statistics and its multiplier bootstrap version for high di-mensional linear models. Our theoretical results are also numerically investigated in three applications, including component-wise confidence intervals, support recovery for sparse vectors and multiple testing with strong family wise error rate control.

To prove our results, we first show an oracle type inequality in section3.1.2, which has also been strengthened to a version in expectation. Then we give explicit asymp-totic order for the accuracy of approximate inverse information matrix constructed from nodewise LASSO method. The major technical tools we have used are Bernstein type inequality, a weighted projection inequality from [29] and central limit theorem for maxima from [30].

The rest of this chapter is organized as follows. Section 3.1discusses the construc-tion of debiased LASSO estimator for β0 and its asymptotic inference. Section 3.2 studies three applications of our theoretical results with their numerical performance. All technical details are deferred to Section 3.3.

(19)

2. MINIMAX OPTIMAL ESTIMATION

In this chapter I will discuss estimation of two classes of models, partial linear mod-els (2.1) and partial linear additive models (2.6) as detailed later. Recently, [14] have derived minimax optimal estimation rates for both parametric and nonparametric components in these two models. I will construct estimators for both parameters and show that their risks achieve the minimax lower bounds in [14].

Before presenting any theoretical results, we introduce the following notations for convenience. For any vector v ∈ Rn, we write its `

1, Euclidean and `∞ norm as

kvk1 = Pni=1|vi|, kvk =pPin=1vi2 and kvk∞ = max1≤i≤n|vi|, respectively, and also

kvk2

n :=vTv/n. With a bit abuse of notation, we define for any function f :Z 7→R that kfk = pEf2(Z), kfk ∞ = supz∈Z|f(z)| and kfk2n = Pn i=1f 2(Z i)/n. Let S0 be the set of all non-zero components of β0 and s0 = |S0|. Define βS0 such that

(βS0)j = βj1{β0j 6= 0} and βS0c = β−βS0, for any β ∈ R

p. Thus, kβk

1 = kβS0k1+

kβSc

0k1. The α-th order Sobolev space over [0,1], denoted as W

α,2(L), is defined as

{f ∈ [0,1] → R : R01(f(α)(x))2dx ≤ L2} for a constant L > 0. For real sequences

an, bn, if an . bn (an & bn), then lim supan/bn ≤ C (c ≤ lim supan/bn), for some constant C (constant c). If an bn, then c ≤ lim infan/bn ≤ lim supan/bn ≤ C for some constant c, C. Also, we write an = O(bn) if |an| ≤ C|bn| for some constant

C > 0. In the sequel, c, c0, C, C0, . . . denote a generic constant which may differ at each appearance.

2.1 Partial Linear Models

Let us consider partial linear models as follows:

(20)

where (Xi, Zi)ni=1 ∈Rp×[0,1] are i.i.d. copies of (X, Z). We assumeX is a mean zero Gaussian vector with variance matrix Σ, and the errors {εi}ni=1 are i.i.d. standard Gaussian random variables independent of{Xi, Zi}ni=1. For simplicity, we standardize

X such that the diagonal of Σ consist of 1’s. In this chapter, we restrict our attention to the Gaussian design and noise since even in the high dimensional linear models, deriving sharp minimax bounds under non-Gaussian setting remains an open problem; see [15].

LetB[s0, p] be a set ofp-dimensional vectors with at mosts0 non-zero coordinates and Sp be a set of p×pmatrices with 1’s on the diagonal. Define

Rβ0(s0,Σ, α) := inf b β sup β0∈B[s0,p],f0∈Wα,2(L) E[kβ0−βbk2]. (2.2) and Rf0(s0,Σ, α) = inf b f sup β0∈B[s0,p],f0∈Wα,2(L) E Z 1 0 |fb(z)−f0(z)|2dz.

And the minimax risks with respect to random designs with covariance matrices Σ are defined as Rβ0(s0, α) := inf Σ∈Sp Rβ0(s0,Σ, α), and Rf0(s0, α) := inf Σ∈Sp Rf0(s0,Σ, α), respectively.

It has been known from [14] that givenn i.i.d. samples from the high dimensional partial linear model (2.1), the minimax risk for estimating β0 can be bounded from below as Rβ0(s0, α)& s0 n log p s0 , (2.3)

and the minimax risk for estimating f0 can be bounded from below as

Rf0(s0, α)& n−2α/(2α+1),s0 n log p s0 . (2.4) Now, we demonstrate that the doubly penalized estimators of (β0, f0) proposed in [12] and [18] nearly achieve the lower bounds derived in (2.3) and (2.4); see Ta-ble 2.1. In particular, when s0 ∼pr, matching is exact up to some constant.

(21)

We define the penalized estimators (β,b fb) as follows: (β,b fb) := argmin(β,f) Rp×Wα,2(L) ||Y −XTβ−f||2 n+λ||β||1 +ρ2J2(f) , (2.5) where kβk1 is the `1 penalty and J2(f) =

R1

0(f

(α)(z))2dz is the smoothness penalty. Here, λ > 0 and ρ2 control the level of shrinkage for

b

β and the roughness for fb,

respectively.

Before calculating risk upper bounds, we need the following assumptions, adopted from [18]. In particular, Assumption 2.2 avoids functions with high and very steep peaks. Let h1(Z) := E[X|Z].

Assumption 2.1. The smallest eigenvalue of E(X−h1(Z))T(X−h1(Z)) is positive, and the largest eigenvalue of EhT

1h1 is finite.

Assumption 2.2. For some constantKf, it holds that sup{kfk≤1, J(f)≤1}kfk∞≤Kf. Assumption 2.3. J(h1) is bounded.

Theorem 2.1. Suppose Assumptions2.1–2.3hold. Ifλ plogp/n,ρ2 n−2α/(2α+1), then we have Ekβb−β0k2 . s0logp n , and E Z 1 0 |fb(z)−f0(z)|2dz .max n−2α/(2α+1),s0logp n .

These risk upper bounds are almost optimal, comparing to (2.3) and (2.4). This generalizes the claim by [15] that LASSO achieves the (almost optimal) risk bound

s0log(p)/n in high dimensional linear models to partial linear models.

In the end, we remark that the oracle inequalities given in Theorems 1 and 2 of [18] only imply the estimation rates ofβbandfb(in terms of`2-norm). Rather, our theorem

(22)

Table 2.1.: Estimation Interference Results for Model (2.1).

Sparsity Parameters High Dimensionalβ0

Lower Bound Penalized Estimator

s0, p, n Rβ0(s0, α) Ekβb−β0k

2

1

2α+1 < a+b <1 s0log(p/s0)/n s0logp/n a+b < 2α1+1 s0log(p/s0)/n s0logp/n

Sparsity Parameters Smoothf0

Lower Bound Penalized Estimator

s0, p, n Rf0(s0, α) E R1 0 |fb(z)−f0(z)| 2dz 1 2α+1 < a+b <1 s0log(p/s0)/n s0logp/n a+b < 2α1+1 n−2α/(2α+1) n−2α/(2α+1) We sets0=nb, p= exp(na).

(23)

2.2 Partial Linear Additive Models

We are now ready to consider the partially linear additive models with two additive components (for simplicity):

Yi =XiTβ0+f0(Zi) +g0(Ui) +εi, 1≤i≤n, (2.6) where (Xi, Zi, Ui)∈Rp×[0,1]×[0,1] are i.i.d. copies of (X, Z, U) with a joint density

pXZU, β0 ∈ B[s0, p], f0 ∈ Wα,2(L1) and g0 ∈ Wγ,2(L2). As before, we assume that

Xi ∼ Np(0,Σ) and εi ∼ N(0,1) independent of design. For identifiability purpose, we assume Eg0(U) = 0.

Define the minimax risk for estimating β0 as

Rβ0(s0, α, γ) := inf Σ∈Sp Rβ0(s0, α, γ,Σ), where Rβ0(s0, α, γ,Σ) := inf b β sup β0∈B[s0,p],f∈Wα,2(L1), g∈Wγ,2(L 2) E[kβ0−βbk2].

It has been proved in [14] that given n i.i.d. samples from the high dimensional partial linear additive model (2.6), the minimax risk for estimatingβ0 can be bounded from below as Rβ0(s0, α, γ)& s0 n log p s0 , (2.7)

which is only affected by the least smooth function. Next, define the minimax risk of estimating f0 as

Rf0(s0, α, γ) := inf Σ∈Sp Rf0(s0,Σ, α, γ), where Rf0(s0,Σ, α, γ) = inf b f sup β0∈B[s0,p],g∈Wγ,2(L2) sup f0∈Wα,2(L1) E Z 1 0 |fb(z)−f0(z)|2dz. [14] shows that givenni.i.d. samples from the high dimensional partial linear additive model (2.6), the minimax risk for estimating f0 can be bounded from below as

Rf0(s0, α, γ)& n−2α/2α+1,s0 n log p s0 . (2.8)

(24)

Now, we construct estimators of (β0, f0, g0) that almost achieve the lower bounds as shown in [14]. Our construction is inspired by [12] and [20], and holds in a general setup as follows. We assume thatf andg belong to more general classes of functions, Hilbert spaces F and G with continuous functions f ∈ F and g ∈ G on [0,1]. In particular, Wα,2(L

1) ⊂ F and Wγ,2(L2) ⊂ G. Let I(·,·) and J(·,·) be semi-inner product on F and G, and I(·), J(·) be the corresponding semi-norm. A special case of F,G and I, J isF =Wα,2(L 1),G =Wγ,2(L2) andI2(f) = R1 0(f (α)(z))2dz, J2(g) = R1 0(g

(γ)(z))2dz. Also, we can allow ε to be sub-Gaussian (not necessarily N(0,1)) in this section.

The penalized least square estimators of (β0, f0, g0) can be obtained as (β,b f ,b

b

g) = argmin(βRp,f∈F,g∈G){kY −XTβ−f−gkn2 +λkβk1+ρ2I2(f) +µ2Jq(g)},

(2.9) where 1 ≤q ≤ 2 is some fixed constant. Without loss of generality, we assume that functions in F are smoother than those in G in a sense defined in Assumption 2.6. Corresponding to Wα,2(L

1) and Wγ,2(L2), it simply meansγ < α.

Assumption 2.4assumes sub-Gaussian error. Note that ifε1 is bounded such that

kε1k ≤M, then kε1kΨ ≤M.

Assumption 2.4. The error term ε is independent of (X, Z, U) and satisfies for some constant Kε≥1,

kεkΨ≤Kε,

where k · kΨ is an Orlicz norm1 with Ψ(t) = exp(t2)−1.

Let H = F ⊕ G be a Hilbert space of additive functions with the `2 norm k · k and Xe = X−Π(X|H) with Π(X|H) being the projection of X onto H defined as

arg minh∗∈HEkX−h∗k2. By the definition ofXe, we have

kXTβ+f +gk2 =kXeTβk2+kΠ(X|H)Tβ+f +gk2. (2.10)

1kεk

(25)

Also by the definition of H, Π(X|H) = (Π(X1|H),· · · ,Π(Xp|H))T can be written as a sum of fX +gX wherefXj ∈ F and gXj ∈ G for 1≤j ≤p.

Assumption 2.5 is widely used in semiparametric literature [31,32], ensuring suf-ficient information in estimating β0.

Assumption 2.5. The smallest eigenvalueΛ2

min of EXeTXe is positive, and the largest

eigenvalue Λ2max of EΠ(X|H)TΠ(X|H) is finite.

Assumption 2.6 implies that f is “smoother” than g in terms of the following complexity measure. Letdbe a metric on the spaceF. For anyt >0, defineN(t,F, d) as the covering number ofF andH(t,F, d) = logN(t,F, d) the entropy ofF. LetAn be the set of all configurations An of n points within the support of PXZU. ForAn∈

An, we have kfkAn,∞= maxZ∈An|f(Z)|.LetH∞(t,F) = supAn∈AnH(t,F,k · kAn,∞).

Further, we write J∞(u,F) = C0inf δ>0 u Z 1 δ/4 p H∞(tu/2,F)dt+√nδu .

For arbitrary constants R0 > 0 and M0 > 0, we denote F(R0, M0) = {f ∈ F :

kfk ≤ R0, I(f) ≤ M0} and G(R0, M0) = {g ∈ G : kgk ≤ R0, J(g) ≤ M0}. Define

fβ(x) =xTβ and Fβ(R0, M0) ={fβ :kfβk ≤R0,kβk1 ≤M0}.

Assumption 2.6. Let 0 < k < m < 1. For R0 ≤ M0 and some constants AI ≥ 1 and AJ ≥1, it holds that

J∞(z,F(R0, M0))≤AIM0kz 1−k , and J∞(z,G(R0, M0))≤AJM0mz1 −m.

In Assumption 2.6, if we take I2(f) = R(f(α)(x))2dx, then J∞(z,F(1,1)}) ≤

AIz1−

1

2α,i.e. k= 1/(2α), for some constant AI >1.

Assumption 2.7. For some constant B ≥ 1, all M0 > 0 and any R0 ≤ M0/B, it holds that sup fβ∈Fβ(R0,M0) kfβk∞≤M0, sup f∈F(R0,M0) kfk∞ ≤M0, sup g∈G(R0,M0) kgk∞ ≤M0.

(26)

Assumption 2.8implies separate rates forf andg from that forf+g. This is due to the inequalitykf+gk2 (1γ)(kfk+kgk)2 as shown in Lemma 5.1 of [20] given Ef0(Z) = 0. Here, γ is related to the minimal angle between two Hilbert spaces F and G, see A.4 of [31], and formally defined as follows

γ2 =

Z

(r−1)2pZpUdν,

where p=dPZU/dν is the density of PZU w.r.t. ν =νZ×νU with marginal densities

pZ and pU, and r(z, u) = p(z, u)/(pZ(z)pU(u)). Assumption 2.8. It holds that γ <1.

We assume the projection fP(U) =E(f(Z)|U) to be smooth.

Assumption 2.9. For some constant Γ>0, it holds that, for any functionf ∈ F, J(fP)≤Γkfk,

and for some constant I1, J2, it holds that max

1≤j≤p|I(fXj)| ≤I1,1max≤j≤p|J(gXj)| ≤J2.

Before presenting our main theorem, we need a set of oracle inequalities that hold in probability. Define the norm

τ(β, f, g;R) = λkβk1 δ0R +kXTβ+f +gk+ρI(f) +µ R 2−qq µJ(g), τI(β, f;RI) = λ kβk1 δ0RI +kXβe k+kfXTβ+fk+ρI(f),

for some constant δ0 >0.

Lemma 2.1. Suppose Assumptions2.4-2.9 hold. Also assume that for some 0< δ <

1, max{A2I, A2J}/n≤n−δ and A2J/n≤(A2I/n)1+k1 ≤(A2 J/n) 1 1+mn−δ. Let λ r logp n , ρ 2 A1+k2 I n − 1 1+k and µ2 A 2 1+m J n − 1 1+m.

(27)

If there exist R and RI satisfying R2 ≤ λ ≤ 1, R2 µ2 +λ2s0 and R2I ρ2 +λ2s0, then P τ(βb−β0,fb−f0, b g−g0;R)≤R ≥1−cexp(−nµ2/c), P τ(βb−β0,fb−f0, b g−g0;R)≤R, τI(βb−β0,fb−f0;RI)≤RI ≥1−Cexp(−nρ2/C) for some constants C, c >0.

A noteworthy case is that I2(f) = R(f(α)(z))2dz, J2(g) = R(g(γ)(u))2du with

γ < α and q = 2. Set ρ n−α/(2α+1) and µ n−γ/(2γ+1). Given that s

0logp/n =

o(n−2α/(2α+1)), Lemma 2.1 implies that kβ −β0k2 = OP(n−γ/(2γ+1)), kfb− f0k = OP(n−α/(2α+1)) andkbg−g0k=OP(n

−γ/(2γ+1)); otherwise,kββ

0k2 =OP(s0logp/n),

kfb−f0k=OP(s0logp/n) andk b

g−g0k=OP(s0logp/n). The upper bounds exhibit an interesting two regime dichotomy depending on the relation between s0logp/n and n−2α/(2α+1).

Lemma 2.2. Assume conditions of Lemma2.1. Then there exists constantsC0, c0 >0 such that with probability at least 1−7/(2p)−C0exp(−c0nµ2),

kXeT(βb−β0)kn2 + (λ/2)kβb−β0k1 ≤ 4s0λ2 Λ2 e X,min .

Lemma2.2has two important implications: (i) prediction error: kXeT(βb−β0)k2n

4s0λ2/Λ2 e

X,min; (ii) `1 error: kβb−β0k1 ≤ 8s0λ/Λ 2

e

X,min. We note that these two rates are the same order as those standard lasso rates (as iff0 and g0 were known); see [1]. However, the probability that these rates hold is relatively smaller as reflected by an additional term exp(−c0nµ2). This is the price to pay for estimating two unknown nonparametric functions in the model.

We are now ready to prove that (β,b f ,bbg) achieve the minimax lower bounds

es-tablished in (2.7) and (2.8): q = 2, F = Wα,2(L 1),G = Wγ,2(L2) and I(f) = R1 0(f (α)(z))2dz, J(g) =R1 0(g (γ)(z))2dz.

Theorem 2.2. Suppose Assumptions 2.5, 2.7, 2.8 and 2.9 hold. Set λ p

logp/n,

ρ2 n−2α/(2α+1) and µ2 n−2γ/(2γ+1). Then Ekβb−β0k2 .

s0logp

(28)

Table 2.2.: Estimation Interference Results for Model (2.6).

Sparsity Parameters High Dimensionalβ0

Lower Bound Penalized Estimator

s0, p, n Rβ0(s0, α) Ekβb−β0k

2

1

2η+1 < a+b <1 s0log(p/s0)/n s0logp/n a+b < 2η1+1 s0logp/n s0logp/n

Sparsity Parameters Smoothf0

Lower Bound Penalized Estimator

s0, p, n Rf0(s0, α) E R1 0 |fb(z)−f0(z)| 2dz 1 2α+1 < a+b <1 s0log(p/s0)/n s0logp/n a+b < 2α1+1 n−2α/(2α+1) n−2α/(2α+1) We sets0=nb, p= exp(na). E Z 1 0 |fb(z)−f0(z)|2dz .max n−2α/(2α+1),s0logp n , and E Z 1 0 |bg0(u)−g0(u)|2du.max n−2γ/(2γ+1),s0logp n . 2.3 Appendix

2.3.1 Proof for Section 2.1

Proof of Theorem 2.1

Proof. Letρ2 =n−2α/(2α+1) and λ=p

logp/n in (2.5). By definition, we have

kY −Xβb−fbk2n+λkβbk1+ρ2I2(fb)≤ kY −XTβ0−f0k2n+λkβ0k1+ρ2I2(f0). (2.11)

Then, by triangle inequality, it holds that

(29)

which further implies for any k ≥1,

Ekβb−β0kk ≤Ekβb−β0kk1 ≤E(kεk2n/λ+kβ0k1+ρ2I2(f0)/λ)k.

Note that nkεk2

n follows chi-squared distribution with degree of freedom n. Thus we haveEkεkk

n=O(1). Also we have that kβ0k1 =O(

s0). Therefore, it follows

Ekβb−β0kk =O(1/λ+

s0)k.

Define the event T = {kX(βb−β0)k2 ≤ λ2s0}, then it is known from the proof of

Theorem 3.1 of [18] that

P(T)≥1−2/p−3 exp(−ncρ2).

Note that if we choose t= 2 log 2p/n in proof of Theorem 3.1 of [18], then the above probability inequality holds as follows,

P(T)≥1−2/p4−3 exp(−ncρ2). for some constant c >0. Now we have

Ekβb−β0k2 =Ekβb−β0k21T +Ekβb−β0k21Tc ≤O(λ2s0) + q Ekβb−β0k4 p P(Tc) ≤O(λ2s0) +O(1/λ2+s0) p exp(−ncρ2) + 2/p4 ≤O(λ2s0). (2.12) The last inequality holds due to the following arguments. Substituteρ2 =n−2α/(2α+1) and λ=plogp/n into the last inequality, we have

(1/λ2)

q

exp(−n1/(2α+1)) =O(λ2s

0), since n2exp(−n1/(2α+1)) =O(s0log2p), (1/λ2)(1/p2) =O(λ2s0), since 1 =O( p2s 0log2p n2 ); s0 q exp(−n1/(2α+1)) =O(λ2s

0), since nexp(−n1/(2α+1)) =O(logp);

s0(1/p2) = O(λ2s0), since 1 =O(

p2logp

(30)

Now we prove the second part of the theorem. Note that fb ∈ W2,α(L), then

together with (2.11), it implies that, for some constant C > 0, supz∈[0,1]|fb(z) − f0(z)|2 ≤ CI2(fb− f0) = C R1 0(fb (α)(z)f(α) 0 (z))2dz ≤ C(kεkn2/ρ2 + 2λkβ0k1/ρ2 + I2(f0)). Therefore we have Ekfb−f0k2nk ≤(E Z 1 0 (fb(α)(z)−f (α) 0 (z)) 2dz)k O(12+λs 0/ρ2)k =O(1/ρ2k), for any k ≥1. Ekfb−f0kn2 =Ekfb−f0k2n1T +Ekfb−f0k2n1Tc (2.13) ≤O(ρ2+λ2s0) +O(1/ρ2) p exp(−ncρ2) ≤O(ρ2+λ2s0). Finally it follows from Lemma 4.1 of [33] that ER1

0 |fb(z)−f0(z)|

2dz =O(ρ2+λ2s 0).

2.3.2 Proof for Section 2.2

In this section, we first define a set T(R) and show in Lemma 2.3 that τ(βb− β0,fb−f0,

b

g −g0;R) ≤ R on T(R). The probability of T(R) is approximated in Lemma 2.4. We next show τI(βb−β0,fb−f0;RI) ≤ RI on the set T(R)∩ TI(RI) whereas the probability of TI(RI) is approximated in Lemma 2.6. Lemma 2.1 is then proved following Lemmas 2.3-2.6.

For some δ0 >0 small enough, define

M(R) ={(β, f, g) :τ(β, f, g;R)≤R}, T1(R) = ( sup M(R) kXTβ+f+gkn2 − kXTβ+f+g k2 ≤δ20R2 ) , T2(R) = ( sup M(R) Pn ε(X Tβ+f +g) ≤δ 2 0R 2 ) , T(R) =T1(R)∩ T2(R).

(31)

Lemma 2.3. Under the conditions of Lemma 2.1, we have, on T(R),

τ(βb−β0,fb−f0, b

g−g0;R)≤R.

Proof. Takeδ0 ≤1/30. Under the conditions of Lemma2.1, we can findρand µsuch that ρ2I2(f0) +µ2Jq(g0)≤δ02R2, (2.14) and 4λ2s 0/Λ2min≤R2I ≤R2. Define t= R R+τ(βb−β0,fb−f0, b g−g0;R) . Let βe = tβb+ (1−t)β0, fe = tfb+ (1−t)f0, e g = tgb+ (1−t)g0. Notice that τ(βe−β0,fe−f0, e g −g0;R) = tτ(βb−β0,fb−f0, b g −g0;R) ≤ R, which implies (βe− β0,fe−f0, e g−g0)∈ M(R). In order to showτ(βb−β0,fb−f0, b g−g0;R)≤R, it suffices to prove τ(βe−β0,fe−f0, e g−g0;R)≤R/2. By the convexity, we have

kY −XTβe−fe− e gk2 n+λkβek1+ρ2I2(fe) +µ2Jq( e g) ≤ kY −XTβ0−f0−g0k2n+λkβ0k1+ρ2I2(f0) +µ2Jq(g0) ≤ kY −XTβ0−f0−g0k2n+λkβ0k1+δ02R 2,

where the last inequality follows from equation (2.14). This implies

kXT(βe−β0) + (fe−f0) + ( e g−g0)k2n+λkβek1+ρ2I2(fe) +µ2Jq( e g) ≤2Pn ε(XT(βb−β0) + (fb−f0) + ( b g−g0)) +λkβ0k1+δ02R 2. (2.15)

Therefore, by the definition of T1(R) and T2(R),

kXT(βe−β0) + (fe−f0) + ( e g−g0)k2+λkβeSC 0 k1+ρ 2I2( e f) +µ2Jq(eg) ≤δ02R2+δ20R2+ 2δ02R2+λkβ0k1−λkβeS0k1 ≤4δ20R2+λkβ0S0 −βeS0k1. (2.16)

(32)

Note that λkβ0S0 −βeS0k1 ≤λ √ s0kβ0S0 −βeS0k ≤λ√s0kβe−β0k ≤λ√s0kXeT(βe−β0)k/Λ e X,min ≤λ2s0/Λ2X,emin+kXeT(βe−β0)k2/4 ≤δ02R2/4 +kXeT(βe−β0)k2/4,

where the third and last inequalities hold by assumption and the fourth inequality follows from uv ≤u2+v2/4.Thus, substituting it into (2.16), we obtain

(a) (3/4)kXT(βe−β0) + (fe−f0) + (eg−g0)k2 ≤(17/4)δ02R2; (b) ρ2I2( e f)≤(17/4)δ2 0R2; (c) µ2Jq(eg)≤(17/4)δ02R2. Now it follows from (a) that kXT(

e

β−β0)k ≤(17/3)δ

0R. In addition, (b) (c) and (2.14) implies

ρI(fe−f0)≤ρI(fe) +ρI(f0)≤

√ 17 2 δ0R+ 2δ0R≤ √ 17δ0R and µ R 2−qq Jq(eg−g0)≤ √ 17 2 δ0R+ 2δ0R ≤ √ 17δ0R. Adding λkβ0S0 −βeS0k1 on both sides of (2.16), we get

kXT(βe−β0) + (fe−f0) + (eg−g0)k 2 +λkβe−β0k1+ρ2I2(fe) +µ2Jq(eg) ≤4δ02R2+ 2λkβ0S0 −βeS0k1 ≤4δ02R2+kXeT(βb−β0)k2+ 1 4δ 2 0R 2, which implies λkβe−β0k1 ≤ 17 4 δ 2 0R2.

(33)

Invoking the definition of τ(βb−β0,fb−f0, b g−g0;R), we finally get τ(βb−β0,fb−f0,bg−g0;R)≤ √ 17/√3 + 2√17 + 17/4δ0R≤15δ0R≤ 1 2R by letting δ0 = 1/30.

Lemma 2.4. Under the conditions of Lemma 2.1, we have for some constants C >e

0,ec >0, P(T(R))≥1−Ceexp(− e cnρ2).

Proof. Under the conditions of Lemma 2.1, we can find ρ2 (1γ)/B2 and µ2

R2−q(1−γ)q/2/Bq. Further, we find a constant L >0 such that the followings hold:

√ nρ1+k ≥LAI, √ nµ1+m ≥LAJ, (2.17) R ≥LLJAJ/ √ n, R≥Kερ, R ≥LJρ, R ≥K q q−(2−q)m ε µ (2.18) and ρk≤1/L, (2.19) where LJ = (R/µ) 2/q

. By similar arguments of [20], such L exists. Note that

τ(β, f, g;R) ≤ R implies that kXTβ +f +gk2 R2 and kβk

1 ≤ δ0R2/λ, I(f) ≤

R/ρ, J(g) ≤ LJ. By orthogonal decomposition (2.10), we have kXeTβk ≤ R and

kfT

Xβ+f +gXTβ+gk ≤R. Then Assumption 2.5 implies

kXTβk ≤ kXeTβk+kΠ(X|H)Tβk ≤R+ (Λmaxmin)kXβe k ≤(1 + Λmaxmin)R.

Similar arguments and assumption 2.8 implies that

kfk ≤(1 + Λmax/Λmin)R/(1−γ),kgk ≤(1 + Λmax/Λmin)R/(1−γ).

For simplicity, we take R1 = R2 = R3 = (1 + Λmax/Λmin)R/(1− γ) , R2/(1−

γ) and M1 = δ0R2/λ, M2 = R2/ρ2, M3 = LJ. In addition, Assumption 2.7 and

ρ2 (1 γ)/B2, µ2 R2−q(1 γ)q/2/Bq yield that sup

fβ∈Fβ(R1,M1)kfβk ≤ M1,

supf∈F(R2,M2)kfk∞ ≤M2 and supg∈F(R3,M3)kgk∞ ≤M3. Thus, we takeKl =Ml,1≤ l ≤3. Let t=nρ2/L2. Further, we choose δ

1, δ10 small enough such that

R q

log3(2n)≤δ1,

p

(34)

Without loss of generality, we assume C1 = 1 in Theorem 2.4. Otherwise, we can replace in L=LC1 in the proof. With (2.20) and the fact thatR2 ≤λ, it holds that

M1 r logp n q log3n ≤δ0δ1δ10R ≤ R L, M1 ρ L = δ0R2 λ ρ L ≤ δ0ρ L ≤ R L. (2.21)

Moreover, by Theorem2.3, (2.17) and (2.17), we obtain

J∞(K2,Fβ(R1, M1)) √ n ≤M1 r log(2p) n q log3(2n) + 2√K2 n ≤ R L + 2R ρ√n ≤ 3R L , (2.22) J∞(K3,Fβ(R1, M1)) √ n ≤M1 r log(2p) n q log3(2n) + 2√K3 n ≤ R L + 2LJ √ n ≤ 3R L . (2.23)

Further, Assumption 2.6 and equations (2.17), (2.18) and (2.19) show

J∞(K2,F(R2, M2)) √ n ≤ AI(R/ρ)k(R/ρ)1−k √ n ≤ AIR √ nρ ≤ R L, (2.24) J∞(K3,G(R3, M3)) √ n ≤ AJLkJL (1−k) J √ n ≤ AJLJ √ n ≤ R L, K3 ρ L ≤ ρLJ L ≤ R L, (2.25) J∞(K3,F(R2, M2)) √ n ≤ AI(R/ρ)kL1J−k n ≤ LAI √ nρ1+k ρ1−kL1−k J R1−k ρ kR L ≤ R L2. (2.26) Now for any (β, f, g), it holds

kXTβ+f+gkn2 − kXTβ+f +g k2 ≤kXTβk2n− kXTβk2 + kfk2n− kfk2 + kgk2n− kgk2 + 2(Pn−P)XTβf +2(Pn−P)XTβg+ 2(Pn−P)f g ,A+B+C+D+E+F.

We bound each of the terms as follows.

A. Replace R∗ and M∗ by R1 and M1 in Theorem 2.3. Then we get from equa-tion (2.21) that A≤R1M1 r logp n q log3n+ ρ L ! +M12 logp n log 3n+ ρ2 L2 (2.27) ≤ 2R 2 Lp(1−γ2) + R 2 L2 ≤ 4R 2 Lp(1−γ2) .

(35)

B. Replace R∗ and K∗ by R2 and K2 in Theorem 2.4. Then we have from equa-tion (2.24) B ≤2R2J∞(K√2,F(R2, M2)) n +R2K2 ρ L + 4J2 ∞(K2,F(R2, M2)) n +K 2 2 ρ2 L2 (2.28) ≤ 2R 2 L√1−γ2 + R 2ρ ρL√1−γ2 +4R 2 L2 + R2ρ2 ρ2L ≤ 8 Lp(1−γ2) R2.

C. Replace R∗ and K∗ by R3 and K3 in Theorem 2.4 and apply (2.25). Then we have C≤2R3J∞(K√3,G(R3, M3)) n +R3K3 ρ L+ 4J2 (K3,F(R3, M3)) n +K 2 3 ρ2 L2 ≤ 2R 2 L√1−γ2 + R 2 L√1−γ2 +4R 2 L2 + R2 L2 ≤ 8 Lp(1−γ2) R2.

D. ReplaceR∗1, R2∗, K1∗, K2∗in Theorem2.5byR1, R2, K1, K2andM∗in Lemma2.11 byM1. Then with the application of (2.22), we get the following:

D≤R1J∞(K2√,F(R2, M2)) n + R2J∞(R1K2/R2,Fβ) √ n + R1K2ρ L + K1K2ρ2 L2 (2.29) ≤ R 2 Lp(1−γ2) + 3R 2 Lp(1−γ2) + R 2ρ ρLp(1−γ2) + δ0R 2 λ Rρ2 ρL2 ≤ 6 Lp(1−γ2) R2.

E. ReplaceR∗1, R2∗, K1∗, K2∗in Theorem2.5byR1, R3, K1, K3andM∗in Lemma2.11 byM1. Then with the application of (2.22), we get the following:

E ≤R1J∞(K3√,G(R3, M3)) n + R3J∞(R1K3/R3,Fβ) √ n + R1K3ρ L + K1K3ρ2 L2 (2.30) ≤ R 2 Lp(1−γ2) + 3R 2 Lp(1−γ2) + RLJρ Lp(1−γ2) + δ0R 2 λ LJρ2 L2 ≤ 6 Lp(1−γ2) R2,

(36)

where ρLJ ≤R follows from equation (2.18).

F. Replace R∗1, R∗2, K1∗, K2∗ in Theorem 2.5 by R2, R3, K2, K3. Then with the ap-plication of (2.22), we get the following:

F ≤R2J∞(K3√,G(R3, M3)) n + R3J∞(R2K3/R3,F(R2, M2)) √ n + R2K3ρ L + K2K3ρ2 L2 (2.31) ≤ R 2 Lp(1−γ2) + R 2 L2p(1γ 2) + RLJρ L√1−γ2 + RLJρ 2 ρL2 ≤ 4R 2 L√1−γ2 .

Combining A to F, we get for any (β, f, g)∈ M(R), sup (β,f,g)∈M(R) kXTβ+f +gkn2 − kXTβ+f+g k2 ≤ 36R2 L√1−γ2 .

with probability at least 1−6 exp(−nρ2/L).

Look at the set T2(R) now. Note that

|Pnε(XTβ+f +g)| ≤ Pnε(XTβ) + Pnεf + Pnεg .

Lemma 2.7 shows that

Pnε(XTβ) ≤δ0R2/10≤ R2 √ 1−γ2L , (2.32) where the last step follows by choosing δ0 ≤ 10. Then it follows from Theorem 5.2 of [20], Assumption 2.6 and equation (2.17) that

Pnεf ≤ KεJ∞(R2,F(R2, M2)) +KεR2 √ t √ n ≤ √ KεAIR n(1−γ2)(1−k)/2ρk + R 2 √ 1−γ2L ≤ R 2 L(1−γ2)(1−k)/2 + R 2 L√1−γ2 ≤ 2R 2 L√1−γ2 ,

(37)

and Pnεg ≤ KεJ∞(R3,G(R3, M3)) +KεR3 √ t √ n ≤ KεAJM m 3 R1−m √ n(1−γ2)(1−m)/2 + R 2 √ 1−γ2L ≤ R 2 L(1−γ2)(1−m)/2 + R 2 L√1−γ2 ≤ 2R 2 L√1−γ2 . Therefore, sup (β,f,g)∈M(R) |Pnε(XTβ+f+g)| ≤ 5R2 L√1−γ2 with probability at least 1−3 exp(−nρ2/L).

By Letting 5/L√1−γ2 ≤δ20, we have shown that for some constantsC >e 0, e c >0, P(T(R))≥1−Ceexp(−ecnρ2).

Let

MI(RI) ={(β, f) :τI(β, f;RI)≤RI}. ForδI sufficiently small, we define

TI,1(RI) = ( sup (β,f)∈MI(RI) kXβe +fXAT β+fAkn2 − kXβe +fXTβ+fAk2 ≤δI2R2I ) , TI,2(RI) = ( sup (β,f)∈MI(RI) Pn ε(Xβe +fXAT β+fA) ≤δ 2 IR 2 I ) , TI,3(RI) = ( sup (β,f,g)∈F(R),(β,f)∈MI(RI) Pn(Xβe +f T XAβ+fA)(fXPT β+g T Xβ+fP +g) ≤δ 2 IR 2 I ) and let TI(RI) =TI,1(RI)∩ TI,2(RI)∩ TI,3(RI).

Lemma 2.5. Under the conditions of Lemma 2.1, it holds that on T(R)∩ T1(RI),

(38)

Proof. Under the conditions of Lemma 2.1, we can find some ρ and µ such that ρ2I2(f0) +µ2Jq(g0)≤δ20R 2, ρ2I2(f 0)≤δI2R 2 I, (2.33) 2µ2(Γ +J2δ0RI/λ)(2δ0R/µ) 2(q−1) q ≤δ IRI2, 2µ2(Γ +J2δ0RI/λ)q/R2 −q I ≤δ 2 I (2.34) for some δ0, δI >0, which will be taken small enough later.

kY −XTβb−fb− b gk2 n+λkβbk1+ρ2I2(fb) +Jq( b g) ≤kY −XTβ0 −f0−(bg+fXP(βb−β0) +gX(βb−β0) +fbP −f 0 P)k 2 n+λkβ0k1+ ρ2I2(f0) +µ2Jq(bg+fXP(βb−β0) +gX(βb−β0) +fbP −f 0 P), which implies kX(βb−β0) +fXAT (βb−β0) +fbA−fA0k2n+ρ2I2(fb) ≤ −2Pn (fXP +gX)T(βb−β0) +fbP −fP0 + b g−g0 (Xe +fXA)T(βb−β0) +fbA−fA0 + 2Pn εXe+fXAT (βb−β0) +fbA−fA0 +λkβ0k1+ρ2I(f0)−µ2Jq(bg) +µ2Jq(bg+fXP(βb−β0) +gX(βb−β0) +fbP −fP0). Let t = RI RI +τI(βb−β0,fb−f0;RI) .

Define βe = tβb+ (1−t)β0,fe= tfb+ (1−t)f0. Note that (β,e fe) ∈ T1(RI) Similarly as the proof of Lemma 2.3, it suffices to show that τI(βe−β0,fe−f0;RI)≤RI/2.By

convexity and the definition ofTI(RI), we have

kXe(βe−β0) +fXAT (βe−β0) +feA−fA0k2+λkβek1+ρ2I2(fe) ≤5R2I+λkβ0k1+ρ2I2(f0) +µ2Jq(bg+fXP(βb−β0) +gX(βb−β0) +fbP −f 0 P)−µ 2Jq( b g).

Using the fact that for a, b >0 and 1< q <2,

(39)

we obtain Jq(bg+fXP(βb−β0) +gX(βb−β0) +fbP −fP0)−Jq( b g) ≤2Jq−1(bg)J(fXPT (βe−β0) +gXT(βe−β0) +feP −fP0) + 2Jq(fXPT (βe−β0) +gXT(βe−β0) +feP −fP0) ≤2Jq−1(bg)[J(gXT(βe−β0)) +J(fXPT (βe−β0) +feP −fP0)] + 2[J(gTX(βe−β0)) +J(fXPT (βe−β0) +feP −fP0)]q ≤2Jq−1(bg)(kJ(gX)k∞kβe−β0k1+ ΓkfX(βe−β0) +fe−f0k) + 2(kJ(gX)k∞kβe−β0k1+ ΓkfX(βe−β0) +fe−f0k)q ≤2 2δ0R µ 2(qq−1) J2 δ0R2I λ + ΓRI + 2 J2 δ0R2I λ + ΓRI q ≤2R2I/µ2.

where the fourth inequality follows from J(bg) ≤(2δ0R/µ)2/q onT(R) and Assump-tion 2.8,2.9 and the factkfX(βe−β0) +fe−f0k ≤RI onTI(RI). The last step follows

from (2.34). Hence, we have

kXeT(βe−β0)k2+kfXAT (βe−β0) +feA−fA0k2+λkβek1+ρ2I2(fe)≤8δ2IR2I+λkβ0k1.

Subtracting λkβeS0k1 on both sides of above equation, we get

kXeT(βe−β0)k2+kfXAT (βe−β0) +feA−fA0k2+λkβeSc 0k1+ρ 2I2( e f) ≤8δI2R2I+λkβeS0 −β0S 0k1, (2.35) where λkβeS0 −β0S 0k1 ≤λ √ s0kβ0S0 −βeS0k ≤λ√s0kβe−β0k ≤λ √ s0kXeT(βe−β0)k/Λmin ≤λ2s0/4Λ2min+kXeT(βe−β0)k2 ≤δI2R2I+kXeT(βe−β0)k2. Therefore, kfXAT (βe−β0) +feA−fA0k2+λkβeSc 0k1+ρ 2I2( e f)≤9δ2IR2I.

(40)

Then it holds (a0) kfT

XA(βe−β0) +feA−fA0k ≤3δIRI which further implies kfXT(βe−β0) +fe−f0k ≤

3δIRI/

p

(1−γ2);

(b0) ρI(fe−f0)≤ρI(fe)+ρI(f0)≤(3+1)δIRI ≤4δIRI together with equation (2.33).

Note that by using λkβeS0 −β0S0k1 ≤ λ

2s

0/Λ2min/2 +kXT(βe−β0)k2/2, we can also obtain

(c0) kXeT(βe−β0)k ≤

18δIRI Now, adding λkβe0S

0 −β0S0k1 on both sides of (2.35), we get

kXT(βe−β0)k2+kfXAT (βe−β0) +feA−fA0k2 +λkβe−β0k1+ρ2I2(fe) ≤8δI2R2I+ 2λkβeS0 −β0S 0k1 ≤8δ 2 IR 2 I+λ 2s 0/Λ2min+kX T( e β−β0)k2, which implies that

(d0) λkβe−β0k1 ≤9δI2R2I.

Combine the results (a)−(c) and recall the form ofτI(βe−β0,fe−f0;RI). We get τI(βe−β0,fe−f0;RI)≤(( √ 18 + 16)/p1−γ2)δ IRI ≤ 1 2RI, given that δI ≤(( √

18 + 16)/2p1−γ2). This completes the proof of the lemma.

Lemma 2.6. Under the conditions of Lemma 2.1, there exists constant CI and cI, such that

P(TI(RI))≥1−CIexp(−cInρ2). Proof. Note that on the setTI(RI), we have

kfk2 (1 + Λ

max/Λ0min)RI2, I(f)≤RI/ρ, and

(41)

where Λmin0 represent the smallest eigenvalue of E(fXfXT). Also we have kgk2 ≤ (1 + Λmax/Λmin)R2/(1−γ) andJ(g)≤LJ. Now, we take

R01 =R022 = (1 + Λmax/Λ0min)R 2 I ,R 2 I/(1−γ1), R032 = (1 + Λmax/Λmin)R2/(1−γ) = R2/(1−γ2), and M10 =δIR2I/λ, M 0 2 =RI/ρand M30 =LJ.

Assumption 2.7 and ρ2 (1γ)/B2, µ2 R2−q(1γ)q/2/Bq yield that sup f∈Fβ(R01,M 0 1) kfβk∞≤M10, sup f∈F(R0 2,M20) kfk∞≤M20 and sup g∈F(R0 3,M30) kgk∞ ≤M30.

Thus, we takeKl0 =Ml0,1≤l ≤3.LetLbe the constant as in the proof of Lemma2.4. We further restrict it such that

RI ≥LLJAJ/ √ n, RI ≥Kερ, RI ≥LJρ, R ≥K q q−(2−q)m ε µ. (2.36)

This can be achieved due to the assumptions that ρ2

. R2

I ≤ R2 and µ2 . R2. We take t=nρ2/L2 and look at T

I,1(RI) first. We have kXβe +fXAT β+fAkn2 − kXβe +fXAT β+fAk2 ≤kXβe +fXAT βk2n− kXβe +fXAT βk2 + kfAk2n− kfAk2 + (Pn−P)(Xβe +fXAT β)fA ,A0 +B0+C0.

We boundA0, B0, C0 as follows, respectively.

A0. Note that fXP(·) = E(fX(Z)|U = ·) ∈ G and fXA = fX − fXP. We have

fX, fXA, fP are bounded. Without loss of generality, we assume the upper bound as 1. Note that by Assumption 2.7, kXβe +fXAT βk∞ ≤ kXβe k∞+kfXAT βk∞ ≤

(42)

2M10. Replace R∗ and M∗ by R10 and 2M10 in Theorem 2.3. Then similarly as (2.27), we get A0 ≤2R01M10   s logplog3n n + ρ L  + 4M 02 1 logplog3n n + ρ2 L2 ≤ 4R 2 I √ 1−γ1L + 8R 2 I L2 ≤ 12R 2 I L√1−γ1 . B0. Note that sup f∈F(R02,M20) kfAk∞≤ sup f∈F(R02,M20) kfPk∞+ sup f∈F(R02,M20) kfk∞ ≤2K20 and sup f∈F(R0 2,M20) kfAk ≤R20.

ReplaceR∗ andK∗ byR20 and 2K20 in Theorem2.4. Similarly as (2.28), we then have B0 ≤2R 0 2J∞(2K 0 2,{fA:f ∈ F(R02, M 0 2)}) √ n + 2R 0 2K 0 2 ρ L (2.37) + 4J 2 ∞(2K20,{fA :f ∈ F(R02, M 0 2)}) n + 4K 02 2 ρ2 L2 ≤4R 0 2J∞(2K20,F(R02, M20)) n + 2R2Iρ ρL√1−γ1 + 16J 2 ∞(2K20,F(R02, M20)) n + 4RI2 L ≤ 8R 2 I Lp(1−γ1) + 8R 2 I Lp(1−γ1) + 64R 2 I L2 + 4RI2 L (2.38) ≤ 84 Lp(1−γ1) R2.

(43)

C0. Similarly as (2.29), it holds that C0 ≤R 0 1J∞(2K 0 2,{fA:f ∈ F(R02, M 0 2)}) √ n + R02J∞(2R01K20/R02,Fβ(R10,2M 0 1)) √ n +2R 0 1K 0 2ρ L + 4K10K20ρ2 L2 ≤2R 0 1J∞(2K20,F(R02, M20)) n + R02J∞(2K20,Fβ(R01,2M10)) n + 2R2I L√1−γ1 +4δ0R 2 I λ RI2 L2 ≤ 4R 2 I Lp(1−γ1) + 6R 2 I L√1−γ1 + 2R 2 I L√1−γ1 + 4R 2 I L2 ≤ 16R 2 I L√1−γ1 . Combining A0 to C0, we get sup (β,f)∈F(RI) kXˇTβ+fAk2n− kXˇ T β+fAk2 ≤ 112R2 I Lp(1−γ2) with probability at least 1−2 exp(−nρ2/L).

Next, look at TI,2(RI). We have

Pn(ε(Xβe +fXAT β+fA)) ≤ Pnε(Xβe +fXAT β) + PnεfA ,where Pnε( ˇXTβ) ≤2RI2/ √

1−γ1Lfollows from Lemma 2.7and similar ar-guments as (2.32). Further, Theorem 5.2 of [20], Assumption2.6and equation (2.36) shows PnεfA ≤ KεJ∞(R02,{fA:f ∈ F(R02, M 0 2)}) +KεR02 √ t √ n ≤2KεJ∞(R 0 2,F(R 0 2, M 0 2)) +KεR20 √ t √ n ≤√ 2KεAIRI n(1−γ1)(1−k)/2ρk + R 2 I √ 1−γ1L ≤ 2R 2 I L(1−γ1)(1−k)/2 + R 2 I √ 1−γ1L ≤ 3R 2 I √ 1−γ1L . Thus, we have sup (β,f)∈F(RI) Pnε(Xβe +fXAT β+fA) ≤ 5R2I L√1−γ1

(44)

with probability at least 1−2C0exp(−nρ2/L).

Finally, we consider TI,3(RI). Notice thatP(Xβe +fXAT β+fA)(fXPT β+gXTβ+fP+

g) = 0. Then we get Pn(Xβe +fXAT β+fA)(fXPT β+gXTβ+fP +g) ≤(Pn−P) (Xβe +fXAT β)(fXPT β+gTXβ) + (Pn−P)(Xβe +fXAT β)(fP +g) (Pn−P) (fXPT β+g T Xβ)fA + (Pn−P) (fA(fP +g)) ,A00+B00+C00+D00. It is noted that kXβe +fXAT βk ≤R01,kXβe +fXAT βk∞ ≤2M10, kfXPT β+gXTβk ≤R10,kfP +gk ≤ kfPk+kgk ≤R02+R 0 3 ≤2R 0 3, J(fP +g)≤J(fP) +J(g)≤Γkfk+LJ ≤ΓRI+LJ ≤4K30. Then we apply Theorem 2.5 for A00, B00, C00, D00, respectively.

A00. Similar to the proof of equation (2.21) and (2.22),

A00≤ R 0 1J∞(K 0 1,Fβ(R01, M 0 1)) √ n + R01J∞(K10,Fβ(R01,2M 0 1)) √ n + R01K10ρ L + 2K102ρ2 L2 ≤ √ RI 1−γ1 RI L + 2K10 √ n + √RI 1−γ1 2RI L + 2K10 √ n + RIρ L√1−γ1 +2ρ 2 L2 ≤ 10R 2 I L√1−γ1 .

B00. Similar to the proof of (2.25), (2.23) and (2.30),

B00 ≤ R 0 1J∞(4K30,G(2R03,4K30)) n + 2R03J∞(R10(4K30)/2R03,Fβ(R10,2M10)) n + R 0 1(4K30)ρ L + 2K10(4K30)ρ2 L2 ≤ √ RI 1−γ1 4RI L + RI √ 1−γ1 2RI L + 2RI √ 1−γ1 4RI L + RI √ 1−γ1 4RI L + 2δ0RI2 λ 4LJρ2 L2 ≤ 26R 2 I L√1−γ1 .

(45)

C00. Similar to the proof of (2.22), (2.37), and (2.29), C00≤ R 0 1J∞(2K 0 2,{fA :f ∈ F(R02, M 0 2)}) √ n + R02J∞(R01(2K20)/R02,Fβ(R01,2M 0 1)) √ n +R 0 1(2K 0 2)ρ L + K10(2K20)ρ2 L2 ≤ √ RI 1−γ1 2RI L + RI √ 1−γ1 6RI L + RI √ 1−γ1 2RIρ ρL + δ0R2I λ 2RIρ2 ρL2 ≤ 12R2I L√1−γ1 .

D00. Similar to the proof of (2.26) and (2.31),

D00 ≤ R 0 2J∞(4K30,G(2R03,4K30)) n + 2R30J∞(R20(4K30)/2R30,{fA :f ∈ F(R20, M20)}) n +R 0 2(4K30)ρ L + (2K20)(4K30)ρ2 L2 ≤ √ RI 1−γ1 4RI L + √ 1−γ2 √ 1−γ1 1−k 8Rk √ 1−γ2 AIL1J−kRIρ √ nρ1+k + RI √ 1−γ1 4LJρ L + 2RI ρ 4LJρ2 L2 ≤ 18R 2 I √ 1−γ1( √ 1−γ2)kL .

Therefore, we have with probability at least 1−4 exp(−nρ2/L), sup (β,f,g)∈M(R),(β,f)∈F(RI) Pn( ˇXTβ+fA)(h2β+fP +g) ≤ 66RI2 √ 1−γ1( √ 1−γ2)aL . By letting 65/(√1−γ1( √

1−γ2)aL)≤δI2, we conclude that there exists constantCI and cI, such that

P(TI(RI))≥1−CIexp(−cInρ2).

Proof of Lemma 2.1

Proof. Take C = max(C,e ec) and C = max(C, Ce I,ec, cI). Then the theorem follows as

below:

(46)

and

P(T(R),TI(RI))≥1−P(T(R))c−P(TI(RI))c≥1−Ceexp(−cnρe 2)−CIexp(−cInρ2)

≥1−Cexp(−Cnρ2).

Proof of Corollary 2.2

In this section, we prove Corollary 2.2. We start from the following preliminary lemmas.

Lemma 2.7. With probability at least 1−1/p, 2Pnε(XeT(βb−β0)) ≤2KX p 6Kε r 2 logp n kβb−β0k1 ≤ λ 10kβb−β0k1. Proof. First we have

Pnε(XeT(βb−β0))

≤ kPnεXeTk∞kβb−β0k1.

Assumption2.4thatEexp(ε2i/Kε2)≤2 impliesEexp(tεi)≤exp(3Kε2t/2), see [34]. Then we get Eexp t(1 n n X i=1 εiXeij) ! = n Y i=1 Eexp(t nXeijεi) ≤ n Y i=1 exp(3 2K 2 ε( t2 n2Xe 2 ij)) = exp 3 2K 2 ε t2 nkXejk 2 n ,

which implies given Xe fixed, for t >0 and all j,

P ( 1 n n X i=1 εiXeij > r t n2kXejkn r 3 2Kε ) ≤exp(−t), see [34]. Hence P ( max 1≤j≤p 1 n n X i=1 εiXeij > r t+ logp n 2kXejkn r 3 2Kε ) ≤exp(−t).

(47)

Note that Π(X|H) = fX +gX with fX ∈ F and gX ∈ G, we have |Π(X|G)| ≤M0, for some constantM0 >0 , which further impliesXe =X−Π(X|H) is sub-Gaussian.

Then by Lemma 14.16 of [1], we have for some constant KX ≥1 that

P ( max 1≤j≤p|kXejk 2 n−EkXejk2n| ≥KX r log(2p) n ) ≤1/(2p),

which further implies

P max 1≤j≤pkXejk 2 n≥2KX ≤P ( max 1≤j≤pkXejk 2 n≥1 +KX r log(2p) n ) ≤1/(2p), (2.39) Now take t= log(2p). With probability at least 1−1/p,

kPnεXeTk∞ ≤2 √ 6KXKε r 2 log(2p) n .

Noting that λ & plogp/n, we can have 2kPnεXeTk∞ ≤ 4

6KXKε

p

2 log(2p)/n ≤

λ/10.

Lemma 2.8. With probability at least 1−5/(2p)−Cexp(−nρ2/c) for some constant

c, C >0, 2Pn(fb−f0+ b g−g0)XeT(βb−β0) +2 Pn(fXT(βb−β0)+gTX(βb−β0))XeT(βb−β0) ≤ λ 10kβb−β0k1, kXe(βb−β0)kn2 − kXe(βb−β0)k2 ≤ λ 2kβb−β0k1. Proof. On the set T(R)∩ TI(RI), we have

Pn(fb−f0+bg−g0)XeT(βb−β0) ≤ kPn(fb−f0 +bg−g0)XeTk∞ kβb−β0k1 ≤ max 1≤j≤p 1 n n X i=1 (fb−f0+gb−g0)iXeij ! kβb−β0k1.

Note that given Xe, we have for each 1≤j ≤p,

|(fb−f0+bg−g0)iXeij| ≤2R/ p

(48)

By Lemma 14.15 of [1], we have P    max 1≤j≤p 1 n n X i=1 (fb−f0+gb−g0)iXeij ≥ max 1≤j≤p s (2R/√1−γ2)2 Pn i=1Xeij2 n s 2 t2+ log(2p) n    ≤exp(−nt2).

Again by (2.39) and choosing t2 = log(2p)/n, we have

P ( max 1≤j≤p 1 n n X i=1 (fb−f0+ b g−g0)iXeij ≥2KX(2R/ p 1−γ2) r 4 log(2p) n ) ≤1/p. By choosing λ≥40KX p

log(2p)/n, we have with probability at least 1−1/p, max 1≤j≤p 1 n n X i=1 (fb−f0+ b g−g0)iXeij ≤λ/20.

Finally, note that

1 n n X i=1 ((fX +gX)T(βb−β0))iXeij ≤ 1 n n X i=1 p X k=1 (fX +gX)ik(βb−β0)kXeij ≤ p X k=1 (βb−β0)k( 1 n n X i=1 (fX +gX)ikXeij) ≤kβb−β0k1 max 1≤k≤p 1 n n X i=1 (fX +gX)ikXeij ≤δ0 R2 λ 1max≤k≤p 1 n n X i=1 (fX +gX)ikXeij ,

where E(fX +gX)ikXeij = 0 and |(fX +gX)ikXeij| ≤ M0|Xeij| given Xe known. By

Lemma 14.15 in [1], we obtain that given Xe,

P  max 1≤j≤p1max≤k≤p 1 n n X i=1 (fX +gX)ikXeij ≥ max 1≤j≤p s M2 0 Pn i=1Xeij2 n s 2 t2+2 log 2p n   ≤exp(−nt2).

Similarly, letting t2 = log(2p)/n and revoking (2.39) gives

P max 1≤j≤p1max≤k≤p 1 n n X i=1 (fX +gX)ikXeij >2KXM0 r log 2p n ! ≤1/p.

(49)

Choose λ >2KXM0

p

log(2p)/n. We finally get with probability at least 1−1/p, 2Pn(fXT(βb−β0) +gXT(βb−β0))XeT(βb−β0)

≤δ0R2kβb−β0k1

which can be smaller than 20λkβb−β0k1 by taking suitable choices of δ0. Similarly, we can get

|kXe(βb−β0)k2n− kXe(βb−β0)k2| ≤δ0 R2 λ 1≤maxk,j≤p 1 n n X i=1 (XeikXejk−EXeikXejk) kβb−β0k1,

whereXeikXejk−EXeikXejk is sub-exponential. By Lemma 14.16 of [1] we have for some

constant K e X that P max 1≤j,k≤p 1 n n X i=1 (XeikXejk −EXeikXejk) > KXe r log 2p n ! ≤1/(2p). Therefore, by choosing λ >2δ0KXe p log 2p/n, we have kXe(βb−β0)kn2 − kXe(βb−β0)k2 ≤λkβb−β0k1/2,

with probability at least 1−1/(2p). Recalling the probability ofT(R)∩ TI(RI) from Lemma 2.4, this lemma is proved.

Lemma 2.9. Assume ρ2 ≤ δ 2 0R2 2(I1+I(f0))I1 .

Then on the set T(R)∩ TI(RI),

ρ2I2(fb+fXT(βb−β0)−ρ2I2(fb) ≤ λ 10kβb−β0k1. Proof. ρ2I2(fb+fXT(βb−β0)−ρ2I2(fb) = ρ2[I2(fXT(βb−β0)) + 2I(f , fb XT(βb−β0)] ≤ρ2δ0R 2 λ I 2 1kβb−β0k1+ 2I(fb)I(fXT(βb−β0) ≤ δ0ρ2I12+ 2ρ2( R ρ +I(f0))I1 kβb−β0k1 ≤ 1 2δ 3 0R 2+ √ 2 2 δ0R 2+δ2 0R 2 ≤3δ0R2kβb−β0k1,

(50)

where the first equality follows from definition of I(·), the second inequality follows from Assumption2.9, and the third is true due to triangular inequality. Choosingδ0 such that 3δ0R2 ≤λ/10, we get the desired result.

Lemma 2.10. Assume µ2 ≤ δ 2 0R2 (J2q−1 +Jq−1(g 0))J2 . (2.40)

Then on the set T(R)∩ TI(RI),

µ2J2(bg+gXT(βb−β0))−µ2J2( b g)≤ λ 10kβb−β0k1 Proof. µ2Jq(bg+gTX(βb−β0))−µ2Jq( b g)≤µ22Jq−1(bg)J(gXT(βb−β0) + 2Jq(gTX(βb−β0) Note thatJ(gT X(βb−β0))≤J2kβb−β0k1 by Assumption 2.9and kβb−β0k1 ≤ δ0R 2 λ . We have µ2Jq(bg+gXT(βb−β0))−µ2Jq(bg) ≤µ2 2 (R µ) 2/q +J(g 0) q−1 J2kβb−β0k1+ 2J2qkβb−β0kq1 ! ≤2δ02/qR2+ 2δ02R2+ 2δ20R2 δ0R2 λ q−1 kβb−β0k1 ≤6δ02/qR2kβb−β0k1,

where the first inequality follows from definition and the second one follows from the condition (2.40). Choosingδ0such that 6δ

2/q

0 R2 ≤λ/10, we get the desired result. Based on Lemmas 2.7-2.10, we are now ready to prove Corollary 2.2.

Proof of Corollary 2.2. Recall that Π(X|H) = fX +gX. By definition, we have

kY −XTβb−fb− b gk2 n+λkβbk1+ρ2I2(fb) +µ2Jq( b g) ≤kY −XTβ0−(fb+fXT(βb−β0))−( b g+gXT(βb−β0))k2n+λkβ0k1 +ρ2I2(fb+fXT(βb−β0)) +µ2Jq( b g+gTX(βb−β0)).

References

Related documents

National Conference on Technical Vocational Education, Training and Skills Development: A Roadmap for Empowerment (Dec. 2008): Ministry of Human Resource Development, Department

Further, by showing that v τ is a modular unit over Z we give a new proof of the fact that the singular values of v τ are units at all imaginary quadratic arguments and obtain

An analysis of the economic contribution of the software industry examined the effect of software activity on the Lebanese economy by measuring it in terms of output and value

For the topologies studied, this suggests that in an idealized fractional bandwidth routed network the increase in network throughput achieved by improving the transceiver coding can

 HCC is developing in 85% in cirrhosis hepatis Chronic liver damage Hepatocita regeneration Cirrhosis Genetic changes

The summary resource report prepared by North Atlantic is based on a 43-101 Compliant Resource Report prepared by M. Holter, Consulting Professional Engineer,

It was decided that with the presence of such significant red flag signs that she should undergo advanced imaging, in this case an MRI, that revealed an underlying malignancy, which

&#34;A Food-Based Approach Introducing Orange- Fleshed Sweet Potatoes Increased Vitamin A Intake and Serum Retinol Concentrations in Young Children in Rural Mozambique.&#34;