Adaptive group bridge estimation for high dimensional partially linear models

(1)

R E S E A R C H

Open Access

Adaptive group bridge estimation for

high-dimensional partially linear models

Xiuli Wang and Mingqiu Wang

*

*_{Correspondence:} [email protected] School of Statistics, Qufu Normal University, Jingxuan West Road, Qufu, 273165, P.R. China

Abstract

This paper studies group selection for the partially linear model with a diverging number of parameters. We propose an adaptive group bridge method and study the consistency, convergence rate and asymptotic distribution of the global adaptive group bridge estimator under regularity conditions. Simulation studies and a real example show the ﬁnite sample performance of our method.

MSC: 62E20; 62J07; 62F12

Keywords: adaptive group bridge; high dimension; partially linear model

1 Introduction

Consider the following model:

Y= xTβ+f(U) +ε, ()

where x = (xT

,xT, . . . ,xTpn)

T_{is a covariate vector with}_x

j= (Xjk,k= , . . . ,dj)Tbeing adj× vector corresponding to thejth group in the linear part,β= (βT_j,j= , . . . ,pn)T withβ_j being thedj× vector of regression coeﬃcients,f is an unknown function ofU, and εis the random error with mean zero. Without loss of generality,U is scaled to [, ]. Furthermore, (x,U) andεare independent.

Variable selection for high-dimensional data is a hot and important issue. Penalized regression methods have been widely used in the literature such as [–], and so on. Among these methods, bridge regression including lasso and ridge as two well-known special cases has been studied by many authors (e.g., [–]). [] studied adaptive bridge estimation for high-dimensional linear models. In addition, group structure of variables arise always in many contemporary statistical modeling problems. [] proposed a group bridge method which not only effectively removes unimportant groups, but also maintains the flexibility of selecting variables within identified groups. [] investigated an adaptive choice of the penalty order in group bridge regression.

The aforementioned model () is just the partially linear model that originated from []. The partially linear model is a common semiparametric model enjoying the inter-pretability and ﬂexibility. Our contributions in this paper include: () we propose an adap-tive group bridge method to achieve the group selection for a high-dimensional partially linear model; () we consider the choice of indexγ in the adaptive group bridge and use

(2)

leave-one-observation-out cross-validation (CV) to implement this choice. It can signif-icantly reduce the computational burden; () we give the consistency, convergence rate and asymptotic distribution of the adaptive group bridge estimator which is the global minimizer of the objective function.

The rest of the article is organized as follows. Section  gives the adaptive group bridge method. In Section , we show the assumptions and asymptotic results for the global adap-tive group bridge estimator. Section  shows computational algorithm and selection of tuning parameters. Simulation studies and real data are presented in Section . Section  gives a short discussion. Technical proofs are relegated to Appendix.

2 Adaptive group bridge in the partially linear model

Suppose that we have a collection of independent observations{(xi,Ui,Yi), ≤i≤n}from model (). That is,

Yi= xTiβ+f(Ui) +εi, i= , . . . ,n, ()

whereε, . . . ,εnare i.i.d. random errors with mean zero and ﬁnite varianceσ<∞. To obtain an estimate of functionf(·), we employ a B-spline basis. DenoteSnas the space of polynomial splines of degreem≥. Let{Bk(u), ≤k≤qn}be a normalized B-spline basis withBk∞≤, where · ∞is the sup norm. Then, for anyfn∈Sn, we have

fn(u) = qn

j=

Bj(u)αjB(u)Tα.

Under some smoothness conditions, the nonparametric functionf can well be approxi-mated by functions inSn.

Consider the following adaptive group bridge penalized objective function:

n

i=

Yi– xTi β– B(Ui)Tα 

+ pn

j=

λjβjγ, ()

where λj, j= , . . . ,pn, are the tuning parameters, and · denotes the L norm on

the Euclidean space. Let Y = (Y, . . . ,Yn)T, X= (Xijk, ≤i≤n, ≤j≤pn, ≤k≤dj) =

(x, . . . , xn)Tand Z = (B(U), . . . , B(Un))T. Then () can be changed into

Ln(β,α) =Y–Xβ– Zα+ pn

j=

λjβjγ. ()

For someβ, the optimalαminimizingLn(·) meets the partial diﬀerential equation

∂Ln(β,α)/∂α= ,

namely,

(3)

LetH= Z(ZT_Z₎–_ZT_{, note that}_H_{is a projection matrix. We can rewrite the expression} () as follows:

Qn(β) =(I–H)(Y –Xβ)



+ pn

j=

λjβjγ. ()

For some ﬁxedγ > , deﬁneβˆ=arg minQn(β), thenβˆ is called the adaptive group bridge estimator. Ifβˆ is obtained, then the estimatorαˆ can be achieved. Thus we can get the estimator of the nonparametric part, namely,ˆfn(u) = B(u)T_α._ˆ

3 Asymptotic properties

In this section, we show the oracle property of the parametric part. For convenience of the statement, we ﬁrst give some notations. Deﬁne g(u) =E(x|U=u) andx˜= x –E(x|U). Let(u) be the conditional covariance matrix ofx˜, i.e.,(u) =cov(x˜|U=u). Denote as the unconditional covariance matrix ofx˜, i.e., =E[(U)]. The corresponding sample version is G = (g(U), . . . , g(Un))Twith g(Ui) =E(xi|Ui) andX= (x˜, . . . ,x˜n)Twithx˜i= xi– E(xi|Ui).

Let the true parameter be β_= (βT_, . . . ,βT__p_n)T _(βT

,βT)T. Let A={≤j≤pn : β__j = }be the index set of the nonzero groups. Without loss of generality, we assume that coefficients of the firstkngroup are nonzero, i.e.,A={, , . . . ,kn}. Let|A|=knbe the cardinality of the setA, which is allowed to increase withn. Forj∈/A,β__j= . De-fineβ_= (βT__j,j∈A)T_,_β

= (βTj,j∈/A)T. Letd∗=max≤j≤pndj,ϕn=max{λj,j∈A}and ϕn=min{λj,j∈/A}.

Corresponding to the partition ofβ_, denoteβˆ= (βˆT_(),βˆT_())T_{and decompose}

X= (XX), G= (GG), X= (XX), =

 

 

.

The following conditions are required for the B-spline approximation of functionf.

(C) The distribution ofUis absolutely continuous, and its density is bounded away

fromand∞.

(C) (Hölder conditions off(·)andgj(·), wheregjis thejth component ofg) Letl,δand Mbe real constants such that <δ≤andM> .f(·)andgj(·)belong to a class of functionsH,

H= h:h(l)(u) –h(l)(u)≤M|u–u|δ,for≤u,u≤

,

where <l≤m– andr=l+δ.

The following part lists all the reasonable conditions which are necessary to attain the asymptotic results.

(A) Letλmax( )andλmin( )be the largest and smallest eigenvalue of , respectively.

There exist constantsτandτsuch that

(4)

(A) There exist constants <b<b<∞such that

b≤min βj, ≤j≤kn

≤max β__j, ≤j≤kn

≤b.

(A) n–_XT₍_I_–_H₎_X_– _→P __;_E_[_tr₍_XT₍_I_–_H₎_X_{)] =}_O₍_np_n)_.

(A) d∗=O(),p_n/n→andn–ϕnkn→.

(A) (a)ϕnkn//( √

npn+n√pnqn–r)→; (b)ϕn(

n–_p

n+√pnq–nr)γ–/n→ ∞.

(A) For every≤j≤pnand≤k≤dj,E[Xjk–E(Xjk|U)]is bounded. Furthermore,

E(ε₎_{is bounded.}

Conditions (A) and (A) are commonly used. Condition (A) holds under some condi-tions. The proof can be found in Lemmas  and  in []. Condition (A) is used to obtain the consistency of the estimator. Condition (A) is needed in the proof of convergence rate. Condition (A) is necessary to attain the asymptotic distribution.

Theorem .(Consistency) Suppose thatγ > and conditions(A)-(A)hold,then

ˆβ–β_=OP

n–d∗pn+qn–r+n–ϕnkn

,

namely, ˆβ–β_→P .

Theorem . implies that under some conditions the estimators converge to the true values of parameters.

Theorem .(Convergence rate) Suppose that conditions(A)-(A)hold,then

ˆβ–β_=OP

n–_p

n+√pnq–nr

.

This theorem shows that the adaptive group bridge can give the optimal convergence rate withpn→ ∞.

Theorem .(Oracle property) Suppose that <γ < ,n–_k

nqn→and nq–n r→.If conditions(A)-(A)are satisﬁed,then we have

(i) Pr(βˆ_()= )→,n→ ∞; (ii) Letu

n=nωTn(XT(I–H)X)– (XT(I–H)X)–ωnwithωnbeing some kn

j=dj-vector withωn= ,then

n/u–_n ωT_n(βˆ_()–β_)→D N,σ.

This theorem states that the adaptive group bridge performs as well as the oracle [].

4 Computational algorithm and selection of tuning parameters

4.1 Computational algorithm

(5)

We take the initial valueβ(). Here the ordinary least square estimate is chosen as the initial valueβ(). The penalty termpλj(βj) =λjβjγ can be approximated as

pλj

β_j≈pλjβ

()

j +   p

λjβ

()

j /β

()

j βj–β

()

j



,

whenβ()_j > . The following iterative expression ofβcan be obtained:

β()=XT₍_I_–_H₎_X₊_n

λ,γ

β()–XT(I–H)Y, ()

where

λ,γ

β()=diag

_p λj(β

()

j )

β()_j Idj,j= , . . . ,pn

,

withIdjbeing adj×djunit matrix. If someβ

()

j is smaller than –, then we setβ

()

j = . The ﬁnial estimate can be obtained iteratively by formula () until the convergence is achieved.

4.2 Selection of the tuning parameters

For our method,qn,γ, andλj(j= , . . . ,pn) should be chosen. For convenience, cubic spline basis (m= ) is used. We setqn= . Simulation results demonstrate that this choice per-forms quite well. There are also many tuning parameters that should be chosen. In fact, we only need to select one tuning parameter by settingλj=λ/β()j . We use ‘leave-one-observation-out’ cross-validation (CV) to selectλandγ. Due to the convergence of the algorithm, we have

ˆ

β=XT(I–H)X+nλ,γ(β)ˆ

–_XT₍_I_–_H_)Y,

whereβˆ is obtained based on the whole data set. Note that it is the solution of the ridge regression

Y∗–X∗β+nβTλ,γ(β)β,ˆ ()

where Y∗= (I–H)Y andX∗= (I–H)X. Let Y∗= (y∗_, . . . ,y∗_n)T_and_X∗_{= (x}∗

, . . . , x∗n)T. The CV error is

CV(λ,γ) =  n

n

i=

y∗_i – x∗_iTβˆ–i,

whereβˆ–iis achieved by solving () without theith observation. The computation of the CV error is intensive, so we will use the following formula, which can be proved similar to []:

CV(λ,γ) =  n

n

i=

(6)

where Dii is the (i,i)th diagonal element of (I–H)X[XT(I–H)X+nλ,γ(β)]ˆ –XT(I–

H). It is obvious that this method can signiﬁcantly reduce the computational bur-den.

5 Simulation studies and application

In this section, we investigate the ﬁnite sample performance of the adaptive group bridge method through simulations and a real data application.

5.1 Monte Carlo simulations

We simulate  datasets consisting ofnobservations from the following partially linear model:

Yi= pn

j=

xT_ijβ_j+cos(πUi) +εi, i= , . . . ,n,

wheren= , and the errorεi∼N(,σ) withσ= ., , . We consider that there are pngroups withpn= , ,  and each group consists of three variables. The true values of parametersβT_ = (., , .),βT_ = (, –, ),βT_ = (., ., .),βT_=· · ·=βT_p_n= (, , ). Uifollows the uniform distribution on [, ]. To generate covariate x = (xT,xT, . . . ,xTpn)

T

withxj= (Xjk,k= , , )T, we ﬁrst simulateR, . . . ,Rpn independently from the standard

normal distribution. Next, simulateZj,j= , . . . ,pn, from a multivariate normal distribu-tion with the mean zero andCov(Zj,Zl) = .|j–l|. Then the covariates are generated as Xjk= (Zj+R(j–)+k)/

√

,j= , . . . ,pn,k= , , .

We compare the adaptive group bridge (AGB) with the group lasso (GL) and the group bridge (GB). The following three performance measures are calculated:

. Lloss of parametric estimate, which is deﬁned asβ–β. . Average number of nonzero groups identiﬁed by the method (NN).

. Average number of nonzero groups identiﬁed by the method that are truly nonzero (NNT).

Group selection results are depicted in Table . The numbers in the parentheses in the columns labeled ‘NN’ and ‘NNT’ are the corresponding sample standard deviations based on the  runs. Boxplots of theLlosses under diﬀerent settings are given in Figures -.

From Table , we can have the following observations:

() Both GB and AGB perform better than GL for all settings. All these three methods can retain all the true nonzero groups, but GL always keeps more redundant groups that are unrelated with the response than both GB and AGB.

() AGB performs much better for largerσandpn. Whenpn= for AGB, groups

selected for the caseσ= are about .% lower than that for the caseσ= .. While groups selected for GB decrease by .% in the same situation.

() Forpn= , GB performs better than AGB, but the stability of GB is bad forσ= . Figures - presentLlosses with varyingσ andpn. We can see that the performances

of estimates are similar for GB and AGB. Forpn=  and , both GB and AGB perform better than GL. However, when pn= , the median ofLlosses for all these three are

(7)

[image:7.595.116.481.95.320.2]

Table 1 Group selection results

pn Method σ= 0.5 σ= 1 σ= 4

NN NNT NN NNT NN NNT

10 GL 7.80 3 7.59 3 6.02 3

(1.231) (0) (2.396) (0) (1.255) (0)

GB 4.64 3 4.80 3 4.55 3

(1.259) (0) (1.356) (0) (1.258) (0)

AGB 5.22 3 5.00 3 4.56 3

(1.605) (0) (1.735) (0) (1.131) (0)

30 GL 20.47 3 11.49 3 14.46 3

(2.504) (0) (4.464) (0) (2.022) (0)

GB 11.04 3 10.96 3 10.18 3

(3.643) (0) (3.784) (0) (2.350) (0)

AGB 13.06 3 10.64 3 9.57 3

(5.510) (0) (5.921) (0) (2.046) (0)

50 GL 33.17 3 16.48 3 22.37 3

(3.223) (0) (5.018) (0) (3.308) (0)

GB 17.23 3 17.79 3 15.96 3

(4.608) (0) (3.952) (0) (3.284) (0)

AGB 19.25 3 15.67 3 15.68 3

(8.437) (0) (8.385) (0) (3.684) (0)

[image:7.595.119.478.395.685.2]

(8)

[image:8.595.118.477.82.371.2] [image:8.595.117.477.420.711.2]

Figure 2 Boxplots ofL2loss forpn= 30.

(9)

[image:9.595.117.478.95.276.2]

Table 2 Estimates of the wage data

Variable Description GL GB AGB

edu Number of years of education 0.0694 0.0668 0.0635

south 1 = southern region, 0 = other –0.0723 –0.0679 –0.0490

sex 1 = Female, 0 = Male –0.1999 –0.1983 –0.2031

union 1 = union member, 0 = nonmember 0.1951 0.1934 0.2030

race 1 = other, 0 = White –0.0559 –0.0585 –0.0582

1 = Hispanic, 0 = White –0.0537 –0.0615 –0.0614

occup 1 = management, 0 = other 0.1874 0.2173 0.2516

1 = sales, 0 = other –0.0797 –0.0809 –0.0721

1 = clerical, 0 = other 0.0166 0.0262 0.0430

1 = service, 0 = other –0.1171 –0.1173 –0.1104

1 = professional, 0 = other 0.1533 0.1768 0.2061

sector 1 = manufacturing, 0 = other 0.0848 0.0912 0.0994

1 = construction, 0 = other 0.0546 0.0622 0.0674

marr 1 = married, 0 = other 0.0000 0.0000 0.0000

5.2 Wage data analysis

The workers’ wage data from Berndt[] contains a random sample of  observations on  variables sampled from the current population survey of . It provides information on wages and other characteristics of the workers, including continuous variables: the number of years of education, years of work experience, age and nominal variables: race, sex, region of residence, occupational status, sector, marital status and union membership. Our goal is to study the important factors for the wage, so it is reasonable to use our proposed method for these data.

From the residual plot, we can easily see that the variance of wages is not a constant. So thelogtransformation is used to stabilize the variance of wages. Due to the multicollinear-ity problem between age and experience, we need to get rid of either age or experience. Here we remove the age variable from the model. Xie and Huang [] analyzed these data without considering the transformation ofY. Furthermore, they did not consider group selection of factors. Similar to Xie and Huang [], we ﬁt these data using a partially linear model withUbeing ‘years of work experience’.

Table  reports estimated regression coefficients of GL, GB and AGB. All these three methods exclude marital status. We use the first  observations as a training dataset to select and fit the model, and use the rest of  observations as a testing dataset to evalu-ate the prediction ability of the selected model. The prediction performance is measured by the median of{|yi–yî|,i= , , . . . , }for GL, GB and AGB using the testing data, respectively. Hereyi’s are those  observations in the testing dataset andyˆi’s are corre-sponding prediction values. The median absolute prediction errors of GL, GB and AGB are ., . and ., respectively. Therefore, we can conclude that the AGB gives the smallest prediction error, so it is an attractive technique in group selection.

6 Discussion

(10)

Appendix

Proof of Theorem. By the deﬁnition ofβ, it is easy to getˆ

(I–H)(Y –Xβ)ˆ + pn

j=

λj ˆβjγ≤(I–H)(Y –Xβ) 

+ pn

j=

λjβjγ,

that is,

(I–H)(Y –Xβ)ˆ –(I–H)(Y –Xβ_)≤ pn

j=

λjβjγ.

As Y =Xβ_+f(U) +εwithf(U) = (f(U), . . . ,f(Un))Tandε= (ε, . . . ,εn)T, we can rewrite

the upper inequality as follows:

(I–H)X(βˆ–β_)– f(U) +εT(I–H)X(βˆ–β_)≤ pn

j=

λjβj

γ

.

Let

an=n–/

XT₍_I_–_H₎_X/₍_βˆ_–_β

),

bn=n–/

XT₍_I_–_H₎_X–/_XT₍_I_–_H₎_f_{(U) +}_ε_.

Then we have

an≤

an–bn+bn

≤ n

pn

j=

λjβjγ+ bn.

Since|A|=kn, under condition (A),

 n

pn

j=

λjβjγ =O

ϕnkn n

.

While

bn=  n

f(U) +εT(I–H)XXT₍_I_–_H₎_X–_XT₍_I_–_H₎_f_{(U) +}_ε

≤  nε

T_A_ε₊ nf(U)

T_Af_(U), _()

where

A= (I–H)XXT(I–H)X–XT(I–H).

For the ﬁrst term on the right-hand side of (),

E

 nε

T_A_ε

=σ



n tr

(11)

Thus

n–εTAε=OP

n–d∗pn

. ()

For the second term on the right-hand side of (), by conditions (C) and (C),

E

 nf(U)

T_Af_(U)

≤ 

nE λmax (I–H)X

XT₍_I_–_H₎_X–

XT₍_I_–_H₎

×trf(U)T(I–H)f(U)

=  nE

f(U)T(I–H)f(U)=Oq_n–r. ()

Combining ()-(),

bn=OP

n–d∗pn+q–n r

.

By conditions (A) and (A),

Ean=  nE

(βˆ–β_)TXT(I–H)X(βˆ–β_)

=E

(βˆ–β_)T

 nX

T₍_I_–_H₎_X_–

(βˆ–β_)

+E(βˆ–β_)T (βˆ–β_)

≥ τ

E ˆβ–β

_.

Therefore

ˆβ–β_=OP

n–d∗pn+qn–r+n–ϕnkn

.

Under condition (A), we have

ˆβ–β_→P .

Proof of Theorem . Letμn=n–_p

n+q–nr +

n–_ϕn

kn, we can choose a sequence

{rn,rn> }which satisﬁesrn→. PartitionR

pn

j=dj_\{__}_{into shells}_{_S

nj:j= , , . . .}, where Snj={β: j–rn≤ β–β< jrn}. For an arbitrary ﬁxed constantL∈R+, if ˆβ–βis

larger than Lrn,βˆis in one of the shells withj≥L, we have

Pr ˆβ–β_ ≥Lrn

=

l>L,l_r n>Lμn

Pr(βˆ∈Snl)

+

l>L,l_r_n_≤_Lμn

Pr(βˆ ∈Snl) (Lis an arbitrary constant),

where

l>L,l_r_n_>L__μ n

Pr(βˆ∈Snl)≤Pr ˆβ–β_ ≥L–_μ

n

(12)

and

l>L,l_r n≤Lμn

Pr(βˆ∈Snl)

=

Pr

ˆ

β∈Snl,n ≤ τ



+

Pr

ˆ

β∈Snl,n> τ



,

wheren=n–XT(I–H)X– . By condition (A),

Pr

ˆ

β∈Snl,n>τ 

≤Pr

n>τ 

=o().

Therefore,

Pr ˆβ–β_ ≥Lrn

=o() +

Pr

inf

β∈Snl

Qn(β) –Qn(β)

< ,n ≤ τ



.

Since

Qn(β) –Qn(β_)

=(I–H)X(β–β_)– f(U) +εT(I–H)X(β–β_)

+ pn j= λj

β_jγ–β__jγ

≥(I–H)X(β–β_)– f(U) +εT(I–H)X(β–β_)

+ kn j= λj

β_jγ_–_β jγ

= In+ In+ In.

For In,

In≥ inf

β∈Snl

nτ

 β–β

_,

for allβ∈Snl, there existsβ–β≥l–rn, therefore In≥nτl–rn. For In, we have

|In|=

kn

j=

λjγβ∗j

γ–

β_j–β__j

≤ϕnγ

kn j= β∗ j γ–

(13)

whereβ∗_j is betweenβ_jandβ__j. By condition (A) and since we only need to considerβ withβ∈Snl, lrn≤Lμn, there exists a constantC>  such that

|In| ≤Cϕnγ

kn

j=

β_j–β__j ≤Cϕnkn/γβ–β.

So for allβ∈Snlsuch that|In| ≤Cϕnkn/γlrn, by the Markov inequality, we have

Pr

inf

β∈Snl

Qn(β) –Qn(β)

≤

≤Pr

sup

β∈Snl

|In| ≥nτl–rn–Cϕnkn/γlrn

≤ E(supβ∈Snl|In|)

nτl–rn–Cϕnkn/γlrn .

Using the Cauchy-Schwarz inequality, we have

E

sup

β∈Snl

|In|

≤Ef(U) +εT(I–H)XXT(I–H)f(U) +ε/

×E

sup

β∈Snl

β–β_ /

≤l+/rn

EεT(I–H)XXT(I–H)ε

+Ef(U)T(I–H)XXT(I–H)f(U)/,

where

EεT(I–H)XXT₍_I_–_H_)ε₌_σ_E_tr₍_I_–_H₎_XXT₍_I_–_H₎₌_O₍_np n)

and

Ef(U)T(I–H)XXT(I–H)f(U)

≤EtrXT(I–H)Xtrf(U)T(I–H)f(U)

=O(npn)O

nq–_n r=Onpnq–n r

.

Accordingly,

E

sup

β∈Snl

|In|

≤Clrn √

npn+n√pnq–nr

.

Then we can get

l>L,l_r_n_≤_L__μ n

Pr(βˆ∈Snl)≤ l>L

Clrn(√npn+n√pnq–nr) nτl–rn–Cϕnkn/γlrn

(14)

We choosern= (√pn/n+√pnq–nr), we have

Pr(βˆ∈Snl) =

l>L

C

τl––Cϕnkn/γ/( √

npn+n√pnq–nr) .

By condition (A)(a)ϕnkn//( √

npn+n√pnq–nr)→, for suﬃciently largen,

l––Cτ–λjkn// √

npn+n√pnM

–rg

n

≥l–.

Thus

l>L,l_r_n_≤_L__μ n

Pr(βˆ∈Snl)≤ l>L

C

l– ≤C –(L–)_.

LetL→ ∞, then

Pr(βˆ∈Snl)→.

Hence

ˆβ–β_=OP

n–_p

n+√pnq–nr

.

Proof of Theorem. (i) By Theorem ., for suﬃciently largeC,βˆ lies in the ball{β:

β–β_ ≤vnC}with probability converging to , wherevn=

n–_p

n+√pnq–nr. Letβ()=

β_+vnνandβ()=β+vnν=vnνwithν=ν+ν≤C. Let

Vn(ν,ν) =Qn(β(),β()) –Qn(β, ) =Qn(β+vnν,vnν) –Qn(β, ).

Then βˆ_ andβˆ_ can be attained by minimizingVn(ν,ν) over ν ≤C, except on an

event with probability converging to zero. We only need to show that, for someνandν

withν ≤C, ifν> ,

PrVn(ν,ν) –Vn(ν, ) > 

→, n→ ∞.

Some simple calculations show that

Vn(ν,ν) –Vn(ν, ) =vn(I–H)Xν 

+ v_n(Xν)T(I–H)(Xν)

– vn

f(U) +εT(I–H)(Xν) +

j∈/A

λjvnνjγ

= IIn+ IIn+ IIn+ IIn.

For the ﬁrst two terms IInand IIn,

IIn+ IIn≥–vn(I–H)Xν 

= –nv_nC_oP() +τ

(15)

For IIn, we have

Ef(U) +εT(I–H)Xν



≤ Ef(U)T(I–H)XννTXT(I–H)f(U)

+εT(I–H)XννTXT(I–H)ε

≤C E

trXT_(I–H)X

trf(U)T(I–H)f(U)

+σEtrXT_(I–H)X

=Onpnq–nr+npn

.

Thus we have

IIn=vn

np/_n q–_nr+n/p/_n OP().

For IIn, by  <γ< ,

j∈/A

vnνjγ /γ

≥

j∈/A

vnνj /

=vnν.

Accordingly,

IIn≥ϕnvγnνγ.

By condition (A)(b), for someν> , we have

PrVn(ν,ν) –Vn(ν, ) > 

→.

(ii) Letωnbe some kn

j=dj-vector withωn= . By Theorem .(i), with probability tending to , we have the following result:

∂Qn(β_())

∂β_()

β()=βˆ()

=XT_(I–H)X(βˆ()–β) –XT(I–H)

f(U) +ε+ξ_n= ,

whereξ_n= (λγ ˆβγ–βˆ

T

, . . . ,λknγ ˆβkn

γ–_βˆT

kn)

T_{. We consider the limit distribution}

n–/ωT_n –/_ XT_(I–H)X

(βˆ_()–β_)

=n–/ωT_n –/_ XT_(I–H)f(U) +n–/ωT_n –/_ XT_(I–H)ε

–n–/ωT_n –/_ ξ_n

=Jn+Jn+Jn.

ForJn,

J_n_=n–ωT_n –/_ XT_(I–H)f(U)=OP

(16)

ForJn, by conditions (A) and (A), we have

EJ_n_≤n–τ_–ϕnγ

kn

j=

E ˆβ_j(γ–)=On–ϕnkn

.

ForJn,

Jn=n–/ωTn –/ GT(I–H)ε+n–/ωTn –/ XTε

–n–/ωT_n –/_ XT_Hε

=Kn+Kn+Kn.

Under conditions (C) and (C),

EK_n_=n–ωT_n –/_ EGT_(I–H)εεT(I–H)G

_–/

 ωn

=Oknq–n r

.

By condition (A), we have

EK_n_=n–ωT_n –/_ EXT_HεεTHX

_–/

 ωn

=On–knqn

.

Now we focus onKn

Kn=n–/ωTn –/ XTε

=√ n

n

i=

sniεi.

First,

E(sniεi) = ;

Var

_n

i=

sniεi

= n

i=

Var(sniεi) =σ.

Next we verify the conditions of the Lindeberg-Feller central limit. For any> ,

n

i=

Es_niε_i|sniεi|>

=nEs_n_ε_|snε|>

≤nEs_n_ε_/Pr|snε|>

/

.

By condition (A),

Es_n_ε_=n–E ωT_n –/_ x–E(x|U)

x–E(x|U)

T _–/

 ωn 

Eε_

≤n–ρ_minωnωTn

ρ_max –_E x–E(x|U)

T

x–E(x|U)



(17)

≤n–ρ_minωnωTn

ρ_max –_Eε_knd∗ kn

j=

dj

k=

EXjk–E(Xjk|U)



=Ok_nn–

and

P|snε|>

≤ 

E(snε) 

= σ



n –_ωT

n –/ E x–E(x|U)

x–E(x|U)

T _–/

 ωn

= σ



n

–₌_O_n–_.

Thus we have

n

i=

Es_niε_i|sniεi|>

=Onknn–n–/

=o().

This means thatKn

D

→N(,σ_{). Using Slutsky’s theorem, we have}

n–/ωT_n –/_ XT_(I–H)X

(βˆ_()–β_)→D N,σ.

Letu

n=nωTn(XT(I–H)X)– (XT(I–H)X)–ωn, then

n/u–_nωT_n(βˆ_()–β_)→D N,σ.

Acknowledgements

This research was supported by the National Natural Science Foundation of China (Grant No. 11401340).

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors contributed equally to the writing of this paper. All authors read and approved the ﬁnal manuscript.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional aﬃliations.

Received: 5 January 2017 Accepted: 16 June 2017

References

1. Tibshirani, R: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B58, 267-288 (1996)

2. Frank, I, Friedman, J: A statistical view of some chemometrics regression tools. Technometrics35, 109-148 (1993) 3. Fan, J, Li, R: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc.96,

1348-1360 (2001)

4. Zou, H: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc.101, 1418-1429 (2006)

5. Zou, H, Hastie, T: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B67, 301-320 (2005) 6. Fu, W: Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat.7, 397-416 (1998) 7. Knight, K, Fu, W: Asymptotics for lasso-type estimators. J. Comput. Graph. Stat.28, 1356-1378 (2000) 8. Liu, Y, Zhang, H, Park, C, Ahn, J: Support vector machines with adaptivelqpenalty. Comput. Stat. Data Anal.51,

6380-6394 (2007)

9. Huang, J, Horowitz, J, Ma, S: Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Stat.36, 587-613 (2008)

(18)

11. Chen, Z, Zhu, Y, Zhu, C: Adaptive bridge estimation for high-dimensional regression models. J. Inequal. Appl.2016, 258 (2016)

12. Huang, J, Ma, S, Xie, H, Zhang, C: A group bridge approach for variable selection. Biometrika96, 339-355 (2009) 13. Park, C, Yoon, Y: Bridge regression: adaptivity and group selection. J. Stat. Plan. Inference141, 3506-3519 (2011) 14. Engle, R, Granger, C, Rice, J, Weiss, A: Semiparametric estimates of the relation between weather and electricity sales.

J. Am. Stat. Assoc.81, 310-320 (1986)

15. Xie, H, Huang, J: Scad-penalized regression in high-dimensional partially linear models. Ann. Stat.37, 673-696 (2009) 16. Donoho, D, Johnstone, I: Ideal spatial adaptation by wavelet shrinkage. Biometrika81, 425-455 (1994)

17. Wang, L, Li, H, Huang, J: Variable selection in nonparametric varying-coeﬃcient models for analysis of repeated measurements. J. Am. Stat. Assoc.103, 1556-1569 (2008)