Optimal Smoothing for a Computationally and Statistically Efficient Single Index Estimator

(1)

SFB 649 Discussion Paper 2009-028

Optimal Smoothing for a

Computationally and

Statistically Efficient Single

Index Estimator

Yingcun Xia*

Wolfgang Härdle**

Oliver Linton***

* National University of Singapore, Singapore ** Humboldt-Universität zu Berlin, Germany *** London School of Economics, United Kingdom

This research was supported by the Deutsche

Forschungsgemeinschaft through the SFB 649 "Economic Risk". http://sfb649.wiwi.hu-berlin.de

ISSN 1860-5664

SFB 649, Humboldt-Universität zu Berlin Spandauer Straße 1, D-10178 Berlin

S

FB

6

4

9 E

C

O

N

O

M

I

C

R

I

S

K

B

E

R

L

I

N

(2)

Optimal Smoothing for a Computationally and Statistically

Efficient Single Index Estimator

∗

Yingcun Xia

Department of Statistics and Applied Probability National University of Singapore

Wolfgang H¨ardle

CASE - Center for Applied Statistics & Economics Institut f¨ur Statistik und ¨Okonometrie

Wirtschaftswissenschaftliche Fakult¨at Humboldt-Universit¨at zu Berlin

D - 10178 Berlin, Germany

Oliver Linton

Department of Economics,

London School of Economics, Houghton Street, London WC2A 2AE, United Kingdom.

May 7, 2009

Abstract

In semiparametric models it is a common approach to under-smooth the nonparametric functions in order that estimators of the finite dimensional parameters can achieve root-n consistency. The requirement of under-smoothing may result as we show from inefficient estimation methods or technical difficulties. Based on local linear kernel smoother, we propose an estimation method to estimate the single-index model without under-smoothing. Under some conditions, our estimator of the single-index is asymptotically normal and most efficient in the semi-parametric sense. Moreover, we derive higher expansions for our estimator and use them to define an optimal bandwidth for the purposes of index estimation. As a result we obtain a practically more relevant method and we show its superior performance in a variety of applications.

∗_{The first author is most grateful to Professor V. Spokoiny for helpful discussions and NUS FRG R-155-000-048-112 and the}

Alexander von Humboldt Foundation for financial support. The second author thanks the Deutsche Forschungsgemeinschaft SFB 649 “ ¨Okouomisches Risiko” for financial support. The third author thanks the ESRC for financial support.

(3)

Key words and phrases: ADE; Asymptotics; Bandwidth; MAVE method; Semi-parametric efficiency. JEL classification: C00; C13; C14

1 Introduction

Single index models (SIMs) are widely used in the applied quantitative sciences. Although the context of applications for SIMs almost never prescribes the functional or distributional form of the involved statistical error, the SIM is commonly fitted with (low dimensional) likelihood principles. Both from a theoretical and practical point of view such fitting approach has been criticized and has led to semiparametric modelling. This approach involves high dimensional parameters (nonparametric functions) and a finite dimensional index parameter. Consider the following single-index model,

Y = g(θ>₀X) + ε, (1)

where E(ε|X) = 0 almost surely, g is an unknown link function, and θ0 is a single-index parameter with

length one and first element positive for identification. In this model there is a single linear combination of covariates X that can capture most information about the relation between response variable Y and covariates X, thereby avoiding the “curse of dimensionality”. Estimation of the single-index model is very attractive both in theory and in practice. In the last decade a series of papers has considered estimation of the parametric index and the nonparametric part with focus on root-n estimability and efficiency issues, see Carroll, Fan, Gijbels and Wand (1997) for an overview. There are numerous methods proposed or can be used for the estimation of the model. Amongst them, the most popular ones are the average derivative estimation (ADE) method investigated by H¨ardle and Stoker (1989), the sliced inverse regression (SIR) method proposed by Li (1989); the semiparametric least squares (SLS) method of Ichimura (1993) and the simultaneous minimization method of H¨ardle, Hall and Ichimura (1993).

The existing estimation methods are all subject to some or other of the following four critiques: (1)

Heavy computational burden: see, for example, H¨ardle, Hall and Ichimura (1993), Delecroix, H¨ardle and

Hristache (2003), Xia and Li (1999) and Xia et al. (1999). These methods include complicated optimization techniques (iteration between bandwidth choice and parameter estimation) for which no simple and effective algorithm is available up to now. (2) Strong restrictions on link functions or design of covariates X: Li (1991) required the covariate to have a symmetric distribution; H¨ardle and Stoker (1989) and Hristache

(4)

et al. (2001) needed a non-symmetric structure for the link function, i.e., |Eg0(θ>₀X)| is bounded away

from 0. If these conditions are violated, the corresponding methods are inconsistent. (3) Inefficiency: The ADE method of H¨ardle and Stoker (1989) or the improved ADE method of Hristache et al. (2001) is not asymptotically efficient in the semi-parametric sense, Bickel et al. (1993). Nishiyama and Robinson (2000, 2005) considered the Edgeworth correction to the ADE methods. H¨ardle and Tsybakov (1993) discussed the sensitivity of the ADE. Since this method involves high dimensional smoothing and derivative estimation,

its higher order properties are poor. (4) Under-smoothing: Let hoptg be the optimal bandwidth in the sense of

MISE for the estimation of link function g and let h_θ be the bandwidth used for the estimation of θ0. Most

of the methods mentioned above require the bandwidth h_θ to be much smaller than the bandwidth hoptg , i.e.

h_θ/hoptg → 0 as n → ∞, in order that estimators of θ0 can achieve root-n consistency, see, H¨ardle, and Stoker (1989) and Hristache et al. (2002), Robinson (1988), Hall (1989) and Carroll et al. (1997) among others.

Due to technical complexities, there are few investigations about how to select the bandwidth hθ for the

estimation of the single-index. Thus it could be the case that even if hθ = hoptg allows for root-n consistent

estimation of θ, that hopt_θ /hoptg → 0 or hoptg /hopt_θ → 0, where hopt_θ is the optimal bandwidth for estimation

of θ. This would mean that using a single bandwidth hoptg would result in suboptimal performance for the

estimator of θ. Higher order properties of other semiparametric procedures have been studied in Linton (1995) inter alia.

Because the estimation of θ0 is based on the estimation of the link function g, we might expect that a

good bandwidth for the link function should be a good bandwidth for the single-index, i.e., under-smoothing should be unnecessary. Unfortunately, most of the existing estimation methods involve for technical reason

“under-smoothing” the link function in order to obtain a root-n consistent estimator of θ₀. See, for example,

Härdle and Stoker (1989), Hristache et al. (2001, 2002), Carroll et al. (1997) and Xia and Li (1999). Härdle, Hall and Ichimura (1993) investigated this problem for the first time and proved that the optimal bandwidth for the estimation of the link function in the sense of MISE can be used for the estimation of the single-index to achieve root-n consistency. As mentioned above, for its computational complexity the method of Härdle, Hall and Ichimura (1993) is hard to implement in practice.

This paper presents a method of joint estimation of the parametric and nonparametric parts. It avoids undersmoothing and the computational complexity of former procedures and achieves the semiparametric efficiency bound. It is based on the MAVE method of Xia et al (2002), which we outline in the next section. Using local linear approximation and global minimization, we give a very simple iterative algorithm. The

(5)

proposed method has the following advantages: (i) the algorithm only involves one-dimensional smoothing

and is proved to converge at a geometric rate; (ii) with normal errors in the model, the estimator of θ0

is asymptotically normal and efficient in the semiparametric sense; (iii) the optimal bandwidth for the

estimation of the link function in the sense of MISE can be used to estimate θ0 with root-n consistency;

(iv) by a second order expansion, we further show that the optimal bandwidth for the estimation of the

single-index θ0, hopt_θ , is of the same magnitude as hoptg .

Therefore, the commonly used “under-smoothing” approach is inefficient in the sense of second order approximation. Powell and Stoker (1996) investigated bandwidth selection for the ADE methods. We also propose an automatic bandwidth selection method for our estimator of θ. Xia (2006) has recently shown the first order asymptotic properties of this method. Our theoretical results are proven under weak moment conditions.

In section 3 we present our main results. We show the speed of convergence, give the asymptotic estimation and derive a smoothing parameter selection procedure. In the following section we investigate the proposed estimator in simulation and application. Technical details are deferred to the appendix.

2 The MAVE method

Suppose that {Xi, Yi : i = 1, 2, . . . , n} is a random sample from model (1). The basic idea of our estimation

method is to linearly approximate the smooth link function g and to estimate θ0 by minimizing the overall

approximation errors. Xia et al (2002) proposed a procedure via the so called minimum average conditional variance estimation (MAVE). The single index model (1) is a special case of what they considered, and we

can estimate it as follows. Assuming function g and parameter θ0 are known, then the Taylor expansion of

g(θ>

0Xi) at g(θ>0x) is

g(θ>₀Xi) ≈ a + dθ>0(Xi− x),

where a = g(θ>

0x) and d = g0(θ>0x). With fixed θ, the local estimator of the conditional variance is then

σ_n2(x|θ) = min a,d{n ˆfθ(x)} −1 n X i=1 [Yi− {a + dθ>(Xi− x)}]2Kh{θ>(Xi− x)}, where ˆf_θ(x) = n−1Pn

i=1Kh{θ>(Xi− x)}, where K is a univariate density function, h is the bandwidth and

Kh(u) = K(u/h)/h; see Fan et al (1996). The value σn2(x|θ) can also be understood as the local departure

(6)

should minimize the overall departure at all x = Xj, j = 1, · · · , n. Thus, our estimator of θ0 is to minimize Qn(θ) = n X j=1 σ_n2(Xj|θ) (2)

with respect to θ : |θ| = 1. This is the so-called minimum average conditional variance estimation (MAVE) in Xia et al (2002). In practice it is necessary to include some trimming in covariate regions where density

is low, so we weight σ2

n(Xj|θ) by a sequence ˆρθj, where ˆρθj = ρn{ ˆfθ(Xj)}, that is discussed further below.

The corresponding algorithm can be stated as follows. Suppose θ1 is an initial estimate of θ0. Set the

number iteration τ = 1 and bandwidth h1. We also set a final bandwidth h. Let Xij = Xi− Xj.

Step 1: With bandwidth hτ, calculate ˆfθ(Xj) = n−1

P_n

i=1Khτ(θ>Xij) and the solutions of aj and dj to the

inner problem in (2) Ã aθ j dθ jhτ ! = n_Xn i=1 Khτ(θ>Xij) Ã 1 θ>_X ij/hτ ! Ã 1 θ>_X ij/hτ !>_o −1Xn i=1 Khτ(θ>Xij) Ã 1 θ>_X ij/hτ ! Yi.

Step 2: Fix the weight K_h_τ(θ>_X

ij), fθ(Xj), aθj and dθj. Calculate the solution of θ to (2)

θ = { n X i,j=1 K_h_τ(θ>Xij)ˆρθj{dθ(Xj)}2XijXij>fˆθ(θ>Xj)}−1 n X i,j=1 K_h_τ(θ>Xij)ˆρθjdθ(Xj)Xij(yi−aθj)/ ˆfθ(θ>Xj), where ˆρθ j = ρn{ ˆfθ(Xj)}.

Step 3: Set τ = τ + 1, θ := θ/|θ| and hτ := max{h, hτ/

√

2}, go to Step 1. Repeat steps 1 and 2 until convergence.

The iteration can be stopped by the common rule. For example, if the calculated θ’s are stable at a

certain direction, we can stop the iteration. The final vector θ := θ/|θ| is the MAVE estimator of θ0, denoted

by ˆθ. Note that these steps are an explicit algorithm of the Xia et al (2002) method for the single-index

model with some version of what the called ‘refined kernel weighting’ and boundary trimming. Similar to the other direct estimation methods, the calculation above is easy to implement. See Horowitz and H¨ardle (1996) for more discussions. After θ is estimated, the link function can be then estimated by the local linear

smoother as gθˆ(v), where ˆ gθ(v) = [n{sθ₂(v)sθ₀(v) − (sθ₁(v))2}]−1 n X i=1 {sθ₂(v) − sθ₁(v)(θ>Xi− v)/hτ}Khτ(θ>Xi− v)Yi, (3)

(7)

and sθ_k(v) = n−1Pn_i=1Khτ(θ>Xi− v){(θ>Xi− v)/hτ}k for k = 0, 1, 2. Actually, ˆg

ˆ

θ_{(v) is the final value of}

aθ_j in Step 1 with θ>Xj replaced by v.

In the algorithm, ρn(.) is a trimming function employed to handle the boundary points. There are many

choices for the estimator to achieve the root-n consistency; see e.g. H¨ardle and Stocker (1989) and HHI

(1993). However, to achieve the efficiency bound, ρn(v) must tend to 1 for all v. In this paper, we take

ρn(v) as a bounded function with third order derivatives on R such that ρn(v) = 1 if v > 2c0n−ς; ρn(v) = 0

if v ≤ c0n−ς for some constants ς > 0 and c0> 0. As an example, we can take

ρ_n(v) =        1, if v ≥ 2c0n−ς, exp{(2c0n−ς−v)−1} exp{(2c0n−ς−v)−1}+exp{(v−c0n−ς)−1}, if 2c0n −ς _{> v > c} 0n−ς, 0, if v ≤ c0n−ς. (4)

The choice of ς will be given below.

3 Main Results

We impose the following conditions to obtain the asymptotics of the estimators.

[(C1)] [Initial estimator] The initial estimator is in Θn= {θ : |θ − θ0| ≤ n−α} for some 0 < α ≤ 1/2.

[(C2)] [Design] The density function f_θ(v) of θ>_{X and its derivatives up to 6th order are bounded on R}

for all θ ∈ Θ_n, E|X|6 _{< ∞ and E|Y |}3 _{< ∞. Furthermore, sup}

v∈R,θ∈Θn|fθ(v) − fθ0(v)| ≤ c|θ − θ0| for

some constant c > 0.

[(C3)] [Link function] The conditional mean gθ(v) = E(Y |θ>X = v), E(X|θ>X = v), E(XX>|θ>X = v)

and their derivatives up to 6th order are bounded for all θ : |θ − θ0| < δ where δ > 0.

[(C4)] [Kernel function] K(v) is a symmetric density function with finite moments of all orders.

[(C5)] [Bandwidth and trimming parameter] Trimming parameter ς ≤ 1/20 and bandwidth h ∝ n−ρ _for

some ρ with 1/5 − ² ≤ ρ ≤ 1/5 + ² for some ² > 0.

Assumption (C1) is feasible because such an initial estimate is obtainable using existing methods, such as Härdle and Stoker (1989), Powell et al. (1989) and Horowitz and Härdle (1996). Actually, Härdle, Hall and

Ichimura (1993) even assumed that the initial value is in a root-n neighborhood of θ₀, {θ : |θ−θ₀| ≤ C₀n−1/2_}.

(8)

small neighborhood of θ0; see also Ichimura (1993). The moment requirement on X is not strong. H¨ardle,

Hall and Ichimura (1993) obtained their estimator in a bounded area of Rp, which is equivalent to assume

that X is bounded; see also H¨ardle and Stoker (1989). We impose slightly higher order moment requirement than second moment for Y to ensure the optimal bandwidth in (C5) can be used in applying Lemma 6.1 in section 6. The smoothness requirements on the link function in (C3) can be relaxed to the existence of a bounded second order derivative at the cost of more complicated proofs and smaller bandwidth. Assumption (C4) includes the Gaussian kernel and the quadratic kernel. Assumption (C5) includes the commonly used

optimal bandwidth in both the estimation of the link function and the estimation of the index θ0. Actually,

imposing these constraints on the bandwidth is for ease of exposition in the proofs.

Let µ_θ(x) = E(X|θ>_{X = θ}>_{x), ν}

θ(x) = µθ(x)−x, wθ(x) = E(XX>|θ>X = θ>x), W0(x) = νθ0(x)ν_θ0(x).

Let A+ _{denote the Moore-Penrose inverse of a symmetric matrix A. Recall that K is a symmetric density}

function. Thus, R K(v)dv = 1 and R vK(v)dv = 0. For ease of exposition, we further assume that µ2 =

R

v2K(v)dv = 1. Otherwise, we can redefine K(v) := µ1/2₂ K(µ1/2₂ v).

We have the following asymptotic results for the estimators.

Theorem 3.1 (Speed of algorithm) Let θτ be the value calculated in Step 3 after τ iterations. Suppose

assumptions (C1)-(C5) hold. If hτ → 0 and |θτ− θ0|/h2τ → 0, we have

θ_{τ +1}− θ₀ = 1 2{(I − θ0θ > 0) + o(1)}(θτ− θ0) + 1 2√nNn+ O(n 2ς_h4 τ)

almost surely, where Nn= [E{g0(θ>0X)2W0(X)}]+n−1/2

P_n

i=1g0(θ>0Xi)νθ0(Xi)εi = Op(n−1/2).

Theorem 3.1 indicates that the algorithm converges at a geometric rate, i.e. after each iteration, the estimation error reduces by half approximately. By Theorem 3.1 and the bandwidth requirement in the algorithm, we have

|θτ +1− θ0| = {1₂ + o(1)}|θτ +1− θ0| + O(n−1/2+ n2ςh4τ).

Starting with |θ1−θ0| = Cn−α, in order to achieve root-n consistency, say |θk−θ0| ≤ cn−1/2 i.e. 2−kCn−α≤

cn−1/2_{, the number of iterations k can be calculated roughly by}

k = {(1

2− α) log n + log(C/c)}/ log 2. (5) Based on Theorem 3.1, we immediately have the following limiting distribution.

(9)

Theorem 3.2 (Efficiency of estimator) Under the conditions (C1)-(C5), we have

√

n(ˆθ − θ₀)→ N (0, ΣL ₀),

where Σ0= [E{g0(θ>0X)2W0(X)}]+E{g0(θ>0X)2W0(X)ε2}[E{g0(θ0>X)2W0(X)}]+.

By choosing a similar trimming function, the estimators in H¨ardle, Hall and Ichimura (1993) and Ichimura (1993) have the same asymptotic covariance matrix as Theorem 3.2. If we further assume that the conditional distribution of Y given X belongs to a canonical exponential family

f_{Y |X}(y|x) = exp{yη(x) − B(η(x)) + C(y)}

for some known functions B, C and η, then Σ0 is the lower information bound in the semiparametric sense

(Bickel, Klaassen, Ritov and Wellner, 1993). See also the proofs in Carroll, Fan, Gijbels and Wand (1997) and H¨ardle, Hall and Ichimura (1993). In other words, our estimator is the most efficient in the semiparametric sense.

For the estimation of the single-index model, it was generally believed that undersmoothing the link function must be employed in order to allow the estimator of the parameters to achieve root-n consistency. However, H¨ardle, Hall and Ichimura (1993) established that undersmoothing the link function is not neces-sary. They derived an asymptotic expansion of the sum of squared residuals. We also derive an asymptotic

expansion but of the estimator bθ itself. This allows us to measure the higher order cost of estimating the

link function. We use the expansion to propose an automatic bandwidth selection procedure for the index.

Let f_θ₀(.) be the density function of θ>

0X.

Theorem 3.3 (Higher Order Expansion) Under conditions (C1)-(C5) and εi is independent of Xi, we

have almost surely

ˆ θ − θ0= En+ c_nh1,n+ c2,nh4+ Hn+ O{n2ςγn3}, where γ_n= h2_{+ (nh/ log n)}−1/2_, En= (Wn)+ n X i=1 ρn{fθ0(Xj)}g 0_(θ> 0Xi)νθ0(θ > 0Xi)εi, with W_n= n−1Pn j=1ρn{fθ0(Xj)}(g0(θ>0Xi))2νθ0(Xj)νθ>0(Xj), Hn= O{n −1/2_γ n+ n−1h−1/2} with E{HnEn} = o{(nh)−2+ h8} and c1,n= Z K2(v)v2dvσ2(nWn)−1 n X j=1 ρn{fθ(Xj)}{νθ00(Xj) + f 0 0(Xj)νθ0(Xj)/fθ0(Xj)},

(10)

c_2,n= 1 4( Z K(v)v4dv − 1)(nW_n)−1 n X j=1 ρ_n{f_θ(X_j)}g0(θ₀>X_j)g00(θ₀>X_j)ν_θ00₀(X_j).

Because K(v) is a density function and we constrain thatR v2_{K(v) = 1, it follows that µ}

4 = R

K(v)v4_{dv >}

1. In the expansion of ˆθ − θ0, the first term En does not depend on h. The second and third terms are the

leading term among the remainders. The higher order properties of this estimator are better than those of the AD method, see Nishiyama and Robinson (2000), and indeed do not reflect a curse of dimensionality.

To minimize the stochastic expansion, it is easy to see that the bandwidth should be proportional to

n−1/5_{. Moreover, by Theorem 3.2 we consider the Mahalanobis distance}

(ˆθ − θ0)>Σ+0(ˆθ − θ0) = Tn+ o{h8+ (nh)−2}, where Tn= (En+ c_1,n nh + c2,nh 4_{+ H} n)>Σ+0(En+ c_1,n nh + c2,nh 4_{+ H} n)

is the leading term. We have by Theorem 3.3 that

ETn= E(En>Σ+0En) + (_nhc1 + c2h4)>Σ+0(

c1

nh + c2h

4_{) + o{h}8_{+ (nh)}−2_},

where c₁ =RK2_(v)v2_dvσ2_W+

0 E{ν00(X) + f−1(X)f0(X)ν0(X)}, W0 = E{(g0(θ0>X))2νθ0(X)νθ>0(X)} and

c2 = 1₄( Z

K(v)v4dv − 1)W₀+E[g0(θ>₀X)g00(θ>₀X)ν_θ00₀(X)].

Note that E(E_n>Σ+₀En) does not depend on h. By minimizing ETn with respective to h, the optimal

bandwidth should be hθ= ( (9r2 2 + 16r1)1/2− 3r2 8 )_1/5 n−1/5,

where r1 = c>1Σ+0c1/(c>2Σ+0c2) and r2 = c>1Σ+0c2/c>2Σ+0c2. As a comparison, we consider the optimal

bandwidth for the estimation of the link function g. By Lemma 5.1 and Theorem 3.2, if fθ0(v) > 0 we have

ˆ g(v) = g(v) + 1 2g 00_(v)2_h2₊ 1 nfθ0(v) n X i=1 K_h(θ₀>X_i− v)ε_i+ O_P(n−1/2+ h2γ_n). (6) In other words, the link function can be estimated with the efficiency as if the index parameter vector is known. A brief proof for (6) is given in section 5. It follows that

(11)

where the leading term is Sn(v) = [1₂g00(v)2+ {nfθ0(v)}−1

P_n

i=1Kh(θ0>Xi− v)εi]2. Suppose we are interested

in constant bandwidth in region [a, b] with weight w(v). Minimizing R_[a,b]ESn(v)w(v)dv with respect to h,

we have the optimal bandwidth for the estimation of the link function is

hg = " R K2(v)dvR_[a,b]f_θ−1₀ (v)σ2_θ₀(v)w(v)dv R [a,b]g00(v)2w(v)dv #_1/5 n−1/5.

It is noticeable that the optimal bandwidth for the estimation of the parameter vector θ0 is of the same

order as that for the estimation of the link function. In other words, under-smoothing may lose efficiency for

the estimation of θ0 in the higher order sense. These optimal bandwidth hopt_θ and hoptg can be consistently

estimated by plug-in methods; see Ruppert et al (1995).

Although the optimal bandwidth for the estimation of θ is different from that for the link function, its estimation such as the plug-in method may be very unstable because of the estimation of second order derivatives. Moreover, its estimation needs another pilot parameter which is again hard to choose. In

practice it is convenient to apply hoptg for hopt_θ directly, and since hoptg and hopt_θ have the same order, the loss

of efficiency in doing so should be small. For the former, there are a number of estimation methods such as CV and GCV methods. If CV methods is used, in each iteration with the latest estimator θ, the bandwidth is selected by minimizing ˆhg = argmin h n−1 n X j=1 {Yj − ˆgθj(θ>Xj)}2

where ˆgθ_j(v) is the delete-one-observation estimator of the link function, i.e. the estimator of ˆgθ(v) in (3)

using data {(Xi, Yi), i 6= j}. Another advantage for this approach is that we can also obtain the estimator

for the link function.

4 Numerical Results

In the following calculation, the Gaussian kernel function and the trimming function (4) with ς = 1/20 and

c0= 0.01 are used. A MATLAB code rMAVE.m for the calculations below is available at

http://www.stat.nus.edu.sg/%7Estaxyc

In the first example, we check the behavior of bandwidths hg and hθ. We consider two sets of simulations

(12)

estimation of the link function g and the single-index θ0. Our models are

model A: y = (θ>₀X)2+ 0.2ε, model B: y = cos(θ₀>X) + 0.2ε,

where θ₀ = (3, 2, 2, 1, 0, 0, −1, −2, −2, −3)>_{/6, X ∼ N}

10(0, I), and ε ∼ N (0, 1) is independent of X. The

ADE method was used to choose the initial value of θ. With different sample size n and bandwidth h, we estimate the model and calculate estimation errors

errθ = {1 − |θ>0θ|}ˆ 1/2, errg= _n1 n X j=1 ρn{ ˆf_θˆ(ˆθ>Xj)}|ˆgθˆ(ˆθ>Xj) − g(θ>0Xj)|, where ˆgθˆ_(ˆ_θ>_X

j) is defined in (3). With 200 replications, we calculate the mean errors mean(errθ) and

mean(errg). The results are shown in Figure 1.

We have the following observations. (1) Notice that n1/2_mean(err

θ) tends to decrease as n increases,

which means the estimation error err_θ enjoys a root-n consistency (and slightly faster for finite sample size).

(2) Notice that the U-shape curves of err_θ has a wider bottom than those of errg. Thus, the estimation

of θ₀ is more robust to the bandwidth than the estimation of g. (3) Let hopt_θ = arg min_hmean(err_θ)

and hoptg = arg minhmean(errg). Then hopt_θ and hoptg represent the best bandwidths respectively for the

estimation of the link function g and the single-index θ0. Notice that hopt_θ /hoptg tends to increase as n

increases, which means the optimal bandwidth for the estimation of θ0 tends to zero not faster than that

for the estimation of link function. Thus the under-smoothing bandwidth is not optimal.

Next, we compare our method with some of the existing estimation methods including ADE in H¨ardle and Stocker (1993), MAVE, the method in Hristache et al (2001), called HJS hereafter, the SIR and pHd methods in Li (1991, 1992) and SLS in Ichimura (1993). For SLS, we use the algorithm in Friedman (1984) in the calculation. The algorithm has best performance among those proposed for the minimization of SLS, such as Weisberg and Welsh (1994) and Fan and Yao (2003). We consider the following model used in Hristache et al (2001),

Y = (θ>₀X)2exp(aθ₀>X) + σε, (7)

where X = (x1, · · · , x10)>, θ0 = (1, 2, 0, ..., 0)>/

√

5, x1, · · · , x10, ε are independent and ε ∼ N (0, 1). For the

covariates X: (x_k+ 1)/2 ∼ Beta(τ, 1) for k = 1, · · · , p. Parameter a is introduced to control the shape of

function. If a = 0, the structure is symmetric; the bigger it is, the more monotonic the function is.

Following Hristache et al (2001), we use the absolute deviationPp_j=1|ˆθ_j− θ_j| to measure the estimation

(13)

Figure 1: The wide solid lines are the values of log{n1/2_mean(err

θ)} and the narrow lines are the values of log{n1/2_mean(err

g)} (re-scaled for easier visualisation). The dotted vertical lines correspond to the bandwidths hθ

and hg respectively. 0 0.5 1 −6 −4 −2 0 h_θ/h g=1.34 h θ h g model A, n=50 0 0.5 1 −6 −4 −2 0 h_θ/h g=1.62 h θ h g model A, n=100 0 0.5 1 −6 −4 −2 0 h θ h g h_θ/h g=2.33 model A, n=200 0 0.5 1 −6 −4 −2 0 h θ h g h_θ/h g=2.33 model A, n=400 0 0.5 1 −6 −4 −2 0 h θ h g h_θ/h g=2.36 model A, n=800 0 0.5 1 −6 −4 −2 0 h_θ/h g=1.1 h_θ h g model B, n=100 0 0.5 1 −6 −4 −2 0 h_θ/h g=1.26 h_θ h g model B, n=200 0 0.5 1 −6 −4 −2 0 h_θ/h g=1.37 h_θ h g model B, n=400 0 0.5 1 −6 −4 −2 0 h_θ/h g=1.64 h_θ h g model B, n=800

Table 1. Average estimation errorsPp_j=1|ˆθj − θj|

and their standard deviations (in square bracket) for model (7).

a = 1 a = 0

n σ τ ADE∗ _HJS∗ _SIR/pHd _SLS _MAVE _SIR/pHd _SLS _MAVE 200 0.1 1 0.6094 0.1397 0.6521 0.0645 0.0514 0.7500 0.6910 0.0936 [0.1569] [0.0258] [0.0152] [0.1524] [1.2491] [0.0255] 200 0.2 1 0.6729 0.2773 0.6976 0.1070 0.0934 0.7833 0.8937 0.1809 [0.1759] [0.0375] [0.0294] [0.1666] [1.3192] [0.0483] 400 0.1 0.75 0.7670 0.1447 0.3778 0.1151 0.0701 0.6037 0.0742 0.0562 [0.0835] [0.0410] [0.0197] [0.1134] [0.0193] [0.0146] 400 0.1 1 0.4186 0.0822 0.4868 0.0384 0.0295 0.5820 0.5056 0.0613 [0.1149] [0.0125] [0.0096] [0.1084] [1.0831] [0.0167] 400 0.1 1.5 0.2482 0.0412 0.5670 0.0208 0.0197 0.5760 0.0923 0.0669 [0.1524] [0.0063] [0.0056] [0.1215] [0.0257] [0.0175] 400 0.2 1 0.4665 0.1659 0.5249 0.0654 0.0607 0.6084 0.7467 0.1229 [0.1353] [0.0207] [0.0178] [0.1064] [1.2655] [0.0357] 400 0.4 1 0.5016 0.3287 0.6328 0.1262 0.1120 0.6994 0.9977 0.2648 [0.1386] [0.0406] [0.0339] [0.1370] [1.2991] [0.1880]

(14)

the following observations from Table 1. Our methods has much better performance than ADE and the method of Hristache et al (2001). For each simulation, the better one of SIR and pHd is reported in Table 1, suggesting that these methods are not so competitive. Actually the main application of SIR and pHd is not in the estimation of single-index models. See Li (1991, 1992). For SLS, its performance depends much on the data and the model. If the model is easy to estimate (such as monotone and having big signal/noise ratio), it performance quite well. But overall SLS is still not so good as MAVE. The proposed method has the best performance in all the simulations we have done.

5 Proof of Theorems

Let fθ(v) be the density function of θ>X and Λn = {x : |x| < nc, fθ(x) > n−2ς, θ ∈ Θn} where c > 1/3

and ς > 0 is defined in (C5). Suppose An is a random matrix depending on x and θ. By An = O(an) (or

An=O(a_n)) we mean that all elements in A_n are O_a.s.(a_n) (or o_a.s.(a_n)) uniformly for θ ∈ Θ_n and x ∈ Λ_n.

Let δ_n = (nh/ log n)−1/2_{, γ}

n = h2+ δn and δθ = |θ − θ0|. For any vector V (v) of functions of v, we define

(V (v))0_{= dV (v)/dv.}

Suppose (X_i, Z_i), i = 1, 2, . . . , n, are i.i.d. samples from (X, Z). Let X_ix= X_i− x,

sθ_k(x) = n−1 n X i=1 Kh(θ>Xix){θ>Xix/h}k, tθk(x) = n−1 n X i=1 Kh(θ>Xix){θ>Xix/h}kXi, wθ_k(x) = n−1 n X i=1 K_h(θ>Xix){θ>Xix/h}kXiXi>, eθk(x) = n−1 n X i=1 K_h(θ>Xix){θ>Xix/h}kεi, ²θ k = sθk(x) − Esθk(x), ξθk = tθk(x) − Etθk(x), Dn,kθ (x) = sθ2(x)sθk(x) − sθ1(x)sθk+1(x), En,kθ = sθ0(x)sθk+1(x) − sθ

1(x)sθk(x) for k = 1, 2, . . .. For any random variable Z and its random observations Zi, i = 1, ..., n, let

T_n,kθ (Z|x) = sθ₂(x)n−1 n X i=1 K_hθ(Xix)(θ>Xix/h)kZi− sθ1(x)n−1 n X i=1 K_hθ(Xix)(θ>Xix/h)k+1Zi, S_n,kθ (Z|x) = sθ₀(x)n−1 n X i=1 K_hθ(Xix)(θ>Xix/h)k+1Zi− sθ1(x)n−1 n X i=1 K_hθ(Xix)(θ>Xix/h)kZi.

By the Taylor expansion of g(θ>

0Xi) at θ0>x, we have g(θ₀>Xi) = g(θ>0x) + 5 X k=1 1 k!g (k)_(θ> 0x){θ>Xix+ (θ0− θ)>Xix}k+ O({θ>Xix+ (θ0− θ)>Xix}6) = g(θ>₀x) + Aθ(x, X_i) + Bθ(x, X_i)(θ₀− θ) + O{(θ>X_ix)6+ δ3_θ(|X_i|6+ |x|6)}, (8)

(15)

where Aθ(x, Xi) = P₅ `=1(k!)−1g(k)(θ>0x)(θ>Xix)k and Bθ(x, Xi) = 5 X k=1 1 (k − 1)!g (k)_(θ> 0x)(θ>Xix)k−1Xix>+ 1 2g 00_(θ> 0x)(θ − θ0)>XixXix>.

For ease of exposition, we simplify the notation and abbreviate g for g(θ>

0x) and g0, g00, g000 for g0(θ0>x),

g00_(θ>

0x), g000(θ0>x) respectively. Without causing confusion, we write fθ(θ>x) as fθ, fθ(θ>Xj) as fθ(Xj) and

K_h(θ>_X

ij) as K_hθ(Xij). Similar notations are used for the other functions.

Lemma 5.1 (Link function) Let

Σθ_n(x) = n−1 n X i=1 K_hθ(Xix) Ã 1 θ>Xix/h ! Ã 1 θ>Xix/h !> and _Ã aθ(x) d_θ(x)h ! = {nΣθ_n(x)}−1 n X i=1 K_hθ(Xix) Ã 1 θ>_X ix/h ! Yi.

Under assumptions (C2)–(C5), we have

a_θ(x) = g(θ>₀x) + Aθ_n(x)h2+ B_nθ(x)(θ₀− θ) + V_nθ(x) + O(h2γ_n2+ δ_θ3)(1 + |x|6), dθ(x)h = g0(θ0>x)h + ˜Aθn(x)h2+ ˜Bnθ(x)(θ0− θ)h + ˜Vnθ(x) + O(h2γn2+ δ3θ)(1 + |x|6), where Aθ_n(x) = 1 2g 00₊1 4{(µ4− 1)g 00_f−2 θ (fθfθ00− 2(fθ0)2) + 1 24µ4g (4)_}h2_{+ H}θ 1,n(x), ˜ Aθ_n(x) = 1 2g 00_(µ 4− 1)f_θ−1fθ0h + 1 6g (3)_µ 4h + 1₂g00f_θ−1(²θ3− ²θ1) + O(hγn), B_nθ(x) = g0νθ+ O(γn+ δθ), B˜nθ(x) = g0(θ0>x)fθ−1{fθνθ(x)}0+ O(γn), where Hθ 1,n(x) = 12g00(θ>0x){fθ−1(²2θ− ²θ0) + (2 − µ4)f_θ−2f_θ0h²θ1− fθ−2fθ0h²θ3} +16fθ−1g000h²θ3 and Vnθ(x) = fθ−1eθ0− f_θ−2f0 θheθ1 + µ4f_θ−2f_θ00h2eθ0/2 + fθ−2(eθ0²θ2 − eθ1²θ1) − µ4f_θ−2f_θ000h3eθ1 + {fθ−2(fθ0)2− (µ4 + 1)fθ−1fθ00}{fθ−1h2eθ0 − f_θ−2f0 θh3eθ1}−fθ−1(²θ0+²θ1){fθ−1eθ0−fθ−2fθ0eθ1}+2fθ−2fθ0h²1θfθ−1eθ0and ˜Vnθ(x) = fθ−1eθ1+fθ−2fθ00h2eθ1/2+fθ−2(²θ0eθ1− ²θ 1eθ0) − fθ−2fθ0he0θ+ fθ−1²θ0[−(µ4+ 1)fθ−1fθ00h2/2 − fθ−1(²θ0+ ²1θ) + fθ−2(fθ0)2h2].

Lemma 5.2 (Summations) Let ηθ

n(x)=n−1

P_n

i=1Khθ(Xix)Xixεi. Under conditions (C1)-(C5), we have

Aθ_ndef= n−1

n

X

j=1

(16)

B_nθ def= (nh)−1 n X j=1 ρ_n{sθ₀(X_j)}eθ_k(X_j)η_nθ(X_j)/sθ₀(X_j) = ˜ck,n nh + R θ n+ O(n2ςγn3), C_nθ def= n−1 n X j=1 ρ_n(sθ₀(X_j))²θ_k(X_j)η_nθ(X_j)/sθ₀(X_j) = M_nθ+ O(n2ςγ_n3), where Eθ n= P_n i=1ρn{fθ(Xj)}g0(θ>Xi)νθ(θ>Xi)εi, rθn,0=O(1),

E_nθ = O{(n/ log n)−1/2}, Qθ_n= O{(n/ log n)−1/2γn}, Rθn= O{n−1/2δn}, Mnθ= O{n−1/2δn},

with E{Eθ

nQθn} = o(h8 + (nh)−2), E{EnθRθn} = o(h8 + (nh)−2), E{EnθMnθ} = o(h8 + (nh)−2), and ˜ck,n =

R

vk+1_K2_(v)dvE[ρ

n(fθ(Xj))f_θ−1(Xj)(νθ(Xj)fθ(Xj))0(Xj)] if k is odd, 0 otherwise.

Lemma 5.3 (Denominator) Let Dθ

n = n−2

P_n

i,j=1ρn(sθ0(Xj))d2_θ(Xj)K_hθ(Xij)XijXij>/sθ0(Xj) in the

algo-rithm. Suppose (θ, B) : p × p is an orthogonal matrix. Then under (C1)-(C5), we have almost surely

(D_nθ)−1 = θθ>dθ₁₁h−2− θdθ₁₂B>h−1− B(dθ₁₂)>θ>h−1+ Bdθ₂₂B>, where dθ₁₁= (Gθ_n)−1+O(1), dθ 12= Hnθh + O(γn), dθ22= 1 2(B >_Wθ nB)−1+ O(γn), with Gθ n= n−1 P_n j=1ρn(fθ(Xj))fθ−1(Xj)(g0(θ0Xj))2and Hnθ = 12n−1 P_n j=1ρn(fθ(Xj))fθ−1(Xj){(fθνθ)0(Xj)}> (Gθ_n)−1(g0(θ>₀Xj))2B(B>WnθB)−1 and Wnθ= n−1 P_n j=1ρn{fθ(Xj)}(g0(θ>Xi))2νθ(Xj)νθ>(Xj).

Proof of Lemma 5.3 Let (θ, B) be an orthogonal matrix. It is easy to see that

n−1 n X i=1 K_hθ(X_ix)θ>X_ixX>_ixθ = sθ₂(x)h2, n−1 n X i=1 K_hθ(X_ix)θ>X_ixX>_ixB = {tθ₁(x) − sθ₁(x)x}>Bh, n−1 n X i=1 K_hθ(Xix)B>XixX>ixB = B>{w0θ(x) − tθ0(x)x>− x(t0θ(x))>+ xx>sθ0(x)}B. Thus (Dθ_n)−1 = (θ, B) Ã Dθ 11h2 (D12θ )>Bh B>_Dθ 12h B>D22θ B !−1 (θ, B)>, where Dθ₁₁= n−1 n X j=1 ρn(sθ0(Xj)){dθ(Xj)}2sθ2(Xj)/sθ0(Xj), D₁₂θ = n−1 n X j=1 ρn(sθ0(Xj)){dθ(Xj)}2{tθ1(Xj) − sθ1(Xj)Xj}>/sθ0(Xj),

(17)

Dθ₂₂= n−1

n

X

j=1

ρ_n(sθ₀(X_j))(d_θ(X_j))2{wθ₀(X_j) − tθ₀(X_j)X_j>− X_jtθ₀(X_j) + X_jX_j>sθ₀(X_j)}/sθ₀(X_j).

By the matrix inversion formula in blocks (Schott, 1997), we have the equation in Lemma 5.3 with

d11 = {D11θ − (Dθ12)>BB>(D22θ )−1BB>Dθ12}−1, d12 = dθ11(Dθ12)>B(B>D22θ B)−1, dθ22= {B>Dθ22B}−1+

d₁₁{B>_Dθ

22B}−1B>D12θ (D12θ )>B{B>Dθ22B}−1. By Lemma 6.1, we have

Dθ₁₁= G−1_n +O(1), Dθ

12= Hnh + O(γn), Dθ22= 2Wn+ O(γn).

Thus, Lemma 5.3 follows. ¥

Lemma 5.4 (Numerator) Let N_nθ= n−2 Pn

i,j=1

ρn(sθ0(Xj))Khθ(Xij)Xij{Yi−aθ(Xj)−dθ(Xj)θ0>Xij}/sθ0(Xj).

Under assumptions (C1)–(C5), we have almost surely N_nθ = E_nθ+˜c1,n

nh + ˜c2,nh

4_{+ R}θ

n+ Bnθ(θ − θ0) + O{n2ς(γn3+ δθ3)},

where Rθ_n = O{n−1(log n/h)1/2 + (log n/n)−1/2h2}, θ>Rθ_n = O{hn−1(log n/h)1/2 + (log n/n)−1/2h3} and E{Rθ_nE₀θ} = O{(nh)−2+ h8}, B_nθ = W_nθ+O(1) with W_nθ defined in Lemma 5.3, ˜c_1,n and Eθ

0 are defined in Lemma 5.2 and ˜c2,n= 1₄(µ4− 1) n X j=1 ρn{fθ(Xj)}g0(θ0>Xj)g00(θ0>Xj)νθ00(Xj).

Proof of Theorem 3.1 By assumption (C2), we have

∞ X n=1 P ( n [ i=1 {Xi ∈ Λ/ n}) ≤ ∞ X n=1 nP (Xi∈ Λ/ n) ≤ ∞ X n=1 nP (|Xi| > nc) < ∞ X n=1 nn−6cE|X|6 < ∞

for any c > 1/3. It follows from the Borel-Cantelli lemma that

P ( ∞ \ n=1 n [ i=1 {Xi ∈ Λ/ n}) = 0. (9)

Let ˜Λn= {x : fθ(θ>x) > 2n−²}. Similarly, we have

P ( ∞ \ n=1 n [ i=1 {Xi ∈ ˜/Λn}) = 0. (10)

Thus, we can exchange summations over {Xj : j = 1, · · · , n}, {Xj : Xj ∈ Λn, j = 1, · · · , n} and {Xj : Xj ∈

˜

Λn, j = 1, · · · , n} in the sense of almost surely consistency. On the other hand, we have by (C2)

n−1 X

|Xj|<nc

(18)

By the notation in Lemmas 5.3 and 5.4, after one iteration of Steps 1-3, the new θ is ˜

θ = θ0+ (Dnθ)−1Nnθ. (11)

Note that θ>E_nθ = 0, θ>cθ_1,n= 0, θ>cθ_2,n= 0, θ>W_nθ = 0, W_nθ(W_nθ)+= I − θθ> and δθ/h2→ 0. We have

˜ θ =θ0+ θ[θ>dθ11h−2{Rθn+ Bnθ(θ − θ0) + O{n2ς(γn3 + δθ3)}} − d12θ B>h−1Nnθ] − B(dθ₁₂)>θ>h−1[Rθ_n+ Bθ_n(θ − θ0) + O{n2ς(γn3+ δ3θ)}] + Bdθ22B>Nnθ =(1 + a_n)θ₀+ {1 2(I − θ0θ > 0) + bn}(θ − θ0) + 1 2{W θ n}+Enθ+ O(h4), where an=O(1) and b_n=O(1). By (25) below, we have sθ

0(x) = fθ(θ>x) + O(γn). Thus by the smoothness of ρn(.) and (10), we have

ρ_n(sθ₀(x)) = ρ_n(f_θ(θ>x)) + O(nςγ_n) = 1 + O(nςγ_n). (12)

Since ρn(.) is bounded, we have E{ρn( ˆfθ(θ>x)) − 1}2 =O(1). By (C3) and Lemma 6.1, we have

E_nθ = n−1

n

X

i=1

g0(θ>₀Xi)ν_θ0(Xj)εi+O(n−1/2).

Note that Wn= W0+O(δ_θ). It is easy to check that |˜θ| = 1 + a_n+ b_n+ O(h4) = 1 +O(1). Thus

˜ θ/|˜θ| = θ0+ {1₂(I − θ0θ0>) +O(1)}(θ − θ0) +1₂n−1W0+ n X i=1 g0(θ>₀Xi)ν_θ0(Xi)εi+O(h3+ n−1/2).

Let θ(k) _{be the value of θ after k iteration. Because h}

k+1= max{hk/ch, h}. Therefore,

|θk+1− θ0|/h2k+1→ 0,

for all k > 1. We have

θ(k+1)= θ0+ {1₂(I − θ0θ0>) +O(1)}(θ(k)− θ0) +1₂n−1W0+ n X i=1 g0(θ>₀Xi)ν_θ0(Xi)εi+O(h3 k+ n−1/2).

Recursing the above equation, we have

θ(k+1)= θ0+ {₂1_k(I − θ0θ>0) +O(1) k X ι=1 1 2ι}(θ(1)− θ0) + { k X ι=1 1 2ι}n−1W0+ n X i=1 g0(θ>₀Xi)ν_θ0(Xi)εi +O( k X ι=1 1 2ιh 3 k−ι+ n−1/2).

(19)

Thus as the number of iterations k → ∞, Theorem 3.1 follows immediately from the above equation and

the central limit theorem. ¥

Proof of Theorem 3.3 Based on Theorem 3.2, we can assume δθ = (log n/n)1/2. Note that θ>{Enθ+

c1,n(nh)−1+ c2,nh4} = 0. We consider the product of each term in (Dθn)−1 with Nnθ. We have

θθ>dθ₁₁h−2N_nθ = θθ>dθ₁₁h−2[Rθ_n+ B_nθ(θ − θ₀) + O{n2ς(γ_n3+ δ_θ3)}] = aθ_nθ₀+ aθ_n(θ − θ₀),

θdθ₁₂B>h−1N_nθ = bθ_nθ0+ bθn(θ − θ0), B(dθ12)>θ>h−1

(D_nθ)−1Nn= θ{Sn>(θ − θ0) + HnEnθ+ O(n2ςγ4n)}

= θ0{Sn>(θ − θ0) + HnE0θ+ O(n2ςγn4)} + cn(θ − θ0),

where Sn = O(1) and cn = O(γn/h). It is easy to see that cn =O(1) providing that |θ − θ₀|/h2 → 0. By

Lemma 5.3 and 5.4, we have ˜ θ =θ0{1 + Sn>(θ − θ0) + HnE0θ+ O(n2ςγn4)} + 1 2W θ n{E0θ+ c0 1,n nh + c 0 2,nh4+ Rθn+ Qθn} + {1 2(I − θθ >_{) + c} n}(θ − θ0) + O{n2ς(γn3+ h log n/n)}.

It is easy to see that |˜θ| = 1 + S>

n(θ − θ0) + HnE0θ+ O(n2ςγn4). Thus ˜ θ/|˜θ| = θ0+1₂Wnθ{E0θ+ c0 1,n nh + c 0 2,nh4+ Rθn+ E0θHn>E0θ} + { 1 2(I − θθ >_{) + c}0 n}(θ − θ0) + O{n2ς(γn3+ h log n/n)}, where c0

n =O(1). Similar to the proof of Theorem 3.1, we complete the proof with c1,n = Wn−1c01,n and

c2,n= Wn−1c02,n. ¥

6 Proofs of the Lemmas

In this section, we first give some results about the uniform consistency. Based on these results, the Lemmas are proved.

Lemma 6.1 Suppose Gn,i(χ) is a martingale with respect to Fi = σ{Gn,`(χ), ` ≤ i} with χ ∈ X and X is a

compact region in a multidimensional space such that (I) |Gn,i(χ)| < ξi, where ξi are IID and sup Eξ2r1 < ∞

for some r > 2; (II) EG2_n,k(χ) < ans(χ) with inf s(χ) positive, and (III) |Gn,i(χ) − Gn,i( ˜χ)| < nα1|χ − ˜χ|Mi,

where Mi, i = 1, 2, ... are IID with EM12 < ∞. If and an = cn−δ with 0 ≤ δ < 1 − 2/r, then for any α10 > 0

we have sup |χ|≤nα01 ¯ ¯ ¯n−1s−1/2(χ) n X i=1 G_n,i(χ) ¯ ¯ ¯ = O{(n−1a_nlog n)1/2}

(20)

almost surely. Suppose for any fixed n and k, Gn,i,k(θ) is a martingale with respect to Fi,k = σ{Gn,`,k(θ), ` ≤

i} such that (I) |Gn,i,k(θ)| ≤ ξi, (II) EG2n,i,k(θ) < an and (III) |Gn,i,k(θ) − Gn,i,k(˜θ)| < nα2|θ − ˜θ|Mi, where

ξi, an and Mi are defined above. If E|εk|2r < ∞ and E{εk|Gn,i,j(θ), i < j, j = 1, ..., k − 1} = 0, then

sup θ∈Θ ¯ ¯ ¯n−2 n X k=2 nk−1_X i=1 Gn,i,k(θ) o εk ¯ ¯ ¯ = O{(anlog n)1/2/n} almost surely.

Proof of Lemma 6.1 We give the details for the second part of the Lemma. The first part is easier

and can be proved similarly. Let ∆n(θ) be the expression between the absolute symbols in the equation.

By (III) and the strong low of large numbers, it is easy to see that there are n1 = nα3 balls centered at

θι : Bι = {θ : |θ − θι| < n−α4} with α4 > α2+ 2, such that

S_n₁

ι=1Bι ⊃ Θ. By the strong law of large numbers,

we have max 1≤ι≤n1 sup θ∈Bι |∆n(θ) − ∆n(θι)| ≤ nα2 max 1≤ι≤n1 sup θ∈Bι |θ − θι|n−2 n X k=1 |εk| n X i=1 Mi= O{(anlog n)1/2/n}

almost surely. Let ∆_n,k(θ_ι) =Pk−1_i=1G_n,i,k(θ_ι). Next, we show that there is a constant c₁ such that

pndef= P ³_\∞ `=1 ∞ [ n=` { max 1<k≤n1<ι≤nmax1 |∆n,k(θι)| > c1(nanlog n)1/2} ´ = 0. (13)

Let T_n = {na_nlog(n)}1/2_{, G}I

n,i,k(θι) = Gn,i,k(θι)I(|Gn,i,k(θι)| ≤ Tn) and GOn,i,k(θι) = Gn,i,k(θι) − GIn,i,k(θι).

Write ∆_n,k(θι) = k−1 X i=1 {GI_n,i,k(θι) − EGIn,i,k(θι)} + k−1 X i=1 {GO_n,i,k(θι) − EGOn,i,k(θι)}. (14)

Note that E|GO_n,i,k(θι)| ≤ Tn−r+1E|ξ1|r= E|ξ1|r{nanlog(n)}−(r−1)/2. If an= cn−δwith 0 ≤ δ < 1 − 2/r and

k ≤ n, we have |

k−1

X

i=1

EGO_n,i,k(θ_ι)| ≤ E|ξ₁|r(k − 1){na_nlog(n)}−(r−1)/2≤ CE|ξ₁|r{na_nlog(n)}1/2. (15) Note that n X i=1 |GO_n,i,k(θι)| ≤ n X i=1 |ξi|I(|ξi| > Tn) ≤ Tn−r+1 n X i=1 |ξi|rI(|ξi| > Tn)

For fixed T , by the strong law of large numbers, we have

n−1

n

X

i=1

(21)

almost surely. The right hand side above is dominated by E{|ξ1|r} and → 0 as T → ∞. Note that Tn

increase to ∞ with n. For large n such that Tn> T , we have

n−1 n X i=1 |ξi|rI(|ξi| > Tn) ≤ n−1 n X i=1 |ξi|rI(|ξi| > T ) → 0

almost surely as T → ∞. It follows

n

X

i=1

|GO_n,i,k(θι)| = o(nTn−r+1) = o{(nanlog n)1/2} (16)

almost surely. Thus by (15) and (16), if c0

1 > CE|ξ1|r we have p0_n def= P ³_\∞ `=1 ∞ [ n=` { max 1<k≤n1<ι≤nmax1 | k−1 X i=1

{GO_n,i,k(θι) − EGOn,i,k(θι)}| > c01(nanlog n)1/2}

´ ≤ P ³_\∞ `=1 ∞ [ n=` { n X i=1 |ξi|(|ξi| ≥ Tn) > c01(nanlog n)1/2} ´ +P ³_\∞ `=1 ∞ [ n=` { max 1<k≤n1<ι≤nmax1 | k−1 X i=1

EGO_n,i,k(θι)| > c01(nanlog n)1/2}

´

= 0. (17)

By condition (II), if k ≤ n we have max 1≤ι≤n1 Var k−1 X i=1

{GI_n,i,k(θι) − EGIn,i,k(θι)} ≤ c2nandef= N1, (18)

where c2 is a constant. By the condition on an and the definition of GIn,i,k(θι), we have constants c3 and c4

such that

max 1≤ι≤nα|{G

I

n,i,k(θι) − EGIn,i,k(θι)}| ≤ c3Tn

= c₃{na_n/ log n}1/2{a−r_n logr+1n/nr−2}1/(2(r−1))

≤ c4{nan/ log n}1/2 def= N2. (19)

Let N3 = c5{nanlog n}1/2 with c52 > 2(α3+ 3)(c2+ c4c5). By the Bernstein’s inequality (cf. DE LA Pe˜na,

1999), we have from (18) and (19) that for any k ≤ n,

P (|

k−1

X

i=1

{GI_n,i,k(θι) − EGIn,i,k(θι)}| > N3) ≤ 2 exp µ −N2 3 2(N1+ N2N3) ¶ ≤ 2 exp{−c2₅log n/(2c2+ 2c4c5)} ≤ c6n−α3−3.

(22)

Let c1 > max{c5, c01}. We have ∞ X n=1 P n max 1<k≤n1<ι≤nmax1 | k−1 X i=1

[GI_n,i,k(θι) − EGIn,i,k(θι)]| > c1(nanlog n)1/2

o ≤ ∞ X n=1 n X k=2 n1 X ι=1 P n | k−1 X i=1

[GI_n,i,k(θ_ι) − EGI_n,i,k(θ_ι)]| > c₁(na_nlog n)1/2 o ≤ ∞ X n=1 c6n−α3−3n1+α3 < ∞. (20)

By (14), (17) and (20) and the Borel-Cantelli lemma, we have

pn≤ P n_\∞ `=1 ∞ [ n=` max 1<k≤n1<ι≤nmax1 | k−1 X i=1

[GI_n,i,k(θι) − EGIn,i,k(θι)]| > c1(nanlog n)1/2

o

+ p0_n= 0.

Therefore (13) follows.

Let ∆I

n,k(θι) = ∆n,k(θι)I{|∆n,k(θι)| ≤ c1(nanlog n)1/2} and U`(θι) =

P_` k=2∆In,k(θι)εk. Write ∆n(θι) = Un(θι) + n X k=2 ∆O_n,k(θι)εk, where ∆O

n,k(θι) = ∆n,k(θι) − ∆In,k(θι). It is easy to see from (13) that for the second part on the right hand

side above, max 1<ι≤n1 | n X k=2 ∆O_n,k(θ_ι)ε_k| = O{n(a_nlog n)1/2} (21) almost surely, since for any constant c > 0,

∞ X n=1 P { max 1<ι≤n1 | n X k=2 ∆O_n,k(θι)εk| > cn(anlog n)1/2} ≤ ∞ X n=1 P ( max 1<ι≤n1 max 1<k≤n|∆ O n,k(θι)| > 0) ≤ ∞ X n=1 P { max 1<ι≤n1 max 1<k≤n|∆n,k(θι)| > c1(nanlog n) 1/2_} < ∞.

Now consider the first term. Let Tn01/2/ log n,

U_`I(θι) = ` X k=2 ∆_n,kI (θι){εk(|εk| ≤ Tn0) − E[εk(|εk| ≤ Tn0)]} and UO

` (θι) = U`(θι) − U`I(θι). Similar to the proof of (15) and (16), we have almost surely

|

`

X

k=2

∆O_n,k(θι)E{εk(|εk| > Tn0)}| = O{n(anlog n)1/2}, (22)

|

`

X

k=2

(23)

Note that

|∆I_n,k(θι){εk(|εk| ≤ Tn0) − E[εk(|εk| ≤ Tn0)]}| < 2c1(nanlog n)1/2Tn0 = 2c1n(an/ log n)1/2 def= N4

and by (II), Var{U_`I(θι)} = c022an def= N5, where c20 is a constant. Let N6 = c03n(anlog n)1/2 with c032 >

2(α3+ 3)(2c1c03+ c02). By the Berenstein’s inequality, we have

P (|U_nI(θι)| ≥ N6) ≤ 2 exp{− N 2 6 2(N₆N₄+ N₅)} ≤ 2n −α3−3_. Therefore n X n=1 P { max 1≤ι≤n1 |U_nI(θι)| ≥ N6} < n X n=1 n1P {|UnI(θι)| ≥ N6} < ∞. By the Borel-Cantelli lemma, we have

max 1≤ι≤n1

|U_nI(θι)| = O(N6) (24)

almost surely. Lemma 6.1 follows from (21), (22), (23) and (24). ¥

Proof of Lemma 5.1 Write sθ

k(x) = ²θk(x) + Esθk(x). By Taylor expansion, we have

sθ_k(x) =

3 X

τ =0

µ_k+τf_θ(τ )(x)hτ+ ²θ_k(x) + O(h4). (25)

Because V ar{²θ_k(x)} = O{(nh)−1}, it follows from Lemma 6.1 that ²θ_k(x) = O(δn). It is easy to check that

D_n,0θ (x) = f_θ2+1

2(µ4+ 1)fθf

00

θh2− (fθ0)2h2+ f (²θ0+ ²θ2) − 2fθ0h²θ1+ O(γn2).

Dθ_n,2(x) = f_θ2+ µ4(fθfθ00− (fθ0)2)h2+ 2fθ²θ2− fθ0h²θ3− µ4fθ0h²θ1+ O(γn2).

D_n,3θ (x) = fθ²θ3+ O(hγn), Dn,4θ (x) = µ4fθ2+ O(γn), Dn,5θ (x) = O(h).

T_n,0θ (X|x) = f_θ2ν_θ(x) + O(γ_n), S_n,0θ (X|x) = O(h), T_n,kθ (X|x) = O(1), S_n,kθ (X|x) = O(1), for k ≥ 1,

T_n,0θ (|θ>Xix|6|x) = O(h6), Sn,0θ (|θ>Xix|6|x) = O(h6), Tn,0θ (XX>|x) = O(1), Sθn,0(XX>|x) = O(h),

En,2(x) = (µ4− 1)fθfθ0h + f (²θ3− ²θ1) + O(hγn), En,3(x) = µ4fθ2+ O(γn), En,4(x) = O(h).

Note that

(24)

and Aθ_n(x) = 5 X k=2 1 k!g (k)_(θ> 0x) Dθ n,k(x) Dθ n,0(x) hk−2, B_nθ(x) = 4 X k=0 1 k!g (k+1)_(θ> 0x) Tn,k(X|x) − Dn,k(x)x Dθ n,0(x) hk, Cn(x, θ) = 1₂g00(θ>0x){Tn,0(XX>|x) − Tn,0(X|x)x>− xTn,0(X>|x) + xx>Dn,0θ (x)}{Dθn,0(x)}−1, ˜ Aθ_n(x) = 4 X k=2 1 k!g (k)_(θ> 0x) Eθ n,k(x) Dθ n,0(x) hk−2, B˜_nθ(x) = 4 X k=1 k k!g (k)_(θ> 0x) S_n,k(X|x) − E_n,k(x)x Dθ n,0(x) hk, ˜ Cn(x, θ) = 1₂g00(θ>0x){Sn,0(XX>|x) − Sn,0(X|x)x>− xSn,0(X>|x) + xx>En,0θ (x)}{Dθn,0(x)}−1.

Lemma 5.1 follows from simple calculations based on the above equations. ¥

Proof of Lemma 5.2 It follows from Lemma 6.1 that ηθ

n(x) = O(δn)(1 + |x|) and sθ0 = fθ+ ˜²θ0 where

˜²θ

0 = ²θk+ (Esθk− fθ) = O(γn). Because |ρ00n(.)| < n2ς, we have

ρn(sθ0(Xj)) = ρn(fθ(Xj)) + ρ0n(fθ(Xj))˜²θ0(Xj) + O(n2ςγn2). (26) Thus Aθ_n= ˜E_nθ+ Qθ_n,1+ O(n2ςγ_n3), where ˜Eθ n= n−2 P_n i=1 P_n j=1ρn(fθ(Xj))fθ−1(Xj)g0(θ>Xj)Khθ(Xij)Xijεi, and Qθn,1= n−1 P_n

i=1Gθn,i with

Gθ_n,i= n−1 n X j=1 h 1 2f 00 θ(Xj){ρ0n(fθ(Xj)) − ρn(fθ(Xj))f_θ−1(Xj)}h2+ {1 − ρn(fθ(Xj)f_θ−1(Xj)}˜²θ0(Xj) i × f_θ−1(Xj)g0(θ>Xj)Khθ(Xij)Xijεi.

Simple calculations lead to E ˜Eθ

n = 0, E( ˜Enθ)2 = O(n−1), E(Gn,iθ ) = 0 and E(Gθn,i)2 = O{h4+ (nh)−1}. By

the first part of Lemma 6.1, we have ˜

E_nθ= O{(log n/n)1/2}, Qθ_n,1= O{h2(log n/n)1/2+ n−1(log n/h)1/2}.

By Taylor expansion, g0_(θ>

0x) = g0(θ>x) + g00(v∗)(θ0 − θ)>x, where v∗ is a value between θ>x and θ>0x. Write

˜

E_nθ= E_nθ+ Qθ_n,2+ rn,0(θ − θ0),

where Qθ_n,2 = n−2Pn_i=1Pn_j=1{ρn(fθ(Xj))f_θ−1(Xj)g0(θ>Xj)Khθ(Xij)Xij − ρn(fθ(Xi))g0(θ>Xi)νθ(Xi)}εi and

rn,0= O(γn/h). By Lemma 6.1 and that V ar(Qθn,2) = O{h4+ (nh)−1}, we have

(25)

Let Qθ_n = Qθ_n,1+ Qθ_n,2. It is easy to check that E{Qθ_nE_nθ} = o(h8 + (nh)−2). Therefore, the first part of Lemma 5.2 follows.

Similarly, we have from (26) that

Bθ_n= (nh)−1 n X j=1 {ρn(fθ(Xj)) + ρ0n(fθ(Xj))˜²θ0(Xj)}eθk(Xj)ηθn(Xj)/fθ(Xj) + O(n2ςγn4/h). Let ˜Rθ

n be the first term on the right hand side above. Then

˜ Rθ_n= n−3 n X j=1 {ρn(fθ(Xj)) + ρ0n(fθ(Xj))˜²θ0(Xj)} n X i=1 K_h2(θ>Xij)(θ>Xij/h)kXijε2i/fθ(Xj) + n−3 n X j=1 {ρn(fθ(Xj)) + ρ0n(fθ(Xj))˜²θ0(Xj)} n X i6=` Kh(θ>Xij)(θ>Xij/h)kKh(θ>X`j)X`jεiε`/fθ(Xj) def = ˜R_n,1θ + ˜Rθ_n,2+ ˜Rθ_n,3+ ˜Rθ_n,4. If ε is independent of X, then Eθ(x)def= E{Kh2(θ>Xix)(θ>Xij/h)kXixε2i} = h−1 2 X `=0 1 `!µ˜k+`{fθ(x)νθ(x)} (`)_h`_σ2_{+ O(h}2_), where ˜µk= R K2(v)vkdv. By Lemma 6.1, we have n−1 n X i=1 K_h2(θ>X_ix)(θ>X_ix/h)kX_ixε2_i − E_θ(x) = O(h−1δ_n). Thus Rθ_n,0def= (n2h)−1 n X j=1 ρn(fθ(Xj)) h n−1 n X i=1 K_h2(θ>Xij)(θ>Xij/h)kXijε2i − Eθ(Xj) i = O{(nh2)−1δn}. (27)

It is easy to check that E{E_nθRθ_n,0} = 0. Write

(n2h)−1

n

X

j=1

ρn(fθ(Xj))Eθ(Xj) = (nh)−1E{ρn(fθ(Xj))Eθ(Xj)} + Rθn,1,

where E{Rθ n,1Enθ} = 0 and Rθ_n,1= (n2h)−1 n X j=1

[ρ_n(f_θ(X_j))E_θ(X_j) − E{ρ_n(f_θ(X_j))E_θ(X_j)}] = O{(nh2)−1(n/ log n)−1/2}. (28)

Note that E{ρn(fθ(X))νθ(X)} = 0. We have

(nh)−1E{ρ_n(f_θ(X_j))E_θ(X_j)} = ˜ck,n

nh + R

θ

(26)

where Rθ_n,2= O(n−1) and E{Rθ_n,2E₀θ} = 0. By (27)-(29) and the fact that (n/ log n)−1/2= o(γn), we have ˜ Rθ_n,1= ˜ck nh + R θ n,1+ Rθn,2. (30) Similarly ˜ Rθ_n,2= O{(nh)−1γ_n}. (31) Let Gθ n,i,`= n−1 P_n j=1ρn(fθ(Xj))Kh(θ>Xij)(θ>Xij/h)kKh(θ>X`j)X`j/fθ(Xj). Write ˜Rn,3θ as ˜ Rθ_n,3= n−2 n X i6=` 1 2(G θ n,i,`+ Gθn,`,i)εiε`= n−2 n X `=1 n X i<` 1 2(G θ n,i,`+ Gθn,`,i)εi o ε`.

By the second part of Lemma 6.1, we have ˜

Rθ_n,3= O{n−1/2δn}. (32)

Similarly, we have

˜

Rθ_n,4= O{n−1/2δn}. (33)

Thus the second part of Lemma 5.2 follows from (30) and (31).

The third part of Lemma 5.2 can be proved similarly as the proof of the second part. ¥

Proof of Lemma 5.4 By (8), Lemma 5.1 and θ0 = θ + (θ0− θ), simple calculations lead to

Yi− aθ(x) − dθ(x)θ>0Xix= εi+ { ˜Aθ(x, Xi) − Aθn(x)h2} + { ˜Bθ(x, Xi) − Bnθ(x)}>(θ0− θ)

− V_nθ(x) + O{h2γ_n2+ δ3_θ},

where ˜Aθ_{(x, X}

i) = Aθ(x, Xi)−dθ(x)θ>Xix and ˜Bθ(x, Xi) = Bθ(x, Xi)−dθ(x)Xix. It follows from the Taylor

expansion that C_n,kθ (x)def= n−1 n X i=1 K_hθ(Xix)(θ>Xix/h)kXix = 5 X `=0 1 `!µk+`(f µθ) (`)_h`_{+ ˜}_ξθ k+ O(h6), where ˜ξθ k= n−1 P_n i=1{Khθ(Xix)(θ>Xix/h)kXix− EKhθ(Xix)(θ>Xix/h)kXix} = ξkθ− x²θk. We have n−1 n X i=1 K_hθ(Xix)XixA˜θ(x, Xi) = {g0(θ>0x) − dθ(x)}Cn,1(x)h + 5 X k=2 1 k!g (k)_(θ> 0x)Cn,kθ (x)hk = −1 2g 00_(µ 4− 1)fθ0fθ−1(νθfθ)0h4+ 1 2g 00_(ν θfθ)0(²θ3− ²θ1)h2+ ˜Vnθ{(νθfθ)0h + 1 6µ4(νθfθ) 00_h3_{+ ˜}_ξθ 1} +1 2g 00_h2_{f θνθ+ 1 2µ4(fθνθ) 00_h2_{} +} 1 24g (4)_µ 4fθνθh4+ 1 2g 00_h2_ξ_˜θ 2 + 1 6g 000_h3_ξ_˜θ 3+ O(h2γn2).