SFB 649 Discussion Paper 2009-028
Optimal Smoothing for a
Computationally and
Statistically Efficient Single
Index Estimator
Yingcun Xia*
Wolfgang Härdle**
Oliver Linton***
* National University of Singapore, Singapore ** Humboldt-Universität zu Berlin, Germany *** London School of Economics, United Kingdom
This research was supported by the Deutsche
Forschungsgemeinschaft through the SFB 649 "Economic Risk". http://sfb649.wiwi.hu-berlin.de
ISSN 1860-5664
SFB 649, Humboldt-Universität zu Berlin Spandauer Straße 1, D-10178 Berlin
S
FB
6
4
9
E
C
O
N
O
M
I
C
R
I
S
K
B
E
R
L
I
N
Optimal Smoothing for a Computationally and Statistically
Efficient Single Index Estimator
∗Yingcun Xia
Department of Statistics and Applied Probability National University of Singapore
Wolfgang H¨ardle
CASE - Center for Applied Statistics & Economics Institut f¨ur Statistik und ¨Okonometrie
Wirtschaftswissenschaftliche Fakult¨at Humboldt-Universit¨at zu Berlin
D - 10178 Berlin, Germany
Oliver Linton
Department of Economics,
London School of Economics, Houghton Street, London WC2A 2AE, United Kingdom.
May 7, 2009
Abstract
In semiparametric models it is a common approach to under-smooth the nonparametric functions in order that estimators of the finite dimensional parameters can achieve root-n consistency. The requirement of under-smoothing may result as we show from inefficient estimation methods or technical difficulties. Based on local linear kernel smoother, we propose an estimation method to estimate the single-index model without under-smoothing. Under some conditions, our estimator of the single-index is asymptotically normal and most efficient in the semi-parametric sense. Moreover, we derive higher expansions for our estimator and use them to define an optimal bandwidth for the purposes of index estimation. As a result we obtain a practically more relevant method and we show its superior performance in a variety of applications.
∗The first author is most grateful to Professor V. Spokoiny for helpful discussions and NUS FRG R-155-000-048-112 and the
Alexander von Humboldt Foundation for financial support. The second author thanks the Deutsche Forschungsgemeinschaft SFB 649 “ ¨Okouomisches Risiko” for financial support. The third author thanks the ESRC for financial support.
Key words and phrases: ADE; Asymptotics; Bandwidth; MAVE method; Semi-parametric efficiency. JEL classification: C00; C13; C14
1
Introduction
Single index models (SIMs) are widely used in the applied quantitative sciences. Although the context of applications for SIMs almost never prescribes the functional or distributional form of the involved statistical error, the SIM is commonly fitted with (low dimensional) likelihood principles. Both from a theoretical and practical point of view such fitting approach has been criticized and has led to semiparametric modelling. This approach involves high dimensional parameters (nonparametric functions) and a finite dimensional index parameter. Consider the following single-index model,
Y = g(θ>0X) + ε, (1)
where E(ε|X) = 0 almost surely, g is an unknown link function, and θ0 is a single-index parameter with
length one and first element positive for identification. In this model there is a single linear combination of covariates X that can capture most information about the relation between response variable Y and covariates X, thereby avoiding the “curse of dimensionality”. Estimation of the single-index model is very attractive both in theory and in practice. In the last decade a series of papers has considered estimation of the parametric index and the nonparametric part with focus on root-n estimability and efficiency issues, see Carroll, Fan, Gijbels and Wand (1997) for an overview. There are numerous methods proposed or can be used for the estimation of the model. Amongst them, the most popular ones are the average derivative estimation (ADE) method investigated by H¨ardle and Stoker (1989), the sliced inverse regression (SIR) method proposed by Li (1989); the semiparametric least squares (SLS) method of Ichimura (1993) and the simultaneous minimization method of H¨ardle, Hall and Ichimura (1993).
The existing estimation methods are all subject to some or other of the following four critiques: (1)
Heavy computational burden: see, for example, H¨ardle, Hall and Ichimura (1993), Delecroix, H¨ardle and
Hristache (2003), Xia and Li (1999) and Xia et al. (1999). These methods include complicated optimization techniques (iteration between bandwidth choice and parameter estimation) for which no simple and effective algorithm is available up to now. (2) Strong restrictions on link functions or design of covariates X: Li (1991) required the covariate to have a symmetric distribution; H¨ardle and Stoker (1989) and Hristache
et al. (2001) needed a non-symmetric structure for the link function, i.e., |Eg0(θ>0X)| is bounded away
from 0. If these conditions are violated, the corresponding methods are inconsistent. (3) Inefficiency: The ADE method of H¨ardle and Stoker (1989) or the improved ADE method of Hristache et al. (2001) is not asymptotically efficient in the semi-parametric sense, Bickel et al. (1993). Nishiyama and Robinson (2000, 2005) considered the Edgeworth correction to the ADE methods. H¨ardle and Tsybakov (1993) discussed the sensitivity of the ADE. Since this method involves high dimensional smoothing and derivative estimation,
its higher order properties are poor. (4) Under-smoothing: Let hoptg be the optimal bandwidth in the sense of
MISE for the estimation of link function g and let hθ be the bandwidth used for the estimation of θ0. Most
of the methods mentioned above require the bandwidth hθ to be much smaller than the bandwidth hoptg , i.e.
hθ/hoptg → 0 as n → ∞, in order that estimators of θ0 can achieve root-n consistency, see, H¨ardle, and Stoker (1989) and Hristache et al. (2002), Robinson (1988), Hall (1989) and Carroll et al. (1997) among others.
Due to technical complexities, there are few investigations about how to select the bandwidth hθ for the
estimation of the single-index. Thus it could be the case that even if hθ = hoptg allows for root-n consistent
estimation of θ, that hoptθ /hoptg → 0 or hoptg /hoptθ → 0, where hoptθ is the optimal bandwidth for estimation
of θ. This would mean that using a single bandwidth hoptg would result in suboptimal performance for the
estimator of θ. Higher order properties of other semiparametric procedures have been studied in Linton (1995) inter alia.
Because the estimation of θ0 is based on the estimation of the link function g, we might expect that a
good bandwidth for the link function should be a good bandwidth for the single-index, i.e., under-smoothing should be unnecessary. Unfortunately, most of the existing estimation methods involve for technical reason
“under-smoothing” the link function in order to obtain a root-n consistent estimator of θ0. See, for example,
H¨ardle and Stoker (1989), Hristache et al. (2001, 2002), Carroll et al. (1997) and Xia and Li (1999). H¨ardle, Hall and Ichimura (1993) investigated this problem for the first time and proved that the optimal bandwidth for the estimation of the link function in the sense of MISE can be used for the estimation of the single-index to achieve root-n consistency. As mentioned above, for its computational complexity the method of H¨ardle, Hall and Ichimura (1993) is hard to implement in practice.
This paper presents a method of joint estimation of the parametric and nonparametric parts. It avoids undersmoothing and the computational complexity of former procedures and achieves the semiparametric efficiency bound. It is based on the MAVE method of Xia et al (2002), which we outline in the next section. Using local linear approximation and global minimization, we give a very simple iterative algorithm. The
proposed method has the following advantages: (i) the algorithm only involves one-dimensional smoothing
and is proved to converge at a geometric rate; (ii) with normal errors in the model, the estimator of θ0
is asymptotically normal and efficient in the semiparametric sense; (iii) the optimal bandwidth for the
estimation of the link function in the sense of MISE can be used to estimate θ0 with root-n consistency;
(iv) by a second order expansion, we further show that the optimal bandwidth for the estimation of the
single-index θ0, hoptθ , is of the same magnitude as hoptg .
Therefore, the commonly used “under-smoothing” approach is inefficient in the sense of second order approximation. Powell and Stoker (1996) investigated bandwidth selection for the ADE methods. We also propose an automatic bandwidth selection method for our estimator of θ. Xia (2006) has recently shown the first order asymptotic properties of this method. Our theoretical results are proven under weak moment conditions.
In section 3 we present our main results. We show the speed of convergence, give the asymptotic estimation and derive a smoothing parameter selection procedure. In the following section we investigate the proposed estimator in simulation and application. Technical details are deferred to the appendix.
2
The MAVE method
Suppose that {Xi, Yi : i = 1, 2, . . . , n} is a random sample from model (1). The basic idea of our estimation
method is to linearly approximate the smooth link function g and to estimate θ0 by minimizing the overall
approximation errors. Xia et al (2002) proposed a procedure via the so called minimum average conditional variance estimation (MAVE). The single index model (1) is a special case of what they considered, and we
can estimate it as follows. Assuming function g and parameter θ0 are known, then the Taylor expansion of
g(θ>
0Xi) at g(θ>0x) is
g(θ>0Xi) ≈ a + dθ>0(Xi− x),
where a = g(θ>
0x) and d = g0(θ>0x). With fixed θ, the local estimator of the conditional variance is then
σn2(x|θ) = min a,d{n ˆfθ(x)} −1 n X i=1 [Yi− {a + dθ>(Xi− x)}]2Kh{θ>(Xi− x)}, where ˆfθ(x) = n−1Pn
i=1Kh{θ>(Xi− x)}, where K is a univariate density function, h is the bandwidth and
Kh(u) = K(u/h)/h; see Fan et al (1996). The value σn2(x|θ) can also be understood as the local departure
should minimize the overall departure at all x = Xj, j = 1, · · · , n. Thus, our estimator of θ0 is to minimize Qn(θ) = n X j=1 σn2(Xj|θ) (2)
with respect to θ : |θ| = 1. This is the so-called minimum average conditional variance estimation (MAVE) in Xia et al (2002). In practice it is necessary to include some trimming in covariate regions where density
is low, so we weight σ2
n(Xj|θ) by a sequence ˆρθj, where ˆρθj = ρn{ ˆfθ(Xj)}, that is discussed further below.
The corresponding algorithm can be stated as follows. Suppose θ1 is an initial estimate of θ0. Set the
number iteration τ = 1 and bandwidth h1. We also set a final bandwidth h. Let Xij = Xi− Xj.
Step 1: With bandwidth hτ, calculate ˆfθ(Xj) = n−1
Pn
i=1Khτ(θ>Xij) and the solutions of aj and dj to the
inner problem in (2) Ã aθ j dθ jhτ ! = nXn i=1 Khτ(θ>Xij) Ã 1 θ>X ij/hτ ! Ã 1 θ>X ij/hτ !>o −1Xn i=1 Khτ(θ>Xij) Ã 1 θ>X ij/hτ ! Yi.
Step 2: Fix the weight Khτ(θ>X
ij), fθ(Xj), aθj and dθj. Calculate the solution of θ to (2)
θ = { n X i,j=1 Khτ(θ>Xij)ˆρθj{dθ(Xj)}2XijXij>fˆθ(θ>Xj)}−1 n X i,j=1 Khτ(θ>Xij)ˆρθjdθ(Xj)Xij(yi−aθj)/ ˆfθ(θ>Xj), where ˆρθ j = ρn{ ˆfθ(Xj)}.
Step 3: Set τ = τ + 1, θ := θ/|θ| and hτ := max{h, hτ/
√
2}, go to Step 1. Repeat steps 1 and 2 until convergence.
The iteration can be stopped by the common rule. For example, if the calculated θ’s are stable at a
certain direction, we can stop the iteration. The final vector θ := θ/|θ| is the MAVE estimator of θ0, denoted
by ˆθ. Note that these steps are an explicit algorithm of the Xia et al (2002) method for the single-index
model with some version of what the called ‘refined kernel weighting’ and boundary trimming. Similar to the other direct estimation methods, the calculation above is easy to implement. See Horowitz and H¨ardle (1996) for more discussions. After θ is estimated, the link function can be then estimated by the local linear
smoother as gθˆ(v), where ˆ gθ(v) = [n{sθ2(v)sθ0(v) − (sθ1(v))2}]−1 n X i=1 {sθ2(v) − sθ1(v)(θ>Xi− v)/hτ}Khτ(θ>Xi− v)Yi, (3)
and sθk(v) = n−1Pni=1Khτ(θ>Xi− v){(θ>Xi− v)/hτ}k for k = 0, 1, 2. Actually, ˆg
ˆ
θ(v) is the final value of
aθj in Step 1 with θ>Xj replaced by v.
In the algorithm, ρn(.) is a trimming function employed to handle the boundary points. There are many
choices for the estimator to achieve the root-n consistency; see e.g. H¨ardle and Stocker (1989) and HHI
(1993). However, to achieve the efficiency bound, ρn(v) must tend to 1 for all v. In this paper, we take
ρn(v) as a bounded function with third order derivatives on R such that ρn(v) = 1 if v > 2c0n−ς; ρn(v) = 0
if v ≤ c0n−ς for some constants ς > 0 and c0> 0. As an example, we can take
ρn(v) = 1, if v ≥ 2c0n−ς, exp{(2c0n−ς−v)−1} exp{(2c0n−ς−v)−1}+exp{(v−c0n−ς)−1}, if 2c0n −ς > v > c 0n−ς, 0, if v ≤ c0n−ς. (4)
The choice of ς will be given below.
3
Main Results
We impose the following conditions to obtain the asymptotics of the estimators.
[(C1)] [Initial estimator] The initial estimator is in Θn= {θ : |θ − θ0| ≤ n−α} for some 0 < α ≤ 1/2.
[(C2)] [Design] The density function fθ(v) of θ>X and its derivatives up to 6th order are bounded on R
for all θ ∈ Θn, E|X|6 < ∞ and E|Y |3 < ∞. Furthermore, sup
v∈R,θ∈Θn|fθ(v) − fθ0(v)| ≤ c|θ − θ0| for
some constant c > 0.
[(C3)] [Link function] The conditional mean gθ(v) = E(Y |θ>X = v), E(X|θ>X = v), E(XX>|θ>X = v)
and their derivatives up to 6th order are bounded for all θ : |θ − θ0| < δ where δ > 0.
[(C4)] [Kernel function] K(v) is a symmetric density function with finite moments of all orders.
[(C5)] [Bandwidth and trimming parameter] Trimming parameter ς ≤ 1/20 and bandwidth h ∝ n−ρ for
some ρ with 1/5 − ² ≤ ρ ≤ 1/5 + ² for some ² > 0.
Assumption (C1) is feasible because such an initial estimate is obtainable using existing methods, such as H¨ardle and Stoker (1989), Powell et al. (1989) and Horowitz and H¨ardle (1996). Actually, H¨ardle, Hall and
Ichimura (1993) even assumed that the initial value is in a root-n neighborhood of θ0, {θ : |θ−θ0| ≤ C0n−1/2}.
small neighborhood of θ0; see also Ichimura (1993). The moment requirement on X is not strong. H¨ardle,
Hall and Ichimura (1993) obtained their estimator in a bounded area of Rp, which is equivalent to assume
that X is bounded; see also H¨ardle and Stoker (1989). We impose slightly higher order moment requirement than second moment for Y to ensure the optimal bandwidth in (C5) can be used in applying Lemma 6.1 in section 6. The smoothness requirements on the link function in (C3) can be relaxed to the existence of a bounded second order derivative at the cost of more complicated proofs and smaller bandwidth. Assumption (C4) includes the Gaussian kernel and the quadratic kernel. Assumption (C5) includes the commonly used
optimal bandwidth in both the estimation of the link function and the estimation of the index θ0. Actually,
imposing these constraints on the bandwidth is for ease of exposition in the proofs.
Let µθ(x) = E(X|θ>X = θ>x), ν
θ(x) = µθ(x)−x, wθ(x) = E(XX>|θ>X = θ>x), W0(x) = νθ0(x)νθ0(x).
Let A+ denote the Moore-Penrose inverse of a symmetric matrix A. Recall that K is a symmetric density
function. Thus, R K(v)dv = 1 and R vK(v)dv = 0. For ease of exposition, we further assume that µ2 =
R
v2K(v)dv = 1. Otherwise, we can redefine K(v) := µ1/22 K(µ1/22 v).
We have the following asymptotic results for the estimators.
Theorem 3.1 (Speed of algorithm) Let θτ be the value calculated in Step 3 after τ iterations. Suppose
assumptions (C1)-(C5) hold. If hτ → 0 and |θτ− θ0|/h2τ → 0, we have
θτ +1− θ0 = 1 2{(I − θ0θ > 0) + o(1)}(θτ− θ0) + 1 2√nNn+ O(n 2ςh4 τ)
almost surely, where Nn= [E{g0(θ>0X)2W0(X)}]+n−1/2
Pn
i=1g0(θ>0Xi)νθ0(Xi)εi = Op(n−1/2).
Theorem 3.1 indicates that the algorithm converges at a geometric rate, i.e. after each iteration, the estimation error reduces by half approximately. By Theorem 3.1 and the bandwidth requirement in the algorithm, we have
|θτ +1− θ0| = {12 + o(1)}|θτ +1− θ0| + O(n−1/2+ n2ςh4τ).
Starting with |θ1−θ0| = Cn−α, in order to achieve root-n consistency, say |θk−θ0| ≤ cn−1/2 i.e. 2−kCn−α≤
cn−1/2, the number of iterations k can be calculated roughly by
k = {(1
2− α) log n + log(C/c)}/ log 2. (5) Based on Theorem 3.1, we immediately have the following limiting distribution.
Theorem 3.2 (Efficiency of estimator) Under the conditions (C1)-(C5), we have
√
n(ˆθ − θ0)→ N (0, ΣL 0),
where Σ0= [E{g0(θ>0X)2W0(X)}]+E{g0(θ>0X)2W0(X)ε2}[E{g0(θ0>X)2W0(X)}]+.
By choosing a similar trimming function, the estimators in H¨ardle, Hall and Ichimura (1993) and Ichimura (1993) have the same asymptotic covariance matrix as Theorem 3.2. If we further assume that the conditional distribution of Y given X belongs to a canonical exponential family
fY |X(y|x) = exp{yη(x) − B(η(x)) + C(y)}
for some known functions B, C and η, then Σ0 is the lower information bound in the semiparametric sense
(Bickel, Klaassen, Ritov and Wellner, 1993). See also the proofs in Carroll, Fan, Gijbels and Wand (1997) and H¨ardle, Hall and Ichimura (1993). In other words, our estimator is the most efficient in the semiparametric sense.
For the estimation of the single-index model, it was generally believed that undersmoothing the link function must be employed in order to allow the estimator of the parameters to achieve root-n consistency. However, H¨ardle, Hall and Ichimura (1993) established that undersmoothing the link function is not neces-sary. They derived an asymptotic expansion of the sum of squared residuals. We also derive an asymptotic
expansion but of the estimator bθ itself. This allows us to measure the higher order cost of estimating the
link function. We use the expansion to propose an automatic bandwidth selection procedure for the index.
Let fθ0(.) be the density function of θ>
0X.
Theorem 3.3 (Higher Order Expansion) Under conditions (C1)-(C5) and εi is independent of Xi, we
have almost surely
ˆ θ − θ0= En+ cnh1,n+ c2,nh4+ Hn+ O{n2ςγn3}, where γn= h2+ (nh/ log n)−1/2, En= (Wn)+ n X i=1 ρn{fθ0(Xj)}g 0(θ> 0Xi)νθ0(θ > 0Xi)εi, with Wn= n−1Pn j=1ρn{fθ0(Xj)}(g0(θ>0Xi))2νθ0(Xj)νθ>0(Xj), Hn= O{n −1/2γ n+ n−1h−1/2} with E{HnEn} = o{(nh)−2+ h8} and c1,n= Z K2(v)v2dvσ2(nWn)−1 n X j=1 ρn{fθ(Xj)}{νθ00(Xj) + f 0 0(Xj)νθ0(Xj)/fθ0(Xj)},
c2,n= 1 4( Z K(v)v4dv − 1)(nWn)−1 n X j=1 ρn{fθ(Xj)}g0(θ0>Xj)g00(θ0>Xj)νθ000(Xj).
Because K(v) is a density function and we constrain thatR v2K(v) = 1, it follows that µ
4 = R
K(v)v4dv >
1. In the expansion of ˆθ − θ0, the first term En does not depend on h. The second and third terms are the
leading term among the remainders. The higher order properties of this estimator are better than those of the AD method, see Nishiyama and Robinson (2000), and indeed do not reflect a curse of dimensionality.
To minimize the stochastic expansion, it is easy to see that the bandwidth should be proportional to
n−1/5. Moreover, by Theorem 3.2 we consider the Mahalanobis distance
(ˆθ − θ0)>Σ+0(ˆθ − θ0) = Tn+ o{h8+ (nh)−2}, where Tn= (En+ c1,n nh + c2,nh 4+ H n)>Σ+0(En+ c1,n nh + c2,nh 4+ H n)
is the leading term. We have by Theorem 3.3 that
ETn= E(En>Σ+0En) + (nhc1 + c2h4)>Σ+0(
c1
nh + c2h
4) + o{h8+ (nh)−2},
where c1 =RK2(v)v2dvσ2W+
0 E{ν00(X) + f−1(X)f0(X)ν0(X)}, W0 = E{(g0(θ0>X))2νθ0(X)νθ>0(X)} and
c2 = 14( Z
K(v)v4dv − 1)W0+E[g0(θ>0X)g00(θ>0X)νθ000(X)].
Note that E(En>Σ+0En) does not depend on h. By minimizing ETn with respective to h, the optimal
bandwidth should be hθ= ( (9r2 2 + 16r1)1/2− 3r2 8 )1/5 n−1/5,
where r1 = c>1Σ+0c1/(c>2Σ+0c2) and r2 = c>1Σ+0c2/c>2Σ+0c2. As a comparison, we consider the optimal
bandwidth for the estimation of the link function g. By Lemma 5.1 and Theorem 3.2, if fθ0(v) > 0 we have
ˆ g(v) = g(v) + 1 2g 00(v)2h2+ 1 nfθ0(v) n X i=1 Kh(θ0>Xi− v)εi+ OP(n−1/2+ h2γn). (6) In other words, the link function can be estimated with the efficiency as if the index parameter vector is known. A brief proof for (6) is given in section 5. It follows that
where the leading term is Sn(v) = [12g00(v)2+ {nfθ0(v)}−1
Pn
i=1Kh(θ0>Xi− v)εi]2. Suppose we are interested
in constant bandwidth in region [a, b] with weight w(v). Minimizing R[a,b]ESn(v)w(v)dv with respect to h,
we have the optimal bandwidth for the estimation of the link function is
hg = " R K2(v)dvR[a,b]fθ−10 (v)σ2θ0(v)w(v)dv R [a,b]g00(v)2w(v)dv #1/5 n−1/5.
It is noticeable that the optimal bandwidth for the estimation of the parameter vector θ0 is of the same
order as that for the estimation of the link function. In other words, under-smoothing may lose efficiency for
the estimation of θ0 in the higher order sense. These optimal bandwidth hoptθ and hoptg can be consistently
estimated by plug-in methods; see Ruppert et al (1995).
Although the optimal bandwidth for the estimation of θ is different from that for the link function, its estimation such as the plug-in method may be very unstable because of the estimation of second order derivatives. Moreover, its estimation needs another pilot parameter which is again hard to choose. In
practice it is convenient to apply hoptg for hoptθ directly, and since hoptg and hoptθ have the same order, the loss
of efficiency in doing so should be small. For the former, there are a number of estimation methods such as CV and GCV methods. If CV methods is used, in each iteration with the latest estimator θ, the bandwidth is selected by minimizing ˆhg = argmin h n−1 n X j=1 {Yj − ˆgθj(θ>Xj)}2
where ˆgθj(v) is the delete-one-observation estimator of the link function, i.e. the estimator of ˆgθ(v) in (3)
using data {(Xi, Yi), i 6= j}. Another advantage for this approach is that we can also obtain the estimator
for the link function.
4
Numerical Results
In the following calculation, the Gaussian kernel function and the trimming function (4) with ς = 1/20 and
c0= 0.01 are used. A MATLAB code rMAVE.m for the calculations below is available at
http://www.stat.nus.edu.sg/%7Estaxyc
In the first example, we check the behavior of bandwidths hg and hθ. We consider two sets of simulations
estimation of the link function g and the single-index θ0. Our models are
model A: y = (θ>0X)2+ 0.2ε, model B: y = cos(θ0>X) + 0.2ε,
where θ0 = (3, 2, 2, 1, 0, 0, −1, −2, −2, −3)>/6, X ∼ N
10(0, I), and ε ∼ N (0, 1) is independent of X. The
ADE method was used to choose the initial value of θ. With different sample size n and bandwidth h, we estimate the model and calculate estimation errors
errθ = {1 − |θ>0θ|}ˆ 1/2, errg= n1 n X j=1 ρn{ ˆfθˆ(ˆθ>Xj)}|ˆgθˆ(ˆθ>Xj) − g(θ>0Xj)|, where ˆgθˆ(ˆθ>X
j) is defined in (3). With 200 replications, we calculate the mean errors mean(errθ) and
mean(errg). The results are shown in Figure 1.
We have the following observations. (1) Notice that n1/2mean(err
θ) tends to decrease as n increases,
which means the estimation error errθ enjoys a root-n consistency (and slightly faster for finite sample size).
(2) Notice that the U-shape curves of errθ has a wider bottom than those of errg. Thus, the estimation
of θ0 is more robust to the bandwidth than the estimation of g. (3) Let hoptθ = arg minhmean(errθ)
and hoptg = arg minhmean(errg). Then hoptθ and hoptg represent the best bandwidths respectively for the
estimation of the link function g and the single-index θ0. Notice that hoptθ /hoptg tends to increase as n
increases, which means the optimal bandwidth for the estimation of θ0 tends to zero not faster than that
for the estimation of link function. Thus the under-smoothing bandwidth is not optimal.
Next, we compare our method with some of the existing estimation methods including ADE in H¨ardle and Stocker (1993), MAVE, the method in Hristache et al (2001), called HJS hereafter, the SIR and pHd methods in Li (1991, 1992) and SLS in Ichimura (1993). For SLS, we use the algorithm in Friedman (1984) in the calculation. The algorithm has best performance among those proposed for the minimization of SLS, such as Weisberg and Welsh (1994) and Fan and Yao (2003). We consider the following model used in Hristache et al (2001),
Y = (θ>0X)2exp(aθ0>X) + σε, (7)
where X = (x1, · · · , x10)>, θ0 = (1, 2, 0, ..., 0)>/
√
5, x1, · · · , x10, ε are independent and ε ∼ N (0, 1). For the
covariates X: (xk+ 1)/2 ∼ Beta(τ, 1) for k = 1, · · · , p. Parameter a is introduced to control the shape of
function. If a = 0, the structure is symmetric; the bigger it is, the more monotonic the function is.
Following Hristache et al (2001), we use the absolute deviationPpj=1|ˆθj− θj| to measure the estimation
Figure 1: The wide solid lines are the values of log{n1/2mean(err
θ)} and the narrow lines are the values of log{n1/2mean(err
g)} (re-scaled for easier visualisation). The dotted vertical lines correspond to the bandwidths hθ
and hg respectively. 0 0.5 1 −6 −4 −2 0 hθ/h g=1.34 h θ h g model A, n=50 0 0.5 1 −6 −4 −2 0 hθ/h g=1.62 h θ h g model A, n=100 0 0.5 1 −6 −4 −2 0 h θ h g hθ/h g=2.33 model A, n=200 0 0.5 1 −6 −4 −2 0 h θ h g hθ/h g=2.33 model A, n=400 0 0.5 1 −6 −4 −2 0 h θ h g hθ/h g=2.36 model A, n=800 0 0.5 1 −6 −4 −2 0 hθ/h g=1.1 hθ h g model B, n=100 0 0.5 1 −6 −4 −2 0 hθ/h g=1.26 hθ h g model B, n=200 0 0.5 1 −6 −4 −2 0 hθ/h g=1.37 hθ h g model B, n=400 0 0.5 1 −6 −4 −2 0 hθ/h g=1.64 hθ h g model B, n=800
Table 1. Average estimation errorsPpj=1|ˆθj − θj|
and their standard deviations (in square bracket) for model (7).
a = 1 a = 0
n σ τ ADE∗ HJS∗ SIR/pHd SLS MAVE SIR/pHd SLS MAVE 200 0.1 1 0.6094 0.1397 0.6521 0.0645 0.0514 0.7500 0.6910 0.0936 [0.1569] [0.0258] [0.0152] [0.1524] [1.2491] [0.0255] 200 0.2 1 0.6729 0.2773 0.6976 0.1070 0.0934 0.7833 0.8937 0.1809 [0.1759] [0.0375] [0.0294] [0.1666] [1.3192] [0.0483] 400 0.1 0.75 0.7670 0.1447 0.3778 0.1151 0.0701 0.6037 0.0742 0.0562 [0.0835] [0.0410] [0.0197] [0.1134] [0.0193] [0.0146] 400 0.1 1 0.4186 0.0822 0.4868 0.0384 0.0295 0.5820 0.5056 0.0613 [0.1149] [0.0125] [0.0096] [0.1084] [1.0831] [0.0167] 400 0.1 1.5 0.2482 0.0412 0.5670 0.0208 0.0197 0.5760 0.0923 0.0669 [0.1524] [0.0063] [0.0056] [0.1215] [0.0257] [0.0175] 400 0.2 1 0.4665 0.1659 0.5249 0.0654 0.0607 0.6084 0.7467 0.1229 [0.1353] [0.0207] [0.0178] [0.1064] [1.2655] [0.0357] 400 0.4 1 0.5016 0.3287 0.6328 0.1262 0.1120 0.6994 0.9977 0.2648 [0.1386] [0.0406] [0.0339] [0.1370] [1.2991] [0.1880]
the following observations from Table 1. Our methods has much better performance than ADE and the method of Hristache et al (2001). For each simulation, the better one of SIR and pHd is reported in Table 1, suggesting that these methods are not so competitive. Actually the main application of SIR and pHd is not in the estimation of single-index models. See Li (1991, 1992). For SLS, its performance depends much on the data and the model. If the model is easy to estimate (such as monotone and having big signal/noise ratio), it performance quite well. But overall SLS is still not so good as MAVE. The proposed method has the best performance in all the simulations we have done.
5
Proof of Theorems
Let fθ(v) be the density function of θ>X and Λn = {x : |x| < nc, fθ(x) > n−2ς, θ ∈ Θn} where c > 1/3
and ς > 0 is defined in (C5). Suppose An is a random matrix depending on x and θ. By An = O(an) (or
An=O(an)) we mean that all elements in An are Oa.s.(an) (or oa.s.(an)) uniformly for θ ∈ Θn and x ∈ Λn.
Let δn = (nh/ log n)−1/2, γ
n = h2+ δn and δθ = |θ − θ0|. For any vector V (v) of functions of v, we define
(V (v))0= dV (v)/dv.
Suppose (Xi, Zi), i = 1, 2, . . . , n, are i.i.d. samples from (X, Z). Let Xix= Xi− x,
sθk(x) = n−1 n X i=1 Kh(θ>Xix){θ>Xix/h}k, tθk(x) = n−1 n X i=1 Kh(θ>Xix){θ>Xix/h}kXi, wθk(x) = n−1 n X i=1 Kh(θ>Xix){θ>Xix/h}kXiXi>, eθk(x) = n−1 n X i=1 Kh(θ>Xix){θ>Xix/h}kεi, ²θ k = sθk(x) − Esθk(x), ξθk = tθk(x) − Etθk(x), Dn,kθ (x) = sθ2(x)sθk(x) − sθ1(x)sθk+1(x), En,kθ = sθ0(x)sθk+1(x) − sθ
1(x)sθk(x) for k = 1, 2, . . .. For any random variable Z and its random observations Zi, i = 1, ..., n, let
Tn,kθ (Z|x) = sθ2(x)n−1 n X i=1 Khθ(Xix)(θ>Xix/h)kZi− sθ1(x)n−1 n X i=1 Khθ(Xix)(θ>Xix/h)k+1Zi, Sn,kθ (Z|x) = sθ0(x)n−1 n X i=1 Khθ(Xix)(θ>Xix/h)k+1Zi− sθ1(x)n−1 n X i=1 Khθ(Xix)(θ>Xix/h)kZi.
By the Taylor expansion of g(θ>
0Xi) at θ0>x, we have g(θ0>Xi) = g(θ>0x) + 5 X k=1 1 k!g (k)(θ> 0x){θ>Xix+ (θ0− θ)>Xix}k+ O({θ>Xix+ (θ0− θ)>Xix}6) = g(θ>0x) + Aθ(x, Xi) + Bθ(x, Xi)(θ0− θ) + O{(θ>Xix)6+ δ3θ(|Xi|6+ |x|6)}, (8)
where Aθ(x, Xi) = P5 `=1(k!)−1g(k)(θ>0x)(θ>Xix)k and Bθ(x, Xi) = 5 X k=1 1 (k − 1)!g (k)(θ> 0x)(θ>Xix)k−1Xix>+ 1 2g 00(θ> 0x)(θ − θ0)>XixXix>.
For ease of exposition, we simplify the notation and abbreviate g for g(θ>
0x) and g0, g00, g000 for g0(θ0>x),
g00(θ>
0x), g000(θ0>x) respectively. Without causing confusion, we write fθ(θ>x) as fθ, fθ(θ>Xj) as fθ(Xj) and
Kh(θ>X
ij) as Khθ(Xij). Similar notations are used for the other functions.
Lemma 5.1 (Link function) Let
Σθn(x) = n−1 n X i=1 Khθ(Xix) à 1 θ>Xix/h ! à 1 θ>Xix/h !> and à aθ(x) dθ(x)h ! = {nΣθn(x)}−1 n X i=1 Khθ(Xix) à 1 θ>X ix/h ! Yi.
Under assumptions (C2)–(C5), we have
aθ(x) = g(θ>0x) + Aθn(x)h2+ Bnθ(x)(θ0− θ) + Vnθ(x) + O(h2γn2+ δθ3)(1 + |x|6), dθ(x)h = g0(θ0>x)h + ˜Aθn(x)h2+ ˜Bnθ(x)(θ0− θ)h + ˜Vnθ(x) + O(h2γn2+ δ3θ)(1 + |x|6), where Aθn(x) = 1 2g 00+1 4{(µ4− 1)g 00f−2 θ (fθfθ00− 2(fθ0)2) + 1 24µ4g (4)}h2+ Hθ 1,n(x), ˜ Aθn(x) = 1 2g 00(µ 4− 1)fθ−1fθ0h + 1 6g (3)µ 4h + 12g00fθ−1(²θ3− ²θ1) + O(hγn), Bnθ(x) = g0νθ+ O(γn+ δθ), B˜nθ(x) = g0(θ0>x)fθ−1{fθνθ(x)}0+ O(γn), where Hθ 1,n(x) = 12g00(θ>0x){fθ−1(²2θ− ²θ0) + (2 − µ4)fθ−2fθ0h²θ1− fθ−2fθ0h²θ3} +16fθ−1g000h²θ3 and Vnθ(x) = fθ−1eθ0− fθ−2f0 θheθ1 + µ4fθ−2fθ00h2eθ0/2 + fθ−2(eθ0²θ2 − eθ1²θ1) − µ4fθ−2fθ000h3eθ1 + {fθ−2(fθ0)2− (µ4 + 1)fθ−1fθ00}{fθ−1h2eθ0 − fθ−2f0 θh3eθ1}−fθ−1(²θ0+²θ1){fθ−1eθ0−fθ−2fθ0eθ1}+2fθ−2fθ0h²1θfθ−1eθ0and ˜Vnθ(x) = fθ−1eθ1+fθ−2fθ00h2eθ1/2+fθ−2(²θ0eθ1− ²θ 1eθ0) − fθ−2fθ0he0θ+ fθ−1²θ0[−(µ4+ 1)fθ−1fθ00h2/2 − fθ−1(²θ0+ ²1θ) + fθ−2(fθ0)2h2].
Lemma 5.2 (Summations) Let ηθ
n(x)=n−1
Pn
i=1Khθ(Xix)Xixεi. Under conditions (C1)-(C5), we have
Aθndef= n−1
n
X
j=1
Bnθ def= (nh)−1 n X j=1 ρn{sθ0(Xj)}eθk(Xj)ηnθ(Xj)/sθ0(Xj) = ˜ck,n nh + R θ n+ O(n2ςγn3), Cnθ def= n−1 n X j=1 ρn(sθ0(Xj))²θk(Xj)ηnθ(Xj)/sθ0(Xj) = Mnθ+ O(n2ςγn3), where Eθ n= Pn i=1ρn{fθ(Xj)}g0(θ>Xi)νθ(θ>Xi)εi, rθn,0=O(1),
Enθ = O{(n/ log n)−1/2}, Qθn= O{(n/ log n)−1/2γn}, Rθn= O{n−1/2δn}, Mnθ= O{n−1/2δn},
with E{Eθ
nQθn} = o(h8 + (nh)−2), E{EnθRθn} = o(h8 + (nh)−2), E{EnθMnθ} = o(h8 + (nh)−2), and ˜ck,n =
R
vk+1K2(v)dvE[ρ
n(fθ(Xj))fθ−1(Xj)(νθ(Xj)fθ(Xj))0(Xj)] if k is odd, 0 otherwise.
Lemma 5.3 (Denominator) Let Dθ
n = n−2
Pn
i,j=1ρn(sθ0(Xj))d2θ(Xj)Khθ(Xij)XijXij>/sθ0(Xj) in the
algo-rithm. Suppose (θ, B) : p × p is an orthogonal matrix. Then under (C1)-(C5), we have almost surely
(Dnθ)−1 = θθ>dθ11h−2− θdθ12B>h−1− B(dθ12)>θ>h−1+ Bdθ22B>, where dθ11= (Gθn)−1+O(1), dθ 12= Hnθh + O(γn), dθ22= 1 2(B >Wθ nB)−1+ O(γn), with Gθ n= n−1 Pn j=1ρn(fθ(Xj))fθ−1(Xj)(g0(θ0Xj))2and Hnθ = 12n−1 Pn j=1ρn(fθ(Xj))fθ−1(Xj){(fθνθ)0(Xj)}> (Gθn)−1(g0(θ>0Xj))2B(B>WnθB)−1 and Wnθ= n−1 Pn j=1ρn{fθ(Xj)}(g0(θ>Xi))2νθ(Xj)νθ>(Xj).
Proof of Lemma 5.3 Let (θ, B) be an orthogonal matrix. It is easy to see that
n−1 n X i=1 Khθ(Xix)θ>XixX>ixθ = sθ2(x)h2, n−1 n X i=1 Khθ(Xix)θ>XixX>ixB = {tθ1(x) − sθ1(x)x}>Bh, n−1 n X i=1 Khθ(Xix)B>XixX>ixB = B>{w0θ(x) − tθ0(x)x>− x(t0θ(x))>+ xx>sθ0(x)}B. Thus (Dθn)−1 = (θ, B) Ã Dθ 11h2 (D12θ )>Bh B>Dθ 12h B>D22θ B !−1 (θ, B)>, where Dθ11= n−1 n X j=1 ρn(sθ0(Xj)){dθ(Xj)}2sθ2(Xj)/sθ0(Xj), D12θ = n−1 n X j=1 ρn(sθ0(Xj)){dθ(Xj)}2{tθ1(Xj) − sθ1(Xj)Xj}>/sθ0(Xj),
Dθ22= n−1
n
X
j=1
ρn(sθ0(Xj))(dθ(Xj))2{wθ0(Xj) − tθ0(Xj)Xj>− Xjtθ0(Xj) + XjXj>sθ0(Xj)}/sθ0(Xj).
By the matrix inversion formula in blocks (Schott, 1997), we have the equation in Lemma 5.3 with
d11 = {D11θ − (Dθ12)>BB>(D22θ )−1BB>Dθ12}−1, d12 = dθ11(Dθ12)>B(B>D22θ B)−1, dθ22= {B>Dθ22B}−1+
d11{B>Dθ
22B}−1B>D12θ (D12θ )>B{B>Dθ22B}−1. By Lemma 6.1, we have
Dθ11= G−1n +O(1), Dθ
12= Hnh + O(γn), Dθ22= 2Wn+ O(γn).
Thus, Lemma 5.3 follows. ¥
Lemma 5.4 (Numerator) Let Nnθ= n−2 Pn
i,j=1
ρn(sθ0(Xj))Khθ(Xij)Xij{Yi−aθ(Xj)−dθ(Xj)θ0>Xij}/sθ0(Xj).
Under assumptions (C1)–(C5), we have almost surely Nnθ = Enθ+˜c1,n
nh + ˜c2,nh
4+ Rθ
n+ Bnθ(θ − θ0) + O{n2ς(γn3+ δθ3)},
where Rθn = O{n−1(log n/h)1/2 + (log n/n)−1/2h2}, θ>Rθn = O{hn−1(log n/h)1/2 + (log n/n)−1/2h3} and E{RθnE0θ} = O{(nh)−2+ h8}, Bnθ = Wnθ+O(1) with Wnθ defined in Lemma 5.3, ˜c1,n and Eθ
0 are defined in Lemma 5.2 and ˜c2,n= 14(µ4− 1) n X j=1 ρn{fθ(Xj)}g0(θ0>Xj)g00(θ0>Xj)νθ00(Xj).
Proof of Theorem 3.1 By assumption (C2), we have
∞ X n=1 P ( n [ i=1 {Xi ∈ Λ/ n}) ≤ ∞ X n=1 nP (Xi∈ Λ/ n) ≤ ∞ X n=1 nP (|Xi| > nc) < ∞ X n=1 nn−6cE|X|6 < ∞
for any c > 1/3. It follows from the Borel-Cantelli lemma that
P ( ∞ \ n=1 n [ i=1 {Xi ∈ Λ/ n}) = 0. (9)
Let ˜Λn= {x : fθ(θ>x) > 2n−²}. Similarly, we have
P ( ∞ \ n=1 n [ i=1 {Xi ∈ ˜/Λn}) = 0. (10)
Thus, we can exchange summations over {Xj : j = 1, · · · , n}, {Xj : Xj ∈ Λn, j = 1, · · · , n} and {Xj : Xj ∈
˜
Λn, j = 1, · · · , n} in the sense of almost surely consistency. On the other hand, we have by (C2)
n−1 X
|Xj|<nc
By the notation in Lemmas 5.3 and 5.4, after one iteration of Steps 1-3, the new θ is ˜
θ = θ0+ (Dnθ)−1Nnθ. (11)
Note that θ>Enθ = 0, θ>cθ1,n= 0, θ>cθ2,n= 0, θ>Wnθ = 0, Wnθ(Wnθ)+= I − θθ> and δθ/h2→ 0. We have
˜ θ =θ0+ θ[θ>dθ11h−2{Rθn+ Bnθ(θ − θ0) + O{n2ς(γn3 + δθ3)}} − d12θ B>h−1Nnθ] − B(dθ12)>θ>h−1[Rθn+ Bθn(θ − θ0) + O{n2ς(γn3+ δ3θ)}] + Bdθ22B>Nnθ =(1 + an)θ0+ {1 2(I − θ0θ > 0) + bn}(θ − θ0) + 1 2{W θ n}+Enθ+ O(h4), where an=O(1) and bn=O(1). By (25) below, we have sθ
0(x) = fθ(θ>x) + O(γn). Thus by the smoothness of ρn(.) and (10), we have
ρn(sθ0(x)) = ρn(fθ(θ>x)) + O(nςγn) = 1 + O(nςγn). (12)
Since ρn(.) is bounded, we have E{ρn( ˆfθ(θ>x)) − 1}2 =O(1). By (C3) and Lemma 6.1, we have
Enθ = n−1
n
X
i=1
g0(θ>0Xi)νθ0(Xj)εi+O(n−1/2).
Note that Wn= W0+O(δθ). It is easy to check that |˜θ| = 1 + an+ bn+ O(h4) = 1 +O(1). Thus
˜ θ/|˜θ| = θ0+ {12(I − θ0θ0>) +O(1)}(θ − θ0) +12n−1W0+ n X i=1 g0(θ>0Xi)νθ0(Xi)εi+O(h3+ n−1/2).
Let θ(k) be the value of θ after k iteration. Because h
k+1= max{hk/ch, h}. Therefore,
|θk+1− θ0|/h2k+1→ 0,
for all k > 1. We have
θ(k+1)= θ0+ {12(I − θ0θ0>) +O(1)}(θ(k)− θ0) +12n−1W0+ n X i=1 g0(θ>0Xi)νθ0(Xi)εi+O(h3 k+ n−1/2).
Recursing the above equation, we have
θ(k+1)= θ0+ {21k(I − θ0θ>0) +O(1) k X ι=1 1 2ι}(θ(1)− θ0) + { k X ι=1 1 2ι}n−1W0+ n X i=1 g0(θ>0Xi)νθ0(Xi)εi +O( k X ι=1 1 2ιh 3 k−ι+ n−1/2).
Thus as the number of iterations k → ∞, Theorem 3.1 follows immediately from the above equation and
the central limit theorem. ¥
Proof of Theorem 3.3 Based on Theorem 3.2, we can assume δθ = (log n/n)1/2. Note that θ>{Enθ+
c1,n(nh)−1+ c2,nh4} = 0. We consider the product of each term in (Dθn)−1 with Nnθ. We have
θθ>dθ11h−2Nnθ = θθ>dθ11h−2[Rθn+ Bnθ(θ − θ0) + O{n2ς(γn3+ δθ3)}] = aθnθ0+ aθn(θ − θ0),
θdθ12B>h−1Nnθ = bθnθ0+ bθn(θ − θ0), B(dθ12)>θ>h−1
(Dnθ)−1Nn= θ{Sn>(θ − θ0) + HnEnθ+ O(n2ςγ4n)}
= θ0{Sn>(θ − θ0) + HnE0θ+ O(n2ςγn4)} + cn(θ − θ0),
where Sn = O(1) and cn = O(γn/h). It is easy to see that cn =O(1) providing that |θ − θ0|/h2 → 0. By
Lemma 5.3 and 5.4, we have ˜ θ =θ0{1 + Sn>(θ − θ0) + HnE0θ+ O(n2ςγn4)} + 1 2W θ n{E0θ+ c0 1,n nh + c 0 2,nh4+ Rθn+ Qθn} + {1 2(I − θθ >) + c n}(θ − θ0) + O{n2ς(γn3+ h log n/n)}.
It is easy to see that |˜θ| = 1 + S>
n(θ − θ0) + HnE0θ+ O(n2ςγn4). Thus ˜ θ/|˜θ| = θ0+12Wnθ{E0θ+ c0 1,n nh + c 0 2,nh4+ Rθn+ E0θHn>E0θ} + { 1 2(I − θθ >) + c0 n}(θ − θ0) + O{n2ς(γn3+ h log n/n)}, where c0
n =O(1). Similar to the proof of Theorem 3.1, we complete the proof with c1,n = Wn−1c01,n and
c2,n= Wn−1c02,n. ¥
6
Proofs of the Lemmas
In this section, we first give some results about the uniform consistency. Based on these results, the Lemmas are proved.
Lemma 6.1 Suppose Gn,i(χ) is a martingale with respect to Fi = σ{Gn,`(χ), ` ≤ i} with χ ∈ X and X is a
compact region in a multidimensional space such that (I) |Gn,i(χ)| < ξi, where ξi are IID and sup Eξ2r1 < ∞
for some r > 2; (II) EG2n,k(χ) < ans(χ) with inf s(χ) positive, and (III) |Gn,i(χ) − Gn,i( ˜χ)| < nα1|χ − ˜χ|Mi,
where Mi, i = 1, 2, ... are IID with EM12 < ∞. If and an = cn−δ with 0 ≤ δ < 1 − 2/r, then for any α10 > 0
we have sup |χ|≤nα01 ¯ ¯ ¯n−1s−1/2(χ) n X i=1 Gn,i(χ) ¯ ¯ ¯ = O{(n−1anlog n)1/2}
almost surely. Suppose for any fixed n and k, Gn,i,k(θ) is a martingale with respect to Fi,k = σ{Gn,`,k(θ), ` ≤
i} such that (I) |Gn,i,k(θ)| ≤ ξi, (II) EG2n,i,k(θ) < an and (III) |Gn,i,k(θ) − Gn,i,k(˜θ)| < nα2|θ − ˜θ|Mi, where
ξi, an and Mi are defined above. If E|εk|2r < ∞ and E{εk|Gn,i,j(θ), i < j, j = 1, ..., k − 1} = 0, then
sup θ∈Θ ¯ ¯ ¯n−2 n X k=2 nk−1X i=1 Gn,i,k(θ) o εk ¯ ¯ ¯ = O{(anlog n)1/2/n} almost surely.
Proof of Lemma 6.1 We give the details for the second part of the Lemma. The first part is easier
and can be proved similarly. Let ∆n(θ) be the expression between the absolute symbols in the equation.
By (III) and the strong low of large numbers, it is easy to see that there are n1 = nα3 balls centered at
θι : Bι = {θ : |θ − θι| < n−α4} with α4 > α2+ 2, such that
Sn1
ι=1Bι ⊃ Θ. By the strong law of large numbers,
we have max 1≤ι≤n1 sup θ∈Bι |∆n(θ) − ∆n(θι)| ≤ nα2 max 1≤ι≤n1 sup θ∈Bι |θ − θι|n−2 n X k=1 |εk| n X i=1 Mi= O{(anlog n)1/2/n}
almost surely. Let ∆n,k(θι) =Pk−1i=1Gn,i,k(θι). Next, we show that there is a constant c1 such that
pndef= P ³\∞ `=1 ∞ [ n=` { max 1<k≤n1<ι≤nmax1 |∆n,k(θι)| > c1(nanlog n)1/2} ´ = 0. (13)
Let Tn = {nanlog(n)}1/2, GI
n,i,k(θι) = Gn,i,k(θι)I(|Gn,i,k(θι)| ≤ Tn) and GOn,i,k(θι) = Gn,i,k(θι) − GIn,i,k(θι).
Write ∆n,k(θι) = k−1 X i=1 {GIn,i,k(θι) − EGIn,i,k(θι)} + k−1 X i=1 {GOn,i,k(θι) − EGOn,i,k(θι)}. (14)
Note that E|GOn,i,k(θι)| ≤ Tn−r+1E|ξ1|r= E|ξ1|r{nanlog(n)}−(r−1)/2. If an= cn−δwith 0 ≤ δ < 1 − 2/r and
k ≤ n, we have |
k−1
X
i=1
EGOn,i,k(θι)| ≤ E|ξ1|r(k − 1){nanlog(n)}−(r−1)/2≤ CE|ξ1|r{nanlog(n)}1/2. (15) Note that n X i=1 |GOn,i,k(θι)| ≤ n X i=1 |ξi|I(|ξi| > Tn) ≤ Tn−r+1 n X i=1 |ξi|rI(|ξi| > Tn)
For fixed T , by the strong law of large numbers, we have
n−1
n
X
i=1
almost surely. The right hand side above is dominated by E{|ξ1|r} and → 0 as T → ∞. Note that Tn
increase to ∞ with n. For large n such that Tn> T , we have
n−1 n X i=1 |ξi|rI(|ξi| > Tn) ≤ n−1 n X i=1 |ξi|rI(|ξi| > T ) → 0
almost surely as T → ∞. It follows
n
X
i=1
|GOn,i,k(θι)| = o(nTn−r+1) = o{(nanlog n)1/2} (16)
almost surely. Thus by (15) and (16), if c0
1 > CE|ξ1|r we have p0n def= P ³\∞ `=1 ∞ [ n=` { max 1<k≤n1<ι≤nmax1 | k−1 X i=1
{GOn,i,k(θι) − EGOn,i,k(θι)}| > c01(nanlog n)1/2}
´ ≤ P ³\∞ `=1 ∞ [ n=` { n X i=1 |ξi|(|ξi| ≥ Tn) > c01(nanlog n)1/2} ´ +P ³\∞ `=1 ∞ [ n=` { max 1<k≤n1<ι≤nmax1 | k−1 X i=1
EGOn,i,k(θι)| > c01(nanlog n)1/2}
´
= 0. (17)
By condition (II), if k ≤ n we have max 1≤ι≤n1 Var k−1 X i=1
{GIn,i,k(θι) − EGIn,i,k(θι)} ≤ c2nandef= N1, (18)
where c2 is a constant. By the condition on an and the definition of GIn,i,k(θι), we have constants c3 and c4
such that
max 1≤ι≤nα|{G
I
n,i,k(θι) − EGIn,i,k(θι)}| ≤ c3Tn
= c3{nan/ log n}1/2{a−rn logr+1n/nr−2}1/(2(r−1))
≤ c4{nan/ log n}1/2 def= N2. (19)
Let N3 = c5{nanlog n}1/2 with c52 > 2(α3+ 3)(c2+ c4c5). By the Bernstein’s inequality (cf. DE LA Pe˜na,
1999), we have from (18) and (19) that for any k ≤ n,
P (|
k−1
X
i=1
{GIn,i,k(θι) − EGIn,i,k(θι)}| > N3) ≤ 2 exp µ −N2 3 2(N1+ N2N3) ¶ ≤ 2 exp{−c25log n/(2c2+ 2c4c5)} ≤ c6n−α3−3.
Let c1 > max{c5, c01}. We have ∞ X n=1 P n max 1<k≤n1<ι≤nmax1 | k−1 X i=1
[GIn,i,k(θι) − EGIn,i,k(θι)]| > c1(nanlog n)1/2
o ≤ ∞ X n=1 n X k=2 n1 X ι=1 P n | k−1 X i=1
[GIn,i,k(θι) − EGIn,i,k(θι)]| > c1(nanlog n)1/2 o ≤ ∞ X n=1 c6n−α3−3n1+α3 < ∞. (20)
By (14), (17) and (20) and the Borel-Cantelli lemma, we have
pn≤ P n\∞ `=1 ∞ [ n=` max 1<k≤n1<ι≤nmax1 | k−1 X i=1
[GIn,i,k(θι) − EGIn,i,k(θι)]| > c1(nanlog n)1/2
o
+ p0n= 0.
Therefore (13) follows.
Let ∆I
n,k(θι) = ∆n,k(θι)I{|∆n,k(θι)| ≤ c1(nanlog n)1/2} and U`(θι) =
P` k=2∆In,k(θι)εk. Write ∆n(θι) = Un(θι) + n X k=2 ∆On,k(θι)εk, where ∆O
n,k(θι) = ∆n,k(θι) − ∆In,k(θι). It is easy to see from (13) that for the second part on the right hand
side above, max 1<ι≤n1 | n X k=2 ∆On,k(θι)εk| = O{n(anlog n)1/2} (21) almost surely, since for any constant c > 0,
∞ X n=1 P { max 1<ι≤n1 | n X k=2 ∆On,k(θι)εk| > cn(anlog n)1/2} ≤ ∞ X n=1 P ( max 1<ι≤n1 max 1<k≤n|∆ O n,k(θι)| > 0) ≤ ∞ X n=1 P { max 1<ι≤n1 max 1<k≤n|∆n,k(θι)| > c1(nanlog n) 1/2} < ∞.
Now consider the first term. Let Tn01/2/ log n,
U`I(θι) = ` X k=2 ∆n,kI (θι){εk(|εk| ≤ Tn0) − E[εk(|εk| ≤ Tn0)]} and UO
` (θι) = U`(θι) − U`I(θι). Similar to the proof of (15) and (16), we have almost surely
|
`
X
k=2
∆On,k(θι)E{εk(|εk| > Tn0)}| = O{n(anlog n)1/2}, (22)
|
`
X
k=2
Note that
|∆In,k(θι){εk(|εk| ≤ Tn0) − E[εk(|εk| ≤ Tn0)]}| < 2c1(nanlog n)1/2Tn0 = 2c1n(an/ log n)1/2 def= N4
and by (II), Var{U`I(θι)} = c022an def= N5, where c20 is a constant. Let N6 = c03n(anlog n)1/2 with c032 >
2(α3+ 3)(2c1c03+ c02). By the Berenstein’s inequality, we have
P (|UnI(θι)| ≥ N6) ≤ 2 exp{− N 2 6 2(N6N4+ N5)} ≤ 2n −α3−3. Therefore n X n=1 P { max 1≤ι≤n1 |UnI(θι)| ≥ N6} < n X n=1 n1P {|UnI(θι)| ≥ N6} < ∞. By the Borel-Cantelli lemma, we have
max 1≤ι≤n1
|UnI(θι)| = O(N6) (24)
almost surely. Lemma 6.1 follows from (21), (22), (23) and (24). ¥
Proof of Lemma 5.1 Write sθ
k(x) = ²θk(x) + Esθk(x). By Taylor expansion, we have
sθk(x) =
3 X
τ =0
µk+τfθ(τ )(x)hτ+ ²θk(x) + O(h4). (25)
Because V ar{²θk(x)} = O{(nh)−1}, it follows from Lemma 6.1 that ²θk(x) = O(δn). It is easy to check that
Dn,0θ (x) = fθ2+1
2(µ4+ 1)fθf
00
θh2− (fθ0)2h2+ f (²θ0+ ²θ2) − 2fθ0h²θ1+ O(γn2).
Dθn,2(x) = fθ2+ µ4(fθfθ00− (fθ0)2)h2+ 2fθ²θ2− fθ0h²θ3− µ4fθ0h²θ1+ O(γn2).
Dn,3θ (x) = fθ²θ3+ O(hγn), Dn,4θ (x) = µ4fθ2+ O(γn), Dn,5θ (x) = O(h).
Tn,0θ (X|x) = fθ2νθ(x) + O(γn), Sn,0θ (X|x) = O(h), Tn,kθ (X|x) = O(1), Sn,kθ (X|x) = O(1), for k ≥ 1,
Tn,0θ (|θ>Xix|6|x) = O(h6), Sn,0θ (|θ>Xix|6|x) = O(h6), Tn,0θ (XX>|x) = O(1), Sθn,0(XX>|x) = O(h),
En,2(x) = (µ4− 1)fθfθ0h + f (²θ3− ²θ1) + O(hγn), En,3(x) = µ4fθ2+ O(γn), En,4(x) = O(h).
Note that
and Aθn(x) = 5 X k=2 1 k!g (k)(θ> 0x) Dθ n,k(x) Dθ n,0(x) hk−2, Bnθ(x) = 4 X k=0 1 k!g (k+1)(θ> 0x) Tn,k(X|x) − Dn,k(x)x Dθ n,0(x) hk, Cn(x, θ) = 12g00(θ>0x){Tn,0(XX>|x) − Tn,0(X|x)x>− xTn,0(X>|x) + xx>Dn,0θ (x)}{Dθn,0(x)}−1, ˜ Aθn(x) = 4 X k=2 1 k!g (k)(θ> 0x) Eθ n,k(x) Dθ n,0(x) hk−2, B˜nθ(x) = 4 X k=1 k k!g (k)(θ> 0x) Sn,k(X|x) − En,k(x)x Dθ n,0(x) hk, ˜ Cn(x, θ) = 12g00(θ>0x){Sn,0(XX>|x) − Sn,0(X|x)x>− xSn,0(X>|x) + xx>En,0θ (x)}{Dθn,0(x)}−1.
Lemma 5.1 follows from simple calculations based on the above equations. ¥
Proof of Lemma 5.2 It follows from Lemma 6.1 that ηθ
n(x) = O(δn)(1 + |x|) and sθ0 = fθ+ ˜²θ0 where
˜²θ
0 = ²θk+ (Esθk− fθ) = O(γn). Because |ρ00n(.)| < n2ς, we have
ρn(sθ0(Xj)) = ρn(fθ(Xj)) + ρ0n(fθ(Xj))˜²θ0(Xj) + O(n2ςγn2). (26) Thus Aθn= ˜Enθ+ Qθn,1+ O(n2ςγn3), where ˜Eθ n= n−2 Pn i=1 Pn j=1ρn(fθ(Xj))fθ−1(Xj)g0(θ>Xj)Khθ(Xij)Xijεi, and Qθn,1= n−1 Pn
i=1Gθn,i with
Gθn,i= n−1 n X j=1 h 1 2f 00 θ(Xj){ρ0n(fθ(Xj)) − ρn(fθ(Xj))fθ−1(Xj)}h2+ {1 − ρn(fθ(Xj)fθ−1(Xj)}˜²θ0(Xj) i × fθ−1(Xj)g0(θ>Xj)Khθ(Xij)Xijεi.
Simple calculations lead to E ˜Eθ
n = 0, E( ˜Enθ)2 = O(n−1), E(Gn,iθ ) = 0 and E(Gθn,i)2 = O{h4+ (nh)−1}. By
the first part of Lemma 6.1, we have ˜
Enθ= O{(log n/n)1/2}, Qθn,1= O{h2(log n/n)1/2+ n−1(log n/h)1/2}.
By Taylor expansion, g0(θ>
0x) = g0(θ>x) + g00(v∗)(θ0 − θ)>x, where v∗ is a value between θ>x and θ>0x. Write
˜
Enθ= Enθ+ Qθn,2+ rn,0(θ − θ0),
where Qθn,2 = n−2Pni=1Pnj=1{ρn(fθ(Xj))fθ−1(Xj)g0(θ>Xj)Khθ(Xij)Xij − ρn(fθ(Xi))g0(θ>Xi)νθ(Xi)}εi and
rn,0= O(γn/h). By Lemma 6.1 and that V ar(Qθn,2) = O{h4+ (nh)−1}, we have
Let Qθn = Qθn,1+ Qθn,2. It is easy to check that E{QθnEnθ} = o(h8 + (nh)−2). Therefore, the first part of Lemma 5.2 follows.
Similarly, we have from (26) that
Bθn= (nh)−1 n X j=1 {ρn(fθ(Xj)) + ρ0n(fθ(Xj))˜²θ0(Xj)}eθk(Xj)ηθn(Xj)/fθ(Xj) + O(n2ςγn4/h). Let ˜Rθ
n be the first term on the right hand side above. Then
˜ Rθn= n−3 n X j=1 {ρn(fθ(Xj)) + ρ0n(fθ(Xj))˜²θ0(Xj)} n X i=1 Kh2(θ>Xij)(θ>Xij/h)kXijε2i/fθ(Xj) + n−3 n X j=1 {ρn(fθ(Xj)) + ρ0n(fθ(Xj))˜²θ0(Xj)} n X i6=` Kh(θ>Xij)(θ>Xij/h)kKh(θ>X`j)X`jεiε`/fθ(Xj) def = ˜Rn,1θ + ˜Rθn,2+ ˜Rθn,3+ ˜Rθn,4. If ε is independent of X, then Eθ(x)def= E{Kh2(θ>Xix)(θ>Xij/h)kXixε2i} = h−1 2 X `=0 1 `!µ˜k+`{fθ(x)νθ(x)} (`)h`σ2+ O(h2), where ˜µk= R K2(v)vkdv. By Lemma 6.1, we have n−1 n X i=1 Kh2(θ>Xix)(θ>Xix/h)kXixε2i − Eθ(x) = O(h−1δn). Thus Rθn,0def= (n2h)−1 n X j=1 ρn(fθ(Xj)) h n−1 n X i=1 Kh2(θ>Xij)(θ>Xij/h)kXijε2i − Eθ(Xj) i = O{(nh2)−1δn}. (27)
It is easy to check that E{EnθRθn,0} = 0. Write
(n2h)−1
n
X
j=1
ρn(fθ(Xj))Eθ(Xj) = (nh)−1E{ρn(fθ(Xj))Eθ(Xj)} + Rθn,1,
where E{Rθ n,1Enθ} = 0 and Rθn,1= (n2h)−1 n X j=1
[ρn(fθ(Xj))Eθ(Xj) − E{ρn(fθ(Xj))Eθ(Xj)}] = O{(nh2)−1(n/ log n)−1/2}. (28)
Note that E{ρn(fθ(X))νθ(X)} = 0. We have
(nh)−1E{ρn(fθ(Xj))Eθ(Xj)} = ˜ck,n
nh + R
θ
where Rθn,2= O(n−1) and E{Rθn,2E0θ} = 0. By (27)-(29) and the fact that (n/ log n)−1/2= o(γn), we have ˜ Rθn,1= ˜ck nh + R θ n,1+ Rθn,2. (30) Similarly ˜ Rθn,2= O{(nh)−1γn}. (31) Let Gθ n,i,`= n−1 Pn j=1ρn(fθ(Xj))Kh(θ>Xij)(θ>Xij/h)kKh(θ>X`j)X`j/fθ(Xj). Write ˜Rn,3θ as ˜ Rθn,3= n−2 n X i6=` 1 2(G θ n,i,`+ Gθn,`,i)εiε`= n−2 n X `=1 n X i<` 1 2(G θ n,i,`+ Gθn,`,i)εi o ε`.
By the second part of Lemma 6.1, we have ˜
Rθn,3= O{n−1/2δn}. (32)
Similarly, we have
˜
Rθn,4= O{n−1/2δn}. (33)
Thus the second part of Lemma 5.2 follows from (30) and (31).
The third part of Lemma 5.2 can be proved similarly as the proof of the second part. ¥
Proof of Lemma 5.4 By (8), Lemma 5.1 and θ0 = θ + (θ0− θ), simple calculations lead to
Yi− aθ(x) − dθ(x)θ>0Xix= εi+ { ˜Aθ(x, Xi) − Aθn(x)h2} + { ˜Bθ(x, Xi) − Bnθ(x)}>(θ0− θ)
− Vnθ(x) + O{h2γn2+ δ3θ},
where ˜Aθ(x, X
i) = Aθ(x, Xi)−dθ(x)θ>Xix and ˜Bθ(x, Xi) = Bθ(x, Xi)−dθ(x)Xix. It follows from the Taylor
expansion that Cn,kθ (x)def= n−1 n X i=1 Khθ(Xix)(θ>Xix/h)kXix = 5 X `=0 1 `!µk+`(f µθ) (`)h`+ ˜ξθ k+ O(h6), where ˜ξθ k= n−1 Pn i=1{Khθ(Xix)(θ>Xix/h)kXix− EKhθ(Xix)(θ>Xix/h)kXix} = ξkθ− x²θk. We have n−1 n X i=1 Khθ(Xix)XixA˜θ(x, Xi) = {g0(θ>0x) − dθ(x)}Cn,1(x)h + 5 X k=2 1 k!g (k)(θ> 0x)Cn,kθ (x)hk = −1 2g 00(µ 4− 1)fθ0fθ−1(νθfθ)0h4+ 1 2g 00(ν θfθ)0(²θ3− ²θ1)h2+ ˜Vnθ{(νθfθ)0h + 1 6µ4(νθfθ) 00h3+ ˜ξθ 1} +1 2g 00h2{f θνθ+ 1 2µ4(fθνθ) 00h2} + 1 24g (4)µ 4fθνθh4+ 1 2g 00h2ξ˜θ 2 + 1 6g 000h3ξ˜θ 3+ O(h2γn2).