Reducing bias in nonparametric density estimation via bandwidth dependent kernels: L1 view

(1)

Munich Personal RePEc Archive

Reducing bias in nonparametric density

estimation via bandwidth dependent

kernels: L1 view

Mynbaev, Kairat and Martins-Filho, Carlos

KazakhBritish Technical University, University of Colorado

-Boulder

2016

Online at

https://mpra.ub.uni-muenchen.de/75902/

(2)

Reducing bias in nonparametric density estimation via bandwidth

dependent kernels:

_L1

view

1

Kairat Mynbaev

International School of Economics Kazakh-British Technical University Tolebi 59

Almaty 050000, Kazakhstan

email: kairat [email protected]

and

Carlos Martins-Filho

Department of Economics IFPRI

University of Colorado 2033 K Street NW

Boulder, CO 80309-0256, USA & Washington, DC 20006-1002, USA email: [email protected] email: [email protected]

November, 2016

Abstract. We define a new bandwidth-dependent kernel density estimator that improves existing conver-gence rates for the bias, and preserves that of the variation, when the error is measured inL1. No additional

assumptions are imposed to the extant literature.

Keywords and phrases. Kernel density estimation, higher order kernels, bias reduction.

AMS-MS Classification. 62G07, 62G10, 62G20.

(3)

1 Introduction

Given a sequence ofn∈N_{independent realizations}_{_X_j_}n

j=1 of the random variableX, having densityf on

R_{, the Rosenblatt-Parzen kernel estimator (Rosenblatt (1956), Parzen (1962)) of}_f _{is given by}

fn(x) =

1

n

X

j=1

(ShnK)(x−Xj), (1.1)

whereShn is an operator defined by

(ShnK)(x) = 1

hn

K

x hn

, (1.2)

K is a kernel, i.e., a function onR_{such that}R

K(x)dx= 1 andhn>0 is a non-stochastic bandwidth such

thathn→0 asn→ ∞.1

One of the most natural and mathematically sound (Devroye and Gy¨orfi (1985), Devroye (1987)) criteria

to measure the performance offn as an estimator off is theL1distanceR|fn−f|. In particular, given that

this distance is a random variable (measurable function of{Xj}nj=1) it is convenient to focus onE

R

|fn−f|

,

whereE denotes the expectation taken using f. For this criterion, there is a simple bound (Devroye, 1987,

p. 31)

E

Z

|fn−f|

≤

Z

|(f∗ShnK)−f|+E

Z

|fn−f∗ShnK|

,

where for arbitrary f, g ∈ L1, (f ∗g)(x) = R g(y)f(x−y)dy is the convolution of f and g. The term

R

|f∗ShnK−f|is called bias overRandE R|fn−f∗ShnK|is called the variation overR. There exists

a large literature devoted to establishing conditions on f and K that assure suitable rates of convergence

of the bias to zero as n→ ∞(see, inter alia, Silverman (1986), Devroye (1987) and Tsybakov (2009)). In

particular, ifK is of orders, i.e.,αj(K) = 0 forj = 1, ..., s−1 andαs(K)6= 0, where αj(K) =RtjK(t)dt

is thejth moment of K, and f has an integrable derivativef(s)_{, then} R

|f ∗ShnK−f| is of order O(hsn)

and this order cannot be improved, see, e.g., (Devroye, 1987, Theorem 7.2). In this note, we show that if in

(1.2) the kernel is allowed to depend onn, then the orderO(hs

n) can be replaced by the ordero(hsn), without

increasing the order of the kernel or the smoothness of the density. In addition, another result from Devroye

1_{Throughout this note, integrals are over}R_{, unless otherwise specified.}

(4)

(1987) states that ifKis a kernel of order greater thansand the derivativef(s)_is_a_{-Lipschitz then the bias}

is of orderO(hs+a

n ). We achieve the same rate of convergence with kernels of orders.

2 Main results

LetL1andC denote the spaces of integrable and (bounded) continuous functions onRwith normskfk₁=

R

|f|andkfkC= sup|f|, andβs(K) =R |t|s|K(t)|dt. Let{Kn}be a sequence of kernels and define

ˆ

fn(x) =

1

n

X

j=1

(ShnKn)(x−Xj).

In the following Theorem 1, the densityf has the same degree of smoothness and the kernelsKn are of the

same order as in (Devroye, 1987, Theorem 7.2), but the bias is of ordero(hs

n) instead ofO(hsn). This results

because the kernels depend onnand have “disappearing” moments of orders.

Theorem 1. Let {Kn} be a sequence of kernels of order s such that: 1. αs(Kn) → 0; 2. {usKn(u)} is

uniformly integrable. For allf with absolutely continuousf(s−1)_and_f(s)_∈_L

1, we havekf∗ShnKn−fk1=

o(hs n).

Proof. Note that sinceKn is a kernel

f ∗ShnKn(x)−f(x) =

Z

Kn(t)[f(x−hnt)−f(x)]dt. (2.1)

Sincef iss-times differentiable, by Taylor’s Theorem,

f(x−hnt)−f(x) = s−1

X

j=1

f(j)₍_x₎

j! (−hnt)

j₊Z x−hnt

x

(x−hnt−u)s−1

(s−1)! f

(s)₍_u₎_du.

Furthermore, given thatKn is of orders,

f∗ShnKn(x)−f(x) =

1 (s−1)!

Z Z x−hnt

x

(x−hnt−u)s−1f(s)(u)duKn(t)dt. (2.2)

Lettingλ=−u−x

hnt we have

Z x−hnt

x

(x−hnt−u)s−1f(s)(u)du= (−hnt)s

Z 1

0

(5)

Substituting (2.3) into (2.2) we obtain

(−hn)s

s!

Z Z 1

0

f(s)(x−hnλt)s(1−λ)s−1dλtsKn(t)dt. (2.4)

SinceR1

0(1−λ)s−1dλ= 1

s, we have that

(−hn)s

(s−1)!

Z Z 1

0

f(s)(x)(1−λ)s−1dλtsKn(t)dt=

(−hn)s

s! f

(s)₍_x₎Z _ts_K

n(t)dt. (2.5)

Then, adding and subtracting (2.5) to the right-hand side of (2.4) gives

(−hn)s

s!

f(s)(x)αs(Kn) +

Z Z 1

0

[f(s)(x−hnλt)−f(s)(x)]s(1−λ)s−1dλtsKn(t)dt

.

Since f(s) _∈ _L

1 we write its continuity modulus as ω(δ) = sup_|t|≤δ

R

f(s)(x−t)−f(s)(x)

dx. It is

well-known (see properties M.2, M.6 and M.7 in Zhuk and Natanson (2003)) that

ω(δ)≤2 f

(s)

1, ωis nondecreasing and limδ→0ω(δ) = 0. (2.6)

Then,

kf ∗ShnKn−fk1≤ h s n s! f (s)

₁|αs(Kn)|+ Z Z 1

0

Z f

(s)₍_x

−hnλt)−f(s)(x)

dx s(1−λ)

s−1_dλ

|tsKn(t)|dt

≤ h s n s! f (s)

1|αs(Kn)|+

Z 1

0

Z

ω(λhn|t|)|tsKn(t)|dt s(1−λ)s−1dλ

= h s n s!   f (s)

1|αs(Kn)|+

Z 1

0

  Z

|t|≤_√1

hn

+

Z

|t|>_√1

hn



ω(λhn|t|)|tsKn(t)|dt s(1−λ)s−1dλ

  (2.7) ≤ h s n s!   f (s)

1|αs(Kn)|+ω

p

hn

βs(Kn) + 2

f (s) 1 Z

|t|>_√1

hn

|tsKn(t)|dt



. (2.8)

Given thatαs(Kn)→0 as n→ ∞,{tsKn(t)} is uniformly integrable, which implies sup

n βs(Kn)<∞, and

using (2.6) and (2.8) we have

kf∗ShnKn−fk1=o(hsn). (2.9)

Remark 1. Kernel sequences {Kn} that satisfy the restrictions imposed by Theorem 1 can be easily

constructed. To this end, denote byBsthe space of functions with bounded normkKk_Bs=β0(K) +βs(K).

(6)

Take functionsK(0), K(s)∈ Bs such that

α0(K(0)) = 1, αj(K(0)) = 0 forj = 1, ..., s;αj(K(s)) = 0 forj= 0,1, ..., s−1, αs(K(s)) = 1. (2.10)

We define then-dependent kernel Kn=K(0)+hnK(s) with 0< hn ≤1. Note thatKn is a kernel of order

swithαs(Kn) =hn which tends to zero asn→ ∞. It is clear that any kernelK of order scan be written

as K = K(0)+αs(K)K(s), so that the conventional s-order kernels obtain from ours with αs(K) = hn.

Furthermore, it follows from (2.10) that{ts₍_K

(0)(t) +hnK(s)(t))}is uniformly integrable.

Now, to obtain K(0) andK(s), assume that for a nonnegative kernel Kwe haveβ2s(K)<∞. Then, we

can associate withK a symmetric matrix

As=

   

α0(K) α1(K) ... αs(K)

α1(K) α2(K) ... αs+1(K)

... ... ... ... αs(K) αs+1(K) ... α2s(K)

   

,

such that detAs 6= 0 (see Mynbaev et al. (2014)). For an arbitrary vector b ∈ Rs+1 let a = A−s1b and

define a polynomial transformation of K by (TaK)(t) = Psi=0aitiK(t). Then, we put K(0)=TaK with

b = (1,0, ...,0)′ _and _K

(s) = TaK with b = (0, ...,0,1)′, which satisfy (2.10). Thus, we have the following

corollary to Theorem 1.

Corollary 1. Let Kn =K(0)+hnK(s) where K(0) and K(s) are as defined in Remark 1. Then, for all f

with absolutely continuousf(s−1)_and_f(s)_∈_L

1, we have kf∗ShnKn−fk1=o(hsn).

Remark 2. IfKnare supported on [−M, M] for someM >0 and for alln, then in (2.7), instead of splitting

R₌

|t| ≤1/√hn ∪|t|>1/√hn we can useR={|t| ≤M} ∪ {|t|> M}and then instead of (2.8) we get

kf∗ShnKn−fk1≤

hs n

s!

h f

(m)

1|αs(Kn)|+ω(hnM)βs(Kn)

i

.

Hence, selecting {Kn} in such a way that αs(Kn) = O(ω(hn)), sup

n βs(Kn) < ∞ and using the fact that

ω(hnM)≤(M + 1)ω(hn)2 we get a result that is more precise than (2.9), i.e.,

kf∗ShnKn−fk1=O(hsnω(hn)).

(7)

Remark 3. By Young’s inequality, the variation of ˆfn using Kn =K(0)+hnK(s)is such that

E

Z

|fˆn−f∗ShnKn| ≤ E

Z

|fˆn−f∗ShnK(0)|+hn

Z

|f ∗ShnK(s)|

≤ E

Z

|fˆn−f∗ShnK(0)|+hn

Z

|f|

Z

|K(s)|.

Letting fn(0) be the estimator in (1.1) with K = K(0), we have ER|fˆn−f ∗ShnK(0)| ≤ ER|fn(0)−f ∗

ShnK(0)|+hnR|f|R|K(s)|. Hence,

E

Z

|fˆn−f ∗ShnKn| ≤E

Z

|f(0)

n −f∗ShnK(0)|+ 2hn

Z

|f|

Z

|K(s)|.

Since,hn→0 asn→ ∞, the variation of ˆfn is asymptotically bounded by the variation of the conventional

estimatorfn usingK(0), i.e.,ER|fn(0)−f ∗ShnK(0)|.

Under the assumptions that f has a variance,R

(1 +t2₎₍_K

(0)(t))2dt <∞, (Devroye, 1987, Theorem 7.4)

showed thatER

|fn(0)−f∗ShnK(0)|=O((nhn)−1/2). Thus,

p

nhnE

Z

|fˆn−f ∗ShnKn|= (1 + (nh3n)1/2)O(1) =O(1),

where the last equality follows ifnh3

n ≤c <∞.

We now provide an analog for Theorem 7.1 in Devroye (1987). There, the bias orderO(hs+a_{) is achieved}

for kernels with orders greater thans, while in the following theorems we obtain the same order of bias for

kernels of orders.

Theorem 2. Let{Kn}be a sequence of kernels of orderssuch that: 1. αs(Kn) =O(han); 2. sup

n βs+a(Kn)<

∞, for some a ∈ (0,1]. For f with absolutely continuous f(s−1) _and _f(s) _∈ _L

1 assume that for some

0< c <∞

ω(δ)≤c|δ|a. (2.11)

Then, kf∗ShnKn−fk1=O(hs+an ).

Proof. As in the proof of Theorem 1, we have

kf∗ShnKn−fk1≤

hs n

s!

f

(s)

₁|αs(Kn)|+ Z 1

0

Z

ω(λhn|t|)s(1−λ)s−1|tsKn(t)|dtdλ

(2.12)

(8)

By (2.11)

kf∗ShnKn−fk1≤

hs n

s!

f

(s)

₁|αs(Kn)|+ch

a n

Z Z 1

0

s(1−λ)s−1_dλ

ts+aKn(t)

dt

.

Hence, under conditions 1. and 2. in the statement of the theorem we havekf∗ShnKn−fk1=O(h s+a n ).

Remark 4. Practitioners may find condition (2.11) too general, preferring more primitive conditions on

f(s)_{. To this end, we say that a function}_g_{defined on}_R_{satisfies a}_{global Lipschitz condition}_{of order}_a_∈₍₀_,_1]

if there exist positive functionsl(x), r(x) such that

|g(x−h)−g(x)| ≤l(x)|h|a for|h| ≤r(x), x∈R_. _(2.13)

The function l is called a Lipschitz constant and the function r is called a Lipschitz radius. The class

Lip(a, δ), forδ >1, is defined as the set of functionsg which satisfy (2.13) withl andrsuch that

Z

(l(x) +r(x)−δ)dx <∞. (2.14)

In the next lemma we give two sufficient sets of conditions forg∈Lip(a, δ). In the first casegis compactly

supported, and in the second it is not.

Lemma 1. a) Supposeg has compact support,suppg⊆[−N, N]for someN >0, andg satisfies the usual

Lipschitz condition |g(x−h)−g(x)| ≤ c|h|a _{for any} _{x, h} _{and some} _a_∈ ₍₀_,_1]_{. Set} _l₍_x_{) =} _{c, r}₍_x_{) = 1}_for

|x|< N andl(x) = 0, r(x) =|x| −N for|x| ≥N. Then,g∈Lip(a, δ)with anyδ >1.

b) Suppose that|g(1)₍_t₎_{| ≤}_ce−|t|_{, t}_∈R_{. Let} _l₍_x_{) =}_c_exp(_−|_x_|_/_{2 + 1}_/₂₎_{, r}₍_x_{) = (1 +}_|_x_|₎_/₂_{, x}_∈R_{. Then,}

g∈Lip(1, δ)with anyδ >1.

Proof. a) If|x|< N, then|g(x−h)−g(x)| ≤ |h|a_l₍_x_{) for all}_h_{(and not only for}_|_h_{| ≤}_r₍_x_{)). If}_|_x_{| ≥}_N_{, then}

|h| ≤r(x) =|x| −N implies|x−h| ≥ |x| − |h| ≥N, so that|g(x−h)−g(x)|= 0 =|h|a_l₍_x_{) for}_|_h_{| ≤}_r₍_x_).

b) Let|x| ≥1. We have

|g(x−h)−g(x)| ≤ |h| sup

|t−x|≤|h||

g(1)₍_t₎_{| ≤ |}_h_|_c _sup

|t−x|≤|h|

e−|t|_. _(2.15)

|t−x| ≤ |h| ≤r(x) = (1 +|x|)/2 implies|t|=|x+t−x| ≥ |x| − |t−x| ≥ |x|/2−1/2≥0 and (2.15) gives

(9)

that by (2.15),|g(x−h)−g(x)| ≤ |h|c≤ |h|cexp(1/2− |x|/2) =|h|l(x) for allhand not only for|h| ≤r(x).

Condition (2.14) is obviously satisfied in both cases.

By part a) of Lemma 1, compactly supported densities with derivative f(s) _{that satisfies the usual} _a

-Lipschitz condition are such that f(s) _∈ _Lip₍_{a, δ}_{) for any} _{δ >}_{1. This corresponds to the case treated in}

Theorem 7.1 of Devroye (1987). Part b) shows that for densities with unbounded domains, not covered by

Theorem 7.1, iff(s)₍_x_{) has derivative that decays exponentially as} _|_x_{| → ∞}_{, then} _f(s) _∈_Lip₍₁_{, δ}_{) for any}

δ >1. Next we provide a version of Theorem 2 for densities with derivativef(s)_∈_Lip₍_{a, δ}_).

Theorem 3. Suppose that the density f is such that its derivative f(s) _{belongs to} _L

1, C (the respective

norms are finite) andLip(a, δ). Let{Kn}be a sequence of kernels of orderssuch that: 1. αs(Kn) =O(han);

2. sup

n max{βs+a(Kn), βs+δ(Kn)}<∞. Then,kf∗ShnKn−fk1=O(h s+a n ).

Proof. As in the proof of Theorem 1, we have

kf∗ShnKn−fk1≤

hs n s! f (s)

₁|αs(Kn)|+ Z 1

0

Z Z f

(s)₍_x

−hnλt)−f(s)(x)

|t

s_K

n(t)|dt dx s(1−λ)s−1dλ

.

LetI(x) =R

f(s)(x−hnλt)−f(s)(x)

|tsKn(t)|dtand note that sincef(s)∈Lip(a, δ)

I(x) =

  

Z

λhn|t|≤r(x)

+

Z

λhn|t|>r(x)

   f

(s)₍_x

−hnλt)−f(s)(x)

|t

s_K n(t)|dt

≤

Z

λhn|t|≤r(x)

l(x)λahan|t|a+s|Kn(t)|dt+

Z

λhn|t|>r(x)

|f(s)(x−hnλt)−f(s)(x)||tsKn(t)|dt

≤hanβs+a(Kn)l(x) +

Z

λhn|t|>r(x)

|f(s)(x−hnλt)−f(s)(x)||tsKn(t)|dt.

LettingI1(x) =

R

λhn|t|>r(x)

|f(s)₍_x₋_h

nλt)−f(s)(x)||tsKn(t)|dtwe have

I1(x)≤

Z

λhn|t|>r(x)

1

|t|δ

|f(s)(x−hnλt)|+|f(s)(x)|

|t|s+δ|Kn(t)|dt.

Noting that|t|−δ _{< λ}δ_hδ

nr(x)−δ and given thatkf(s)kC <∞, we obtain I1(x)≤ 2kf(s)kCλ δ

hδ n

r(x)δβs+δ(Kn).

Consequently,

I(x)≤hanmax{βs+a(Kn), βs+δ(Kn)}

l(x) + 2hδn−akf(s)kC

1

r(x)δ

. (2.16)

(10)

SinceR1

0 s(1−λ)

s−1_ds₌1

s and given (2.14)

kf∗ShnKn−fk1≤h s n

s!

f

(s)

1|αs(Kn)|+

1

sh

a

nmax{βs+a(Kn), βs+δ(Kn)}

Z

l(x) + 2kf(s)kC 1

r(x)δ

dx

.

(2.17)

Thus, using conditions 1. and 2. in the statement of the theorem, we havekf∗ShnKn−fk1=O(hs+an ).

Remark 5. As in the case of Theorem 1, Theorems 2 and 3 do not address the construction of the kernel

sequence{Kn}. The following corollary to Theorem 3 shows thatKn =K(0)+hanK(s) is a suitable kernel

sequence, whereK(0) andK(s)are as defined above.

Corollary 2. Suppose the density f is such that its derivative f(s) _{belongs to} _L

1, C and to Lip(a, δ),

where a ∈ (0,1], δ > 1. Let K(0), K(s) satisfy (2.10) and belong to the intersection Bs+a∩ Bs+δ. Put

Kn=K(0)+hanK(s),0≤hn≤1.ThenKnis a kernel of ordersforhn>0andkf∗ShnKn−fk1=O(hs+an ).

The condition K(0) ∈ Bs+a and the definition Kn = K(0)+hanK(s) can be replaced by K(0) ∈ Bs+1 and

Kn=K(0)+hnK(s), respectively, without affecting the conclusion.

References

Devroye, L., 1987. A Course in density estimation. Birkh¨auser, Boston, MA.

Devroye, L., Gy¨orfi, L., 1985. Nonparametric density estimation: theL1 view. John Wiley and Sons, New

York, NY.

Mynbaev, K., Nadarajah, S., Withers, C., Aipenova, A., 2014. Improving bias in kernel density estimation. Statistics and Probability Letters 94, 106–112.

Parzen, E., 1962. On estimation of a probability density and mode. Annals of Mathematical Statistics 33, 1065–1076.

Rosenblatt, M., 1956. Remarks on some nonparametric estimates of a density function. Annals of Mathe-matical Statistics 27, 832–837.

Silverman, B. W., 1986. Density estimation for statistics and data analysis. Chapman and Hall, London.

Tsybakov, A. B., 2009. Introduction to nonparametric estimation. Springer-Verlag, New York, NY.