• No results found

Learning Theory Approach to Minimum Error Entropy Criterion

N/A
N/A
Protected

Academic year: 2020

Share "Learning Theory Approach to Minimum Error Entropy Criterion"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Learning Theory Approach to Minimum Error Entropy Criterion

Ting Hu [email protected]

School of Mathematics and Statistics Wuhan University

Wuhan 430072, China

Jun Fan [email protected]

Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue

Kowloon, Hong Kong, China

Qiang Wu [email protected]

Department of Mathematical Sciences Middle Tennessee State University Murfreesboro, TN 37132, USA

Ding-Xuan Zhou [email protected]

Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue

Kowloon, Hong Kong, China

Editor:Gabor Lugosi

Abstract

We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learn-ing algorithm when an approximation of R´enyi’s entropy (of order 2) by Parzen windowlearn-ing is minimized. This learning algorithm involves a Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting when the scaling parame-ter is large. Consistency and explicit convergence rates are provided in parame-terms of the approximation ability and capacity of the involved hypothesis space. Novel analysis is carried out for the gen-eralization error associated with R´enyi’s entropy and a Parzen windowing function, to overcome technical difficulties arising from the essential differences between the classical least squares prob-lems and the MEE setting. An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms.

Keywords: minimum error entropy, learning theory, R´enyi’s entropy, empirical risk minimization, approximation error

1. Introduction

(2)

A systematic treatment and recent development of this area can be found in Principe (2010) and references therein.

Minimum error entropy (MEE) is a principle of information theoretical learning and provides a family of supervised learning algorithms. It was introduced for adaptive system training in Erdog-mus and Principe (2002) and has been applied to blind source separation, maximally informative subspace projections, clustering, feature selection, blind deconvolution, and some other topics (Er-dogmus and Principe, 2003; Principe, 2010; Silva et al., 2010). The idea of MEE is to extract from data as much information as possible about the data generating systems by minimizing error en-tropies in various ways. In information theory, enen-tropies are used to measure average information quantitatively. For a random variableEwith probability density function pE, Shannon’s entropy of

Eis defined as

HS(E) =−E[logpE] =−

Z

pE(e)logpE(e)de

while R´enyi’s entropy of orderα(α>0 butα6=1) is defined as

HR,α(E) = 1

1αlogE[p

α−1 E ] =

1 1αlog

Z

(pE(e))αde

satisfying limα→1HR,α(E) =HS(E). In supervised learning our target is to predict the response variableY from the explanatory variableX. Then the random variableEbecomes the error variable

E=Yf(X) when a predictor f(X) is used and the MEE principle aims at searching for a pre-dictor f(X)that contains the most information of the response variable by minimizing information entropies of the error variableE =Yf(X). This principle is a substitution of the classical least squares method when the noise is non-Gaussian. Note thatE[Yf(X)]2=Re2pE(e)de. The least

squares method minimizes the variance of the error variableE and is perfect to deal with problems involving Gaussian noise (such as some from linear signal processing). But it only puts the first two moments into consideration, and does not work very well for problems involving heavy tailed non-Gaussian noise. For such problems, MEE might still perform very well in principle since mo-ments of all orders of the error variable are taken into account by entropies. Here we only consider R´enyi’s entropy of order α=2: HR(E) =HR,2(E) =−log

R

(pE(e))2de. Our analysis does not apply to R´enyi’s entropy of orderα6=2.

In most real applications, neither the explanatory variable X nor the response variableY is explicitly known. Instead, in supervised learning, a sample z={(xi,yi)}mi=1 is available which reflects the distribution of the explanatory variableX and the functional relation betweenXand the response variableY. With this sample, information entropies of the error variableE =Yf(X) can be approximated by estimating its probability density functionpE by Parzen (1962) windowing

b

pE(e) = mh1 ∑mi=1G(

(eei)2

2h2 ), whereei=yif(xi),h>0 is an MEE scaling parameter, andGis a

windowing function. A typical choice for the windowing functionG(t) =exp{−t}corresponds to Gaussian windowing. Then approximations of Shannon’s entropy and R´enyi’s entropy of order 2 are given by their empirical versionsm1∑mi=1logpbE(ei)and−log(m1∑mi=1pbE(ei))as

c

HS=− 1

m

m

i=1 log

" 1

mh

m

j=1

G

(eiej)2 2h2

#

and

c

HR=−log 1

m2h m

i=1 m

j=1

G

(eiej)2 2h2

(3)

respectively. The empirical MEE is implemented by minimizing these computable quantities. Though the MEE principle has been proposed for a decade and MEE algorithms have been shown to be effective in various applications, its theoretical foundation for mathematical error anal-ysis is not well understood yet. There is even no consistency result in the literature. It has been observed in applications that the scaling parameterhshould be large enough for MEE algorithms to work well before smaller values are tuned. However, it is well known that the convergence of Parzen windowing requireshto converge to 0.We believe this contradiction imposes difficulty for rigorous mathematical analysis of MEE algorithms. Another technical barrier for mathematical analysis of MEE algorithms for regression is the possibility that the regression function may not be a minimizer of the associated generalization error, as described in detail in Section 3 below. The main contribu-tion of this paper is a consistency result for an MEE algorithm for regression. It does requirehto be large and explains the effectiveness of the MEE principle in applications.

In the sequel of this paper, we consider an MEE learning algorithm that minimizes the empirical R´enyi’s entropyHcR and focus on the regression problem. We will take a learning theory approach and analyze this algorithm in anempirical risk minimization(ERM) setting. Assumeρis a proba-bility measure on

Z

:=

X

×

Y

, where

X

is a separable metric space (input space for learning) and

Y

=R(output space). LetρX be its marginal distribution on

X

(for the explanatory variableX) and ρ(·|x)be the conditional distribution ofY for givenX =x. The samplezis assumed to be drawn fromρindependently and identically distributed. The aim of the regression problem is to predict the conditional mean ofY for givenX by learning the regression function defined by

fρ(x) =E(Y|X=x) =

Z

Xydρ(y|x), x

X

.

The minimization of empirical R´enyi’s entropy cannot be done over all possible measurable functions which would lead to overfitting. A suitable hypothesis space should be chosen appropri-ately in the ERM setting. The ERM framework for MEE learning is defined as follows. Recall

ei=yif(xi).

Definition 1 Let G be a continuous function defined on[0,∞) and h>0. Let

H

be a compact subset of C(

X

). Then the MEE learning algorithm associated with

H

is defined by

fz=arg min f∈H

(

−log 1

m2h m

i=1 m

j=1

G [(yif(xi))−(yjf(xj))]

2 2h2

!)

. (1)

(4)

2. Main Results on Consistency and Convergence Rates

Throughout the paper, we assumeh1 and that

E[|Y|q]<∞for someq>2,and LρX. Denoteq∗=min{q2,2}. (2)

We also assume that the windowing functionGsatisfies

GC2[0,∞),G+′ (0) =1,andCG:= sup t∈(0,∞)

|(1+t)G′(t)|+|(1+t)G′′(t)| <∞. (3)

The special exampleG(t) =exp{−t}for the Gaussian windowing satisfies (3).

Consistency analysis for regression algorithms is often carried out in the literature under a decay assumption forY such as uniform boundedness and exponential decays. A recent study (Audibert and Catoni, 2011) was made under the assumptionE[|Y|4]<∞. Our assumption (2) is weaker since qmay be arbitrarily close to 2. Note that (2) obviously holds when|Y| ≤Malmost surely for some constantM>0, in which case we shall denoteq∗=2.

Our consistency result, to be proved in Section 5, asserts that whenhandmare large enough, the error var[fz(X)−fρ(X)]of MEE algorithm (1) can be arbitrarily close to the approximation

error (Smale and Zhou, 2003) of the hypothesis space

H

with respect to the regression function fρ.

Definition 2 The approximation error of the pair(

H

,ρ)is defined by

D

H(fρ) = inf

f∈Hvar[f(X)−fρ(X)].

Theorem 3 Under assumptions (2) and (3), for any0<ε1and0<δ<1, there exist hε,δ≥1and mε,δ(h)1both depending on

H

,G,ρ,ε,δsuch that for hhε,δand mmε,δ(h), with confidence

1δ, we have

var[fz(X)−(X)]≤

D

H() +ε. (4)

Our convergence rates will be stated in terms of the approximation error and the capacity of the hypothesis space

H

measured by covering numbers in this paper.

Definition 4 Forε>0,the covering number

N

(

H

,ε)is defined to be the smallest integer l∈N such that there exist l disks in C(

X

)with radiusεand centers in

H

covering the set

H

.We shall assume that for some constants p>0and Ap>0, there holds

log

N

(

H

,ε)Apε−p, ∀ε>0. (5)

The behavior (5) of the covering numbers is typical in learning theory. It is satisfied by balls of Sobolev spaces on

X

Rnand reproducing kernel Hilbert spaces associated with Sobolev smooth

kernels. See Anthony and Bartlett (1999), Zhou (2002), Zhou (2003) and Yao (2010). We remark that empirical covering numbers might be used together with concentration inequalities to provide shaper error estimates. This is however beyond our scope and for simplicity we adopt the the covering number inC(

X

)throughout this paper.

(5)

Theorem 5 Assume (2), (3) and covering number condition (5) for some p>0. Then for any

0<η1and0<δ<1, with confidence1δwe have

var[fz(X)−fρ(X)]≤CeHη(2−q)/2

h−min{q−2,2}+hm−1+1p

log2

δ+ (1+η)

D

H(fρ). (6)

If|Y| ≤M almost surely for some M>0, then with confidence1−δwe have

var[fz(X)−fρ(X)]≤

e

CH

η

h−2+m−1+1p

log2

δ+ (1+η)

D

H(fρ). (7)

HereCeH is a constant independent of m,δ,η or h (depending on

H

,Ggiven explicitly in the

proof).

Remark 6 In Theorem 5, we use a parameterη>0in error bounds (6) and (7) to show that the bounds consist of two terms, one of which is essentially the approximation error

D

H(fρ)sinceη can be arbitrarily small. The reader can simply setη=1to get the main ideas of our analysis.

If moment condition (2) withq4 is satisfied andη=1, then by takingh=m3(11+p), (6) becomes

var[(fz(X)−(X)]≤2CeH

1

m

2 3(1+p)

log2

δ+2

D

H(). (8)

If|Y| ≤Malmost surely, then by takingh=m2(11+p) andη=1, error bound (7) becomes

var[fz(X)−fρ(X)]≤2CeHm− 1 1+plog2

δ+2

D

H(fρ). (9) Remark 7 When the index p in covering number condition (5) is small enough (the case when

H

is a finite ball of a reproducing kernel Hilbert space with a smooth kernel), we see that the power indices for the sample error terms of convergence rates (8) and (9) can be arbitrarily close to2/3

and1, respectively. There is a gap in the rates between the case of (2) with large q and the uniform bounded case. This gap is caused by the Parzen windowing process for which our method does not lead to better estimates when q>4. It would be interesting to know whether the gap can be narrowed.

Note the result in Theorem 5 does not guarantee that fz itself approximates fρwell when the

bounds are small. Instead a constant adjustment is required. Theoretically the best constant is

E[fz(X)(X)]. In practice it is usually approximated by the sample mean 1

mm

i=1(fz(xi)−yi) in the case of uniformly bounded noise and the approximation can be easily handled. To deal with heavy tailed noise, we project the output values onto the closed interval [m,√m]by the projectionπ√

m:R→Rdefined by

π√ m(y) =

  

y, ify[m,√m], √

m, ify>√m, −√m, ify<m,

and then approximateE[fz(X)fρ(X)]by the computable quantity

1

m

m

i=1

fz(xi)−π√m(yi)

. (10)

(6)

Theorem 8 AssumeE[|Y|2]<and covering number condition (5) for some p>0. Then for any 0<δ<1, with confidence1δwe have

sup f∈H

1 m m

i=1

f(xi)−π√m(yi)

−E[f(X)fρ(X)]

Ce′Hm− 1 2+plog2

δ (11)

which implies in particular that

1 m m

i=1

fz(xi)−π√m(yi)

−E[fz(X)fρ(X)]

CeH′ m− 1 2+plog2

δ, (12)

whereCeHis the constant given by

e

CH′ =7 sup f∈Hk

fk∞+4+7 q

E[|Y|2] +E[|Y|2] +A2+1p

p .

Replacing the meanE[fz(X)fρ(X)]by the quantity (10), we define an estimator of fρas

e

fz=fz− 1

m

m

i=1

fz(xi)−π√m(yi)

.

Putting (12) and the bounds from Theorem 5 into the obvious error expression

efz−

L2 ρX ≤ 1 m m

i=1

fz(xi)−π√m(yi)

−E[fz(X)(X)]

+ q

var[(fz(X)−(X)], (13)

we see that fez is a good estimator of fρ: the power index 2+1p in (12) is greater than 2(11+p), the

power index appearing in the last term of (13) when the variance term is bounded by (9), even in the uniformly bounded case.

To interpret our main results better we present a corollary and an example below.

If there is a constant cρsuch that fρ+cρ∈

H

, we have

D

H(fρ) =0. In this case, the choice

η=1 in Theorem 5 yields the following learning rate. Note that (2) impliesE[|Y|2]<∞.

Corollary 9 Assume (5) with some p >0 and fρ+cρ∈

H

for some constant cρ∈R. Under conditions (2) and (3), by taking h=m(1+p)min1{q−1,3}, we have with confidence1δ,

efz−

L2

ρX

e

CH′ + q

2CeH

m

min{q−2,2}

2(1+p)min{q−1,3}log2

δ.

If|Y| ≤M almost surely, then by taking h=m2(11+p), we have with confidence1δ,

efz−fρ

L2 ρX ≤ e

CH′ + q

2CeH

m−2(11+p)log2

δ.

This corollary states that fez can approximate the regression function very well. Note, however, this happens when the hypothesis space is chosen appropriately and the parameterhtends to infinity. A special example of the hypothesis space is a ball of a Sobolev spaceHs(

X

)with indexs>n 2 on a domain

X

Rnwhich satisfies (5) withp=n

s. Whensis large enough, the positive index n scan be arbitrarily small. Then the power exponent of the following convergence rate can be arbitrarily close to 13 whenE[|Y|4]<∞, and1

(7)

Example 1 Let

X

be a bounded domain ofRn with Lipschitz boundary. Assume fρHs(X) for some s>n2and take

H

={fHs(X):kfkHs(X)R}with R≥ kkHs(X)and R≥1. IfE[|Y|4]<∞,

then by taking h=m3(1+1n/s), we have with confidence1δ,

efz−fρ

L2

ρX

Cs,nR

n 2(s+n)m

1

3(1+n/s)log2

δ.

If|Y| ≤M almost surely, then by taking h=m2(1+1n/s), with confidence1δ,

efz−fρ

L2

ρX

Cs,nR

n 2(s+n)m

1 2+2n/slog2

δ.

Here the constant Cs,nis independent of R.

Compared to the analysis of least squares methods, our consistency results for the MEE algo-rithm require a weaker condition by allowing heavy tailed noise, while the convergence rates are comparable but slightly worse than the optimal oneO(m−2+1n/s). Further investigation of error

anal-ysis for the MEE algorithm is required to achieve the optimal rate, which is beyond the scope of this paper.

3. Technical Difficulties in MEE and Novelties

The MEE algorithm (1) involving sample pairs like quadratic forms is different from most classical ERM learning algorithms (Vapnik, 1998; Anthony and Bartlett, 1999) constructed by sums of inde-pendent random variables. But as done for some ranking algorithms (Agarwal and Niyogi, 2009; Clemencon et al., 2005), one can still follow the same line to define a functional called general-ization error orinformation error(related to information potential defined on page 88 of Principe, 2010) associated with the windowing functionGover the space of measurable functions on

X

as

E

(h)(f) = Z

Z

Z

Z−h

2G [(yf(x))−(y′−f(x′))] 2 2h2

!

dρ(x,y)dρ(x′,y′).

An essential barrier for our consistency analysis is an observation made by numerical simulations (Erdogmus and Principe, 2003; Silva et al., 2010) and verified mathematically for Shannon’s en-tropy in Chen and Principe (2012) that the regression function fρmay not be a minimizer of

E

(h).

It is totally different from the classical least squares generalization error

E

ls(f) =R

Z(f(x)−y)2dρ

which satisfies a nice identity

E

ls(f)

E

ls() =kfk2 L2

ρX ≥0.This barrier leads to three tech-nical difficulties in our error analysis which will be overcome by our novel approaches making full use of the special feature that the MEE scaling parameterhis large in this paper.

3.1 Approximation of Information Error

The first technical difficulty we meet in our mathematical analysis for MEE algorithm (1) is the varying form depending on the windowing functionG. Our novel approach here is an approximation of the information error in terms of the variancevar[f(X)fρ(X)]whenhis large. This is achieved

by showing that

E

(h) is closely related to the following symmetrized least squares error which

(8)

Definition 10 The symmetrized least squares error is defined on the space L2ρX by

E

sls(f) =Z

Z

Z

Z

(yf(x)) yf(x′)2dρ(x,y)dρ(x′,y′), fL2ρX.

To give the approximation of

E

(h), we need a simpler form of

E

sls.

Lemma 11 IfE[Y2]<, then by denoting Cρ=RZyfρ(x)2dρ, we have

E

sls(f) =2var[f(X)

(X)] +2, fL2ρX.

Proof Recall that for two independent and identically distributed samplesξ andξ′ of a random variable, one has the identity

E[(ξξ′)2] =2[E(ξEξ)2] =2var(ξ). Then we have

E

sls(f) =E(yf(x)) yf(x)2

=2var[Yf(X)].

By the definitionE[Y|X] = (X), it is easy to see that=var(Y(X))and the covariance

betweenYfρ(X)and fρ(X)−f(X)vanishes. Sovar[Yf(X)] =var[Yfρ(X)] +var[f(X)− (X)]. This proves the desired identity.

We are in a position to present the approximation of

E

(h) for which a large scaling

parame-ter h plays an important role. Since

H

is a compact subset ofC(

X

), we know that the number supfHkfk∞is finite.

Lemma 12 Under assumptions (2) and (3), for any essentially bounded measurable function f on X , we have

E

(h)(f) +h2G(0)var[f(X)(X)]5·27CG

(E[|Y|q])q ∗+2

q +kfkq∗+2

hq∗.

In particular,

E

(h)(f) +h2G(0)var[f(X)(X)]CHhq∗, f

H

, where CHis the constant depending onρ,G,q and

H

given by

CH′ =5·27CG

(E[|Y|q])(q∗+2)/q+ sup

f∈Hk

fk

(9)

Proof Observe that q∗+2=min{q,4} ∈(2,4]. By the Taylor expansion and the mean value theorem, we have

|G(t)G(0)G+(0)t| ≤

( kG′′k∞

2 t2≤k G′′k∞

2 t(q

+2)/2

, if 0t1,

2kGk∞t2kGk∞t(q∗+2)/2, ift>1.

So |G(t) G(0) G+(0)t| ≤ kG′′k∞

2 +2kG′k∞

t(q∗+2)/2 for all t 0, and by setting

t=[(yf(x))−2h(y2′−f(x′))]2, we know that

E

(h)(f) +h2G(0) +Z

Z

Z

ZG

+(0)

[(yf(x))(yf(x′))]2

2 dρ(x,y)dρ(x,y)

kG′′k∞

2 +2kGk∞

hq∗2−q

+2 2

Z

Z

Z

Z

(yf(x)) yf(x′)q∗+2dρ(x,y)dρ(x′,y′)

kG′′k∞

2 +2kGk

hq∗28 Z

Z|y|

q∗+2dρ+

kfkq∗+2

.

This together with Lemma 11, the normalization assumptionG+(0) =1 and H¨older’s inequality applied whenq>4 proves the desired bound and hence our conclusion.

Applying Lemma 12 to a function f

H

and fρ∈L∞ρX yields the following fact on the excess

generalization error

E

(h)(f)

E

(h)(f

ρ).

Theorem 13 Under assumptions (2) and (3), we have

E

(h)(f)

E

(h)()var[f(X)(X)]CH′′hq∗, f

H

,

where CH′′ is the constant depending onρ,G,q and

H

given by

CH′′ =5·28CG

(E[|Y|q])(q∗+2)/q+ sup

f∈Hk

fk

!q∗+2

+kkq∗+2

 .

3.2 Functional Minimizer and Best Approximation

As fρmay not be a minimizer of

E

(h), the second technical difficulty in our error analysis is the

diversity of two ways to define atarget functionin

H

, one to minimize the information error and the other to minimize the variancevar[f(X)fρ(X)]. These possible candidates for the target function

are defined as

fH :=arg min

f∈H

E

(h)(f),

fapprox:=arg min

fHvar[f(X)−(X)].

Our novelty to overcome the technical difficulty is to show that when the MEE scaling parameterh

(10)

Theorem 14 Under assumptions (2) and (3), we have

E

(h)(f

approx)≤

E

(h)(fH) +2CH′′hq

and

var[fH(X)−fρ(X)]≤var[fapprox(X)−fρ(X)] +2CH′′hq

.

Proof By Theorem 13 and the definitions of fH and fapprox, we have

E

(h)(fH)−

E

(h)(fρ)≤

E

(h)(fapprox)−

E

(h)(fρ)≤var[fapprox(X)−fρ(X)] +CH′′h

q

≤var[fH(X)−fρ(X)] +CH′′hq

E

(h)(f

H)−

E

(h)(fρ) +2C′′Hhq

≤var[fapprox(X)−fρ(X)] +3CH′′hq

.

Then the desired inequalities follow.

Moreover, Theorem 13 yields the following error decomposition for our algorithm.

Lemma 15 Under assumptions (2) and (3), we have

var[fz(X)−fρ(X)]≤

n

E

(h)(f

z)−

E

(h)(fH)

o

+var[fapprox(X)−fρ(X)] +2CH′′hq

. (14)

Proof By Theorem 13,

var[fz(X)−(X)] ≤

E

(h)(fz)−

E

(h)() +CH′′hq

≤ n

E

(h)(fz)−

E

(h)(fH)

o

+

E

(h)(fH)−

E

(h)() +CH′′hq

.

Since fapprox

H

, the definition of fH tells us that

E

(h)(fH)−

E

(h)(fρ)≤

E

(h)(fapprox)−

E

(h)(fρ).

Applying Theorem 13 to the above bound implies

var[fz(X)−fρ(X)]≤

n

E

(h)(fz)−

E

(h)(fH)

o

+var[fapprox(X)−fρ(X)] +2C′′Hh

q.

Then desired error decomposition (14) follows.

Error decomposition has been a standard technique to analyze least squares ERM regression algorithms (Anthony and Bartlett, 1999; Cucker and Zhou, 2007; Smale and Zhou, 2009; Ying, 2007). In error decomposition (14) for MEE learning algorithm (1), the first term on the right side is the sample error, the second termvar[fapprox(X)−(X)]is the approximation error, while the last extra term 2CH′′hq∗ is caused by the Parzen windowing and is small whenhis large. The quantity

E

(h)(f

(11)

3.3 Error Decomposition by U-statistics and Special Properties We shall decompose the sample error term

E

(h)(f

z)−

E

(h)(fH) further by means of U-statistics

defined for f

H

and the samplezas

Vf(z) = 1

m(m1) m

i=1

j6=i

Uf(zi,zj),

whereUf is a kernel given withz= (x,y),z′= (x′,y′)∈

Z

by

Uf(z,z′) =−h2G

[(yf(x))(yf(x′))]2 2h2

!

+h2G y(x)

y(x′)2 2h2

!

. (15)

It is easy to see thatE[Vf] =

E

(h)(f)

E

(h)(f

ρ)andUf(z,z) =0. Then

E

(h)(f

z)−

E

(h)(fH) =E[Vfz]−E

VfH

=E[Vf

z]−Vfz+Vfz−VfH +VfH −E

VfH

.

By the definition of fz, we haveVfz−VfH ≤0. Hence

E

(h)(f

z)−

E

(h)(fH)≤E[Vfz]−Vfz+VfH −E

VfH

. (16)

The above bound will be estimated by a uniform ratio probability inequality. A technical difficulty we meet here is the possibility thatE[Vf] =

E

(h)(f)

E

(h)(f

ρ)might be negative since fρmay not

be a minimizer of

E

(h). It is overcome by the following novel observation which is an immediate

consequence of Theorem 13.

Lemma 16 Under assumptions (2) and (3), ifεCH′′hq, then E[Vf] +E[Vf] +C′′

Hhq

var[f(X)fρ(X)] +ε≥ε, ∀f

H

. (17)

4. Sample Error Estimates

In this section, we follow (16) and estimate the sample error by a uniform ratio probability inequality based on the following Hoeffding’s probability inequality for U-statistics (Hoeffding, 1963).

Lemma 17 If U is a symmetric real-valued function on

Z

×

Z

satisfying aU(z,z′)b almost surely andvar[U] =σ2, then for anyε>0,

Prob (

1

m(m1) m

i=1

j6=i

U(zi,zj)−E[U] ≥ε

)

≤2 exp

− (m−1)ε

2 4σ2+ (4/3)(ba

.

To apply Lemma 17 we need to bound σ2 andbafor the kernelU

f defined by (15). Our novelty for getting sharp bounds is to use a Taylor expansion involving aC2functionGeonR:

e

G(w) =Ge(0) +Ge′(0)w+ Z w

0

(wt)Ge′′(t)dt, wR. (18)

Denote a constantAH depending onρ,G,qand

H

as

AH =9·28CG2 sup f∈Hk

fk

4 q

∞ (E[|Y|q])

2

q+kk2

∞+sup

f∈Hk

fk2

!

(12)

Lemma 18 Assume (2) and (3). (a) For any f,g

H

, we have

Uf

4CGkfk∞h and

UfUg

4CGkfgk∞h

and

var[Uf]≤AH var[f(X)−fρ(X)](q−2)/q.

(b) If|Y| ≤M almost surely for some constant M>0, then we have almost surely

Uf

AH(f(x)fρ(x))−(f(x′)−fρ(x′))

, f

H

(19)

and

UfUg

AH(f(x)g(x))(f(x′)g(x′)), f,g

H

, (20)

where AH is a constant depending onρ,G and

H

given by

AH =36CG M+sup f∈Hk

fk∞

!

.

Proof Define a functionGeonRby

e

G(t) =G(t2/2), tR.

We see thatGeC2(R),Ge(0) =G(0),Ge′(0) =0,Ge′(t) =tG(t2/2)andGe′′(t) =G(t2/2)+t2G′′(t2/2).

Moreover,

Uf(z,z′) = −h2Ge

(yf(x))(yf(x′))

h

+h2Ge yfρ(x)

yfρ(x′)

h

!

.

(a) We apply the mean value theorem and see that|Uf(z,z′)| ≤2hkGe′k∞kfk∞. The

inequal-ity for|UfUg|is obtained when fρis replaced byg. Note thatkGe′k∞=ktG′(t2/2)k∞. Then the

bounds forUf andUfUgare verified by notingktG′(t2/2)k∞≤2CG.

To bound the variance, we apply (18) to the two points w1 = (yf(x))−(y

f(x))

h and

w2=(

y(x))−(y′−(x′))

h . Writingw2−tasw2−w1+w1−t, we see fromGe′(0) =0 that

Uf(z,z′) = h2

e

G(w2)Ge(w1)=h2Ge′(0)(w2−w1) +h2

Z w2

0

(w2−t)Ge′′(t)dth2 Z w1

0

(w1−t)Ge′′(t)dt = h2

Z w2

0

(w2−w1)Ge′′(t)dt+h2 Z w2

w1

(w1−t)Ge′′(t)dt. It follows that

Uf(z,z′)

≤ kGe′′k∞ yfρ(x)

yfρ(x′) f(x)−fρ(x)

f(x′)fρ(x′)

+kGe′′k∞ f(x)fρ(x)

(13)

SinceE[|Y|q]<∞, we apply H¨older’s inequality and see that

Z

Z

Z

Z

y(x) y(x′)2 f(x)(x) f(x′)(x′)2dρ(z)dρ(z′)

Z

Z

Z

Z

y(x) y(x′)qdρ(z)dρ(z′) 2/q

Z

Z

Z

Z

f(x)fρ(x)− f(x′)−fρ(x′)2q/(q−2)dρ(z)dρ(z′)

1−2/q

≤4q+1(E[|Y|q] +kkq) 2/q

n

kfk4/(q−2)2var[f(X)(X)]o(q−2)/q.

Here we have separated the power index 2q/(q2)into the sum of 4/(q2)and 2. Then

var[Uf] ≤ E[U2f]≤2kGe′′k2∞2

5q+3

q (E[|Y|q] +kfρkq ∞)

2

qkffρk 4 q

∞ var[f(X)−fρ(X)]

q−2 q

+2kGe′′k24kfk22var[f(X)(X)].

Hence the desired inequality holds true sincekGe′′k∞≤ kGk∞+kt2G′′(t2/2)k∞3CGand var[f(X)−

(X)]≤ kffρk2

∞.

(b) If |Y| ≤ M almost surely for some constant M >0, then we see from (21) that almost surely Uf(z,z′)

4kGe′′k∞(M+kfρk∞+kffρk∞) f(x)fρ(x)− f(x′)−fρ(x′). Hence

(19) holds true almost surely. Replacing bygin (21), we see immediately inequality (20). The proof of Lemma 18 is complete.

With the above preparation, we can now give the uniform ratio probability inequality for U-statistics to estimate the sample error, following methods in the learning theory literature (Haussler et al., 1994; Koltchinskii, 2006; Cucker and Zhou, 2007).

Lemma 19 Assume (2), (3) andεCH′′hq∗.Then we have

Prob (

sup f∈H

Vf−E[Vf]

(E[Vf] +2ε)(q2)/q >4ε 2/q

)

≤2

N

H

, ε

4CGh

exp (

−(mA′′ 1)ε

Hh

)

,

where A′′H is the constant given by

A′′H =4AH(CH′′)−

2/q+12C Gsup

f∈Hk

ffρk∞.

If|Y| ≤M almost surely for some constant M>0, then we have

Prob (

sup f∈H

Vf−E[Vf] p

E[Vf] + >4 √

ε )

≤2

N

H

, ε

2AH

! exp

(

−(mA′′1)ε

H

)

,

where A′′H is the constant given by

A′′H =8AH +6AH sup f∈Hk

(14)

Proof Ifkffjk∞≤4CεGh, Lemma 18 (a) implies |E[Vf]−E[Vfj]| ≤εand|VfVfj| ≤εalmost

surely. These in connection with Lemma 16 tell us that

Vf −E[Vf]

(E[Vf] +2ε)(q2)/q >4ε

2/q =

Vfj−E[Vfj]

(E[Vf

j] +2ε)(q−2)/q

>ε2/q.

Thus by taking {fj}Nj=1 to be an 4CεGh net of the set

H

with N being the covering number

N

H

,4Cε

Gh

, we find

Prob (

sup f∈H

Vf−E[Vf]

(E[Vf] +2ε)(q−2)/q >4ε 2/q

)

≤Prob (

sup j=1,...,N

Vfj−E[Vfj]

(E[Vf

j] +2ε)(q−2)/q

>ε2/q )

j=1,...,N Prob

(

Vfj−E[Vfj]

(E[Vf

j] +2ε)(q−2)/q

>ε2/q )

.

Fix j∈ {1, . . . ,N}. Apply Lemma 17 toU=Ufj satisfying

1 m(m−1)∑

m

i=1∑j6=iU(zi,zj)−E[U] =

Vfj−E[Vfj]. By the bounds for|Ufj|and var[Ufj]from Part (b) of Lemma 18, we know by taking

eε=ε2/q(E[V

fj] +2ε)

(q−2)/qthat

Prob (

Vfj−E[Vfj]

(E[Vf

j] +2ε)(q−2)/q

>ε2/q )

=ProbVfj−E[Vfj]

>eε

≤2 exp (

− (m−1)eε

2 4AH var[fj(X)−(X)]

(q2)/q

+12CGkfjk∞heε )

≤2 exp (

−(m−1)ε

4/q(E[V

fj] +2ε) (q−2)/q 4AH +12CGkfjk∞hε2/q

)

,

where in the last step we have used the important relation (17) to the function f = fj and bounded var[fj(X)−(X)]

(q2)/q

by (E[Vf

j] +2ε)

(q2)/q

. This together with the notation

N=

N

H

,4Cε

Gh

and the inequalitykfjfρk∞≤supfH kffρk∞gives the first desired bound,

where we have observed thatεCH′′hq∗ andh1 implyε−2/q(C′′

H)−2/qh.

If|Y| ≤Malmost surely for some constantM>0, then we follows the same line as in our above proof. According to Part (b) of Lemma 18, we should replace 4CGhby 2AH,q by 4, and bound the variance var[Ufj] by 2AH′ var[fj(X)−fρ(X)]≤2A′H(E[Vfj] +2ε). Then the desired estimate

follows. The proof of Lemma 19 is complete.

We are in a position to bound the sample error. To unify the two estimates in Lemma 19, we denoteAH =2CGin the general case. Form∈N,0<δ<1, letεm,δbe the smallest positive solution

to the inequality

log

N

H

, ε

2AH

!

−(mA′′1)ε

H

≤logδ

(15)

Proposition 20 Let0<δ<1,0<η1. Under assumptions (2) and (3), we have with confidence of1δ,

var[fz(X)−fρ(X)]≤(1+η)var[fapprox(X)−fρ(X)] +12

2+24q−22

η2−2q(hε

m,δ+2CH′′h

q).

If|Y| ≤M almost surely for some M>0, then with confidence of1δ,we have

var[fz(X)−(X)]≤(1+η)var[fapprox(X)−(X)] + 278

η (εm,δ+2C ′′

Hh−2).

Proof Denoteτ= (q2)/qandεm,δ,h=max{hεm,δ,C′′Hhq}in the general case with someq>2, whileτ=1/2 andεm,δ,h=max{εm,δ,CH′′h−2}when |Y| ≤M almost surely. Then by Lemma 19,

we know that with confidence 1δ, there holds

sup f∈H

Vf−E[Vf]

(E[Vf] +m,δ,h)τ ≤4ε

1−τ

m,δ,h

which implies

E[Vf

z]−Vfz+VfH −E

VfH

≤4ε1m,δτ,h(E[Vf

z] +2εm,δ,h)

τ+1−τ

m,δ,h(E[VfH] +2εm,δ,h)

τ.

This together with Lemma 15 and (16) yields

var[fz(X)−fρ(X)]≤4

S

+16εm,δ,h+var[fapprox(X)−fρ(X)] +2CH′′hq

, (23)

where

S

:=ε1−τ

m,δ,h(E[Vfz])

τ+ε1−τ

m,δ,h(E[VfH])

τ= (24

η)

τε1−τ

m,δ,h η

24E[Vfz]

τ

+ (12 η)

τε1−τ

m,δ,h η

12E[VfH]

τ .

Now we apply Young’s inequality

a·b(1τ)a1/(1−τ)+τb1/τ, a,b0 and find

S

24

η

τ/(1τ)

εm,δ,h+ η

24E[Vfz] +

12

η

τ/(1τ)

εm,δ,h+ η

12E[VfH].

Combining this with (23), Theorem 13 and the identityE[Vf] =

E

(h)(f)

E

(h)(f

ρ)gives

var[fz(X)−(X)]≤ η

6var[fz(X)−(X)] + (1+ η

3)var[fapprox(X)−(X)] +

S

,

where

S

:= (16+8(24/η)τ/(1τ)

m,δ,h+3CH′′hq

. Since 1/(1η6)1+η3 and(1+η3)21+η, we see that

var[fz(X)−fρ(X)]≤(1+η)var[fapprox(X)−fρ(X)] +

4 3

S

.

(16)

5. Proof of Main Results

We are now in a position to prove our main results stated in Section 2.

5.1 Proof of Theorem 3

Recall

D

H(fρ) =var[fapprox(X)−fρ(X)]. Takeη=min{ε/(3

D

H(fρ)),1}. Then

ηvar[fapprox(X)−(X)]≤ε/3.

Now we take

hε,δ=

72

2+24(q−2)/2

η(2−q)/2C′′

H/ε

1/q∗ .

Seteε:=ε/ 36 2+24(q−2)/2η(2q)/2. We choose

mε,δ(h) =hA ′′

H

eε log

N

H

, e ε 2hAH

!

−logδ 2

! +1.

With this choice, we know that whenevermmε,δ(h), the solutionεm,δto inequality (22) satisfies

εm,δ≤eε/h. Combining all the above estimates and Proposition 20, we see that wheneverhhε,δ

andmmε,δ(h), error bound (4) holds true with confidence 1δ. This proves Theorem 3.

5.2 Proof of Theorem 5

We apply Proposition 20. By covering number condition (5), we know thatεm,δis bounded byeεm,δ, the smallest positive solution to the inequality

Ap 2A

H

ε p

−(mA′′1)ε

H

≤logδ 2.

This inequality written asε1+p A′′H

m1log 2

δεpAp 2A′H

p A′′H

m1≥0 is well understood in learning theory (e.g., Cucker and Zhou, 2007) and its solution can be bounded as

m,δmax

2 A ′′

H

m1log 2

δ, 2ApA ′′

H(2A′H)p

1/(1+p)

(m1)−1+1p

.

IfE[|Y|q]<for someq>2, then the first part of Proposition 20 verifies (6) with the constant

e

CH given by

e

CH =24

2+24(q−2)/22A′′H + 2ApA′′H(2A′H)

p1/(1+p)

+2CH′′.

If|Y| ≤Malmost surely for someM>0, then the second part of Proposition 20 proves (7) with the constantCeH given by

e

CH =278

2A′′H + 2ApA′′H(2A′H)p

1/(1+p)

+2C′′H

.

(17)

5.3 Proof of Theorem 8 Note 1 m m

i=1

f(xi)−π√m(yi)

m1

m

i=1

g(xi)−π√m(yi)

≤ kfgk∞

and

E[f(X)πm(Y)]E[g(X)πm(Y)]≤ kfgk.

So by taking{fj}Nj=1to be an ε4 net of the set

H

withN=

N H

4

, we know that for each f

H

there is some j∈ {1, . . . ,N}such thatkffjk∞≤ε4. Hence

1 m m

i=1

f(xi)−π√m(yi)

−E[f(X)πm(Y)] >ε = 1 m m

i=1

fj(xi)−π√m(yi)

−E[fj(X)πm(Y)] > ε 2.

It follows that

Prob (

sup f∈H

1 m m

i=1

f(xi)−π√m(yi)

−E[f(X)πm(Y)] >ε ) ≤ Prob ( sup j=1,...,N

1 m m

i=1

fj(xi)−π√m(yi)

−E[fj(X)πm(Y)] > ε 2 ) ≤ N

j=1 Prob ( 1 m m

i=1

fj(xi)−π√m(yi)

−E[fj(X)πm(Y)] > ε 2 ) .

For each fixed j∈ {1, . . . ,N}, we apply the classical Bernstein probability inequality to the random variableξ=fj(X)−π√m(Y)on(Z,ρ)bounded byMe=supf∈H kfk∞+√mwith varianceσ2(ξ)≤ E[|fj(X)π√

m(Y)|2]≤2 supf∈Hkfk2∞+2E[|Y|2] =:σ2H and know that

Prob ( 1 m m

i=1

fj(xi)−π√m(yi)

−E[fj(X)πm(Y)]

> ε 2 )

≤2 exp (

m(ε/2)

2 2

3Meε/2+2σ2(ξ) )

≤2 exp (

mε

2 4

3Meε+8σ2H

)

.

The above argument together with covering number condition (5) yields

Prob (

sup fH

1 m m

i=1

f(xi)−π√m(yi)

−E[f(X)πm(Y)] >ε )

≤2Nexp (

mε

2 4

3Meε+8σ 2

H

)

≤2 exp ( Ap 4 ε p

mε

2 4

3Meε+8σ 2

H

)

.

Bounding the right-hand side above byδis equivalent to the inequality

ε2+p

34mMelog2 δε

1+p

m8σ2

Hlog

2 δε

p

Ap4

p

(18)

By takingeεm,δ to be the smallest solution to the above inequality, we see from Cucker and Zhou (2007) as in the proof of Theorem 5 that with confidence at least 1δ,

sup f∈H

1

m

m

i=1

f(xi)−π√m(yi)

−E[f(X)πm(Y)]

≤eεm,δmax   

4Me m log

2 δ,

s 24σ2

H

m log

2 δ,

Ap4p

m

1 2+p

  

( 7 sup

f∈Hk

fk∞+4+7

q

E[|Y|2] +4A

1 2+p

p )

m−2+1plog2

δ.

Moreover, sinceπ√m(y)y=0 for|y| ≤mwhile|πm(y)y| ≤ |y| ≤|y|2

mfor|y|>

m, we know that

Em(Y)]E[fρ(X)]=

Z

X

Z

Y

π√m(y)ydρ(y|x)dρX(x)

= Z

X

Z

|y|>√m π√

m(y)−ydρ(y|x)dρX(x) ≤

Z

X

Z

|y|>√m

|y|2 √

mdρ(y|x)dρX(x)≤

E[|Y|2]

m .

Therefore, (11) holds with confidence at least 1δ. The proof of Theorem 8 is complete.

6. Conclusion and Discussion

In this paper we have proved the consistency of an MEE algorithm associated with R´enyi’s entropy of order 2 by letting the scaling parameterhin the kernel density estimator tends to infinity at an ap-propriate rate. This result explains the effectiveness of the MEE principle in empirical applications where the parameterhis required to be large enough before smaller values are tuned. However, the motivation of the MEE principle is to minimize error entropies approximately, and requires smallh

for the kernel density estimator to converge to the true probability density function. Therefore, our consistency result seems surprising.

As far as we know, our result is the first rigorous consistency result for MEE algorithms. There are many open questions in mathematical analysis of MEE algorithms. For instance, can MEE algorithm (1) be consistent by takingh0? Can one carry out error analysis for the MEE algorithm if Shannon’s entropy or R´enyi’s entropy of orderα6=2 is used? How can we establish error analysis for other learning settings such as those with non-identical sampling processes (Smale and Zhou, 2009; Hu, 2011)? These questions require further research and will be our future topics.

(19)

Table 1: NOTATIONS

notation meaning pages

pE probability density function of a random variableE 378

HS(E) Shannon’s entropy of a random variableE 378

HR,α(E) R´enyi’s entropy of orderα 378

X explanatory variable for learning 378

Y response variable for learning 378

E=Yf(X) error random variable associated with a predictor f(X) 378

HR(E) R´enyi’s entropy of orderα=2 378

z={(xi,yi)}mi=1 a sample for learning 378

G windowing function 378, 379, 380

h MEE scaling parameter 378, 379

b

pE Parzen windowing approximation ofpE 378

c

HS empirical Shannon entropy 378

c

HR empirical R´enyi’s entropy of order 2 378

the regression function ofρ 379

fz output function of the MEE learning algorithm (1) 379

H

the hypothesis space for the ERM algorithm 379

var the variance of a random variable 379

q,q∗=min{q−2,2} power indices in condition (2) forE[|Y|q]< 380

CG constant for decay condition (3) ofG 380

D

H(fρ) approximation error of the pair(

H

,ρ) 380

N

(

H

,ε) covering number of the hypothesis space

H

380

p power index for covering number condition (5) 380 π√m projection onto the closed interval[m,m] 381

e

fz estimator of fρ 382

E

(h)(f) generalization error associated withGandh 383

E

ls(f) least squares generalization error

E

ls(f) =R

Z(f(x)−y)2dρ 383

constant=RZ

y(x)2dρassociated withρ 384

fH minimizer of

E

(h)(f)in

H

385

fapprox minimizer ofvar[f(X)−fρ(X)]in

H

385

Uf kernel for the U statisticsVf 387

e

Figure

Table 1: NOTATIONS

References

Related documents

Cigarette brands with changes in market share should be monitored to see if there are specific marketing practices or design changes which could be driving large gains, as this

When we take the resulting equilibrium number of audit firms into account, we dem- onstrate that the effect of prohibiting non-audit services on the independence of large

Figure 15: This plot shows the tightness of our schedulability test and scheduling algorithm for the augmented self-suspending task model where one half D ˆ = 1 2 of the

Following from the topology defined in Section 4, the simulation is composed of several SEED instances each providing the facility to simulate a CPS

Kajiki (2014), Transition Governance of Energy System in Japan after Fukushima – Lessons from Local Experiments and Identifying Missing Key Critical Strategic Items of

If the conditions of Rule 7 Section 5 have not been satisfied, an advocate can still avoid the provisions of Rule 7 Section 4 if the party requesting the advocate to represent

We used Pearson’s correlation coefficients ( r ) with one tailed tests to study the correlation between local suicide SMR and lithium concentrations, population

Although there was no significant difference in the total reporting rate between periods (Table 2, p &gt; 0.05), the total reporting number, along with the reported number and