Learning Theory Approach to Minimum Error Entropy Criterion

(1)

Learning Theory Approach to Minimum Error Entropy Criterion

Ting Hu [email protected]

School of Mathematics and Statistics Wuhan University

Wuhan 430072, China

Jun Fan [email protected]

Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue

Kowloon, Hong Kong, China

Qiang Wu [email protected]

Department of Mathematical Sciences Middle Tennessee State University Murfreesboro, TN 37132, USA

Ding-Xuan Zhou [email protected]

Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue

Kowloon, Hong Kong, China

Editor:Gabor Lugosi

Abstract

We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learn-ing algorithm when an approximation of R´enyi’s entropy (of order 2) by Parzen windowlearn-ing is minimized. This learning algorithm involves a Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting when the scaling parame-ter is large. Consistency and explicit convergence rates are provided in parame-terms of the approximation ability and capacity of the involved hypothesis space. Novel analysis is carried out for the gen-eralization error associated with R´enyi’s entropy and a Parzen windowing function, to overcome technical difficulties arising from the essential differences between the classical least squares prob-lems and the MEE setting. An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms.

Keywords: minimum error entropy, learning theory, R´enyi’s entropy, empirical risk minimization, approximation error

1. Introduction

(2)

A systematic treatment and recent development of this area can be found in Principe (2010) and references therein.

Minimum error entropy (MEE) is a principle of information theoretical learning and provides a family of supervised learning algorithms. It was introduced for adaptive system training in Erdog-mus and Principe (2002) and has been applied to blind source separation, maximally informative subspace projections, clustering, feature selection, blind deconvolution, and some other topics (Er-dogmus and Principe, 2003; Principe, 2010; Silva et al., 2010). The idea of MEE is to extract from data as much information as possible about the data generating systems by minimizing error en-tropies in various ways. In information theory, enen-tropies are used to measure average information quantitatively. For a random variableEwith probability density function pE, Shannon’s entropy of

Eis defined as

HS(E) =−E[logpE] =−

Z

pE(e)logpE(e)de

while R´enyi’s entropy of orderα(α>0 butα₆=1) is defined as

HR,α(E) = 1

1₋αlogE[p

α−1 E ] =

1 1₋αlog

_Z

(pE(e))αde

satisfying limα→1HR,α(E) =HS(E). In supervised learning our target is to predict the response variableY from the explanatory variableX. Then the random variableEbecomes the error variable

E=Y₋f(X) when a predictor f(X) is used and the MEE principle aims at searching for a pre-dictor f(X)that contains the most information of the response variable by minimizing information entropies of the error variableE =Y₋f(X). This principle is a substitution of the classical least squares method when the noise is non-Gaussian. Note thatE_[_Y₋_f₍_X_)]2₌R_e2_p_E₍_e₎_de_{. The least}

squares method minimizes the variance of the error variableE and is perfect to deal with problems involving Gaussian noise (such as some from linear signal processing). But it only puts the first two moments into consideration, and does not work very well for problems involving heavy tailed non-Gaussian noise. For such problems, MEE might still perform very well in principle since mo-ments of all orders of the error variable are taken into account by entropies. Here we only consider R´enyi’s entropy of order α=2: HR(E) =HR,2(E) =−log

R

(pE(e))2de. Our analysis does not apply to R´enyi’s entropy of orderα₆=2.

In most real applications, neither the explanatory variable X nor the response variableY is explicitly known. Instead, in supervised learning, a sample z=_{(xi,yi)}mi=1 is available which reflects the distribution of the explanatory variableX and the functional relation betweenXand the response variableY. With this sample, information entropies of the error variableE =Y₋f(X) can be approximated by estimating its probability density functionpE by Parzen (1962) windowing

b

pE(e) = _mh1 ∑mi=1G(

(e−ei)2

2h2 ), whereei=yi−f(xi),h>0 is an MEE scaling parameter, andGis a

windowing function. A typical choice for the windowing functionG(t) =exp_{−t_}corresponds to Gaussian windowing. Then approximations of Shannon’s entropy and R´enyi’s entropy of order 2 are given by their empirical versions₋_m1∑mi=1logpbE(ei)and−log(_m1∑mi=1pbE(ei))as

c

HS=− 1

m

∑

i=1 log

" 1

mh

m

∑

j=1

G

(ei−ej)2 2h2

#

and

c

HR=−log 1

m2_h m

∑

i=1 m

∑

j=1

G

(ei−ej)2 2h2

(3)

respectively. The empirical MEE is implemented by minimizing these computable quantities. Though the MEE principle has been proposed for a decade and MEE algorithms have been shown to be effective in various applications, its theoretical foundation for mathematical error anal-ysis is not well understood yet. There is even no consistency result in the literature. It has been observed in applications that the scaling parameterhshould be large enough for MEE algorithms to work well before smaller values are tuned. However, it is well known that the convergence of Parzen windowing requireshto converge to 0.We believe this contradiction imposes difficulty for rigorous mathematical analysis of MEE algorithms. Another technical barrier for mathematical analysis of MEE algorithms for regression is the possibility that the regression function may not be a minimizer of the associated generalization error, as described in detail in Section 3 below. The main contribu-tion of this paper is a consistency result for an MEE algorithm for regression. It does requirehto be large and explains the effectiveness of the MEE principle in applications.

In the sequel of this paper, we consider an MEE learning algorithm that minimizes the empirical R´enyi’s entropyHcR and focus on the regression problem. We will take a learning theory approach and analyze this algorithm in anempirical risk minimization(ERM) setting. Assumeρis a proba-bility measure on

Z

:=

X

_×

Y

, where

X

is a separable metric space (input space for learning) and

Y

=R_{(output space). Let}ρX be its marginal distribution on

X

(for the explanatory variableX) and ρ(_·|x)be the conditional distribution ofY for givenX =x. The samplezis assumed to be drawn fromρindependently and identically distributed. The aim of the regression problem is to predict the conditional mean ofY for givenX by learning the regression function defined by

fρ(x) =E(Y|X=x) =

Z

Xydρ(y|x), x∈

X

.

The minimization of empirical R´enyi’s entropy cannot be done over all possible measurable functions which would lead to overfitting. A suitable hypothesis space should be chosen appropri-ately in the ERM setting. The ERM framework for MEE learning is defined as follows. Recall

ei=yi−f(xi).

Definition 1 Let G be a continuous function defined on[0,∞) and h>0. Let

H

be a compact subset of C(

X

). Then the MEE learning algorithm associated with

H

is defined by

fz=arg min f∈H

(

−log 1

m2_h m

∑

i=1 m

∑

j=1

G [(yi−f(xi))−(yj−f(xj))]

2 2h2

!)

. (1)

(4)

2. Main Results on Consistency and Convergence Rates

Throughout the paper, we assumeh_≥1 and that

E_[_|_Y_|q_]_<∞for someq>2,and fρ_∈L∞_ρ_X. Denoteq∗=min_{q₋2,2_}. (2)

We also assume that the windowing functionGsatisfies

G_∈C2[0,∞),G₊′ (0) =₋1,andCG:= sup t∈(0,∞)

|(1+t)G′(t)_|+_|(1+t)G′′(t)_| <∞. (3)

The special exampleG(t) =exp_{−t_}for the Gaussian windowing satisfies (3).

Consistency analysis for regression algorithms is often carried out in the literature under a decay assumption forY such as uniform boundedness and exponential decays. A recent study (Audibert and Catoni, 2011) was made under the assumptionE_[_|_Y_|4_]_<_{∞. Our assumption (2) is weaker since} qmay be arbitrarily close to 2. Note that (2) obviously holds when_|Y| ≤Malmost surely for some constantM>0, in which case we shall denoteq∗=2.

Our consistency result, to be proved in Section 5, asserts that whenhandmare large enough, the error var[fz(X)−fρ(X)]of MEE algorithm (1) can be arbitrarily close to the approximation

error (Smale and Zhou, 2003) of the hypothesis space

H

with respect to the regression function fρ.

Definition 2 The approximation error of the pair(

H

,ρ)is defined by

D

_H(fρ) = inf

f∈Hvar[f(X)−fρ(X)].

Theorem 3 Under assumptions (2) and (3), for any0<ε_≤1and0<δ<1, there exist hε,δ≥1and m_ε_,_δ(h)_≥1both depending on

H

,G,ρ,ε,δsuch that for h_≥h_ε_,_δand m_≥m_ε_,_δ(h), with confidence

1₋δ, we have

var[fz(X)−fρ(X)]≤

D

H(fρ) +ε. (4)

Our convergence rates will be stated in terms of the approximation error and the capacity of the hypothesis space

H

measured by covering numbers in this paper.

Definition 4 Forε>0,the covering number

N

(

H

,ε)is defined to be the smallest integer l∈N such that there exist l disks in C(

X

)with radiusεand centers in

H

covering the set

H

.We shall assume that for some constants p>0and Ap>0, there holds

log

N

(

H

,ε)_≤Apε−p, ∀ε>0. (5)

The behavior (5) of the covering numbers is typical in learning theory. It is satisfied by balls of Sobolev spaces on

X

_⊂Rn_{and reproducing kernel Hilbert spaces associated with Sobolev smooth}

kernels. See Anthony and Bartlett (1999), Zhou (2002), Zhou (2003) and Yao (2010). We remark that empirical covering numbers might be used together with concentration inequalities to provide shaper error estimates. This is however beyond our scope and for simplicity we adopt the the covering number inC(

X

)throughout this paper.

(5)

Theorem 5 Assume (2), (3) and covering number condition (5) for some p>0. Then for any

0<η_≤1and0<δ<1, with confidence1₋δwe have

var[fz(X)−fρ(X)]≤CeHη(2−q)/2

h−min{q−2,2}+hm−1+1p

log2

δ+ (1+η)

D

H(fρ). (6)

If|Y| ≤M almost surely for some M>0, then with confidence1−δwe have

e

CH

η

h−2+m−1+1p

log2

δ+ (1+η)

D

H(fρ). (7)

HereCeH is a constant independent of m,δ,η or h (depending on

H

,G,ρ given explicitly in the

proof).

Remark 6 In Theorem 5, we use a parameterη>0in error bounds (6) and (7) to show that the bounds consist of two terms, one of which is essentially the approximation error

D

_H(fρ)sinceη can be arbitrarily small. The reader can simply setη=1to get the main ideas of our analysis.

If moment condition (2) withq_≥4 is satisfied andη=1, then by takingh=m3(11+p)_{, (6) becomes}

var[(fz(X)−fρ(X)]≤2CeH

1

m

2 3(1+p)

log2

δ+2

D

H(fρ). (8)

If_|Y_{| ≤}Malmost surely, then by takingh=m2(11+p) _and_η₌_{1, error bound (7) becomes}

var[fz(X)−fρ(X)]≤2CeHm− 1 1+p_log2

δ+2

D

H(fρ). (9) Remark 7 When the index p in covering number condition (5) is small enough (the case when

H

is a finite ball of a reproducing kernel Hilbert space with a smooth kernel), we see that the power indices for the sample error terms of convergence rates (8) and (9) can be arbitrarily close to2/3

and1, respectively. There is a gap in the rates between the case of (2) with large q and the uniform bounded case. This gap is caused by the Parzen windowing process for which our method does not lead to better estimates when q>4. It would be interesting to know whether the gap can be narrowed.

Note the result in Theorem 5 does not guarantee that fz itself approximates fρwell when the

bounds are small. Instead a constant adjustment is required. Theoretically the best constant is

E_[_f_z₍_X₎₋_fρ₍_X_{)]. In practice it is usually approximated by the sample mean} 1

m∑ m

i=1(fz(xi)−yi) in the case of uniformly bounded noise and the approximation can be easily handled. To deal with heavy tailed noise, we project the output values onto the closed interval [₋√m,√m]by the projectionπ√

m:R→Rdefined by

π√ m(y) =

  

y, ify_∈[₋√m,√m], √

m, ify>√m, −√m, ify<₋√m,

and then approximateE_[_f_z₍_X₎₋_f_ρ₍_X_)]_{by the computable quantity}

1

m

∑

i=1

fz(xi)−π√m(yi)

. (10)

(6)

Theorem 8 AssumeE[_|Y|2_]_<_∞_{and covering number condition (5) for some p}_>₀_{. Then for any} 0<δ<1, with confidence1₋δwe have

sup f∈H

1 m m

∑

i=1

f(xi)−π√m(yi)

−E_[_f₍_X₎₋_f_ρ₍_X_)]

≤Ce′Hm− 1 2+p_log2

δ (11)

which implies in particular that

1 m m

∑

i=1

fz(xi)−π√m(yi)

−E_[_f_z₍_X₎₋_f_ρ₍_X_)]

≤CeH′ m− 1 2+p_log2

δ, (12)

whereCe_H′ is the constant given by

e

C_H′ =7 sup f∈Hk

f_k∞+4+7 q

E_[_|_Y_|2_{] +}_E_[_|_Y_|2_{] +}_A2+1p

p .

Replacing the meanE_[_f_z₍_X₎₋_f_ρ₍_X_)]_{by the quantity (10), we define an estimator of} _f_ρ_as

e

fz=fz− 1

m

∑

i=1

fz(xi)−π√m(yi)

.

Putting (12) and the bounds from Theorem 5 into the obvious error expression

efz−fρ

L2 ρX ≤ 1 m m

∑

i=1

fz(xi)−π√m(yi)

−E_[_f_z₍_X₎₋_fρ₍_X_)]

+ q

var[(fz(X)−fρ(X)], (13)

we see that fez is a good estimator of fρ: the power index ₂₊1_p in (12) is greater than ₂₍₁1₊_p₎, the

power index appearing in the last term of (13) when the variance term is bounded by (9), even in the uniformly bounded case.

To interpret our main results better we present a corollary and an example below.

If there is a constant cρsuch that fρ+cρ∈

H

, we have

D

H(fρ) =0. In this case, the choice

η=1 in Theorem 5 yields the following learning rate. Note that (2) impliesE_[_|_Y_|2_]_<∞.

Corollary 9 Assume (5) with some p >0 and fρ+cρ∈

H

for some constant cρ∈R. Under conditions (2) and (3), by taking h=m(1+p)min1{q−1,3}_{, we have with confidence}₁₋_δ_,

efz−fρ

L2

ρ_X ≤

e

C_H′ + q

2CeH

m−

min{q−2,2}

2(1+p)min{q−1,3}_log2

δ.

If_|Y_{| ≤}M almost surely, then by taking h=m2(11+p)_{, we have with confidence}₁₋_δ_,

efz−fρ

L2 ρX ≤ e

C_H′ + q

2CeH

m−2(11+p)_log2

δ.

This corollary states that fez can approximate the regression function very well. Note, however, this happens when the hypothesis space is chosen appropriately and the parameterhtends to infinity. A special example of the hypothesis space is a ball of a Sobolev spaceHs(

X

)with indexs>n 2 on a domain

X

_⊂Rn_{which satisfies (5) with}_p₌n

s. Whensis large enough, the positive index n scan be arbitrarily small. Then the power exponent of the following convergence rate can be arbitrarily close to 1₃ whenE_[_|_Y_|4_]_<_{∞, and}1

(7)

Example 1 Let

X

be a bounded domain ofRn _{with Lipschitz boundary. Assume fρ}_∈_Hs(X) for some s>n₂and take

H

=_{f_∈Hs(X):_kf_kHs₍_X₎≤R}with R≥ kfρk_Hs₍_X₎and R≥1. IfE[|Y|4]<∞,

then by taking h=m3(1+1n/s)_{, we have with confidence}₁₋_δ_,

efz−fρ

_L2

ρX

≤Cs,n,ρR

n 2(s+n)_m−

1

3(1+n/s)_log2

δ.

If_|Y_{| ≤}M almost surely, then by taking h=m2(1+1n/s)_{, with confidence}₁₋_δ_,

efz−fρ

L2

ρX

≤Cs,n,ρR

n 2(s+n)_m−

1 2+2n/s_log2

δ.

Here the constant Cs,n,ρis independent of R.

Compared to the analysis of least squares methods, our consistency results for the MEE algo-rithm require a weaker condition by allowing heavy tailed noise, while the convergence rates are comparable but slightly worse than the optimal oneO(m−2+1n/s_{). Further investigation of error}

anal-ysis for the MEE algorithm is required to achieve the optimal rate, which is beyond the scope of this paper.

3. Technical Difficulties in MEE and Novelties

The MEE algorithm (1) involving sample pairs like quadratic forms is different from most classical ERM learning algorithms (Vapnik, 1998; Anthony and Bartlett, 1999) constructed by sums of inde-pendent random variables. But as done for some ranking algorithms (Agarwal and Niyogi, 2009; Clemencon et al., 2005), one can still follow the same line to define a functional called general-ization error orinformation error(related to information potential defined on page 88 of Principe, 2010) associated with the windowing functionGover the space of measurable functions on

X

as

E

(h)(f) = Z

Z

Z−h

2_G [(y−f(x))−(y′−f(x′))] 2 2h2

!

dρ(x,y)dρ(x′,y′).

An essential barrier for our consistency analysis is an observation made by numerical simulations (Erdogmus and Principe, 2003; Silva et al., 2010) and verified mathematically for Shannon’s en-tropy in Chen and Principe (2012) that the regression function fρmay not be a minimizer of

E

(h).

It is totally different from the classical least squares generalization error

E

ls₍_f_{) =}R

Z(f(x)−y)2dρ

which satisfies a nice identity

E

ls₍_f₎₋

_E

ls₍_fρ_{) =}_k_f₋_fρ_k2 L2

ρ_X ≥0.This barrier leads to three tech-nical difficulties in our error analysis which will be overcome by our novel approaches making full use of the special feature that the MEE scaling parameterhis large in this paper.

3.1 Approximation of Information Error

The first technical difficulty we meet in our mathematical analysis for MEE algorithm (1) is the varying form depending on the windowing functionG. Our novel approach here is an approximation of the information error in terms of the variancevar[f(X)₋fρ(X)]whenhis large. This is achieved

by showing that

E

(h) _{is closely related to the following} _{symmetrized least squares error} _which

(8)

Definition 10 The symmetrized least squares error is defined on the space L2_ρ_X by

E

sls₍_f_{) =}Z

Z

(y₋f(x))₋ y′₋f(x′)2dρ(x,y)dρ(x′,y′), f_∈L2_ρ_X.

To give the approximation of

E

(h)_{, we need a simpler form of}

_E

sls_.

Lemma 11 IfE_[_Y2_]_<_∞_{, then by denoting C}_ρ₌R_Z_y₋_f_ρ₍_x₎2_d_ρ_{, we have}

E

sls₍_f_{) =}_2var[_f₍_X₎

−fρ(X)] +2Cρ, _∀f_∈L2_ρ_X.

Proof Recall that for two independent and identically distributed samplesξ andξ′ of a random variable, one has the identity

E[(ξ₋ξ′)2] =2[E(ξ₋Eξ)2_{] =}_2var(ξ)_. Then we have

E

sls₍_f_{) =}_E₍_y₋_f₍_x₎₎₋ _y′₋_f₍_x′₎2

=2var[Y−f(X)].

By the definitionE_[_Y_|_X_{] =} _fρ₍_X_{), it is easy to see that}_Cρ₌_var(_Y₋_fρ₍_X₎₎_{and the covariance}

betweenY₋fρ(X)and fρ(X)−f(X)vanishes. Sovar[Y−f(X)] =var[Y−fρ(X)] +var[f(X)− fρ(X)]. This proves the desired identity.

We are in a position to present the approximation of

E

(h) _{for which a large scaling}

parame-ter h plays an important role. Since

H

is a compact subset ofC(

X

), we know that the number sup_f_∈Hkfk∞is finite.

Lemma 12 Under assumptions (2) and (3), for any essentially bounded measurable function f on X , we have

E

(h)(f) +h2G(0)₋Cρ₋var[f(X)₋fρ(X)]_≤5_·27CG

(E_[_|_Y_|q_])q ∗+2

q ₊_k_f_kq∗+2

∞

h−q∗.

In particular,

E

(h)(f) +h2G(0)₋Cρ₋var[f(X)₋fρ(X)]_≤C_H′ h−q∗, _∀f_∈

H

, where C_H′ is the constant depending onρ,G,q and

H

given by

C_H′ =5_·27CG 

(E_[_|_Y_|q_])(q∗+2)/q₊ _sup

f∈Hk

f_k∞

(9)

Proof Observe that q∗+2=min_{q,4_{} ∈}(2,4]. By the Taylor expansion and the mean value theorem, we have

|G(t)₋G(0)₋G′₊(0)t_{| ≤}

( kG′′k∞

2 t2≤k G′′k∞

2 t(q

∗₊₂₎_/₂

, if 0_≤t_≤1,

2_kG′_k∞t_≤2_kG′_k∞t(q∗+2)/2, ift>1.

So _|G(t) ₋G(0) ₋G′₊(0)t_{| ≤} kG′′k∞

2 +2kG′k∞

t(q∗+2)/2 for all t _≥ 0, and by setting

t=[(y−f(x))−₂_h(y2′−f(x′))]2, we know that

E

(h)₍_f_{) +}_h2_G_{(0) +}Z

Z

ZG

′

+(0)

[(y₋f(x))₋(y′₋f(x′))]2

2 dρ(x,y)dρ(x ′_,_y′₎

≤

kG′′_k∞

2 +2kG ′_k∞

h−q∗2−q

∗₊2 2

Z

(y₋f(x))₋ y′₋f(x′)q∗+2dρ(x,y)dρ(x′,y′)

≤

kG′′k∞

2 +2kG ′_k

∞

h−q∗28 Z

Z|y|

q∗+2_d_ρ₊

kf_kq_∞∗+2

.

This together with Lemma 11, the normalization assumptionG′₊(0) =₋1 and H¨older’s inequality applied whenq>4 proves the desired bound and hence our conclusion.

Applying Lemma 12 to a function f _∈

H

and fρ∈L∞ρX yields the following fact on the excess

generalization error

E

(h)₍_f₎₋

_E

(h)₍_f

ρ).

Theorem 13 Under assumptions (2) and (3), we have

E

(h)(f)₋

E

(h)(fρ)₋var[f(X)₋fρ(X)]_≤C_H′′h−q∗, _∀f_∈

H

,

where C_H′′ is the constant depending onρ,G,q and

H

given by

C_H′′ =5_·28CG 

(E_[_|_Y_|q_])(q∗+2)/q₊ _sup

f∈Hk

f_k∞

!q∗+2

+_kfρ_kq_∞∗+2

 .

3.2 Functional Minimizer and Best Approximation

As fρmay not be a minimizer of

E

(h), the second technical difficulty in our error analysis is the

diversity of two ways to define atarget functionin

H

, one to minimize the information error and the other to minimize the variancevar[f(X)₋fρ(X)]. These possible candidates for the target function

are defined as

fH :=arg min

f∈H

E

(h)₍_f₎_,

fapprox:=arg min

f_∈Hvar[f(X)−fρ(X)].

Our novelty to overcome the technical difficulty is to show that when the MEE scaling parameterh

(10)

Theorem 14 Under assumptions (2) and (3), we have

E

(h)₍_f

approx)≤

E

(h)(fH) +2CH′′h−q ∗

and

var[fH(X)−fρ(X)]≤var[fapprox(X)−fρ(X)] +2CH′′h−q ∗

.

Proof By Theorem 13 and the definitions of fH and fapprox, we have

E

(h)(fH)−

E

(h)(fρ)≤

E

(h)(fapprox)−

E

(h)(fρ)≤var[fapprox(X)−fρ(X)] +CH′′h−

q∗

≤var[fH(X)−fρ(X)] +CH′′h−q ∗

≤

E

(h)₍_f

H)−

E

(h)(fρ) +2C′′Hh−q ∗

≤var[fapprox(X)−fρ(X)] +3CH′′h−q ∗

.

Then the desired inequalities follow.

Moreover, Theorem 13 yields the following error decomposition for our algorithm.

Lemma 15 Under assumptions (2) and (3), we have

n

E

(h)₍_f

z)−

E

(h)(fH)

o

+var[fapprox(X)−fρ(X)] +2CH′′h−q ∗

. (14)

Proof By Theorem 13,

var[fz(X)−fρ(X)] ≤

E

(h)(fz)−

E

(h)(fρ) +CH′′h−q ∗

≤ n

E

(h)(fz)−

E

(h)(fH)

o

+

E

(h)(fH)−

E

(h)(fρ) +CH′′h−q ∗

.

Since fapprox∈

H

, the definition of fH tells us that

E

(h)(fH)−

E

(h)(fρ)≤

E

(h)(fapprox)−

E

(h)(fρ).

Applying Theorem 13 to the above bound implies

n

E

(h)(fz)−

E

(h)(fH)

o

+var[fapprox(X)−fρ(X)] +2C′′Hh−

q∗_.

Then desired error decomposition (14) follows.

Error decomposition has been a standard technique to analyze least squares ERM regression algorithms (Anthony and Bartlett, 1999; Cucker and Zhou, 2007; Smale and Zhou, 2009; Ying, 2007). In error decomposition (14) for MEE learning algorithm (1), the first term on the right side is the sample error, the second termvar[fapprox(X)−fρ(X)]is the approximation error, while the last extra term 2C_H′′h−q∗ is caused by the Parzen windowing and is small whenhis large. The quantity

E

(h)₍_f

(11)

3.3 Error Decomposition by U-statistics and Special Properties We shall decompose the sample error term

E

(h)₍_f

z)−

E

(h)(fH) further by means of U-statistics

defined for f_∈

H

and the samplezas

Vf(z) = 1

m(m₋1) m

∑

i=1

∑

j6=i

Uf(zi,zj),

whereUf is a kernel given withz= (x,y),z′= (x′,y′)∈

Z

by

Uf(z,z′) =−h2G

[(y₋f(x))₋(y′₋f(x′))]2 2h2

!

+h2G y−fρ(x)

− y′₋fρ(x′)2 2h2

!

. (15)

It is easy to see thatE_[_V_f_{] =}

E

(h)₍_f₎₋

_E

(h)₍_f

ρ)andUf(z,z) =0. Then

E

(h)₍_f

z)−

E

(h)(fH) =E[Vfz]−E

VfH

=E_[_V_f

z]−Vfz+Vfz−VfH +VfH −E

VfH

.

By the definition of fz, we haveVfz−VfH ≤0. Hence

E

(h)₍_f

z)−

E

(h)(fH)≤E[Vfz]−Vfz+VfH −E

VfH

. (16)

The above bound will be estimated by a uniform ratio probability inequality. A technical difficulty we meet here is the possibility thatE_[_V_f_{] =}

E

(h)₍_f₎₋

_E

(h)₍_f

ρ)might be negative since fρmay not

be a minimizer of

E

(h)_{. It is overcome by the following novel observation which is an immediate}

consequence of Theorem 13.

Lemma 16 Under assumptions (2) and (3), ifε_≥C_H′′h−q∗, then E_[_V_f_{] +}_2ε_≥E_[_V_f_{] +}_C′′

Hh−q ∗

+ε_≥var[f(X)₋fρ(X)] +ε≥ε, ∀f ∈

H

. (17)

4. Sample Error Estimates

In this section, we follow (16) and estimate the sample error by a uniform ratio probability inequality based on the following Hoeffding’s probability inequality for U-statistics (Hoeffding, 1963).

Lemma 17 If U is a symmetric real-valued function on

Z

_×

Z

satisfying a_≤U(z,z′)_≤b almost surely andvar[U] =σ2_{, then for any}_ε_>₀_,

Prob (

1

m(m₋1) m

∑

i=1

∑

j6=i

U(zi,zj)−E[U] ≥ε

)

≤2 exp

− (m−1)ε

2 4σ2_{+ (4}_/₃₎₍_b₋_a_)ε

.

To apply Lemma 17 we need to bound σ2 _and_b₋_a_{for the kernel}_U

f defined by (15). Our novelty for getting sharp bounds is to use a Taylor expansion involving aC2functionGeonR_:

e

G(w) =Ge(0) +Ge′(0)w+ Z w

0

(w₋t)Ge′′(t)dt, _∀w_∈R_. ₍₁₈₎

Denote a constantAH depending onρ,G,qand

H

as

AH =9·28CG2 sup f∈Hk

f₋fρ_k

4 q

∞ (E[|Y|q])

2

q₊_k_fρ_k2

∞+sup

f∈Hk

f₋fρ_k2_∞

!

(12)

Lemma 18 Assume (2) and (3). (a) For any f,g_∈

H

, we have

Uf

_≤4CGkf−fρk∞h and

Uf−Ug

_≤4CGkf−gk∞h

and

var[Uf]≤AH var[f(X)−fρ(X)](q−2)/q.

(b) If_|Y_{| ≤}M almost surely for some constant M>0, then we have almost surely

Uf

_≤A′_H(f(x)₋fρ(x))−(f(x′)−fρ(x′))

, _∀f _∈

H

(19)

and

Uf−Ug

_≤A′_H(f(x)₋g(x))₋(f(x′)₋g(x′)), _∀f,g_∈

H

, (20)

where A′_H is a constant depending onρ,G and

H

given by

A′_H =36CG M+sup f∈Hk

f_k∞

!

.

Proof Define a functionGeonR_by

e

G(t) =G(t2/2), t_∈R_.

We see thatGe_∈C2(R_),_Ge_{(0) =}_G_(0),_Ge′_{(0) =}_0,_Ge′₍_t_{) =}_tG′₍_t2_/₂₎_and_Ge′′₍_t_{) =}_G′₍_t2_/₂₎₊_t2_G′′₍_t2_/_2).

Moreover,

Uf(z,z′) = −h2Ge

(y₋f(x))₋(y′₋f(x′))

h

+h2Ge y−fρ(x)

− y′₋fρ(x′)

h

!

.

(a) We apply the mean value theorem and see that_|Uf(z,z′)| ≤2hkGe′k∞kf−fρk∞. The

inequal-ity for_|Uf−Ug|is obtained when fρis replaced byg. Note thatkGe′k∞=ktG′(t2/2)k∞. Then the

bounds forUf andUf−Ugare verified by notingktG′(t2/2)k∞≤2CG.

To bound the variance, we apply (18) to the two points w1 = (y−f(x))−(y

′₋_f₍_x′₎₎

h and

w2=(

y−fρ(x))−(y′−fρ(x′))

h . Writingw2−tasw2−w1+w1−t, we see fromGe′(0) =0 that

Uf(z,z′) = h2

e

G(w2)₋Ge(w1)=h2Ge′(0)(w2−w1) +h2

Z w2

0

(w2−t)Ge′′(t)dt−h2 Z w1

0

(w1−t)Ge′′(t)dt = h2

Z w2

0

(w2−w1)Ge′′(t)dt+h2 Z w2

w1

(w1−t)Ge′′(t)dt. It follows that

Uf(z,z′)

_{≤ k}Ge′′_k∞ y₋fρ(x)

− y′₋fρ(x′) f(x)−fρ(x)

− f(x′)₋fρ(x′)

+_kGe′′_k∞ f(x)₋fρ(x)

(13)

SinceE[_|Y|q_]_<_{∞, we apply H¨older’s inequality and see that}

Z

y₋fρ(x)₋ y′₋fρ(x′)2 f(x)₋fρ(x)₋ f(x′)₋fρ(x′)2dρ(z)dρ(z′)

≤

Z

y₋fρ(x)₋ y′₋fρ(x′)qdρ(z)dρ(z′) 2/q

Z

f(x)₋fρ(x)− f(x′)−fρ(x′)2q/(q−2)dρ(z)dρ(z′)

1−2/q

≤4q+1(E_[_|_Y_|q_{] +}_k_fρ_kq_∞₎ 2/q

n

kf₋fρ_k4_∞/(q−2)2var[f(X)₋fρ(X)]o(q−2)/q.

Here we have separated the power index 2q/(q₋2)into the sum of 4/(q₋2)and 2. Then

var[Uf] ≤ E[U2f]≤2kGe′′k2∞2

5q+3

q (E_[_|_Y_|q_{] +}_k_f_ρkq ∞)

2

qk_f−_fρk 4 q

∞ var[f(X)−fρ(X)]

q−2 q

+2_kGe′′_k2_∞4_kf₋fρ_k2_∞2var[f(X)₋fρ(X)].

Hence the desired inequality holds true since_kGe′′_k∞_{≤ k}G′_k∞+_kt2G′′(t2/2)_k∞_≤3CGand var[f(X)−

fρ(X)]_{≤ k}f−fρk2

∞.

(b) If _|Y_{| ≤} M almost surely for some constant M >0, then we see from (21) that almost surely Uf(z,z′)

_≤4_kGe′′_k∞(M+_kf_ρk∞+_kf₋f_ρk∞) f(x)₋fρ(x)− f(x′)−fρ(x′). Hence

(19) holds true almost surely. Replacing fρbygin (21), we see immediately inequality (20). The proof of Lemma 18 is complete.

With the above preparation, we can now give the uniform ratio probability inequality for U-statistics to estimate the sample error, following methods in the learning theory literature (Haussler et al., 1994; Koltchinskii, 2006; Cucker and Zhou, 2007).

Lemma 19 Assume (2), (3) andε_≥C_H′′h−q∗.Then we have

Prob (

sup f∈H

Vf−E[Vf]

(E_[_V_f_{] +}_2ε)(q₋2)/q >4ε 2/q

)

≤2

N

H

, ε

4CGh

exp (

−(m_A−_′′ 1)ε

Hh

)

,

where A′′_H is the constant given by

A′′_H =4AH(CH′′)−

2/q₊₁₂_C Gsup

f∈Hk

f₋f_ρk∞.

If_|Y_{| ≤}M almost surely for some constant M>0, then we have

Prob (

sup f∈H

Vf−E[Vf] p

E_[_V_f_{] +}_2ε >4 √

ε )

≤2

N

H

, ε

2A′_H

! exp

(

−(m_A−_′′1)ε

H

)

,

where A′′_H is the constant given by

A′′_H =8A′_H +6A′_H sup f∈Hk

(14)

Proof If_kf−fjk∞≤₄_Cε_G_h, Lemma 18 (a) implies |E[Vf]−E[Vfj]| ≤εand|Vf −Vfj| ≤εalmost

surely. These in connection with Lemma 16 tell us that

Vf −E[Vf]

(E_[_V_f_{] +}_2ε)(q₋2)/q >4ε

2/q ₌

⇒

Vfj−E[Vfj]

(E_[_V_f

j] +2ε)(q−2)/q

>ε2/q.

Thus by taking _{fj}Nj=1 to be an 4CεGh net of the set

H

with N being the covering number

N

H

,₄_Cε

Gh

, we find

Prob (

sup f∈H

Vf−E[Vf]

(E_[_V_f_{] +}_2ε)(q−2)/q >4ε 2/q

)

≤Prob (

sup j=1,...,N

Vfj−E[Vfj]

(E_[_V_f

j] +2ε)(q−2)/q

>ε2/q )

≤

∑

j=1,...,N Prob

(

Vfj−E[Vfj]

(E_[_V_f

j] +2ε)(q−2)/q

>ε2/q )

.

Fix j_{∈ {}1, . . . ,N_}. Apply Lemma 17 toU=Ufj satisfying

1 m(m−1)∑

m

i=1∑j₆=iU(zi,zj)−E[U] =

Vfj−E[Vfj]. By the bounds for|Ufj|and var[Ufj]from Part (b) of Lemma 18, we know by taking

eε=ε2/q₍_E_[_V

fj] +2ε)

(q−2)/q_that

Prob (

Vfj−E[Vfj]

(E_[_V_f

j] +2ε)(q−2)/q

>ε2/q )

=ProbVfj−E[Vfj]

>eε

≤2 exp (

− (m−1)eε

2 4AH var[fj(X)−fρ(X)]

(q₋2)/q

+12CGkfj−fρk∞heε )

≤2 exp (

−(m−1)ε

4/q₍_E_[_V

fj] +2ε) (q−2)/q 4AH +12CGkfj−fρk∞hε2/q

)

,

where in the last step we have used the important relation (17) to the function f = fj and bounded var[fj(X)−fρ(X)]

(q₋2)/q

by (E_[_V_f

j] +2ε)

(q₋2)/q

. This together with the notation

N=

N

H

,₄_Cε

Gh

and the inequality_kfj−fρk∞≤supf_∈H kf−fρk∞gives the first desired bound,

where we have observed thatε_≥C_H′′h−q∗ andh_≥1 implyε−2/q_≤₍_C′′

H)−2/qh.

If_|Y_{| ≤}Malmost surely for some constantM>0, then we follows the same line as in our above proof. According to Part (b) of Lemma 18, we should replace 4CGhby 2A′_H,q by 4, and bound the variance var[Ufj] by 2AH′ var[fj(X)−fρ(X)]≤2A′H(E[Vfj] +2ε). Then the desired estimate

follows. The proof of Lemma 19 is complete.

We are in a position to bound the sample error. To unify the two estimates in Lemma 19, we denoteA′_H =2CGin the general case. Form∈N,0<δ<1, letεm,δbe the smallest positive solution

to the inequality

log

N

H

, ε

2A′_H

!

−(m_A−_′′1)ε

H

≤logδ

(15)

Proposition 20 Let0<δ<1,0<η_≤1. Under assumptions (2) and (3), we have with confidence of1₋δ,

var[fz(X)−fρ(X)]≤(1+η)var[fapprox(X)−fρ(X)] +12

2+24q−22

η2−2q(_hε

m,δ+2CH′′h−

q∗₎_.

If_|Y_{| ≤}M almost surely for some M>0, then with confidence of1₋δ,we have

var[fz(X)−fρ(X)]≤(1+η)var[fapprox(X)−fρ(X)] + 278

η (εm,δ+2C ′′

Hh−2).

Proof Denoteτ= (q₋2)/qandε_m_,_δ_,_h=max_{hε_m_,_δ,C′′_Hh−q∗_}_{in the general case with some}_q_>_2, whileτ=1/2 andεm,δ,h=max{εm,δ,C_H′′h−2}when |Y| ≤M almost surely. Then by Lemma 19,

we know that with confidence 1₋δ, there holds

sup f∈H

Vf−E[Vf]

(E_[_V_f_{] +}_2ε_m_,_δ_,_h₎τ ≤4ε

1−τ

m,δ,h

which implies

E_[_V_f

z]−Vfz+VfH −E

VfH

≤4ε1_m−_,_δτ_,_h(E_[_V_f

z] +2εm,δ,h)

τ₊_4ε1−τ

m,δ,h(E[VfH] +2εm,δ,h)

τ_.

This together with Lemma 15 and (16) yields

var[fz(X)−fρ(X)]≤4

S

+16εm,δ,h+var[fapprox(X)−fρ(X)] +2CH′′h−q ∗

, (23)

where

S

:=ε1−τ

m,δ,h(E[Vfz])

τ₊_ε1−τ

m,δ,h(E[VfH])

τ_{= (}24

η)

τ_ε1−τ

m,δ,h _η

24E[Vfz]

τ

+ (12 η)

τ_ε1−τ

m,δ,h _η

12E[VfH]

τ .

Now we apply Young’s inequality

a_·b_≤(1₋τ)a1/(1−τ)+τb1/τ, a,b_≥0 and find

S

_≤

24

η

τ/(1₋τ)

ε_m_,_δ_,_h+ η

24E[Vfz] +

12

η

τ/(1₋τ)

ε_m_,_δ_,_h+ η

12E[VfH].

Combining this with (23), Theorem 13 and the identityE_[_V_f_{] =}

E

(h)₍_f₎₋

_E

(h)₍_f

ρ)gives

var[fz(X)−fρ(X)]≤ η

6var[fz(X)−fρ(X)] + (1+ η

3)var[fapprox(X)−fρ(X)] +

S

′_,

where

S

′_{:= (16}₊₈₍₂₄_/_η)τ/(1₋τ)_)ε

m,δ,h+3C_H′′h−q

∗

. Since 1/(1₋η₆)_≤1+η₃ and(1+η₃)2_≤₁₊_η, we see that

var[fz(X)−fρ(X)]≤(1+η)var[fapprox(X)−fρ(X)] +

4 3

S

′_.

(16)

5. Proof of Main Results

We are now in a position to prove our main results stated in Section 2.

5.1 Proof of Theorem 3

Recall

D

_H(fρ) =var[fapprox(X)−fρ(X)]. Takeη=min{ε/(3

D

H(fρ)),1}. Then

ηvar[fapprox(X)−fρ(X)]≤ε/3.

Now we take

hε,δ=

72

2+24(q−2)/2

η(2−q)/2_C′′

H/ε

1/q∗ .

Seteε:=ε/ 36 2+24(q−2)/2η(2₋q)/2_{. We choose}

m_ε_,_δ(h) =hA ′′

H

eε log

N

H

, e ε 2hA′_H

!

−logδ 2

! +1.

With this choice, we know that wheneverm_≥mε,δ(h), the solutionεm,δto inequality (22) satisfies

εm,δ≤eε/h. Combining all the above estimates and Proposition 20, we see that wheneverh≥hε,δ

andm_≥m_ε_,_δ(h), error bound (4) holds true with confidence 1₋δ. This proves Theorem 3.

5.2 Proof of Theorem 5

We apply Proposition 20. By covering number condition (5), we know thatε_m_,_δis bounded byeε_m_,_δ, the smallest positive solution to the inequality

Ap ₂_A_′

H

ε p

−(m_A−_′′1)ε

H

≤logδ 2.

This inequality written asε1+p₋ A′′H

m₋1log 2

δεp−Ap 2A′H

p A′′H

m₋1≥0 is well understood in learning theory (e.g., Cucker and Zhou, 2007) and its solution can be bounded as

eε_m_,_δ_≤max

2 A ′′

H

m₋1log 2

δ, 2ApA ′′

H(2A′H)p

1/(1+p)

(m₋1)−1+1p

.

IfE_[_|_Y_|q_]_<_∞_{for some}_q_>_{2, then the first part of Proposition 20 verifies (6) with the constant}

e

CH given by

e

CH =24

2+24(q−2)/22A′′_H + 2ApA′′H(2A′H)

p1/(1+p)

+2C_H′′.

If_|Y_{| ≤}Malmost surely for someM>0, then the second part of Proposition 20 proves (7) with the constantCeH given by

e

CH =278

2A′′_H + 2ApA′′H(2A′H)p

1/(1+p)

+2C′′_H

.

(17)

5.3 Proof of Theorem 8 Note 1 m m

∑

i=1

f(xi)−π√m(yi)

−_m1

m

∑

i=1

g(xi)−π√m(yi)

≤ kf−gk∞

and

E_[_f₍_X₎₋_π√_m₍_Y_)]₋E_[_g₍_X₎₋_π√_m₍_Y_)]_{≤ k}_f₋_g_k_∞_.

So by taking_{fj}N_j₌₁to be an ε₄ net of the set

H

withN=

N H

,ε₄

, we know that for each f_∈

H

there is some j_{∈ {}1, . . . ,N_}such that_kf₋fjk∞≤ε₄. Hence

1 m m

∑

i=1

f(xi)−π√m(yi)

−E_[_f₍_X₎₋_π√_m₍_Y_)] >ε =_⇒ 1 m m

∑

i=1

fj(xi)−π√m(yi)

−E_[_f_j₍_X₎₋_π√_m₍_Y_)] > ε 2.

It follows that

Prob (

sup f∈H

1 m m

∑

i=1

f(xi)−π√m(yi)

−E_[_f₍_X₎₋_π√_m₍_Y_)] >ε ) ≤ Prob ( sup j=1,...,N

1 m m

∑

i=1

fj(xi)−π√m(yi)

−E_[_f_j₍_X₎₋_π√_m₍_Y_)] > ε 2 ) ≤ N

∑

j=1 Prob ( 1 m m

∑

i=1

fj(xi)−π√m(yi)

−E_[_f_j₍_X₎₋_π√_m₍_Y_)] > ε 2 ) .

For each fixed j_{∈ {}1, . . . ,N_}, we apply the classical Bernstein probability inequality to the random variableξ=fj(X)−π√m(Y)on(Z,ρ)bounded byMe=supf∈H kfk∞+√mwith varianceσ2(ξ)≤ E_[_|_f_j₍_X₎₋π√

m(Y)|2]≤2 supf∈Hkfk2∞+2E[|Y|2] =:σ2H and know that

Prob ( 1 m m

∑

i=1

fj(xi)−π√m(yi)

−E_[_f_j₍_X₎₋_π√_m₍_Y_)]

> ε 2 )

≤2 exp (

− m(ε/2)

2 2

3Meε/2+2σ2(ξ) )

≤2 exp (

− mε

2 4

3Meε+8σ2H

)

.

The above argument together with covering number condition (5) yields

Prob (

sup f_∈H

1 m m

∑

i=1

f(xi)−π√m(yi)

−E_[_f₍_X₎₋_π√_m₍_Y_)] >ε )

≤2Nexp (

− mε

2 4

3Meε+8σ 2

H

)

≤2 exp ( Ap 4 ε p

− mε

2 4

3Meε+8σ 2

H

)

.

Bounding the right-hand side above byδis equivalent to the inequality

ε2+p

−₃4_mMelog2 δε

1+p

−_m8σ2

Hlog

2 δε

p

−Ap4

p

(18)

By takingeε_m_,_δ to be the smallest solution to the above inequality, we see from Cucker and Zhou (2007) as in the proof of Theorem 5 that with confidence at least 1₋δ,

sup f∈H

1

m

∑

i=1

f(xi)−π√m(yi)

−E_[_f₍_X₎₋_π√_m₍_Y_)]

≤eε_m_,_δ_≤max   

4Me m log

2 δ,

s 24σ2

H

m log

2 δ,

Ap4p

m

1 2+p

  

≤

( 7 sup

f∈Hk

f_k∞+4+7

q

E_[_|_Y_|2_{] +}₄_A

1 2+p

p )

m−2+1p_log2

δ.

Moreover, sinceπ√_m₍_y₎₋_y₌_{0 for}_|_y_{| ≤}√_m_while_|_π√_m₍_y₎₋_y_{| ≤ |}_y_{| ≤}_√|y|2

mfor|y|>

√

m, we know that

E_[π√_m₍_Y_)]₋E_[_f_ρ₍_X_)]₌

Z

X

Z

Y

π√_m₍_y₎₋_yd_ρ(_y_|_x₎_d_ρ_X₍_x₎

= Z

X

Z

|y|>√m π√

m(y)−ydρ(y|x)dρX(x) ≤

Z

X

Z

|y|>√m

|y_|2 √

mdρ(y|x)dρX(x)≤

E_[_|_Y_|2_] √

m .

Therefore, (11) holds with confidence at least 1₋δ. The proof of Theorem 8 is complete.

6. Conclusion and Discussion

In this paper we have proved the consistency of an MEE algorithm associated with R´enyi’s entropy of order 2 by letting the scaling parameterhin the kernel density estimator tends to infinity at an ap-propriate rate. This result explains the effectiveness of the MEE principle in empirical applications where the parameterhis required to be large enough before smaller values are tuned. However, the motivation of the MEE principle is to minimize error entropies approximately, and requires smallh

for the kernel density estimator to converge to the true probability density function. Therefore, our consistency result seems surprising.

As far as we know, our result is the first rigorous consistency result for MEE algorithms. There are many open questions in mathematical analysis of MEE algorithms. For instance, can MEE algorithm (1) be consistent by takingh_→0? Can one carry out error analysis for the MEE algorithm if Shannon’s entropy or R´enyi’s entropy of orderα₆=2 is used? How can we establish error analysis for other learning settings such as those with non-identical sampling processes (Smale and Zhou, 2009; Hu, 2011)? These questions require further research and will be our future topics.

(19)

Table 1: NOTATIONS

notation meaning pages

pE probability density function of a random variableE 378

HS(E) Shannon’s entropy of a random variableE 378

HR,α(E) R´enyi’s entropy of orderα 378

X explanatory variable for learning 378

Y response variable for learning 378

E=Y₋f(X) error random variable associated with a predictor f(X) 378

HR(E) R´enyi’s entropy of orderα=2 378

z=_{(xi,yi)}mi=1 a sample for learning 378

G windowing function 378, 379, 380

h MEE scaling parameter 378, 379

b

pE Parzen windowing approximation ofpE 378

c

HS empirical Shannon entropy 378

c

HR empirical R´enyi’s entropy of order 2 378

fρ the regression function ofρ 379

fz output function of the MEE learning algorithm (1) 379

H

the hypothesis space for the ERM algorithm 379

var the variance of a random variable 379

q,q∗=min_{q−2,2_} power indices in condition (2) forE[_|Y|q_]_<_∞ ₃₈₀

CG constant for decay condition (3) ofG 380

D

_H(fρ) approximation error of the pair(

H

,ρ) 380

N

(

H

,ε) covering number of the hypothesis space

H

380

p power index for covering number condition (5) 380 π√_m _{projection onto the closed interval}_[₋√_m_,√_m_] ₃₈₁

e

fz estimator of fρ 382

E

(h)₍_f₎ _{generalization error associated with}_G_and_h ₃₈₃

E

ls₍_f₎ _{least squares generalization error}

_E

ls₍_f_{) =}R

Z(f(x)−y)2dρ 383

Cρ constantCρ=RZ

y₋fρ(x)2dρassociated withρ 384

fH minimizer of

E

(h)(f)in

H

385

fapprox minimizer ofvar[f(X)−fρ(X)]in

H

385

Uf kernel for the U statisticsVf 387

e