Learning Theory Approach to Minimum Error Entropy Criterion
Ting Hu [email protected]
School of Mathematics and Statistics Wuhan University
Wuhan 430072, China
Jun Fan [email protected]
Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue
Kowloon, Hong Kong, China
Qiang Wu [email protected]
Department of Mathematical Sciences Middle Tennessee State University Murfreesboro, TN 37132, USA
Ding-Xuan Zhou [email protected]
Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue
Kowloon, Hong Kong, China
Editor:Gabor Lugosi
Abstract
We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learn-ing algorithm when an approximation of R´enyi’s entropy (of order 2) by Parzen windowlearn-ing is minimized. This learning algorithm involves a Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting when the scaling parame-ter is large. Consistency and explicit convergence rates are provided in parame-terms of the approximation ability and capacity of the involved hypothesis space. Novel analysis is carried out for the gen-eralization error associated with R´enyi’s entropy and a Parzen windowing function, to overcome technical difficulties arising from the essential differences between the classical least squares prob-lems and the MEE setting. An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms.
Keywords: minimum error entropy, learning theory, R´enyi’s entropy, empirical risk minimization, approximation error
1. Introduction
A systematic treatment and recent development of this area can be found in Principe (2010) and references therein.
Minimum error entropy (MEE) is a principle of information theoretical learning and provides a family of supervised learning algorithms. It was introduced for adaptive system training in Erdog-mus and Principe (2002) and has been applied to blind source separation, maximally informative subspace projections, clustering, feature selection, blind deconvolution, and some other topics (Er-dogmus and Principe, 2003; Principe, 2010; Silva et al., 2010). The idea of MEE is to extract from data as much information as possible about the data generating systems by minimizing error en-tropies in various ways. In information theory, enen-tropies are used to measure average information quantitatively. For a random variableEwith probability density function pE, Shannon’s entropy of
Eis defined as
HS(E) =−E[logpE] =−
Z
pE(e)logpE(e)de
while R´enyi’s entropy of orderα(α>0 butα6=1) is defined as
HR,α(E) = 1
1−αlogE[p
α−1 E ] =
1 1−αlog
Z
(pE(e))αde
satisfying limα→1HR,α(E) =HS(E). In supervised learning our target is to predict the response variableY from the explanatory variableX. Then the random variableEbecomes the error variable
E=Y−f(X) when a predictor f(X) is used and the MEE principle aims at searching for a pre-dictor f(X)that contains the most information of the response variable by minimizing information entropies of the error variableE =Y−f(X). This principle is a substitution of the classical least squares method when the noise is non-Gaussian. Note thatE[Y−f(X)]2=Re2pE(e)de. The least
squares method minimizes the variance of the error variableE and is perfect to deal with problems involving Gaussian noise (such as some from linear signal processing). But it only puts the first two moments into consideration, and does not work very well for problems involving heavy tailed non-Gaussian noise. For such problems, MEE might still perform very well in principle since mo-ments of all orders of the error variable are taken into account by entropies. Here we only consider R´enyi’s entropy of order α=2: HR(E) =HR,2(E) =−log
R
(pE(e))2de. Our analysis does not apply to R´enyi’s entropy of orderα6=2.
In most real applications, neither the explanatory variable X nor the response variableY is explicitly known. Instead, in supervised learning, a sample z={(xi,yi)}mi=1 is available which reflects the distribution of the explanatory variableX and the functional relation betweenXand the response variableY. With this sample, information entropies of the error variableE =Y−f(X) can be approximated by estimating its probability density functionpE by Parzen (1962) windowing
b
pE(e) = mh1 ∑mi=1G(
(e−ei)2
2h2 ), whereei=yi−f(xi),h>0 is an MEE scaling parameter, andGis a
windowing function. A typical choice for the windowing functionG(t) =exp{−t}corresponds to Gaussian windowing. Then approximations of Shannon’s entropy and R´enyi’s entropy of order 2 are given by their empirical versions−m1∑mi=1logpbE(ei)and−log(m1∑mi=1pbE(ei))as
c
HS=− 1
m
m
∑
i=1 log
" 1
mh
m
∑
j=1
G
(ei−ej)2 2h2
#
and
c
HR=−log 1
m2h m
∑
i=1 m
∑
j=1
G
(ei−ej)2 2h2
respectively. The empirical MEE is implemented by minimizing these computable quantities. Though the MEE principle has been proposed for a decade and MEE algorithms have been shown to be effective in various applications, its theoretical foundation for mathematical error anal-ysis is not well understood yet. There is even no consistency result in the literature. It has been observed in applications that the scaling parameterhshould be large enough for MEE algorithms to work well before smaller values are tuned. However, it is well known that the convergence of Parzen windowing requireshto converge to 0.We believe this contradiction imposes difficulty for rigorous mathematical analysis of MEE algorithms. Another technical barrier for mathematical analysis of MEE algorithms for regression is the possibility that the regression function may not be a minimizer of the associated generalization error, as described in detail in Section 3 below. The main contribu-tion of this paper is a consistency result for an MEE algorithm for regression. It does requirehto be large and explains the effectiveness of the MEE principle in applications.
In the sequel of this paper, we consider an MEE learning algorithm that minimizes the empirical R´enyi’s entropyHcR and focus on the regression problem. We will take a learning theory approach and analyze this algorithm in anempirical risk minimization(ERM) setting. Assumeρis a proba-bility measure on
Z
:=X
×Y
, whereX
is a separable metric space (input space for learning) andY
=R(output space). LetρX be its marginal distribution onX
(for the explanatory variableX) and ρ(·|x)be the conditional distribution ofY for givenX =x. The samplezis assumed to be drawn fromρindependently and identically distributed. The aim of the regression problem is to predict the conditional mean ofY for givenX by learning the regression function defined byfρ(x) =E(Y|X=x) =
Z
Xydρ(y|x), x∈
X
.The minimization of empirical R´enyi’s entropy cannot be done over all possible measurable functions which would lead to overfitting. A suitable hypothesis space should be chosen appropri-ately in the ERM setting. The ERM framework for MEE learning is defined as follows. Recall
ei=yi−f(xi).
Definition 1 Let G be a continuous function defined on[0,∞) and h>0. Let
H
be a compact subset of C(X
). Then the MEE learning algorithm associated withH
is defined byfz=arg min f∈H
(
−log 1
m2h m
∑
i=1 m
∑
j=1
G [(yi−f(xi))−(yj−f(xj))]
2 2h2
!)
. (1)
2. Main Results on Consistency and Convergence Rates
Throughout the paper, we assumeh≥1 and that
E[|Y|q]<∞for someq>2,and fρ∈L∞ρX. Denoteq∗=min{q−2,2}. (2)
We also assume that the windowing functionGsatisfies
G∈C2[0,∞),G+′ (0) =−1,andCG:= sup t∈(0,∞)
|(1+t)G′(t)|+|(1+t)G′′(t)| <∞. (3)
The special exampleG(t) =exp{−t}for the Gaussian windowing satisfies (3).
Consistency analysis for regression algorithms is often carried out in the literature under a decay assumption forY such as uniform boundedness and exponential decays. A recent study (Audibert and Catoni, 2011) was made under the assumptionE[|Y|4]<∞. Our assumption (2) is weaker since qmay be arbitrarily close to 2. Note that (2) obviously holds when|Y| ≤Malmost surely for some constantM>0, in which case we shall denoteq∗=2.
Our consistency result, to be proved in Section 5, asserts that whenhandmare large enough, the error var[fz(X)−fρ(X)]of MEE algorithm (1) can be arbitrarily close to the approximation
error (Smale and Zhou, 2003) of the hypothesis space
H
with respect to the regression function fρ.Definition 2 The approximation error of the pair(
H
,ρ)is defined byD
H(fρ) = inff∈Hvar[f(X)−fρ(X)].
Theorem 3 Under assumptions (2) and (3), for any0<ε≤1and0<δ<1, there exist hε,δ≥1and mε,δ(h)≥1both depending on
H
,G,ρ,ε,δsuch that for h≥hε,δand m≥mε,δ(h), with confidence1−δ, we have
var[fz(X)−fρ(X)]≤
D
H(fρ) +ε. (4)Our convergence rates will be stated in terms of the approximation error and the capacity of the hypothesis space
H
measured by covering numbers in this paper.Definition 4 Forε>0,the covering number
N
(H
,ε)is defined to be the smallest integer l∈N such that there exist l disks in C(X
)with radiusεand centers inH
covering the setH
.We shall assume that for some constants p>0and Ap>0, there holdslog
N
(H
,ε)≤Apε−p, ∀ε>0. (5)The behavior (5) of the covering numbers is typical in learning theory. It is satisfied by balls of Sobolev spaces on
X
⊂Rnand reproducing kernel Hilbert spaces associated with Sobolev smoothkernels. See Anthony and Bartlett (1999), Zhou (2002), Zhou (2003) and Yao (2010). We remark that empirical covering numbers might be used together with concentration inequalities to provide shaper error estimates. This is however beyond our scope and for simplicity we adopt the the covering number inC(
X
)throughout this paper.Theorem 5 Assume (2), (3) and covering number condition (5) for some p>0. Then for any
0<η≤1and0<δ<1, with confidence1−δwe have
var[fz(X)−fρ(X)]≤CeHη(2−q)/2
h−min{q−2,2}+hm−1+1p
log2
δ+ (1+η)
D
H(fρ). (6)If|Y| ≤M almost surely for some M>0, then with confidence1−δwe have
var[fz(X)−fρ(X)]≤
e
CH
η
h−2+m−1+1p
log2
δ+ (1+η)
D
H(fρ). (7)HereCeH is a constant independent of m,δ,η or h (depending on
H
,G,ρ given explicitly in theproof).
Remark 6 In Theorem 5, we use a parameterη>0in error bounds (6) and (7) to show that the bounds consist of two terms, one of which is essentially the approximation error
D
H(fρ)sinceη can be arbitrarily small. The reader can simply setη=1to get the main ideas of our analysis.If moment condition (2) withq≥4 is satisfied andη=1, then by takingh=m3(11+p), (6) becomes
var[(fz(X)−fρ(X)]≤2CeH
1
m
2 3(1+p)
log2
δ+2
D
H(fρ). (8)If|Y| ≤Malmost surely, then by takingh=m2(11+p) andη=1, error bound (7) becomes
var[fz(X)−fρ(X)]≤2CeHm− 1 1+plog2
δ+2
D
H(fρ). (9) Remark 7 When the index p in covering number condition (5) is small enough (the case whenH
is a finite ball of a reproducing kernel Hilbert space with a smooth kernel), we see that the power indices for the sample error terms of convergence rates (8) and (9) can be arbitrarily close to2/3and1, respectively. There is a gap in the rates between the case of (2) with large q and the uniform bounded case. This gap is caused by the Parzen windowing process for which our method does not lead to better estimates when q>4. It would be interesting to know whether the gap can be narrowed.
Note the result in Theorem 5 does not guarantee that fz itself approximates fρwell when the
bounds are small. Instead a constant adjustment is required. Theoretically the best constant is
E[fz(X)−fρ(X)]. In practice it is usually approximated by the sample mean 1
m∑ m
i=1(fz(xi)−yi) in the case of uniformly bounded noise and the approximation can be easily handled. To deal with heavy tailed noise, we project the output values onto the closed interval [−√m,√m]by the projectionπ√
m:R→Rdefined by
π√ m(y) =
y, ify∈[−√m,√m], √
m, ify>√m, −√m, ify<−√m,
and then approximateE[fz(X)−fρ(X)]by the computable quantity
1
m
m
∑
i=1
fz(xi)−π√m(yi)
. (10)
Theorem 8 AssumeE[|Y|2]<∞and covering number condition (5) for some p>0. Then for any 0<δ<1, with confidence1−δwe have
sup f∈H
1 m m
∑
i=1
f(xi)−π√m(yi)
−E[f(X)−fρ(X)]
≤Ce′Hm− 1 2+plog2
δ (11)
which implies in particular that
1 m m
∑
i=1
fz(xi)−π√m(yi)
−E[fz(X)−fρ(X)]
≤CeH′ m− 1 2+plog2
δ, (12)
whereCeH′ is the constant given by
e
CH′ =7 sup f∈Hk
fk∞+4+7 q
E[|Y|2] +E[|Y|2] +A2+1p
p .
Replacing the meanE[fz(X)−fρ(X)]by the quantity (10), we define an estimator of fρas
e
fz=fz− 1
m
m
∑
i=1
fz(xi)−π√m(yi)
.
Putting (12) and the bounds from Theorem 5 into the obvious error expression
efz−fρ
L2 ρX ≤ 1 m m
∑
i=1
fz(xi)−π√m(yi)
−E[fz(X)−fρ(X)]
+ q
var[(fz(X)−fρ(X)], (13)
we see that fez is a good estimator of fρ: the power index 2+1p in (12) is greater than 2(11+p), the
power index appearing in the last term of (13) when the variance term is bounded by (9), even in the uniformly bounded case.
To interpret our main results better we present a corollary and an example below.
If there is a constant cρsuch that fρ+cρ∈
H
, we haveD
H(fρ) =0. In this case, the choiceη=1 in Theorem 5 yields the following learning rate. Note that (2) impliesE[|Y|2]<∞.
Corollary 9 Assume (5) with some p >0 and fρ+cρ∈
H
for some constant cρ∈R. Under conditions (2) and (3), by taking h=m(1+p)min1{q−1,3}, we have with confidence1−δ,efz−fρ
L2
ρX ≤
e
CH′ + q
2CeH
m−
min{q−2,2}
2(1+p)min{q−1,3}log2
δ.
If|Y| ≤M almost surely, then by taking h=m2(11+p), we have with confidence1−δ,
efz−fρ
L2 ρX ≤ e
CH′ + q
2CeH
m−2(11+p)log2
δ.
This corollary states that fez can approximate the regression function very well. Note, however, this happens when the hypothesis space is chosen appropriately and the parameterhtends to infinity. A special example of the hypothesis space is a ball of a Sobolev spaceHs(
X
)with indexs>n 2 on a domainX
⊂Rnwhich satisfies (5) withp=ns. Whensis large enough, the positive index n scan be arbitrarily small. Then the power exponent of the following convergence rate can be arbitrarily close to 13 whenE[|Y|4]<∞, and1
Example 1 Let
X
be a bounded domain ofRn with Lipschitz boundary. Assume fρ∈Hs(X) for some s>n2and takeH
={f∈Hs(X):kfkHs(X)≤R}with R≥ kfρkHs(X)and R≥1. IfE[|Y|4]<∞,then by taking h=m3(1+1n/s), we have with confidence1−δ,
efz−fρ
L2
ρX
≤Cs,n,ρR
n 2(s+n)m−
1
3(1+n/s)log2
δ.
If|Y| ≤M almost surely, then by taking h=m2(1+1n/s), with confidence1−δ,
efz−fρ
L2
ρX
≤Cs,n,ρR
n 2(s+n)m−
1 2+2n/slog2
δ.
Here the constant Cs,n,ρis independent of R.
Compared to the analysis of least squares methods, our consistency results for the MEE algo-rithm require a weaker condition by allowing heavy tailed noise, while the convergence rates are comparable but slightly worse than the optimal oneO(m−2+1n/s). Further investigation of error
anal-ysis for the MEE algorithm is required to achieve the optimal rate, which is beyond the scope of this paper.
3. Technical Difficulties in MEE and Novelties
The MEE algorithm (1) involving sample pairs like quadratic forms is different from most classical ERM learning algorithms (Vapnik, 1998; Anthony and Bartlett, 1999) constructed by sums of inde-pendent random variables. But as done for some ranking algorithms (Agarwal and Niyogi, 2009; Clemencon et al., 2005), one can still follow the same line to define a functional called general-ization error orinformation error(related to information potential defined on page 88 of Principe, 2010) associated with the windowing functionGover the space of measurable functions on
X
asE
(h)(f) = ZZ
Z
Z−h
2G [(y−f(x))−(y′−f(x′))] 2 2h2
!
dρ(x,y)dρ(x′,y′).
An essential barrier for our consistency analysis is an observation made by numerical simulations (Erdogmus and Principe, 2003; Silva et al., 2010) and verified mathematically for Shannon’s en-tropy in Chen and Principe (2012) that the regression function fρmay not be a minimizer of
E
(h).It is totally different from the classical least squares generalization error
E
ls(f) =RZ(f(x)−y)2dρ
which satisfies a nice identity
E
ls(f)−E
ls(fρ) =kf−fρk2 L2ρX ≥0.This barrier leads to three tech-nical difficulties in our error analysis which will be overcome by our novel approaches making full use of the special feature that the MEE scaling parameterhis large in this paper.
3.1 Approximation of Information Error
The first technical difficulty we meet in our mathematical analysis for MEE algorithm (1) is the varying form depending on the windowing functionG. Our novel approach here is an approximation of the information error in terms of the variancevar[f(X)−fρ(X)]whenhis large. This is achieved
by showing that
E
(h) is closely related to the following symmetrized least squares error whichDefinition 10 The symmetrized least squares error is defined on the space L2ρX by
E
sls(f) =ZZ
Z
Z
(y−f(x))− y′−f(x′)2dρ(x,y)dρ(x′,y′), f∈L2ρX.
To give the approximation of
E
(h), we need a simpler form ofE
sls.Lemma 11 IfE[Y2]<∞, then by denoting Cρ=RZy−fρ(x)2dρ, we have
E
sls(f) =2var[f(X)−fρ(X)] +2Cρ, ∀f∈L2ρX.
Proof Recall that for two independent and identically distributed samplesξ andξ′ of a random variable, one has the identity
E[(ξ−ξ′)2] =2[E(ξ−Eξ)2] =2var(ξ). Then we have
E
sls(f) =E(y−f(x))− y′−f(x′)2=2var[Y−f(X)].
By the definitionE[Y|X] = fρ(X), it is easy to see thatCρ=var(Y−fρ(X))and the covariance
betweenY−fρ(X)and fρ(X)−f(X)vanishes. Sovar[Y−f(X)] =var[Y−fρ(X)] +var[f(X)− fρ(X)]. This proves the desired identity.
We are in a position to present the approximation of
E
(h) for which a large scalingparame-ter h plays an important role. Since
H
is a compact subset ofC(X
), we know that the number supf∈Hkfk∞is finite.Lemma 12 Under assumptions (2) and (3), for any essentially bounded measurable function f on X , we have
E
(h)(f) +h2G(0)−Cρ−var[f(X)−fρ(X)]≤5·27CG(E[|Y|q])q ∗+2
q +kfkq∗+2
∞
h−q∗.
In particular,
E
(h)(f) +h2G(0)−Cρ−var[f(X)−fρ(X)]≤CH′ h−q∗, ∀f∈H
, where CH′ is the constant depending onρ,G,q andH
given byCH′ =5·27CG
(E[|Y|q])(q∗+2)/q+ sup
f∈Hk
fk∞
Proof Observe that q∗+2=min{q,4} ∈(2,4]. By the Taylor expansion and the mean value theorem, we have
|G(t)−G(0)−G′+(0)t| ≤
( kG′′k∞
2 t2≤k G′′k∞
2 t(q
∗+2)/2
, if 0≤t≤1,
2kG′k∞t≤2kG′k∞t(q∗+2)/2, ift>1.
So |G(t) −G(0) −G′+(0)t| ≤ kG′′k∞
2 +2kG′k∞
t(q∗+2)/2 for all t ≥ 0, and by setting
t=[(y−f(x))−2h(y2′−f(x′))]2, we know that
E
(h)(f) +h2G(0) +Z
Z
Z
ZG
′
+(0)
[(y−f(x))−(y′−f(x′))]2
2 dρ(x,y)dρ(x ′,y′)
≤
kG′′k∞
2 +2kG ′k∞
h−q∗2−q
∗+2 2
Z
Z
Z
Z
(y−f(x))− y′−f(x′)q∗+2dρ(x,y)dρ(x′,y′)
≤
kG′′k∞
2 +2kG ′k
∞
h−q∗28 Z
Z|y|
q∗+2dρ+
kfkq∞∗+2
.
This together with Lemma 11, the normalization assumptionG′+(0) =−1 and H¨older’s inequality applied whenq>4 proves the desired bound and hence our conclusion.
Applying Lemma 12 to a function f ∈
H
and fρ∈L∞ρX yields the following fact on the excessgeneralization error
E
(h)(f)−E
(h)(fρ).
Theorem 13 Under assumptions (2) and (3), we have
E
(h)(f)−E
(h)(fρ)−var[f(X)−fρ(X)]≤CH′′h−q∗, ∀f∈H
,where CH′′ is the constant depending onρ,G,q and
H
given byCH′′ =5·28CG
(E[|Y|q])(q∗+2)/q+ sup
f∈Hk
fk∞
!q∗+2
+kfρkq∞∗+2
.
3.2 Functional Minimizer and Best Approximation
As fρmay not be a minimizer of
E
(h), the second technical difficulty in our error analysis is thediversity of two ways to define atarget functionin
H
, one to minimize the information error and the other to minimize the variancevar[f(X)−fρ(X)]. These possible candidates for the target functionare defined as
fH :=arg min
f∈H
E
(h)(f),fapprox:=arg min
f∈Hvar[f(X)−fρ(X)].
Our novelty to overcome the technical difficulty is to show that when the MEE scaling parameterh
Theorem 14 Under assumptions (2) and (3), we have
E
(h)(fapprox)≤
E
(h)(fH) +2CH′′h−q ∗and
var[fH(X)−fρ(X)]≤var[fapprox(X)−fρ(X)] +2CH′′h−q ∗
.
Proof By Theorem 13 and the definitions of fH and fapprox, we have
E
(h)(fH)−E
(h)(fρ)≤E
(h)(fapprox)−E
(h)(fρ)≤var[fapprox(X)−fρ(X)] +CH′′h−q∗
≤var[fH(X)−fρ(X)] +CH′′h−q ∗
≤
E
(h)(fH)−
E
(h)(fρ) +2C′′Hh−q ∗≤var[fapprox(X)−fρ(X)] +3CH′′h−q ∗
.
Then the desired inequalities follow.
Moreover, Theorem 13 yields the following error decomposition for our algorithm.
Lemma 15 Under assumptions (2) and (3), we have
var[fz(X)−fρ(X)]≤
n
E
(h)(fz)−
E
(h)(fH)o
+var[fapprox(X)−fρ(X)] +2CH′′h−q ∗
. (14)
Proof By Theorem 13,
var[fz(X)−fρ(X)] ≤
E
(h)(fz)−E
(h)(fρ) +CH′′h−q ∗≤ n
E
(h)(fz)−E
(h)(fH)o
+
E
(h)(fH)−E
(h)(fρ) +CH′′h−q ∗.
Since fapprox∈
H
, the definition of fH tells us thatE
(h)(fH)−E
(h)(fρ)≤E
(h)(fapprox)−E
(h)(fρ).Applying Theorem 13 to the above bound implies
var[fz(X)−fρ(X)]≤
n
E
(h)(fz)−E
(h)(fH)o
+var[fapprox(X)−fρ(X)] +2C′′Hh−
q∗.
Then desired error decomposition (14) follows.
Error decomposition has been a standard technique to analyze least squares ERM regression algorithms (Anthony and Bartlett, 1999; Cucker and Zhou, 2007; Smale and Zhou, 2009; Ying, 2007). In error decomposition (14) for MEE learning algorithm (1), the first term on the right side is the sample error, the second termvar[fapprox(X)−fρ(X)]is the approximation error, while the last extra term 2CH′′h−q∗ is caused by the Parzen windowing and is small whenhis large. The quantity
E
(h)(f3.3 Error Decomposition by U-statistics and Special Properties We shall decompose the sample error term
E
(h)(fz)−
E
(h)(fH) further by means of U-statisticsdefined for f∈
H
and the samplezasVf(z) = 1
m(m−1) m
∑
i=1
∑
j6=iUf(zi,zj),
whereUf is a kernel given withz= (x,y),z′= (x′,y′)∈
Z
byUf(z,z′) =−h2G
[(y−f(x))−(y′−f(x′))]2 2h2
!
+h2G y−fρ(x)
− y′−fρ(x′)2 2h2
!
. (15)
It is easy to see thatE[Vf] =
E
(h)(f)−E
(h)(fρ)andUf(z,z) =0. Then
E
(h)(fz)−
E
(h)(fH) =E[Vfz]−E
VfH
=E[Vf
z]−Vfz+Vfz−VfH +VfH −E
VfH
.
By the definition of fz, we haveVfz−VfH ≤0. Hence
E
(h)(fz)−
E
(h)(fH)≤E[Vfz]−Vfz+VfH −E
VfH
. (16)
The above bound will be estimated by a uniform ratio probability inequality. A technical difficulty we meet here is the possibility thatE[Vf] =
E
(h)(f)−E
(h)(fρ)might be negative since fρmay not
be a minimizer of
E
(h). It is overcome by the following novel observation which is an immediateconsequence of Theorem 13.
Lemma 16 Under assumptions (2) and (3), ifε≥CH′′h−q∗, then E[Vf] +2ε≥E[Vf] +C′′
Hh−q ∗
+ε≥var[f(X)−fρ(X)] +ε≥ε, ∀f ∈
H
. (17)4. Sample Error Estimates
In this section, we follow (16) and estimate the sample error by a uniform ratio probability inequality based on the following Hoeffding’s probability inequality for U-statistics (Hoeffding, 1963).
Lemma 17 If U is a symmetric real-valued function on
Z
×Z
satisfying a≤U(z,z′)≤b almost surely andvar[U] =σ2, then for anyε>0,Prob (
1
m(m−1) m
∑
i=1
∑
j6=iU(zi,zj)−E[U] ≥ε
)
≤2 exp
− (m−1)ε
2 4σ2+ (4/3)(b−a)ε
.
To apply Lemma 17 we need to bound σ2 andb−afor the kernelU
f defined by (15). Our novelty for getting sharp bounds is to use a Taylor expansion involving aC2functionGeonR:
e
G(w) =Ge(0) +Ge′(0)w+ Z w
0
(w−t)Ge′′(t)dt, ∀w∈R. (18)
Denote a constantAH depending onρ,G,qand
H
asAH =9·28CG2 sup f∈Hk
f−fρk
4 q
∞ (E[|Y|q])
2
q+kfρk2
∞+sup
f∈Hk
f−fρk2∞
!
Lemma 18 Assume (2) and (3). (a) For any f,g∈
H
, we haveUf
≤4CGkf−fρk∞h and
Uf−Ug
≤4CGkf−gk∞h
and
var[Uf]≤AH var[f(X)−fρ(X)](q−2)/q.
(b) If|Y| ≤M almost surely for some constant M>0, then we have almost surely
Uf
≤A′H(f(x)−fρ(x))−(f(x′)−fρ(x′))
, ∀f ∈
H
(19)and
Uf−Ug
≤A′H(f(x)−g(x))−(f(x′)−g(x′)), ∀f,g∈
H
, (20)where A′H is a constant depending onρ,G and
H
given byA′H =36CG M+sup f∈Hk
fk∞
!
.
Proof Define a functionGeonRby
e
G(t) =G(t2/2), t∈R.
We see thatGe∈C2(R),Ge(0) =G(0),Ge′(0) =0,Ge′(t) =tG′(t2/2)andGe′′(t) =G′(t2/2)+t2G′′(t2/2).
Moreover,
Uf(z,z′) = −h2Ge
(y−f(x))−(y′−f(x′))
h
+h2Ge y−fρ(x)
− y′−fρ(x′)
h
!
.
(a) We apply the mean value theorem and see that|Uf(z,z′)| ≤2hkGe′k∞kf−fρk∞. The
inequal-ity for|Uf−Ug|is obtained when fρis replaced byg. Note thatkGe′k∞=ktG′(t2/2)k∞. Then the
bounds forUf andUf−Ugare verified by notingktG′(t2/2)k∞≤2CG.
To bound the variance, we apply (18) to the two points w1 = (y−f(x))−(y
′−f(x′))
h and
w2=(
y−fρ(x))−(y′−fρ(x′))
h . Writingw2−tasw2−w1+w1−t, we see fromGe′(0) =0 that
Uf(z,z′) = h2
e
G(w2)−Ge(w1)=h2Ge′(0)(w2−w1) +h2
Z w2
0
(w2−t)Ge′′(t)dt−h2 Z w1
0
(w1−t)Ge′′(t)dt = h2
Z w2
0
(w2−w1)Ge′′(t)dt+h2 Z w2
w1
(w1−t)Ge′′(t)dt. It follows that
Uf(z,z′)
≤ kGe′′k∞ y−fρ(x)
− y′−fρ(x′) f(x)−fρ(x)
− f(x′)−fρ(x′)
+kGe′′k∞ f(x)−fρ(x)
SinceE[|Y|q]<∞, we apply H¨older’s inequality and see that
Z
Z
Z
Z
y−fρ(x)− y′−fρ(x′)2 f(x)−fρ(x)− f(x′)−fρ(x′)2dρ(z)dρ(z′)
≤
Z
Z
Z
Z
y−fρ(x)− y′−fρ(x′)qdρ(z)dρ(z′) 2/q
Z
Z
Z
Z
f(x)−fρ(x)− f(x′)−fρ(x′)2q/(q−2)dρ(z)dρ(z′)
1−2/q
≤4q+1(E[|Y|q] +kfρkq∞) 2/q
n
kf−fρk4∞/(q−2)2var[f(X)−fρ(X)]o(q−2)/q.
Here we have separated the power index 2q/(q−2)into the sum of 4/(q−2)and 2. Then
var[Uf] ≤ E[U2f]≤2kGe′′k2∞2
5q+3
q (E[|Y|q] +kfρkq ∞)
2
qkf−fρk 4 q
∞ var[f(X)−fρ(X)]
q−2 q
+2kGe′′k2∞4kf−fρk2∞2var[f(X)−fρ(X)].
Hence the desired inequality holds true sincekGe′′k∞≤ kG′k∞+kt2G′′(t2/2)k∞≤3CGand var[f(X)−
fρ(X)]≤ kf−fρk2
∞.
(b) If |Y| ≤ M almost surely for some constant M >0, then we see from (21) that almost surely Uf(z,z′)
≤4kGe′′k∞(M+kfρk∞+kf−fρk∞) f(x)−fρ(x)− f(x′)−fρ(x′). Hence
(19) holds true almost surely. Replacing fρbygin (21), we see immediately inequality (20). The proof of Lemma 18 is complete.
With the above preparation, we can now give the uniform ratio probability inequality for U-statistics to estimate the sample error, following methods in the learning theory literature (Haussler et al., 1994; Koltchinskii, 2006; Cucker and Zhou, 2007).
Lemma 19 Assume (2), (3) andε≥CH′′h−q∗.Then we have
Prob (
sup f∈H
Vf−E[Vf]
(E[Vf] +2ε)(q−2)/q >4ε 2/q
)
≤2
N
H
, ε4CGh
exp (
−(mA−′′ 1)ε
Hh
)
,
where A′′H is the constant given by
A′′H =4AH(CH′′)−
2/q+12C Gsup
f∈Hk
f−fρk∞.
If|Y| ≤M almost surely for some constant M>0, then we have
Prob (
sup f∈H
Vf−E[Vf] p
E[Vf] +2ε >4 √
ε )
≤2
N
H
, ε2A′H
! exp
(
−(mA−′′1)ε
H
)
,
where A′′H is the constant given by
A′′H =8A′H +6A′H sup f∈Hk
Proof Ifkf−fjk∞≤4CεGh, Lemma 18 (a) implies |E[Vf]−E[Vfj]| ≤εand|Vf −Vfj| ≤εalmost
surely. These in connection with Lemma 16 tell us that
Vf −E[Vf]
(E[Vf] +2ε)(q−2)/q >4ε
2/q =
⇒
Vfj−E[Vfj]
(E[Vf
j] +2ε)(q−2)/q
>ε2/q.
Thus by taking {fj}Nj=1 to be an 4CεGh net of the set
H
with N being the covering numberN
H
,4CεGh
, we find
Prob (
sup f∈H
Vf−E[Vf]
(E[Vf] +2ε)(q−2)/q >4ε 2/q
)
≤Prob (
sup j=1,...,N
Vfj−E[Vfj]
(E[Vf
j] +2ε)(q−2)/q
>ε2/q )
≤
∑
j=1,...,N Prob
(
Vfj−E[Vfj]
(E[Vf
j] +2ε)(q−2)/q
>ε2/q )
.
Fix j∈ {1, . . . ,N}. Apply Lemma 17 toU=Ufj satisfying
1 m(m−1)∑
m
i=1∑j6=iU(zi,zj)−E[U] =
Vfj−E[Vfj]. By the bounds for|Ufj|and var[Ufj]from Part (b) of Lemma 18, we know by taking
eε=ε2/q(E[V
fj] +2ε)
(q−2)/qthat
Prob (
Vfj−E[Vfj]
(E[Vf
j] +2ε)(q−2)/q
>ε2/q )
=ProbVfj−E[Vfj]
>eε
≤2 exp (
− (m−1)eε
2 4AH var[fj(X)−fρ(X)]
(q−2)/q
+12CGkfj−fρk∞heε )
≤2 exp (
−(m−1)ε
4/q(E[V
fj] +2ε) (q−2)/q 4AH +12CGkfj−fρk∞hε2/q
)
,
where in the last step we have used the important relation (17) to the function f = fj and bounded var[fj(X)−fρ(X)]
(q−2)/q
by (E[Vf
j] +2ε)
(q−2)/q
. This together with the notation
N=
N
H
,4CεGh
and the inequalitykfj−fρk∞≤supf∈H kf−fρk∞gives the first desired bound,
where we have observed thatε≥CH′′h−q∗ andh≥1 implyε−2/q≤(C′′
H)−2/qh.
If|Y| ≤Malmost surely for some constantM>0, then we follows the same line as in our above proof. According to Part (b) of Lemma 18, we should replace 4CGhby 2A′H,q by 4, and bound the variance var[Ufj] by 2AH′ var[fj(X)−fρ(X)]≤2A′H(E[Vfj] +2ε). Then the desired estimate
follows. The proof of Lemma 19 is complete.
We are in a position to bound the sample error. To unify the two estimates in Lemma 19, we denoteA′H =2CGin the general case. Form∈N,0<δ<1, letεm,δbe the smallest positive solution
to the inequality
log
N
H
, ε2A′H
!
−(mA−′′1)ε
H
≤logδ
Proposition 20 Let0<δ<1,0<η≤1. Under assumptions (2) and (3), we have with confidence of1−δ,
var[fz(X)−fρ(X)]≤(1+η)var[fapprox(X)−fρ(X)] +12
2+24q−22
η2−2q(hε
m,δ+2CH′′h−
q∗).
If|Y| ≤M almost surely for some M>0, then with confidence of1−δ,we have
var[fz(X)−fρ(X)]≤(1+η)var[fapprox(X)−fρ(X)] + 278
η (εm,δ+2C ′′
Hh−2).
Proof Denoteτ= (q−2)/qandεm,δ,h=max{hεm,δ,C′′Hh−q∗}in the general case with someq>2, whileτ=1/2 andεm,δ,h=max{εm,δ,CH′′h−2}when |Y| ≤M almost surely. Then by Lemma 19,
we know that with confidence 1−δ, there holds
sup f∈H
Vf−E[Vf]
(E[Vf] +2εm,δ,h)τ ≤4ε
1−τ
m,δ,h
which implies
E[Vf
z]−Vfz+VfH −E
VfH
≤4ε1m−,δτ,h(E[Vf
z] +2εm,δ,h)
τ+4ε1−τ
m,δ,h(E[VfH] +2εm,δ,h)
τ.
This together with Lemma 15 and (16) yields
var[fz(X)−fρ(X)]≤4
S
+16εm,δ,h+var[fapprox(X)−fρ(X)] +2CH′′h−q ∗, (23)
where
S
:=ε1−τm,δ,h(E[Vfz])
τ+ε1−τ
m,δ,h(E[VfH])
τ= (24
η)
τε1−τ
m,δ,h η
24E[Vfz]
τ
+ (12 η)
τε1−τ
m,δ,h η
12E[VfH]
τ .
Now we apply Young’s inequality
a·b≤(1−τ)a1/(1−τ)+τb1/τ, a,b≥0 and find
S
≤24
η
τ/(1−τ)
εm,δ,h+ η
24E[Vfz] +
12
η
τ/(1−τ)
εm,δ,h+ η
12E[VfH].
Combining this with (23), Theorem 13 and the identityE[Vf] =
E
(h)(f)−E
(h)(fρ)gives
var[fz(X)−fρ(X)]≤ η
6var[fz(X)−fρ(X)] + (1+ η
3)var[fapprox(X)−fρ(X)] +
S
′,where
S
′:= (16+8(24/η)τ/(1−τ))εm,δ,h+3CH′′h−q
∗
. Since 1/(1−η6)≤1+η3 and(1+η3)2≤1+η, we see that
var[fz(X)−fρ(X)]≤(1+η)var[fapprox(X)−fρ(X)] +
4 3
S
′.
5. Proof of Main Results
We are now in a position to prove our main results stated in Section 2.
5.1 Proof of Theorem 3
Recall
D
H(fρ) =var[fapprox(X)−fρ(X)]. Takeη=min{ε/(3D
H(fρ)),1}. Thenηvar[fapprox(X)−fρ(X)]≤ε/3.
Now we take
hε,δ=
72
2+24(q−2)/2
η(2−q)/2C′′
H/ε
1/q∗ .
Seteε:=ε/ 36 2+24(q−2)/2η(2−q)/2. We choose
mε,δ(h) =hA ′′
H
eε log
N
H
, e ε 2hA′H!
−logδ 2
! +1.
With this choice, we know that wheneverm≥mε,δ(h), the solutionεm,δto inequality (22) satisfies
εm,δ≤eε/h. Combining all the above estimates and Proposition 20, we see that wheneverh≥hε,δ
andm≥mε,δ(h), error bound (4) holds true with confidence 1−δ. This proves Theorem 3.
5.2 Proof of Theorem 5
We apply Proposition 20. By covering number condition (5), we know thatεm,δis bounded byeεm,δ, the smallest positive solution to the inequality
Ap 2A′
H
ε p
−(mA−′′1)ε
H
≤logδ 2.
This inequality written asε1+p− A′′H
m−1log 2
δεp−Ap 2A′H
p A′′H
m−1≥0 is well understood in learning theory (e.g., Cucker and Zhou, 2007) and its solution can be bounded as
eεm,δ≤max
2 A ′′
H
m−1log 2
δ, 2ApA ′′
H(2A′H)p
1/(1+p)
(m−1)−1+1p
.
IfE[|Y|q]<∞for someq>2, then the first part of Proposition 20 verifies (6) with the constant
e
CH given by
e
CH =24
2+24(q−2)/22A′′H + 2ApA′′H(2A′H)
p1/(1+p)
+2CH′′.
If|Y| ≤Malmost surely for someM>0, then the second part of Proposition 20 proves (7) with the constantCeH given by
e
CH =278
2A′′H + 2ApA′′H(2A′H)p
1/(1+p)
+2C′′H
.
5.3 Proof of Theorem 8 Note 1 m m
∑
i=1
f(xi)−π√m(yi)
−m1
m
∑
i=1
g(xi)−π√m(yi)
≤ kf−gk∞
and
E[f(X)−π√m(Y)]−E[g(X)−π√m(Y)]≤ kf−gk∞.
So by taking{fj}Nj=1to be an ε4 net of the set
H
withN=N H
,ε4, we know that for each f∈
H
there is some j∈ {1, . . . ,N}such thatkf−fjk∞≤ε4. Hence
1 m m
∑
i=1
f(xi)−π√m(yi)
−E[f(X)−π√m(Y)] >ε =⇒ 1 m m
∑
i=1
fj(xi)−π√m(yi)
−E[fj(X)−π√m(Y)] > ε 2.
It follows that
Prob (
sup f∈H
1 m m
∑
i=1
f(xi)−π√m(yi)
−E[f(X)−π√m(Y)] >ε ) ≤ Prob ( sup j=1,...,N
1 m m
∑
i=1
fj(xi)−π√m(yi)
−E[fj(X)−π√m(Y)] > ε 2 ) ≤ N
∑
j=1 Prob ( 1 m m
∑
i=1
fj(xi)−π√m(yi)
−E[fj(X)−π√m(Y)] > ε 2 ) .
For each fixed j∈ {1, . . . ,N}, we apply the classical Bernstein probability inequality to the random variableξ=fj(X)−π√m(Y)on(Z,ρ)bounded byMe=supf∈H kfk∞+√mwith varianceσ2(ξ)≤ E[|fj(X)−π√
m(Y)|2]≤2 supf∈Hkfk2∞+2E[|Y|2] =:σ2H and know that
Prob ( 1 m m
∑
i=1
fj(xi)−π√m(yi)
−E[fj(X)−π√m(Y)]
> ε 2 )
≤2 exp (
− m(ε/2)
2 2
3Meε/2+2σ2(ξ) )
≤2 exp (
− mε
2 4
3Meε+8σ2H
)
.
The above argument together with covering number condition (5) yields
Prob (
sup f∈H
1 m m
∑
i=1
f(xi)−π√m(yi)
−E[f(X)−π√m(Y)] >ε )
≤2Nexp (
− mε
2 4
3Meε+8σ 2
H
)
≤2 exp ( Ap 4 ε p
− mε
2 4
3Meε+8σ 2
H
)
.
Bounding the right-hand side above byδis equivalent to the inequality
ε2+p
−34mMelog2 δε
1+p
−m8σ2
Hlog
2 δε
p
−Ap4
p
By takingeεm,δ to be the smallest solution to the above inequality, we see from Cucker and Zhou (2007) as in the proof of Theorem 5 that with confidence at least 1−δ,
sup f∈H
1
m
m
∑
i=1
f(xi)−π√m(yi)
−E[f(X)−π√m(Y)]
≤eεm,δ≤max
4Me m log
2 δ,
s 24σ2
H
m log
2 δ,
Ap4p
m
1 2+p
≤
( 7 sup
f∈Hk
fk∞+4+7
q
E[|Y|2] +4A
1 2+p
p )
m−2+1plog2
δ.
Moreover, sinceπ√m(y)−y=0 for|y| ≤√mwhile|π√m(y)−y| ≤ |y| ≤√|y|2
mfor|y|>
√
m, we know that
E[π√m(Y)]−E[fρ(X)]=
Z
X
Z
Y
π√m(y)−ydρ(y|x)dρX(x)
= Z
X
Z
|y|>√m π√
m(y)−ydρ(y|x)dρX(x) ≤
Z
X
Z
|y|>√m
|y|2 √
mdρ(y|x)dρX(x)≤
E[|Y|2] √
m .
Therefore, (11) holds with confidence at least 1−δ. The proof of Theorem 8 is complete.
6. Conclusion and Discussion
In this paper we have proved the consistency of an MEE algorithm associated with R´enyi’s entropy of order 2 by letting the scaling parameterhin the kernel density estimator tends to infinity at an ap-propriate rate. This result explains the effectiveness of the MEE principle in empirical applications where the parameterhis required to be large enough before smaller values are tuned. However, the motivation of the MEE principle is to minimize error entropies approximately, and requires smallh
for the kernel density estimator to converge to the true probability density function. Therefore, our consistency result seems surprising.
As far as we know, our result is the first rigorous consistency result for MEE algorithms. There are many open questions in mathematical analysis of MEE algorithms. For instance, can MEE algorithm (1) be consistent by takingh→0? Can one carry out error analysis for the MEE algorithm if Shannon’s entropy or R´enyi’s entropy of orderα6=2 is used? How can we establish error analysis for other learning settings such as those with non-identical sampling processes (Smale and Zhou, 2009; Hu, 2011)? These questions require further research and will be our future topics.
Table 1: NOTATIONS
notation meaning pages
pE probability density function of a random variableE 378
HS(E) Shannon’s entropy of a random variableE 378
HR,α(E) R´enyi’s entropy of orderα 378
X explanatory variable for learning 378
Y response variable for learning 378
E=Y−f(X) error random variable associated with a predictor f(X) 378
HR(E) R´enyi’s entropy of orderα=2 378
z={(xi,yi)}mi=1 a sample for learning 378
G windowing function 378, 379, 380
h MEE scaling parameter 378, 379
b
pE Parzen windowing approximation ofpE 378
c
HS empirical Shannon entropy 378
c
HR empirical R´enyi’s entropy of order 2 378
fρ the regression function ofρ 379
fz output function of the MEE learning algorithm (1) 379
H
the hypothesis space for the ERM algorithm 379var the variance of a random variable 379
q,q∗=min{q−2,2} power indices in condition (2) forE[|Y|q]<∞ 380
CG constant for decay condition (3) ofG 380
D
H(fρ) approximation error of the pair(H
,ρ) 380N
(H
,ε) covering number of the hypothesis spaceH
380p power index for covering number condition (5) 380 π√m projection onto the closed interval[−√m,√m] 381
e
fz estimator of fρ 382
E
(h)(f) generalization error associated withGandh 383E
ls(f) least squares generalization errorE
ls(f) =RZ(f(x)−y)2dρ 383
Cρ constantCρ=RZ
y−fρ(x)2dρassociated withρ 384
fH minimizer of
E
(h)(f)inH
385fapprox minimizer ofvar[f(X)−fρ(X)]in
H
385Uf kernel for the U statisticsVf 387
e