Robust Bounds on Generalization
from the Margin Distribution
John Shawe-Taylor
Royal Holloway, University of London
Nello Cristianini
University of Bristol
NeuroCOLT2 Technical Report Series
NC2-TR-1998-029
October, 1998
1Produced as part of the ESPRIT Working Group
in Neural and Computational Learning II,
NeuroCOLT2 27150
For more information see the NeuroCOLT website
http://www.neurocolt.com
2
Abstract
Introduction 3
1 Introduction
For classi cation by thresholding a real valued function the margin of a training point is the amount by which its output is on the right side of the threshold or, if misclassi ed, minus the amount by which it fails to be correctly classi ed. In the case of linear hyperplanes with unit weight vectors, this value can also be seen as the distance of the input point from the hyperplane. The margin of a classi er is the minimum margin over the training set.
The idea that a large margin classi er might be expected to give good gen-eralization is certainly not new 7, 19]. Despite this insight it was not until comparatively recently 12] that such a conjecture has been placed on a rm footing in the probably approximately correct (pac) model of learning. Learn-ing in this model entails givLearn-ing a bound on the generalization error which will hold with high con dence over randomly drawn training sets. In this sense it can be said to ensure reliable learning, something that cannot be guaranteed by bounds on the expected error of a classi er.
Despite successes in extending this style of analysis to the agnostic case 2] and applying it to neural networks 2], boosting algorithms 11], perceptron decision trees 13] and Bayesian algorithms 6], there has been concern that the measure of the distribution of margin values attained by the training set is largely ignored in a bound that depends only on its minimal value. Intuitively, there appeared to be something lost with a bound that depended so critically on the positions of possibly a small proportion of the training set, ignoring the margin attained by the majority of the points. Attempts to address this problem have been made in for example 11], but they treat points that fail to meet the larger margin as errors and fall back on agnostic bounds for the generalization error. In contrast our results apply to the case where there are training set errors, but have the form of bounds with no training set errors. The question of how to handle the situation of non linearly separable data has received a lot of attention (see 4] for a review of some of the methods suggested). The problem is that minimising the number of training errors is NP-complete and so the various methods adopted are inherently heuristic relative to the best bounds previously available for bounding the generalization error. By showing that the generalization error can be bounded in terms of a quantity that can be optimized by a polynomial time algorithm, we provide a solution to a long-standing conundrum of perceptron learning.
Summary of Results 4
Algorithms arising from the approach are related to those of Cortes and Vap-nik 5] and directly justify the original proposal made to minimise the 2-norm of the slack variables. We generalise the basic result to function classes with bounded fat-shattering dimension and the 1-norm of the slack variables which gives rise to Vapnik's box constraint algorithm. Finally, application to regres-sion is considered, which includes results for standard least squares as a special case.
The paper is structured as follows. In the next section we will summarise the results in
O
notation to give a avour of what the paper aims to achieve. In Section 3 relevant background material and de nitions are introduced. This is followed by a section describing the results for classi cation using linear functions. This is the simplest case considered and provides insight into the basic techniques employed. Section 5 describes the algorithm consequences of these results for Support Vector Machine classi cation algorithms. We then proceed to generalize the results to non-linear function spaces in Section 6. The penultimate section considers the problem of regression and shows how the results obtained for classi cation readily generalize to this case.2 Summary of Results
The results in this section will be given in the ~
O
notation indicating asymptotics ignoring log factors. The aim is to give the avour of the results obtained which might otherwise be obscured by the detailed technicalities of the proofs and precise bounds obtained.The rst case considered is that of classi cation using linear function classes that include the use of kernel functions such as those used in the Support Vector Machine. For this case consider a margin about the separating hyperplane and set (
d
( )(x y))(x y)2S) to be the vector for training setS
to be the vectorof the amounts by which the training points fail to achieve the margin . We bound the probability
of misclassi cation of a randomly chosen test point by (see Theorem 4.3)O
~ (R
+kd
k 2)2
j
S
j 2
where
R
is the radius of a ball about the origin which contains the support of the input probability distribution.The results are generalized to non-linear function classes using a characterisa-tion of their capacity at scale known as the fat shattering dimension fat( ). In this case the bound obtained has the form (see Theorem 6.13)
O
~ fat(=
16)+kd
k 22
=
2j
S
j
This result is generalized to obtain a bound in terms of the 1-norm of the vector
d
(see Corollary 6.14)O
~ fat(=
16)+kd
k 1=
2
j
S
j
Background Results 5
which could of course also be applied to the linear case using a bound on the fat shattering dimension for this case.
Finally, the problem of estimating errors of regressors is addressed with the techniques developed. We bound the probability
that for a randomly chosen test point the absolute error is greater than a given value . In this case we de ne a vector (@
(x y))(x y)2S of amounts by which the error on the trainingexamples exceeds
; . Note thatk@
()k 22 is simply the least squares error on
the training set. We then bound the probability
by (see Theorem 7.2)O
~ fat(=
16)+k@
( )k 22
=
2j
S
j
:
These results can be used for Support Vector Regression and give a way of choosing the optimal size
; of the tube for the insensitive loss function.In addition they can be applied to standard least square regression by setting =
to obtain the bound (see Corollary 7.4)O
~ fat(=
16) +k@
()k 22
=
2j
S
j
:
3 Background Results
We consider learning from examples, initially of a binary classi cation. We denote the domain of the problem by
X
and a sequence of inputs byx
= (x
1:::x
m)2
X
m. A training sequence is typically denoted byz
= ((x
1
y
1):::
(x
my
m)) 2(
X
f;11g)mand the set of training examples byS
. By Erz(
h
) we denote thenumber of classi cation errors of the function
h
on the sequencez
. As we will typically be classifying by thresholding real valued functions we introduce the notationT
(f
) to denote the function giving output 1 iff
has output greaterthan or equal to
and ;1 otherwise. For a class H the classT
(H) is the setof derived classi cation functions. We rst give some necessary de nitions.
De nition 3.1
LetH
be a set of binary valued functions. We say that a set of pointsX
isshattered byH
if for all binary vectorsb
indexed byX
, there is a functionf
b 2H
realisingb
onX
. TheVapnik-Chervonenkis (VC) dimension,VCdim(
H
), of the setH
is the size of the largest shattered set, if this is nite or innity otherwise.The following theorem is well known in a number of dierent forms. We quote the result here as a bound on the generalization error rather than as a required sample size for given generalization.
Theorem 3.2
12] LetH
i,i
= 12:::
be a sequence of hypothesis classesmap-ping
X
tof01gsuch thatVCdim(H
i) =i
, and letP
be a probability distributionon
X
. Letp
d be any set of positive numbers satisfyingP 1d=1
p
d = 1:
WithBackground Results 6
which a learner nds a consistent hypothesis
h
inH
d, the generalization errorof
h
is bounded from above by (md
) = 4m d
ln 2em
d
+ ln 1
p
d
+ ln 4
provided
d m
.We now introduce the generalization of the VC dimension which makes it pos-sible to generalize Theorem 3.2 to large margin classi cation.
De nition 3.3
Let H be a set of real valued functions. We say that a set ofpoints
X
is -shattered byHif there are real numbersr
xindexed byx
2X
suchthat for all binary vectors
b
indexed byX
, there is a functionf
b2Hsatisfyingf
b(x
)
r
x+ ifb
x= 1r
x; otherwise:
Thefat shattering dimension fatH of the set
H is a function from the positive
real numbers to the integers which maps a value to the size of the largest -shattered set, if this is nite or innity otherwise.
We will make critical use of the following result contained in Shawe-Taylor et al 12] which involves the fat shattering dimension of the space of functions.
Theorem 3.4
Consider a real valued function class H having fat shatteringfunction bounded above by the functionafat :<!N which is continuous from
the right. Fix
2<. Then with probability at least1;a learner who correctlyclassies
m
independently generated examplesz
withh
=T
(f
) 2T
(H) suchthaterz(
h
) = 0 and = minj
f
(x
i);jwill have error ofh
bounded from aboveby
(mk
) = 2m k
log28
em
k
log2(32
m
) + log28
m
where
k
= afat(=
8)em
.Linear Function Classes 7
Theorem 3.5
3] Consider a Hilbert space and the class of linear functionsL
of norm less than or equal to
B
restricted to the sphere of radiusR
about the origin. Then the fat shattering dimension ofL
can be bounded byfatL( )
BR
2
:
In order to apply Theorems 3.4 and 3.5 we need to bound the radius of the sphere containing the points and the norm of the linear functionals involved. Clearly, scaling by these quantities will give the margin appropriate for appli-cation of the theorem.
4 Linear Function Classes
Let
X
be a Hilbert space. We de ne the following Hilbert space derived fromX
.De nition 4.1
LetL
f(X
) be the set of real valued functionsf
onX
withsupportsupp(
f
) nite, that is functions inL
f(X
) are non-zero only for nitelymany points. We dene the inner product of two functions
fg
2L
f(X
), byh
f
g
i= Xx2supp(f)
f
(x
)g
(x
):
Note that the sum which de nes the inner product is well-de ned since the functions have nite support. Clearly the space is closed under addition and multiplication by scalars.
Now for any xed
>
0 we de ne an embedding ofX
into the Hilbert spaceX
L
f(X
) as follows.:
x
7!X
= (
x
x)where
x 2L
f(X
) is de ned by x(y
) =
1 if
y
=x
0 otherwise.We begin by considering the case where is xed. In practice we wish to choose this parameter in response to the data. In order to obtain a bound over dierent values of it will be necessary to apply the following theorem several times.
For a linear classi er
u
onX
and thresholdb
2< we de ned
((xy
)(u
b
) ) = maxf0 ;y
(hu
x
i;b
)g:
This quantity is the amount by which
u
fails to reach the margin on the point (xy
) or 0 if its margin is larger than . Similarly for a training setS
, we de neD
(S
(u
b
) ) =s X(x y)2S
Linear Function Classes 8
Theorem 4.2
Fix>
0,b
2 <. Consider a xed but unknown probabilitydistribution on the input space
X
with support in the ball of radiusR
about the origin. Then with probability1; over randomly drawn training setsS
of sizem
for all>
0 the generalization of a linear classieru
onX
with ku
k= 1,thresholded at
b
is bounded by (mk
) = 2m k
log28
em
k
log2(32
m
) + log2720
m
log2(1 +mR
2=
2)where
k
=
64
:
5(R
2+ 2)( ku
k2+
D
(S
(u
b
) )2=
2) 2
provided
m
2=
,k em
and there is no discrete probability on misclassiedtraining points.
Proof
: Consider the xed mapping and the augmented linear functionalover the space
X
L
f(X
),^
u
=0
@
u
1
X
(x y)2S
d
((xy
)(u
b
) )y
x 1A
:
We claim that
1. for
x
62S
,hu
x
i=hu
^ (x
)i, and
2. the margin of ^
u
with thresholdb
on the training set (S
) is .Hence, the o training set behaviour of the linear classi er (
u
b
) can be char-acterised by the behaviour of (^u
b
), while (^u
b
) is a large margin classi er in the spaceX
L
f(X
). Since forx
2S
, k (x
)k2
R
2 + 2 and ku
^k2 = k
u
k2+
D
(S
(u
b
) )2=
2, the result will then follow from an application ofTheorems 3.4 and 3.5 provided that there are no misclassi ed training points with discrete probability. Since Theorem 3.4 can only be applied for a xed space of functions we must apply the two theorems for a discrete set of values for the bound
B
on the norm of the linear functions. This corresponds to choosing a discrete set of values for the product (BR
)2 of Theorem 3.5. We will choosethe arithmetic series
i(R
2+2), fori
= 1:::`
= 90log2(1+
mR
2
=
2), where is chosen so that `(R
2+ 2) = (R
2+ 2)(1 +mR
2=
2) which is an upperbound on the product k (
x
)k 2k
u
^k2, since
D
2mR
2. Hence, = 21=90 andit is readily veri ed that 64
:
005<
64:
5. This implies that if we replacethe constant 64 of Theorem 3.4 by 64.005 to ensure the continuity from the right, then for the observed value ofk (
x
)k2
k
u
^k2 there is an application of the
theorem for a value of (
BR
)2 within a factor of this value and the requiredbound holds. Note that for each application of the theorem we must replace the
of Theorem 3.4 by 0==`
in order to ensure that all applications holdLinear Function Classes 9
1. The rst claim follows immediately from the observation that for
z
62S
, *X
(x y)2S
d
((xy
)(u
b
) )y
x z+
= 0
:
2. For (
x
0y
0)2
S
, we havey
0( hu
^(
x
0)i;
b
) =y
0(h
u
x
0i;
b
) +y
0*
X
(x y)2S
d
((xy
)(u
b
) )y
0x
x 0+
;
d
((x
0y
0)
(u
b
) )+d
((x
0y
0)
(u
b
) ) =:
The theorem follows.
We now apply this theorem several times to allow a choice of which approxi-mately minimises the expression for
k
. Note that the minimum of the expression (ignoring the constant and suppressing the denominator 2) is (R
+D
)2attainedwhen =p
RD
.Theorem 4.3
Fixb
2<. Consider a xed but unknown probability distributionon the input space
X
with support in the ball of radiusR
about the origin. Then with probability 1; over randomly drawn training setsS
of sizem
for all>
0 such thatd
((xy
)(u
b
) ) = 0, for some (xy
) 2S
, the generalizationof a linear classier
u
onX
satisfyingku
k 1 is bounded by (mk
) = 2m k
log28
em
k
log2(32
m
)+ log2180
m
(21 + log2(m
)) 2where
k
=
65(
R
+D
)2+ 2:
25RD
] 2
for
D
=D
(S
(u
b
) ), and providedm
maxf2=
6g,k em
and there is nodiscrete probability on misclassied training points.
Proof
: Consider a xed set of values for , 1 =R
b2m
0:25
;1c, i +1 =
i
=
2, fori
= 2:::t
, wheret
satis es,R=
32 t> R=
64. Hence,t
log2(128
m
0:25) = 0
:
25(28 + log2(
m
)). We apply Theorem 4.2 for each of thesevalues of , using
0 ==t
in each application. Note that we have alsoloosely upper bounded the expression (28 + log2(
m
))log2(1 +mR
2
=
2) by(21+log2(
m
))2in each application. For a given value of and
D
=D
(S
u
),it is easy to check that the value of
k
is minimal for = pRD
and is mono-tonically decreasing for smaller values of and monomono-tonically increasing for larger values. Note that pRD R
p2p
m
;1, as the largest absolutedier-ence in the values of the linear function on two training points is 2
R
and sinced
((xy
)(u
b
) ) = 0, for some (xy
)2S
, we must haved
((x
0
y
0)(u
b
) )2
R
, for all (x
0y
0)2
S
. Hence, as 2m
0:25;1
>
p2(
m
;1)0:25for
m
Algorithmics 10
nd a value of i satisfying p
RD=
2 i pRD
, provided pRD
R=
32.The value of the expression
(
R
2+ 2)(1 +D
(S
u
)2=
2)at the value i will be upper bounded by its value at =p
RD=
2. A routine calculation con rms that for this value of , the expression is equal to (R
+D
)2+ 2:
25RD
. Now suppose pRD < R=
32. In this case we will show that (R
2+ 2
t)(1 +
D
2=
2
t) 130129
(
R
+D
)2+ 2
:
25RD
so that the application of Theorem 4.2 with = t covers this case once the
constant 64
:
5 is replaced by 65. Recall thatR=
32t> R=
64 and note that pD=R <
1=
32. We therefore have (R
2+ 2
t)(1 +
D
2=
2
t)
R
2(1 + 1
=
322)(1 + 642
D
2=R
2)
R
2 1 + 11024
1 + 642
324
R
21 + 11024
1 + 1256
<
130129R
2 130129
(
R
+D
)2+ 2
:
25RD
as required. The result follows.
5 Algorithmics
The theory developed in the previous section provides a principled answer to a long standing question: what is the "best" linear separation of a set of points that are not linearly separable? Many heuristics have been proposed (see 4] for a review), mainly aimed at reducing the empirical risk, but most of them suer from computational problems. The question is a practically interesting one, expecially after the revival of perceptron-like systems due to the success of Support Vector Machines 5, 18]. The inability of the original Support Vector Machines to deal with noise (and tolerate outliers) is a serious practical lim-itation, expecially because - when combined with the use of kernels - it can easily lead to over tting. The solution developed for Support Vector machines is a heuristic known as the "Soft Margin", which will be described below. The bound proven in the previous section implies the following algorithm: minimize
D
(S
(u
b
) )for a given xed value of , and subsequently minimize the bound over dierent choices of . This would ensure that the hyperplane coincides with the minimum of the upper bound on the generalization error. Moreover, as we will see, it can be found in polynomial time. The approach taken by Vapnik 18, Section 5.5.1] for his Soft Margin Classifer is similar, albeit with totally dierent motivations: in order to minimize the training error of the output hy-pothesis (an NP-complete task) he approximates it with the quantityPmAlgorithmics 11
which tends to the training error for
! 0. This gives rise to the followingalgorithm: for non-negative variables
d
i0, minimize the functionF
(d
) =Xmj=1
d
jsubject to the constraints:
y
jhu
x
ji;b
] 1;d
jj
= 1:::m
(1)h
u
u
iC:
(2)which can be solved in polynomial time for
= 1 or = 2. This constrained optimization problem is then solved by minimizing the following quantity (prob-lem (1)): Xmj=1
d
j+ 12hu
u
ifor dierent, xed, values of
. A suitable value of is then usually chosen by means of a validation set. Once translated into dual variables, this problem turns out to be a quadratic programming problem for each xed value of , and can be solved eciently using standard methods.The algorithm which follows from the theory presented in the previous section can, in contrast, be described as follows (problem (2)): minimize
k
d
k 2 =m
X
j=1
d
2j
subject to constraints
y
jhu
x
ji;b
] 1;d
jj
= 1:::m
(3)h
u
u
i =C
(4)which corresponds exactly to minimizing
D
(S
(u
b
) ), where = 1 pC. This
follows from considering the hyperplane (
u
0b
0) = (u
=
pCb=
pC
) which has norm 1 and classi es the point (x
jy
j) such thatd
((x
jy
j)(u
0b
0) ) =d
j
=
pC
so that
D
(S
(u
0b
0) ) = pF
2(d
)=C
. Once translated into dual variables,problem (2) gives rise to a convex maximization problem 14]
F
(0) = ;1 4
m
X
j=1
2j + m
X
j=1
j;1 4
Cm
X
i j=1
ijy
iy
jhx
ix
ji; 0C
which must be solved subject to the constraints,
j 0,j
= 0:::m
, and Pmj=1
jy
j = 0 for each value of = 1p
C, giving the optimal (according to
our bound) hyperplane of xed norm jj
w
jj = 1Algorithmics 12
polynomial time by applying a gradient based path algorithm following grad(
F
) with an appropriate learning rate , but this convex optimization problem is more dicult than a standard quadratic programming one. The best is then chosen again using the bound derived in the previous section, namely:w
= argmin min
kwk=1=
R
+D
2We will now show that the same result can be obtained by solving the (simpler) quadratic problem used by Vapnik, with
= 2 and is optimised with respect to our bound. The idea is that the class of functions de ned by problem (1) for 2<+ is identical to the class of functions de ned by problem (2) for 2<
+.
The optimal function according to our bound is hence the same in both classes. First we need to prove a technical lemma, and state some de nitions.
Lemma 5.1
The hyperplane implicitly dened by the optimization problem (1) depends continuously on the parameter .Proof
: This follows from the fact that the dual function equivalent to problem (1) (once maximized in the positive quadrant for each value of) 5]W(
) = X i;1 2
X
i j
y
iy
jijK
(x
ix
j);
1
X
2i +
CX
iy
iis continuous both in
and in , and is strictly convex infor any xed value of. Strict convexity follows from the fact that its HessianH
ij =@
@F
jij =
y
jiy
jjK
(x
jix
jj);1
=H
jiij
1is negative de nite.
De nition 5.2
We deneW to be the set of the solutions of problem (1) forall values of
, and W to be the set of the solutions of problem (2) for allvalues of . Formally:
W =f
u
2<kj92< +u
=
argmin
mX
j=1
d
2j + 12h
u
u
igW =f
u
2<kj9 2< +u
=
argmin
jjujj=1=R
+D
2g
Theorem 5.3
The sets of functionsW andW dened above are equivalent.Proof
: letw
be the solution of the problem (1) for a xed value of . Then,jj
w
jj ! 0 if ! 0, and jjw
jj! 1 if ! 1. By lemma 5.1 we know thatNon-linear Function Spaces 13
of problem (1) ranges through all possible positive values for suitably chosen values of
. Sincejjw
jj=1
p
C, considering the solution for value of
in problem(1) is equivalent solving problem (2) with
C
= kw
k. This implies that foreach function
w
2 W there exists a value of such that the correspondingw
2W, andw
=w
.An obvious, but important, consequence of this theorem is the following corol-lary:
Corollary 5.4
The minima of the bound onW and W coincide:min
w 2WR
+D
2=
min
w2WR
+D
2This means that the optimal separating hyperplane (according to our bound) can be found by solving the quadratic optimization problem (2) with
= 2 for dierent values of , and choosing the value of which minimizes the bound itself. This analysis provides a theoretical justi cation to the Soft Margin heuristics (using the 2-norm of the vectord
described in the appendix of 5]), as well as a theoretically sound way to choose the optimal value of in that case. In the next sections we will generalize the theoretical results given so far, and this will lead to a further description of soft-margin heuristics for Support Vector Machines as well as non-linear function classes.6 Non-linear Function Spaces
6.1 Further Background Results
Before we can quote the next lemma, we need another de nition.
De nition 6.1
Let(Xd
) be a (pseudo-) metric space, letA
be a subset ofX
and
>
0. A setB
X
is an -cover forA
if, for everya
2A
, there existsb
2B
such thatd
(ab
) . The -covering number ofA
, Nd(A
), is theminimal cardinality of an
-cover forA
(if there is no such nite cover then it is dened to be1). We will say the cover is proper ifB
A
.Note that we have used less than or equal to in the de nition of a cover. This is somewhat unconventional, but will not change the bounds we use. It does, however, prove technically useful in the proofs. The idea is that
B
should be nite but approximate all ofA
with respect to the pseudometricd
. we will use thel
1 distance over a nite samplex
= (x
1
:::x
m) for the pseudo-metric inthe space of functions,
d
x(fg
) = maxi j
f
(x
i);g
(x
i)j:
We writeN(
Fx
) =Nd x(F) We will consider the covers to be chosen from
Non-linear Function Spaces 14
We now quote a lemma from 12] which follows immediately from a result of Alon et al. 1].
Corollary 6.2
12] LetF be a class of functionsX
!ab
] andP
adistribu-tion over
X
. Choose0< <
1 and letd
= fatF(=
4). Thensup
x2Xm
N(
Fx
) 2 4m
(b
;a
) 2 2d log
2
(2em(b;a)=(d))
:
We de ne a clipping function
. () :=8
<
:
if>
;2:
01 if<
;2:
01 otherwise,and let
(F) =f(f
):f
2Fg. The choice of the thresholdis arbitrary butwill be xed before any analysis is made. If the value of
needs to be included explicitly we will denote the clipping function by.For a monotonic function
f
( ) we de nef
( ;) = lim
!0 +
f
(;
)that is the left limit of
f
at . Note that the minimal cardinality of an-cover is a monotonically decreasing function of, as is the fat shattering dimension as a function of .De nition 6.3
We say that a class of functionsF is sturdy its images underthe evaluation maps ~
x
F:F;!<
x
~ F:f
7!
f
(x
)are compact subsets of < for all
x
2X
.Note that this de nition diers slightly from that introduced in 15]. The current de nition is more general, but at the same time simpli es the proof of the required properties.
Lemma 6.4
LetF be a sturdy class of functions. Then for eachN
2 N andany xed sequence
x
2X
m, the inmum N = inff jN( Fx
) =N
g isattained.
Proof
: The straightforward proof follows exactly the proof of Lemma 2.6 of 15].Corollary 6.5
LetF be a sturdy class of functions. Then for eachN
2Nandany xed sequence
x
2X
m, the inmum N = inff jN( (F)x
) =N
gisNon-linear Function Spaces 15
Proof
: Suppose that the assertion does not hold for somex
2X
m andN
2N.Let
N
0=N( N
N(F)x
)> N
. Consider the following supremumN0
= supf jN(
N(F)x
) =N
0g
:
Since the assertion does not hold we have N0
N. By the lemma we must
have N0
>
N, since otherwise the in mum of the required for the next size ofcover will not be attained. Hence, there exists 0
>
N withN( 0
N(F)
x
) =N
0. Let = ( 0+N)
=
2. Note thatN( (F)x
)N
. LetB
be a minimalcover in this case. Claim that
B
is also a 0cover forN(F) in the
d
x metric.
To show this consider
f
2 F and letf
i 2B
be within of (f
) in thed
xmetric. Hence, for all
x
2x
,jf
i(x
);(f
)(x
)j . But this implies that jf
i(x
);N(f
)(x
)j + ( ; N)= 0
:
Hence, we haveN( 0
N(F)
x
)N
, a contradiction.We will make use of the following lemma, which in the form below is due to Vapnik 17, page 168].
Lemma 6.6
LetX
be a set andS
a system of sets onX
, andP
a probability measure onX
. Forx
2X
mandA
2S
, denex(
A
) :=j
x
\A
j=m
. Ifm >
2=
,then
P
mx
: supA2S j
x(
A
);
P
(A
)j>
2
P
2mxy
: supA2S j
x(
A
) ;y(
A
) j> =
2
:
The following two theorems are essentially quoted from 12] but they have been reformulated here in terms of the covering numbers involved. The dierence will be apparent if Theorem 6.8 is compared with Theorem 3.4 quoted in Section 3.
Lemma 6.7
Suppose F is a sturdy set of functions that map fromX
to <.Then for any distribution
P
onX
, and anyk
2N and any 2<P
2m (xy
:9f
2Fr
= maxj f
f
(x
j)g2<
;r
dlog 2(N(
(F)xy
))e=k
1
m
jfi
jf
(y
i)gj>
(mk
) )<
where
(mk
) = 1m(
k
+ log2 2).Proof
: We have omitted the detailed proof since it is essentially the same as the corresponding proof in 12] with the simpli cation that Corollary 6.2 is not required and the property of sturdiness ensures by Corollary 6.5 that we cannd a k cover where
k = inff jN(
(F)xy
) = 2kgwhich can be used for all satisfying dlog 2(
N(
(F)xy
))e=k
. Note alsothat an inequality is required 2
<
;r
, as we have coverings using closedNon-linear Function Spaces 16
Theorem 6.8
Consider a sturdy real valued function classF having a uniformbound on the covering numbers
N( ;
;(F)
x
) B(`
)for all
x
2X
`, for all`
. Fix 2 <. If a learner correctly classiesm
inde-pendently generated examples
z
withh
=T
(f
) 2T
(F) such that erz(
h
) = 0and = minj
f
(x
i);j, then with condence1; the expected error ofh
isbounded from above by
(mk
) = 2m k
+ log28
m
where
k
=dlog 2B(2
m =
2)e.Proof
: Making use of lemma 6.6 we will move to the double sample and stratify byk
. By the union bound, it thus suces to show thatP2m
k=1
P
2m(J
k)
< =
2,where
J
k =fxy
: 9h
=T
(f
)2T
(F)Erx(
h
) = 0k
= dlog2
B(2
m =
2)e= minj
f
(x
i);jEr y(h
)m
(mk
)=
2g:
(The largest value of
k
we need consider is 2m
, since for larger values the bound will in any case be trivial). It is sucient ifP
2m(J
k) 4m =
0. We will in
fact work with the set
J
k( 0) =f
xy
: 9h
=T
(f
)2T
(F)Erx(
h
) = 0k
= dlog2 N(
0
=
2
0= 2(F)
xy
)e 0<
minj
f
(x
i);jEr y(h
)m
(mk
)=
2g:
We will show that for any 0
<
, we haveP
2m(J
k( 0))
0. Hence, consideringthe limit 0
! from below, the result will follow.
Consider ^F = ^F. The probability distribution on ^
X
=X
f01g is givenby
P
onX
with the second component determined by the target value of the rst component. Note that for a pointy
2y
to be misclassi ed, it must have^
f
(^y
)>
maxff
^(^x
): ^x
2^x
g+ , so thatJ
k( 0) n
^
x
y
^ 2(X
f01g) 2m: 9
f
^2F^r
= maxff
^(^x
): ^x
2x
^g 0<
;
r
k
=dlogN( 0=
20= 2(
F)
xy
)e
f
y
^2y
^: ^f
(^y
)g
m
(mk
)=
2 oReplacing by 0
=
2 in Lemma 6.7 and appealing to Lemma 6.6 we obtainP
2m(J
k( 0))
0, for (mk
) = 2m
;k
+ log(2=
0)
Non-linear Function Spaces 17
6.2 Margin distribution and fat shattering
In this section we will generalise the results of Section 4 to function classes for which a bound on their fat-shattering dimension is known. The basic trick is to bound the covering numbers of the sum of two function classes in terms of the covering numbers of the individual classes. If F and G a real valued function
classes de ned on a domain
X
we denote by F+G the function classF+G =f
f
+g
jf
2Fg
2Gg:
Lemma 6.9
LetF and G be two real valued function classes both dened on adomain
X
. Suppose G has rangeab
]. Then we can bound the cardinality of aminimal cover ofF+G by
N(
(F+G)x
) N(=
2;a
+(b;a)=2(
F)
x
)N(=
2Gx
):
Proof
: Fix 2(0 ) and letB
(respectivelyC
) be a minimal (respectively ;) cover of;a
+(b;a)=2(
F) (respectively G) in the
d
x metric. Consider the
set of functions
B
+C
. For anyf
+g
2 F +G, there is anf
i 2B
withinof
;a+(b;a)=2(
f
) in thed
x metric and a
g
j2
C
within ; ofg
in the samemetric. For
x
2x
we claimj
(f
+g
)(x
);(f
i+g
j)(x
)j:
(5)Hence,
(B
+C
) forms a cover of(F+G). Sincej
B
+C
j N( ;a+(b;a)=2(
F)
x
)N( ;Gx
)the result follows by setting
==
2. To justify the claim, assume rst that ;2 (f
+g
)(x
) . This implies that ;2 ;b
;2 ;g
(x
)f
(x
) ;g
(x
) ;a:
Hence, in this case using the fact that
only reduces distances,j
(f
+g
)(x
);(f
i+g
j)(x
)j j(f
+g
)(x
);(f
i+g
j)(x
)j= j(
;a+(b;a)=2(
f
) +g
)(x
);(
f
i+g
j)(x
)j j;a
+(b;a)=2(
f
)(x
);
f
i(x
)j+jg
(x
);g
j(x
)j + ;=:
If on the other hand (
f
+g
)(x
), we need only show that (f
i+g
j)(x
);in order for (5) to be satis ed. But we have
f
i(x
)minff
(x
);a
g;, whileg
j(x
)g
(x
);( ;). Hence,(
f
i+g
j)(x
) minf(f
+g
)(x
)g
(x
)+;a
g; ;:
Finally, if (
f
+g
)(x
) ;2 , we must show that (f
i +g
j)(x
) ; tosatisfy equation (5). In this case
f
i(x
) maxff
(x
);2 ;b
g+, whileg
j(x
)g
(x
) + ( ;). Hence,(
f
i+g
j)(x
) maxf(f
+g
)(x
)g
(x
)+;2 ;b
g+ ;:
Non-linear Function Spaces 18
Before proceeding we need a further technical lemma to show that the property of sturdiness is preserved under the addition operator.
Lemma 6.10
LetF andG be sturdy real valued function classes. ThenF+Gis also sturdy.
Proof
: Considerx
2X
. ~x
F(F) is a compact subset of < as is ~
x
G(G). Note
that
~
x
F+G(F+G) = ~
x
F(F)+ ~
x
G(G)
where the addition of two sets
A
andB
of real numbers is de nedA
+B
=fa
+b
ja
2Ab
2B
g:
Since, ~
x
F( F)x
~G(
G) is a compact set of <
2 and + is a continuous function
from < 2 to
<, we have that ~
x
F(F) + ~
x
G(G) being the image of a compact set
under + is also compact.
De nition 6.11
Fix a threshold 2<. For a functionf
onX
we dened
((xy
)f
) = maxf0 ;y
(f
(x
);)g:
This quantity is the amount by which
f
fails to reach the margin on the point (xy
) or 0 if its margin is larger than . Letg
f 2L
f(X
) be the functiong
f =X
(x y)2S
d
((xy
)f
)y
x:
Proposition 6.12
Fix 2<. Let F be a sturdy class of real-valued functionswith range
ab
]<having a uniform bound on the covering numbersN(
;
+A 2;
+A(
F)
x
) B(` A
)for all
x
2X
`, for all`
. Let G be a sturdy subset ofL
f(X
) with the uniformbound on the covering numbers,
N( ;
G
x
) A(`
)for
x
2 `, where =fxjx
2X
g. Consider a xed but unknown probabilitydistribution on the input space
X
. Then with probability 1; over randomlydrawn training sets
S
of sizem
for all>
0 the generalization of a functionf
2F thresholded at satisfyingg
f 2G is bounded by (mk
) = 2m k
+ log28
m
where
k
=dlog 2B(2
m =
4A
)+ log 2A(2
m =
4)ewhere
A
supfhg
xijg
2Gx
2X
g, providedm
2=
and there is no discreteNon-linear Function Spaces 19
Proof
: Consider the xed mapping 1. We extend the function classF to act
on the space
X
L
f(X
) by its action onX
. We similarly extend the functionclassG by composing with a projection. We claim that
1. for
x
62S
,f
(x
) = (f
+g
f)(x
), and2. the margin of
f
+g
f with threshold on the training set 1(S
) is .Hence, the o training set behaviour of the classi er
f
can be characterised by the behaviour off
+g
f, whilef
+g
f is a large margin classi er in the spaceX
L
f(X
). In order to bound the generalization error we will apply Theorem 6.8forF+G which gives a bound in terms of the covering numbers. These we will
bound using Lemma 6.9. The spaceF+G is sturdy by Lemma 6.10, since both F and G are. Note that the range of G is contained in ;
AA
] on the inputdomain. In this case we obtain the following bound on the covering numbers, lim
!0 +log
2 ;
N(( ;
)=
2(;)=2(
F+G)
x
)lim
!0 +log
2
N(( ;
)=
4 +A (;)=2+A(F)
x
)+ lim
!0 +log
2(
N(( ;
)=
4Gx
))log2(
B(2
m =
4A
))+ log 2(A(2
m =
4))as required.
1. The rst claim follows immediately from the observation that for
z
62S
, *X
(x y)2S
d
((xy
)f
)y
x z+
= 0
:
2. For (
x
0y
0)2
S
, we havey
0((
f
+g
f)(x
0);
) =y
0(
f
(x
0);
) +y
0*
X
(x y)2S
d
((xy
)f
)y
0x
x0 +;
d
((x
0y
0)
f
)+d
((x
0y
0)
f
) =:
The theorem follows.
For a training set
S
, we de neD
(Sf
) =s X(x y)2S
d
((xy
)f
)2:
Theorem 6.13
LetF be a sturdy class of real-valued functions with rangeab
]and fat shattering dimension bounded by fatF( ). Fix
2 < and a scaling of
the output range
2<. Consider a xed but unknown probability distributionNon-linear Function Spaces 20
sets
S
of sizem
for allb
;a > >
0 the generalization of a functionf
2 Fthresholded at
is bounded by (mk
) = 2m k
log2 65m
1 + ~
D
2
log2
9
em
1 + ~
D
+ log2
64
m
1:5(b
;a
)where
k
=hfatF(
;
=
16) + 64 ~D
2 iand ~
D
= 2(D
(Sf
)+)=
provided
m
2=
and there is no discrete probability on misclassied trainingpoints.
Proof
: We de ne a sequence of function classes GjL
f(X
) to be the linearfunctionals with norm at most
B
j on the spaceL
f(X
). We will applyPropo-sition 6.12 for each class
G
j. Note that the range of Gj is ;B
jB
j] on theinput domain. Note also that the image of Gj under the evaluation map is
a closed bounded subset of the reals and hence is compact. It follows that
G
j is sturdy. We chooseB
j =j
, forj
= 1:::`
= pm
(b
;a
)=
. Hence,B
` = pm
(b
;a
)D
(Sf
), for allf
2 F and all< b
;a
. Hence,for any value of
D
=D
(Sf
) obtained there is a value ofB
j satisfyingD B
j< D
+. Substituting the upper boundD
+ for thisB
j will give theresult, when we use
0==`
and bound the covering numbers of the componentfunction classes using Corollary 6.2 and Theorem 3.5. In this case we obtain the following bounds on the covering numbers,
lim
!0 +log
2
N(( ;
)=
4+Bj
;+Bj( F)
x
)
1 +
d
1log 2260
m
(=
2+B
j)2 2
log2
18
em
(=
2+B
j)d
1
=: log2(
B(2
m =
4B
j))where
d
1 = fatF(;
=
16), andlim
!0 +log
2(
N(( ;
)=
4Gjx
)) 1 +d
2log2
260
mB
2j
2 !
log2
18
emB
jd
2=: log2(
A(2
m =
4))where
d
2 = (16B
j=
)2. Hence, in this case we can bound dlog
2
B(2
m =
4B
j)+log2
A(2
m =
4)ebydlog 2
B(2
m =
4B
j) + log 2A(2
m =
4)e 3 + "fatF( ;
=
16) + 16
B
j
2 #
log265
m
(1 + 2B
j=
) 2log29
em
(1 + 2B
j=
)Non-linear Function Spaces 21
The theorem can of course be applied for linear function classes, using the bound on the fat shattering dimension given in Theorem 3.5. The bound obtained is very comparable, though a lot less clean than Theorem 4.3.
For a training set
S
, we de neD
0(
Sf
) = X (x y)2Sd
((xy
)f
):
This is the
l
1 sum of the slack variables which is optimised in Vapnik's boxcon-straint maximal margin hyperplane algorithm. The following Corollary shows that optimising this quantity does indeed lead to good generalization.
Corollary 6.14
Let F be a sturdy class of real-valued functions with rangeab
] and fat shattering dimension bounded by fatF( ). Fix2<and a scaling
of the output range
2<. Consider a xed but unknown probability distributionon the input space
X
. Then with probability1; over randomly drawn trainingsets
S
of sizem
for allb
;a > >
0 the generalization of a functionf
2 Fthresholded at
is bounded by (mk
) = 2m k
log2 65m
1 + ~
D
2
log2
9
em
1 + ~
D
+ log2
64
m
1:5(b
;a
)where
k
=hfatF(
;
=
16) + 64 ~D
2 iand ~
D
= 2(pD
0(Sf
)(b
;
a
) +)=
provided
m
2=
and there is no discrete probability on misclassied trainingpoints.
Proof
: The corollary follows by observing thatD
(Sf
) = s X(x y)2S
d
((xy
)f
)2s
(
b
;a
) X(x y)2S
d
((xy
)f
) = pD
0(Sf
)(b
;a
)and applying the theorem.
If we choose the hyperplane to minimise
D
0(Sf
) and apply the Corollary, wewill necessarily obtain a weaker bound than we would if we minimised
D
(Sf
) and then applied the Theorem. In the case of linear function classes, better bounds for the generalization in terms ofD
andD
0should be obtained usingre-cent results which bound the covering numbers for dierent norms directly 21]. It is worth noting that we can apply Corollary 6.14 to the case of linear functions with norm 1 and recover a result similar to Theorem 4.3. The bound would involve an expression
R
2+D
2 rather than (R
+D
)2, which appears preferable.Regression 22
7 Regression
In order to apply the results of the last section to the regression case we for-mulate the error estimation as a classi cation problem. Consider a real-valued function class F and a target real-valued function
t
(x
). Forf
2 F we de nethe function
e
(f
) and the classe
(F),e
(f
)(x
) = jf
(x
);t
(x
)je
(F) = fe
(f
)jf
2Fg:
For a training point (
xy
)2X
<we de ne@
((xy
)f
) = maxf0jf
(x
);y
j;(; )g:
This quantity is the amount by which
f
exceeds the error margin; on thepoint (
xy
) or 0 iff
is within ; of the target value. Hence, this is theinsensitive loss measure considered by Vapnik with
=; . Letg
f 2L
f(X
)be the function
g
f =; X(x y)2S
@
((xy
)f
)x:
Proposition 7.1
Fix 2 <,>
0. Let F be a sturdy class of real-valuedfunctions with range
ab
]<having a uniform bound on the covering numbersN( ;
F
x
) B(m
)for all
x
2X
m. Let G be a sturdy subset ofL
f(X
) with the uniform bound onthe covering numbers,
N( ;
G
x
) A(m
)for
x
2m, where =fxjx
2X
g. Consider a xed but unknown probabilitydistribution on the input space
X
. Then with probability 1; over randomlydrawn training sets
S
of sizem
for all>
0 the probability that a functionf
2F has error greater than with respect to target functiont
on a randomlychosen input is bounded by
(mk
) = 2m k
+ log28
m
where
k
=dlog 2B(2
m =
4)+ log 2A(2
m =
4)ewhere
A
supfhg
xijg
2 Gx
2X
g, providedm
2=
, there is no discreteprobability on training points with error greater than
andg
e(f) 2GProof
: The result follows from an application of Proposition 6.12 to the func-tion classe
(F), noting that we treat all training examples as negative, andhence correct classi cation corresponds to having error less than
. Finally, we can bound the covering numbersN(
+A 2+A(e
(F))
x
) N( Fx
) B(m
):
Regression 23
For a training set
S
, we de neD(
Sf
) = sX
(x y)2S
@
((xy
)f
)2:
The above result can be used to obtain a bound in terms of the observed value of
D
(Sf
) and the fat shattering dimension of the function class.Theorem 7.2
LetF be a sturdy class of real-valued functions with rangeab
]and fat shattering dimension bounded by fatF( ). Fix
2 <,
>
0 and ascaling of the output range
2 <. Consider a xed but unknown probabilitydistribution on the input space
X
. Then with probability 1; over randomlydrawn training sets
S
of sizem
for all with>
0 the probability that afunction
f
2F has error larger than on a randomly chosen input is boundedby
(mk
) = 2m
k
log265
m b
;a
2 !
log2 9
em b
;a
+ log2
64
m
1:5(b
;a
)!
where
k
=hfatF( ;
=
16) + 64 ~D 2
i
and ~D= 2(D(
Sf
)+)=
provided
m
2=
and there is no discrete probability on misclassied trainingpoints.
Proof
: The proof follows the same pattern as that of Theorem 6.13, with the exception that the bounds on the covering numbers must use the full range of the function classF in the log factors.Corollary 7.3
LetF be a the set of linear functions with norm 1 restricted toinputs in a ball of radius
R
about the origin. Fix 2<,>
0 and a scaling ofthe output range
2<. Consider a xed but unknown probability distributionon the input space
X
. Then with probability1; over randomly drawn trainingsets
S
of sizem
for all , with>
0 the probability that a functionf
2Fhas error larger than
on a randomly chosen input is bounded by (mk
) = 2m
k
log2260
m R
2 !
log2 18
emR
+ log2
128
m
1:5R
!
where
k
=h256
R
2=
2+ 64 ~ D 2i
and ~D= 2(D(
Sf
)+)=
provided
m
2=
and there is no discrete probability on misclassied trainingpoints.
Proof
: The range of linear functions with unit weight vectors when restricted to the unit ball is ;RR
]. Their fat shattering dimension is bounded byConclusions 24
Note that we obtain a generalization bound for standard least squares regression by taking =
in Theorem 7.2. In this caseD(Sf
) is the least squares erroron the training set, while the bound gives the probability of a randomly chosen input having error greater than
. This is summarised in the following corollary.Corollary 7.4
LetF be a sturdy class of real-valued functions with rangeab
]and fat shattering dimension bounded by fatF( ). Fix
2 <,
>
0 and ascaling of the output range
2 <. Consider a xed but unknown probabilitydistribution on the input space
X
. Then with probability 1; over randomlydrawn training sets
S
of sizem
the probability that a functionf
2F has errorlarger than
on a randomly chosen input is bounded by (mk
) = 2m
k
log265
m b
;a
2 !
log2 9
em b
;a
+ log264
m
1:5(b
;a
)!
where
k
=hfatF(
;
=
16) + 64 ~ D 2i
and ~D= 2 q
P
(x y)2S(
f
(x
) ;y
)2+
provided
m
2=
and there is no discrete probability on misclassied trainingpoints.
As mentioned in the section dealing with classi cation we could bound the generalization in terms of other norms of the vector of slack variables
(
@
((xy
)f
))(x y)2S:
The aim of this paper, however, is not to list all possible results, it is rather to illustrate how such results can be obtained.
Another application of these results is to choose the best
for theinsensitive loss function for Support Vector Regression. This problem has usually been solved by using a validation set, but Corollary 7.3 could be used by choose the value of which gives the best bound on the generalization. We assume here that a target accuracy has been set and we wish to minimise the probability that the error exceeds this value. The optimum will be the which minimisesR
2+D(
Sf
;) 2(
;)2
where
f
is the solution obtained when using the -insensitive loss function.8 Conclusions
REFERENCES 25
the hyperplane. The bounds obtained can be signi cantly better than previ-ously obtained bounds, particularly when some of the points are misclassi ed and agnostic bounds would need to be applied were a classical analysis to be adopted in which the square root of the sample size replaces the sample size in the denominator. The bound is also more robust that that derived for the maximal margin hyperplane where a single point can have a dramatic eect on the hyperplane produced.
We have gone on to show how optimizing the measure of the margin distribution that appears in the bound corresponds to an algorithm proposed by Cortes and Vapnik 5]. This formulation also allows the problem to be solved in kernel spaces such as those used with the Support Vector Machine.
We believe that this paper presents the rst pac style bound for a margin distri-bution measure that is neither critically dependent on the nearest points to the hyperplane nor is an agnostic version of that approach. In addition, we believe it is the rst paper to give a provably optimal algorithm for optimizing the generalization performance of agnostic learning with hyperplanes, by showing that the criterion to be minimised should not be the number of training errors, but rather a more exible criterion which could be termed a `soft margin'. The problem of nding a more informative and theoretically well-founded measure of the margin distribution has been an open problem for some time. This paper suggests one candidate for such a measure which has the advantage of being ro-bust in the sense that it is not critically sensitive to the behaviour of individual training points.
The results have been further generalized to non-linear function classes with bounded fat-shattering dimensions, other norms on the vector of shortfalls of individual training points and to the regression case. For regression one byprod-uct is a bound in terms of the least square error on the training set of the prob-ability that a randomly drawn test point will have error greater than a given value.
References
1] Noga Alon, Shai Ben-David, Nicol o Cesa-Bianchi and David Haussler, \Scale-sensitive Dimensions, Uniform Convergence, and Learnability," Journal of the ACM
44
(4), 615{631, (1997)2] Peter L. Bartlett, \The Sample Complexity of Pattern Classi cation with Neural Networks: The Size of the Weights is More Important than the Size of the Network," IEEE Trans. Inf. Theory,
44
(2), 525{536, (1998). 3] Peter Bartlett and John Shawe-Taylor, Generalization Performance ofREFERENCES 26
4] C. Campbell, Constructive Learning Techniques for Designing Neural Net-work Systems, in (ed CT Leondes) Neural NetNet-work Systems Technologies and Applications. San Diego: Academic Press. 1997.
5] C. Cortes and V. Vapnik, Support-Vector Networks, Machine Learning, 20(3):273-297, September 1995
6] Nello Cristianini, John Shawe-Taylor, and Peter Sykacek, Bayesian Classi-ers are Large Margin Hyperplanes in a Hilbert Space, in Shavlik, J., ed., Machine Learning: Proceedings of the Fifteenth International Conference, Morgan Kaufmann Publishers, San Francisco, CA.
7] R.O. Duda and P.E. Hart, Pattern Classi cation and Scene Analysis, New York: Wiley, 1973.
8] Yoav Freund and Robert E. Schapire, Large Margin Classi cation Using the Perceptron Algorithm, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998.
9] Leonid Gurvits, A note on a scale-sensitive dimension of linear bounded functionals in Banach spaces. In Proceedings of Algorithm Learning Theory, ALT-97, and as NECI Technical Report, 1997.
10] Norbert Klasner and Hans Ulrich Simon, From Free to Noise-Tolerant and from On-line to Batch Learning, Proceedings of the Eighth Annual Conference on Computational Learning Theory, COLT'95, 1995, pp. 250{257.
11] R. Schapire, Y. Freund, P. Bartlett, W. Sun Lee, Boosting the Margin: A New Explanation for the Eectiveness of Voting Methods. In D.H. Fisher, Jr., editor, Proceedings of International Conference on Machine Learning, ICML'97, pages 322{330, Nashville, Tennessee, July 1997. Morgan Kauf-mann Publishers.
12] John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, Martin An-thony, Structural Risk Minimization over Data-Dependent Hierarchies, IEEE Trans. on Inf. Theory,
44
(5), 1926{1940, (1998), and NeuroCOLT Technical Report NC-TR-96-053, 1996.(ftp://ftp.dcs.rhbnc.ac.uk/pub/neurocolt/tech reports).
13] John Shawe-Taylor and Nello Cristianini, Data Dependent Structural Risk Minimization for Perceptron Decision Trees, Proceedings of the Eleventh Conference on Neural Information Processing Systems, NIPS'97. Advances in Neural Information Processing Systems 10 Michael I. Jordan, Michael J. Kearns, and Sara A. Solla (eds.) Cambridge, MA: MIT Press (1998), pp. 336{342.
REFERENCES 27
15] John Shawe-Taylor and Robert C. Williamson, Generalization Performance of Classi ers in Terms of Observed Covering Numbers, Submitted to Eu-roCOLT'99, 1998.
16] University of California, Irvine - Machine Learning Repository,
http://www.ics.uci.edu/ mlearn/MLRepository.html
17] Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, 1982.
18] Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995.
19] Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, 1982.
20] Vladimir N. Vapnik, Esther Levin and Yann Le Cunn, Measuring the VC-dimension of a learning machine, Neural Computation, 6:851{876, 1994. 21] Robert C. Williamson, Alex J. Smola and Bernhard Sch#olkopf, \Entropy
Numbers, Operators and Support Vector Kernels," submitted to Euro-COLT'99. See also \Generalization Performance of Regularization Net-works
and Support Vector Machines via Entropy Numbers of Compact Oper-ators," http://spigot.anu.edu.au/people/williams/papers/P100.ps