• No results found

Robust Bounds on Generalization from the Margin Distribution

N/A
N/A
Protected

Academic year: 2020

Share "Robust Bounds on Generalization from the Margin Distribution"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Robust Bounds on Generalization

from the Margin Distribution

John Shawe-Taylor

Royal Holloway, University of London

[email protected]

Nello Cristianini

University of Bristol

[email protected]

NeuroCOLT2 Technical Report Series

NC2-TR-1998-029

October, 1998

1

Produced as part of the ESPRIT Working Group

in Neural and Computational Learning II,

NeuroCOLT2 27150

For more information see the NeuroCOLT website

http://www.neurocolt.com

(2)
(3)

2

Abstract

(4)

Introduction 3

1 Introduction

For classi cation by thresholding a real valued function the margin of a training point is the amount by which its output is on the right side of the threshold or, if misclassi ed, minus the amount by which it fails to be correctly classi ed. In the case of linear hyperplanes with unit weight vectors, this value can also be seen as the distance of the input point from the hyperplane. The margin of a classi er is the minimum margin over the training set.

The idea that a large margin classi er might be expected to give good gen-eralization is certainly not new 7, 19]. Despite this insight it was not until comparatively recently 12] that such a conjecture has been placed on a rm footing in the probably approximately correct (pac) model of learning. Learn-ing in this model entails givLearn-ing a bound on the generalization error which will hold with high con dence over randomly drawn training sets. In this sense it can be said to ensure reliable learning, something that cannot be guaranteed by bounds on the expected error of a classi er.

Despite successes in extending this style of analysis to the agnostic case 2] and applying it to neural networks 2], boosting algorithms 11], perceptron decision trees 13] and Bayesian algorithms 6], there has been concern that the measure of the distribution of margin values attained by the training set is largely ignored in a bound that depends only on its minimal value. Intuitively, there appeared to be something lost with a bound that depended so critically on the positions of possibly a small proportion of the training set, ignoring the margin attained by the majority of the points. Attempts to address this problem have been made in for example 11], but they treat points that fail to meet the larger margin as errors and fall back on agnostic bounds for the generalization error. In contrast our results apply to the case where there are training set errors, but have the form of bounds with no training set errors. The question of how to handle the situation of non linearly separable data has received a lot of attention (see 4] for a review of some of the methods suggested). The problem is that minimising the number of training errors is NP-complete and so the various methods adopted are inherently heuristic relative to the best bounds previously available for bounding the generalization error. By showing that the generalization error can be bounded in terms of a quantity that can be optimized by a polynomial time algorithm, we provide a solution to a long-standing conundrum of perceptron learning.

(5)

Summary of Results 4

Algorithms arising from the approach are related to those of Cortes and Vap-nik 5] and directly justify the original proposal made to minimise the 2-norm of the slack variables. We generalise the basic result to function classes with bounded fat-shattering dimension and the 1-norm of the slack variables which gives rise to Vapnik's box constraint algorithm. Finally, application to regres-sion is considered, which includes results for standard least squares as a special case.

The paper is structured as follows. In the next section we will summarise the results in

O

notation to give a avour of what the paper aims to achieve. In Section 3 relevant background material and de nitions are introduced. This is followed by a section describing the results for classi cation using linear functions. This is the simplest case considered and provides insight into the basic techniques employed. Section 5 describes the algorithm consequences of these results for Support Vector Machine classi cation algorithms. We then proceed to generalize the results to non-linear function spaces in Section 6. The penultimate section considers the problem of regression and shows how the results obtained for classi cation readily generalize to this case.

2 Summary of Results

The results in this section will be given in the ~

O

notation indicating asymptotics ignoring log factors. The aim is to give the avour of the results obtained which might otherwise be obscured by the detailed technicalities of the proofs and precise bounds obtained.

The rst case considered is that of classi cation using linear function classes that include the use of kernel functions such as those used in the Support Vector Machine. For this case consider a margin about the separating hyperplane and set (

d

( )(x y))(x y)2S) to be the vector for training set

S

to be the vector

of the amounts by which the training points fail to achieve the margin . We bound the probability

of misclassi cation of a randomly chosen test point by (see Theorem 4.3)

O

~ (

R

+k

d

k 2)

2

j

S

j 2

where

R

is the radius of a ball about the origin which contains the support of the input probability distribution.

The results are generalized to non-linear function classes using a characterisa-tion of their capacity at scale known as the fat shattering dimension fat( ). In this case the bound obtained has the form (see Theorem 6.13)

O

~ fat(

=

16)+k

d

k 2

2

=

2

j

S

j

This result is generalized to obtain a bound in terms of the 1-norm of the vector

d

(see Corollary 6.14)

O

~ fat(

=

16)+k

d

k 1

=

2

j

S

j

(6)

Background Results 5

which could of course also be applied to the linear case using a bound on the fat shattering dimension for this case.

Finally, the problem of estimating errors of regressors is addressed with the techniques developed. We bound the probability

that for a randomly chosen test point the absolute error is greater than a given value

. In this case we de ne a vector (

@

(x y))(x y)2S of amounts by which the error on the training

examples exceeds

; . Note thatk

@

(

)k 2

2 is simply the least squares error on

the training set. We then bound the probability

by (see Theorem 7.2)

O

~ fat(

=

16)+k

@

( )k 2

2

=

2

j

S

j

:

These results can be used for Support Vector Regression and give a way of choosing the optimal size

; of the tube for the insensitive loss function.

In addition they can be applied to standard least square regression by setting =

to obtain the bound (see Corollary 7.4)

O

~ fat(

=

16) +k

@

(

)k 2

2

=

2

j

S

j

:

3 Background Results

We consider learning from examples, initially of a binary classi cation. We denote the domain of the problem by

X

and a sequence of inputs by

x

= (

x

1

:::x

m)

2

X

m. A training sequence is typically denoted by

z

= ((

x

1

y

1)

:::

(

x

m

y

m)) 2

(

X

f;1

1g)mand the set of training examples by

S

. By Er

z(

h

) we denote the

number of classi cation errors of the function

h

on the sequence

z

. As we will typically be classifying by thresholding real valued functions we introduce the notation

T

(

f

) to denote the function giving output 1 if

f

has output greater

than or equal to

and ;1 otherwise. For a class H the class

T

(H) is the set

of derived classi cation functions. We rst give some necessary de nitions.

De nition 3.1

Let

H

be a set of binary valued functions. We say that a set of points

X

isshattered by

H

if for all binary vectors

b

indexed by

X

, there is a function

f

b 2

H

realising

b

on

X

. TheVapnik-Chervonenkis (VC) dimension,

VCdim(

H

), of the set

H

is the size of the largest shattered set, if this is nite or innity otherwise.

The following theorem is well known in a number of dierent forms. We quote the result here as a bound on the generalization error rather than as a required sample size for given generalization.

Theorem 3.2

12] Let

H

i,

i

= 1

2

:::

be a sequence of hypothesis classes

map-ping

X

tof0

1gsuch thatVCdim(

H

i) =

i

, and let

P

be a probability distribution

on

X

. Let

p

d be any set of positive numbers satisfyingP 1

d=1

p

d = 1

:

With
(7)

Background Results 6

which a learner nds a consistent hypothesis

h

in

H

d, the generalization error

of

h

is bounded from above by

(

md

) = 4

m d

ln 2

em

d

+ ln 1

p

d

+ ln 4

provided

d m

.

We now introduce the generalization of the VC dimension which makes it pos-sible to generalize Theorem 3.2 to large margin classi cation.

De nition 3.3

Let H be a set of real valued functions. We say that a set of

points

X

is -shattered byHif there are real numbers

r

xindexed by

x

2

X

such

that for all binary vectors

b

indexed by

X

, there is a function

f

b2Hsatisfying

f

b(

x

)

r

x+ if

b

x= 1

r

x; otherwise

:

Thefat shattering dimension fatH of the set

H is a function from the positive

real numbers to the integers which maps a value to the size of the largest -shattered set, if this is nite or innity otherwise.

We will make critical use of the following result contained in Shawe-Taylor et al 12] which involves the fat shattering dimension of the space of functions.

Theorem 3.4

Consider a real valued function class H having fat shattering

function bounded above by the functionafat :<!N which is continuous from

the right. Fix

2<. Then with probability at least1;

a learner who correctly

classies

m

independently generated examples

z

with

h

=

T

(

f

) 2

T

(H) such

thaterz(

h

) = 0 and = min

j

f

(

x

i);

jwill have error of

h

bounded from above

by

(

mk

) = 2

m k

log2

8

em

k

log2(32

m

) + log2

8

m

where

k

= afat(

=

8)

em

.
(8)

Linear Function Classes 7

Theorem 3.5

3] Consider a Hilbert space and the class of linear functions

L

of norm less than or equal to

B

restricted to the sphere of radius

R

about the origin. Then the fat shattering dimension of

L

can be bounded by

fatL( )

BR

2

:

In order to apply Theorems 3.4 and 3.5 we need to bound the radius of the sphere containing the points and the norm of the linear functionals involved. Clearly, scaling by these quantities will give the margin appropriate for appli-cation of the theorem.

4 Linear Function Classes

Let

X

be a Hilbert space. We de ne the following Hilbert space derived from

X

.

De nition 4.1

Let

L

f(

X

) be the set of real valued functions

f

on

X

with

supportsupp(

f

) nite, that is functions in

L

f(

X

) are non-zero only for nitely

many points. We dene the inner product of two functions

fg

2

L

f(

X

), by

h

f

g

i= X

x2supp(f)

f

(

x

)

g

(

x

)

:

Note that the sum which de nes the inner product is well-de ned since the functions have nite support. Clearly the space is closed under addition and multiplication by scalars.

Now for any xed

>

0 we de ne an embedding of

X

into the Hilbert space

X

L

f(

X

) as follows.

:

x

7!

X

= (

x

x)

where

x 2

L

f(

X

) is de ned by

x(

y

) =

1 if

y

=

x

0 otherwise.

We begin by considering the case where is xed. In practice we wish to choose this parameter in response to the data. In order to obtain a bound over dierent values of it will be necessary to apply the following theorem several times.

For a linear classi er

u

on

X

and threshold

b

2< we de ne

d

((

xy

)

(

u

b

)

) = maxf0

;

y

(h

u

x

i;

b

)g

:

This quantity is the amount by which

u

fails to reach the margin on the point (

xy

) or 0 if its margin is larger than . Similarly for a training set

S

, we de ne

D

(

S

(

u

b

)

) =s X

(x y)2S

(9)

Linear Function Classes 8

Theorem 4.2

Fix

>

0,

b

2 <. Consider a xed but unknown probability

distribution on the input space

X

with support in the ball of radius

R

about the origin. Then with probability1;

over randomly drawn training sets

S

of size

m

for all

>

0 the generalization of a linear classier

u

on

X

with k

u

k= 1,

thresholded at

b

is bounded by

(

mk

) = 2

m k

log2

8

em

k

log2(32

m

) + log2

720

m

log2(1 +

mR

2

=

2)

where

k

=

64

:

5(

R

2+ 2)( k

u

k

2+

D

(

S

(

u

b

)

)2

=

2) 2

provided

m

2

=

,

k em

and there is no discrete probability on misclassied

training points.

Proof

: Consider the xed mapping and the augmented linear functional

over the space

X

L

f(

X

),

^

u

=

0

@

u

1

X

(x y)2S

d

((

xy

)

(

u

b

)

)

y

x 1

A

:

We claim that

1. for

x

62

S

,h

u

x

i=h

u

^ (

x

)

i, and

2. the margin of ^

u

with threshold

b

on the training set (

S

) is .

Hence, the o training set behaviour of the linear classi er (

u

b

) can be char-acterised by the behaviour of (^

u

b

), while (^

u

b

) is a large margin classi er in the space

X

L

f(

X

). Since for

x

2

S

, k (

x

)k

2

R

2 + 2 and k

u

^k

2 = k

u

k

2+

D

(

S

(

u

b

)

)2

=

2, the result will then follow from an application of

Theorems 3.4 and 3.5 provided that there are no misclassi ed training points with discrete probability. Since Theorem 3.4 can only be applied for a xed space of functions we must apply the two theorems for a discrete set of values for the bound

B

on the norm of the linear functions. This corresponds to choosing a discrete set of values for the product (

BR

)2 of Theorem 3.5. We will choose

the arithmetic series

i(

R

2+2), for

i

= 1

:::`

= 90log

2(1+

mR

2

=

2), where

is chosen so that

`(

R

2+ 2) = (

R

2+ 2)(1 +

mR

2

=

2) which is an upper

bound on the product k (

x

)k 2

k

u

^k

2, since

D

2

mR

2. Hence,

= 21=90 and

it is readily veri ed that 64

:

005

<

64

:

5. This implies that if we replace

the constant 64 of Theorem 3.4 by 64.005 to ensure the continuity from the right, then for the observed value ofk (

x

)k

2

k

u

^k

2 there is an application of the

theorem for a value of (

BR

)2 within a factor

of this value and the required

bound holds. Note that for each application of the theorem we must replace the

of Theorem 3.4 by

0=

=`

in order to ensure that all applications hold
(10)

Linear Function Classes 9

1. The rst claim follows immediately from the observation that for

z

62

S

, *

X

(x y)2S

d

((

xy

)

(

u

b

)

)

y

x

z

+

= 0

:

2. For (

x

0

y

0)

2

S

, we have

y

0( h

u

^

(

x

0)

i;

b

) =

y

0(

h

u

x

0

i;

b

) +

y

0

*

X

(x y)2S

d

((

xy

)

(

u

b

)

)

y

0

x

x 0

+

;

d

((

x

0

y

0

)

(

u

b

)

)+

d

((

x

0

y

0

)

(

u

b

)

) =

:

The theorem follows.

We now apply this theorem several times to allow a choice of which approxi-mately minimises the expression for

k

. Note that the minimum of the expression (ignoring the constant and suppressing the denominator 2) is (

R

+

D

)2attained

when =p

RD

.

Theorem 4.3

Fix

b

2<. Consider a xed but unknown probability distribution

on the input space

X

with support in the ball of radius

R

about the origin. Then with probability 1;

over randomly drawn training sets

S

of size

m

for all

>

0 such that

d

((

xy

)

(

u

b

)

) = 0, for some (

xy

) 2

S

, the generalization

of a linear classier

u

on

X

satisfyingk

u

k 1 is bounded by

(

mk

) = 2

m k

log2

8

em

k

log2(32

m

)+ log2

180

m

(21 + log2(

m

)) 2

where

k

=

65(

R

+

D

)2+ 2

:

25

RD

] 2

for

D

=

D

(

S

(

u

b

)

), and provided

m

maxf2

=

6g,

k em

and there is no

discrete probability on misclassied training points.

Proof

: Consider a xed set of values for , 1 =

R

b2

m

0:25

;1c, i +1 =

i

=

2, for

i

= 2

:::t

, where

t

satis es,

R=

32 t

> R=

64. Hence,

t

log2(128

m

0:25) = 0

:

25(28 + log

2(

m

)). We apply Theorem 4.2 for each of these

values of , using

0 =

=t

in each application. Note that we have also

loosely upper bounded the expression (28 + log2(

m

))log2(1 +

mR

2

=

2) by

(21+log2(

m

))

2in each application. For a given value of and

D

=

D

(

S

u

),

it is easy to check that the value of

k

is minimal for = p

RD

and is mono-tonically decreasing for smaller values of and monomono-tonically increasing for larger values. Note that p

RD R

p

2p

m

;1, as the largest absolute

dier-ence in the values of the linear function on two training points is 2

R

and since

d

((

xy

)

(

u

b

)

) = 0, for some (

xy

)2

S

, we must have

d

((

x

0

y

0)

(

u

b

)

)

2

R

, for all (

x

0

y

0)

2

S

. Hence, as 2

m

0:25

;1

>

p

2(

m

;1)

0:25for

m

(11)

Algorithmics 10

nd a value of i satisfying p

RD=

2 i p

RD

, provided p

RD

R=

32.

The value of the expression

(

R

2+ 2)(1 +

D

(

S

u

)2

=

2)

at the value i will be upper bounded by its value at =p

RD=

2. A routine calculation con rms that for this value of , the expression is equal to (

R

+

D

)2+ 2

:

25

RD

. Now suppose p

RD < R=

32. In this case we will show that (

R

2

+ 2

t)(1 +

D

2

=

2

t) 130129

(

R

+

D

)2

+ 2

:

25

RD

so that the application of Theorem 4.2 with = t covers this case once the

constant 64

:

5 is replaced by 65. Recall that

R=

32t

> R=

64 and note that p

D=R <

1

=

32. We therefore have (

R

2

+ 2

t)(1 +

D

2

=

2

t)

R

2

(1 + 1

=

322

)(1 + 642

D

2

=R

2

)

R

2 1 + 1

1024

1 + 642

324

R

2

1 + 11024

1 + 1256

<

130129

R

2 130

129

(

R

+

D

)2

+ 2

:

25

RD

as required. The result follows.

5 Algorithmics

The theory developed in the previous section provides a principled answer to a long standing question: what is the "best" linear separation of a set of points that are not linearly separable? Many heuristics have been proposed (see 4] for a review), mainly aimed at reducing the empirical risk, but most of them suer from computational problems. The question is a practically interesting one, expecially after the revival of perceptron-like systems due to the success of Support Vector Machines 5, 18]. The inability of the original Support Vector Machines to deal with noise (and tolerate outliers) is a serious practical lim-itation, expecially because - when combined with the use of kernels - it can easily lead to over tting. The solution developed for Support Vector machines is a heuristic known as the "Soft Margin", which will be described below. The bound proven in the previous section implies the following algorithm: minimize

D

(

S

(

u

b

)

)for a given xed value of , and subsequently minimize the bound over dierent choices of . This would ensure that the hyperplane coincides with the minimum of the upper bound on the generalization error. Moreover, as we will see, it can be found in polynomial time. The approach taken by Vapnik 18, Section 5.5.1] for his Soft Margin Classifer is similar, albeit with totally dierent motivations: in order to minimize the training error of the output hy-pothesis (an NP-complete task) he approximates it with the quantityPm
(12)

Algorithmics 11

which tends to the training error for

! 0. This gives rise to the following

algorithm: for non-negative variables

d

i0, minimize the function

F

(

d

) =Xm

j=1

d

j

subject to the constraints:

y

jh

u

x

ji;

b

] 1;

d

j

j

= 1

:::m

(1)

h

u

u

i

C:

(2)

which can be solved in polynomial time for

= 1 or

= 2. This constrained optimization problem is then solved by minimizing the following quantity (prob-lem (1)):

Xm

j=1

d

j+ 12h

u

u

i

for dierent, xed, values of

. A suitable value of

is then usually chosen by means of a validation set. Once translated into dual variables, this problem turns out to be a quadratic programming problem for each xed value of

, and can be solved eciently using standard methods.

The algorithm which follows from the theory presented in the previous section can, in contrast, be described as follows (problem (2)): minimize

k

d

k 2 =

m

X

j=1

d

2

j

subject to constraints

y

jh

u

x

ji;

b

] 1;

d

j

j

= 1

:::m

(3)

h

u

u

i =

C

(4)

which corresponds exactly to minimizing

D

(

S

(

u

b

)

), where = 1 p

C. This

follows from considering the hyperplane (

u

0

b

0) = (

u

=

p

Cb=

p

C

) which has norm 1 and classi es the point (

x

j

y

j) such that

d

((

x

j

y

j)

(

u

0

b

0)

) =

d

j

=

p

C

so that

D

(

S

(

u

0

b

0)

) = p

F

2(

d

)

=C

. Once translated into dual variables,

problem (2) gives rise to a convex maximization problem 14]

F

(

0

) = ;

1 4

m

X

j=1

2

j + m

X

j=1

j;

1 4

C

m

X

i j=1

i

j

y

i

y

jh

x

i

x

ji;

0

C

which must be solved subject to the constraints,

j 0,

j

= 0

:::m

, and Pm

j=1

j

y

j = 0 for each value of = 1

p

C, giving the optimal (according to

our bound) hyperplane of xed norm jj

w

jj = 1
(13)

Algorithmics 12

polynomial time by applying a gradient based path algorithm following grad(

F

) with an appropriate learning rate

, but this convex optimization problem is more dicult than a standard quadratic programming one. The best is then chosen again using the bound derived in the previous section, namely:

w

= argmin min

kwk=1=

R

+

D

2

We will now show that the same result can be obtained by solving the (simpler) quadratic problem used by Vapnik, with

= 2 and

is optimised with respect to our bound. The idea is that the class of functions de ned by problem (1) for

2<

+ is identical to the class of functions de ned by problem (2) for 2<

+.

The optimal function according to our bound is hence the same in both classes. First we need to prove a technical lemma, and state some de nitions.

Lemma 5.1

The hyperplane implicitly dened by the optimization problem (1) depends continuously on the parameter

.

Proof

: This follows from the fact that the dual function equivalent to problem (1) (once maximized in the positive quadrant for each value of

) 5]

W(

) = X

i;

1 2

X

i j

y

i

y

j

i

j

K

(

x

i

x

j)

;

1

X

2

i +

C

X

i

y

i

is continuous both in

and in

, and is strictly convex in

for any xed value of

. Strict convexity follows from the fact that its Hessian

H

ij =

@

@F

ji

j =

y

ji

y

jj

K

(

x

ji

x

jj);

1

=

H

ji

ij

1

is negative de nite.

De nition 5.2

We deneW to be the set of the solutions of problem (1) for

all values of

, and W to be the set of the solutions of problem (2) for all

values of . Formally:

W =f

u

2<kj9

2< +

u

=

argmin

m

X

j=1

d

2

j + 12h

u

u

ig

W =f

u

2<kj9 2< +

u

=

argmin

jjujj=1=

R

+

D

2

g

Theorem 5.3

The sets of functionsW andW dened above are equivalent.

Proof

: let

w

be the solution of the problem (1) for a xed value of

. Then,

jj

w

jj ! 0 if

! 0, and jj

w

jj! 1 if

! 1. By lemma 5.1 we know that
(14)

Non-linear Function Spaces 13

of problem (1) ranges through all possible positive values for suitably chosen values of

. Sincejj

w

jj=

1

p

C, considering the solution for value of

in problem

(1) is equivalent solving problem (2) with

C

= k

w

k. This implies that for

each function

w

2 W there exists a value of such that the corresponding

w

2W, and

w

=

w

.

An obvious, but important, consequence of this theorem is the following corol-lary:

Corollary 5.4

The minima of the bound onW and W coincide:

min

w 2W

R

+

D

2

=

min

w2W

R

+

D

2

This means that the optimal separating hyperplane (according to our bound) can be found by solving the quadratic optimization problem (2) with

= 2 for dierent values of

, and choosing the value of

which minimizes the bound itself. This analysis provides a theoretical justi cation to the Soft Margin heuristics (using the 2-norm of the vector

d

described in the appendix of 5]), as well as a theoretically sound way to choose the optimal value of

in that case. In the next sections we will generalize the theoretical results given so far, and this will lead to a further description of soft-margin heuristics for Support Vector Machines as well as non-linear function classes.

6 Non-linear Function Spaces

6.1 Further Background Results

Before we can quote the next lemma, we need another de nition.

De nition 6.1

Let(

Xd

) be a (pseudo-) metric space, let

A

be a subset of

X

and

>

0. A set

B

X

is an

-cover for

A

if, for every

a

2

A

, there exists

b

2

B

such that

d

(

ab

)

. The

-covering number of

A

, Nd(

A

), is the

minimal cardinality of an

-cover for

A

(if there is no such nite cover then it is dened to be1). We will say the cover is proper if

B

A

.

Note that we have used less than or equal to in the de nition of a cover. This is somewhat unconventional, but will not change the bounds we use. It does, however, prove technically useful in the proofs. The idea is that

B

should be nite but approximate all of

A

with respect to the pseudometric

d

. we will use the

l

1 distance over a nite sample

x

= (

x

1

:::x

m) for the pseudo-metric in

the space of functions,

d

x(

fg

) = max

i j

f

(

x

i);

g

(

x

i)j

:

We writeN(

F

x

) =Nd x(

F) We will consider the covers to be chosen from

(15)

Non-linear Function Spaces 14

We now quote a lemma from 12] which follows immediately from a result of Alon et al. 1].

Corollary 6.2

12] LetF be a class of functions

X

!

ab

] and

P

a

distribu-tion over

X

. Choose0

< <

1 and let

d

= fatF(

=

4). Then

sup

x2Xm

N(

F

x

) 2 4

m

(

b

;

a

) 2

2

d log

2

(2em(b;a)=(d))

:

We de ne a clipping function

.

(

) :=

8

<

:

if

>

;2

:

01 if

<

;2

:

01

otherwise,

and let

(F) =f

(

f

):

f

2Fg. The choice of the threshold

is arbitrary but

will be xed before any analysis is made. If the value of

needs to be included explicitly we will denote the clipping function by

.

For a monotonic function

f

( ) we de ne

f

( ;

) = lim

!0 +

f

(

;

)

that is the left limit of

f

at . Note that the minimal cardinality of an

-cover is a monotonically decreasing function of

, as is the fat shattering dimension as a function of .

De nition 6.3

We say that a class of functionsF is sturdy its images under

the evaluation maps ~

x

F:

F;!<

x

~ F:

f

7!

f

(

x

)

are compact subsets of < for all

x

2

X

.

Note that this de nition diers slightly from that introduced in 15]. The current de nition is more general, but at the same time simpli es the proof of the required properties.

Lemma 6.4

LetF be a sturdy class of functions. Then for each

N

2 N and

any xed sequence

x

2

X

m, the inmum N = inff jN(

F

x

) =

N

g

is

attained.

Proof

: The straightforward proof follows exactly the proof of Lemma 2.6 of 15].

Corollary 6.5

LetF be a sturdy class of functions. Then for each

N

2Nand

any xed sequence

x

2

X

m, the inmum N = inff jN(

(F)

x

) =

N

g

is
(16)

Non-linear Function Spaces 15

Proof

: Suppose that the assertion does not hold for some

x

2

X

m and

N

2N.

Let

N

0=

N( N

N(F)

x

)

> N

. Consider the following supremum

N0

= supf jN(

N(F)

x

) =

N

0

g

:

Since the assertion does not hold we have N0

N. By the lemma we must

have N0

>

N, since otherwise the in mum of the required for the next size of

cover will not be attained. Hence, there exists 0

>

N withN( 0

N(F)

x

) =

N

0. Let = ( 0+

N)

=

2. Note thatN(

(F)

x

)

N

. Let

B

be a minimal

cover in this case. Claim that

B

is also a 0cover for

N(F) in the

d

x metric.

To show this consider

f

2 F and let

f

i 2

B

be within of

(

f

) in the

d

x

metric. Hence, for all

x

2

x

,j

f

i(

x

);

(

f

)(

x

)j . But this implies that j

f

i(

x

);

N(

f

)(

x

)j + ( ; N)

= 0

:

Hence, we haveN( 0

N(F)

x

)

N

, a contradiction.

We will make use of the following lemma, which in the form below is due to Vapnik 17, page 168].

Lemma 6.6

Let

X

be a set and

S

a system of sets on

X

, and

P

a probability measure on

X

. For

x

2

X

mand

A

2

S

, dene

x(

A

) :=

j

x

\

A

j

=m

. If

m >

2

=

,

then

P

m

x

: supA

2S j

x(

A

)

;

P

(

A

)j

>

2

P

2m

xy

: supA

2S j

x(

A

) ;

y(

A

) j

> =

2

:

The following two theorems are essentially quoted from 12] but they have been reformulated here in terms of the covering numbers involved. The dierence will be apparent if Theorem 6.8 is compared with Theorem 3.4 quoted in Section 3.

Lemma 6.7

Suppose F is a sturdy set of functions that map from

X

to <.

Then for any distribution

P

on

X

, and any

k

2N and any

2<

P

2m (

xy

:9

f

2F

r

= max

j f

f

(

x

j)g

2

<

;

r

dlog 2(

N(

(F)

xy

))e=

k

1

m

jf

i

j

f

(

y

i)

gj

>

(

mk

) )

<

where

(

mk

) = 1

m(

k

+ log2 2).

Proof

: We have omitted the detailed proof since it is essentially the same as the corresponding proof in 12] with the simpli cation that Corollary 6.2 is not required and the property of sturdiness ensures by Corollary 6.5 that we can

nd a k cover where

k = inff jN(

(F)

xy

) = 2kg

which can be used for all satisfying dlog 2(

N(

(F)

xy

))e=

k

. Note also

that an inequality is required 2

<

;

r

, as we have coverings using closed
(17)

Non-linear Function Spaces 16

Theorem 6.8

Consider a sturdy real valued function classF having a uniform

bound on the covering numbers

N( ;

;(F)

x

) B(

`

)

for all

x

2

X

`, for all

`

. Fix

2 <. If a learner correctly classies

m

inde-pendently generated examples

z

with

h

=

T

(

f

) 2

T

(F) such that er

z(

h

) = 0

and = minj

f

(

x

i);

j, then with condence1;

the expected error of

h

is

bounded from above by

(

mk

) = 2

m k

+ log2

8

m

where

k

=dlog 2

B(2

m =

2)e.

Proof

: Making use of lemma 6.6 we will move to the double sample and stratify by

k

. By the union bound, it thus suces to show thatP

2m

k=1

P

2m(

J

k)

< =

2,

where

J

k =f

xy

: 9

h

=

T

(

f

)2

T

(F)

Er

x(

h

) = 0

k

= dlog

2

B(2

m =

2)e

= minj

f

(

x

i);

j

Er y(

h

)

m

(

mk

)

=

2g

:

(The largest value of

k

we need consider is 2

m

, since for larger values the bound will in any case be trivial). It is sucient if

P

2m(

J

k) 4m =

0. We will in

fact work with the set

J

k( 0

) =f

xy

: 9

h

=

T

(

f

)2

T

(F)

Er

x(

h

) = 0

k

= dlog

2 N(

0

=

2

0= 2(

F)

xy

)e

0

<

min

j

f

(

x

i);

j

Er y(

h

)

m

(

mk

)

=

2g

:

We will show that for any 0

<

, we have

P

2m(

J

k( 0))

0. Hence, considering

the limit 0

! from below, the result will follow.

Consider ^F = ^F. The probability distribution on ^

X

=

X

f0

1g is given

by

P

on

X

with the second component determined by the target value of the rst component. Note that for a point

y

2

y

to be misclassi ed, it must have

^

f

(^

y

)

>

maxf

f

^(^

x

): ^

x

2^

x

g+ , so that

J

k( 0

) n

^

x

y

^ 2(

X

f0

1g) 2m

: 9

f

^2F^

r

= maxf

f

^(^

x

): ^

x

2

x

^g

0

<

;

r

k

=dlogN( 0

=

2

0= 2(

F)

xy

)e

f

y

^2

y

^: ^

f

(^

y

)

g

m

(

mk

)

=

2 o

Replacing by 0

=

2 in Lemma 6.7 and appealing to Lemma 6.6 we obtain

P

2m(

J

k( 0))

0, for

(

mk

) = 2

m

;

k

+ log(2

=

0

)

(18)

Non-linear Function Spaces 17

6.2 Margin distribution and fat shattering

In this section we will generalise the results of Section 4 to function classes for which a bound on their fat-shattering dimension is known. The basic trick is to bound the covering numbers of the sum of two function classes in terms of the covering numbers of the individual classes. If F and G a real valued function

classes de ned on a domain

X

we denote by F+G the function class

F+G =f

f

+

g

j

f

2F

g

2Gg

:

Lemma 6.9

LetF and G be two real valued function classes both dened on a

domain

X

. Suppose G has range

ab

]. Then we can bound the cardinality of a

minimal cover ofF+G by

N(

(F+G)

x

) N(

=

2

;a

+(b;a)=2(

F)

x

)N(

=

2

G

x

)

:

Proof

: Fix

2(0

) and let

B

(respectively

C

) be a minimal

(respectively ;

) cover of

;a

+(b;a)=2(

F) (respectively G) in the

d

x metric. Consider the

set of functions

B

+

C

. For any

f

+

g

2 F +G, there is an

f

i 2

B

within

of

;a

+(b;a)=2(

f

) in the

d

x metric and a

g

j

2

C

within ;

of

g

in the same

metric. For

x

2

x

we claim

j

(

f

+

g

)(

x

);

(

f

i+

g

j)(

x

)j

:

(5)

Hence,

(

B

+

C

) forms a cover of

(F+G). Since

j

B

+

C

j N(

;a

+(b;a)=2(

F)

x

)N( ;

G

x

)

the result follows by setting

=

=

2. To justify the claim, assume rst that

;2 (

f

+

g

)(

x

)

. This implies that

;2 ;

b

;2 ;

g

(

x

)

f

(

x

)

;

g

(

x

)

;

a:

Hence, in this case using the fact that

only reduces distances,

j

(

f

+

g

)(

x

);

(

f

i+

g

j)(

x

)j j(

f

+

g

)(

x

);(

f

i+

g

j)(

x

)j

= j(

;a

+(b;a)=2(

f

) +

g

)(

x

)

;(

f

i+

g

j)(

x

)j j

;a

+(b;a)=2(

f

)(

x

)

;

f

i(

x

)j+j

g

(

x

);

g

j(

x

)j

+ ;

=

:

If on the other hand (

f

+

g

)(

x

)

, we need only show that (

f

i+

g

j)(

x

)

;

in order for (5) to be satis ed. But we have

f

i(

x

)minf

f

(

x

)

;

a

g;

, while

g

j(

x

)

g

(

x

);( ;

). Hence,

(

f

i+

g

j)(

x

) minf(

f

+

g

)(

x

)

g

(

x

)+

;

a

g;

;

:

Finally, if (

f

+

g

)(

x

)

;2 , we must show that (

f

i +

g

j)(

x

)

; to

satisfy equation (5). In this case

f

i(

x

) maxf

f

(

x

)

;2 ;

b

g+

, while

g

j(

x

)

g

(

x

) + ( ;

). Hence,

(

f

i+

g

j)(

x

) maxf(

f

+

g

)(

x

)

g

(

x

)+

;2 ;

b

g+

;

:

(19)

Non-linear Function Spaces 18

Before proceeding we need a further technical lemma to show that the property of sturdiness is preserved under the addition operator.

Lemma 6.10

LetF andG be sturdy real valued function classes. ThenF+G

is also sturdy.

Proof

: Consider

x

2

X

. ~

x

F(

F) is a compact subset of < as is ~

x

G(

G). Note

that

~

x

F+G(

F+G) = ~

x

F(

F)+ ~

x

G(

G)

where the addition of two sets

A

and

B

of real numbers is de ned

A

+

B

=f

a

+

b

j

a

2

Ab

2

B

g

:

Since, ~

x

F( F)

x

~

G(

G) is a compact set of <

2 and + is a continuous function

from < 2 to

<, we have that ~

x

F(

F) + ~

x

G(

G) being the image of a compact set

under + is also compact.

De nition 6.11

Fix a threshold

2<. For a function

f

on

X

we dene

d

((

xy

)

f

) = maxf0

;

y

(

f

(

x

);

)g

:

This quantity is the amount by which

f

fails to reach the margin on the point (

xy

) or 0 if its margin is larger than . Let

g

f 2

L

f(

X

) be the function

g

f =

X

(x y)2S

d

((

xy

)

f

)

y

x

:

Proposition 6.12

Fix

2<. Let F be a sturdy class of real-valued functions

with range

ab

]<having a uniform bound on the covering numbers

N(

;

+A 2

;

+A(

F)

x

) B(

` A

)

for all

x

2

X

`, for all

`

. Let G be a sturdy subset of

L

f(

X

) with the uniform

bound on the covering numbers,

N( ;

G

x

) A(

`

)

for

x

2 `, where =f

xj

x

2

X

g. Consider a xed but unknown probability

distribution on the input space

X

. Then with probability 1;

over randomly

drawn training sets

S

of size

m

for all

>

0 the generalization of a function

f

2F thresholded at

satisfying

g

f 2G is bounded by

(

mk

) = 2

m k

+ log2

8

m

where

k

=dlog 2

B(2

m =

4

A

)+ log 2

A(2

m =

4)e

where

A

supfh

g

xij

g

2G

x

2

X

g, provided

m

2

=

and there is no discrete
(20)

Non-linear Function Spaces 19

Proof

: Consider the xed mapping 1. We extend the function class

F to act

on the space

X

L

f(

X

) by its action on

X

. We similarly extend the function

classG by composing with a projection. We claim that

1. for

x

62

S

,

f

(

x

) = (

f

+

g

f)(

x

), and

2. the margin of

f

+

g

f with threshold

on the training set 1(

S

) is .

Hence, the o training set behaviour of the classi er

f

can be characterised by the behaviour of

f

+

g

f, while

f

+

g

f is a large margin classi er in the space

X

L

f(

X

). In order to bound the generalization error we will apply Theorem 6.8

forF+G which gives a bound in terms of the covering numbers. These we will

bound using Lemma 6.9. The spaceF+G is sturdy by Lemma 6.10, since both F and G are. Note that the range of G is contained in ;

AA

] on the input

domain. In this case we obtain the following bound on the covering numbers, lim

!0 +log

2 ;

N(( ;

)

=

2

(;)=2(

F+G)

x

)

lim

!0 +log

2

N(( ;

)

=

4

+A (;)=2+A(

F)

x

)

+ lim

!0 +log

2(

N(( ;

)

=

4

G

x

))

log2(

B(2

m =

4

A

))+ log 2(

A(2

m =

4))

as required.

1. The rst claim follows immediately from the observation that for

z

62

S

, *

X

(x y)2S

d

((

xy

)

f

)

y

x

z

+

= 0

:

2. For (

x

0

y

0)

2

S

, we have

y

0

((

f

+

g

f)(

x

0

);

) =

y

0

(

f

(

x

0

);

) +

y

0

*

X

(x y)2S

d

((

xy

)

f

)

y

0

x

x0 +

;

d

((

x

0

y

0

)

f

)+

d

((

x

0

y

0

)

f

) =

:

The theorem follows.

For a training set

S

, we de ne

D

(

Sf

) =s X

(x y)2S

d

((

xy

)

f

)2

:

Theorem 6.13

LetF be a sturdy class of real-valued functions with range

ab

]

and fat shattering dimension bounded by fatF( ). Fix

2 < and a scaling of

the output range

2<. Consider a xed but unknown probability distribution
(21)

Non-linear Function Spaces 20

sets

S

of size

m

for all

b

;

a > >

0 the generalization of a function

f

2 F

thresholded at

is bounded by

(

mk

) = 2

m k

log2 65

m

1 + ~

D

2

log2

9

em

1 + ~

D

+ log2

64

m

1:5(

b

;

a

)

where

k

=h

fatF(

;

=

16) + 64 ~

D

2 i

and ~

D

= 2(

D

(

Sf

)+

)

=

provided

m

2

=

and there is no discrete probability on misclassied training

points.

Proof

: We de ne a sequence of function classes Gj

L

f(

X

) to be the linear

functionals with norm at most

B

j on the space

L

f(

X

). We will apply

Propo-sition 6.12 for each class

G

j. Note that the range of Gj is ;

B

j

B

j] on the

input domain. Note also that the image of Gj under the evaluation map is

a closed bounded subset of the reals and hence is compact. It follows that

G

j is sturdy. We choose

B

j =

j

, for

j

= 1

:::`

= p

m

(

b

;

a

)

=

. Hence,

B

` = p

m

(

b

;

a

)

D

(

Sf

), for all

f

2 F and all

< b

;

a

. Hence,

for any value of

D

=

D

(

Sf

) obtained there is a value of

B

j satisfying

D B

j

< D

+

. Substituting the upper bound

D

+

for this

B

j will give the

result, when we use

0=

=`

and bound the covering numbers of the component

function classes using Corollary 6.2 and Theorem 3.5. In this case we obtain the following bounds on the covering numbers,

lim

!0 +log

2

N(( ;

)

=

4

+Bj

;+Bj( F)

x

)

1 +

d

1log 2

260

m

(

=

2+

B

j)2 2

log2

18

em

(

=

2+

B

j)

d

1

=: log2(

B(2

m =

4

B

j))

where

d

1 = fatF(

;

=

16), and

lim

!0 +log

2(

N(( ;

)

=

4

Gj

x

)) 1 +

d

2log

2

260

mB

2

j

2 !

log2

18

emB

j

d

2

=: log2(

A(2

m =

4))

where

d

2 = (16

B

j

=

)

2. Hence, in this case we can bound dlog

2

B(2

m =

4

B

j)+

log2

A(2

m =

4)eby

dlog 2

B(2

m =

4

B

j) + log 2

A(2

m =

4)e 3 + "

fatF( ;

=

16) + 16

B

j

2 #

log265

m

(1 + 2

B

j

=

) 2

log29

em

(1 + 2

B

j

=

)
(22)

Non-linear Function Spaces 21

The theorem can of course be applied for linear function classes, using the bound on the fat shattering dimension given in Theorem 3.5. The bound obtained is very comparable, though a lot less clean than Theorem 4.3.

For a training set

S

, we de ne

D

0

(

Sf

) = X (x y)2S

d

((

xy

)

f

)

:

This is the

l

1 sum of the slack variables which is optimised in Vapnik's box

con-straint maximal margin hyperplane algorithm. The following Corollary shows that optimising this quantity does indeed lead to good generalization.

Corollary 6.14

Let F be a sturdy class of real-valued functions with range

ab

] and fat shattering dimension bounded by fatF( ). Fix

2<and a scaling

of the output range

2<. Consider a xed but unknown probability distribution

on the input space

X

. Then with probability1;

over randomly drawn training

sets

S

of size

m

for all

b

;

a > >

0 the generalization of a function

f

2 F

thresholded at

is bounded by

(

mk

) = 2

m k

log2 65

m

1 + ~

D

2

log2

9

em

1 + ~

D

+ log2

64

m

1:5(

b

;

a

)

where

k

=h

fatF(

;

=

16) + 64 ~

D

2 i

and ~

D

= 2(p

D

0(

Sf

)(

b

;

a

) +

)

=

provided

m

2

=

and there is no discrete probability on misclassied training

points.

Proof

: The corollary follows by observing that

D

(

Sf

) = s X

(x y)2S

d

((

xy

)

f

)2

s

(

b

;

a

) X

(x y)2S

d

((

xy

)

f

) = p

D

0(

Sf

)(

b

;

a

)

and applying the theorem.

If we choose the hyperplane to minimise

D

0(

Sf

) and apply the Corollary, we

will necessarily obtain a weaker bound than we would if we minimised

D

(

Sf

) and then applied the Theorem. In the case of linear function classes, better bounds for the generalization in terms of

D

and

D

0should be obtained using

re-cent results which bound the covering numbers for dierent norms directly 21]. It is worth noting that we can apply Corollary 6.14 to the case of linear functions with norm 1 and recover a result similar to Theorem 4.3. The bound would involve an expression

R

2+

D

2 rather than (

R

+

D

)2, which appears preferable.
(23)

Regression 22

7 Regression

In order to apply the results of the last section to the regression case we for-mulate the error estimation as a classi cation problem. Consider a real-valued function class F and a target real-valued function

t

(

x

). For

f

2 F we de ne

the function

e

(

f

) and the class

e

(F),

e

(

f

)(

x

) = j

f

(

x

);

t

(

x

)j

e

(F) = f

e

(

f

)j

f

2Fg

:

For a training point (

xy

)2

X

<we de ne

@

((

xy

)

f

) = maxf0

j

f

(

x

);

y

j;(

; )g

:

This quantity is the amount by which

f

exceeds the error margin

; on the

point (

xy

) or 0 if

f

is within

; of the target value. Hence, this is the

insensitive loss measure considered by Vapnik with

=

; . Let

g

f 2

L

f(

X

)

be the function

g

f =; X

(x y)2S

@

((

xy

)

f

)

x

:

Proposition 7.1

Fix

2 <,

>

0. Let F be a sturdy class of real-valued

functions with range

ab

]<having a uniform bound on the covering numbers

N( ;

F

x

) B(

m

)

for all

x

2

X

m. Let G be a sturdy subset of

L

f(

X

) with the uniform bound on

the covering numbers,

N( ;

G

x

) A(

m

)

for

x

2m, where =f

xj

x

2

X

g. Consider a xed but unknown probability

distribution on the input space

X

. Then with probability 1;

over randomly

drawn training sets

S

of size

m

for all

>

0 the probability that a function

f

2F has error greater than

with respect to target function

t

on a randomly

chosen input is bounded by

(

mk

) = 2

m k

+ log2

8

m

where

k

=dlog 2

B(2

m =

4)+ log 2

A(2

m =

4)e

where

A

supfh

g

xij

g

2 G

x

2

X

g, provided

m

2

=

, there is no discrete

probability on training points with error greater than

and

g

e(f) 2G

Proof

: The result follows from an application of Proposition 6.12 to the func-tion class

e

(F), noting that we treat all training examples as negative, and

hence correct classi cation corresponds to having error less than

. Finally, we can bound the covering numbers

N(

+A 2+A(

e

(

F))

x

) N(

F

x

) B(

m

)

:

(24)

Regression 23

For a training set

S

, we de ne

D(

Sf

) = s

X

(x y)2S

@

((

xy

)

f

)2

:

The above result can be used to obtain a bound in terms of the observed value of

D

(

Sf

) and the fat shattering dimension of the function class.

Theorem 7.2

LetF be a sturdy class of real-valued functions with range

ab

]

and fat shattering dimension bounded by fatF( ). Fix

2 <,

>

0 and a

scaling of the output range

2 <. Consider a xed but unknown probability

distribution on the input space

X

. Then with probability 1;

over randomly

drawn training sets

S

of size

m

for all with

>

0 the probability that a

function

f

2F has error larger than

on a randomly chosen input is bounded

by

(

mk

) = 2

m

k

log2

65

m b

;

a

2 !

log2 9

em b

;

a

+ log2

64

m

1:5(

b

;

a

)

!

where

k

=h

fatF( ;

=

16) + 64 ~D 2

i

and ~D= 2(D(

Sf

)+

)

=

provided

m

2

=

and there is no discrete probability on misclassied training

points.

Proof

: The proof follows the same pattern as that of Theorem 6.13, with the exception that the bounds on the covering numbers must use the full range of the function classF in the log factors.

Corollary 7.3

LetF be a the set of linear functions with norm 1 restricted to

inputs in a ball of radius

R

about the origin. Fix

2<,

>

0 and a scaling of

the output range

2<. Consider a xed but unknown probability distribution

on the input space

X

. Then with probability1;

over randomly drawn training

sets

S

of size

m

for all , with

>

0 the probability that a function

f

2F

has error larger than

on a randomly chosen input is bounded by

(

mk

) = 2

m

k

log2

260

m R

2 !

log2 18

emR

+ log2

128

m

1:5

R

!

where

k

=h

256

R

2

=

2+ 64 ~ D 2

i

and ~D= 2(D(

Sf

)+

)

=

provided

m

2

=

and there is no discrete probability on misclassied training

points.

Proof

: The range of linear functions with unit weight vectors when restricted to the unit ball is ;

RR

]. Their fat shattering dimension is bounded by
(25)

Conclusions 24

Note that we obtain a generalization bound for standard least squares regression by taking =

in Theorem 7.2. In this caseD(

Sf

) is the least squares error

on the training set, while the bound gives the probability of a randomly chosen input having error greater than

. This is summarised in the following corollary.

Corollary 7.4

LetF be a sturdy class of real-valued functions with range

ab

]

and fat shattering dimension bounded by fatF( ). Fix

2 <,

>

0 and a

scaling of the output range

2 <. Consider a xed but unknown probability

distribution on the input space

X

. Then with probability 1;

over randomly

drawn training sets

S

of size

m

the probability that a function

f

2F has error

larger than

on a randomly chosen input is bounded by

(

mk

) = 2

m

k

log2

65

m b

;

a

2 !

log2 9

em b

;

a

+ log2

64

m

1:5(

b

;

a

)

!

where

k

=h

fatF(

;

=

16) + 64 ~ D 2

i

and ~D= 2 q

P

(x y)2S(

f

(

x

) ;

y

)

2+

provided

m

2

=

and there is no discrete probability on misclassied training

points.

As mentioned in the section dealing with classi cation we could bound the generalization in terms of other norms of the vector of slack variables

(

@

((

xy

)

f

))(x y)2S

:

The aim of this paper, however, is not to list all possible results, it is rather to illustrate how such results can be obtained.

Another application of these results is to choose the best

for the

insensitive loss function for Support Vector Regression. This problem has usually been solved by using a validation set, but Corollary 7.3 could be used by choose the value of

which gives the best bound on the generalization. We assume here that a target accuracy

has been set and we wish to minimise the probability that the error exceeds this value. The optimum will be the

which minimises

R

2+

D(

Sf

;

) 2

(

;

)

2

where

f

is the solution obtained when using the

-insensitive loss function.

8 Conclusions

(26)

REFERENCES 25

the hyperplane. The bounds obtained can be signi cantly better than previ-ously obtained bounds, particularly when some of the points are misclassi ed and agnostic bounds would need to be applied were a classical analysis to be adopted in which the square root of the sample size replaces the sample size in the denominator. The bound is also more robust that that derived for the maximal margin hyperplane where a single point can have a dramatic eect on the hyperplane produced.

We have gone on to show how optimizing the measure of the margin distribution that appears in the bound corresponds to an algorithm proposed by Cortes and Vapnik 5]. This formulation also allows the problem to be solved in kernel spaces such as those used with the Support Vector Machine.

We believe that this paper presents the rst pac style bound for a margin distri-bution measure that is neither critically dependent on the nearest points to the hyperplane nor is an agnostic version of that approach. In addition, we believe it is the rst paper to give a provably optimal algorithm for optimizing the generalization performance of agnostic learning with hyperplanes, by showing that the criterion to be minimised should not be the number of training errors, but rather a more exible criterion which could be termed a `soft margin'. The problem of nding a more informative and theoretically well-founded measure of the margin distribution has been an open problem for some time. This paper suggests one candidate for such a measure which has the advantage of being ro-bust in the sense that it is not critically sensitive to the behaviour of individual training points.

The results have been further generalized to non-linear function classes with bounded fat-shattering dimensions, other norms on the vector of shortfalls of individual training points and to the regression case. For regression one byprod-uct is a bound in terms of the least square error on the training set of the prob-ability that a randomly drawn test point will have error greater than a given value.

References

1] Noga Alon, Shai Ben-David, Nicol o Cesa-Bianchi and David Haussler, \Scale-sensitive Dimensions, Uniform Convergence, and Learnability," Journal of the ACM

44

(4), 615{631, (1997)

2] Peter L. Bartlett, \The Sample Complexity of Pattern Classi cation with Neural Networks: The Size of the Weights is More Important than the Size of the Network," IEEE Trans. Inf. Theory,

44

(2), 525{536, (1998). 3] Peter Bartlett and John Shawe-Taylor, Generalization Performance of
(27)

REFERENCES 26

4] C. Campbell, Constructive Learning Techniques for Designing Neural Net-work Systems, in (ed CT Leondes) Neural NetNet-work Systems Technologies and Applications. San Diego: Academic Press. 1997.

5] C. Cortes and V. Vapnik, Support-Vector Networks, Machine Learning, 20(3):273-297, September 1995

6] Nello Cristianini, John Shawe-Taylor, and Peter Sykacek, Bayesian Classi-ers are Large Margin Hyperplanes in a Hilbert Space, in Shavlik, J., ed., Machine Learning: Proceedings of the Fifteenth International Conference, Morgan Kaufmann Publishers, San Francisco, CA.

7] R.O. Duda and P.E. Hart, Pattern Classi cation and Scene Analysis, New York: Wiley, 1973.

8] Yoav Freund and Robert E. Schapire, Large Margin Classi cation Using the Perceptron Algorithm, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998.

9] Leonid Gurvits, A note on a scale-sensitive dimension of linear bounded functionals in Banach spaces. In Proceedings of Algorithm Learning Theory, ALT-97, and as NECI Technical Report, 1997.

10] Norbert Klasner and Hans Ulrich Simon, From Free to Noise-Tolerant and from On-line to Batch Learning, Proceedings of the Eighth Annual Conference on Computational Learning Theory, COLT'95, 1995, pp. 250{257.

11] R. Schapire, Y. Freund, P. Bartlett, W. Sun Lee, Boosting the Margin: A New Explanation for the Eectiveness of Voting Methods. In D.H. Fisher, Jr., editor, Proceedings of International Conference on Machine Learning, ICML'97, pages 322{330, Nashville, Tennessee, July 1997. Morgan Kauf-mann Publishers.

12] John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, Martin An-thony, Structural Risk Minimization over Data-Dependent Hierarchies, IEEE Trans. on Inf. Theory,

44

(5), 1926{1940, (1998), and NeuroCOLT Technical Report NC-TR-96-053, 1996.

(ftp://ftp.dcs.rhbnc.ac.uk/pub/neurocolt/tech reports).

13] John Shawe-Taylor and Nello Cristianini, Data Dependent Structural Risk Minimization for Perceptron Decision Trees, Proceedings of the Eleventh Conference on Neural Information Processing Systems, NIPS'97. Advances in Neural Information Processing Systems 10 Michael I. Jordan, Michael J. Kearns, and Sara A. Solla (eds.) Cambridge, MA: MIT Press (1998), pp. 336{342.

(28)

REFERENCES 27

15] John Shawe-Taylor and Robert C. Williamson, Generalization Performance of Classi ers in Terms of Observed Covering Numbers, Submitted to Eu-roCOLT'99, 1998.

16] University of California, Irvine - Machine Learning Repository,

http://www.ics.uci.edu/ mlearn/MLRepository.html

17] Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, 1982.

18] Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995.

19] Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, 1982.

20] Vladimir N. Vapnik, Esther Levin and Yann Le Cunn, Measuring the VC-dimension of a learning machine, Neural Computation, 6:851{876, 1994. 21] Robert C. Williamson, Alex J. Smola and Bernhard Sch#olkopf, \Entropy

Numbers, Operators and Support Vector Kernels," submitted to Euro-COLT'99. See also \Generalization Performance of Regularization Net-works

and Support Vector Machines via Entropy Numbers of Compact Oper-ators," http://spigot.anu.edu.au/people/williams/papers/P100.ps

References

Related documents