The VC Dimension - Perceptron Like Large Margin Classifiers

Until now we have constructed distribution-free bounds on the rate of convergence of

R(al) to infa∈ΛRemp(a) and ofR(al) to infa∈ΛR(a) based on the growth functionGΛ(l).

However, we cannot estimate the value of the growth function given the dataset size and the admissible functions of the hypothesis class. The following theorem will provide us with an upper bound on the growth function [60,61] leading to constructive bounds on the rate of convergence.

Theorem 2.8. The growth function of a set of indicator functions Q(z, a), a_∈Λ either satisfies the equality

GΛ(l) =lln 2

or is bounded by the inequality

GΛ(l) ( =lln 2 ifl_≤h ≤lnPh i=0Cli ≤ln el_hh =h 1 + ln_hl ifl > h ,

where h is the largest integer for which

GΛ(h) =hln 2 .

The theorem says that the growth function can be either linear or at most logarithmic in l as a result of the second branch of GΛ₍_l_{). That means that it cannot scale with} l slower than linearly but faster than logarithmically. For example GΛ(l) cannot be

lp, with 0 < p < 1. The quantity h characterises the ability of the functions in the hypothesis class to explain the data and is called the VC dimension of a set of indicator functions [60, 61]. There exists an alternative definition of the VC dimension which is connected to the procedure of estimating it thus leading to constructive bounds.

Definition 2.9. The VC dimension of a set of functionsQ(z, a), a_∈Λ, is equal to the largest numberhof pointsz1. . . ,zlthat can be separated into two different classes in all

the 2h possible ways using functions from this set. We say then that the VC dimension is the maximum number of points that can be shattered by the set of functions.

One remark that we can add in connection with the above definition is that if for any

l there exists a set ofl points that can be shattered by the functions in the hypothesis class then the VC dimension is infinite.

For the case where the VC dimension is finite the growth function grows at most logarithmically with the sample size for l > h. If we use this upper bound in the place of the growth function we end up with a constructive bound on the rate of convergence

P ( sup a∈Λ Z Q(z, a)dF(z)₋1 l l X i=1 Q(zi, a) > ǫ ) ≤4 exp ( h(1 + ln(2l/h)) l − ǫ₋1 l 2! l ) . (2.31)

It can be easily seen that for a finite VC dimension the growth function increases slower than linearly resulting in liml→∞GΛ(l)/l→0 which is the condition for distribution-free

uniform two-sided convergence. The following theorem [62] states something stronger regarding the role of the VC dimension in the uniform convergence.

Theorem 2.10. The finiteness of the VC dimension is not only a sufficient but also a necessary condition for uniform convergence of the frequencies of eventsAa={z :Q(z, a)

= 1_} to their probabilities for any distribution F(z).

Proof. For the validity of our claim it suffices to disprove uniform convergence for a specific distribution. Given that the VC dimension is infinite the equality

NΛ(z₁, . . . ,z_l) = 2l

holds for some setZl =z1, . . . ,zl. Uniform convergence will fail if for any l and ǫ < 1

there exists a distribution F(z) such that

sup a∈Λ Z Q(z, a)dF(z)₋1 l l X i=1 Q(zi, a) >1₋ǫ .

is true with probability one. We fix an arbitrary sample Zl of size l which we expand by a set Z⋆ =zl+1, . . . ,zn consisting of n−l points, where n is chosen to be n > l/ǫ.

The sampleZl_∪Z⋆ is generated by a uniform distribution concentrated only on then

points, i.e. the probability of any such point isP(zi) = 1/n. Even after expanding the

dataset the functions Q(z, a) are still able to shatter the new dataset. This enables us to choose out of all the possible dichotomies realised by the functions of the class the one dichotomy which corresponds toQ(z, a⋆_{) taking the value of zero on the points of}

the subset Zl and one on the rest of them contained in the subsetZ⋆. Formally, this implies that 1 l l X i=1 Q(zi, a⋆) = 0

and at the same time that Z

Q(z, a⋆)dF(z) = n−l

since Q(z, a⋆) = 1 only for z _∈ Z⋆. Hence, with probability one it holds that the supremum over all functions in the set is greater than 1₋ǫ for any l rendering the finiteness of the VC dimension a necessary condition for uniform convergence.

It is important to point out that one should not confuse the VC dimension with the number of free parameters appearing in a function because this can be proved totally wrong. For example, on the one hand, there can be a class of functions the members of which differ only in one parameter but which, nevertheless, possess an infinite VC dimension. On the other hand one can think of a class of functions which, although described by a high number of free parameters, have a low VC dimension.

A case where the VC dimension can be easily estimated by the number of free parameters involves any hypothesis class that contains indicator functions linear in their parameters

ak Q(z, a) =θ n X k=1 akφk(z) ! , ak∈R ,

where a is a vector with components ak, k = 1, . . . , n. The terms φk(z) entering the

expression of Q(z, a) are linearly independent functions of the sample elements z. The VC dimension of this set of functions equals the number n of free parameters. Ap- plication of this case can be considered the class of zero-threshold hyperplanes in the

n-dimensional space implemented by a learning machine in the classification task. If we consider hyperplanes possessing some bias the free parameters are increased by one and so is the VC dimension of the set.

For the class of indicator functions non-linear in their parameters the VC dimension can be less than or even exceed the number of parameters. A typical example of the latter case is the following set of indicator functions

Q(z, a) =θ(sinaz), z_∈(0,2π), a_∈(0,_∞)

the VC dimension of which is infinite.

In document Perceptron Like Large Margin Classifiers (Page 44-46)