Separating Hyperplanes - Perceptron Like Large Margin Classifiers

The class of functions that SVMs employ is the class of hyperplanes defined by

w_·x+b= 0, w_∈X, b_∈R , (3.1) which for any test point x _∈ X induce the corresponding decision function ˆy(x) = sgn(w_·x+b) taking values from the set _{1,₋1_}. The solution is uniquely determined if the weight vector w and the bias b are specified by the learning algorithm. In the beginning we will consider those datasets that can be split into two classes by using a hyperplane and will defer the inseparable case for later. We will assume that all the hyperplanes we refer to are able to provide at least a mere separation of the data. Among all the possible hyperplanes we can distinguish the single pair (w, b) that induces a solution with a maximal distance from the nearest points called maximum geometric margin. Usually we omit the qualification geometric and we call it just margin. We can rescale the pair (w, b) by multiplying (3.1) with a constant factor and still describe the same hyperplane. Exploiting this freedom we rescale (w, b) so that the points lying closest to it satisfy the following relationship

min

i=1,...,l|w·xi+b|= 1 . (3.2)

Let us for the moment regardxi not as some training pattern but rather as an arbitrary

vector satisfying (3.2). Then, we can treat (3.2) as an equation describing two separate hyperplanes. The one defined by w_·xi +b = 1 lies in the region of the points x

characterised as positive ones (ˆy(x) = 1) and the other defined by w_·xi+b=−1 lies

in the region of points belonging to the negative class (ˆy(x) = ₋1). Each of these two equations defines a hyperplane parallel to the one described by (3.1) at a distance equal to the margin that the training points possess from the solution hyperplane. If the index

iis used to indicate those training patterns that are closest to the separating hyperplane then (3.2) can be rewritten as

yi(w·xi+b) = 1 . (3.3)

A hyperplane for which the pair (w, b) is normalised such that (3.3) holds is said to be written in canonical form. The decision rule f(x) = w_·x+b responsible for the assignment of a label to any data point x when multiplied by its label y measures another kind of margin which is called the functional margin. Moreover, the sign of

yf(x) computed for the example (x, y) signifies correct classification when positive and wrong classification when negative. After bringing the hyperplane in the canonical form we divide (3.3) by_kw_kso that in the place ofwits directionuappears. Then the r.h.s.

of the resulting equation represents the geometric marginγ of the dataset γ =yi u_·xi+ b kw_k = 1 kw_k . (3.4)

In analogy to the above margin of the dataset which is the minimum distance the positions of the points have from the separating hyperplane (3.1) we can define the margin γ(x, y) of any point (x, y) to be equal to the distance of that point from the separating hyperplane. The margin γ(x, y) is obtained from the relationship

γ(x, y) =y u_·x+ b kw_k .

The positivity of γ(x, y) indicates that an instance is correctly classified with respect to the separating hyperplane. The margin of a misclassified point coincides with the negative of its distance from the hyperplane. Furthermore, if the positivity of γ(x, y) holds for every training example (x, y) with respect to some hyperplane then the training set is linearly separable. From the relation (3.4) giving the margin for the points closest to the hyperplane it is apparent that the margin is larger if (3.3) is satisfied with lower values of the norm of w. Assuming that the functional margin of the closest points to the separating hyperplane is normalised to unity we can look for solutions possessing larger geometric margin by seeking hyperplanes with weight vectorsw of lower norm.

In the linearly separable case with margin of at least γ and for the class of γ-margin hyperplanes a worst case bound follows directly from Corollary 2.13 by setting m = 0 and substituting the VC dimension h by its upper bound of Theorem 2.12. We also assume that the dimensionality of the space is so high that the term depending on the margin prevails in the determination of the upper bound on h. Thus, with probability 1₋η the probability of an unseen pattern to give rise to a mistake due to its failure to be classified with margin of at leastγ satisfies

Perror < v u u t h R2 γ2 i + 1(1 + ln 2l)₋lnη₄ l + 1 l . (3.5)

From a mere inspection of (3.5) it is easily understood that we can improve the predic- tive ability of our training machine if we seek hyperplanes possessing margins near the maximum one. It is worth pointing out that a bound like the one of (3.5) without the square root on the r.h.s. could be derived by assuming from the beginning that there are no errors in the training set instead of setting the number of training errors to zero in (2.43).

Generalisation bounds depending on the margin were also derived in [50]. More specifi- cally the following theorem holds.

Theorem 3.1. Syppose inputs are drawn independently according to a distribution whose support is contained in a ball in _Rn _{centred at the origin of radius} _R_{. If we succeed in}

correctly classifying l such inputs by a canonical hyperplane with _kw_k = 1/γ and with

|b_{| ≤}R, then with confidence 1₋η the generalisation error will be bounded from above by ǫ(m, γ) = 2 l klog₂ 8el k log₂(32l) + log₂ 8l η , where k= [577R2_/γ2_]_.

Both (3.5) and Theorem 3.1 are applicable only if we somehow are able to make sure that only functions from the restricted class of γ-margin hyperplanes were considered as acceptable solutions by the machine. This might necessitate that a value of the margin smaller than the one corresponding to the solution found may be employed in the generalisation bounds. In the special case that we know before running that the classifier will find the solution with maximum margin this may be substituted in the above bounds even if the exact value of this margin is not known a priori.

Apart from the theoretical arguments we can rely on our intuition in order to find reasons why the maximum margin is indeed a good property of the solution hyperplanes. Let us train our machine on a set of points and consider a test set which is generated from the training set by adding some noise bounded in norm by the quantityr. This means that the resulting test patterns cannot exceed the boundary surfaces of spheres of radius r

centred at the training points. If the margin γ that separates the closest points from the hyperplane is greater than r then all the test points will be classified correctly. A larger margin allows a higher level of noise as this is measured by r without incurring any error and even higher if we are willing to accept a low test error.

In document Perceptron Like Large Margin Classifiers (Page 56-58)