The Relaxed Online Maximum Margin Algorithm

Another very well-known algorithm addressing the problem of finding the maximum margin hyperplane is the aggressive variant of the Relaxed Online Maximum Margin Algorithm (ROMMA) [38] which is implemented in an incremental setting. Before we proceed to its presentation it would be useful to discuss what would be the ideal online maximum margin algorithm. Assuming linear separability of the dataset, the ideal online maximum margin algorithm considers a modified version of the objective function appearing in the SVM formulation. Specifically, instead of minimising the norm of the weight vector _katk subject to the constraints (at·yk) ≥ 1 for all k in the sequence

of examples, it would choose to minimise _katk subject to the constraints (at·yk) ≥ 1

for all the patternsy_k that have previously been seen by the algorithm. The ROMMA algorithm in order to employ a simple update rule relaxes the above constraints in a fashion that attempts to preserve the online mode of the ideal maximum margin algorithm. Each time a new training example is presented to the algorithm the condition at·yk ≤ 0 for incorrect classification of yk is checked. In the case that a prediction

mistake occurs the algorithm proceeds to an update of the current hypothesisat. Orig-

inally a1 = 0 and after the first trial which is certainly successful the algorithm sets

a to be the shortest weight vector a2 that satisfies {a : a ·yk1 ≥ 1}. Here yk1 is

the first misclassified example which coincides with the first example in the sequence. When a second prediction mistake occurs in connection with the second misclassified example, say y_k₂, the new hypothesis follows again the ideal online maximum margin paradigm and a3 is determined as the shortest awhich fulfills the combined constraint

{a : a_·y_k₁ _≥ 1_{} ∩ {}a : a_·y_k₂ _≥ 1_}. In order for the algorithm to keep at most two constraints at every step it proceeds differently from that point on. After the next prediction mistakea₄ is determined to be the shortest a ensuring that

{a:a3·a≥ ka3k} ∩ {a:a·yk3 ≥1}

holds. Generalising the procedure for any subsequent step t we call the constraint

Ht = {at+1 :at·at+1 ≥ katk} the old constraint whereas {at+1 : at+1 ·yk ≥ 1} is

referred to as the new constraint. Both of them must be satisfied together with the requirement for the shortest a possible at the t-th update involving the pattern y_k which has caused the prediction mistake. The old constraint gives a kind of inertia regarding changes in the solution since the feasible weight vectors are preferably chosen from solutions in the vicinity of the old one in order to keep their norm small. So the old constraint represents the tendency for conservativeness and determines the extent to which the old solution contributes to the new weight vector.

From the discussion above it is obvious that the algorithm needs only to solve a quadratic programming problem with two constraints. We will complement the description of the algorithm with an investigation of how an appropriate solution can be found satisfying the above constraints without resorting to quadratic optimisation. In fact, this will provide us with an efficient way of implementing the algorithm with the mere use of a simple update rule. It can be proved that both the new and the old constraint, with the latter holding after the first mistaken trial, are binding constraints. This means that they hold as equalities_{at+1:at+1·yk= 1}and{at+1 :at·at+1 =katk}. Each of these

constraints describes a hyperplane which is the locus of (the endpoints of) all the weight vectors that satisfy each one of the abovementioned equalities. Only the weight vector that ends at the intersection of both hyperplanes ensures the simultaneous satisfaction of both constraints. The update rule is given by the solution of the system consisting of the two constraints written compactly as

Aat+1 =b , whereA= a T t yT_k ! and b= katk 1 !

. It is presumed in this notation that the vector at+1 multiplies separately each entry of the column vectorA. Notice that in the general

withnbeing the dimensionality of the instance space. In this occasion the solution that minimises the squared error is also the one with the smallest norm and is given by

at+1 =AT AAT −1 b = kykk 2 katk2−(at·yk) ky_k_k2_katk2−(at·yk)2 ! at+ k atk2(1−(at·yk)) ky_k_k2_katk2−(at·yk)2 ! y_k .

Apart from the ROMMA algorithm that we just briefly analysed there exists a variant of it which claims to achieve a predefined δ approximation of the maximum margin (0< δ_≤1). This variant is called aggressive ROMMA. Its name is justified by the fact that an update takes place not only after a prediction mistake but also after any trial in which at·yk≤1−δ. In this case in contrast to the simple ROMMA the old constraint

may not be active. This means that there exist trials in which only at+1·yk = 1 has to

be satisfied by theat+1 with the shortest length and this is ensured if

at+1 =

y_k

ky_k_k2 . (4.7)

The old constraint is not binding provided the inequalities

1₋δ_≥at·yk ≥ kykk2katk2 (4.8)

are satisfied. This condition comes from the substitution of (4.7) in the old constraint which if we want it to be automatically satisfied (4.8) should hold. Otherwise, we apply the same update as in ROMMA.

In document Perceptron Like Large Margin Classifiers (Page 80-82)