Soft Margin Hyperplanes - Perceptron Like Large Margin Classifiers

Until now we made the assumption that the training set is linearly separable and we are seeking separation with maximum margin. By studying (2.43) yielding the probability that an unseen point is wrongly classified by the ∆-margin hyperplane we discover that maximum margin classification is not always the most favourable approach regarding generalisation. In particular we can fix a ∆-margin hyperplane with ∆ exceeding the existing marginγ at the expense of a few margin errors if this can reduce the worst case prediction regarding the probability of a margin mistake. In the inseparable case we are left with no choice except considering a zone around a given hyperplane extending at a distance ∆ inside the regions which characterise an example either as positive or as negative. The thickness of this zone works as a substitute of the margin which, of course, does not exist and will be called a margin as well. It should be clear from the context when we attribute to the margin its usual meaning and when its relaxed one. The success in the choice of ∆ will be assessed after conducting margin queries on test data. The bound of (2.43) can also in this case provide us with some theoretical insights which if combined with the margin errors done on an independent test set can guide us through the procedure of specifying ∆. In order to treat inseparable datasets we have to make some modifications in the formulation of the original optimisation.

Ideally we would like to achieve a solution with a margin of at least ∆ but also with the minimum possible number of mistakes. We give the opportunity to the constraints of the primal optimisation problem encountered earlier to hold as equalities by relaxing

Figure 3.2: _{The slack variables for a classification problem.} them through the introduction of some non-negative quantities ξi, i= 1, . . . , l

yi(w·xi+b)≥1−ξi, i= 1, . . . , l . (3.12)

The variables ξi, i = 1, . . . , l denoted compactly as ξ, are called slack variables [55, 6]

(Fig. 3.2). All we have to do now in order to complete the statement of our optimisation problem is to set the objectiveJ to

J =

i=1 ξ_iσ ,

where σ is a positive parameter tending to zero and impose the additional constraint on the margin written formally as _kw_k2 _≤ ∆−2_{. The minimisation of the objective}

corresponds obviously to a minimisation performed on the number of margin mistakes.

Since the above optimisation problem is computationally intractable we allow for a relaxation of it. Specifically, we do not require the construction of a ∆-margin hyperplane but we pursue instead the margin maximisation in a criterion involving a penalty for the margin mistakes as well. Thus, in addition to the usual term 1₂_kw_k2 a new term enters this criterion which is proportional to the previously mentioned objective with the parameterσ fixed to strictly positive values. This new concept of optimal hyperplane is called the soft-margin optimal hyperplane [13] and is determined by the minimisation of the following criterionJ

J = 1 2kwk 2₊C σ l X i=1 ξ_iσ

subject to the constraints (3.12). The free parameterCdetermines the trade-off between the maximisation of the margin and the minimisation of the sum of ξ_iσ. In the sequel we distinguish two commonly encountered cases according to the value of σ.

First we treat the case where σ = 1 known as the 1-norm optimisation problem. The problem can be solved by applying techniques analogous to those used for finding the maximum margin in the separable case. In particular we construct the corresponding Lagrangian which after eliminating the primal variables (w, b,ξ) assumes a dual form

L(α) identical with that of (3.9). The problem which we are asked to solve is the maximisation ofL(α) under the constraints Pl

i=1αiyi= 0 and

0_≤αi ≤C .

Observe that the previous constraint forces the variablesαi to lie between 0 andC and

that is why it is called the box constraint [13]. The KKT complementarity conditions for this problem are

αi[yi(w·xi+b)−1 +ξi] = 0, i= 1, . . . , l , (3.13)

ξi(αi−C) = 0, i= 1, . . . , l .

The first of them implies that theαi’s are zero for the inactive constraints whereas from

the second we conclude that slack variables ξi have non-zero values only for αi = C.

These αi’s correspond to examples which violate the margin requirement 1/kwk. The

examples for which 0 < αi < C are lying at a distance 1/kwk from the separating

hyperplane and have zero slacks. Furthermore, they satisfy the constraints (3.12) as equalities enabling us to use them in order to determine the only unknown quantity, namely bgiven that the slacks vanish.

We turn now to the case where σ = 2 known as the 2-norm optimisation problem. Following the same procedure as before we obtain the dual Lagrangian L(α) free from the primal variables

L(α) = l X i=1 αi− 1 2 l X i,j=1 yiyjαiαjxi·xj− 1 2C l X i=1 α2_i written equivalently as L(α) = l X i=1 αi₋1 2 l X i,j=1 yiyjαiαj xi·xj+ 1 Cδij , (3.14)

whereδij is Kronecker’sδ. The KKT complementarity conditions assume the same form

as in (3.13). From (3.14) we can see thatL(α) remains essentially the same as the dual Lagrangian occuring in the maximisation of the margin in the separable case except for the term _C1δij. The term (xi ·xj) as we have commented before corresponds to the

into the kernel by weighting its diagonal by the quantity 1/C [52,53, 14]

K′(x,x′) =K(x,x′) + 1

Cδxx′

and solve the maximum margin problem.

Generalisation bounds involving the 1-norm or the 2-norm of the slack vector ξ were derived in [51,52,53].

In document Perceptron Like Large Margin Classifiers (Page 63-66)