• No results found

4.2 Support Vector Machine

4.2.4 Multi-Class Extensions

The basic version of the SVM algorithm considers only the case of binary classification, and there is no direct formulation of the SVM for the multi-class case in which examples of nΩ classes

have to be distinguished. Even though some formulations for solving multi-class problems by a single SVM-like optimisation problem have been proposed (see e.g. [Hsu and Lin, 2002]), decom- position schemes are still the most frequently used approaches for multi-class SVMs. Thereby, the multi-class problem is decomposed into a set of binary problems, each solved by a separate binary SVM. The outcomes of the individual binary classifiers are combined to a final multi-class prediction. This divide-and-conquer strategy of multi-class classification is not restricted to the SVM, but can also be used with every other binary classification algorithm. The most frequently used decomposition schemes are one-versus-all and one-versus-one, but other, more sophisticated approaches have been proposed (e.g. [Platt et al., 2000, Dietterich and Bakiri, 1995]).

One-Versus-All Decomposition

In the one-versus-all approach, nΩ binary SVMs are trained. The k-th classifier is trained to

discriminate examples of class ωk from examples of the remaining classes. An unseen example x

is then assigned to the class given by

argmax

k=1,...,nΩ

fθk(x) (4.28)

with fθk(x) as the signed margin value of the k-th binary SVM.

One-Versus-One Decomposition

In the one-versus-one scheme, a single binary classifier is trained for each possible pair of classes. Thus, the multi-class problem is decomposed intonΩ(nΩ−1)

2 binary problems. For the classification

of an unseen example, the binary response sgn[fθ kl(x)] of the classifier distinguishing examples

from class ωk versus examples from class ωl is considered as a vote either for class ωk or for ωl.

The example is finally assigned to the class which obtains the highest number of votes (Max-Win strategy).

Directed Acyclic Graph Support Vector Machine

An alternative approach for recombining the outputs of nΩ(nΩ−1)

2 binary classifiers is the directed

acyclic graph support vector machine (DAGSVM) [Platt et al., 2000]. The DAGSVM is composed of a rooted binary directed acyclic tree with nΩ(nΩ−1)

2 internal nodes. Each node corresponds to

one of the classification models discriminating examples of two classes. For the classification of an unseen example, the graph is traversed starting from the root node. At each node, the next subgraph is selected according to the output of the classification model corresponding to the current node. Thus, for the calculation of the finally assigned class label as given by the final leaf, only a subset of the classifiers has to be applied to the unseen example, which reduces the evaluation time.

Error-Correcting-Output-Codes Framework

The one-versus-all and one-versus-one schemes can be regarded as a special case of the more general error-correcting-output-codes (ECOC) framework proposed by Dietterich and Bakiri, 1995. In its early version, the multi-class problem is decomposed by adapting nK classifiers on different

partitions of the training data. Subsequently, the binary responses of the nKbinary classification

models for an unseen example x are combined to a nK-dimensional output vector o. This vector

is evaluated by a decoding matrix D ∈ {±1}nΩ×nK containing an unique code vector for each

class. The final class response is determined by calculating the best matching code vector using argmin

k=1,...,nΩ

d(Dk•, o) (4.29)

with d(Dk•, o) as the Hamming distance between o and the k-th row of D containing the code

vector of class ωk. Later, the ECOC scheme was extended by [Allwein et al., 2000] to take

the continuous margin values into account. For this purpose, the Hamming-based decoding was replaced by a decoding which employs a suitable loss-function.

In [Hsu and Lin, 2002] the different decomposition schemes for multi-class classification using binary SVMs are compared by means of standard benchmark data sets. Even though a significantly larger number of binary SVMs has to be trained for the one-versus-one and DAGSVM scheme, the training time as well as the evaluation time of unseen examples can be shorter than for the one-versus-all scheme. The computational complexity of the SVM training scales about quadratic to cubic with the number of training examples [Sch¨olkopf et al., 1999b]. Thus, solving a larger number of smaller quadratic programmes can be computational less expensive than solving a smaller number of larger quadratic programmes. The computational expense of classifying an unseen example is dominated by the number of kernel evaluations, which depends on the number of support vectors. In the case of nonlinear kernel functions like the Gaussian kernel the number of support vectors and, therewith, the number of kernel evaluations needed for classifying an example typically increases with the number of training examples. For the multi-class data sets considered by Hsu and Lin, 2002, the total number of support vectors of the one-versus-one solution was smaller than for the one-versus-all solution resulting in an increased evaluation time for the latter scheme. In terms of accuracy, the experiments indicated comparable performance of one-versus- one and one-versus-all schemes with Gaussian kernel, but a slightly superior performance of the former scheme if the problem is solved by a set of SVMs with linear kernel.