4.2 Support Vector Machine
4.2.4 Multi-Class Extensions
The basic version of the SVM algorithm considers only the case of binary classification, and there is no direct formulation of the SVM for the multi-class case in which examples of nΩ classes
have to be distinguished. Even though some formulations for solving multi-class problems by a single SVM-like optimisation problem have been proposed (see e.g. [Hsu and Lin, 2002]), decom- position schemes are still the most frequently used approaches for multi-class SVMs. Thereby, the multi-class problem is decomposed into a set of binary problems, each solved by a separate binary SVM. The outcomes of the individual binary classifiers are combined to a final multi-class prediction. This divide-and-conquer strategy of multi-class classification is not restricted to the SVM, but can also be used with every other binary classification algorithm. The most frequently used decomposition schemes are one-versus-all and one-versus-one, but other, more sophisticated approaches have been proposed (e.g. [Platt et al., 2000, Dietterich and Bakiri, 1995]).
One-Versus-All Decomposition
In the one-versus-all approach, nΩ binary SVMs are trained. The k-th classifier is trained to
discriminate examples of class ωk from examples of the remaining classes. An unseen example x
is then assigned to the class given by
argmax
k=1,...,nΩ
fθk(x) (4.28)
with fθk(x) as the signed margin value of the k-th binary SVM.
One-Versus-One Decomposition
In the one-versus-one scheme, a single binary classifier is trained for each possible pair of classes. Thus, the multi-class problem is decomposed intonΩ(nΩ−1)
2 binary problems. For the classification
of an unseen example, the binary response sgn[fθ kl(x)] of the classifier distinguishing examples
from class ωk versus examples from class ωl is considered as a vote either for class ωk or for ωl.
The example is finally assigned to the class which obtains the highest number of votes (Max-Win strategy).
Directed Acyclic Graph Support Vector Machine
An alternative approach for recombining the outputs of nΩ(nΩ−1)
2 binary classifiers is the directed
acyclic graph support vector machine (DAGSVM) [Platt et al., 2000]. The DAGSVM is composed of a rooted binary directed acyclic tree with nΩ(nΩ−1)
2 internal nodes. Each node corresponds to
one of the classification models discriminating examples of two classes. For the classification of an unseen example, the graph is traversed starting from the root node. At each node, the next subgraph is selected according to the output of the classification model corresponding to the current node. Thus, for the calculation of the finally assigned class label as given by the final leaf, only a subset of the classifiers has to be applied to the unseen example, which reduces the evaluation time.
Error-Correcting-Output-Codes Framework
The one-versus-all and one-versus-one schemes can be regarded as a special case of the more general error-correcting-output-codes (ECOC) framework proposed by Dietterich and Bakiri, 1995. In its early version, the multi-class problem is decomposed by adapting nK classifiers on different
partitions of the training data. Subsequently, the binary responses of the nKbinary classification
models for an unseen example x are combined to a nK-dimensional output vector o. This vector
is evaluated by a decoding matrix D ∈ {±1}nΩ×nK containing an unique code vector for each
class. The final class response is determined by calculating the best matching code vector using argmin
k=1,...,nΩ
d(Dk•, o) (4.29)
with d(Dk•, o) as the Hamming distance between o and the k-th row of D containing the code
vector of class ωk. Later, the ECOC scheme was extended by [Allwein et al., 2000] to take
the continuous margin values into account. For this purpose, the Hamming-based decoding was replaced by a decoding which employs a suitable loss-function.
In [Hsu and Lin, 2002] the different decomposition schemes for multi-class classification using binary SVMs are compared by means of standard benchmark data sets. Even though a significantly larger number of binary SVMs has to be trained for the one-versus-one and DAGSVM scheme, the training time as well as the evaluation time of unseen examples can be shorter than for the one-versus-all scheme. The computational complexity of the SVM training scales about quadratic to cubic with the number of training examples [Sch¨olkopf et al., 1999b]. Thus, solving a larger number of smaller quadratic programmes can be computational less expensive than solving a smaller number of larger quadratic programmes. The computational expense of classifying an unseen example is dominated by the number of kernel evaluations, which depends on the number of support vectors. In the case of nonlinear kernel functions like the Gaussian kernel the number of support vectors and, therewith, the number of kernel evaluations needed for classifying an example typically increases with the number of training examples. For the multi-class data sets considered by Hsu and Lin, 2002, the total number of support vectors of the one-versus-one solution was smaller than for the one-versus-all solution resulting in an increased evaluation time for the latter scheme. In terms of accuracy, the experiments indicated comparable performance of one-versus- one and one-versus-all schemes with Gaussian kernel, but a slightly superior performance of the former scheme if the problem is solved by a set of SVMs with linear kernel.