• No results found

In multi-class classification models, “multi-class” indicates that the number of the classes is always greater than 2. Typically, these classes are treated equally, in other words, there is no relationship of orderings or similarities among these classes. For example, for a three-class classification model, the class labels may be represented as class 0, class 1 and class 2. Such representation of class labels may mistakenly indicate to some people that class 2 is closer to class 1 than to class 0. However, the fact is that these three classes are dissimilar with each other without any orderings. In the standard setting of multi-class classification, training data consist of examples and corresponding labels (targets), which are given by a teacher (labeler). The goal is to learn a model that can accurately predict labels of unseen future examples. Formally, given training data D = {d1, d2, . . . , dN} wheredi is a pair of hxi, yii, xi is an input feature vector, yi is a desired

categorical output given by a teacher, the objective is to learn a mapping functionf :X →Y such that for a new future examplex0 ,f(x0)≈ y0. Multi-class classification learning is also useful in practice, for example, given historical clinical data, predict which exact disease a (future) patient may have.

The exact form of the model f : X → Y, and the algorithms used to learn it can be extended from the binary classification models in Section 2.1. Some methods

for binary classification can be easily extended. For example, naive Bayes models [Domingos and Pazzani, 1997] and decision trees (classification trees) [Breiman et al., 1984] supports multi-class classification without modification, neural networks [Hastie et al., 2009, Van Der Malsburg, 1986, Rumelhart et al., 1986, Cybenko, 1989] only need to modify the output layer. In this section, we will describe multi-class support vector machine (MCSVM) [Vapnik, 1998, Weston et al., 1999] and approximate multi-class support vector machine (AMSVM) [He et al., 2012], which are two popular multi-class extensions of SVM discussed in Section 2.1.1, in more details. Briefly, these two multi-class extensions decompose the multi-class classification task into multiple binary classification tasks, and apply a binary SVM [Cortes and Vapnik, 1995, Vapnik, 1995] for each task. Also, the kernel trick for binary SVM [Hastie et al., 2009, Joachims, 1998] is compatible with these two multi-class extensions. We note, that some of our new methods presented later in the thesis are based on these methods, so a review of them should help one to understand better the following chapters of the thesis.

2.2.1 Multi-class support vector machine (MCSVM)

Our goal is to learn a multi-class classifier f : X → Y, where X is the feature space and

Y ∈ {1,2, . . . , k} represents class labels of a data instance. Hence each labeled data entry Di

consists of two components:Di =hxi, yii, an input and a class label.

In multi-class support vector machine (MCSVM), we learn k binary support vector machine jointly, one for each class. Briefly, MCSVM works by trying to assure for every training data instance the projection of its assigned class label to be higher than the projection of any other class. Therefore,(k−1)constraints are derived for each labeled data instance, one for each class, except for the assigned class label. The total number of constraints in MCSVM is thusO(kN), whereN is the number of labeled data instances. For each data instance, the projection from the binary classifier of the class label should be higher than the projection from other classes. Formally, we would like to getkprojection mappingsf1(·), f2(·), . . . , fk(·), such that for each data instance

xi, the projectionfyi(xi)is greater thanfl(xi)forl ∈ {1,2, . . . , k} \yi. To permit some flexibility, we allow violations of the constraints but penalize them through the loss function. Therefore, the multi-class support vector machine is formulated as follows:

min W,ξ 1 2 k X l=1 ||wl||22+C N X i=1 X j6=yi ξi,j (wyi −wj) Tφ(x i)≥1−ξi,j ∀i= 1,2, . . . , N ∀j 6=yi ξi,j ≥0 ∀i= 1,2, . . . , N ∀j 6=yi (2.2)

whereyiistheclasslabelofxiandφ(·)istheprojectionofkernelspace. W={w1,...,wk}are

parameters of the k binary one-vs-rest classifiers. N is thenumber of labeled instances.

Ξ = {ξ1, ξ2, . . . , ξN} are the slack variables for each constraint. For prediction, the class with

the highest projection value is selected as the predicted class.

2.2.2 Approximate multi-class support vector machine (AMSVM)

The approximate multi-class SVM (AMSVM) is an approximation of the standard multi-class SVM (MCSVM) method in Section 2.2.1. In AMSVM the set of the constraints is merged and replaced with one constraint that assumes that for each data instance the projection of the class label is higher than the average projection for all the other classes. Via such averaging, the number of constraints is significantly reduced: only one constraint is derived for each labeled data instance. Therefore, the total number of constraints in AMSVM is reduced to O(N). Formally, in the AMSVM withkclasses,kbinary SVMsf1(·), f2(·), . . . , fk(·)are trained jointly. For every

labeled instance hxi, yii, we try to assure the projection fyi(xi) of the class label yi should be greater than the average projection k11P

l6=yifl(xi)of all the other classes l ∈ {1,2, . . . , k} \yi. The optimization of AMSVM can be formalized as:

min W,Ξ 1 2 k X l=1 ||wl||22 +C N X i=1 ξi (wyi − 1 k−1 X j6=yi wj)Tφ(xi)≥1−ξi ∀i= 1,2, . . . , N ξi ≥0 ∀i= 1,2, . . . , N (2.3)

whereyiistheclasslabelofxiandφ(·)istheprojectionofkernelspace. W={w1,...,wk}are

parameters of the k binary one-vs-rest classifiers. N is thenumber of labeled instances.

Ξ = {ξ1, ξ2, . . . , ξN} are the slack variables for each constraint. For prediction, the class with

the highest projection value is selected as the predicted class. As shown in [He et al., 2012] the performance of AMSVM is often comparable to the standard multi-class SVM (MCSVM).

2.2.3 Summary

In this section, we gave a brief review of two popular multi-class extensions of support vector machine: multi-class support vector machine (MCSVM) and approximate support vector machine (AMSVM). More details about theory and analysis of these two multi-class extensions can be found in [Vapnik, 1998, Weston et al., 1999] and [He et al., 2012] respectively.