2. Regularized Learning for Classification
3.3 Algorithm Adaptation
Algorithm adaptation directly modifies popular single-label classification algorithms to solve the multilabel classification problems. We will present the algorithms that are modified from ensemble methods and logistic re- gression. There also exists many other algorithms in the algorithm adap- tation category, for example, the method based on label ranking (Cram- mer et al., 2003), and the method based on neural network (Zhang and Zhou, 2006). These methods are not explained in detail due to the diver- gence from the main scope of this thesis.
3.3.1 Ensemble Methods for Flat Multilabel Classification
Ensemble methods have been initially developed for single-label classifi- cation (Breiman, 1996a; Freund and Schapire, 1997) or regression (Breiman, 1996a), as it is straightforward to combine multiple scalar output vari- ables. However, it is not immediately clear how to combine vector valued outputs in flat multilabel classification.
ADABOOST.MH(Schapire and Singer, 1999; Esuli et al., 2008) is a multi- label variant of theADABOOSTalgorithm. The core idea ofADABOOST.MH
is to apply the hamming loss instead of the0/1loss (2.2). In particular, the algorithm reduces a multilabel classification problem into a single-label classification problem by replacing each training example(xi,yi)withk examples {(xi,yi[l])}kl=1. The algorithm is described in Algorithm 2. In particular, it maintains a distribution over all examples and labels. In each iteration, the algorithm takes in the distribution over all training examples, generates a weak learnerft(x)(line 3), computes the hamming loss (line 5), computes the adaptive parameterαt(line 6), and update the distribution (line 8). The predictionH(x)is a weighted combination of the base learnersft(x)weighted by the adapter parametersαt.
Algorithm 2ADABOOST.MH
Input: Training sampleS = {(xi,yi)}mi=1, learning functionW, number of weak learnersT
Output: Boosting ensembleH(x)
1: InitializeD1(i, l) =mk1 ,∀i∈ {1,· · ·, m}, l∈ {1,· · ·, k} 2: fort= 1· · ·T do 3: ft(x)← W(S, Dt) 4: yˆi=ft(xi),∀i∈ {1,· · ·, m} 5: t=k l=1 m i=1Dt(i, l)1{yi[l]=ˆyi[l]} 6: αt=12ln 1−tt 7: Z=mi=1kl=1Dt(i, l) exp(−αtyi[l]ˆyi[l]) 8: Dt+1(i, l) = 1 ZD t(i, l) exp(−αty i[l]ˆyi[l]),∀i,∀l 9: end for 10: return H(x) =sign(Tt=1αtft(x))
Besides ADABOOST.MH, some other ensemble methods for multilabel classification have also been developed that are based on boosting or bag- ging (Wang et al., 2007; Yan et al., 2007; Kocev et al., 2013). In addition, there is a large body of work which aim to apply ensemble methods to solve the real world multilabel classification problems, for example, nat- ural language processing (Collins and Koo, 2005; Zeman and Žabokrtský, 2005; Sagae and Lavie, 2006; Zhang et al., 2009), and text and speech recognition (Fiscus, 1997; Mohri et al., 2008; Petrov, 2010).
3.3.2 Correlated Logistic Regression (CORRLOG)
Correlated logistic regression (CORRLOG) is a model based approach for flat multilabel classification (Bian et al., 2012). CORRLOGis a major step forward ofIBLRby constructing a logistic regression classifier over all mi-
Multilabel Classification
crolabels and by modeling the pairwise correlation of labels with a func- tion defined on the microlabel pairs.
In fact,CORRLOGis derived from independent logistic regressions (ILRS). Given a pair of an arbitrary training example and a label(x,y), we can construct a set ofILRSclassifiers, one for each microlabel. The posterior probability can be computed by
PILRS(y|x) = k j=1 PLR(y[j]|x) = k j=1 exp(y[j]wϕϕϕ(x)) exp(wϕϕϕ(x)) + exp(−wϕϕϕ(x)), (3.1)
wherej is the index that iterates over microlabels anddenotes matrix transpose. The bias term as that appears in Definition 1 is omitted which is equivalent to augmentingxwith a constant term (Bian et al., 2012). Otherwise, (3.1) can be derived into the same form of Definition 1 by re- placingw with w2. ILRShas the problem of ignoring the correlation be- tween labels and overfitting the training data when the number of micro- labels is large. To alleviate the problems,CORRLOGaugments the poste- rior probability (3.1) by a functionQ(y)defined on the pairs of microlabels as Q(y) = exp ⎧ ⎨ ⎩ k<j αk,jy[k]y[j] ⎫ ⎬ ⎭. (3.2)
Putting together (3.1) and (3.2),CORRLOGcan be defined as
PCORRLOG(y|x)∝PILRS(y|x)Q(y) = exp ⎧ ⎨ ⎩ m j=1 y[j]w, ϕϕϕ(x)+ k<j αkjy[k]y[j] ⎫ ⎬ ⎭.
Thus,CORRLOGexamines the pairwise label correlations by augmenting the joint prediction with a quadratic term Q(y) built from the pairs of microlabels.
3.3.3 Multitask Feature Learning (MTL)
Multitask feature learning (MTL) (Argyriou et al., 2007) is another algo- rithm designed for flat multilabel classification. MTL is quite different from the algorithms discussed in the previous part of the section. Specif- ically,MTL is based on the assumption that different microlabels are re- lated such that they share a common underlying feature representation. Similar assumptions are also made in other models (Caruana, 1997; Bax- ter, 2000; Ben-David and Schuller, 2003).
Letft(x)denote a label specific function for thet’th microlabel.ft(x)can be expressed as ft(x) =at, h(x)= d i=1 at[i]h(x)[i],
whereat∈Rdis the feature weight parameter for thet’th microlabel.h(x) is a linear feature mapping function defined as
h(x) =U, ϕϕϕ(x),
whereϕϕϕ(x) ∈ Rd is the input feature map in the original feature space andU ∈Rd×d is a square matrix. We further useAto denote the matrix composed byat. MTL assumes that microlabels share a small set of fea- tures in whichAis assumed to be sparse with many entries being zero. The optimization problem ofMTLis defined as
Definition 10. MTLOptimization Problem in Primal min U∈Rd×d A∈Rd×T T t=1 m i=1 (yi,t,at,U, ϕϕϕ(xi)) +C||A||22,1 ,
where C is a positive parameter that controls the balance of the regu- larization term and the risk minimization term. The optimization prob- lem is an instantiation of the regularized risk minimization (2.7) with the hamming loss and theL2,1-norm regularization. As Definition 10 is non-convex and the second term is non-smooth, the optimization is trans- formed into an equivalent form which is solved by an alternative opti- mization approach (Argyriou et al., 2007).
Argyriou et al. (2008a) developed an extension of MTLthat introduces a nonlinear generalization using kernel methods. In addition, Argyriou et al. (2008b); Jacob et al. (2009) have developed several similar but not identical algorithms based on the assumption that microlabels form clus- ters such that label specific weight vectors should be similar within the clusters. Recently, Romera-Paredes et al. (2012) proposed a method that exploits the information between unrelated microlabels based on a simi- lar assumption that the microlabels of different groups tend not to share any features.