Algorithm Adaptation - Regularized Learning for Classiﬁcation

2. Regularized Learning for Classiﬁcation

3.3 Algorithm Adaptation

Algorithm adaptation directly modifies popular single-label classification algorithms to solve the multilabel classification problems. We will present the algorithms that are modified from ensemble methods and logistic regression. There also exists many other algorithms in the algorithm adaptation category, for example, the method based on label ranking (Cram- mer et al., 2003), and the method based on neural network (Zhang and Zhou, 2006). These methods are not explained in detail due to the diver- gence from the main scope of this thesis.

3.3.1 Ensemble Methods for Flat Multilabel Classiﬁcation

Ensemble methods have been initially developed for single-label classification (Breiman, 1996a; Freund and Schapire, 1997) or regression (Breiman, 1996a), as it is straightforward to combine multiple scalar output vari- ables. However, it is not immediately clear how to combine vector valued outputs in flat multilabel classification.

ADABOOST.MH(Schapire and Singer, 1999; Esuli et al., 2008) is a multilabel variant of theADABOOSTalgorithm. The core idea ofADABOOST.MH

is to apply the hamming loss instead of the0/1loss (2.2). In particular, the algorithm reduces a multilabel classiﬁcation problem into a single-label classiﬁcation problem by replacing each training example(xi,yi)withk examples {(xi,yi[l])}kl=1. The algorithm is described in Algorithm 2. In particular, it maintains a distribution over all examples and labels. In each iteration, the algorithm takes in the distribution over all training examples, generates a weak learnerft₍_x₎_{(line 3), computes the hamming} loss (line 5), computes the adaptive parameterαt_{(line 6), and update the} distribution (line 8). The predictionH(x)is a weighted combination of the base learnersft₍_x₎_{weighted by the adapter parameters}_αt_.

Algorithm 2ADABOOST.MH

Input: Training sampleS = {(xi,yi)}mi=1, learning functionW, number of weak learnersT

Output: Boosting ensembleH(x)

1: InitializeD1(i, l) =_mk1 ,∀i∈ {1,· · ·, m}, l∈ {1,· · ·, k} 2: fort= 1· · ·T do 3: ft₍_x₎_{← W}₍_S_{, D}t₎ 4: yˆi=ft(xi),∀i∈ {1,· · ·, m} 5: t₌k l=1 m i=1Dt(i, l)1{yi[l]=ˆyi[l]} 6: αt=1₂ln 1−tt 7: Z=m_i₌₁k_l₌₁Dt(i, l) exp(−αtyi[l]ˆyi[l]) 8: Dt+1₍_{i, l}_{) =} 1 ZD t₍_{i, l}_{) exp(}₋_αt_y i[l]ˆyi[l]),∀i,∀l 9: end for 10: return H(x) =sign(T_t₌₁αt_ft₍_x₎₎

Besides ADABOOST.MH, some other ensemble methods for multilabel classiﬁcation have also been developed that are based on boosting or bag- ging (Wang et al., 2007; Yan et al., 2007; Kocev et al., 2013). In addition, there is a large body of work which aim to apply ensemble methods to solve the real world multilabel classiﬁcation problems, for example, nat- ural language processing (Collins and Koo, 2005; Zeman and Žabokrtský, 2005; Sagae and Lavie, 2006; Zhang et al., 2009), and text and speech recognition (Fiscus, 1997; Mohri et al., 2008; Petrov, 2010).

3.3.2 Correlated Logistic Regression (CORRLOG)

Correlated logistic regression (CORRLOG) is a model based approach for flat multilabel classification (Bian et al., 2012). CORRLOGis a major step forward ofIBLRby constructing a logistic regression classifier over all mi-

Multilabel Classiﬁcation

crolabels and by modeling the pairwise correlation of labels with a function deﬁned on the microlabel pairs.

In fact,CORRLOGis derived from independent logistic regressions (ILRS). Given a pair of an arbitrary training example and a label(x,y), we can construct a set ofILRSclassiﬁers, one for each microlabel. The posterior probability can be computed by

PILRS(y|x) = k j=1 PLR(y[j]|x) = k j=1 exp(y[j]wϕϕϕ(x)) exp(wϕϕϕ(x)) + exp(−wϕϕϕ(x)), (3.1)

wherej is the index that iterates over microlabels anddenotes matrix transpose. The bias term as that appears in Definition 1 is omitted which is equivalent to augmentingxwith a constant term (Bian et al., 2012). Otherwise, (3.1) can be derived into the same form of Definition 1 by re- placingw with w₂. ILRShas the problem of ignoring the correlation between labels and overfitting the training data when the number of microlabels is large. To alleviate the problems,CORRLOGaugments the posterior probability (3.1) by a functionQ(y)defined on the pairs of microlabels as Q(y) = exp ⎧ ⎨ ⎩ k<j αk,jy[k]y[j] ⎫ ⎬ ⎭. (3.2)

Putting together (3.1) and (3.2),CORRLOGcan be deﬁned as

PCORRLOG(y|x)∝PILRS(y|x)Q(y) = exp ⎧ ⎨ ⎩ m j=1 y[j]w, ϕϕϕ(x)+ k<j αkjy[k]y[j] ⎫ ⎬ ⎭.

Thus,CORRLOGexamines the pairwise label correlations by augmenting the joint prediction with a quadratic term Q(y) built from the pairs of microlabels.

3.3.3 Multitask Feature Learning (MTL)

Multitask feature learning (MTL) (Argyriou et al., 2007) is another algorithm designed for ﬂat multilabel classiﬁcation. MTL is quite different from the algorithms discussed in the previous part of the section. Specif- ically,MTL is based on the assumption that different microlabels are re- lated such that they share a common underlying feature representation. Similar assumptions are also made in other models (Caruana, 1997; Bax- ter, 2000; Ben-David and Schuller, 2003).

Letft(x)denote a label speciﬁc function for thet’th microlabel.ft(x)can be expressed as ft(x) =at, h(x)= d i=1 at[i]h(x)[i],

whereat∈Rdis the feature weight parameter for thet’th microlabel.h(x) is a linear feature mapping function deﬁned as

h(x) =U, ϕϕϕ(x),

whereϕϕϕ(x) ∈ Rd _{is the input feature map in the original feature space} andU ∈Rd×d _{is a square matrix. We further use}_A_{to denote the matrix} composed byat. MTL assumes that microlabels share a small set of features in whichAis assumed to be sparse with many entries being zero. The optimization problem ofMTLis deﬁned as

Deﬁnition 10. MTLOptimization Problem in Primal min U∈Rd×d A∈Rd×T _T t=1 m i=1 (yi,t,at,U, ϕϕϕ(xi)) +C||A||2₂,1 ,

where C is a positive parameter that controls the balance of the regularization term and the risk minimization term. The optimization problem is an instantiation of the regularized risk minimization (2.7) with the hamming loss and theL₂,1-norm regularization. As Deﬁnition 10 is non-convex and the second term is non-smooth, the optimization is trans- formed into an equivalent form which is solved by an alternative optimization approach (Argyriou et al., 2007).

Argyriou et al. (2008a) developed an extension of MTLthat introduces a nonlinear generalization using kernel methods. In addition, Argyriou et al. (2008b); Jacob et al. (2009) have developed several similar but not identical algorithms based on the assumption that microlabels form clusters such that label speciﬁc weight vectors should be similar within the clusters. Recently, Romera-Paredes et al. (2012) proposed a method that exploits the information between unrelated microlabels based on a similar assumption that the microlabels of different groups tend not to share any features.

In document Multilabel Classification through Structured Output Learning - Methods and Applications (Page 42-46)