Training - Related Work - Efficient Pairwise Multilabel Classification

6.5 Related Work

7.1.1 Training

More formally, the algorithm works as follows. We assume at this point that a hierarchy HonLand metalabelsM₌2Lis already given.H_⊂M2_∪M_×Lis defined as a partial order that spans a rooted directed tree with a compositional semantic (cf. Section ), i.e. µu_Hµv⇔(µu,µv)∈H⇔µv ⊂µu [ µu_Hµv µv=µu [ µu,µw_Hµv µu∩µw =; (7.1) We denote the metalabel at root node1byµ₁₌Land the children metalabels at nodes 1.i,1 ≤ i ≤ k0 ≤ k by µ_1._i, where µ₁ _H µ_1._i, and so forth until the original labels are reached at the leaves,λ_i ₌µ₁_···. The ramificationk0is bounded by the maximal number k of (meta)labels that can be contained in one metalabel. Each inner node 1, 2, 3· · · is associated to a multilabel classifierh₁,h₂,h₃· · ·.

A training example₍x,P₎arrives at first at the root node, where it is used to train the classifier h₁ : X → {µ1.1,· · ·,µ1.k}. x serves as positive example for metalabels P1 :=

{µ_1._i P_∩µ_1._i ₆₌_;}. The training example is then passed to all children nodes1.i with positive metalabelsµ_1.i∈P1.

The classifiers at the inner nodes are trained similarly. Formally, the training set for a classifierh_z :X →{µz.1,· · ·,µz.k}⊂2µz at nodez=1..i is given as follows:

T_rainz :₌_〈₍x,Pz₎P_∩µ_z₆₌_;〉 Pz:₌{µ_z_._iP_∩µ_z_._i₆=;} (7.2) Hence, at each nodez the data is filtered so that only examples pass that are annotated with at least on of its own labelµ_z

An example for a hierarchy of eight original labels is illustrated in Figure . Imagine an example xwith P ={λ₁,λ₂,λ₇}. Forh₁ the example would be positive forµ_1.1 and

µ1.2. Hence, h1.1 would see the example as P1.1 ={µ1.1.1,µ1.1.2}andh1.3 as positive for µ1.3.2=λ7. The classifierh1.2would not receive it.

h1:X →{λ_{| {z }}1,λ2 µ1.1 },{λ_{| {z }}3,λ4,λ5 µ1.2 },{λ_{| {z }}6,λ7,λ8 µ1.3 } µ1.1 µ1.2 µ1.3 h_1.1:X → λ1 { µ1.1.1 ,λ2 { µ1.1.2 µ1.1.1 µ1.1.2 h_1.2:X →λ₃ { µ1.2.1 ,λ4 { µ1.2.2 ,λ5 { µ1.2.3 µ1.2.1 µ1.2.2 µ1.2.3 h_1.3:X →λ₆ { µ1.3.1 ,λ7 { µ1.3.2 , λ8 { µ1.3.3 µ1.3.1 µ1.3.2 µ1.3.3 λ1 λ2 λ3 λ4 λ5 λ6 λ7 λ8

Figure 7.1:A sample hierarchy of multilabel classifiers. An example is forwarded to a child node if

the metalabelµzat the edge coincides withP, i.e.µz∩P 6₌;. Similarly, an edgeµa_._zis followed during prediction ifµ_a_._z∈Pˆa,Pˆa₌h_a₍x₎(example based on ).

7.1.2 Predicting

The prediction works similar. A test examplexis classified byh₁ and only the predicted nodesz withµz ∈h1(x)are visited. The remaining levels are stepped trough recursively

until possibly predicting the original labels at the leaves. Formally, the following labelset is predicted ˆ P₌ λi ^ µz_Hλi λi ∈hz(x) (7.3)

Returning to our example above, assuming a consistent base learner, the example would be predicted asµ_1.1 and µ_1.3 byh₁. Hence, only h_1.1 and h_1.3 have to be further consulted, since neitherλ₆,λ₇ orλ₇ are inh₁(x). However, it holds thatλ₁,λ₂∈h_1.2(x)

andλ₇ ∈h_1.3₍x₎and so these would be the predicted classes Pˆ. Note that it could hap- pen that the root h₁ predicts µ_1.1, but that h_1.1 returns the empty set. In this case, the evaluation stops at the inner node and no label from this sub-tree is predicted.

7.1.3 Hierarchy Construction

Before the actual training phase starts, HOMER has to create the hierarchy treeH. This is done recursively in a top-down depth-first fashion starting with the root. At each nodeµ_z, k child nodes are first created using a clustering algorithm (see below). In case|µ_z|<k, the number of children is set tok0₌|µ_z|.

The main issue in the former process is how to distribute the labels of µz to the k

children. We argue that labels should be evenly distributed to k subsets in a way such that labels belonging to the same subset are assimilar as possible. Such a task can be

thought of as clustering with the additional constrain of equal cluster size. It has been considered in the past in the literature, under the name balanced clustering (

). In the work of ( ), a new balanced clustering algorithm named balanced k-means has been proposed for HOMER, which guarantees that the clusters will be of exactly the same size. The base is a recursivek-means on the label columns of the output matrix₍y_i₎_i,1≤i≤m₌|T_rain_|.

We will denote this approach withBfor balanced clustering. The other two approaches considered are random and even distribution (R) of the labels to the children nodes and clustering (C) using the expectation minimization algorithm (as implemented in WEKA,

The justification for preferring similarity-based distribution is that if similar labels of a nodeµ_z are placed in the same subset, then only a few (ideally just one) metalabels

µz.1,µz.2. . . will be predicted and thus the remaining sub-trees will not be activated.

This will lead to reduced costs during the operation and testing of HOMER. Another expected benefit is that each child node will probably contain less training examples. The complexity of HOMER will be discussed in more detail in Section . The justification for preferring an even distribution is that the multilabel classifiers at each node will deal with a more balanced distribution of positive examples for each metalabel. This is expected to lead to improved predictive performance.

We could consider HOMER as the combination of two components: 1) an algorithm that constructs a hierarchy on top of the labels of a multilabel dataset (cf. references on hierarchy extraction in Section ), and 2) a generalization of the well-known Pachinko-machine hierarchical classification algorithm ( ) to the multilabel case.

In the work of ( ), HOMER, when using the well-known binary relevance classifier as the base multilabel classifier in each internal node, has proven to outperform BR in terms of quality of prediction and, especially, classification time. In this study, we particularly compare to the calibrated pairwise label ranking approach (cf. Section ) as multilabel classifiers at the nodes of HOMER.

We will use the following abbreviations throughout this chapter in order to indicate the different possible combinations.H+BRwill denote the combination of HOMER with binary relevance as decomposition strategy for the multilabel subproblems. Correspond- ingly,H+CLRor H+QCLR, respectively, will denote the combination with the (QVoting) calibration label ranking approach. Note that H+CLR and H+QCLR use the same model, hence we will preferably use H+CLR for indicating the training and H+QCLR for the prediction phase.

7.2 Computational Complexity

This section extends the previous analysis by ( ) particularly re- garding the combination of HOMER with the decompositive approaches. We refer to this original analysis for the detailed costs of the clustering approach.

For the sake of simplification of the analysis, we will assume that we have a perfect k-ary tree of N +1 levels and hence k·k· · · · = kN+1 ₌ _n _{leafs. Such a tree has} _ki−1

nodes at thei-th level, the root being the first level, thus in totalPN_i₌₁ki−1 = _kk₋N₁ = n_k−₋1₁

inner nodes. Any other tree withN levels and maximal ramification factork will at most require the costs of using the perfect tree. Further, we will need the averagedof the label cardinalities|P|and refer to previous analysis of BR, LPC and (Q)CLR in Sections and .

Estimations are given in terms of (binary) classifiers for the memory requirements, number of total examples for training and number of (binary) classifier evaluations for prediction. Remind that the actual bytes and seconds, and especially the ratios given, may deviate from the estimations given here particularly if the binary base classifier used has not a linear behavior in number of training examples for training and is not constant for memory and testing time (cf. Section ). Particularly, the following analysis states training (and testing) costs on per example basis and not in terms of the training size. A summary of complexities and relationships between the approaches are given in Landau bigOnotation in Table .

7.2.1 Memory

The number of inner nodes already determine the required number of multilabel classifiers. For BR this amount has to be multiplied withk in order to obtain the number of binary classifierskn_k−₋1₁ = O₍n₎that have to be maintained in memory. This is less then for pairwise decomposition, which will require additional k(k−1)/2_kn−₋1₁ = k(n−1)/2

binary classifiers.

If we compare the requirements to those of the bases BR and CLR, we see a twofold result. For a fixedk, the comparison between H+BR and BR results in a ratio of k_n(n−1)

(k−1) in

favor of BR, but which becomes only nearlyk/₍k−1₎for increasingn. For the additional pairwise classifiers the ratio is k_n(n−1)

(n−1) = k/n, which is a clear and important reduction.

The ratio between conventional CLR and BR is reduced from₍n₊1₎/2to k(n−1) k₋1 / k(k+1)(n−1) 2(k−1) = 2 k+1 (7.4)

for the combination with HOMER. Approximately the same relationship to normal BR applies.

7.2.2 Training

During training the root processes all training examples. A training instance (with a non- empty labelset) is passed to at least one children and at most to min(|P_|,k) children, depending on the clustering algorithm used and fortuity. At the subsequent levels, an example will be reached over to at most min₍|P|,ki₎ children nodes on the ₍i₊1₎-th

Table 7.1:Complexity comparison of BR and CLR and their combinations with HOMER. The row captions indicate the base decompositions BR and CLR and their asymptotic complexities, whereas the column caption denote the respective combinations with HOMER and their costs. In the cells we find the ratio between the HOMER variants and the base decompositive approaches with re- spect to the column and row captions. The costs are given in terms of the metering indicated in the sub-tables caption.

(a) total number of binary base classifiers H+BR H+CLR

O₍n₎ O₍kn₎ BR O₍n₎ O₍1₎ O₍k₎ CLR O₍n2₎ O₍n₎ O₍k/n₎

(b) preferences learned per training example

H+BR H+CLR O₍kdlog_kn₎ O₍kdlog_kn₎ BR O₍n₎ O₍dlogn n ) O( dlogn n ) CLR O₍dn₎ O₍log_nn₎ O₍log_nn₎ (c) binary base classifier evaluations for one prediction

H+BR H+QCLR O₍kdlog_kn₎ O₍kdlogn₎ BR O₍n₎ O₍dlog_n n₎ O₍kdlog_n n₎ QCLR O₍dnlogn₎ O₍_n k logk) O( k n)

level. We can hence assume that an example will be used≤ |P|times at each level except the last one (where there are no classifiers to train).

The multilabel classifiers at the inner nodes of a HOMER ensemble will hence process one single training examples maximally1₊|P|₍N −2_{) =}O₍dlog_kn₎times. The lower bound is also determined by the depth of the tree, since at least N nodes have to be visited (assuming|P| ≥1).

For the combination of HOMER and the binary relevance decomposition, each example at each affected BR classifier is further replicated for each of thekbinary base classifiers, resulting inO₍kdlog_kn₎instances processed at the lowest classifier level. The same paths trough the tree means to use each example|P|₍k− |P|₎/2_{+ (}N−1₎₍k−1₎|P|<|P|k₊ N k_|P_| ₌ O₍kdlog_kn₎ times in the pairwise base classifiers, which is even slightly less than for BR in many cases, albeit we have to add the costs of the BR classifiers itself to CLR. However, this is theoretically not the worst possible strategy for CLR. Certainly, training an individual CLR classifier becomes most expensive when an instance is mapped tok/2children meta-labels, which results ink2/4preferences learned. However, after at mostlog_k_/2|P|levels the label cardinality becomes one. It is hence not obvious how to determine the worst case for the combination with CLR.

For both, in the best case the _|P_|relevant labels are divided at only one inner node at level N (assuming |P| ≤ k). Hence both complexities are bounded from below by

kN ₌ klog_kn (BR) and ₍k −1₎₍N −1_{) +}|P|₍k − |P|₎/2 < k₍log_kn₊|P|₎ (pairwise) preferences learned.

Comparing H+BR to BR, we achieve a ratio of at worstO₍kdlog_k₍n₎/n_{) =}O₍kd_n_loglog_dn₎ in the number of training examples. Assuming a fixed k, the ratio rapidly decreases asymptotically to dlog₍n₎/n with increasing n. The same applies to the comparison of H+CLR and BR. Consequently, the relationship between the pairwise preferences learned by H+CLR and CLR decreases even faster, sinceO₍kdlog_k₍n₎/₍dn_{)) =}O₍log₍n₎/n₎.

Certainly, it seems recommendable to analyze theaveragecase, where we assume that the problems at the nodes have similar characteristics as the original multilabel problem. Basically, this corresponds to analyzing the randomized clustering strategy. We can expect that more sophisticated strategies will ideally perform better and at worst asymptotically approach the average case. Unfortunately, the analysis of this case is not as trivial as it may appear at the first sight, since it requires an elaborated probabilistic and stochastic modeling. Therefore, and because of the generally coarse estimation this would mean for clusterings others than randomized, we refer to the in our eyes in this case more practical empirical comparison of the costs in Section and leave the formal analysis for further research.

7.2.3 Predicting

As was explained in Section , the prediction phase is very similar to the training phase. Assuming that the base multilabel classifiers predict roughly the same number of labels than in the training data, the same amount of inner nodes are visited during testing an instance. For the combination with BR this means absolutely the same estimations as for training.

For the pairwise classifiers however, the costs are substantially reduced due to the de- creased sizek of the subproblems. The ratio between H+CLR’s and CLR’s base classifier evaluations amounts toO₍k2_·dlog_k₍n₎/n2_{) =}O₍dlog₍n₎/n2₎for a fixedk. Our estimation of the training time consisted in assuming a maximal ramification of|P|at the root and hence single-label assignments in the consecutive|P|paths to the leaves. Under these circumstances, we estimateO₍klogk_·dlog_kn₎evaluations performed by using QVoting in HOMER compared to the well knownO₍dnlogn₎. The ratio between H+QCLR and QCLR is hence increased to now O₍k/n₎. The comparison to BR is slightly worse but similar, since H+QCLR requires O₍kdlog₍n₎/n₎times the base classifier evaluations of BR. H+QCLR should only take an order of O₍logk₎times more computations than its counterpart H+BR.

7.3 Evaluation

In this section, after the presentation of the experimental setup, we will discuss the effect of the several parameters of HOMER and then compare it in terms of training time, classification time and predictive performance against its base multilabel classifiers.

Table 7.2:Name, number of examples used for training and testing, number of features and labels, label cardinality and density, and number of distinct labelsets for each dataset used in the experiments.

examples distinct

name train test features labels cardinality density labelsets

hifind 16452 16519 98 632 37.304 6.0% 32734

eccv2002 42379 4686 36 374 3.525 0.9% 3175

jmlr2003 48859 16503 46 153 3.071 2.0% 3115

mediamill 30993 12914 120 101 4.376 4.3% 6555

7.3.1 Setup

We conducted experiments on four large multilabel datasets with at least 100 labels and 10000 training examples. The first one,hifind, contains 32769 music titles annotated on average with 37 from 632 different labels. The two next datasetseccv2002 andjmlr2003

are from the image classification domain, whereas the last one deals with video material. All datasets were already described in more detail in Section . Table summarizes the key characteristics of the datasets.

The experiments were conducted using the Mulan library of algorithms for multilabel learning ( ). As base classifier we used the decision tree learner J48 with standard settings, which is an implementation of C4.5 included in the WEKA

framework ( ).

Note that only BR is able to predict label rankings in the basic versions, so we can only evaluate the bipartition quality. The effectiveness of all algorithms is evaluated with (label and example-based) micro-averaged recall, precision and F1 (cf. Section ). We also evaluate the efficiency of all algorithms based on their run time (for training and classification).

In document Efficient Pairwise Multilabel Classification (Page 170-176)