Comparison to Binary Relevance Decomposition

3.4 Pairwise Multilabel Decomposition

3.5.7 Comparison to Binary Relevance Decomposition

The discussion about whether one-against-all (OAA) or the pairwise approach is prefer- able is very old and originates from the fact that pairwise decomposition was the first other binarization technique beside OAA, before ECOCs in the year and of course long before label powerset, which came up with multilabel classification. As the only direct competitor, pairwise decomposition was by default compared to OAA. In the next section, we review some of the main works comparing OAA and the pairwise approach and some of the most debatable points and criticisms.

to more sophisticated approaches. However, both approaches bear the risk that the distribution of small scores is underestimated by the chosen distribution model, and consequently that small scores are over-penalized. As the authors of AV point out, their assumption that the scores are distributed according to a truncated exponential distribution does often not hold in practice.

The first work in the field of machine learning which employed pairwise decomposition that is known to the authors was in by , who continued their effort in some other early works. They observed e.g. that the classes of a digit recogni- tion task were pairwise linearly separable, while the corresponding one-against-all task was not solvable with their linear neural network. Several other works followed, which mainly showed the same picture but were not dedicated to binarization techniques (cf. references in ). These were considered only as an additional tunable configuration setting. The first methodological studies were presented by

( ) and ( ). The first work compared OAA with conventional pairwise decomposition and an alternative aggregation strategy specific to multiclass problems. LPC consistently outperformed OAA. presented the first formal analysis of pairwise decomposition and showed that, surprisingly, training a quadratic number of base classifiers was more efficient than training the linear number of OAA models.

In a direct response to this study, ( ) firmly argued in favor of OAA. Their main claim was that if the base learners, in their case support vector machines, were appropriately tuned, then there should be no advantage for neither approach. Their intuition is that when SVMs are tuned (C and kernel parameter, see also Section ), only mistakes are made for examples that “simply for all practical pur- poseslook more like a member of an incorrect class”, and that to correctly classify this type of examples is very difficult, for any decomposition approach.

themselves deliver the cases where LPC expectedly achieves an advantage. Firstly, op- timizing SVM parameters is very costly. Indeed, a greedy approach is used in order to find good global parameters, in contrast to the recommended grid search (cf.

) on every subproblem separately. Secondly, the authors expect that using weak or improperly tuned base learners will have an adverse impact on OAA. Their second point can be confirmed to a certain extent later in the present thesis, where the fast but also simple perceptron algorithm is shown to be a backbone for the efficient pairwise multilabel classification of large data. Very recently, ( ) presented an ex- tensive study comparing OAA and a dynamically ordered version (see , for statically ordered OAA) to nine different pairwise aggregation strategies. The pairwise approaches clearly outperform OAA for six different base learner. However, none of the base learner was tuned and two were certainly weak (k-NN with withk₌1and 3).

The perhaps most important statement in the work of ( ) is however, that their main claim is explicitly only valid if the classes are independent, which is certainly true for multiclass data but rarely for multilabel data. As a consequence, many studies in the frame of multilabel classification comparing pairwise decomposition favorably to binary relevance appeared, including most of those cited in Section (BR is, again, almost always used as a baseline) and including the studies about pairwise classification brought together and presented in this thesis.

However, recently, two new works appeared in defense of BR ( , ). The proposed method of classifier chains (CC) fixes a particular order of the BR base classifiers and subsequently adds the outputs of the preceding classifiers as new features. Hence, thei-th classifierh_i :X×Yi−1→Y1, withY ₌{0, 1}, is trained on j₌1 . . .mex-

amples₍₍x_j,y_j_,1, . . . ,y_j_,_i₋₁₎,y_j_,_i₎and predicts on a test instance₍x,h₁₍x₎, . . . ,h_i₋₁₍x, . . .₎₎. The objective of such an approach is clear: to tackle the main drawback of BR, the igno- rance of the dependencies between labels (as stated by ). The proposed method clearly satisfies this multilabel-specific demand. It is shown by

( ) that CC takes conditional dependencies into account (cf. Section ) and deter- ministically approximates the optimal Bayes classifier.

But it is the firm opinion of the author that CC cannot be submitted as an argument in defense of the BR approach since it is conceptually not equivalent to BR. The chaining part has to be considered as a novel and intelligent method ofstackingwhich is usedon topof BR. But stacking itself is not restricted to BR or any other decomposition approach since it simply relies on extending or replacing the input space (cf. Section ) and was already extensively used in the context of multilabel learning (cf. Section ). In the eyes of the author, the main contribution of CC is hence this novel, intelligent and very interesting stacking approach. In conclusion, stacking can not be used as an argument for a systemic or conceptual advantage of BR, just as the stacked variants of pairwise classifiers recently introduced by ( ) cannot be instanced in favor of a general superiority of pairwise decomposition.

The steady argumentation in favor of BR is due to the emphasized beneficial scalability properties of BR, since BR (and also CC) scale linearly with the number of classes (cf. Section ). However, as we have clearly seen in the comparison in Section , LPC is comparable or under certain circumstances even faster than BR in training, which is the phase mainly considered by . Surprisingly and for unknown reasons, CLR does not terminate on relatively small text datasets with 22 and 29 classes in their experiments, whereas we were able to apply similar SVMs as they used them on datasets with up to 159 classes without any problems. Moreover, use an ensemble of CCs for their comparison to other approaches, which further decreases efficiency and scalability by a predetermined factor, usually by 10 or 50 times. An additional note is that with increasing number of classes (to an order of magnitude of 1000s) the positive effect of the stacking vanishes and ensemble CC even drops below BR for some losses.

In summary, despite the occasional criticisms, we see strong arguments in favor of pairwise learning compared to binary relevance decomposition. Pairwise learning has shown to dominate BR in a large number of experimental studies on multiclass problems, and more recently, on multilabel problems. Part of this empirical evidence is provided by the studies of the author and reflected in this thesis. However, also in this work, we see circumstances were BR-based approaches can be more advisable, particularly regard- ing highest scalability and highest efficiency demands. The MMP algorithm is a good example for this (cf. Chapter ), although, performing certain simplifications, pairwise classification is applicable even under these circumstances (cf. Chapter ).

In document Efficient Pairwise Multilabel Classification (Page 94-96)