Multilabel Pairwise Perceptrons - Efficient Pairwise Multilabel Classification

The decomposition of a multilabel problem using perceptrons as base learner is done exactly as described in Section with the difference that Multilabel Pairwise Percep- trons (MLPP) are trained incrementally. This is reflected in the pseudocode in Figure . Instead of iterating over the pairwise training setsT_rainu,v and trainingwu,v serially, the pro-

cedure in Figure obtains one multilabel training instance xat each time and passes it the concerned perceptrons with respect to the pairs in P×N. TRAINPERCEPTRON cor-

responds to the update in Eq. . Note that the pairwise models are symmetric, i.e.

w_v_,_u=−w_u_,_v.

If we add the virtual label for calibration, we train the additional perceptrons

w0,1. . .w0,n as described in Section for the binary relevance version of the percep-

tron ensemble. We denominate this version CMLPP.

The prediction phase was already profoundly described in Section and . (C)MLPP uses simple voting so that the illustrations in Figure and apply.

In our particular implementation we distribute 2×1_/₂ _{votes if a base classifier ties or}

was not trained (e.g. if two classes are fully correlated). Moreover, we randomly solve ties in rankings, but we usev_v ≥v₀ for deciding if we predict a classλ_v (cf. Eq. ).

4.5 Comparison

The next two subsections discuss and compare the differences between the perceptron variants of the BR and LPC decompositions and analyze the computational costs. Most of the advantages and disadvantages already discussed in Section also apply for these variants.

4.5.1 Discussion

In particular, we expect a better effectivity of MLPP due to the reduction of the sizes of the subproblems (cf. Section ). A simple example illustrates this: imagine two points aand bon a line representing the center of the positive and negative points, respectively. We now insert points according to an arbitrary distribution around a and b. Let µ(m)

denote the margin between the negative and positive points depending on the number of inserted pointsm. This function is inevitably monotonically decreasing.

Thus it is very likely for a subproblem to have a larger margin than the full BR problem. We have seen in Section that the performance strongly depends on the available margin between the points of the binary classes. Indeed, ( ) observed that the classes of a digit recognition task were pairwise linearly separable, while the corresponding one-against-all task was not solvable with perceptrons. Thus, it can be expected that the MLPP algorithm will also benefit from the pairwise approach.

Since the MMP algorithm is based on the binary relevance binarization, it can also be expected for the pairwise approach to be superior. After all, the MMP algorithm has the same problem space as the binary relevance method: the perceptrons have to find hyperplanes that separates one class from the others, with the difference that the algorithm can translate the hyperplanes along the normal vector and scale the inner product in order to fit correctly in the ranking.

Since the revival of the perceptron algorithm with the publication of the voted perceptrons approach by ( ), this learner has repeatedly shown its efficiency and also effectivity in a wide range of domains (cf. Footnote ). On the popular small Reuters r21578 text benchmark e.g., an adapted version called CLASSI

w₁₄ + − + − w13 w12 + − w23 + + w34 − + ₋ − w24 P P P P P P P v4 v1 P v3 P P v2 x

Figure 4.4:MLPP ensemble represented as artificial neural network. The labels at the arrows indi-

cate the weights of the input connections (multiplication), the full circles denote the sign activation functions. The root node and arrows representainput nodes and output connections resp., one for each feature ofx. All other connections are one-dimensional.

outperformed Ripper, one of the most effective rule learners ( ). An adapted perceptron algorithm referred to as Hieron ( ) showed to be clearly more efficient but also more accurate than SVMs applied on a hierarchical ontology based information extraction task, which has usually similar characteristics than text classification. However, the non-adapted version of the perceptron with uneven margins (

) was comparable in absolute numbers but slightly inferior. A similar hierarchical approach was able to outperform binary relevance SVMs on the huge Reuters benchmark

rcv1 ( ). Eventually, hierarchical SVMs trained in a very similar way as in Chapter dominated. However, the perceptrons were only trained in one epoch while the SVMs performed the whole optimization process.

Furthermore, one has to always bear in mind that SVMs are not incrementally trainable while perceptrons certainly are. This is a clear advantage especially for large-scale data and for scenarios with real-time learning and classification demands.

It is interesting to note that the MLPP ensemble of perceptrons can be seen as a feed forward neuronal network with one hidden layer and fixed connections between the nodes. The pairwise perceptrons correspond to nodes in the middle layer with the indi- cator threshold function as activation function and the voting mechanism corresponds to the output nodes at the bottom layer. An illustration is given in Figure . In the same manner BR and MMP can be represented as fixed artificial neuronal networks without hidden layer.

4.5.2 Computational Complexity

The complexities of BR and LPC were already analyzed and compared in detail in Sec- tions and . However, an analysis of BR, MMP and MLPP with perceptrons is

Table 4.1:Computational complexity of perceptron ensembles given in expected number of dot products and additions of vectors per instance, and number of vectors for memory, respectively. Given in terms of number of total labelsn, current|P|and average labelset sizedand losses from

Section .

training time prediction time memory requirement perceptron 1₊ERR=O(1) 1 1 BR n₍1₊HAMLOSS) =O(n) n n MMP n+MARGIN+ISERR=O(n) n n MLPP |P|₍n− |P|₎₍1₊ERR) =O(dn2) n(n−1) 2 n(n−1) 2 M LP P M M P |P|=O(d) n−1 2 n−1 2

particularly interesting for three reasons: Firstly, the costs were given very abstractly in numbers of training examples or using very abstractly estimated complexities. The an- alytically simple perceptron algorithm, which is common to all considered approaches, allows for a very concrete analysis. Secondly, the presented approaches are incrementally trainable and hence the following analysis can be considered as an extension. Moreover, perceptrons learn in linear time and predict in constant time, so this analysis shows explicitly the case where p ₌ 1 and q ₌ 0. And thirdly, although MMP is based on BR it has a different training with possibly different runtime constraints which deserves a consideration itself.

We use the same notation as in previous analyses. In addition, adenotes the number of attributes anda0the average number of features not zero (size of the sparse representation of an instance).

Except for the last part, we will indicate the runtime dependencies in terms of perceptron prediction and update operations, since a scalar productwxrequires nearly the same amount of arithmetic additions and multiplications as an updatew+τx. However, if the factorτ is zero, there may be indeed a deviation. We ignore operations that have to be performed by all algorithms such as sorting or internal real value operations. Addition- ally, we will present the complexities per instance since all algorithms are incrementally trainable.

We explicitly only consider the common variant of the perceptrons with simple weight vectorsw. Please refer to Chapter and particularly Section for the dual variant.

4.5.2.1 Memory Requirements

BR and MMP follow prototype-based approach, so they have to keep one perceptrons for each class in memory, leading ton_·a₌O₍na₎memory space. In contrast, the pairwise approaches require one perceptron for each of the n(n₂−1) pairs of classes, hence we need

O₍n2a₎memory. In addition, the calibrated versions require an overhead ofnperceptrons for the comparisons with the artificial label.

Since all perceptron ensembles are online-learners, we do not need to store the whole training set in memory, so the requirements are reduced frommato theaneeded for the current training instance.

4.5.2.2 Training

For processing one training example n dot products have to be computed by BR, plus at most the same amount of vector additions if there was a prediction error. Follow- ing Eq. , the costs are n₍1₊HAMLOSS). MMP has to update each of the wrong prototypesw_i,λ_i ∈F in addition to the initial prediction. Conveniently, |F|amounts to ISERR+MARGIN(cf. Eq. ), so that exactlyn+ISERR+MARGINoperations are required. The MLPPs require|P|(n− |P_|)dot products, one for each associated perceptron. As- suming an average prediction error of the base perceptrons of ERR, the costs amount to |P_|₍n_{− |}P_|₎₍1+ERR). Unfortunately, it is not possible to obtain a direct relation between ERR and a multilabel metric in Section such as for BR and MMP. The investigation of this relationship remains for the future.

Assuming similar loss rates for all algorithms, i.e. ₍ISERR+MARGIN)/n ≈ HAMLOSS ≈ ERR ≈ δ, MLPP requires |P| −|P|/n ≤ |P| times the number of operations of MMP or BR, hence on average ≤ d, confirming the analysis in Section . Assuming per- fect predictionδ ₌0 and the worst case δ ₌1, respectively, for the pairwise and both prototype-based approaches, and assuming |P| ≤n/2, the costs ratio r between MLPP and BR/MMP is bounded by 1 4|P|=|P| 1_/2n 2n ≤ |P| n−|P| 2n ≤r≤ |P| 2(n−|P|) n <|P| 2n n <2|P| (4.7)

If the calibrated version CMLPP is used, we have to add the BR operations. For the average number of operations per instance on the whole training set, we can substitute |P_|byd in the statements.

Thus, assuming similar loss rates, the pairwise training will be onlyd times slower on average than the BR algorithm (ord+1respectively for the calibrated version) despite training a quadratic number of base classifier.

4.5.2.3 Prediction

During prediction the one-per-class approaches achievencomputations for one instance, since both use the same model space. For the pairwise approach alln₍n−1₎/2perceptrons

37 _{We simplify the notation and write}_δ_{instead of} _δ(_P_,_r₎_{for losses computed on the prediction for a}

training instancex.

38 _{This is no restriction, since otherwise we could just use} _|_N_|(_n_{− |}_N_|)_{for the particular estimation,}

which corresponds to inverting the problem. The maximum number of operations for MLPP is reached with_|P_|₌n/2.

have to be evaluated, leading toO₍n2₎computations. The ratio between both decomposition philosophies is hence(n−1)/2. If calibration is used,n(n+1)/2perceptrons and thus(n+1)/2times more operations have to be performed.

4.5.2.4 Sparsity of Feature Vectors

If the feature vectors of the training and test instances are sparse, i.e. the average number of components x_i 6₌0 over all x, which we represent with a0, is low, perceptron based approaches and generally linear classifiers can benefit computationally from an effective representation of the instances. A sparse data structure only stores the non-zero components and information about the indices gaps. This can be implemented e.g. by two vectors∈Ra0 or a linked list.

While a sparse representation can save memory space when the whole training set has to be stored, no reduction can be achieved for the perceptron ensembles since a high density for these vectors is very likely. Hence, na numbers have to be maintained in memory by BR and MMP, andO₍n2a₎for MLPP and CMLPP.

For training and prediction in contrast, we obtain the number of arithmetic float operations by multiplying the stated costs in number of perceptron operations bya0.

In document Efficient Pairwise Multilabel Classification (Page 107-112)