Multiclass Cascade - PSL Multiclass Learning Framework

7.2 PSL Multiclass Learning Framework

7.2.1 Multiclass Cascade

The underlying principle of the proposed multiclass cascade is the separate-and-conquer strategy. As such, it bears conceptual similarities with covering algorithms like RIPPER. The algorithm refines the cascaded classifier in a stepwise fashion by continuously removing most separable class labels, in order to focus on more difficult classes.

Given a training set D = {(x1, y1), ...,(xn, yn)} where xi is a sample vector xi ∈ X

and yi is a class label yi ∈ Y and Y ∈ {0, ..., k}, the multiclass algorithm constructs a

cascade classifier Hk−1 that consists ofk−1 number of layers. Each layeriis specifically

trained to predict a single class label c, as Hc

i(x)∈Y, otherwise the prediction is passed

on to Hc

i+1(x).

The cascaded classifier Hk−1 is also two dimensional, whereby each layer contains

within it a further nested cascade Hc

i,j (Figure 7.1). The nested cascade facilitates the

coarse-to-fine learning of each layer, which enables the convergence to low training error rates. For clarity each layerj of anested cascade, denoted as a pair of (Mj, Bj), is refered

to as nodes. M_jγ(x) denotes a multiclass node whose prediction γ ∈Y and γ 6=Hc

i,j(x),

while Bj denotes a binary predictor whose output isBj(x)∈ {−1,1}.

The cascade training proceeds as follows: in the first layer Hi, the initial multiclass

node Mi,j is trained on all samples using a generic multiclass algorithm until a predefined

Φ number of boosting iterations are completed. Once this criterion is met, node Mi,j

is assessed for accuracy based on individual class error rates. The most separable class label γ is then identified for separation and assigned to node M_i,jγ as its target class for prediction.

All correctly predicted samples (M_i,jγ (xi) =γ)∧(yi =γ) are then trained against all

theincorrectly predicted samples as class (M_i,jγ (xi) =γ)∧(yi 6=γ), using a binary learning

approach. The resulting binary node Bi,j functions as an auxiliary node to the multiclass

node. The auxiliary nodeBi,j is trained until all the false positive samples with respect to

class γ have been correctly learned. All correctly learned samples (M_i,jγ (xi) =γ)∧(yi =

γ)∧(Bi,j(xi) = 1), belonging to class γ are then removed from subsequent training of

layer Hi. The training of node Mi,j+1 proceeds as with the initial node; howbeit, with

k−j class labels to learn.

The coarse-to-fine learning continues until all the most separable class labels and their instances have been removed from layer Hi. If at the end of the layer training, there

Layer 1 Class X ... FB ... FB Layer 2 Class Y ... FB Class Z Layer 3 ... FB Layer n Class N FB Legend:

= final layer’s binary node = multiclass node = binary node B n M1 B k−1 M2 M3 Mk−1 1 B _B2 B₃ 1 B M1 M2 B2 M_k−2 B k−2 M1 1 B M k−3 B − k3 M1 1 B Mn

Figure 7.1: Cascaded multiclass framework.

remain instances which have not been correctly learned and associated with correct class labels, then a final binary nodeF Bi,j is trained, whose output is{−1,1}. In the training

of the F Bi,j node, the incorrectly learned samples are designated as negatives, while

the instances belonging to the last remaining class label c as the positives. The binary training proceeds until a predetermined error rate is achieved. Subsequently, the class label c is assigned as the predictor for layer Hc

i. All instances belonging to class label c

are eliminated from further training of layers Hi+1. In turn, each succeeding Hi+1 layer

with k−i class labels reduces the computational demands and the complexity of class separability. Algorithm 10 outlines the details of the multiclass learning framework.

Key components of separate-and-conquer strategies include strategies for (1) determin- ing the sequence in which class labels are selected for learning and subsequently, (2) the methods of refining the classification accuracy of a chosen class label from the remaining classes.

The approach taken by RIPPER for selecting class labels, involves sorting the list of class labels in an ascending order of size distribution and learning labels by beginning with the smallest class labels first. The multiclass differs in that it makes no decision on which class label to focus on learninga priori. Instead it equally focuses learning all classes and the most separable class labelγ is assigned to the current Mi,j node, while the correctly

7.2. PSL Multiclass Learning Framework ₁₀₅

Algorithm 10: PSL Multiclass Cascade

Given: n= training set size,c = number of class labels,Φ= max weak classifiers,Pn

positive and Nn negative dataset vectors{(x1, y1), ...(xn, yn)} for binary training

where each samplexcorresponds to a class labely ∈ {−1,1}

Input: Dc

n = multiclass dataset vector{(x1, y1), ...(xn, yn)}where each samplexi is a

feature vector∈X corresponding to a total ofkclasses with each class label

denoted asy∈Y andY is the set of all class labels.

Output: Hf inal_{= cascaded classifier comprising of multiclass}_M _{and binary} _B_and _{F B}

nodes. fori = 1 to k do 1 FlagSamplesForTraining(Dn) whereH(xi)6∈yi 2 forj = 1 to k - i do 3

// train new multiclass node

fork = 1 to Φdo 4 M_i,j = MultiClassWeakLearn(Dn) 5 Boost(D_n, HM i,j) 6

M_i,jγ =γ, whereγ= BestClassAccuracyLabel(Dn,Mi,j)

errorγ ₌_Mγ i,j(Dn)

if errorγ _{> target}multiclass_then

Pn= (xi, yi),where (Mi,jγ (xi) =γ)∧(yi=γ) // train supporting binary

10 node Nn = (xi, yi),where (Mi,jγ (xi) =γ)∧(yi 6=γ) 11 repeat 12 Bi,j = BinaryClassWeakLearn(Pn,Nn) 13 Boost(Pn, Nn,Bi,j) 14 errorbinary ₌_B i,j(Pn, Nn) 15

untilerrorbinary _{> target}binary _;

RemoveSamplesFromLayerTraining(Pn,Dn) whereBi,j(xi) = 1∧yi=γ

17 Hc i,j =c,∀M γ i whereγ6=c 18 if Hc

i,j(Dn)> targetf inalthen

Pn= (xi, yi),∀xi,whereyi=c // train final binary node

20 Nn = (xi, yi),where (Hi,jc (xi) =c)∧(yi6=c) 21 repeat 22 F Bi,j = BinaryClassWeakLearn(Pn,Nn) 23 Boost(Pn, Nn,F Bi,j) 24 errorbinary ₌_{F B} i,j(Pn, Nn) 25

untilerrorbinary _{> target}f inal _;

returnHf inal

predicted instances are then removed from further layer training. The justification for choosing the best performing class label is supported by research conducted into training classifiers using binary directed trees. Training classifiers using binary directed trees falls under the umbrella of hierarchical decomposition strategy3 with which the multiclass cascades share much in common. In their work with binary directed trees, Takahashi and Abe [165] propose that initial levels of a tree should divide the most separable classes. Their intuitive argument is that by dividing the most separated classes in the initial levels of a tree first, a higher overall accuracy can be achieved.

The second key aspect of the framework concerns how the accuracy of a node Mi,j

is to be refined in its prediction for class label γ. The accuracy in respect to hit and false positive rates of a class label chosen for removal, have implications for both the training phase and the generalizability of a classifier. The hit rate of the selected class γ determines the proportion of its samples that will be removed from further training of the layer. If this is low then the class separability of the renaming sub-tasks is not likely to be greatly reduced. Also the training will not become less computationally expensive. However, a negligible false positive rate is even more crucial, since class labelcfor layerHi

is not knowna priori. This means that a cumulative error for instances of classc, over all auxiliary nodes in a layer, renders the final accuracy ofHc

i(x) impractical. For this reason

the binary class learning focuses primarily on lowering the false positive rate instead of the true positive rate in respect to class labelγ. RIPPER for example differs in that once a class is selected for learning, it uses the information gain measure of correctness of the induced rules, to increase its accuracy in respect to the true positive rate.

The complexity of the cascade training can be approximated by showing the total numbers of multiclass and binary weak classifiers that are generated. For k classes, the number of multiclass nodesMk is

totalM =

k2₋_k

2 −1 (7.1)

and binary nodesBk

totalB =

2 + 1 (7.2)

Suppose that the size of all Mi nodes is fixed at Φ weak classifiers, while the size of Bi

nodes is flexible but cannot exceed Ψ weak classifiers. We know that the initial B1,1

node is most complex, and if Ψ is set sufficiently high, then the size of Bi,j decreases

logarithmically after each node and layer. The total number of classifiers is therefore total=Φ k2−k 2 −1 +log(Ψ) k2 2 + 1 (7.3) where the number of class labels is the dominating term in the complexity.

The runtime procedure for classifying a sample ensures that only a subset of the cascaded classifier is evaluated as opposed to monolithic ECOC-based, OAO and OAA approaches. As an unseen sample (xi, yi) enters the cascade at H1,1, it is forwarded to a

subsequent layer for classification if any combination of nodes produces a prediction

7.2. PSL Multiclass Learning Framework ₁₀₇

((M_i,jγ(xi) =γ)∧(yi=γ))∧(Bi,j(xi) = 1) (7.4)

which predicts it as matching its target class label and if the associated binary node confirms it as a positive. A sample is forwarded to subsequent layers until it reaches a layer Hc

i, where the condition (7.4) fails in all cases and the final node F Bi,j(xi) = 1

confirms it as a positive for class labelγ. The output of the nodes on previous portions of the cascade is not considered by each node and therefore each component is independent. As a sample instance propagates further into the cascade, the classification accelerates since there are a decreasing number of nodes at each layer to evaluate.

To further differentiate the multiclass cascade from existing separate-and-conquer approaches, the multiclass cascade neither employs pruning nor global optimization steps as do some rule induction algorithms4_{. In addition, since separate-and-conquer approaches}

tend to be rule induction algorithms, they produce classification rules that are human readable while the ensemble based methods like the multiclass cascade generate classifiers that are incomprehensible.

In document Efficient boosted ensemble based machine learning in the context of cascaded frameworks : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand (Page 122-126)