Binary Decomposition Methods for Multipartite Ranking

(1)

Binary Decomposition Methods for Multipartite Ranking

Eyke Hüllermeier, Philipps-Universität Marburg

Johannes Fürnkranz, TU Darmstadt

Stijn Vanderlooy, Maastricht University

(2)

Outline

 Multipartite Ranking

 Evaluation measures

 C-Index

 m-AUC / Jonckheere-Terpstra statistic

 Methods for learning multipartite rankings

 Transformation into a Single Binary Problem

 Binary Decompositions

 ordered

 pairwise

 Complexity of these approaches

 Experimental Results

 Conclusions

(3)

Binary Classification

0 0 1 1 0

1 1 0

Example: Reviewers divide Papers into Accept / Reject

(4)

Binary Classification with Scores

0 0 1 1 0

1 1 0

Scores (shown in different shades of colors)

indicate the degree of redness or greenness

(5)

Binary Classification with Scores

0 0 1 0 1 1 1

0

(6)

Binary Classification with Scores

0 0 1 0 1 1 1

0 = Bipartite Ranking

+ Partition

(7)

Ordered Classification

0 0 1 1 1 1

2 2

2 0

Example: Reviewers divide papers into Accept / Borderline Borderline / Reject

(8)

Multipartite Ranking

Example: Reviewers sort papers by quality

(9)

Multipartite Ranking

 Task is essentially the same as in bipartite ranking:

Rank a set of objects in

agreement with their class labels

 Main Difference:

Training information is

 not one of two (ordered) classes (binary classification)

 but one of multiple ordered classes (ordinal classification)

→ we need different evaluation metrics

Multipartite Ranking is also known as

 layered ranking

(Waegemans et al. 2008)

 k-partite ranking

(Rajaram, Agarwal 2005)

(10)

Evaluation of Bipartite Rankings

Area under the ROC curve

 the probability that a randomly chosen positive example is ranked before a randomly chosen negative example.

Computation:

 for each pair (p,n), where class(p) > class(n)

 correct++ if score(p) > score(n)

 AUC P , N = correct

# P⋅# N

Available information:

binary classification Available information:

binary classification

Prediction:

ranking scores Prediction:

ranking scores

(11)

Evaluation of Multipartite Rankings

C-Index

 the probability that a randomly chosen positive example is ranked before a randomly chosen negative example.

Computation:

 for each pair (p,n), where class(p) > class(n)

 correct++ if score(p) > score(n)



Obviously, AUC is a special case of C-Index with C = 2.

C-Index= correct

∑ _{I , J  I} ^{# I⋅# J}

example of class J

example of class I < J

Available information:

binary classification Available information:

binary classification

Prediction:

ranking scores Prediction:

ranking scores

ordinal

(12)

Evaluation of Multipartite Rankings

C-Index

 the C-index can be rewritten as a weighted sum of pairwise AUCs:

Jonckheere-Terpstra statistic

 is an unweighted sum of pairwise AUCs:

 equivalent to well-known multi-class extension of AUC

C-Index= 1

∑ _{I , J  I} ^{# I⋅# J} ∑ _{I , J I} # I⋅# J⋅AUC I , J 

m-AUC= 2

C⋅C −1 ∑ _{I , J  I} AUC I , J 

Note:

C-Index and m-AUC can be optimized by optimization of

pairwise AUCs Note:

C-Index and m-AUC can be optimized by optimization of

pairwise AUCs

(13)

Conventional Approach to Multipartite Ranking

Turn the problem into a single, large binary classification problem

(Herbrich, Graepel, Obermayer, 2001)

Goal:

 learn a (linear) scoring function, s.t. for all pairs (p,n) score(p) > score(n) iff class(p) > class(n) Approach:

 for each constraint class(p) > class(n)

 construct a new example x = p – n such that

score x=score p−n=score p−scoren0

(14)

Binary Decomposition Methods

 Ordered (F&H)

 learn one theory for each possible split point in the order of the

classes

{1} vs. {2,3,4}

{1,2} vs. {3,4}

{1,2,3} vs. {4}

 C-1 theories

 each using all examples

 proposed for ordinal classification

 Pairwise (LPC)

 learn one theory for each pair of classes (like for unordered

classification)

{1} vs. {2}

{1} vs. {3}

{1} vs. {4}

 C(C-1)/2 theories

 each using only some examples

 is no worse for ordered

{2} vs. {3}

{2} vs. {4}

{3} vs. {4}

Solve a classification / ranking problem by decomposing it into

a set of binary classification problems

(15)

Complexity

Assume we have 2-class learner with complexity

 single binary problem

 trains 1 model with N

²

order constraints → (one constraint for each pair of examples)

 ordered (F&H)

 trains C-1 models with N training examples →

 pairwise (LPC)

 equal class distribution (worst case):

trains C(C-1)/2 models with training examples →

 the same as ordered decomposition for linear classifiers

O  N ^ 

O C⋅N ^ 

2⋅N

C O C ^2− N ^ 

O  N ^{2 } 

(16)

Prediction with F&H

 Prediction of binary models

 we have C-1 models M _I

 for an example x, each model predicts

 Computing a prediction

 derive estimates for the probabilities P(class(x) = I)

 straight-forward, but not our concern

 Computing a score:

 intuitive justification:

 high classes have a high probability in all p

_I

 low classes have a low probability in all p

_I

 theoretical justification:

p _I  x=P class xI 

 medium classes have high probabilities in the low p

_I

and low probabilities in high p

_I

score _FH  x= ∑ _I ^p ^I ^ ^x

p

₁

p

₂

p

₃

(17)

Prediction with LPC

 Prediction of binary models

 we have C(C-1)/2 models M _I,J (I < J)

 for an example x, each model predicts

 Computing a prediction

 weighted voting: predict

 can be shown to minimize Spearman rank correlation

 Computing a score:

 intuitive justification:

 sum up all predictions “in favor of a higher class”

 examples with low classes will get low probabilities in the models for high classes

 examples with high classes will get high probabilties in the models for high classes

 motivation:

p

_{I , J}

 x=Pclass x=I∣I ∨J 

score _LPC-U  x= ∑ _{I , J  I} ^p I , J  x

1 2 3 4

1

arg max

_I

score

_I

 x= ∑

_J

^p

^{J , I}

^ ^x

p _{J , I}  x=1− p _{I , J}

(18)

Prediction with LPC

Actually, we tested three variants



 unweighted sum of p

_I,J

 motivated by m-AUC



 each p

_I,J

is weighted by the relative frequencies and of its classes

 motivated by C-index



score _LPC-U  x= ∑ _{I , J  I} ^p ^{I , J} ^ ^x

score _LPC-W  x= 1

N ² ∑ _{I , J  I} ^{# I⋅# J⋅p} ^{I , J} ^ ^{x }

# I N

# J N

score _LPC-A  x= 1

N ∑ _{I , J I} ^ # I # J ⋅p _{I , J}  x

(19)

Experimental setup

 Datasets

 21 discretized regression datasets

 5 classes each, using equal-frequency

 4 real ordered classification sets

 Evaluation Procedure

 average of 5 iterations of 10-fold X-val on each dataset

 Linear Base Classifiers

 all binary models are trained with logistic regression

 Rank-SVM is used with linear kernels (default configuration)

(20)

Results for Multipartite Ranking

F&H is significantly better than all LPC and Rank-SVM

(Nemenyi test, critical rank difference = 0.88)

F&H is significantly better than all LPC and Rank-SVM

(Nemenyi test, critical rank difference = 0.88)

(21)

Results for Classification

LPC is somewhat better, certainly no worse (W/T/L = 15/1/9)

(22)

Discussion of Results

 Pairwise (LPC) performs worse than Ordered (F&H)

 this contradicts our expectations (and our results) from classification

 Possible Reason: Non-Competence problem

 Classifiers have to predict, even if the example is not from the class

 in those cases, the probabilities have been estimated in regions in which we have no training examples