• No results found

Binary classification and bipartite ranking

We have reviewed techniques that form the foundation of efficient path recommen- dation. In this section, we describe the problems and related techniques that can help us achieve efficient set recommendation. In particular, we review a few closely related loss functions for the problem of binary classification and bipartite ranking, which will be employed in Chapter 5 to enable efficient recommendation of sets in the context of music playlists.

Given a sample of instances (or examples) with binary labels, binary classification

is the problem of learning a binary-valued function that gives+1 labels for positive

instances and 1 labels for negative instances.7 A related problem isbipartite ranking, which learns to rank positive instances above negative instances.

Let X be the instance space, andD =S+[S denote the binary dataset, where S+={(x+, +1)}is a set of positive examples, andS ={(x , 1)}is a set of nega- tive examples. Examplesx+, x 2 X, e.g.,D-dimensional feature vectors. Further, we useM+andM to denote the number of positive and negative examples, respectively.

2.5.1 Loss functions for binary classification

The binary-valued prediction function for a binary classifier is generally not directly learned, rather, we first learn a real-valued scoring function f : X !R, then compare the score of a particular input with a threshold t to get a binary prediction. The

misclassification loss of f on binary datasetDis the number of incorrectly classified examples. Taking care of ties, we have

Rbc0/1 =

Â

x+2S+ ✓ Jf(x+) <tK+1 2Jf(x+) =tK ◆ +

Â

x 2S ✓ Jf(x )>tK+ 1 2Jf(x ) = tK ◆ , (2.39) whereJ·Kis the indicator function that represents the 0/1 loss.

To practically optimise the misclassification loss Rbc

0/1 we replace the indicator function in (2.39) with one of the convex surrogate of the 0/1 loss to upper bound

Rbc

0/1, in other words, we have

Rbc 0/1 

Â

x+2S+ `(f(x+) t) +

Â

x 2S `(t f(x )).

For example, let t = 0 and `(z) = max(0, 1 z) (i.e., the hinge loss), we get the

empirical risk of the support vector machines for binary classification.8 Alternatively,

7The labels for binary classification are also (widely) denoted as{0, 1}or{True,False}. 8As a remark, loss functions for binary classification can be independent of the threshold, a

let t = 0 and`(z) = e z (i.e., the exponential loss) results in the objective of Ad-

aBoost (Freund and Schapire, 1997), and here we review two generalisations of it. The first one is known as theCost-Sensitive AdaBoost(Ertekin and Rudin, 2011)

Rcsa(f,D) =

Â

x+2S+

exp( f(x+)) +C

Â

x 2S

exp(f(x )), (2.40)

where C is a weighting parameter. Another generalisation of the AdaBoost is the

P-Classification(Ertekin and Rudin, 2011)

Rpc(f,D) =

Â

x+2S+

exp( f(x+)) + 1

px

Â

2S exp(p f(x )), (2.41)

where p 2R+is a hyper-parameter.

It turns out that both the Cost-Sensitive AdaBoost (Eq. 2.40) and the P-Classification (Eq. 2.41) are closely related to the P-Norm Push (Eq. 2.43) (Ertekin and Rudin, 2011). This inspires a more general relation between bipartite ranking and binary classification, which will be described in the next section.

2.5.2 Loss functions for bipartite ranking

Given an example spaceX, let f : X !R be a function that can score an example

x 2 X. The misranking loss of f on binary dataset D is the number of positive examples that are ranked below any negative example. Accounting for ties, we have

Rbr 0/1(f,D) =

Â

x+2S+

Â

x 2S ✓ Jf(x+)< f(x )K+1 2Jf(x+) = f(x )K ◆ . (2.42)

Since the 0/1 loss is a non-differentiable function, to practically optimise the loss of the scoring function f on datasetD, one approach is to upper bound the 0/1 loss with one of its convex surrogates `(z) Jz 0K(e.g., the exponential loss`(z) =e z,

the logistic loss`(z) =log(1+e z), or the squared hinge loss`(z) = [max(0, 1 z)]2).

In other words, we can upper bound the misranking loss as

Rbr0/1(f,D)

Â

x+2S+

Â

x 2S

`(f(x+) f(x )).

If we measure the quality of a ranking function by the area under the ROC curve (AUC), loss functions for bipartite ranking are often variants of the misranking loss,

typical example is the loss function for logistic regression (i.e., the log loss or the cross entropy loss), where the prediction function f : RD ![0, 1]outputs a probability and the loss of f onDis Rlog(f,D) = Â

which is related to 1 AUC (Ertekin and Rudin, 2011). For example, the objective of

P-Norm Pushis defined as (Rudin, 2009):

Rpn` (f,D) =

Â

x 2S "

Â

x+2S+ `(f(x+) f(x )) #p , (2.43)

where p2R+ is a parameter that acts as a soft maximum for the highest scoring neg- ative example. It reduces to the objective of RankBoost if we employ the exponential surrogate loss`(z) =e z and let p=1 (Freund et al., 2003).

Recalling the definition of infinity norm (or maximum norm)

kzk• =max i |zi|= p!lim+•

Â

i |zi| p !1/p , we have R•` (f,D) = lim p!+• h Rpn ` (f,D) i1 p = max x 2S "

Â

x+2S+ `(f(x+) f(x )) # , (2.44)

which is the objective ofInfinite Push(Agarwal, 2011). Further,

Rtp` (f,D) =

Â

x+2S+ ` ✓ f(x+) max x 2S f(x ) ◆ , (2.45)

which is the objective of Top Push, andRtp

` = R•` if the convex surrogate of the 0/1 loss`(·)is non-increasing and differentiable (Li et al., 2014).

Intuitively, the Top Push penalises the scenario where any positive example is ranked below the highest-ranked negative example. If we penalise the scenario where any negative example is ranked above the lowest-ranked positive example, this results in the objective ofBottom Push(Rudin, 2009),

Rbp` (f,D) =

Â

x 2S ` ✓ min x+2S+ f(x+) f(x ) ◆ . (2.46)

It has been shown that the loss functions of binary classification and those of bipartite ranking are closely related (Ertekin and Rudin, 2011; Menon and Williamson, 2016). In particular, Ertekin and Rudin (2011) showed that the P-Norm Push and P-Classification share the same minimiser(s) when the scoring function takes the form of a linear function, and the P-Norm Push employs the exponential surrogate loss

`(z) =e z. In addition, it also showed the RankBoost and Cost-Sensitive AdaBoost

share the same minimiser(s). These results can be generalised to a parametric family of bipartite ranking and binary classification losses, see Appendix B for details.

Related documents