Risk Minimization for Multi-label Classification

2.2 Multi-label Classification

2.2.2 Risk Minimization for Multi-label Classification

We have discussed several evaluation measures commonly used in the context of MLC in the previous section. Although we want to build MLC systems that perform well across multiple measures, it is a very challenging objective to achieve the goal in general. In other words, it is likely that a system yielding good performance in terms of a certain evaluation measure may perform worse in another measure. In this section we will discuss the relationship between different models, each of which is trained to minimize different loss functions.

The goal of MLC is to find an optimal function f∗ that minimizes the expected loss on an unknown sample drawn fromP(XXX, YYY):

f∗ = arg min f EXXXYYY [`(YYY , f(XXX))] = arg min f E X X X EYYY|XXX [`(YYY , f(XXX))] . (2.25)

While the expected risk minimization over P(XXX, YYY) is intractable, for a given observation

x xx it can be simplified to f∗(xxx) = arg min f EYYY|XXX [`(YYY , f(xxx))] = arg min f Z `(YYY , f(xxx))dP(YYY|XXX = xxx). (2.26)

Let us consider two evaluation measures: HA and ACC. Whereas HA calculates the prediction accuracy per label independently, ACC favors only prediction results yyyˆ that match

their targets yyy exactly. Since we want to minimize risk, let `h(yyy,yyyˆ) and `s(yyy,ˆyyy) be the

Hamming loss and subset 0/1 loss, respectively, as follows `h(yyy,ˆyyy) = 1 L L X j=1 I[yj 6= ˆyj] (2.27)

`s(yyy,ˆyyy) =I[yyy 6= ˆyyy] (2.28)

where bothyyyandˆyyyareL-dimensional binary vectors. Using the loss functions, let us denote the optimal functions in terms of the Hamming loss and subset 0/1 loss given by

f_h∗(xxx) = arg min f EYYY|XXX [`h(YYY , f(xxx))] (2.29) f_s∗(xxx) = arg min f EY YY|XXX [`s(YYY , f(xxx))] (2.30) where f∗ h(xxx) and f ∗

h(xxx) denote Bayes classifiers in terms of the Hamming loss and subset

0/1 loss, respectively.

Let us begin with calculating the Bayes classifier with respect to the subset 0/1 loss. Since both targets yyy and predictions yyy are defined as binary (discrete) vectors, we can calculate the expected loss of predictionsyyyˆ that a function f returns for given xxx as follows

EYYY|XXX[`s(YYY ,yyyˆ)] =

y yy

`s(yyy,yyyˆ)P(YYY = yyy|XXX =xxx)

= X

y yy

(1₋I[yyy = ˆyyy])P(YYY =yyy|XXX =xxx)

= X y yy P(YYY =yyy|XXX =xxx)−X y y y

I[yyy= ˆyyy]P(YYY =yyy|XXX = xxx).

(2.31)

In fact, the second term on the r.h.s. of Eq. (2.31) is calculated by a summation over 2L

label configurations. We also know that the function output ˆyyy is fixed given a function, which enables us to factorize the second term into two parts. One is the joint probability of yyy which is equal to yyy. The other is the sum of the joint probabilities of the rest of labelˆ

combinations, which is equal to zero. More precisely, we can rewrite the second term on the r.h.s. of Eq. (2.31) as follows

y yy

I[yyy = ˆyyy]P(YYY = ˆyyy|XXX = xxx) =P(YYY = ˆyyy|XXX =xxx)

y y y6=ˆyyy

I[yyy= ˆyyy]P(YYY = yyy|XXX = xxx)

| {z }

=0because ofI[yyy=ˆyyy]=0,∀yyyin the sum

(2.32)

= P(YYY = ˆyyy_|XXX =xxx). (2.33) Plugging Eq. (2.33) into Eq. (2.31), we have

Thus, the expected risk minimization in terms of the subset 0/1 loss is equivalent to finding a mode of the joint probability of labelsYYY given instancesxxx and the Bayes classifier is given by

fs∗(xxx) = arg max f

P(YYY = ˆyyy|XXX = xxx). (2.35) Similarly, we can also calculate the Bayes classifier in terms of the Hamming loss. Let us

rewrite the expected risk in terms of the Hamming loss using definition of the loss function as follows EYYY|XXX[`h(YYY ,ˆyyy)] = X y y y

`h(yyy,ˆyyy)P(YYY = yyy|XXX = xxx)

= 1 L X y1,y2,···,yL y_j∈{0,1} `1 h(y1,yˆ1) +· · ·+` L h(yL,yˆL) P(YYY = yyy_|XXX =xxx) (2.36) where `j_h yj,ˆjj = Iyj 6= ˆjj

. As the hamming loss treats each label independently that allows us to assume labelsyj are conditionally independent, we can factorize the summation

on the r.h.s. of Eq. (2.36) as follows X y1,y2,···,yL yj∈{0,1} `1_h(y1,yˆ1) +· · ·+`L_h (yL,yˆL) P(YYY =yyy_|XXX = xxx) = X y1∈{0,1} (1−I[y1 = ˆy1])P(Y1 =y1|XXX =xxx) + X y2∈{0,1} (1₋I[y2 = ˆy2])P(Y2= y2|XXX = xxx) +· · ·+ X yL∈{0,1} (1−I[yL= ˆyL])P(YL =yL|XXX = xxx) =L₋X j=1 P(Yj = ˆyj|XXX = xxx). (2.37)

In turn, substitution of Eq. (2.36) with Eq. (2.37) gives

EYYY|XXX [`h(YYY ,yyyˆ)] = 1− 1 L L X j=1 P(Yj = ˆyj|XXX = xxx). (2.38)

The expected risk minimization in terms of the hamming loss is equivalent to finding L marginal modes ofYj given instances xxx independently, and the Bayes classifier is given by

f_h∗(xxx) =_{arg max

P(Y1 = ˆy1|XXX =xxx),· · · ,arg max

P(YL= ˆyL|XXX =xxx)}. (2.39)

In contrast to that the Bayes classifier for the subset 0/1 loss fs∗(xxx) requires the joint

probability distribution of labels, we can obtain the Bayes classifier for the hamming loss

Table 2.5:TheBayes classifiers for the Hamming loss and subset 0/1 loss are identical if (a) labels are conditionally independent or (b) the joint mode of labels are greater than or equal to 0.5.

(a) P(YYY|xxx) ₀ y1 ₁ P(Y2|xxx) y2 0₁ 0.12 0.18_{0.28 0.42} 0.3_0.7 P(Y1|xxx) 0.4 0.6 1 (b) P(YYY|xxx) ₀ y1 ₁ P(Y2|xxx) y2 0₁ 0.6 0.1_{0.1 0.2} 0.7_0.3 P(Y1|xxx) 0.7 0.3 1

As shown already in Table 2.2, the mode of the joint distribution of labels may differ from a set of marginal modes of individual labels except for two conditions where theBayes

classifiers for the subset 0/1 loss and hamming loss coincide. Assuming that all labels are

conditionally independent given instances such that P(Y1, Y2,· · · , YL|xxx) =

j=1P(Yj|xxx), f_h∗(xxx) and f_h∗(xxx) are same. When a probability assigned to a single label configuration is

greater or equal to 0.5, i.e., P(YYY = f∗

s(xxx)|XXX = xxx) ≥ 0.5, fh∗(xxx) and f

∗

h(xxx) also return the

same function output. Table 2.5 shows two examples of such a probability distribution of labels.

We have discussed that the hamming loss and subset 0/1 loss lead us to different optimal functions. One can be obtained by ignoring label dependence completely, but the other seeks a label configuration that yields the highest probability over the entire label space. Due to the difference, it is unable to find a universal classifier that performs well across all measures. For example, Dembczyński et al. (2012b) have analyzed that the regret in terms of the subset 0/1 loss for the hamming loss is quite high and vice versa.

To be more specific, let us consider the regret of theBayes classifier for the hamming loss

in terms of the subset 0/1 loss. In other words, we compare the performance of f∗

h and f

∗

using the subset 0/1 loss. The upper bound of the regret is given by

EYYY|XXX[`s(YYY , f_h∗(xxx))]−EYYY|XXX [`s(YYY , fs∗(xxx))]< 0.5. (2.40)

Please note that the risk of f_h∗ and fs∗ in terms of the subset 0/1 loss are identical when

P(YYY =f∗

s(xxx)|XXX = xxx)≥0.5, so that the risk of fs∗ is greater than 0.5 if fh∗ differs from f

∗

It is also worth noting that the risk of any classifier f is bounded by[0,1].

The regret offs∗ in terms of the hamming loss has the following upper bound forL >3: EYYY|XXX[`h(YYY , fs∗(xxx))]−EYYY|XXX[`h(YYY , f_h∗(xxx))] <

L₋2

L+ 2 (2.41)

For more details, please refer to (Dembczyński et al., 2012b).

In this section, we have shown that an optimal function for a certain evaluation measure may perform worse in terms of another measure. Hence, it is crucial to determine which evaluation measure will be optimized and to make sure the objective of a MLC system is consistent with respect to the measure of interest.

In document Learning Label Structures with Neural Networks for Multi-label Classification (Page 30-33)