2.2 Multi-label Classification
2.2.2 Risk Minimization for Multi-label Classification
We have discussed several evaluation measures commonly used in the context of MLC in the previous section. Although we want to build MLC systems that perform well across multiple measures, it is a very challenging objective to achieve the goal in general. In other words, it is likely that a system yielding good performance in terms of a certain evaluation measure may perform worse in another measure. In this section we will discuss the relationship between different models, each of which is trained to minimize different loss functions.
The goal of MLC is to find an optimal function f∗ that minimizes the expected loss on an unknown sample drawn fromP(XXX, YYY):
f∗ = arg min f EXXXYYY [`(YYY , f(XXX))] = arg min f E X X X EYYY|XXX [`(YYY , f(XXX))] . (2.25)
While the expected risk minimization over P(XXX, YYY) is intractable, for a given observation
x xx it can be simplified to f∗(xxx) = arg min f EYYY|XXX [`(YYY , f(xxx))] = arg min f Z `(YYY , f(xxx))dP(YYY|XXX = xxx). (2.26)
Let us consider two evaluation measures: HA and ACC. Whereas HA calculates the pre- diction accuracy per label independently, ACC favors only prediction results yyyˆ that match
their targets yyy exactly. Since we want to minimize risk, let `h(yyy,yyyˆ) and `s(yyy,ˆyyy) be the
Hamming loss and subset 0/1 loss, respectively, as follows `h(yyy,ˆyyy) = 1 L L X j=1 I[yj 6= ˆyj] (2.27)
`s(yyy,ˆyyy) =I[yyy 6= ˆyyy] (2.28)
where bothyyyandˆyyyareL-dimensional binary vectors. Using the loss functions, let us denote the optimal functions in terms of the Hamming loss and subset 0/1 loss given by
fh∗(xxx) = arg min f EYYY|XXX [`h(YYY , f(xxx))] (2.29) fs∗(xxx) = arg min f EY YY|XXX [`s(YYY , f(xxx))] (2.30) where f∗ h(xxx) and f ∗
h(xxx) denote Bayes classifiers in terms of the Hamming loss and subset
0/1 loss, respectively.
Let us begin with calculating the Bayes classifier with respect to the subset 0/1 loss. Since both targets yyy and predictions yyy are defined as binary (discrete) vectors, we can calculate the expected loss of predictionsyyyˆ that a function f returns for given xxx as follows
EYYY|XXX[`s(YYY ,yyyˆ)] =
X
y yy
`s(yyy,yyyˆ)P(YYY = yyy|XXX =xxx)
= X
y yy
(1−I[yyy = ˆyyy])P(YYY =yyy|XXX =xxx)
= X y yy P(YYY =yyy|XXX =xxx)−X y y y
I[yyy= ˆyyy]P(YYY =yyy|XXX = xxx).
(2.31)
In fact, the second term on the r.h.s. of Eq. (2.31) is calculated by a summation over 2L
label configurations. We also know that the function output ˆyyy is fixed given a function, which enables us to factorize the second term into two parts. One is the joint probability of yyy which is equal to yyy. The other is the sum of the joint probabilities of the rest of labelˆ
combinations, which is equal to zero. More precisely, we can rewrite the second term on the r.h.s. of Eq. (2.31) as follows
X
y yy
I[yyy = ˆyyy]P(YYY = ˆyyy|XXX = xxx) =P(YYY = ˆyyy|XXX =xxx)
+X
y y y6=ˆyyy
I[yyy= ˆyyy]P(YYY = yyy|XXX = xxx)
| {z }
=0because ofI[yyy=ˆyyy]=0,∀yyyin the sum
(2.32)
= P(YYY = ˆyyy|XXX =xxx). (2.33) Plugging Eq. (2.33) into Eq. (2.31), we have
Thus, the expected risk minimization in terms of the subset 0/1 loss is equivalent to finding a mode of the joint probability of labelsYYY given instancesxxx and the Bayes classifier is given by
fs∗(xxx) = arg max f
P(YYY = ˆyyy|XXX = xxx). (2.35) Similarly, we can also calculate the Bayes classifier in terms of the Hamming loss. Let us
rewrite the expected risk in terms of the Hamming loss using definition of the loss function as follows EYYY|XXX[`h(YYY ,ˆyyy)] = X y y y
`h(yyy,ˆyyy)P(YYY = yyy|XXX = xxx)
= 1 L X y1,y2,···,yL yj∈{0,1} `1 h(y1,yˆ1) +· · ·+` L h(yL,yˆL) P(YYY = yyy|XXX =xxx) (2.36) where `jh yj,ˆjj = Iyj 6= ˆjj
. As the hamming loss treats each label independently that allows us to assume labelsyj are conditionally independent, we can factorize the summation
on the r.h.s. of Eq. (2.36) as follows X y1,y2,···,yL yj∈{0,1} `1h(y1,yˆ1) +· · ·+`Lh (yL,yˆL) P(YYY =yyy|XXX = xxx) = X y1∈{0,1} (1−I[y1 = ˆy1])P(Y1 =y1|XXX =xxx) + X y2∈{0,1} (1−I[y2 = ˆy2])P(Y2= y2|XXX = xxx) +· · ·+ X yL∈{0,1} (1−I[yL= ˆyL])P(YL =yL|XXX = xxx) =L−X j=1 P(Yj = ˆyj|XXX = xxx). (2.37)
In turn, substitution of Eq. (2.36) with Eq. (2.37) gives
EYYY|XXX [`h(YYY ,yyyˆ)] = 1− 1 L L X j=1 P(Yj = ˆyj|XXX = xxx). (2.38)
The expected risk minimization in terms of the hamming loss is equivalent to finding L marginal modes ofYj given instances xxx independently, and the Bayes classifier is given by
fh∗(xxx) ={arg max
f1
P(Y1 = ˆy1|XXX =xxx),· · · ,arg max
fL
P(YL= ˆyL|XXX =xxx)}. (2.39)
In contrast to that the Bayes classifier for the subset 0/1 loss fs∗(xxx) requires the joint
probability distribution of labels, we can obtain the Bayes classifier for the hamming loss
Table 2.5:TheBayes classifiers for the Hamming loss and subset 0/1 loss are identical if (a) labels are conditionally independent or (b) the joint mode of labels are greater than or equal to 0.5.
(a) P(YYY|xxx) 0 y1 1 P(Y2|xxx) y2 01 0.12 0.180.28 0.42 0.30.7 P(Y1|xxx) 0.4 0.6 1 (b) P(YYY|xxx) 0 y1 1 P(Y2|xxx) y2 01 0.6 0.10.1 0.2 0.70.3 P(Y1|xxx) 0.7 0.3 1
As shown already in Table 2.2, the mode of the joint distribution of labels may differ from a set of marginal modes of individual labels except for two conditions where theBayes
classifiers for the subset 0/1 loss and hamming loss coincide. Assuming that all labels are
conditionally independent given instances such that P(Y1, Y2,· · · , YL|xxx) =
QL
j=1P(Yj|xxx), fh∗(xxx) and fh∗(xxx) are same. When a probability assigned to a single label configuration is
greater or equal to 0.5, i.e., P(YYY = f∗
s(xxx)|XXX = xxx) ≥ 0.5, fh∗(xxx) and f
∗
h(xxx) also return the
same function output. Table 2.5 shows two examples of such a probability distribution of labels.
We have discussed that the hamming loss and subset 0/1 loss lead us to different optimal functions. One can be obtained by ignoring label dependence completely, but the other seeks a label configuration that yields the highest probability over the entire label space. Due to the difference, it is unable to find a universal classifier that performs well across all measures. For example, Dembczyński et al. (2012b) have analyzed that the regret in terms of the subset 0/1 loss for the hamming loss is quite high and vice versa.
To be more specific, let us consider the regret of theBayes classifier for the hamming loss
in terms of the subset 0/1 loss. In other words, we compare the performance of f∗
h and f
∗
s
using the subset 0/1 loss. The upper bound of the regret is given by
EYYY|XXX[`s(YYY , fh∗(xxx))]−EYYY|XXX [`s(YYY , fs∗(xxx))]< 0.5. (2.40)
Please note that the risk of fh∗ and fs∗ in terms of the subset 0/1 loss are identical when
P(YYY =f∗
s(xxx)|XXX = xxx)≥0.5, so that the risk of fs∗ is greater than 0.5 if fh∗ differs from f
∗
s.
It is also worth noting that the risk of any classifier f is bounded by[0,1].
The regret offs∗ in terms of the hamming loss has the following upper bound forL >3: EYYY|XXX[`h(YYY , fs∗(xxx))]−EYYY|XXX[`h(YYY , fh∗(xxx))] <
L−2
L+ 2 (2.41)
For more details, please refer to (Dembczyński et al., 2012b).
In this section, we have shown that an optimal function for a certain evaluation measure may perform worse in terms of another measure. Hence, it is crucial to determine which evaluation measure will be optimized and to make sure the objective of a MLC system is consistent with respect to the measure of interest.