4.2 Two Generic Extensions of the ML-CFS Method
4.2.1 ML-CFS Using the Absolute Value of Correlation Coefficient
In the original multi-label ML-CFS method described in Section 4.1 and the orig- inal single-label CFS method [8], Pearson’s linear correlation coefficient (r) was used to estimate the terms rF F and rF L in Equation (4.1). In general, there are two types of correlation: positive correlation and negative correlation. Both of them can represent redundancy between a pair of features, or represent the relevance of a feature to predict a set of labels, as follows. For the purpose of measuring redundancy between two features, what matters is the absolute value of the correlation coefficient (r), regardless of its sign. E.g., bothr = +0.8 andr=
−0.8 represent a strong degree of redundancy. However, in the original single-label
and multi-label CFS methods, the values of the merit formulas depend on both the value and the sign ofr. If a feature subset contains, say, one pair of features withr = +0.8 and another pair of features withr =−0.8, these two values would cancel
each other resulting in an averager over those two feature pairs of 0; a misleading value, since the twor values actually suggest a large degree of redundancy in each of those feature pairs.
To avoid the aforementioned problems, we use the absolute (without sign) value of the correlation coefficient in all occurrences of the correlation coefficient r in Equation (4.1) when calculating the value of the average correlation between fea- tures in a feature subset F (rF F) and the average correlation between features and labels (rF L). Hence, the average correlation between features in a feature subsetF (rF F) is computed by Equation (4.6), wherefp is the number of feature pairs in feature subsetF. The average value of the correlation coefficient between features and labels is given by Equation (4.7), which uses Equation(4.8) to com- pute the average value of the correlation coefficient between each single feature and all labels. Note that rfifj and rf Lreturn a value in [0..1].
rF F = P|F| fi=1,fj=1,i6=j rfifj fp (4.6) rF L = P|F| f=1 rf L |F| (4.7) rf L = P|L| i=1|rf Li| |L| (4.8)
4.2.2
ML-CFS Using Mutual Information for Class Label
Weighting
In the original ML-CFS method, Equation (4.3) computes, for a given feature f, the arithmetic average of the correlation between that feature and a class label over all labels, implicitly assuming that all labels are equally relevant and ignor- ing dependencies between labels. However, in real-world datasets there might be a significant degree of dependence between some labels, where the occurrence of one label would increase the probability of another label for a given instance. For example, in multi-label classification of emotions in a music dataset, the class label ‘Sadness’ might be more correlated with the class label ‘Depressing’ than with the class label ‘Cheerful’. The correlation between labels is important in multi-label classification [122]. If the labels were independent from each other, we could sim- ply transform a multi-label problem into a set of single-label problems using the binary relevance method. However, when there are strong dependences among labels in the data, simply using an approach that ignores label correlations, like binary relevance or computing the arithmetic average of correlations across all la- bels may not be sufficient to cope well with the label-dependence problem.
To take label dependences into account, we used mutual information (MI) to measure the degree of dependence between each pair of labels. We use MI, rather than Pearson’s correlation coefficient, because labels are nominal, rather than nu-
merical, and MI is often used to measure dependencies between nominal variables in feature selection [71][65][26]. If the MI between two variables is near zero, this would indicate that the variables are close to independent.
The mutual information MI(X;Y) between the random variables (class at- tributes) X and Y is shown in Equation (4.9), where p(x,y) denotes the joint probability of class labels x and y, p(x) denotes the marginal probability of x, the log is in base 2, and the summation is over all values of variables X and Y. To use MI as a measure of label dependence, we first compute the average MI of each label Li (AvgMI(Li)) as defined in Equation (4.10). This is simply the mean of
the MI between label Li and each of the other class labels Lj (j 6= i).
M I(X;Y) =X Xp(x, y)log p(x, y) p(x)p(y) (4.9) AvgM I(Li) = P|L| i=1,j6=iM I(LiLj) |L| −1 (4.10)
The AvgMI(Li) value for each label Li can then be used to modify the Merit
function as follows. When computing the correlation between a feature and a set of labels, Equation (4.3) is extended by assigning a different weight to each feature-label correlation term (for each label Li), where the weights are based
on the AvgMI values computed by Equation (4.10). We investigated two oppo- site approaches to assign such weights, based on two opposite rationales, as follows.
On one hand, it could be argued that a greater weight should be assigned to feature-label correlations involving labels with greater AvgMI values. The ratio- nale for this is that, if a given label Li is highly correlated with the other labels
– i.e., AvgMI(Li) is large, one should reward features which are strong predictors
of that label because a multi-label classification algorithm exploiting label corre- lations could use an accurate prediction of that label to improve the accuracy in the prediction of other labels. Hence, one approach investigated in this work is to
Table 4.1: Main Characteristics of the Datasets used in the experiments Dataset Symbol Dataset Name Dataset Description
Instances Features Labels Label Cardinality Label Density Distinct Labels
N1 CAL500 502 68 174 26.044 0.150 502 N2 Scene 2407 294 6 1.074 0.179 15 N3 Emotions 593 72 6 1.869 0.311 27 N4 Yeast 2417 103 14 4.237 0.303 198 N5 Business 11314 21924 30 1.600 0.053 158 N6 Art 7484 23146 26 1.659 0.063 404 N7 Education 12030 27534 33 1.455 0.044 348 N8 Recreation 12828 30324 22 1.428 0.065 369 N9 Health 9205 30635 32 1.635 0.051 235 N10 Entertainment 12730 32001 21 1.405 0.067 246 N11 Computer 12444 34096 33 1.518 0.046 296 N12 Science 6428 37187 40 1.471 0.037 332 B1 Enron 1702 1001 53 3.378 0.064 753 B2 Medical 978 1449 45 1.245 0.028 94
extend Equation (4.3) with Equation (4.11).
rf L= P|L| i=1|rf Li|AvgM I(Li) P|L| i=1AvgM I(Li) (4.11) rf L = P|L| i=1|rf Li|(1−AvgM I(Li)) P|L| i=1(1−AvgM I(Li)) (4.12)
On the other hand, it could be argued that a greater weight should be assigned to feature-label correlations involving labels with smaller AvgMI values. The ra- tionale for this is that, if a given labelLi is weakly correlated with the other labels
– i.e., AvgMI(Li) is small, a multi-label classification algorithm exploiting label
correlations would not be able to use an accurate prediction of other labels to improve the accuracy in the prediction of label Li, and therefore features which
are strong predictors of that label should be rewarded regardless of their ability to predict other labels. Hence, one approach investigated in this work is to extend Equation (4.3) with Equation (4.12) In Equations (4.11) and (4.12), the denomi- nators normalize the weight values so that the sum of weights across labels is 1.
4.3
Datasets Used in the Experiments
Table 4.1 shows the main characteristics of all the multi-label datasets used in our experiments. There are two different groups of datasets based on the data type of their features and their application domain: (1) N1-N12, multi-label datasets with continuous (real-valued) features; and (2) B1-B2, multi-label datasets with binary features. The datasets in Table 4.1 were obtained from MULAN repository
[http://mulan.sourceforge.net/datasets.html].
The datasets are described in Table 4.1. In this table, the titles of the first five columns have self-explanatory meanings. The meanings of the last three columns are as follows.
Label Cardinality (LCard) is the average number of labels per instance. Label Density (LDen) is the label cardinality divided by the number of labels. Distinct Labels (DistL) is the total number of distinct label combinations observed in the dataset [112]. The formal definitions of Label Cardinality, Label Density and Dis- tinct Label are as follows.
LCard= 1 |D| |D| X i=1 |Yi| (4.13) LDen= LCard L (4.14) DistL=|Yi ⊆L|∃(xi, Yi)∈D| (4.15)
where |D| is the number of instances in dataset D, Yi is the set of class labels
occurring in the i-th instance, L is the set of class labels and (xi, yi) denotes the