Several single-label classification algorithms have been modified for multi-label classification. For example, the C4.5 algorithm (a well-known decision tree induc- tion algorithm proposed in [93]) was modified by [16]. In order to extend C4.5 to
a multi-label scenario, Clair and King adapted the formula of entropy calculation for multi-label classification. In this Section we focus only on the two multi-label classification algorithms used in our experiments reported in Chapter 4-6, namely multi-label extensions of the kNN (K nearest neighbour) and Backpropagation Neural Network algorithms. Each of these is described in a separate subsection, in the following.
3.3.1
Multi-Label K-Nearest Neighbours Algorithm
A multi-label classification algorithm based on an extension of a traditional single- label k-nearest neighbours (kNN) algorithm, called ML-kNN was proposed [124]. This algorithm works as follows. For each test instance unseen, ML-kNN identi- fies that instance’s k nearest neighbours in the training set and considers which of those neighbours are labelled as positive or negative. Next, in order to trans- fer class labels from those neighbours to that unseen instance, in essence, this approach uses the k-NN algorithm independently for each label in the label set. More specifically, it counts the number of neighbours associated with each label and uses a maximum a posteriori principle to define the label set for the unseen instance.
For an unknown-class instancex, the predicted value (0 or 1) of each class label Yj is computed by Equation 3.1, Yj = 1, if P(cj|Yj = 1)P(Yj = 1) ≥P(cj|Yj = 0)P(Yj = 0) 0, otherwise (3.1) where:
cj is the count of the nearest neighbours of instance x which have the j-th label
(i.e., nearest neighbours with Yj = 1), P(cj|yj = 1) is the probability of the count
the analogous probability conditioned on the event that x does not have the j-th label, P(Yj = 1) and P(Yj = 0) are the prior probability of the j-th label taking
the value 1 or 0 (estimated by taking into account the relative frequency ofyj = 1
and yj = 0 in the entire training set).
The ML-kNN method was used in multi-label classification of music into emo- tions [111], multi-label classification for video annotation [23] and multi-label learn- ing with label-specific features [121]. Moreover, [111] pointed out that ML-kNN is a high performance representative of problem adaptation methods.
An aspect of the original single-label kNN which is inherited to ML-kNN is the distance measure. For distance-based classification methods like kNN using the Euclidean distance measure, feature normalization is an important step, be- cause it prevents a feature with initially (before normalization) large range from outweighing a feature with initially smaller range when computing the distance between two instances. Feature normalization equalizes the range of values of all features [2, 72, 100]. The Euclidean distance is used to measure the distance be- tween instances in ML-kNN; therefore, the original features need to be normalized in a pre-processing process, before the application of ML-kNN.
3.3.2
Multi-Label Neural Network Algorithm
An extension of the traditional feed-forward neural network for multi-label classi- fication problem, Backpropagation Multi-Label Learning (BPMLL), was proposed by [123]. A feed-forward neural network has a multi-layer architecture. The first layer represents an input layer and the last layer is the output of the algorithm. Layers in the middle, called hidden layers have no connection with the external world. Each layer has many neurons (nodes), which connect to all nodes in the next layer, while there is no connection between nodes in the same layer. Note
that the output layer has one node for each of the class labels.
Where
Y is the set of class labels
d is the number of input nodes (the dimensionality of the feature vector)
Q is the number of output nodes, each corresponding to one of the possible class labels
M is the number of nodes in the hidden layer
V
hsis the weight of the connection between input node h and hidden node s, (
)
W
sjis the weight of the connection between input node j and hidden node s, (
)
a
0,…,a
dare the input nodes (a
0is the bias node)
b
0,…,b
Mare the hidden nodes (b
0is the bias node)
c
0,…,c
Qare the output nodes (representing class labels)
Output layer
Hidden layer
Input layer
Figure 3.1: Backpropagation Multi-Label Learning (BPMLL) architecture (adapted from [123])
This kind of architecture is show in Figure 3.1. There are d units in the input layer each one corresponding to each feature while there areq unit of output layer where each unit corresponding to class label.
propagation with an error function. The global error function is shown in Equa- tion 3.2. The error term for the i-th instance is calculated as the accumulated difference between the output of each pair for nodes where one node (ci
k) repre-
sents a label belonging to instance i and another node (ci
l) represents a label not
belonging to instance i. Note that the bigger the difference (ci
k −cil), the better
the predictive performance of the neural network, since the output ofci
k should be
as high as possible (labelk occurs in instance i) and the output ofci
l should be as
low as possible (label l does not occur in instance i).
E = m X i=1 Ei = m X i=1 1 |Yi||Yi| X (k,l)∈Yi×Yi exp(−(cik−c i l)) (3.2) where
Yi is the set of labels occurring in the instance i
Yi is the complementary set of Yi (i.e., set of labels not occurring in instance i)
ci
k−cil is the difference between the output of the node for one label belonging to
instancei (k ∈Yi) and one label not belonging to instance i (l ∈Yi)
k is the index of a label belonging to label set Yi
l is the index of a label belonging to label set Yi
m is the number of instances in a multi-label training set
In Equation 3.2, the larger the value of ci k − c
i
l, the smaller the value of
exp(−(ci
k −cil)), and so the smaller the error associated with the pair of labels
k and l. The summation of these errors for each pair of labels is then normalized by dividing that summation by the total number of label pairs (|Yi||Yi|), for each
instance i, and finally the errors for all instances are added up to calculate the global error E.