Multi-Label Classification Algorithms - New Multi-Label Correlation-Based Feature Selection Met

Several single-label classification algorithms have been modified for multi-label classification. For example, the C4.5 algorithm (a well-known decision tree induc- tion algorithm proposed in [93]) was modified by [16]. In order to extend C4.5 to

a multi-label scenario, Clair and King adapted the formula of entropy calculation for multi-label classification. In this Section we focus only on the two multi-label classification algorithms used in our experiments reported in Chapter 4-6, namely multi-label extensions of the kNN (K nearest neighbour) and Backpropagation Neural Network algorithms. Each of these is described in a separate subsection, in the following.

3.3.1 Multi-Label K-Nearest Neighbours Algorithm

A multi-label classification algorithm based on an extension of a traditional single- label k-nearest neighbours (kNN) algorithm, called ML-kNN was proposed [124]. This algorithm works as follows. For each test instance unseen, ML-kNN identi- fies that instance’s k nearest neighbours in the training set and considers which of those neighbours are labelled as positive or negative. Next, in order to trans- fer class labels from those neighbours to that unseen instance, in essence, this approach uses the k-NN algorithm independently for each label in the label set. More specifically, it counts the number of neighbours associated with each label and uses a maximum a posteriori principle to define the label set for the unseen instance.

For an unknown-class instancex, the predicted value (0 or 1) of each class label Yj is computed by Equation 3.1, Yj =      1, if P(cj|Yj = 1)P(Yj = 1) ≥P(cj|Yj = 0)P(Yj = 0) 0, otherwise (3.1) where:

cj is the count of the nearest neighbours of instance x which have the j-th label

(i.e., nearest neighbours with Yj = 1), P(cj|yj = 1) is the probability of the count

the analogous probability conditioned on the event that x does not have the j-th label, P(Yj = 1) and P(Yj = 0) are the prior probability of the j-th label taking

the value 1 or 0 (estimated by taking into account the relative frequency ofyj = 1

and yj = 0 in the entire training set).

The ML-kNN method was used in multi-label classification of music into emo- tions [111], multi-label classification for video annotation [23] and multi-label learning with label-specific features [121]. Moreover, [111] pointed out that ML-kNN is a high performance representative of problem adaptation methods.

An aspect of the original single-label kNN which is inherited to ML-kNN is the distance measure. For distance-based classification methods like kNN using the Euclidean distance measure, feature normalization is an important step, be- cause it prevents a feature with initially (before normalization) large range from outweighing a feature with initially smaller range when computing the distance between two instances. Feature normalization equalizes the range of values of all features [2, 72, 100]. The Euclidean distance is used to measure the distance between instances in ML-kNN; therefore, the original features need to be normalized in a pre-processing process, before the application of ML-kNN.

3.3.2 Multi-Label Neural Network Algorithm

An extension of the traditional feed-forward neural network for multi-label classification problem, Backpropagation Multi-Label Learning (BPMLL), was proposed by [123]. A feed-forward neural network has a multi-layer architecture. The first layer represents an input layer and the last layer is the output of the algorithm. Layers in the middle, called hidden layers have no connection with the external world. Each layer has many neurons (nodes), which connect to all nodes in the next layer, while there is no connection between nodes in the same layer. Note

that the output layer has one node for each of the class labels.

Where

Y is the set of class labels

d is the number of input nodes (the dimensionality of the feature vector)

Q is the number of output nodes, each corresponding to one of the possible class labels

M is the number of nodes in the hidden layer

V

_hs

is the weight of the connection between input node h and hidden node s, (

)

W

_sj

is the weight of the connection between input node j and hidden node s, (

)

a

₀

,…,a

are the input nodes (a

₀

is the bias node)

b

₀

,…,b

are the hidden nodes (b

₀

is the bias node)

c

₀

,…,c

are the output nodes (representing class labels)

Output layer

Hidden layer

Input layer

Figure 3.1: Backpropagation Multi-Label Learning (BPMLL) architecture (adapted from [123])

This kind of architecture is show in Figure 3.1. There are d units in the input layer each one corresponding to each feature while there areq unit of output layer where each unit corresponding to class label.

propagation with an error function. The global error function is shown in Equa- tion 3.2. The error term for the i-th instance is calculated as the accumulated difference between the output of each pair for nodes where one node (ci

k) repre-

sents a label belonging to instance i and another node (ci

l) represents a label not

belonging to instance i. Note that the bigger the difference (ci

k −cil), the better

the predictive performance of the neural network, since the output ofci

k should be

as high as possible (labelk occurs in instance i) and the output ofci

l should be as

low as possible (label l does not occur in instance i).

E = m X i=1 Ei = m X i=1 1 |Yi||Yi| X (k,l)∈Yi×Yi exp(−(cik−c i l)) (3.2) where

Yi is the set of labels occurring in the instance i

Yi is the complementary set of Yi (i.e., set of labels not occurring in instance i)

k−cil is the difference between the output of the node for one label belonging to

instancei (k ∈Yi) and one label not belonging to instance i (l ∈Yi)

k is the index of a label belonging to label set Yi

l is the index of a label belonging to label set Yi

m is the number of instances in a multi-label training set

In Equation 3.2, the larger the value of ci k − c

l, the smaller the value of

exp(−(ci

k −cil)), and so the smaller the error associated with the pair of labels

k and l. The summation of these errors for each pair of labels is then normalized by dividing that summation by the total number of label pairs (|Yi||Yi|), for each

instance i, and finally the errors for all instances are added up to calculate the global error E.

In document New Multi-Label Correlation-Based Feature Selection Methods for Multi-Label Classification and Application in Bioinformatics (Page 82-87)