• No results found

There are a small number of published studies on filter-based feature selection methods for multi-label classification following the data preprocessing (rather than the embedded) approach, as follows.

Several works first transform the multi-label problem into a single label prob- lem and then use a single-label feature selection method. In [26] proposed to use a problem transformation method to transform data from a multi-label problem to a single-label problem, and used the mutual information (MI) as an evaluation function for feature subset selection in the filter approach. The Pruned Problem Transformation Method (PPT) is a variation of the Power Set problem transfor- mation method (PT3) defined in [112], which simply considers each different label subset in the original data as a single label (as in PT3 method) and removes from the data set the new labels with a number of instances smaller than a predefined threshold. Then they used greedy forward feature selection based on MI to select features. Similarly, in [27] the PPT method was applied for transforming data and multivariate mutual information was used to select features. This paper claims that using multivariate mutual information can deal with redundancy between features in the feature subset. However, these studies cannot deal with multi-label problems directly.

RF-BR used the binary relevance (BR) transformation technique to transform multi-label data to single-label data and then evaluated each feature subset using ReliefF (RF). This approach also cannot directly deal with multi-label datasets [103].

The main drawback of using a problem transformation method in those studies is that they cannot cope with the correlation between labels. Other multi-label feature selection methods which avoid to use a problem transformation method

were proposed in several studies, as follows.

Multivariate mutual information for multi-label feature selection without using problem transformation was proposed by [71]. This approach avoids the informa- tion loss during the problem transformation process. However, this approach needs a user pre-defined number (the number of features in the selected feature subset), which equals to three in their paper.

In [68], authors modified the idea from the fast correlation-based feature selec- tion (FCFS) method which was proposed by [119] and applied it in a multi-label scenario. They used maximum spanning tree (MST) and symmetrical uncertainty (SU) in their filter approach to select features in a multi-label classification task. They built a SU matrix which considers feature-feature correlations and feature- label correlations using SU as a criterion to measure correlations. However, they assumed all features were discrete, a drawback in datasets where many features are continuous. Continuous features can be discretized in a preprocessing step, but this leads to loss of relevant information, especially in microarray datasets with more than 20,000 continuous features such as the data used in our experiments reported in Chapter 5.

A multi-label feature selection method using an MF-statistic and MreliefF based approach was proposed by [65]. These two approaches take the label cor- relation into account by using the multi-label F-statistic and multi-label reliefF method to evaluate the correlation between a feature and labels, but they cannot consider the correlation between features.

Also, [120] performed feature selection for classification with multi-label naive Bayes. First they used Principle Component Analysis (PCA) to remove redun- dant features, and after that they used a Genetic Algorithm (GA) for selecting a relevant feature subset. In their paper the learning problem was addressed by

multi-label naive Bayes (MLNB). Their study performed feature selection in a multi-label scenario because the GA uses the predictive performance of MLNB to guide the search for features, following a wrapper approach. However, note that PCA is an unsupervised learning method for dimensionality reduction, whereas the datasets used in our experiments are appropriate for supervised learning methods. In addition, PCA creates new features that are difficult to be interpreted by users, whilst a dimensionality reduction approach based on feature selection has the ad- vantage of preserving the meaning of the original features, facilitating the user’s interpretation of the classifier built with the selected features [97] [76].

Relief for multi-label feature selection (RF-ML) was proposed by Spolaor and others in 2013. This approach searches for k nearest multi-label instances by using a dissimilarity function. RF-ML considers the effect of feature interaction when computing the dissimilarity between instances. The dissimilarity function used in their paper is the normalization of Hamming Distance. Another method proposed by [103], IG-ML selects feature subsets which have a multi-label information gain (IG) value greater than or equal to a pre-defined threshold. This method has the drawback of requiring an ad-hoc user-defined threshold value.

In [74], authors proposed the multi-label feature selection via information gain (IGMF). This approach evaluates the information gain between a feature and the label set and after that eliminates irrelevant features (using the average of the information gain across all features as a threshold). They claim that this ap- proach can deal with the multi-label problem directly. However, a discretization technique was used before calculating information gain, and as mentioned earlier this involves information loss especially in datasets with many continuous features.

Also, [91] adopted the information gain-based feature selection for multi-label scenario. This approach computes a multi-label information gain score for all features then ranks all features before selecting the top k features, where k is a

Table 3.8: A Summary of work on Filter-based Multi-Label Feature Selection Methods Ref. PT Method Eval. Function Disadvantages [26, 27] Power set MI

•Cannot deal with multi-label problem directly

•Loss of information associated with discretized data

•Need user pre-defined number of selected features

[65] None F-statistic •Ignore the correlations between pairs of features

•Need user pre-defined number of selected features

[71] None MMI •Loss of information associated with discretized data

•Need user pre-defined number of selected features

[68] None SU •Loss of information associated with discretized data

•Need user pre-defined number of selected features

[103] BR ReliefF •Need user pre-defined number of selected features

•Cannot deal with multi-label problem directly

[74, 91, 103] None IG •Need user pre-defined number of selected features

•Loss of information associated with discretized data

[105] None RFML •Need user pre-defined number of selected features

[104] BR IG •Need user pre-defined number selected features

•Loss of information associated with discretized data

user-defined parameter.

Another method is proposed by [104]. The main idea of this approach is to deal with label dependency. This method constructs a new label from an original label pair for q times (while q is a user-predefined number, the number of constructed labels, where q is smaller than the total number of labels). After generating the q new labels then BR was applied to a new dataset which consists of the original dataset plusq constructed labels. The main drawbacks of this approach is that it needs a user-predefined number (q), also, there are many ways to generate a new label (by using AND, XOR or XNOR operator) and the user needs to specify how to select a pair of labels. Moreover, this approach increases the number of labels in the dataset regarding to the size of q.

Table 3.8 shows a summary of the previously discussed feature selection meth- ods based on the filter approach.

3.5

Multi-Label Classification Evaluation Mea-