Datasets Used in the Experiments - New Multi-Label Correlation-Based Feature Selection Methods

In our experiments, we have analysed two multi-label microarray gene expression datasets (Table 5.1). Unlike the other datasets analyzed in the previous Chapter, the two multi-label microarray datasets are not publically available; they were prepared for data mining by the author of this thesis, using data provided by Prof. Michaelis, School of Bioscience, University of Kent. Both these datasets were obtained from the resistant cancer cell line (RCCL) collection [22]. The first one (referred to as dataset M1) consists of 28,536 features (genes), 24 instances (cell lines) and 2 class attributes. More precisely, each feature represents the (real- valued) expression level of a different gene, for each cell line (instance) in the dataset. The two class attributes stand for two drugs which are used to treat neuroblastoma (a type of cancer), namely: ‘Nutlin-3’, which can take two class labels (sensitive and resistant), and ‘RITA’, which can take three class labels (sensitive, resistant and highly resistant) for each cell line. Hence, the goal of the multi-label classification algorithm is to produce a classification model that, given the values of the features (gene expression levels) for a cell line, predicts whether that cell line would be sensitive or resistant to the drug Nutlin-3, and predicts whether that cell line would be sensitive, resistant or highly resistant to the drug RITA.

In order to prepare dataset M1 for the application of a multi-label algorithm, first we decompose the two class attributes into three binary class labels. The first binary class label (L1) indicates whether a cell line (an instance) is sensitive or resistant to drug Nutlin-3. The situation is more complicated in the case of the class attribute for the RITA drug, which can take 3 values, since conventional multi-label algorithms can cope only with binary class labels. Hence, we decom-

posed the 3 class values for RITA into two binary attributes: L2 takes the value yes or no to indicate whether or not a cell line is sensitive to the RITA drug; whilst L3 takes the value yes or no to indicate whether or not a cell line is highly resistant to RITA. Hence, at most one of labelsL2 andL3 can take the value yes for a given cell line. If bothL2 andL3 take the value no for a cell line, this means the cell line is resistant to the drug RITA. Also, if L1, L2 and L3 take the value no for a cell line, this means the cell line is sensitive to Nutlin-3 and resistant to RITA. Note that the fact that several cell lines have this pattern of three labels with value no leads to an average value of label cardinality smaller than 1, since label cardinality is computed by counting the number of yes values in labels.

The second multi-label microarray dataset – referred to as M2 – also has 28,536 features (genes) and 24 instances (cell lines), but it has 3 binary class attributes (different drugs used to treat neuroblastoma), namely: Cisplatin, Carboplatin and Oxaliplating.

Moreover, in both dataset M1 and M2, we remove genes with unknown names because we aimed at selecting genes whose relevance to drug resistance/sensitivity can be interpreted by biologists. After removing unknown genes, the number of features (genes) that remained in dataset M1 is 22060, and 22,058 genes (features) remained in dataset M2 (each dataset had about 22.7% of genes with unknown names).

5.4 Experimental Methodology

The experiments reported in this Chapter are devided into five parts, as follows. First, we ran an experiment for comparing the two different versions of ML-CFS: (1) the first version of ML-CFS (described in Section 4.1); and (2) the ML-CFS method using the absolute value of correlation coefficient (ML-CFSabs), which

Table 5.2: Five different versions of ML-CFS using a weighted formula to combine the merit function and KEGG pathway information

Methods α 1−α ML-CFSk55 0.5 0.5 ML-CFSk64 0.6 0.4 ML-CFSk73 0.7 0.3 ML-CFSk82 0.8 0.2 ML-CFSk91 0.9 0.1

was described in Section 4.2.1.

Second, we ran an experiment for comparing 5 different parameter (α) settings of ML-CFS using a weighted formula to combine the merit function and KEGG pathway information, as described in Section 5.2. The pre-defined weights (α) and (1−α) used in Equation (5.2) are shown in Table 5.2.

Third, we compare the best version of ML-CFS using a weighted formula to combine the merit function and KEGG pathway information against other two versions of ML-CFS: (1) ML-CFSabs; and (2) gmiML-CFS, the ML-CFS version where class labels with greater MI (Mutual Information) are assigned greater weights (described in Section 4.2.2). The idea of this experiment is to evaluate what extent the use of mutual information and KEGG pathway information improve over ML-CFSabs ability to select a high quality feature subset. It is im- portant to mention that gmiML-CFS also uses the absolute value of the correlation coefficient (like ML-CFSabs).

Fourth, we compare the best version of ML-CFS according to the result of the previous experiment against other two ML-CFS versions using KEGG pathway information: (1) ML-CFS with embedded KEGG pathway Information to the Merit Function (described in Section 5.2.2); and (2) ML-CFS selecting only genes that occur in KEGG pathways (described in Section 5.2.3).

Fifth, we compare the best version of our ML-CFS method in the previous experiment against Relief for Multi-Label feature selection (RFML), and the proposed Correlation-Based Feature Selection with the union operator (CFS-U) . These are the same baseline approaches used in the previous Chapter, and the details of each approach are described in Section 4.5.1.

The results of these five experiments are reported in Sections 5.5.1 through 5.5.5, respectively. In each of these five experiments, in order to evaluate the predictive performance of the different versions of ML-CFS, the feature subset selected by each ML-CFS version was given to two different types of multi-label classification algorithm, namely the Multi-Label k-Nearest Neighbour (ML-kNN) classification algorithm proposed in [124] and the Back-Propagation Multi-Label Learning (BPMLL) classification algorithm [123]. These two algorithms were run using their default parameters, which were mentioned in their corresponding pa- per. After that, the predictive accuracy of each classification model was measured, for each ML-CFS version, on the test set, containing data instances, which were not included in the training set, therefore measuring the generalization ability of the classification model. For the two microarray datasets (M1 and M2) we used the well-known leave one out cross-validation procedure [116].

Like in Chapter 4, we measure predictive accuracy using five different accuracy measures, namely: Hamming-loss, Ranking-loss, One-error, Coverage and Average Precision [113], as reviewed in Chapter 2.

In document New Multi-Label Correlation-Based Feature Selection Methods for Multi-Label Classification and Application in Bioinformatics (Page 161-165)