4.3 Experiment Setup
4.4.3 Biomarker Identification
We tested the performance of biomarker identification of the proposed method on the Apple-plus and Apple-minus datasets, because only in these two datasets was a set of compounds spiked-in and predefined as the biomarkers.
Figure 4.3 shows an example of the approach used to count the number of identified biomarkers. As shown in Figure 4.3, the intersection between the selected features in the terminal nodes of the tree and the predefined set of biomarkers are used as an evaluation of the biomarker identification task.
Table 4.4 shows the biomarkers in Apple-plus and Apple-minus datasets (positive and negative modes of the ions). The table also shows the sta- tus of identification of the biomarkers by the proposed GP method and Method2. The percentage of runs in which these biomarkers appear are
Table 4.4: Identified spike-in biomarkers by the proposed GP method and
Method1for the Apple datasets. The biomarkers are identified using their
m/z values.
m/z values in Apple-plus dataset New Method Method2
(12 biomakers) Selection Status % of GP runs Selection Status % of GP runs 331.21 7 0 3 100.0 471.09 3 80.00 3 50.00 107.05, 169.05, 238.05, 275.09, 456.11, 459.13 3 100.0 7 0.0 456.62, 475.10 7 0.0 7 0.0 449.11 3 66.67 3 88.0 229.09 3 90.00 7 0.0
m/z values in Apple-minus dataset New Method Method2
(5 biomakers) Selection Status % of GP runs Selection Status % of GP runs 463.0 3 86.67 7 0.0 447.09 3 100.0 3 86.67 273.03 3 100.0 3 93.33 435.13 3 100.0 7 0.0 227.07 3 93.33 7 0.0
4.5. CHAPTER SUMMARY 109 shown in Table 4.4. As shown in Table 4.4, GP identified the complete
set of biomarkers in Apple-minus datasets. Method2 detected only two
biomarkers in 93.33% and 86.67% of the runs, respectively. For Apple- minus dataset, the new GP method detected three biomarkers in all its 30 runs and the remaining two in 86.67% and 93.33% of the runs. For the Apple-plus dataset, nine out of the twelve biomarkers (75%) are detected by the proposed GP method. Seven biomarkers are identified in 100.0% of runs, and the other three are selected in 66.67%, 80% and 90% of the
GP runs. However, Method2 identified only three of the twelve biomark-
ers. This suggests that the new proposed method can be successfully used for the task of biomarker identification as it constructs a new set of fea- tures that can achieve better classification accuracy and biomarker detec- tion rate.
4.5
Chapter Summary
The goal of this chapter was to test the performance of GP in construct- ing multiple new high-level features and to examine the effect of these new features in terms of dimensionality reduction, classification perfor- mance, and biomarker identification. The goal was successfully achieved by developing a new GP method, which takes an embedded approach by maximising the significant discrimination between different classes. The performances of the high-level constructed features are compared to those of the whole original set of features and the selected set of low-level fea- tures from two methods with seven different classifiers. The results show that the new features performed better than the original set of features for all the datasets with most of the classifiers. The results also show that these smaller sets of new features achieved significantly better or similar performance to the selected low-level features on almost all the datasets. Moreover, the constructed features helped in reducing the dimensionality more than the selected features. The biomarker identification results of the
proposed method showed that the new GP method can identify 100.0% of the biomarkers in the Apple-minus LC-MS dataset and 75% of the prede- fined biomarkers in the Apple-plus dataset. Due to its better classification and biomarker identification performance, the new GP can be successfully applied to this task.
In the next chapter, multi-objective GP methods for feature selection and construction are proposed. The multi-objective feature construction is an extension of the method proposed in this chapter that aims to keep the trade-off between the number of high-level features constructed and the classification performance.
Chapter 5
Multi-Objective Feature
Manipulation
5.1
Introduction
Many feature selection techniques have been proposed to detect the po- tential biomarkers in MS data [26, 102–104, 123]. Despite the promise of the previously proposed methods, none of these methods considered the number of features as an important independent objective to optimise. Some studies considered the relative importance of the number of features to classification accuracy in a single fitness function. The major limitation of these approaches is the prior specification of the relative importance of each objective into a single-objective fitness function. Multi-objective optimisation offers the solution to the optimisation of different conflicting objectives simultaneously without the need to consider the relative impor- tance in advance. Section 5.2 of this chapter proposes the first attempt to use GP as a multi-objective approach to biomarker detection.
Although our previously proposed feature construction approach in Chapter 4 has shown the effectiveness of the new features on improving the classification performance, the number of features constructed is still high. Section 5.3 aims to extend the work in Chapter 4 to consider the
trade-off between the number of features constructed and the classification accuracy through the use of multi-objective optimisation.
5.1.1
Chapter Goals
The overall goal of this chapter is to develop GP-based multi-objective feature selection and construction approaches to classification of MS data. In feature selection, the proposed GP method uses ideas from NSGAII [27] and SPEA2 [192] to evolve models that keep the balance between
the conflicting objectives. We notate these methods asN S-GP M OF S and
SP-GP M OF S. The main goal here is to evolve a Pareto front of non- dominated solutions, which include a small number of selected original features and achieve a better classification accuracy than using the whole set of features.
In feature construction, a single evolved tree is used to construct multi- ple features by replacing the original features with the constructed features after combining them using the GP functions. Multi-objective optimisa- tion is used to reduce the number of constructed features while keeping
the high classification accuracy. We notate these methods asN S-GP M OF C
andSP-GP M OF C.
In both approaches, an embedded approach is used to take the advan- tages of the low computational cost and better classification accuracy.
Precisely, we will investigate the following:
• whether using GP as a multi-objective approach to feature selection
can evolve better non-dominated solutions than using the single ob- jective GP algorithm,
• whether using multi-objective GP feature selection methods can se-
lect feature subsets that improve the classification performance and reduce the number of features more effectively than using the tradi- tional multi-objective algorithms,