This thesis focuses on extracting informative features from the data set that contains high feature dimension and feature correlation, as well as sample scarcity.
According to existing bioinformatics literature, none of the computational classification models are superior to the other. This is due to the implicit research objective and model abilities in extracting informative genes. Although many variants of hybrid selection/classification methods have been proposed, the performance of the models are still heavily reliant on the characteristics of the data sets and the nature of classification methods. In our solution, we analyse differentially expressed microarray genes using genetic algorithms (GAs) and artificial neural networks (ANNs).
The reasons for choosing GA and ANN in this research are that they are the only two algorithms based on the analogy of nature and have received high recognition for the delivery of promising results from various disciplinary areas, such as medical diagnosis (Dybowski et al., 1996; Khan et al., 2001; Djavan et al., 2002; Zhang et al., 2005; Froese et al., 2006; Heckerling et al., 2007), environmental forecasting (Nunnari, 2004; Fatemi, 2006; Nasseri et al., 2008), hardware utilisation prediction (Barletta et al., 2007; Taheri and Mohebbi, 2008), real-time series prediction (Kim and Han, 2000; Sexton and Gupta, 2000; Arifovic and Gencay, 2001), food lifespan forecasting (Gonia et al., 2008), sonar image reading (Montana and Davis, 1989) and computational problem (Sexton and Dorsey, 2000; Kwon and Moon, 2005; Cheng and Ko, 2006; Hu et al., 2007). The ANN is a universal computation algorithm that has the ability to compose complex hypotheses that can explain a high degree of correlation between features without any prior information from the data set (Cartwright, 2008a). Meanwhile, the GA is an effective population-based search algorithm designed for a large, complex and poorly understood data space due to its ability to exploit accumulating information about this unknown data space and to bias subsequent a search into useful subspaces (DeJong, 1988). In addition, GA is robust from trapping into local minima, i.e. the over-fitting problem (Montana and Davis, 1989).
GA/ANN hybrid systems are not new in microarray classification, but, are innovative for gene extraction. Several examples of GA/ANN hybrid systems on classification include breast metastasis recurrence (Bevilac- qua et al., 2006a,b), multiclass tumour classification (Cho et al., 2003a; Karzynski et al., 2003; Lin et al., 2006) and DNA sequence motif discovery (Beiko and Charlebois, 2005). In these studies, the data sets were normally normalised and partitioned into several smaller sets to ensure better classification performance of the system. The GA acts as a supporting tool to optimise the classification performance of ANN. This could contribute to the gene variability in the selection results. Rather than emphasize classification performance, our research focuses on the extraction ability of the hybrid GA/ANN. Our approach optimises the connection
weights of ANN and, at the same time, evaluates the fitness function of the GA using 3-layered ANNs. The distinct difference between the existing GA/ANN hybrid systems and our GA/ANN hybrid approach is that rather than using ANN as a classifier to predict cancer classes, the ANN in our approach is act as a fitness score generator to compute the GA fitness function. Figure 1.3 presents the graphical hybridisation of the GA/ANN approach used for classification and our selection approach.
(a) A typical GA/ANN hybrid system for microarray classification.
(b) The proposed GANN hybrid system for microarray gene extraction.
Figure 1.3: A typical GA/ANN hybrid classification model and the proposed GANN feature extraction model. The diagram (a) shows a typical GA/ANN hybrid system used in microarray classification. In this hybrid system, the ANN is used as a classifier to discriminate between cancer classes. The diagram (b) presents the proposed hybrid system focusing on the extraction of informative genes from microarray data. In our system, the ANN is act as a fitness score generator to compute fitness score for GA.
Fitness function is the most crucial aspect in GA as it determines the effectiveness performance of GA. Most research concentrates on optimising other aspects of GA and only a few studies on improving GA fitness function, e.g. the use of a penalty function to identify invalid chromosomes and approximating fitness evaluation within a given amount of computation time (Beasley et al., 1993). However, these approaches require an additional task level in a GA algorithm, for instance, a set of rules for determining the invalidity of chromosomes, i.e. how poor the chromosome is, and a set of mathematical formulas to compute penalty values when GA selecting invalid chromosomes, and consequently, the optimisability performance of GA
relies heavily on how ‘good’ this additional function is in finding the ‘optimal’ fitness function. Some studies proposed the use of effective classifiers on fitness computation, for instance, Li et al. (2001b) used the classification result returned by k-nearest neighbour (KNN) as the fitness function of GA on acute leukaemia classification, Cho et al. (2003a) computed fitness function based on neural network prediction results on SRBCTs tumours, Lin et al. (2006) and Bevilacqua et al. (2006b) employed error rate returned by neural network classification as GA fitness function on multiclass microarray data and breast cancer metastasis recurrence, respectively. In our approach, instead of letting the user determine the level of invalid chromosomes, we use simple feedforward ANN to compute fitness values for GA chromosomes. A novel feature of our approach is based on the explicit design of the algorithm which explores the potentialities of the GA and ANN methods of extracting informative features with minimal structural requirements on GAs and ANNs, as followed the Ockham’s Razor principle.
Figure 1.3b shows our hybrid approach. To formulate an effective feature extraction method and to circum- vent the over-fitting problem, a GA is used to initialise a population of chromosomes in which its fitness value is computed using a 3-layered feedforward ANN with centroid vector principle and Euclidean distance. Once all chromosomes are assigned with fitness values, a set of genetic mechanism is used to assess the fitness of the chromosome and the least fit chromosome is replaced by a new chromosome. Through evolution over many generations, ANN connection weights and GA fitness function are optimised, the least fit chromosomes are gradually replaced by new chromosomes produced in each generation and the optimal set of genes are obtained.