The main goal of this thesis was to devise a more effective way for extracting informative features from high dimensional data using GAs and ANNs. This goal leads to three major contributions, which are reviewed in related literature, the solution for feature extraction, the prototype implementation and evaluation. The summary of these contributions is as follows:
6.2.1 The review of related literature
Two domains of literature pertaining to the biology and computer fields were reviewed. The existing biology literature shows that there are two dominant types of microarrays: oligonucleotides and cDNA, each of which requires different techniques to be used in the microarray production. The cDNA microarray usually requires a two-phase in normalisation steps, i.e. the pre-normalisation in the fluorescent labelling process and the post-normalisation in the imaging process. The oligonucleotide microarray requires only post-normalisation in the imaging process. Both types of microarray data contain a high dimension of noisy genes, which require the use of computer algorithm to analyse the data.
The existing computer literature shows that there are two dominant ways for analysing microarray data: predicting the accuracy of the samples based on the known cancer group in the data; and discovering new cancer groups from the known cancer group in the data. Both ways expose the immaturity of the gene extraction area. Thus, in this thesis, we devised an efficient solution for extracting informative genes based on the application of hybrid GAs and ANNs, as the previous research restricting the utility of the selection method to retain the effectiveness of the classification method and the presentation of gene extraction based upon the hybrid GA/ANN is rare. The rarity of this solution in feature extraction is mainly due to the ill-conceived hypothesis in the existing works in dealing with microarray data. Another reason is because both GAs and ANNs are not model transparent which lacks the step-by-step logic to explain the interaction
between genes in the model. Thirdly, the ANN is commonly used as a classifier to classify samples and the GA is usually used to optimise the parameter of the ANN. Promising results based on the incorporation of GA into the ANN were reported in many discipline areas. Thus, researchers are not keen to find outcomes when the ANN is incorporated into a GA.
Our work has incorporated the ANN into a GA using as minimal parameter setting as possible to avoid the over-fitting problem which normally arises in ANNs. The ANN in our model is used as a fitness generator to compute the GA fitness function.
The existing literature identifies several problems as follows:
• The lack of understanding of microarray data result in an ambiguity of the objectives of the study. • The risk of over-fitting and biased underestimates of the error rate due to the misuse of a valid
mechanism and resusbstitution estimation in the classification process.
• The lack of supporting evidence in the declaration of new prediction models due to the misuse of a valid mechanism and resusbstitution estimation.
• The researchers being not aware of the influence of the model complexity to the prediction results which result in model over-fitting.
• The researchers being not aware of the influence of the data preprocessing in finding the relevant information for the problems.
Information on these problems can be found in Sections 1.1 and 1.2 in Chapter 1 and Section 2.2 in Chapter 2.
6.2.2 The solution for feature extraction
Identifying a solution for feature extraction is the most important contribution of this research, as the selection method is always needed to reduce the number of noisy information from microarray data. In our solution, we utilised the universal computation power of ANNs and the evolutionary ability of GAs to extract informative genes from a specific microarray data sets. Three main steps in our feature extraction model are: (a) initialising a population of potential members to the problem, i.e. GA chromosomes; (b) computing fitness values for each member in the population using a 3-layer feedforward ANN; and (c) evaluating the fitness of each member in the population using GA crossover and mutation operators.
• The order of genes selected based on the selection frequency of genes, i.e. the number of times that the gene is selected, and the fitness accuracy of the gene subset, i.e. the number of times that the sample is correctly labelled in the class for the selected gene, is calculated.
• The fitness accuracy and the number of fitness evaluations for each GA cycle, i.e. GA generation, are preserved.
• The flexibility to alter parameters, such as types of activation function to be used to compute fitness values, the population size, the fitness evaluation size, the network size, the fitness precision level and the gene list corresponds to a specific fitness accuracy to be displayed.
• The simplicity of the model in which only the fundamental parameter settings are applied which reduce the possible risks arising from the complex model and also provides generalisability in handling multiple types of data structures, i.e. microarray data and bioassay data, in the bioinformatics field.
• Retaining all highly fitted members for the next generation, excepting the least fit member which is replaced by the new member in the next generation.
6.2.3 The prototype implementation and Evaluation
The prototype of the feature extraction solution has been implemented using a C++ programming language in LINUX environment to realise the proposed techniques. This prototype helps to validate our approach and shows the possibility of using ANN as a fitness generator for a GA, as well as extracting informative features from the high dimensional data and in the highly imbalanced data with different data representations. The prototype provided a fundamental basis for conducting our experimental study.
The performance of the prototype has been evaluated via experimental study which serve the following purposes:
• The performance of prototype, each with a different ANN activation function, to extract informative genes from the microarray data.
• The minimal sizes of GA population and fitness evaluation for efficient marker identification. • The ability of prototype to handle different platform of microarray data.
• The ability of prototype to extract important genes that were lower expressed in microarray data set. • The ability of prototype to handle a highly imbalanced data with multiple data representations.