Summary - Genetic algorithm-neural network: feature extraction for bioinformatics data.

In this chapter, we discussed the design of our feature extraction model. We have shown technical perspec- tives on both GAs and ANNs, how to use ANNs to calculate the fitness values of GA chromosomes. We have applied simple mathematical formulas in fitness computation including the centroid vector principle to calculate mean of each classes, the Euclidean distance to measure the proximity of samples from each classes and network activation functions to determines the potential firing of a node. We have also applied a genomic analysis platform, GenePattern software suite, to demonstrate the gene selection results graphically. The NCBI Genbank and the SOURCE search system have been used to validate the gene findings obtained by our model.

Prototype and Experimental Study

Chapter 3 presented the conceptual design of our feature extraction model to support the theme of this thesis. This chapter demonstrates the prototype of our model, namely Genetic Algorithm-Neural Network (GANN). The novelty of the GANN model is its simplicity that follows the Ockham’s Razor principle which can minimise the potentiality of gene variability errors incurred by data preprocessing which is discussed in Chapter 2. The construction of ANN for computing the fitness values of the chromosomes described in the previous chapter will be explained. Since a standard GA technique, except the fitness computation technique, has been used in the pattern evaluation module, we did not explain the GA construction steps in details, however, we outlined the overview of the implementation for GA evaluation.

The objectives of this chapter are to describe the tools, the prototype and the experimental study for supporting the theme of this thesis. This chapter contains six sections. Section 4.1 presents the software tools used in this thesis, including the programming tool for developing the prototype and the synthetic data sets, the data mining tool for validating the findings of the bioassay data sets and the visualisation tools to present graphically the result findings and the data sets. Section 4.2 explains the needs for the transposition process to preprocess microarray data. The GANN prototype is presented in Section 4.3 and Section 4.4 describes the validation steps conducted using the NCBI Genbank and the SOURCE search system. Section 4.5 describes the research methodology used to test the hypotheses of this thesis and finally, Section 4.6 concludes the chapter.

4.1 Tools used in the Prototype

Five tools: C++, WEKA, GenePattern, Microsoft Excel and R Project, are used to support the theme of this thesis. C++ is an object-oriented programming language that is used for the coding of the GANN

prototype. WEKA is a data mining software that is used to compute the statistical significance of the gene findings and to validate the bioassay findings. GenePattern is a genomic analysis platform that is used to visualise the correlation of the gene findings. Microsoft Office Excel is a spreadsheet software from the Microsoft Office Package Suite that is used to graphically present the findings from the prototype. Lastly, R Project is a language and environment that is used to visualise the data interactions.

4.1.1 Programming language for developing the prototype and the synthetic data

Although C programming language is more commonly used in the machine learning community, C++ is chosen to be the language for developing the prototype since it is less complex in terms of coding commands and rich in variety of built-in functions without a sophisticated programming environment. The screen shots of the important functions in the prototype can be found in Appendix A. In addition to the GANN prototype, two specially written C++ programs are coded to which one is used to transpose the microarray experimental data sets and the other program is used to construct the synthetic data sets. All C++ coding were programmed on the LINUX environment.

For the synthetic data program, the values of genes were designed based on a standard Gaussian distribution, i.e. the mean value of 0 (µ = 0) and the standard deviation of 1 (δ = 1), with the exception on 30 randomly selected genes, from each data set, which were created with different µ values. Table 4.1 presents the settings in the synthetic data sets. In the synthetic data set 1, the 30 differentially expressed genes were created with µ = 2.0 and were labelled with the gene indexes 1-15 and 5001-5015. In the synthetic data set 2, 10 out of the 30 genes were created with µ = 0.5 (gene indexes 1-10) and the remaining 20 genes were created with µ = 2.0 (gene indexes 11-30).

Table 4.1: The description of the synthetic data sets.

Data set Description Significant features Synthetic data set 1 100 samples equally distributed into 2

class. Each sample has 10000 features which were standardised with µ = 0 and δ = 1.

30 significant features, with the feature indexes 1-15 and 5001- 5015, were standardised with µ = 2.0 and δ = 1.

Synthetic data set 2 67 samples distributed into 3 classes, i.e. 20 samples in class 1, 30 in class 2 and 17 in class 3. Each sample has 5000 features which were standardised with µ = 0 and δ = 1.

30 significant features, with the feature indexes 1-10 were standardised with µ = 0.5 and δ = 1 and the feature indexes 11-30 were standardised with µ = 2.0 and δ = 1.

4.1.2 Tool for evaluating the significance of the findings

WEKA (Waikato Environment for Knowledge Analysis) is a data mining software developed by the University of Waikato (Hall et al., 2009). It is a free software that is available to download from its original website (WEKA data mining software). WEKA contains numerous collection of machine learning and data mining tools, such as data preprocessing, classification, regression, clustering, association rules and visualisation. Amongst these tools, the Information Gain (GainRatio) is selected to measure the significant of individual genes extracted from the microarray and synthetic data sets. Four cost-sensitive classifiers, which are naive bayes (NB), support vector machine (SMO), C4.5 tree (J48) and random forest (RF) are used to validate the significance of the attributes extracted from the bioassay data sets. Additionally, the principal component analysis (PCA) is used as a comparative tool in the bioassay data sets.

For GainRatio and PCA, the default parameter settings on the Attribute Selection tool were used. To construct the cost-sensitive classifier, the Cost Sensitive parameter on the Meta option for NB, SMO and RF, and the MetaCost parameter for J48 tree were selected. Figure 4.1 shows the screen shot to construct a Cost-Sensitive NB classifier (CSC NB) on the WEKA environment. We used the default WEKA settings for NB and RF classifiers. For SMO classifier, we set the build logistic models parameter to “true” and for J48 tree, we amended the Unpruned tree parameter to “true”.

4.1.3 Tool for visualising the significance of the gene findings

GenePattern software suites is a freely available software package developed at the Broad Institute of MIT and Harvard (Reich et al., 2006) that provides access to a broad array of computational methods used to analyse genomic data via the Broad Institute website (GenePattern software suites). HeatMapViewer, one of the GenePattern tools, which allows the transformation from the numeric findings into graphical representations and provides a global view on the features interaction without any form of programming syntax, is used to support the findings of the GANN prototype. The colour-coding scheme in the HeatMapViewer provides a quick coherent view of feature correlations.

Figure 4.2 presents the screen shot to produce a heat-map on the GenePattern environment. In the HeatMap Viewer module, the expression values are standardised with the mean value, ranging from -3 to 3, and the standard deviation of 1. These values are presented in 2 different colour shades, i.e. red and blue. High expression values are displayed in red indicating with positive values and the negative value representing low expression values which is displayed in blue. Intermediate expression values are displayed in different shades of red and blue. We used the default HeatMap Viewer settings to generate heat-maps for our findings on microarray data.

Figure 4.2: The screen shot generating heat-map using HeatMap Viewer.

4.1.4 Tool for visualising the findings

Microsoft Office Excel is a spreadsheet application written and distributed by Microsoft. It is featured with calculation functions, graphing tools, pivot tables and VBA macro programming language. In this thesis,

the calculation functions and graphical tools of the Microsoft Office Excel (version 11.0) are used.

4.1.5 Tool for visualising data sets

R project is a language and environment for statistical computing and graphics developed at the Bell Labo- ratories. It is available freely under the terms of the Free Software Foundation’s GNU General Public License in source code form from the R website (R Development Core Team, 2006). In R, the cmdscale() function has been used to visualise sample patterns within the data. Figure 4.3 shows the screen shot on the cmdscale programming code in R environment. The cmdscale function performs classical multidimensional scaling (MDS) in visualising similarities/dissimilarities between data points (i.e. samples) based on several fitter variable points (i.e. features) to project data points in a two-dimensional graph. The results are evaluated by comparing the distances between data points on the proximity matrix with the Euclidean distances and a measure of goodness-of-fit. In R project, we use the default parameters on the Euclidean distance and goodness-of-fit.

Figure 4.3: The screen shot visualising data pattern using multidimensional scaling (MDS) on the R environment.

In document Genetic algorithm-neural network: feature extraction for bioinformatics data. (Page 112-117)