The ALL/AML microarray data set - The statistical significance of the extracted genes

5.3 The statistical significance of the extracted genes

5.3.3 The ALL/AML microarray data set

With reference on Table 5.3 on page 138 for ALL/AML data set, not surprisingly, the linear based system has the highest number of extracted genes, i.e. 63 genes in total, and the tanh based system has the lowest number of extracted genes, i.e. 53 genes in total. Both the sigmoid and the threshold based systems have, in total, 54 genes extracted by the fitness evaluation size ranging from 20000 to 40000. Amongst the genes extracted in each system, across the board, 39 genes overlapped in all systems, including the first-17 genes selected by each system.

Table 5.3: The list of extracted genes in ALL/AML data set by each system based on the population size 300. Freq. is the number of times that the gene is selected. Genes highlighted in Boldface are common genes that were identified by all systems. Genes marked with “*” symbol are genes that matched with the genes reported in the original study.

Table 5.3 – Continued

Both the sigmoid and the tanh based systems have 49 genes in common and have an identical set of the first- 11 selected genes. This might be due to both sigmoid and tanh based systems being non-linear functions, which able to explore the correlation between features within the data. Furthermore, both the sigmoid and the tanh based systems used the logistic curve (i.e. S-shape curve, see Figure 3.10 on page 81) for squashing the activation value of each set of genes to a specific activation range before the output is generated by the network. A logistic curve relates to the growth in the learning process. At the initial stage of the learning, the growth is exponential, then as saturation begins (at the middle stage of the learning), the growth slows and at the final stage of the learning (i.e. maturity), growth stops. This curve provides better discrimination

between data classes. Whilst, the linear and the threshold based systems have 52 common genes. This is due to both systems performing simple linear computation on the activation value for each set of genes and not squashing the activation results. Both the linear and the threshold functions utilised a straight line (see Figure 3.10) to discriminate data classes rather than the logistic curve. However, they do not have genes in identical rankings, mainly because the threshold based system restricted the activation of the network node only when it exceeds the defined threshold value in the system.

A comparison between the genes extracted by each system and the original work by Golub et al. (1999) was conducted (see Table 5.3). Amongst the selected genes in each system, both the sigmoid and the tanh based systems have 20 genes, including the top 4-ranked genes in the systems, which were consistent with the top-50 genes reported by Golub et al. Meanwhile, the linear and the threshold based system have 24 matching genes when compared to the reported genes by Golub et al. Amongst these common genes, 18 were overlapped in all systems. This indicates that our method is effective in extracting informative genes from ALL/AML data set and the data set is not being normalised. In Golub et al. (1999) work, the ALL/AML data set had been normalised with zero mean and unit standard deviation. Some relevant works on the ALL/AML data set is presented in Table C.1 in Appendix C.

By comparing the ranking order of the genes extracted by the GANN systems and the IG method, genes 1882, 2288, 4847 and 2354 are the top-4 most significant genes selected by both the sigmoid and the tanh based systems, and these genes were consistent with the top-50 genes reported by Golub et al. (1999). For the linear based system, the top-4 significant genes are 4847, 2288, 2354 and 2121, which were also consistent with the top-50 genes reported by Golub et al.. Meanwhile the threshold based system identified gene 1779 as one of the top-4 important genes, instead of gene 2354 that was highly rated by the sigmoid, the linear and the tanh based systems. For the IG method, the top-4 significant genes are 2288 and 760 (both genes have the equal IG rate of. 0.747), 1882 (IG rate = 0.742), 4847 and 1834 (both genes have equal IG rate of 0.735) and 3252 (IG rate = 0.718). Amongst the IG selected genes, gene 3252 is ranked in-between the top-11 and the top-20 significant genes by all the GANN systems and gene 1834 is the least significant genes in all the GANN systems as it does not have much correlation with other selected genes in the systems. For gene 760, both the linear and the threshold based systems have poor ranking on this genes, however, in both the sigmoid and the tanh based systems, gene 760 is ranked in-between the top-13 and the top-17 significant genes. The gene ranking discrepancy between the sigmoid/tanh based systems and the linear/threshold based systems confirmed our observation on the sigmoid and the tanh based systems in which the use of logistic curve provides features which benefits data classification and simultaneously, these features pose a certain degree of correlation with other selected features. The main reason for such gene ranking discrepancy between GANN system and the IG method is due to the fact that the IG method measures the distance

between features to its nearest class independently, meaning that each of these features can be used as an independent primary feature to categorise data classes as each feature provides a high classification accuracy in the data set. Unlike the IG method, GANN explores the correlation between features to the data classes. Therefore, a feature in the GANN system may not provides high classification accuracy in the data set; however, using a group of features extracted by the GANN system, a certain level of classification accuracy might be achieved.

In document Genetic algorithm-neural network: feature extraction for bioinformatics data. (Page 157-163)