4.5 Experimental Study
4.5.3 Experiment Design
The experiments are designed according to the objectives of the experimental study, which has been men- tioned in Section 4.5.1 and the hypotheses of this thesis, is as follows:
1. Varying sizes of data sets with varying numbers of samples allocated in each class cluster were used in the experiments. We performed trial experiments on several different sizes of data and it was found that data that had a class size between 2 to 3 and a ratio between samples and features is 1:100 showing a significant performance difference between the systems. Thus, the experiments were conducted based on the synthetic data with sample size varying from 67 to 100, feature size varying from 5000 to 10000 and class size varying from 2 to 3. Similar sets of experiments were also performed on the real-world microarray data sets which have a sample size varying from 72 to 83, feature size varying from 2308 to 7129 and class size varying from 2 to 4. This design serves the first two purposes of the experimental study.
2. Varying sizes of population and fitness evaluation on the GA were tested in the experiments. In the GANN prototype, the values of GA parameters, i.e. the population size, the fitness evaluations size, the crossover factor and the mutation rate, can be changed (see GANN interface in Figure A.1 in Appendix A). Trial experiments were conducted on these parameters and it was found that there was a minor influence of the mutation operator to the stability performance of the system, with mutation rate varying from 0.1 to 0.5. When the mutation rate more than 0.5 was applied, the system became unstable and different set of genes were produced in the repeated trial when similar set of parameters were used. This could be a precursor to the over-fitting problem. Therefore, we retained a small mutation rate, i.e. 0.1, in all systems.
We also conducted trial experiments based on the population size varying from 100 to 700 and the fitness evaluation size varying from 1000 to 50000. The trial results showed that convergence began in most systems when the population size reached 300 and the fitness evaluation size 20000. However, there was not much difference in system performance for population size, ranging from 300 to 700. Therefore, the experiment was conducted on a population size varying from 100, 200 and 300. We performed the experiments on smaller fitness evaluation sizes, started from 5000 and each time, the evaluation was increased another 5000 cycles, until the maximum evaluation size of 50000 is reached. This design supports the third purpose of the experimental study.
3. Varying levels of fitness precision in the prototype were examined in the experiment. The default values of the precision parameter is usually consistent with the sample size, i.e. 100% fitness precision score. This value can be altered to different precision accuracies. The experiments based on three precision levels varying from 95% to 100% were tested. This design argues our statement in Section 1.2.4 on page 9 and supports the objectives of our research theme stated in Section 1.4 on page 12.
4. The comparison study based on the normalised and the raw microarray data set was performed to support our argument concerning the implication of data normalisation process addressed in Section
1.2.2 on page 8. The experiments based on the ALL/AML microarray data set was conducted. 5. The experiments based on two bioassay data sets with the tanh-based GANN system were performed.
This design supports the fourth purpose of the experimental study.
All the above experiments were designed for the last purpose of the experimental study. The first and the second experiments were assessed based on the number of significant genes extracted by the system, the fitness accuracy of the system on the extracted genes and the processing time of the system. The integrity of the findings was evaluated based on the comparison studies conducted in previous work and from a molecular perspective.
4.6
Summary
In this chapter, we have discussed the tools used to support the theme of this thesis and explained the prototype of the method outlined in Chapter 3. An experimental study were conducted to evaluate the performance of the prototype from several aspects, including the effect of different precision levels, the population sizes, the fitness evaluation sizes and different activation functions.
With the vast development of microarrays, and there are many ‘grey areas’ that need to be further investi- gated, the existing feature selection models are unable to handle these areas. Questions, such as ‘which genes trigger the development of specific cancer diseases?’ and ‘which genes shown the first sign of the recurrence of mutated cells?’, are yet to be answered. These questions have been taken into consideration when we constructed the prototype which aims to provide an insight into the elementary genes and triggered genes in malignancy development. A fitness precision accuracy parameter is also built into the prototype, which allow users to closely monitor the pattern of the disease development from the beginning stage to the final stage.
In the next chapter, we carried out a comparative study of our results with the studies reported previously and show how the hybridisation of GAs and ANNs is suitable for analysing microarray data as well as the bioassay data.
Experimental Results and Discussion
The prototype and the experimental study have been explained in the previous chapter. The objectives of this chapter is to present the findings of the prototype including discussions of the experimental results. Four GANN systems represent different activation functions, based on three population sizes and ten evaluation sizes which were compared in this chapter.
The relevant graphs and tables to support the objectives of the experimental study and the hypotheses of this thesis were produced in this chapter. Additional information on these graphs and tables can be found in Appendix B.
5.1
System performance with different data sets in different population
sizes
In this section, we assess the overall system performance based on the synthetic data sets and the microarray data sets. Three figures, each representing an evaluation criteria, were produced. Each figure presents a high level view of system performance in terms of the the average number of the extracted genes in Figure 5.1, the average fitness accuracy of the extracted genes in Figure 5.2 and the average elapsed time (processing time) in Figure 5.3, based on three population sizes, ranging from 100, 200 to 300. The complete list of the extracted gene by each system in the four data sets are presented in Appendix B.
The four systems shown in the figures represent four different ANN activation functions used in the GANN prototype to compute the fitness values for each subset of genes in the population. These systems comprise of sigmoid, linear, tanh and threshold based.