In this section we look more in details the stability of the generated decision trees.
It is important to be aware if some genes are more often chosen as attributes when the datasets are merged. Also the larger the sample size the better the population of gene expressions for lung cancer are represented. This may imply that the gene
Figure 6.9: DWD applied on SCAN preprocessed datasets with the following order: GSE10072, GSE19804, GSE7670, GSE19188, GSE31547, GSE18842
is important for the classification of diseases and thus could be of significance for lung cancer.
Here we use two resources to verify if a gene can be important, namely, genecards1(Rebhan et al., 1997) and the DEGs determined by Taminau. Genecards is a database that
references genes and proteins with their related diseases. The experiment con-sisted of generating 10 decision trees for classifying diseases and count the oc-currences of every genes. We referenced in the tables the genes occurring at least 5 times in the 10 trees for the datasets or combinations. If there were no genes occurring at least 5 times, then we referenced the gene occurring the most in the experiments. Plots have been generated for every experiments and can be found in the Appendix.
Stability decision trees
We notice in Table 6.3 that the decision trees generated for individual datasets only have one recurring gene. When we clearly note multiple recurring genes per combination in most of the generated decision trees. The FRMA method seems to be less stable than the other preprocessing methods. This information is important and will be taken into consideration when performing experiments. The other two preprocessing techniques are more stable, but the most stable genes are not always
1http://www.genecards.org
Dataset Gene # in tree lung cancer in DEGs related Taminau FRMA
GSE10072 RNF38 3 no no
GSE7670 AHNAK 9 no no
AKAP12 6 yes no
GSE31547 RNF38 5 no no
GSE19804 SPOCK2 8 no yes
GSE19188 STX11 7 no no
GSE18842 ASPM 4 no yes
SCAN
GSE10072 VWF 7 yes no
GSE7670 CFP 7 no no
AGR2 5 yes no
GSE31547 ICA1 4 no no
GSE19804 SPOCK2 7 yes no
GSE19188 RILPL2 6 no no
GSE18842 ASPM 7 no yes
UPC
GSE10072 CAV1 8 yes yes
GSE7670 AGER 8 yes yes
GSE31547 PTRF 8 yes no
GSE19804 SPOCK2 8 no yes
ADIRF-AS1 6 no no
GSE19188 ADH1B 4 yes yes
GSE18842 ASPM 6 no yes
Table 6.3: References the genes that occurred the most when generating the deci-sion trees for individual datasets. These genes are assumed to be the most stable ones.
Batch effect Gene # in tree lung cancer in DEGs
Table 6.4: References the genes that occurred the most when generating decision trees after merging the datasets and applying batch effect removal on them. These genes are assumed to be the most stable ones.
the gene that are biologically relevant (i.e. the A2M gene for the combination of SCAN with GENENORM).
For individual datasets it seems like GSE7670 is the most stable dataset except for UPC. This is surprising since it is one of the datasets with the least number of samples. The other stable dataset is GSE19804. We observed that the genes occurring the most in the decision trees generated for these datasets were not of-ten related to lung cancer nor categorized as DEGs by Taminau. Only SPOCK2 occurred multiple times for the same dataset when using different preprocessing method. Since SPOCK2 is in the differentially expressed list of Taminau, we con-clude that it is an important gene in this dataset. This also shows that the three preprocessing methods are equivalent when it comes to important genes for indi-vidual datasets. This can also be noticed in table 6.4, were AGER is used at least one time as attribute in the decision trees for every preprocessing technique.
Genes related to lung cancer or in DEGs Taminau
Tables 6.3 and 6.4 show clearly that merging different datasets forces the deci-sion tree to find attributes that are more biologically relevant. The genes AGER and CAV1 are almost always selected in every preprocessing/batch effect removal method combination as being the most stable genes. Both are known to be related to lung cancer and are in the list of differentially expressed genes discovered by Taminau (2012). If we look at some of the generated decision trees we can con-clude that AGER and CAV1 are the root of the trees that is generated. This implies that these genes split the diseases in cancer and control very well. We see in Fig-ures 6.10 and 6.11 two of the many examples that those genes are used for splitting the diseases in two groups and thus are genes with a possible biological influence on cancer. We chose to show the combination of SCAN and GENENORM since A2M is a very stable gene, but is neither related to lung cancer nor in the differen-tially expressed genes list of Taminau. A2M occurs 10 times but never as splitting attribute of the root node.
FRMA COMBAT : Disease
CAV1 >= 12
IGF2BP3 < 5.8 SOX4 < 9.3
lung cancer
Figure 6.10: The decision tree classifies the samples by their diseases for the FRMA preprocessing technique combined with the ComBat batch effect removal method. The attributes splitting the samples in subgroups are on top of each node.
The label in the node states the class with the most samples in the node. The second line states the number of samples for each study. The order of the classes are: GSE10072, GSE18842, GSE19188, GSE19804, GSE31547 and GSE7670.
The last line in the node denotes the number of samples for each disease. The order of the classes are: lung cancer, control and unknown.
SCAN GENENORM : Disease
AGER >= 0.19
ARHGEF10 >= −1 A2M >= −5.6
LST1 >= 1.8
Figure 6.11: The decision tree classifies the samples by their diseases for the SCAN preprocessing technique combined with the GENENORM batch effect re-moval method. The attributes splitting the samples in subgroups are on top of each node. The label in the node states the class with the most samples in the node. The second line states the number of samples for each study. The order of the classes are: GSE10072, GSE18842, GSE19188, GSE19804, GSE31547 and GSE7670.
The last line in the node denotes the number of samples for each disease. The order of the classes are: lung cancer, control and unknown.