Determining the optimal two-sample

DIMENSION REDUCTION

CHAPTER 5 EMPIRICAL WORK

5.6 CROSS-VALIDATON

5.6.2 Determining the optimal two-sample

Tables 5.8 through 5.12 report the selection frequencies of four pre-specified 𝑝-values in the two-sample 𝑡-test (namely 0.001, 0.01, 0.05 and 0.10) over the 100 random splits of the binary data sets. These frequencies are reported for 𝐾 ∈ {1, 3, 5} in the KNN classifier and for 𝐶 ∈ {1, 1 000, 10 000} in the RBF SVM classifier.

Selecting a small 𝑝-value threshold, for example 0.001, in the two-sample 𝑡-test procedure suggests that the VS criterion is very strict, since only the genes with a 𝑝- value less than 0.001 are considered significant and selected for inclusion in the model. Note that the smaller the pre-specified 𝑝-value threshold, the greater the difference between the two sample means for a specific gene must be for the gene to be selected as significant and to remain in the model. Therefore, the smaller the 𝑝-value threshold used in the two-sample 𝑡-test thresholding, the stricter the variable selection criterion and the fewer genes will remain in the model. Conversely, for a larger 𝑝-value threshold used in the two-sample 𝑡-test, the variable selection criterion is less strict and fewer genes will be eliminated from the model.

Table 5.8 below summarises the frequencies with which the four pre-specified 𝑝-value thresholds in the two-sample 𝑡-test selection procedure were selected in 100 random splits of the colon cancer data set.

Before analysing the results reported in Table 5.8 it should be noted that in the colon data set, gene 249 had the minimum calculated 𝑝-value < 0.0001 and is considered the most significant gene in terms of the two-sample 𝑡-test. Note gene 249 was also ranked the most important using correlation thresholding. Additionally, further analyses indicated

that for the 2 000 genes in the colon data set the median 𝑝-value is 0.1623 and that 1 215 genes are not selected at a pre-specified 𝑝-value = 0.100.

Table 5.8: LOOCV 𝒕-test 𝒑-value frequencies using the KNN & SVM classifier

𝒑-value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 = 1 000 𝑪 = 10 000 0.001 31 32 52 67 17 19 0.010 18 18 12 20 31 32 0.050 20 27 24 9 22 29 0.100 31 23 12 4 30 20 Total 100 100 100 100 100 100

It is evident from Table 5.8 that for all three values of 𝐾 in the KNN classifier applied to the colon data set the optimal 𝑝-value threshold in the two-sample 𝑡-test is 0.001. For the 1-NN classifier in Table 5.8, the 𝑝-value threshold equal to both 0.001 and 0.100 is selected in 31% of the 100 random splits; however the smaller of the two 𝑝-values is used in further analyses as it results in more genes being eliminated from the model. Considering Table 5.8, the optimal 𝑝-value is 0.001 in the two-sample 𝑡-test if the SVM classifier with 𝐶 =1 is applied to the colon data set. However, for 𝐶 = 1 000 and 10 000 in the SVM classifier, the optimal 𝑝-value threshold in the two-sample 𝑡-test is 0.01. The small optimal 𝑝-value threshold for both the KNN and SVM classifiers suggests that many of the genes retained in the colon data set differ significantly in the tumour and normal population groups.

The table below summarises the frequencies with which the four pre-specified 𝑝-value thresholds in the two-sample 𝑡-test selection procedure were selected in 100 random splits of the leukemia data set. In the leukemia data set, gene 436 gave the minimum 𝑝- value of approximately 0 (indicating that it is the most significant gene in distinguishing between AML and ALL leukemia patients), while gene 277 has the maximum 𝑝-value of 0.4969. The median 𝑝-value in the two-sample 𝑡-tests is 0.0575 and 704 genes have a 𝑝-value ≤ 0.001 while 1 479 genes have a 𝑝-value ≥ 0.1.

From Table 5.9 it can be observed that for all three values of 𝐾 in the KNN classifier and all three values of the 𝐶 parameter in the SVM classifier, the optimal 𝑝-value in the two- sample 𝑡-test VS procedure applied to the leukemia cancer data set is 0.001.

Table 5.9: LOOCV 𝒕-test 𝒑-value frequencies using the KNN & SVM classifier 𝒑-value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 = 1 000 𝑪 = 10 000 0.001 72 94 87 95 92 92 0.010 4 3 11 4 6 6 0.050 14 2 1 1 1 1 0.100 10 1 1 0 1 1 Total 100 100 100 100 100 100

Note that additional computations showed that 704 genes (19.71% of the total genes) in the leukemia data set yielded a 𝑝-value less than 0.001. As in Table 5.8, the small optimal 𝑝-value threshold for both the KNN and SVM classifiers suggests that only a small subset of genes that differ significantly between the AML and ALL groups are required to achieve the optimal classification accuracy.

The following three tables summarise the frequencies with which the four pre-specified 𝑝-value thresholds in the two-sample 𝑡-test VS procedure were selected in 100 splits of the first, second and third simulated data sets respectively. Table 5.10 provides the results for the first simulated data set.

Table 5.10: LOOCV 𝒕-test 𝒑-value frequencies for Sim1 using KNN & SVMs

𝒑-value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 = 1 000 𝑪 = 10 000 0.001 13 7 9 5 6 6 0.010 49 56 55 36 37 37 0.050 26 26 21 39 39 39 0.100 12 11 15 20 18 18 Total 100 100 100 100 100 100

From Table 5.10 it is evident that the optimal 𝑝-value threshold used in the two-sample 𝑡- test on the Sim1 data set is 0.010 for all values of 𝐾 in the KNN classifier. However, for all three values considered for the 𝐶 parameter in the SVM classifier the optimal 𝑝-value threshold used in the two-sample 𝑡-test is 0.050, although it was only selected 5.13% and 7.69% more times than the threshold of 0.01.

Note that in the Sim1 data set, 46 genes have a calculated 𝑝-value less than 0.010 and 143 genes have calculated 𝑝-values less than 0.050. In summary, Table 5.10 indicates that the optimal 𝑝-value used in the two-sample 𝑡-test thresholding for the KNN and SVM classifiers is selected as 0.010 or 0.050 for at least 75% of the random splits.

Table 5.11 provides the results for the second simulated data set.

Table 5.11: LOOCV 𝒕-test 𝒑-value frequencies for Sim2 using KNN & SVMs

𝒑-value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 = 1 000 𝑪 = 10 000 0.001 76 70 71 56 62 61 0.010 23 29 29 43 38 39 0.050 1 1 0 1 0 0 0.100 0 0 0 0 0 0 Total 100 100 100 100 100 100

From Table 5.11 it is clear that the optimal 𝑝-value threshold in the two-sample 𝑡-test thresholding on the second simulated data set is 0.001 for all values of 𝐾 in the KNN classifier as well as all three values considered for the 𝐶 parameter in the SVM classifier. Note that only three genes (genes with indices 382, 471 and 908) in the Sim2 data set have a 𝑝-value less than 0.001 and would be retained in the model if two-sample 𝑡-test thresholding is implemented with a threshold of 0.001.

Table 5.12 provides the results for the third simulated data set.

Table 5.12: LOOCV 𝒕-test 𝒑-value frequencies for Sim3 using KNN & SVMs

𝒑-value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 = 1 000 𝑪 = 10 000 0.001 16 19 18 25 20 20 0.010 35 43 45 43 42 40 0.050 32 26 28 19 21 22 0.100 17 12 9 13 17 18 Total 100 100 100 100 100 100

It is evident from the results summarised in Table 5.12 that the optimal 𝑝-value threshold for applying two-sample 𝑡-test thresholding on the third simulated data set is 0.01 for the three values considered for 𝐾 in the KNN classifier and for the three values of the 𝐶 parameter in the RBF SVM classifier. Extra calculations showed that 50 genes in Sim3 have 𝑝-values less than 0.01.

In document Statistical classification in high-dimensional scenarios with a focus on microarray data sets (Page 96-100)