Determining the optimal correlation threshold value

DIMENSION REDUCTION

CHAPTER 5 EMPIRICAL WORK

5.6 CROSS-VALIDATON

5.6.1 Determining the optimal correlation threshold value

Five candidate correlation threshold values were considered for each of the binary data sets. These five candidate values were calculated for each of the data sets using the following procedure: Calculate the absolute correlation of each predictor variable with the response and determine the maximum absolute correlation value, 𝑐𝑜𝑟𝑚𝑎𝑥. Then the five

Note that the highest candidate correlation threshold value considered is 80% of 𝑐𝑜𝑟_𝑚𝑎𝑥 to ensure that the selected threshold is not too large, thereby ensuring that at least one feature is later on extracted after the data has been split into training and test parts. Finally, although the range of candidate correlation thresholds remains constant for each data set, LOOCV is used to determine optimal correlation threshold on each split, and therefore the optimal correlation threshold value used in feature extraction typically varied from split to split.

Selecting a large threshold for the absolute correlation between the response and a predictor variable implies that only the genes that are highly correlated with 𝑌 remain in the model and a large reduction of approximately uncorrelated, hopefully irrelevant genes occurs. Conversely, selecting a smaller correlation threshold value implies that there will be a smaller reduction in the number of genes. This will be appropriate if a large subset of genes is required to achieve accurate classification using the KNN and RBF SVM classifiers.

The table below summarises the frequencies with which the five candidate correlation threshold values determined on the colon cancer data, were selected over the 100 random splits. The frequencies are reported when 𝐾 ∈ {1, 3, 5} is used in the KNN classifier and when 𝐶 ∈ {1, 1 000, 10 000} is used in the RBF SVM classifier.

Table 5.3: LOOCV correlation threshold frequencies using KNN & SVMs

Correlation threshold value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 =1 000 𝑪 =10 000 0.1000 25 24 17 4 33 35 0.2013 32 35 47 18 43 40 0.3026 7 12 7 31 11 11 0.4039 9 10 21 23 9 10 0.5053 27 19 8 24 4 4 Total 100 100 100 100 100 100

Before analysing the results reported in Table 5.3 it should be noted that in the colon data set gene 249 has the maximum absolute correlation with 𝑌 at a value of 0.6318, while gene 1 122 has the minimum absolute correlation value of 0.0001 with 𝑌. Additionally, in the full colon data set the median absolute correlation with 𝑌 is 0.1272 and 55 genes have an absolute correlation value greater than 0.5053 with 𝑌.

The results reported in Table 5.3 indicate that for the KNN classifier with 𝐾 = 1, 3 and 5 the optimal correlation threshold value on the colon data set is 0.2013 as it has the maximum selection frequency over the 100 random splits. Therefore, based on the 100 random splits of the colon data set, it appears that the subset of genes which have an absolute correlation above 0.2013 with 𝑌 will achieve the maximum classification accuracy for 𝐾 ∈ {1, 3, 5} in the KNN classifier.

Additionally, Table 5.3 suggests that for the colon data set the optimal correlation threshold value using 𝐶 = 1 in the SVM classifier is 0.3026, while for 𝐶 = 1 000 and 10 000 the optimal correlation value is 0.2013.

The results summarised in Table 5.3 indicate that the optimal correlation threshold value is selected between 0.1000 and 0.2013 majority of the time for large 𝐶 parameter values (76% for 𝐶 =1 000 and 75% for 𝐶 =10 000). However, for 𝐶 = 1 in the SVM classifier the optimal correlation threshold value is selected between 0.3026 and 0.5053, 78% of the time. Therefore, it is clear from the results given in Table 5.3 that for a small 𝐶 value (𝐶 = 1) the RBF SVM classifier performs better on a highly-reduced subset of the genes by selecting a high correlation threshold. This is what is expected since a small 𝐶 value in the SVM classifier implies that the variables are not being penalised heavily. Therefore, due to the light regularisation a larger (stricter) correlation threshold value should be implemented to ensure that fewer variables remain in the model. Conversely, the smaller correlation threshold value of 0.2013 selected for the two larger values of 𝐶 suggests that the RBF SVM classifier is more accurate on a larger subset of the original genes in the colon data set. This is what is intuitively expected, since a larger 𝐶 parameter value value implies more regularisation and therefore the SVM classifier can afford to include more variables and thus a high frequency of smaller correlation threshold values is expected.

The table below summarises the frequencies with which the five candidate correlation threshold values determined on the leukemia data set, were selected over the 100 random splits. These frequencies are reported for 𝐾 ∈ {1, 3, 5} in the KNN classifier and 𝐶 ∈ {1, 1 000, 10 000} in the RBF SVM classifier.

Before analysing the results reported in Table 5.4, it should be noted that in the leukemia data set gene 1 182 has the maximum absolute correlation with 𝑌 at a value of 0.8597, while gene 277 is the least correlated with 𝑌. The median absolute correlation value in the leukemia data set is 0.1874, and 1 024 genes have a correlation with 𝑌 that is smaller

than the minimum threshold in Table 5.4. Additionally, 28 genes in the leukemia data set have an absolute correlation with 𝑌 that is greater than the maximum threshold value (0.6878) in Table 5.4.

Table 5.4: LOOCV correlation threshold frequencies using KNN & SVMs

Correlation threshold value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 = 1 000 𝑪 =10 000 0.1000 33 42 14 96 97 97 0.2469 30 19 44 2 2 2 0.3939 29 22 27 2 1 1 0.5408 5 16 14 0 0 0 0.6878 3 1 1 0 0 0 Total 100 100 100 100 100 100

The results reported in Table 5.4 suggest that the optimal correlation threshold value is 0.1000 for the 1-NN and 3-NN classifiers on the leukemia data set, while, for the 5-NN classifier the optimal correlation threshold value is 0.2469. The results given in Table 5.4 imply that the optimal correlation threshold value for the SVM classifier on the leukemia data set is 0.1000 for 𝐶 ∈ {1, 1 000, 10 000}, since it is selected at least 96% of the time over the 100 random splits. Unlike for the colon data set, the results for the SVM classifier applied to the leukemia data set are robust with respect to 𝐶 and always select the correlation threshold value of 0.1000.

Therefore, from the results summarised in Table 5.4 it appears that a larger subset of genes in the leukemia set (excluding only the genes that have a very small absolute correlation less than 0.1000 with 𝑌) achieves maximum classification accuracy using the KNN and SVM classifiers.

The three tables below summarise the frequencies with which the five candidate correlation threshold values were selected for the 100 random splits of the first, second and third simulated data sets respectively. These are reported for 𝐾 ∈ {1, 3, 5} in the KNN classifier and 𝐶 ∈ {1, 1 000, 10 000} in the RBF SVM classifier.

The results for the first simulated data set are:

Table 5.5: LOOCV correlation threshold frequencies for Sim1 using KNN & SVMs

Correlation threshold value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 = 1 000 𝑪 = 10 000 0.1000 5 10 15 2 18 18 0.1850 13 7 18 26 15 14 0.2700 13 15 13 14 12 13 0.3550 46 46 36 47 45 45 0.4399 23 22 18 11 10 10 Total 100 100 100 100 100 100

It is evident from the results in Table 5.5 that the optimal correlation threshold value is 0.3350 for both the KNN and SVM classifier. Note that 0.3550 is the second largest threshold value tested in Table 5.5, which suggests that there is a substantial number of irrelevant genes that should be removed by correlation thresholding to achieve maximum classification accuracy over the 100 random splits of Sim1. The results for the SVM classifier on Sim1 do not follow what is intuitively expected and the optimal correlation threshold is robust to the value of 𝐶. Additional investigation of the Sim1 data set showed that 54 genes had an absolute correlation with 𝑌 above 0.3350 and will be retained in the model when implementing correlation thresholding at the value of 0.3350. Since it is known that for the Sim1 data set only 10% (100) of the genes are relevant in distinguishing between 𝑃₁ and 𝑃₂, the results in Table 5.5 confirm that a large reduction in the number of genes retained in the model is necessary.

Table 5.6 summarises the results for the second simulated data set. Note that additional computations showed that only six genes (the genes corresponding to the indices 382, 415, 447, 471, 473 and 408) have an absolute correlation above 0.4296 with the response.

It is clear from Table 5.6 that the optimal correlation thresholding value on the second simulated data set is 0.4296 for all values of 𝐾 and 𝐶 in the KNN and RBF SVM classifiers respectively. Furthermore, it should be noted that unlike in Table 5.5 the optimal correlation threshold value was selected for the majority of the 100 random splits.

Table 5.6: LOOCV correlation threshold frequencies for Sim2 using KNN & SVMs Correlation threshold value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 = 1 000 𝑪 = 10 000 0.1000 19 14 12 0 4 4 0.1824 4 3 5 4 6 6 0.2648 1 2 0 4 1 1 0.3472 11 11 12 15 11 10 0.4296 65 70 71 77 78 79 Total 100 100 100 100 100 100

The selection of the large optimal correlation threshold value suggests that only the six genes that have a high correlation of above 0.4296 with 𝑌 are significant and thus there will be a substantial reduction in the number of genes retained in the model for the Sim2 data set.

This can be explained by the fact that the correlation coefficient cannot be expected to accurately identify the important variables in cases where the two populations differ with respect to scale. Note that the six genes referred to above actually do not distinguish between the two populations in the Sim2 data set and are therefore most probably identified by chance as being important. One would expected that any decent thresholding method would select more than six significant genes in a simulated data set that is known to have 100 relevant genes.

The results for the third simulated data set are as follows:

Table 5.7: LOOCV correlation threshold frequencies for Sim3 using KNN & SVMs

Correlation threshold value Frequency KNN Classifier SVM Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 𝑪 = 1 𝑪 = 1 000 𝑪 = 10 000 0.1000 10 21 38 30 44 45 0.1948 22 12 13 13 11 11 0.2897 25 21 12 15 8 7 0.3845 31 36 26 26 20 21 0.4793 12 10 11 16 17 16 Total 100 100 100 100 100 100

Table 5.7 indicates that for Sim3 the optimal correlation threshold value is dependent on the value of 𝐾 in the KNN classifier. For the small values of 𝐾 (1 and 3) the optimal correlation threshold value is 0.3845, implying that when only a few nearest neighbours are considered a large reduction in the number of genes is necessary to achieve good classification accuracy (computations show that 44 genes in the Sim3 data set have a correlation above 0.4793 with 𝑌). For the 5-NN classifier a larger subset of approximately 624 genes is required since the optimal correlation threshold value is 0.1000, which is the smallest of the five candidate threshold values considered in Table 5.7.

Using the SVM classifier with 𝐶 ∈ {1, 1 000, 10 000}, it is clear from Table 5.7 that 0.1000 is the optimal correlation threshold value for Sim3.

In document Statistical classification in high-dimensional scenarios with a focus on microarray data sets (Page 90-96)