Determining the optimal threshold values for the SRBCT data set

DIMENSION REDUCTION

NSC ∆ parameter

5.6.4 Determining the optimal threshold values for the SRBCT data set

As in the binary classification applications explained in Section 5.6.1, five candidate correlation thresholds were considered for the SRBCT data set by calculating the absolute correlation of each predictor variable with the response and determining the maximum absolute correlation value, 𝑐𝑜𝑟𝑚𝑎𝑥. Then the five candidate correlation

correlation threshold value considered was 90% of the 𝑐𝑜𝑟_𝑚𝑎𝑥 to ensure that at least one gene was extracted using correlation thresholding. Finally, although the range of candidate correlation thresholds remains constant for the SRBCT data set, LOOCV was used to determine the optimal correlation threshold on each split, and this optimal correlation threshold value used in feature extraction could vary from split to split.

The table below summarises the selection frequencies of the five candidate correlation threshold values for the 100 random splits of the SRBCT data set. This is reported for 𝐾 ∈ {1, 3, 5} in the KNN classifier.

Table 5.18: LOOCV correlation threshold frequencies using the KNN classifier

Correlation threshold value Frequency

𝑲 = 1 𝑲 = 3 𝑲 = 5 0.1000 0 1 0 0.2563 0 0 0 0.4126 0 0 0 0.5690 0 0 0 0.7253 100 99 100 Total 100 100 100

The results given in Table 5.18 indicate that for the SRBCT data set, the optimal correlation threshold value is independent of the value of 𝐾 in the KNN classifier and is 0.7253 for all values of 𝐾. Therefore, from the results in Table 5.18 it can be concluded that only the genes that have an absolute correlation greater that 0.7253 with 𝑌 in the SRBCT data set should be retained in the model. Note that gene 1 194 has the maximum absolute correlation of 0.8059 with 𝑌 out of all of the 2 308 genes in the SRBCT data set. Furthermore, only five genes in the SRBCT data set have a correlation with 𝑌 above the maximum threshold value of 0.7253 in Table 5.18. These are the genes with indices 187, 509, 1 003, 1 1194 and 2 046.

Finally, one should take note that the results reported in Table 5.18 are less dispersed over the range of candidate absolute correlation values than in both of the binary practical data sets reported in Tables 5.3 through 5.7.

Table 5.19 reports the frequencies for the 15 candidate ∆ parameter values in the NSC extraction procedure for 100 random splits of the SRBCT data set for 𝐾 ∈ {1, 3, 5} in the KNN classifier as well as these frequencies for the NSC classifier.

Table 5.19: LOOCV frequencies for the NSC ∆ threshold value using KNN & NSC NSC ∆ parameter Frequency KNN Classifier NSC Classifier 𝑲 = 1 𝑲 = 3 𝑲 = 5 1.9430 0 0 0 0 2.2206 0 0 0 0 2.4982 0 0 0 0 2.7757 0 0 0 0 3.0533 0 0 0 23 3.3309 0 0 0 41 3.6085 0 0 0 31 3.8860 0 0 0 5 4.1636 0 0 0 0 4.4412 0 0 0 0 4.7188 1 0 0 0 4.9963 4 0 2 0 5.2739 12 18 19 0 5.5515 8 25 38 0 5.8290 75 57 41 0 Total 100 100 100 100

As shown in Table 5.19, ∆_𝑜𝑝𝑡 = 5.8290 for 𝐾 ∈ {1, 3, 5} in the KNN classifier on the SRBCT data set. However, for the NSC classifier ∆𝑜𝑝𝑡 = 3.3309. It is interesting to note

that for the KNN classifier in Table 5.19, only the five maximum threshold values are selected over the 100 random splits, with the two largest ∆ values in Table 5.19 being selected the majority of the time. For the NSC classifier the selected ∆ frequencies are spread over only four candidate values ranging from 3.0533 to 3.8860, and all values falling outside this range are never selected in the 100 random splits of the SRBCT data set.

To ensure comparability across classifiers going forward, ∆𝑜𝑝𝑡 = 3.3309 will be

implemented in the NSC variable selection procedure applied to the SRBCT data set.

Note that Tibshirani et al. (2001) report that using 10-fold cross-validation on the SRBCT data set, ensuring that the classes were distributed proportionally among each of the 10

parts, yielded that the optimal value for the NSC parameter is ∆ = 4.34, which selects 43 active genes.

5.7 SUMMARY

This chapter started with a section in which a detailed description of DNA microarray data sets was provided. This was followed by a description of the three practical data sets and three simulated data sets considered in the thesis. The chapter then provided a summary and overview of the classification procedures that were implemented in the analysis of the binary and multi-class data sets. This lead to a description of how the base classifiers that were used in the empirical study were implemented in R.

The chapter then introduced and briefly discussed the concept of LOOCV and its use to determine the optimal values for the correlation thresholding, the two-sample 𝑡-test and finally the NSC variable selection procedures. The results of an empirical study investigating the optimal correlation threshold, two-sample 𝑡-test 𝑝-value and NSC ∆ shrinkage parameter values for the base classifiers using LOOCV were reported and discussed. In this chapter, no clear selection pattern emerged for all the data sets, and no firm recommendation could be made on the strength of the various selection criteria and the accuracy of the base classifiers. The results reported in Section 5.6 indicate that the optimal threshold values for the three VS procedures for 𝐾 ∈ {1, 3, 5} in the KNN classifier and 𝐶 ∈ {1, 1 000, 10 000} in the SVM classifier are relatively subjective. Therefore, from the results it is evident that it is essential to perform an initial empirical study on every high-dimensional data set to determine optimal thresholding values. Furthermore, the optimal thresholding values should be determined separately for each different classifier applied to a data set.

The next chapter reports the results of the empirical study and implements the optimal tuning parameters selected for the classification procedures discussed in Section 5.6.

CHAPTER 6

In document Statistical classification in high-dimensional scenarios with a focus on microarray data sets (Page 105-109)