CLASSIFICATION PROCEDURES - EMPIRICAL WORK

DIMENSION REDUCTION

CHAPTER 5 EMPIRICAL WORK

5.5 CLASSIFICATION PROCEDURES

This section provides a brief description of the implementation details of the classification procedures considered in this thesis.

It is important to note that the values in the investigated data sets were only standardised if it is statistically necessary for a specific dimension reduction and classification

procedure. Therefore, the binary data sets were standardised before applying the SVM classification procedure. Standardisation was carried out by computing means and standard deviations on the training data set and then using these quantities to standardise the values in both the training and test sets.

Hastie et al. (2009:79) state that principal components depend on the scaling of the inputs, implying that data sets should be standardised before performing PCA or SPCA as a dimension reduction technique. Note that standardisation is done automatically in the preProcess() function from the caret package in R which was used to perform PCA and SPCA.

The following diagram provides a summary of the classification procedures applied to the binary data sets:

Figure 5.3: Summary of the binary classification procedures

Figure 5.3 summarises the 21 different classification procedures that were applied to the binary data sets, both practical and simulated. The statistical classification procedures in Figure 5.3 can be divided according to the three main classifiers (and their extensions), namely: KNN (fastKNN), SVM and LDA (DLDA and NSC). Figure 5.3 also displays the three variable selection and dimension reduction techniques which are implemented in the thesis. Note that the binary data sets may undergo both VS and dimension reduction as shown in Figure 5.3.

From Figure 5.3 it is clear that LDA and its extensions are applied in four different versions, namely:

1. Ordinary LDA using the reduced set of inputs computed using NSC as a variable selection method

2. The NSC classification procedure (which entails an extension of DLDA) 3. DLDA on all 𝑝 original variables

4. DLDA using the reduced set of inputs computed using NSC as a VS method.

The following diagram provides a summary of the classification procedures applied to the multi-class SRBCT data set:

Figure 5.4: Summary of the multi-class classification procedures

Figure 5.4 clearly illustrates that only three classification procedures are applied to the multi-class SRBCT data set, namely KNN, fastKNN and the NSC classifier. Figure 5.4 displays only two VS techniques, correlation thresholding and the NSC procedure. Although two-sample 𝑡-test thresholding could be implemented on the multi-class data set by considering all (4₂) pairs of classes, it is not considered in this thesis. Furthermore, the optimal correlation threshold value is only determined for the KNN classifier while the optimal NSC ∆ value is only determined for the NSC classifier and not for the KNN classifier.

The following section describes the manner in which the various feature selection, dimension reduction and classification procedures displayed in Figures 5.3 and 5.4 were implemented using R.

5.5.1 KNN and fastKNN

In this thesis, the knn() function from the class package in R was used to implement the KNN classification procedure. The KNN classification procedure was applied using three different numbers of nearest neighbours, namely 𝐾 ∈ {1, 3, 5}. This was done to investigate the possible effect of this tuning parameter on the classification performance of KNN. For each case in the test set, the knn() function determines the 𝐾 nearest cases (in terms of Euclidean distance) in the training set, and the classification of the test case is decided by majority voting, with ties broken at random. The use.all argument in the

knn() function controls the handling of ties. In this thesis use.all = FALSE was used,

which implies that if there were ties then a random selection of distances equal to the 𝐾𝑡ℎ

is chosen to come to a decision.

Finally, for the binary scenario it is important to note that in the knn package in R, the procedure outputs 𝑌 as a value 1 or 2, instead of a 0 or a 1. Therefore the output has to be transformed to a 0 or a 1. However, in the multi-class scenario this transformation is not necessary.

As mentioned above, fastKNN was applied as both a classification and as a feature extraction procedure. The fastKNN classifier was applied using the fastknn() function in the fastknn package in R and the same three values of 𝐾 as in the KNN classifier were used, i.e. 𝐾 ∈ {1, 3, 5}. As with the knn() function, the fastknn() function outputs 𝑌 ($class) as a 1 or a 2 instead of a 0 or a 1. Therefore the output has to be transformed as before.

In this thesis, feature engineering using fastKNN was implemented using the PfastKNN() and PfastKNNset() functions which are included in the appendix. The PfastKNNset() function performs LOOCV for each new training set to avoid overfitting, and the

PfastKNN() function computes the 𝐾 new features for each group in the training set as

explained in Section 2.3.1. The value of 𝐾 used in the fastKNN feature extraction procedure is the same as the value of 𝐾 used in the KNN classifier. Feature extraction using fastKNN with 𝐾 = 5 is also implemented in the SVM classification procedure in the five binary data sets (therefore, the RBF SVM is applied to 10 (5×2) new features extracted from fastKNN).

The misclassification test error rate for the KNN and fastKNN procedures were calculated on each split of the data set.

5.5.2 LDA and DLDA

The LDA classifier was only applied to reduced versions of the colon cancer, leukemia, Sim1 and Sim3 data sets using the PLDA() programme as in the steps explained in Section 2.4. The PLDA() programme returns the classification for each observation in the test data set and this programme is included in Appendix B. Note that the data sets first underwent VS using the NSC procedure before applying LDA. This was done to overcome the singularity problem arising when LDA is applied to data sets where 𝑝 > 𝑁. The DLDA classifier was also only applied to the binary practical data sets using the

PdiagLDA() programme which is included in Appendix B. The PdiagLDA() programme

performs the necessary steps to calculate the DLDA discriminant score for the two populations in the test data as in Equation (2.3) and it returns the classification predictions for the test cases.

Note that the data sets were not standardised before applying the LDA and DLDA classifiers.

5.5.3 NSC

The NSC variable selection and classification procedures were applied to both the binary and multi-class data sets. The pamr.train() function from the pamr package in R was applied to the training data set to obtain a vector of 30 candidate values for the threshold ∆. In an attempt to reduce the computations in determining the optimal ∆ value in the NSC procedure, the smallest seven and largest eight of these threshold values from the 30 candidate values obtained in the pamr package were omitted and only the remaining 15 candidate values were considered. The NSC procedure is computationally extremely expensive and it was initially thought that an even smaller range of candidate ∆ values, namely ten, should be considered to speed up the process of finding an optimal value of ∆. However, data exploration highlighted the fact that the results of the NSC-KNN and NSC-SVM procedures are both quite sensitive to the choice of ∆, and therefore it can be concluded that it was better to use 15 candidate values from the default range of 30 possibilities for the CV choice of ∆. The optimal ∆ value was then determined using LOOCV on the training set, by fitting the necessary classifier to the selected genes for each of the 15 ∆ candidate values and reporting the training misclassification error rate.

The optimal ∆ value was the maximum ∆ candidate value (smallest number of selected genes) which achieves the lowest training error rate.

The NSC programme in R (which can be found in Appendix B) was thereafter applied to the training data using the optimal ∆ value to obtain the indices of the genes which play a role in classification, disregarding the rest.

The NSC classification procedures were applied using the NSC discriminant score for the classes as in Equation (2.3) in Section 2.5, using the programme Pnsc_class() for the binary data sets and Pnsc_class_mult() for the SRBCT multi-class data set. Both of these programmes are included in Appendix B.

5.5.4 SVM

The RBF SVM classifier was only applied to the binary data sets, and not to the multi- class data set. The Psvm programme included in Appendix B was used for this purpose. The first step in the Psvm programme determines the optimal 𝛾 value in the RBF kernel on each split of the training data using the sigest function in the kernlab package in R. The next step in the Psvm programme is to apply the ksvm() function from the kernlab package in R to the data set using the following arguments. The argument type="C-svc" is given to indicate that the SVM is being used in a classification setting. In order to indicate that the Radial Basis Function kernel (“Gaussian” function) must be used to compute inner products in the feature space, kernel="rbfdot" is used. The argument

kpar=list(sigma=opt.gam) is used to set the single hyper-parameter in the kernel RBF

function, the 𝛾 value, equal to the optimal 𝛾 value obtained from the data set in Step 1. As mentioned in Chapter 2, for the purpose of this thesis, the 𝐶 parameter was set to 1, 1 000 and 10 000 in order to examine the possible influence of this parameter on the classification performance of the SVM. The kernel matrix in Equation (2.13) was computed using the kernelMatrix() function.

Finally, the Psvm programme outputs the real 𝑓̂(𝒙) values in the SVM classifier, as calculated in Equation (2.13).

In document Statistical classification in high-dimensional scenarios with a focus on microarray data sets (Page 83-88)