4.3 Classifier design and evaluation
4.3.3 Classifier assessment
An assessment of a trained classifier can answer two different questions. The first one relates to the separability of two classes and biologically reads as ”Are different tumour types distinguishable?” The second question refers to a global feature ranking and reads as ”Which aberrations distinguish one tumour type from another?”
Methods to answer the first question are based on an estimation of the clas- sification accuracy and will be discussed in the next part of this section. The second question will be discussed in the third part of this section.
Figure 4.8: Grid parameter search visualised by the SVM implementation libsvm [CL01]. The x-axis depicts the parameterC and the y-axis the parameter γ (both on a logarithmic scale). The classification accuracy is encoded by an equivalent colour.
4.3. CLASSIFIER DESIGN AND EVALUATION 83 Quantitative classifier assessment
Different methods for a numerical model assessment were considered in sec- tion 2.3 from a theoretical point of view. But which method should be used as a reliable estimate for the reader of a biological journal? Which method reflects the separability of (two) tumour classes?
An empirical comparison of different model assessment techniques revealed diverse results.
Kohavi compared 0.632 bootstrap, leave-one-out cross validation (LOO-CV) and ten-fold cross validation (10-fold-CV), amongst others, using the decision tree classifier C4.5 and data sets from the UCI repository [Koh95a]. The real- world data sets from this repository are commonly used to compare machine learning algorithms. He observed a higher variance of the LOO-CV and a larger bias of the bootstrap. Finally, he recommended a stratified ten-fold cross-validation.
A study of small-sample microarrays using the classifiers kNN (k-nearest neighbour), lda (linear discriminant analysis) and the decision tree algorithm CART revealed a high variance of both LOO-CV and 10-fold-CV [BND04]. LOO-CV and 10-fold-CV showed a comparable performance. In conclusion, the authors recommended the computationally expensive bootstrap.
Molinaro et al compared different resampling methods on simulated gene ex- pression data sets [MSP05]. The classification algorithms lda, dda (diagonal linear discriminant analysis), kNN and CART were applied. The authors concluded that LOO-CV ”generally performed quite well”, with the excep- tion of unstable classifiers like CART. 10-fold-CV was quite comparable and suggested for larger samples. The authors included a feature selection and in this circumstance a cross validation (LOO-CV and 10-fold-CV) was better than a bootstrap.
A comparison of model assessment methods using an SVM-classifier [ABR+05] and data sets from the UCI repository revealed that a LOO-CV outperforms a 10-fold-CV. However, a boostrap with 100 replicates was bet- ter and a bootstrap with 10 replicates worse than a LOO-CV.
From a theoretical point of view, the LOO-CV sometimes overestimates the prediction error. In a data set without a correlation between the feature values and the class labels, a classifier would predict the class according to the majority class of all cases in the learning set. A classifier in LOO-setting would therefore always learn the wrong class (the class that is not left out)
and predict an accuracy of 0% (assumption: balanced design, both class labels have a share of 50%). Such a failure occured in the data set shown in section 5.2.4, where the predicted accuracy was clearly below the 50% expected of a random assignment in a two-class-problem.
To summarise, LOO-CV performs badly for unstable classifiers. LOO-CV and 10-fold-CV have comparable results, although 10-fold-CV is characteri- sed by a higher bias and LOO-CV by a higher variance. For small data sets of approx. 10 samples, a 10-fold-CV and a LOO-CV would be the same. Bootstrap has a low variance but sometimes a high bias.
Finally, I decided to use a LOO-CV estimator for most of the experiments with a support vector machine. However, almost all (biological) conclusions were backed up by another classifier (often the decision tree C5.0) in an implementation of another software package (Clementine). The 10-fold-CV for the decision tree was chosen according to the aforementioned problems of the LOO-CV with unstable classifiers.
Qualitative classifier assessment
The algorithms discussed in this part of the section answer the question ”Which aberrations distinguish one tumour type from another?” This is based on a global analysis of the classifier. In section 4.4.1, a case-based analysis of a classifier is introduced. A case-based qualitative analysis answers the question ”Why does a given tumour sample belong to tumour type B?” For separable classes (classification accuracy 100%), I propose an algorithm that identifies feature subsets such that each subset can be used to distin- guish both tumour types (classes). Features with a low importance for the classification are recursively discarded and the SVM retrained. An alterna- tive approach would have been an analysis of all possible feature subsets. However, this is an NP-complete problem.
Next, the question arises, whether the subsets found represent statistically relevant features. I use permutation tests and calculate a p-value for each discriminating subset found. The underlying test statistics is based on the hyperplane distance.
The QP optimisation problem of the SVM can not always be solved effecti- vely. However, difficult and time-consuming optimisation problems indicate that the separation of both classes is difficult. Therefore, the learning pro- cess of the SVM is stopped iff the time consumed for the QP-problem using
4.3. CLASSIFIER DESIGN AND EVALUATION 85
Train the SVM with all features
Test: Separation possible ? Save and analyse found feature set Analyse hyperplane and discard one
weak feature Retrain SVM with remaining features YES NO Search for more sets ...
Figure 4.9: Algorithm for the identification of discriminating feature subsets.
a feature subset takes much more time than the original problem with all features.
Taken together, the algorithm reads as (Fig. 4.9):
1 Start with an SVM classifier trained with all features
2 Select the feature with the lowest importance for the classification 3 Discard this feature and retrain the SVM with all remaining features 4 Assess whether a separation of both classes with the remaining features
is still possible. If a separation is still possible, then go to step two. Otherwise, a minimal discriminating subset of features was found. 5 Assess the importance of the minimal discriminating subset found. Save
an ”important” subset.
6 Discard all members of important subsets and redo the analysis (step two).
Finally, and-or-trees can be used to represent the feature sets found (Fig. 4.9). An and-or-tree describes the composition of an expression in terms of sub-expressions which are combined by ”and” or ”or” nodes. An and-node is true iff all of its successors are true whereas an or-node is true if at least one successor is true.
For non-separable data sets (classification accuracy below 100%), a feature ranking was calculated according to [GWBV02]. Briefly, the most important separating features were identified from the trained SVM classifier according
AND feature 1 feature 2 AND feature 6 feature 7 OR AND feature 3 feature 5 feature 4
Figure 4.10: Representation of discriminating feature subsets using an and-or- tree.
to the absolute value of each component of the hyperplane direction vec- tor. Guyon et al. also proposed a recursive feature elimination and used the classification accuracy as criterion. However, the differences between the classification accuraccies of feature subsets of varying sizes were not signifi- cant. Therefore, the estimation of the optimal size of a discriminating feature subset (given the analysed genomic profiles) was very difficult and finally not used.