In this chapter we conducted a detailed empirical study of the ensemble approach to classi- fication of small-sample genomic and proteomic data. The main performance issue is not whether the ensemble scheme improves the classification error of an unstable, overfitting classifier (e.g., CART, NNET), or whether its classification error converges to a fixed limit; but rather whether the ensemble scheme will improve performance of the unstable, over- fitting classifier sufficiently to beat the performance of single stable, non-overfitting clas- sifiers (e.g., DLDA, LDA, 3NN). We observed that this never was the case for any of the data sets and experimental conditions considered here, except in the case of the proteomics data set with RELIEF feature selection in acute small-sample cases, when nevertheless the performance of a single unstable, overfitting classifier (in this case, CART) was better or comparable to the corresponding ensemble classifier. We observed that in most cases bag- ging does a good (sometimes, admirable) job of improving the performance of unstable, overfitting classifiers, but that improvement was not enough to beat the performance of single stable, non-overfitting classifiers.
The main message to be gleaned from this study by practitioners is that the use of bagging in classification of small-sample genomics and proteomics data increases compu- tational cost, but is not likely to improve overall classification accuracy over other, more simple, approaches. The solution we recommend is to use simple classification rules and avoid bagging in these scenarios. It is important to stress that we do not give a definitive recommendation on the use of the random forest method for small-sample genomics and proteomics data; however, we do think that this study does provide a step in that direction, since the random forest method depends partly, if not significantly, for its success on the effectiveness of bagging. Further research is needed to investigate this question.
CHAPTER VI
SMALL-SAMPLE ERROR ESTIMATION FOR BAGGING CLASSIFICATION RULES∗
Application of ensemble classification rules in gene-expression microarray classification problems has become increasingly common. Among ensemble classification rules, boot- strap aggregating (“bagging”) is the most popular, and has generated a considerable amount of literature. However, the problem of error estimation for these classification rules, par- ticularly under the small-sample settings prevalent in genomics, is not well understood. Breiman proposed a general method, which he called “out-of-bag”, for estimating statistics of bagged classifiers, which was subsequently applied by other authors to estimate the clas- sification error. In this chapter, we give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized. We conducted an extensive simulation study of bagging of common classifi- cation rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave- one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically bi- ased. The performance of the other estimators are consistent with their performance with the corresponding single classifiers, as reported in other studies. Bolstered error estima-
∗Reprinted with permission from ”Small-sample Error Estimation for Bagged Classifi-
cation Rules,” by T. T. Vu and U. M. Braga-Neto, 2009. volume 2010, 12 pages, Copyright 2010 of EURASIP Journal on Bioinformatics and Systems Biology.
tors showed consistent superior performance to the others, in terms of accuracy (RMS) and computational cost.
A. Introduction
Ensemble classification methods combine the decision of multiple classifiers designed on randomly perturbed versions of the available data [147, 148, 149, 150, 151]. The most popular version of this scheme is known as bootstrap aggregating, or “bagging” [150, 151] where the ensemble classifier corresponds to a majority-vote among classifiers designed on bootstrap samples [96] from the available training data.
There has been considerable interest recently in the application of bagging in the clas- sification of both gene-expression data [154, 155, 156, 157] and protein-abundance mass spectrometry data [158, 159, 160, 161, 162, 163]. The popularity of bagging is based on the expectation that combining the decision of several classifiers will regularize and improve the performance of unstable, overfitting classification rules (the so-called “weak learners”). In Chapter V, we have investigated this claim, in the context of small-sample genomics and proteomics data. On the other hand, a different issue is the performance of error estima- tors for bagged classifiers. Accurate error estimation is a critical issue in Genomics, as it decisively impacts the scientific validity of hypotheses derived from application of pattern recognition methods to biomedical data [43, 179, 180]. On the topic of error estimation, Breiman proposed a general method, which he called “out-of-bag”, for estimating statistics of bagged classifiers [181], and, subsequently, other authors applied it to the estimation of the classification error [182, 183]. In this chapter, we give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, which is done by formulat- ing carefully how the error count is normalized. The performance of out-of-bag estimators with general bagged classification rules is not in fact well understood, especially in connec-
tion with bagging ensemble classifiers derived from classification rules other than decision trees (which was Breiman’s primary interest). In addition, to our knowledge, no studies have attempted to assess the performance of error estimators for bagged classifiers in the context of Genomics data, particularly in the prevalent small-sample setting usually found in these applications.
To investigate these issues, we conducted an extensive simulation study of bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as re- substitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator itself. We present here selected representative results; the full set of results can be found on the companion website, at http://gsp.tamu.edu/Publications/supplementary/oob. The results from the nu- merical experiments indicated that the performance of the out-of-bag error estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimisti- cally biased. The performance of the other estimators are for the most part consistent with their performance with the corresponding single classification rules assessed in other stud- ies, with the best performance being provided by the bolstered error estimators, in terms of root mean square error.