3. Methods and Protocols
3.4 Microarray data analysis
3.4.5 Classification of samples based on gene expression patterns
For class prediction, the weighted voting procedure (see 3.4.1), multiple-tree models, and support vector machines (SVM) were used.
Multiple decision trees
Multiple-tree models were computed to discriminate between three different AML subclasses in the initial U95Av2 data (n=37 samples). To avoid overfitting of a singular tree model, a multiple-tree model was developed using an iteratively reduced set of genes. The trees were restricted to contain no more than k-1 nodes to discriminate between k classes. Genes whose expression values were selected for the nodes of the tree were then eliminated from the original data set, and a new tree was calculated based on the truncated data set. This was iterated until a predetermined number of trees were reached. To determine how many trees should be incorporated in the model misclassification rates were calculated for models containing 1 to 25 trees. The data set was randomly split into a training set (n=24) and a test set (n=8). Within the range tested, 15 trees were calculated to be optimal, both avoiding overfitting and reduced classification accuracy. The final class assignment was decided by applying a vote-by-majority rule to the outputs of the 15 single trees. Equal votes for two of the three classes are counted as misclassification. The generalization properties of the classifier are judged by 10-fold CV and by a test set of 5 samples that were not used for classification training. Multiple-tree models for classification were developed at the Intelligent Bioinformatics Systems division at the German Cancer Research Center (DKFZ), Heidelberg and were calculated using the C5.0 algorithm as implemented in SPSS (Quinlan, 1993). A schematic summary is given in Figure 9.
Figure 9. Multiple-tree model computation. The entire data set was normalized and differentially expressed genes were identified (1). A blinded validation set of five samples was excluded from further analysis for final evaluation of the constructed classifier. The remaining samples were then randomly split into training and test sets (2), and the optimal number of trees was determined (3). Then, the final classifier was built using this number through an iterative process (4) to construct the multiple-tree model (5). The independent test set error was calculated on the initially excluded 5 samples (6). Independently, the prediction error has been estimated by 10-fold CV (7). 37 samples (32 training/testing and 5 validation) 12,625 genes 1.Normalization and pre-selection 32 training samples 1,174 genes 3/4 training 1/4 test random subsampling (10x) 2.construct multiple- tree classifiers with 1 … 25 trees
( )
n 3.estimate optimal number of trees n predict
4.calculate decision tree with 2 nodes (3 classes)
5.remove discriminatory genes 32 training samples (1,174 – 2i ) genes repeat (n-1) times ndecision trees 6.predict 5 validation samples error on blinded test set
(0%)
7.construct 10 multiple- tree classifiers, each
containing ntrees cross-validation error (0%) 10-fold CV multiple-tree classifier 37 samples (32 training/testing and 5 validation) 12,625 genes 1.Normalization and pre-selection 32 training samples 1,174 genes 3/4 training 1/4 test random subsampling (10x) 2.construct multiple- tree classifiers with 1 … 25 trees
( )
n 3.estimate optimal number of trees n predict
4.calculate decision tree with 2 nodes (3 classes)
5.remove discriminatory genes 32 training samples (1,174 – 2i ) genes repeat (n-1) times ndecision trees 6.predict 5 validation samples error on blinded test set
(0%)
7.construct 10 multiple- tree classifiers, each
containing ntrees
cross-validation error (0%) 10-fold CV multiple-tree classifier
SVM-based classification
For classification of U133 set microarray data the support vector machine (SVM) algorithm was used. SVMs are learning machines that can perform binary classification tasks (Vapnik, 1998; Guyon et al., 2002; Schölkopf and Smola, 2002). In this work, a classification task involves training and testing gene expression profiles which consist of some data instances. Each instance in the training set contains “target values” (class labels, i.e., leukemia classes) and several “attributes” (features, i.e., genes). The goal of this approach is to produce a model which predicts target values of data instances in the testing set which are only given the attributes. Applied to gene expression data, an SVM would begin with a set of genes that have a common function, e.g., genes that demonstrate differential expression between distinct leukemia subtypes. After non-linearly mapping the n-dimensional input space into a high dimensional feature space a linear classifier is constructed in this high dimensional feature space (Figure 10).
Figure 10. Concept of SVM- based classification. The SVM operates by mapping the given training set into a possibly high-dimensional feature space and by attempting to locate in that space a plane that separates positive from negative samples. The hyperplane, a plane in a space with more than 3 dimensions, corres- ponds to a non-linear decision boundary in the input space.
Using this training set, an SVM would learn to discriminate between the types and subtypes of leukemias based on expression data. Having found such a plane, the SVM can then predict the classification of an unlabeled new sample by mapping it into the feature space and asking on which side of the separating plane the example lies (Figure 11). Then a label is assigned according to its relationship with the decision boundary. In this work, multi-class SVM classifiers were built with linear kernels using the library LIBSVM version 2.36
(www.csie.ntu.edu.tw/~cjlin/libsvm/) (Chang and Lin, 2001).
Figure 11. Classification task. The SVM separates a given set of binary labeled training data with a hyperplane that is maximally distant from them (maximal margin). The middle black line is the decision surface defining the borderline between the area of prediction of type I samples (red) and type II samples (blue). The outer lines precisely meet the constraint. Support vectors marked to be critical for the classification task are the points that lie closest to the separating hyperplane (Schölkopf and Smola, 2002).
mapping gene expression input space
training data
higher-dimensional feature space
hyperplane
mapping gene expression input space
training data
higher-dimensional feature space
hyperplane type I type II new sample classified as type I new sample classified as type II