• No results found

2.7 Machine Learning Algorithms

2.7.4 Random Forest Classifier

There has been a lot of interest in "ensemble learning" methods (Breiman, 1996; Liaw and Wiener, 2002; Freund et al., 1996; Schapire, 1990) that generate many classifiers and use

2.7. MACHINELEARNINGALGORITHMS 51

them to predict a class by aggregating their results. The two most popular are boosting (Schapire, 1990) and bagging (Breiman, 1996) of classification trees.

Bagging predictors was proposed by Breiman (1996) to improve the classification by combining classifications of randomly generated training sets. The training sets are generated using bootstrapping and aggregation is used in combining classifications of the training set. Usually, in bagging, the aggregation is done by either averaging the results or using majority votes (mode). Bagging Trees was shown by Breiman (1996) to out- perform other traditional classifiers like a single Tree and Nearest Neighbor classifiers. In particular, Bagging Tree outperformed Nearest Neighbor classifier on all six datasets used whereas a single tree only outperformed Nearest Neighbor on three out of the six classifiers. One major problem with "Bagging" is determining the number of bootstraps to use, which in the case of Breiman was 50. However, Breiman showed that misclassifi- cations reduced as the number of bootstraps increase. Another limitation of this method is that when the number of bootstraps is not large enough, some samples are left out of the training set. Another ensemble method that depends on aggregation is Boosting. This was proposed by Schapire (1990) and is based on the concept that the training set class is weakly learnable if the learner can produce a hypothesis that performs slightly better than random guessing. In Boosting, a series of "weak" classifiers are built, each being trained on a dataset in which classes misclassified by the previous classifier are given more weight. and all the classifiers are then weighted according to their success and their outputs are combined using majority vote. The most common boosting classi- fier is AdaBoost (Freund et al., 1996), which stands for "adaptive boosting". It combines the outputs of many "weak" classifiers to increase classification accuracy. The advantage is that weak classifier can be very simple to implement and computationally inexpensive (Friedman et al., 2001).

Random forest proposed by Breiman (2001) is emerging as a very popular classifier. This is an extension of the bagging method and was developed to improved classifica- tion performance when compared with using bagging and boosting classifiers such as AdaBoost. It is based on trees but each node is split using the best among a subset of

predictors randomly. It aims to reduce the variance of the individual tree by randomly selecting many trees from the dataset and averaging their prediction output. This method is based on bagging and random decision trees, discussed in section 2.7.3. Random Forest classifier Breiman (2001) introduces a random permutation into the learning process, in order to produce multiple decision trees from a single dataset (thus forming a "forest"). Aggregation techniques, such as majority voting, are then used to combine the predic- tions from all of the trees. This method combines Breiman’s "Bagging" (Breiman, 1996) whilst injecting random perturbations into the feature selection, for building a collection of decision trees with a controlled variance. The dataset is split into a training set and a test set, known as the out of bag cases, in a Random Forest classification problem.

Random Forest classifiers have been applied in many applications, including bioin- formatics (Lee et al., 2005; Díaz-Uriarte and De Andres, 2006; Statnikov et al., 2008), and ecology for the classification of vascular and non-vascular plants and for vertebrates Cutler et al. (2007). Lee et al. (2005) extensively analysed the performance of many clas- sifiers (21 classifier methods) in analysing microarray gene expression data (7 datasets) and found out that Random Forest was the most successful classifier on these datasets. However, another comparative study, though not extensive, was carried out by Statnikov et al. (2008) to compare SVM and Random Forest for disease samples classification. Their results show that the SVM classifier outperforms Random Forest in this case. Apart from applying Random Forest directly to the classification of genes, the study by Díaz-Uriarte and De Andres (2006) had looked into gene feature selection prior to classification. They used Random Forest to select genes for classification and found out that Random Forest has comparable performance to other classification methods, including Linear discrimi- nant analysis (LDA), KNN, and SVM when feature selection was carried out. In ecology, Cutler et al. (2007), compared the accuracy of Random Forest with other commonly used statistical classifiers such as LDA, Classification trees and Logistic regression for clas- sifying plant species and bird habitat datasets. They observed that Random Forest had higher classification accuracy.

2.7. MACHINELEARNINGALGORITHMS 53

method, and in addition, it has only two parameters (the number of variables in each subset and the number of trees). It is also not sensitive to these values (Liaw and Wiener, 2002). It has been shown in (Liaw and Wiener, 2002; Díaz-Uriarte and De Andres, 2006) that the Random Forest classifier compares quite favourably with SVM and on some datasets, it may even outperform it. Cutler et al. (2007) also mentioned that Random Forest classifier has a generally high classification accuracy, and is able to model complex interaction among predictor variables. Finally, it retains the many benefits of Random Decision Trees, as well as achieving better classification accuracy by using random subsets of variables, majority voting, and bagging samples (Qi, 2012).