2.5 Machine Learning Algorithms
2.5.5 Introduction to Multiple Classifier Systems
Several studies have been reported using multiple classifier systems. The theory is to enhance classification accuracies by using either several different algorithms, by using one classification algorithm on several input data sets, or by using one classifier on one input data set and adapting weights or parameters. The outputs are then combined. In [157] Benediktsson described consensus theory as being a research field that deals with finding the consensus among members of a group of experts. Consensus theory usually treats all available data sources separately and uses all the available data only once. Several methods of combining information from different data sources were proposed like linear opinion pool, logarithmic opinion pool and some derived algorithms. The conclusion drawn in [157] was, that the statistical multisource classifier, derived from the logarithmic opinion pool, performed well, while the linear opinion pool did not perform well in many cases, especially when the data sources were not in agreement. It was also stated that it is hard to determine
optimal weights for both of these algorithms.
Most multiple classifiers are used as black box systems and it is increasingly difficult to comprehend the interaction of variables that are providing the predictive accuracy. This is acceptable for many applications as speech and letter recognition or remote sensing, but it is critical to understand the classifier in applications like analysis of medical experiments or diagnosis.
2.5.5.1 Bagging
In 1994 Breiman proposed an algorithm called bootstrap aggregating, better known by its acronym bagging, to enhance classification accuracies in [158]. The main idea is to generate several training sets, which are bootstrap [159] replicates of the learn- ing set and therefore may be overlapping but not equal. These sets are used to train a classifier on each of them. The output of all algorithms is then either averaged in the case of a numerical output, or a plurality vote is used in the case of a class pre- diction. Bootstrapping can increase accuracies but the underlying algorithm needs to be an unstable prediction method. Instability in this context means that chang- ing the training data set can cause significant changes in the constructed classifier. E.g. decision trees are considered to be rather unstable classifiers, as the constructed decision tree depends highly on the distinct training data instances, while kNN was shown to be a stable classifier in [160]. With stable classifiers, bagging can slightly degrade the classification accuracy. Several tests were performed in [158] with the conclusion, that it is a relatively easy way to improve existing methods and works well on unstable classifiers, where it can substantially improve classification accu- racy. The tradeoff is the loss of a simple interpretable structure when using a base algorithm like DTs.
In [161] a classifier system called "BAGFS" was used, which is similar to bagging but in addition to the bootstrap replicates of the training sets multiple feature subsets are used. The base classifier in [161] was a C4.5 DT with default parameters and pruning. The outputs were combined using plurality voting and the results were better than those of a 5-nearest neighbor classifier and a C4.5 classifier.
2.5.5.2 Boosting
One of the best-known boosting algorithms, and the first one that could adapt, was AdaBoost (short for adaptive boosting), which was introduced by Freund in [162].
A brief introduction to boosting was given by Schapire in [163]. AdaBoost takes a weak base classifier that can operate on weighted input data and calls it T times. The first weak learner is trained using equal weights on all training samples and a weak hypothesis is given. The weights of the incorrectly classified samples are increased. With this adapted weights the next classifier is trained yielding another hypothesis and the process is repeated T times. This leads to classifiers that focus on the previously misclassified examples. The final hypothesis H(x) for the sample x is a weighted majority vote of the T weak hypothesis ht with the weight factors
αtas given in (2.67). H(x) = sign T X t=1 αtht(x) ! (2.67) AdaBoost has several interesting theoretical properties. First, if each hypothesis, given by one of the weak classifiers, is slightly better than random such that γt >
0, with γt measuring how much better than random the prediction of single base
classifier t is, then the error on the training set drops exponentially fast. Equation (2.68) describes this relation, which was proven by Freund and Schapire in [162].
et≤ exp −2 X t γt2 ! (2.68) Second, the bound on the generalization error suggests, that boosting will overfit when looping too many times. According to [163], it has been empirically observed, that this is most often not true in practice. Instead AdaBoost will sometimes even continue to reduce the generalization error, even when the training error has already reached zero. In response to the empirical findings, the analysis in [163] investigates the margins of the training examples and concludes that boosting continues to max- imize the margins of the training samples even after the training error reached zero, which corresponds to a drop in the test error. Furthermore, it was reported, that the margin theory indicates parallels to the SVMs described in chapter 2.5.4. Schapire also stated that AdaBoost is fast, easy to use and has only the number of loops T as a parameter. The results depend on the data and the weak learner that is being used and boosting is susceptible to noise.
Quinlan compared bagged and boosted versions of C4.5 in [164] and concluded that both yield significantly more accurate classifiers than the standard C4.5 algo- rithm. In the tests boosting appeared to be more effective than bagging but the
performance of bagged C4.5 was more stable than the performance of the boosted version of C4.5. But Quinlan also cited Freund and Schapire to have run tests on bagged and boosted versions of C4.5 and finding bagging much more competitive to boosting. He assumes that the different test setups are responsible for the diverging results.
2.5.5.3 Random Forests
Breiman introduced random forests in [165]. The name has a double usage as general term for ensemble methods using DT-type classifiers and for a specific implementa- tion by Breiman [166]. The basic idea of random forests in general is to generate a large number of trees and combine their outputs by voting for the most popular class. In [166] a margin function was defined, which measured to what extent the average number of votes for the right class exceeds the number of votes for any other class. It was stated that the confidence of the classification grows with the size of the margin. Furthermore, using the Strong Law of Large Numbers, which states that the average of the samples converges almost surely to the expectation value, it was shown that random forests do not overfit as more trees are added but instead a limiting value of the generalization error is produced. This upper bound of the generalization error is given in (2.69).
P E∗ ≤ ¯ρ(1 − s2)/s2 (2.69)
Two values are important for the generalization error of random forests: the strength of the individual classifiers s and the correlation between the classifiers ρ. A ratio that is a helpful guide in understanding the functioning of random forests is the c/s2 ratio given in (2.70), where ¯ρ is the mean value of the correlation.
c/s2 = ¯ρ/s2 (2.70)
The c/s2 ratio is the correlation divided by the square of the strength. The smaller this ratio, the smaller will be the generalization error. To improve the accuracy, the randomness, which is injected to create the individual trees in the forest, needs to minimize the correlation ρ while preserving the strength s.
One method of creating a random forest is to grow several classification trees, each trained on a bootstrapped sample of the training data, and determine the combined classification by a majority vote. Another method randomly selects inputs
or combinations of inputs at each node to grow each tree. Two strategies were proposed in [166], the first one randomly selects F input features. The second one F times randomly select L features, multiplying each with a coefficient that is a uniform random number in [−1, 1] and adding the multiplied features to generate a new feature as a linear combination of the original features. In both cases F features are randomly selected or calculated as linear combinations and a search is performed on these F features to find the best split.
It was also reported that, in contrast to AdaBoost, random forests yield better results in the presence of noise, as they do not concentrate weight on any subset of the instances. AdaBoost on the other side will concentrate the increasing weight on instances, which are misclassified due to incorrect class labels. The incorrect samples will persist to be misclassified and the weights will increase even more. Breiman also stated, that the results of random forests are competitive with boosting and adaptive bagging while not progressively changing the training set. According to his work, random inputs and random features produce good results in classification, but inferior results in regression and the results depend on the way that is used to inject randomness into the process.
In [167] a random forest was applied to a multisource remote sensing classifica- tion problem and compared to bagging and boosting methods. It was noted that ensemble methods are in general considered to be black-box classifiers. The ran- dom forest was reported to outperform boosting and bagging in general regarding overall accuracy, but was outperformed by boosting with a j4.8 DT as base classi- fier. Furthermore the random forest classifier was reported go be advantageous for multisource classification as it is nonparametric and no statistical model is needed. 2.5.5.4 Other Multiple Classifier Systems
In [20] another multiple classifier approach was presented with an ensemble of SVMs where the outputs were fused by another SVM. The algorithm was compared to sev- eral single classifiers and SVM classifiers combined by majority voting and absolute maximum and the results will be given in chapter 2.5.6. Another approach based on SVMs was presented in [21] where a simple random mechanism, which was inspired by random forests, was used to create the individual SVMs. The outputs were fused based on a weighted majority vote. Each classifier was based on a random subset of the training data. The unused training samples were used to estimate the average
classification accuracy ρi of each classifier independently. From the average accuracy
the weight bi for the classifier was calculated as given in (2.71)
bi(x) = log
ρi
1 − ρi
(2.71) It was stated that the proposed method generated better results regarding classifi- cation accuracy and visual quality of the classification map compared to a similar approach using a CART as a base classifier, a random forest approach, a Gaussian maximum a posteriori probability classifier, a SVM and two versions of Markov ran- dom fields. However, the test areas were small and performance of the approach was not tested on a larger area.
A spatial classifier, a k-nearest neighbor algorithm, and a linear discriminant clas- sifier were combined by a sum rule, a product rule, and two versions of a stacked regression in [168], thereby using an ensemble of different classifiers instead of differ- ent versions of the same base classifier. For some individual classes, the classification accuracy was decreased by the ensemble classifiers compared to the single classifiers. But the overall accuracies were significantly higher for the ensemble classifiers. This led the authors to the conclusion that the classifiers need to be chosen carefully by taking single class accuracies into account, to avoid a decrease in class accuracy. The product rule was found to be almost as good as the stacked regression while being much simpler and faster to calculate.