Our presentation of the taxonomy of ensemble learning methods is structured according to the various methods for introducing diversity among the ensemble members. The standard way of categorizing ensemble learning methods is by dividing them into two general groups: methods that modify the training data and methods that modify the learning process. We will use the same taxonomy with a different perspective, which we believe gives a more intuitive view with respect to the inductive inference process.
Through the previous chapters, we have viewed the learning process as search through the hypothesis space. Consequently, when building an ensemble we would like the individual estimators to occupy different points in the hypothesis space, possibly all near to the optimal one. We will therefore look at two categories of methods for achieving diversity in the ensemble’s errors: those that obtain diverse trajectories in the hypothesis space for the base level learning algorithms by diversifying the set of accessible hypotheses and those that achieve this by diversifying the strategy of traversal:
• Diversifying the Set of Accessible Hypotheses - These methods vary the set of hypothe- ses which are available to the learning algorithm in two ways, by diversifying the input space or by diversifying the output space for the different ensemble members. It should be noted here that the space of all possible hypotheses H remains the same, although a different subset Hi⊆ H would be accessible for each of the individual estimators.
• Diversifying the Traversal Strategy - These methods basically alter the mechanism of modeling the inductive decisions, thus explicitly varying the way the algorithm tra- verses the hypothesis space, thereby leading to different hypotheses. The simplest
Ensembles of Decision and Regression Trees 43
method for diversification of the traversal strategy is by introducing a random com- ponent. Each traversal of the hypothesis space seeded with a different random seed would likely yield a different trajectory, which in the case of tree-structured predictors would correspond to a different decision or a regression tree.
Naturally, there are methods that utilize a combination of these two categories, but we are mainly interested in the strategies for diversification and their effect on the inference process.
Figure 4: An illustration of the process of learning ensembles through diversification of the training data and diversification of the traversal strategy.
A unified view is given by the illustration in Figure 4. The process starts by creating a collection of n training datasets obtained through some method for diversifying the input and/or the output space. A predictive model is further learned on each of the training datasets by using a machine learning algorithm of choice. In this case, since we are interested mainly in ensembles of decision and regression trees, the learning algorithm will construct a decision tree from each training dataset.
Optionally, it is possible to introduce some randomness in the decision tree learning algorithm which corresponds to diversifying the traversal strategy of the algorithm. In the well known Random Forest algorithm, Breiman (2001) modified the standard approach for learning a decision tree by reducing the number of combinations examined at each splitting node to a smaller subset of randomly chosen attributes. Each node is therefore split using the best among a random subset of predictors. This somewhat counterintuitive strategy turns out to perform very well as compared to many other classifiers, including discriminant analysis, support vector machines, and neural networks (Breiman, 2001).
Having an ensemble of decision trees trained on a different sample of the original training data, each of the learned trees is further applied on a separate test set of instances. Their predictions are at last combined using some form of a combination rule. In the classification case, a popular method to combine predictions from individual models is majority voting, i.e., the predictions are combined by using the plurality of votes in favor of a particular class. In the regression case, a simple weighted or non-weighted averaging would in general suffice.
In a final note, we would like to mention a third category of ensemble learning methods that basically represents a regularization of the variance of the ensemble by using a meta-
44 Ensembles of Decision and Regression Trees
learning algorithm. Here, regularization of variance denotes a method for synthesizing and combining information from other models. Stacking or stacked generalization (Wolpert, 1992), is mainly a method for combining heterogeneous base models, such as the k-nearest neighbors method, decision trees, Naive Bayes, etc. The central idea is to learn the biases of the base predictors with respect to a particular training set, and then to filter them out. The procedure has two steps. First, a meta-learning dataset using the predictions of the base models is generated. Second, using the meta-learning set a meta model is learned, which can combine the predictions of base models into a final prediction. This method is used as some form of a combination rule. As such, it goes out of the scope of this chapter, for it does not belong to the methods that influence the inference process of the individual estimators, but rather the combining of predictions with the final ensemble.
4.3.1 Diversifying the Set of Accessible Hypotheses
We have identified three ways of diversifying the set of accessible hypotheses, each dis- cussed in more detail in the following sections. We will discuss methods for manipulation or diversification of the training data, including the input space X and the output space Y . 4.3.1.1 Diversification of the Training Data
The most well-known methods for learning ensembles by manipulation of training data are Bagging, introduced by Breiman (1996a), and Boosting, proposed by Freund and Schapire (1997). These ensemble learning algorithms have been empirically shown to be very effective in improving the generalization performance of the individual estimators in a number of studies. In order to promote variance among the predictors, these algorithms rely on altering the training set used when learning different ensemble members.
Bagging is an acronym for Bootstrap Aggregating. It is the simplest ensemble learning algorithm which consists mainly of creating a number of bootstrap replicates of the training dataset. A bootstrap replicate is created by sampling randomly with replacement, i.e., an instance ai=< xi, yi>may appear multiple times in the sample. Assume that the algorithm
creates M such bootstrapped samples. The obtained bootstrap replicates are used to fit M predictive models Tm, each representing hopefully a different hypothesis from H. After the
base learners have been fit, the aggregated response is the average over the Tm’s when pre-
dicting a numerical outcome (regression), and a plurality vote when predicting a categorical outcome (classification). An advantage of Bagging is that it is a parallel algorithm, both in its training and operational phases, which enables a parallel training and execution of the M ensemble members.
The commonly accepted answer to the question ”Why does Bagging work? ” is that, bagging smooths out the estimates of the individual models and reduces the variance of the ensemble. This was empirically supported in our own experimental study as we shall see in the following chapters. In the case of classification, however, it is less clear why voting works. One generally accepted belief is that bagging transforms the prior distribution of the classifier models into a new distribution of models of higher complexity (Domingos, 1997).
Boosting has its origins in the online learning algorithm called Hedge(β ) by Freund and Schapire (1997), developed as part of a decision-theoretic framework for online learning. Within this framework, the authors propose weighting of a set of experts in order to predict the outcome of a certain event. Each expert si is assigned a weight which can be interpreted
as the probability that si is the best expert in the group. These probabilities are updated
online, with the arrival of each new ”training” event. The experts that predicted correctly are rewarded by increasing their weight, thus our increasing belief in their expertise, while the experts that predicted incorrectly are penalized by decreasing their weight. The Hedge(β ) algorithm evolves the distribution of our beliefs in the experts in order to minimize the
Ensembles of Decision and Regression Trees 45
cumulative loss of the prediction. Freund and Schapire (1997) proved an upper bound on the loss which is not much worse than the loss of the best expert in the ensemble in hindsight. The well known AdaBoost (Adaptive Boosting) algorithm, which has resulted from this study, works in a similar manner. The general boosting idea is to develop a classifier team incrementally, by adding one classifier at a time. In each iteration of the process, a new training set is generated which takes into account the predictive accuracy of the classifier resulting from the previous iteration. The sampling distribution starts from uniform, and progresses towards increasing the likelihood of the ”difficult” instances. Basically, the clas- sifier that joins the team at step i is trained on a dataset which selectively sampled from D, such that the likelihood of instances which were misclassified at step i − 1 is increased. Therefore, each succeeding classifier gets access to a different set of hypotheses, which the previous classifier was not able to explore well enough.
This is only one of the two possible implementations of AdaBoost which in particular deals with resampling. The other implementation performs reweighing instead of resampling and is thus deterministic. In this case, AdaBoost generates a sequence of base models T1, ..., TM using weighted training sets such that the training examples misclassified by model
Tm−1 are given half the total weight when training the model Tm, and correctly classified
examples are given the remaining half of the weight. 4.3.1.2 Diversification of the Input Space
An early work by Ho (1998) proposes a randomization in the input space in the context of ensemble learning, such that every base model is trained using a randomly chosen subset of attributes. The random subspace method (RSM) is based on random sampling of features instead of data points. Let each learning example ai in the learning set S be a p + 1-
dimensional vector ai=< xi1, xi2, ..., xip, yi>. RSM randomly selects d attributes from the first p, where d < p. By this, we obtain a d-dimensional random subspace of the original p-dimensional attribute space over which a new predictor is learned. Typically, d is kept the same for all members of the ensemble. Ho (1998) has reported good results for tree classifiers built from d ≈ p/2 features. The common knowledge is that, when the dataset has many attributes each containing little information on the target, one may obtain better models in random subspaces than in the original attribute space (Breiman, 2001; Ho, 1998). The same idea has been used in the Random Forest algorithm of Breiman (2001), however its final effect is different since repeated random selection of features is performed within the inference process.
4.3.1.3 Diversification in the Output Space
In a study on feature selection for linear regression, Breiman (2000) has found a somewhat surprising result that adding noise to the target attribute values, while leaving the input vectors intact, worked just as well as bagging. He extended those early results to non-linear contexts in both regression and classification tasks, and showed that given a single training set by introducing extra random variation into the outputs, the predictors built on these training datasets, averaged or voted, may perform comparable in accuracy to methods such as bagging or boosting. Practically, adding random noise to the output enables us to produce a sequence of perturbed training sets with similar quality as compared to those obtained with bagging.
For the case of regression, adding a random component to the outputs can be done in a straightforward way. Breiman (2000) proposed the Output Smearing procedure which describes a method of adding Gaussian noise to the outputs. The procedure requires a robust estimate of the standard deviation of the outputs in the dataset. This is achieved by using a second pass over the data in which the values which deviate more than 2.5 standard
46 Ensembles of Decision and Regression Trees
deviations from the original mean are removed and the standard deviation is re-computed. Then, the new outputs are generated as:
y0= y + z · sd(y), (19)
where sd(y) is the standard deviation estimate and z is an independent unit normal. A maximal tree is then grown using the new set of outputs, and the whole procedure is repeated Mtimes. The final prediction is given as the average over the predictions of the M previously grown trees. The procedure is inherently parallel, allowing us to grow M trees simultaneously. Breiman (2000) has shown that adding output noise (mean-zero Gaussian noise added to each of the outputs) with random feature selection works better than bagging, comparing favorably in terms of accuracy.
In a similar study (Geurts, 2001) has shown that significant improvements of the accuracy of a single model can be obtained by a simple perturbation of the testing vector at prediction time. The Dual Perturb and Combine algorithm (dual P&C) proposed by Geurts (2001) was shown to produce improvements comparable to those obtained with bagging. Dual P&C produces a number of perturbed versions of the attribute vector of a testing instance by adding Gaussian noise. In the context of decision trees adding Gaussian noise to the attribute vector is more or less equivalent to adding Gaussian noise to the discretization thresholds applied in the splitting tests, which in some ways equals to randomizing the discretization thresholds.
4.3.2 Diversification of the Traversal Strategy
In contrast to deterministic information-guided learning, among the most accurate methods for learning predictors we also have methods that incorporate randomness in the learning process. For example, neural networks use randomization to assign initial weights, bagging generates ensembles from randomly sampled bootstrap replicates of the training set, and Boltzmann machines are stochastic in nature. Randomizing the learning process has been often seen by many authors as an efficient method to promote diversity among ensemble models, while reducing the computational complexity and improving the accuracy. The main argument in classification is that the classification margin of an example allows for less precise models which can be learned more efficiently by replacing the brute force examination of all attribute-value combinations with a more efficient one (Breiman, 2001). From our point of view, this method enables diversification of the search strategy, resulting in a different traversal trajectory in the space of hypotheses for every random seed.
All of these methods have been shown to achieve remarkable accuracy, despite the coun- terintuitive action of injecting random bits of information in the learning process. The common belief is that randomization in the tree building process improves accuracy by avoiding overfitting, while bagging achieves that by improving the instability of decision tree learning algorithms. The most interesting representative of this group of ensemble learning algorithms is the RandomForest algorithm by Breiman (2001). The RandomForest algorithm, as described briefly in the first section, can be thought of as a variant of Bagging. The main point of departure from Bagging is that it allows for a random selection of a subset of attributes considered in the split selection phase at every node of the tree.
The accuracy of a RandomForest depends on the strength of the individual tree classifiers and the correlation among them. Breiman (2001) proved a bound on the generalization error of RandomForest, showing that RandomForest does not overfit as more trees are added. In particular, given a RandomForests of decision trees the following inequality is proven:
PE∗≤ ρ(1 − s2)/s2 (20)
Ensembles of Decision and Regression Trees 47
the residuals (in case of classification the residuals are computed using a margin function) and s is the strength (expected error) of the set of classifiers. Although the bound is likely to be loose, it shows that in order to improve the generalization error of the forest, one has to minimize the correlation while maintaining strength.
A similar bound for the mean squared generalization error is derived for the case of regression, which shows that the decrease in error from the individual trees in the forest depends on the correlation between the residuals and the mean squared error of the indi- vidual trees. Let f (θ ) denote a classifier for a given randomly chosen set of feature vectors θ employed at each node. Let PE∗( f orest) and PE∗(tree) correspond to the predictive error of the forest and the average predictive error of a single tree. The requirements for accurate regression forests are low correlation between the residuals and low error trees, as stated by the following theorem:
If for all random choices of feature vectors θ , E[Y ] = EX[ f (θ )], then PE∗( f orest) ≤
ρ PE∗(tree), where ρ is the weighted correlation between the residuals Y − f (θ ) and Y − f (θ0), and θ , θ0 are independent.
An interesting difference between RandomForests for regression and classification is that in the case of regression the correlation increases slowly with the increase of the number of features used. Basically, a relatively large number of features is required to reduce the generalization error of a tree.