3.4 Data Layer Methods and Results
4.1.2 Statistical model description
4.1.2.1 Random Forest Models
Random Forest is an ensemble-learning technique that generates many classification trees
that are aggregated to compute a final classification (Breiman et al. 1984; Breiman 2001).
Random Forests have been found to be among the best predictors (Svetnik et al. 2003). Unlike
standard decision trees, which consider all predictor variables at each split, a random forest only
considers a subset of predictors randomly chosen at each node (Breiman 2001; Liaw and Wiener
2002). Each individual classification tree built in the forest is unique as the training data are
resampled without replacement (called bootstrap sampling), and the predictor datasets to
consider at each split of the tree are randomly changed. One desirable feature of the RF is its
built-in estimation of predictor accuracy due to the bootstrap sampling method, called the out-of-
bag (OOB) error estimation. Additionally, RFs only have two parameters to set when tuning the
model: the number of variables to consider at each node (mtry) and the number of trees in the
forest (ntree) (Liaw and Wiener 2002). Table 4.1 outlines the steps for the RF algorithm as well as
for the error estimation that takes place within the RF.
During model development, two parameters need to be optimized to decrease the error,
mtry and ntree. The ntree parameter needs to be set sufficiently high to allow for convergence of the
OOB error rate. The mtry parameter has an influence on both the strength of the individual trees
in the forest, as well as the correlation between trees in the forest. Reducing mtry reduces both
strength and correlation. Reduction of strength results in increased error, while reduction of
correlation results in decreased error. Therefore, mtry must be optimized to minimize the error
95
Table 4.4: Description of Random Forest algorithm steps Random Forest Algorithm Steps:
1. Draw ntree bootstrap samples from the training data.
2. For each bootstrap sample, grow a classification or regression tree to its full size. Instead of considering all possible predictors at each node of the tree, choose the best split among a random sample of size mtry
of the predictors and choose the best split from only those predictors.
3. Use the fully grown trees to predict to new data by aggregating the predictions of all trees combined (majority vote becomes the prediction).
Random Forest Error Estimation Steps:
1. For each tree (and subsequently each bootstrap sample), predict the data not in the bootstrap sample using the tree grown with the bootstrap sample. The data not in the bootstrap sample are called the “out- of-bag” or OOB data (Breiman 2001).
2. Aggregate the OOB predictions to calculate the error rate. This is called the OOB estimate of error rate and has been shown to be both an accurate estimate of the generalization error and similar to the more traditional model error estimate procedure of using cross-validation (Bylander 2002; Liaw and Wiener 2002).
In addition to the above mentioned parameters, the number of predictor variables to
include must also be set during model development. In general, the most simplified model is
desired, with as few predictor variables as possible to appropriately model the response.
Simplified models are desired to avoid mathematical artifacts that result from including an
excess number of predictors. In fact, it is always possible to model the data better by including
more predictor terms. When the number of predictor terms approaches the number of response
observations, a near perfect fit is possible and represents a mathematical artifact rather than a
modeling success (Mac Nally 2000). Additionally, the confidence of the model is lessened when
more predictor terms are included in the model as prediction error increases (Breiman 1995).
While these suggestions were originally formulated based on the more traditional linear
regression models, they also apply to CART models. While CART models are less sensitive to
including irrelevant predictors than other methods [linear regression, Generalized Additive
Models (GAMs), etc.], reducing the number of predictors in CART models is still desirable for
96
Once the model development is complete, RF models provide additional useful products
for ecological studies. The most widely used of these products are variable-importance plots.
Random forest models estimate the importance of predictor variables by analyzing how much the
OOB error increases when OOB data for that variable are permuted, while all other variables are
left unchanged. This is a complicated task, as the importance of any single variable may be due
to its interaction with one or multiple other variables (Breiman 2001; Liaw and Wiener 2002).
The RF model provides two measures of variable importance: the mean decrease in
accuracy and mean decrease in Gini coefficient. The mean decrease in accuracy (referred to as
just mean decrease accuracy in the model) is computed at the same time as the OOB error. The
RF calculates the decrease in the accuracy of the model when a single variable is excluded. The
greater the decrease in accuracy of the model with the exclusion of a particular variable, the
greater the importance of that variable. The Gini coefficient is a measure of homogeneity of the
final RF. The mean decrease Gini coefficient calculated by the RF model represents how each of
the predictor variables contributes to this homogeneity. This is essentially a measure of how well
each predictor classifies the data. When the goal of the model is prediction, it is recommended to
use mean-decrease accuracy measures to determine which variables to include (Breiman 2001).