• No results found

3.4 Data Layer Methods and Results

4.1.2 Statistical model description

4.1.2.1 Random Forest Models

Random Forest is an ensemble-learning technique that generates many classification trees

that are aggregated to compute a final classification (Breiman et al. 1984; Breiman 2001).

Random Forests have been found to be among the best predictors (Svetnik et al. 2003). Unlike

standard decision trees, which consider all predictor variables at each split, a random forest only

considers a subset of predictors randomly chosen at each node (Breiman 2001; Liaw and Wiener

2002). Each individual classification tree built in the forest is unique as the training data are

resampled without replacement (called bootstrap sampling), and the predictor datasets to

consider at each split of the tree are randomly changed. One desirable feature of the RF is its

built-in estimation of predictor accuracy due to the bootstrap sampling method, called the out-of-

bag (OOB) error estimation. Additionally, RFs only have two parameters to set when tuning the

model: the number of variables to consider at each node (mtry) and the number of trees in the

forest (ntree) (Liaw and Wiener 2002). Table 4.1 outlines the steps for the RF algorithm as well as

for the error estimation that takes place within the RF.

During model development, two parameters need to be optimized to decrease the error,

mtry and ntree. The ntree parameter needs to be set sufficiently high to allow for convergence of the

OOB error rate. The mtry parameter has an influence on both the strength of the individual trees

in the forest, as well as the correlation between trees in the forest. Reducing mtry reduces both

strength and correlation. Reduction of strength results in increased error, while reduction of

correlation results in decreased error. Therefore, mtry must be optimized to minimize the error

95

Table 4.4: Description of Random Forest algorithm steps Random Forest Algorithm Steps:

1. Draw ntree bootstrap samples from the training data.

2. For each bootstrap sample, grow a classification or regression tree to its full size. Instead of considering all possible predictors at each node of the tree, choose the best split among a random sample of size mtry

of the predictors and choose the best split from only those predictors.

3. Use the fully grown trees to predict to new data by aggregating the predictions of all trees combined (majority vote becomes the prediction).

Random Forest Error Estimation Steps:

1. For each tree (and subsequently each bootstrap sample), predict the data not in the bootstrap sample using the tree grown with the bootstrap sample. The data not in the bootstrap sample are called the “out- of-bag” or OOB data (Breiman 2001).

2. Aggregate the OOB predictions to calculate the error rate. This is called the OOB estimate of error rate and has been shown to be both an accurate estimate of the generalization error and similar to the more traditional model error estimate procedure of using cross-validation (Bylander 2002; Liaw and Wiener 2002).

In addition to the above mentioned parameters, the number of predictor variables to

include must also be set during model development. In general, the most simplified model is

desired, with as few predictor variables as possible to appropriately model the response.

Simplified models are desired to avoid mathematical artifacts that result from including an

excess number of predictors. In fact, it is always possible to model the data better by including

more predictor terms. When the number of predictor terms approaches the number of response

observations, a near perfect fit is possible and represents a mathematical artifact rather than a

modeling success (Mac Nally 2000). Additionally, the confidence of the model is lessened when

more predictor terms are included in the model as prediction error increases (Breiman 1995).

While these suggestions were originally formulated based on the more traditional linear

regression models, they also apply to CART models. While CART models are less sensitive to

including irrelevant predictors than other methods [linear regression, Generalized Additive

Models (GAMs), etc.], reducing the number of predictors in CART models is still desirable for

96

Once the model development is complete, RF models provide additional useful products

for ecological studies. The most widely used of these products are variable-importance plots.

Random forest models estimate the importance of predictor variables by analyzing how much the

OOB error increases when OOB data for that variable are permuted, while all other variables are

left unchanged. This is a complicated task, as the importance of any single variable may be due

to its interaction with one or multiple other variables (Breiman 2001; Liaw and Wiener 2002).

The RF model provides two measures of variable importance: the mean decrease in

accuracy and mean decrease in Gini coefficient. The mean decrease in accuracy (referred to as

just mean decrease accuracy in the model) is computed at the same time as the OOB error. The

RF calculates the decrease in the accuracy of the model when a single variable is excluded. The

greater the decrease in accuracy of the model with the exclusion of a particular variable, the

greater the importance of that variable. The Gini coefficient is a measure of homogeneity of the

final RF. The mean decrease Gini coefficient calculated by the RF model represents how each of

the predictor variables contributes to this homogeneity. This is essentially a measure of how well

each predictor classifies the data. When the goal of the model is prediction, it is recommended to

use mean-decrease accuracy measures to determine which variables to include (Breiman 2001).