3 Literature review
Algorithm 2: ANN Prediction Generation Compute:
3.2.4.3 Random Forest
Random forest is an ML modelling technique that provides a means to improve the predictive performance of tree-based models. Tree-based models are supervised ML models that work by sequentially segmenting the input space of a problem into a number of regions. A classification prediction is then made by determining which region an input belongs to and then returning the mode output associated with that region. In the binary output case, segmentation is done by taking an input variable and finding a threshold value for that variable that splits the output data into sets that exhibit the highest possible degree of purity. The threshold is referred to as a decision boundary, therefore these models are also referred to as decision trees. This splitting process continues sequentially and for each split a different input variable can be selected to create a decision boundary. The process continues until regions are produced that contain a predefined minimum number of observations or level of region purity, where purity refers to the degree to which a region contains only a specific class of output variable [37], [38]. Decision trees get their name from the fact that a visual representation of the sequential decision boundaries and terminal regions, referred to as leaf nodes, resembles a tree-like structure, as seen in Figure 8 [38].
A popular measure for leaf node impurity in the binary classification case is the Gini Impurity score [37], [38]. This score is calculated, the splits made with a given input variable, as follows:
(9)
where Gi is the Gini Impurity score for variable i, M is the number of classes (i.e. splits) of variable i, pm is the proportion of observations that belong to class m of variable i and pm,k is the
proportion of observations that belong to class m of variable i and class k of the binary output variable. The variable with the lowest Gini Impurity score is selected to make the next split in the decision tree [37], [38].
Figure 14: Visual representation of decision tree, adapted from [38]
Some of the advantages of decision trees are that they are extremely simple, making them easy to implement and easy to interpret. Once built, they also allow for an easily interpreted visual representation, as seen in Figure 14. Finally, decision trees are capable of handling quantitative and qualitative input variables with the same level of ease. However, an important disadvantage of decision trees is that it is often necessary to implement greedy splitting rules like the Gini Impurity score, especially if there are a large number of input variables, which could lead to a poorly performing model [38].
Decision trees in general tend to exhibit high variance, which means that they are highly sensitive to changes in the training data. For instance, if one were to divide a training data set in half and build a decision tree for each half, these decision trees could look vastly different. A model with high variance is prone to overtraining, a phenomenon that leads to poor model generalisation. Random forest involves generating a large number of decision trees, each on a randomly resampled subset of the training data. Then, by selecting the mode prediction over all the generated trees as the final prediction, the high variance of a single decision tree is mitigated. This occurs due to the fact that the variance will be inversely proportional to the number of generated decision trees. This process of generating a large set of models to reduce prediction variance is known as bagging. However, further steps need to be taken for decision trees, due to the fact that the presence of a small set of variables that are able to create very pure leaf nodes will always make up the first set of splits for all the generated trees. This results in generating highly correlated trees, which does not address the high variance issue of a single decision tree. Therefore, with a random forest, each decision tree is not only presented with a
randomly sampled subset of the observations, but also with a random subset of input variables. This forces the decision trees to be less correlated, resulting in improved generalisation. A disadvantage of using random forest instead of a decision tree is that the simplicity and interpretability of a decision tree is sacrificed for better predictive performance [38], [42]. The random forest procedure works as follows: Let Z be the NxM matrix consisting of N observations of M input variables. Let Y be the Nx1 vector of a corresponding binary output variable. We then generate B sets of Ẑ(i), with i ranging from one to B, by randomly sampling K observations, with replacement, of L randomly selected variables from Z. The dimensions for each Ẑ(i) is therefore KxL where K and L are set to a values K ≤ N and L < M.
For each Ẑ(i) we build a decision tree, using the Gini Impurity score with regards to Y(i) to determine the splitting sequence of the L input variables. The splitting procedure continues until a minimum leaf node size, n, is reached for all leaf nodes. Classification predictions are then made by choosing the mode prediction made by the B decision trees. The values of B, K, L and n are regarded as model tuning parameters. Therefore, random forests are generated for a number of candidate configurations of these parameters, and the best performing configuration is selected [42].
One of the attractive aspects of the random forest model is that the risk of overfitting the training data does not increase with an increase in the parameter B. An increase in B only results in a reduction in the variance of the mode predictions made by the generated decision trees. Therefore, this parameter is easy to configure, seeing as one can start with a small value for B and increase it until the prediction accuracy of the model stabilises, without having to worry about reducing the generalisability of the model. Another attractive aspect of random forest is the concept of Out-Of-Bag (OOB) error. For each decision tree generated by random forest, the OOB observations are those that were not contained in the decision tree’s K observations. For each observation in N, we can determine for which of the B decision trees in the model it was OOB. These decision trees can then be tasked with classifying this observation as a test. After doing so for each of the N observations the overall misclassification rate is the OOB error rate. This measurement can be used to tune the aforementioned model parameter configurations. The advantage lies in the fact that, with OOB error, we do not require a hold-out data set to test the model parameter configurations, which allows random forest to consume data very economically [38], [43].