Random Forest classifiers - An Automated Pipeline for Variability Detection and Classification

Random Forests are an example of an ensemble classification method where a collection of weak classifiers are combined to produce a single strong classifier (Breiman,2001). For the Random Forest ensemble, the individual classifiers are Classification And Regression Trees (CARTs) (Breiman et al.,1984). A CART is constructed through performing binary splits around a variable or set of variables. These branches then undergo additional binary splits until the tree has enough depth that the branch can be ended as a leaf node where these leaf nodes have an arbitrary size subset of the initial dataset. Figure 4.3

demonstrates a decision tree extracted from a Random Forest model trained on Skycam light curves.

A number of these trees defined by an argument ntrees when determines the number of

decision trees trained by the Random Forest ensemble. These trees are trained through the use of bootstrap aggregating where, similar to the bootstrapping using in the eval- uation of GRAPE, the training set is randomly sampled with replacement to train each individual tree. This process is commonly known as Bagging (Breiman,2001). For each

split in each tree, a random sample of variables of size mtryis selected to decide the split.

A third argument named nodesize allows the control of how deep or shallow the trees are grown by determining the maximum amount of training data allowed by a terminal leaf node. Any nodes with an amount of data greater than this size must execute a split. Deep trees can describe many variable interactions but readily overfit the training data whereas shallow trees limit the complexity of the decision tree model which can result in a high bias model, the model is too simple for the desired problem. The Random Forest can be formally defined as a set of classifiers h(x|Θ1), . . . , h(x|ΘK) produced based on

training from a set of training data D = [(xi, yi)]ni=1 where h is the non-linear function

of the decision tree parameters, Θj are the parameters of the jth decision tree, xi is the

variables of the ith observation and yi is the class of the ith observation.

Random Forests are not trained in the traditional sense. They do not make use of an iterative update process which improves the performance of the individual decision trees. Rather, they continue to grow new trees using the random splitting of the variables in the training data until a cohort of trees with good performance on the desired problem are produced. This cohort of trees can be identified through the use of an ‘out-of-bag’ error. This error is a measure of the performance of an individual tree using the training data which was not selected as part of the bagging process and will have a lower value for the better performing trees. As the performance of the model is not effected by the low performance trees, the overall performance of the ensemble improves as trees with good performance are generated. The simplicity of the Random Forest trees compared to traditional decision trees allows them to be more generalisable, one of the main limitations of CARTs, although they must be used as part of an ensemble method. Another interesting component of Random Forests is they have automatically incorpo- rated performance and feature importance measures. The out-of-bag error is a measure of the error rate of a trained Random Forest model by computing the performance of each decision tree using the data which was not selected during the bootstrap sampling operations. The Gini criterion is a measure of the diversity, i.e. the proportion of classes of differing types, in each leaf node. The more important features will result in a larger change in this Gini criterion if they are removed from the training. This change is named the Mean Decrease Gini of the feature given a trained Random Forest model. The Gini criterion is defined by equation 4.6given equation 4.5.

g(Sj) = N X i=1 ˆ P (Ci|Sj)(1 − ˆP (Ci|Sj)) (4.5)

where Sj is the set of data in the jth leaf node, Ci is the ith class in the data and g(Sj)

Machine Learning 136

one class Ci. ˆP (Ci|Sj) is the proportion of data in leaf node Sj which is of class Ci. N

is the number of classes in the training dataset. The Gini criteron is then determined by the weighted sum of the variations shown in equation4.6.

G = M X j=1 ˆ P (Sj)g(Sj) (4.6)

where G is the Gini criterion and ˆP (Sj) is the proportion of training data in the jth

leaf node relative to the total number of training data objects. This criterion, through the mean decrease Gini feature importance measurement, can be used in the process of feature selection. Random Forests determine their final prediction based on ‘votes’ from each of the individual decision tree classifiers with the probability of an object being of class Ci decided by the proportion of trees which classify the object as being a member

of class Ci. Random Forests also allow for the determination of the similarity between

the feature vectors of two objects by determining the proportion of the decision trees where the two objects are placed in the same terminal leaf node. This similarity measure is similar to a euclidean distance between the objects in the feature space as weighted by the importance of the features in the Random Forest model.

In document An Automated Pipeline for Variability Detection and Classification for the Small Telescopes Installed at the Liverpool Telescope (Page 158-160)