2.8 Discussion
3.2.2 Supervised learning methods
3.2.2.1 Random forests
We base the discussion in this subsection on the work of Breiman [213] and Breiman and Cutler [214].
The random forests algorithm is an ensemble classification method. Ensemble classi- fication methods, also referred to as meta-classifiers, classify objects by aggregating the results of a collection (ensemble) of independent predictors. The aim of meta-classifiers is to obtain a more accurate classification than the component classifiers alone.
Random forests work by growing an ensemble of decision trees (a supervised classification technique that we will briefly describe below). Each tree provides a classification of a new sample. Informally, it is said that each tree votes for a class. The sample is assigned to the class that obtains most votes.
3.2.2.1.1 Decision trees Decision trees are a supervised classification method that
work by organising a set of attributes (features) in a rooted tree structure. Each non-leaf node represents one or several attributes being tested. Each branch represents the outcome of the test, while the leaf nodes represent the class labels. An illustration is shown in Figure 3.5. The decision tree determines if a day is suitable for playing tennis, based on three weather characteristics: outlook, humidity and wind.
In the case of decision trees, a new object is classified by evaluating its attributes using the rules encoded by the tree. The evaluation starts with the root note. At each non-leaf node, one ore several attributes are tested. Depending on the value of the attributes being tested on each node, the evaluation continues on one of the branches, until it reaches a leaf node. The leaf node in which the classification stops represents the
3.2 Machine learning 54
Outlook
Humidity Yes Wind
Yes No
Sunny Rain
Overcast
Yes No
Normal High Weak Strong
Figure 3.5 Example of a decision tree. The elliptic nodes represent the attribute being tested. The square nodes represent the class label. Adapted from Mitchell et al. [215].
class to which the object is assigned. For example, a day characterised by rainy outlook, high humidity and weak wind is classified as suitable for tennis (Figure 3.5). This
decision is made by first evaluating the attribute in the root node, i.e. theoutlook. In
this particular case the outlook israin, therefore the classification continues on the right
branch. Next, thewindattribute is evaluated. It has the valueweak, thus the evaluation
continues on the left branch. That branch leads to a leaf node labelledYes, indicating
that the day is suitable for tennis.
3.2.2.1.2 Random forests algorithm Given a dataset set withN samples, each one
withMattributes, random forests build each of their decision trees as follows:
1. N samples are selected, at random, with replacement, which will be used as
training set for growing the tree;
2. from the list ofMattributes,m<<Mattributes are selected at random;
3. using themattributes and theNsamples a decision tree is constructed.
One of the key features of random forests is the selection with replacement of the
training samples, used in the construction of each tree. When samplingNtimes with
3.2 Machine learning 55
about a third of samples are not selected at all [216]. The collection of samples not
selected is referred to asoob (out-of-bag) dataand is used by random forests to calculate
an unbiased classification error estimate. This is important as, by using the oob error rate, there is no need to perform cross-validation, as is the case with most supervised classification algorithms. The oob error is also used to calculate variable importances, i.e. a measure which tells how important a variable is for the overall classification.
The error rate of random forest depends on two aspects: thecorrelationbetween
the trees and thestrengthof trees. The correlation estimate measures how similar are
the classifications on average yielded by each pair of trees, across all samples in the dataset. Intuitively this tells if the trees output redundant classifications. Strength, on the other hand, tells how accurately each tree is classifying. Decreasing correlation and increasing strength lead to the decrease of error rate.
Both correlation and strength increase when themparameter (also referred to as the
mtryparameter - the number of attributes sampled for each tree) increases. Therefore,
when using random forest it is important to choose a value formtrythat gives a good
trade-off between strength and correlation. The default value for this parameter is√M
for classification, andM/3 for regression. Another important parameter is the number
of decision trees to grow (thentreeparameter). If too few trees are grown, the model
might underfit the data. The default value forntreeis 500.
Random forests can be adapted to handle imbalanced datasets, i.e. datasets for which there is a significant difference in the size of classes. This is an issue for the classification algorithms as usually they try to optimize the overall error rate. Most of the times this will keep the error for the larger classes low, while letting the error of the small classes, which contributes little to the overall error, to be high.
For random forests there are two commonly used techniques for addressing this problem. One is to assign class weights inversely proportional to the class size, which are then used to weigh the contribution of the samples to the overall error. The other approach is to use stratified sampling, i.e. an equal number of samples is drawn from each class, regardless of the class size. This can be achieved by either over-sampling the smaller classes or down-sampling the larger classes.