4 Information-based Learning
11: remove d [best] from d 12: for each partition i of do
4.4 Extensions and Variations
The ID3 decision tree induction algorithm described in the previous section provides the basic approach to decision tree induction: a top-down, recursive, depth-first partitioning of the dataset beginning at the root node and finishing at the leaf nodes. Although this algorithm works quite well as presented, it assumes categorical features and clean data. It is relatively easy, however, to extend the ID3 algorithm to handle continuous descriptive features and continuous target features. A range of techniques can also be used to make a decision tree more robust to noise in the data. In this section we describe the techniques used to address these issues as well as the use of ensemble methods that allow us to combine the predictions made by multiple models. We begin, however, by introducing some of the metrics, other than entropy-based information gain, that can be used to select which feature to split on next as we build the tree.
4.4.1 Alternative Feature Selection and Impurity Metrics
The information gain measure described in Section 4.2.3[128] uses entropy to judge the impurity of the partitions that result from splitting a dataset using a particular feature.
Entropy-based information gain, however, does have some drawbacks. In particular, it preferences features with many levels because these features will split the data into many small subsets, which will tend to be pure, irrespective of any correlation between the descriptive feature and the target feature. One way of addressing this issue is to use information gain ratio instead of entropy. The information gain ratio is computed by dividing the information gain of a feature by the amount of information used to determine the value of the feature
where IG (d, ) is the information gain of the feature d for the dataset (computed using Equation (4.4)[131] from Section 4.2.3[128]), and the divisor is the entropy of the dataset with respect to the feature d (note that levels (d) is the set of levels that the feature d can take). This divisor biases information gain ratio away from features that take on a large number of values and as such counteracts the bias in information gain toward these features.
To illustrate how information gain ratio is computed, we will compute the information gain ratio for the descriptive features STREAM, SLOPE, and ELEVATION in the vegetation classification dataset in Table 4.3[138]. We already know the information gain for these features (see Table 4.4[139]):
To convert these information gain scores into information gain ratios, we need to compute the entropy of each feature and then divide the information gain scores by the respective entropy values. The entropy calculations for these descriptive features are
Using these results, we can now compute the information gain ratio for each descriptive feature by dividing the feature’s information gain by the entropy for that feature:
From these calculations we can see that SLOPE has the highest information gain ratio score, even though ELEVATION has the highest information gain. The implication of this is that if we build a decision tree for the dataset in Table 4.3[138] using information gain ratio, then SLOPE (rather than ELEVATION) would be the feature chosen for the root of the tree. Figure 4.12[147] illustrates the tree that would be generated for this dataset using information gain ratio.
Figure 4.12
The vegetation classification decision tree generated using information gain ratio.
Notice that there is a chapparal leaf node at the end of the branch ELEVATION = low even though there are no instances in the dataset where ELEVATION = low and VEGETATION
= chapparal. This leaf node is the result of an empty partition being generated when the partition at the ELEVATION node was split. This leaf node was assigned the target level chapparal because this was the majority target level in the partition at the ELEVATION node.
If we compare this decision tree to the decision tree generated using information gain (see Figure 4.11[143]), it is obvious that the structure of the two trees is very different. This difference illustrates the effect of the metric used to select which feature to split on during tree construction. Another interesting point of comparison between these two trees is that even though they are both consistent with the dataset in Table 4.3[138], they do not always return the same prediction. For example, given the following query:
STREAM = false, SLOPE = moderate, ELEVATION = highest
the tree generated using information gain ratio (Figure 4.12[147]) will return VEGETATION = riparian, whereas the tree generated using information gain (Figure 4.11[143]) will return VEGETATION = conifer. The combination of features listed in this query does not occur in the dataset. Consequently, both of the trees are attempting to generalize beyond the dataset. This illustrates how two different models that are both consistent with a dataset can make different generalizations.12 So, which feature selection metric should be used, information gain or information gain ratio? Information gain has the advantage that it is computationally less expensive than information gain ratio. If there is variation across the number of values in the domain of the descriptive features in a dataset, however, information gain ratio may be a better option. These factors aside, the effectiveness of descriptive feature selection metrics can vary from domain to domain. So we should experiment with different metrics to find which one results in the best models for each
dataset.
Another commonly used measure of impurity is the Gini index:
where is a dataset with a target feature t; levels(t) is the set of levels in the domain of the target feature; and P(t = l) is the probability of an instance of having the target level l.
The Gini index can be understood as calculating how often the target levels of instances in a dataset would be misclassified if predictions were made based only on the distribution of the target levels in the dataset. For example, if there were two target levels with equal likelihood in a dataset, then the expected rate of misclassification would be 0.5, and if there were four target levels with equal likelihood, then the expected rate of misclassification would be 0.75. The Gini index is 0 when all the instances in the dataset have the same target level and when there are k possible target levels with equal likelihood. Indeed, a nice feature of the Gini index is that Gini index scores are always between 0 and 1, and in some contexts this may make it easier to compare Gini indexes across features. We can calculate the Gini index for the dataset in Table 4.3[138] as
Table 4.7
Partition sets (Part.), entropy, Gini index, remainder (Rem.), and information gain (Info.
Gain) by feature for the dataset in Table 4.3[138].
The information gain for a feature based on the Gini index can be calculated in the
same way as it is using entropy: calculate the Gini index for the full dataset and then subtract the sum of the weighted Gini index scores for the partitions created by splitting with the feature. Table 4.7[149] shows the calculation of the information gain using the Gini index for the descriptive features in the vegetation classification dataset. Comparing these results to the information gain calculated using entropy (see Table 4.4[139]), we can see that although the resulting numbers are different, the relative ranking of the features is the same—in both cases ELEVATION has the highest information gain. Indeed, for the vegetation dataset, the decision tree that will be generated using information gain based on the Gini index will be identical to the one generated using information gain based on entropy (see Figure 4.11[143]).
So, which impurity measure should be used, Gini or entropy? The best advice that we can give is that it is good practice when building decision tree models to try out different impurity metrics and compare the results to see which suits a dataset best.