Tree models - Computational data mining - Applied Data Mining Statistical Methods for Business

Computational data mining

4.5 Tree models

While linear and logistic regression methods produce a score and then possibly a classification according to a discriminant rule, tree models begin by producing a classification of observations into groups and then obtain a score for each group. Tree models are usually divided into regression trees, when the response variable is continuous, and classification trees, when the response variable is quantitative discrete or qualitative (categorical, for short). However, as most concepts apply equally well to both, here we do not distinguish between them, unless otherwise specified. Tree models can be defined as a recursive procedure, through which a set of n statistical units are progressively divided into groups, according to a division rule that aims to maximise a homogeneity or purity measure of the response variable in each of the obtained groups. At each step of the procedure, a division rule is specified by the choice of an explanatory variable to split and the choice of a splitting rule for such variable, which establishes how to partition the observations.

The main result of a tree model is a ﬁnal partition of the observations. To achieve this it is necessary to specify stopping criteria for the division process. Suppose that a ﬁnal partition has been reached, consisting ofg groups (g < n). Then for any given observation response variable observation yi, a regression

tree produces a ﬁtted value yˆi that is equal to the mean response value of the

group to which the observation i belongs. Let m be such a group; formally we have that ˆ yi = 1 nm nm l=1 ylm

For a classification tree, fitted values are given in terms of fitted probabilities of affiliation to a single group. If only two classes are possible (binary classification), the fitted success probability is therefore

πi = 1 nm nm l=1 ylm

where the observations ylm can take the value 0 or 1, therefore the ﬁtted prob-

ability corresponds to the observed proportion of successes in groupm. Notice thatyˆi andπi are constant for all the observations in the group.

The output of the analysis is usually represented using a tree; it looks very similar to the dendrogram produced by hierarchical clustering (Section 4.2). This implies that the partition performed at a certain level is inﬂuenced by the previous choices. Figure 4.6 shows a classiﬁcation tree for a credit scoring application described in Chapter 11. For this problem the response variable is binary, and for the time being we use a 1 to indicate the event corresponding to the consumer not being reliable (bad). In Chapter 11 we will follow the opposite convention,

Figure 4.6 Example of a decision tree.

where 1 indicates the ‘good’ customers. The choice is irrelevant for the ﬁnal results, but it affects how the model is interpreted.

Figure 4.6 describes the typical output of a tree model analysis. The analysis proceeds in a divisive way. On top of the ﬁgure there is the root node of the tree model, which contains all the 1000 available observations; 70% are credit reliable (good) and 30% are credit unreliable (bad). In the ﬁrst step of the procedure, the consumers are divided into two groups, according to the levels of the variable ‘Good Account’, estimated to be the most discriminatory. In the group on the right are the 394 consumers with a good account (Good Account=1), of which about 88% are credit reliable; in the group on the left are the 606 consumers with a bad account (Good Account=0), of which about 58% are credit reliable. Subsequently, the bad accounts group is further split according to the variable ‘Previous Repayments’ and the good accounts group is separated according to the variable ‘Concurrent’ (debts). The tree stops at these levels, so there are four terminal nodes (groups not further divisible).

The terminal nodes of a tree are often called the ‘leaves’. The leaves contain the main information conveyed by a tree model analysis; in this example they partition the observations into four groups, ordered by the ﬁtted probability of credit unreliability (response variable): 9.3%, 25%, 38.5% and 67.1%. These ﬁtted probabilities can be compared with those that could be obtained from a logistic regression model.

We can now classify new observations, for which the levels of the response variable are unknown. In Figure 4.6 we can do this by locating such observations in one of the four classes corresponding to the terminal branches, according to the levels assumed by the explanatory variables Good Account, Previous Repayments, and Concurrent, and following the described rules. For example, a consumer with level 0 of the variable Good Account and level 0 of the variable Previous Repayments will be classiﬁed as credit unreliable, since the corresponding terminal node (the leftmost leaf) has a very high probability of unreliability (equal to 67.1%). A consumer with level 1 of Good Account and level 0 of Con- current will be classiﬁed as credit reliable, as they have a very low probability of unreliability (9.3%).

For classiﬁcation trees, a discriminant rule can be derived at each leaf of the tree. A commonly used rule is to classify all observations belonging to a terminal node in the class corresponding to the most frequent level (mode). This corresponds to the so-called ‘majority rule’. Other ‘voting’ schemes are also possible but, in the absence of other considerations, this rule is the most reasonable. Therefore each of the leaves points out a clear allocation rule of the observations, which is read by going through the path that connects the initial node to each of them. Every path in a tree model thus represents a classiﬁcation rule. Compared with discriminant models, tree models produce rules that are less explicit analytically but easier to understand graphically.

Tree models can be considered as non-parametric predictive models (Section 5.2), since they do not require assumptions about the probability distribution of the response variable. In fact, this flexibility means that tree models are generally applicable, whatever the nature of the dependent variable and the explanatory variables. But this greater flexibility may have disadvantages, for instance, a higher demand of computational resources. Furthermore, their sequential nature and their algorithmic complexity can make them dependent on the observed data, and even a small change might alter the structure of the tree. It is difficult to take a tree structure designed for one context and generalize it to other contexts.

Despite their graphical similarities, there are important differences between hierarchical cluster analysis and classification trees. Classification trees are predictive rather than descriptive. Cluster analysis performs an unsupervised classification of the observations on the basis of all available variables, whereas classification trees perform a classification of the observations on the basis of all explanatory variables and supervised by the presence of the response (target) variable. A second important difference concerns the partition rule. In classification trees the segmentation is typically carried out using only one explanatory variable at a time (the maximally predictive explanatory variable), whereas in hierarchical clustering the divisive (or agglomerative) rule between groups is established on the basis of considerations on the distance between them, calculated using all the available variables.

We now describe in more detail the operational choices that have to be made before fitting a tree model to the data. It is appropriate to start with an accurate preliminary exploratory analysis. First, it is necessary to verify that the sample size is sufficiently large. This is because subsequent partitions will have fewer observations, for which the fitted values may have a lot of variance. Second, it is prudent to conduct accurate exploratory analysis on the response variable, especially to identify possible anomalous observations that could severely distort the results of the analysis. Pay particular attention to the shape of the response variable’s distribution. For example, if the distribution is strongly asymmetrical, the procedure may lead to isolated groups with few observations from the tail of the distribution. Furthermore, when the dependent variable is qualitative, ideally the number of levels should not be too large. Large numbers of levels should be reduced to improve the stability of the tree and to achieve improved predictive performance.

After the preprocessing stage, choose an appropriate tree model algorithm, paying attention to how it performs. The two main aspects are the division criteria and the methods employed to reduce the dimension of the tree. The most used algorithm in the statistical community is the CART algorithm (Breiman

et al., 1984), which stands for ‘classiﬁcation and regression trees’. Other algorithms include CHAID (Kass, 1980), C4.5 and its later version, C5.0 (Quinlan, 1993). C4.5 and C5.0 are widely used by computer scientists. The ﬁrst versions of C4.5 and C5.0 were limited to categorical predictors, but the most recent versions are similar to CART. We now look at two key aspects of the CART algorithm: division criteria and pruning, employed to reduce the complexity of a tree.

4.5.1 Division criteria

The main distinctive element of a tree model is how the division rule is chosen for the units belonging to a group, corresponding to a node of the tree. Choosing a division rule means choosing a predictor from those available, and choosing the best partition of its levels. The choice is generally made using a goodness measure of the corresponding division rule. This allows us to determine, at each stage of the procedure, the rule that maximises the goodness measure. A goodness measure(t)is a measure of the performance gain in subdividing a (parent) node

t according to a segmentation in a number of (child) nodes. Lettr,r =1, . . . , s,

indicate the child groups generated by the segmentation (s =2 for a binary segmentation) and letpr indicate the proportion of observations, among those int,

that are allocated to each child node, with pr =1. The criterion function is

usually expressed as (s, t)=I (t)− s r=1 I (tr)pr

where the symbolI indicates an impurity function. High values of the criterion function imply that the chosen partition is a good one. The concept of impurity refers to a measure of variability of the response values of the observations. In a regression tree, a node will be pure if it has null variance (all observations are equal) and impure if the variance of the observations is high. More precisely, for a regression tree, the impurity at node mcan be deﬁned by

IV(m)= nm l=1 (ylm− ˆym)2 nm

where yˆm indicates the ﬁtted mean value for group m. For regression trees

impurity corresponds to the variance; for classiﬁcation trees alternative measures should be considered. Here are the usual choices.

Misclassiﬁcation impurity IM(m)= nm l=1 1(ylm, yk) nm = 1−πk

whereykis the modal category of the node, with ﬁtted probabilityπk; the function

1() indicates the indicator function, which takes the value 1 if ylm =yk and

0 otherwise. Gini impurity IG(m)=1− k(m) i=1 π2 i

where πi are the ﬁtted probabilities of the levels present in node m, which are

at mostk(m). Entropy impurity IE(m)= − k(m) i=1 πilogπi

with πi as above. Notice that the entropy impurity and the Gini impurity cor-

respond to the application of the heterogeneity indexes (Section 3.1) to the observations in node m. Compared to the misclassification impurity, both are more sensitive to changes in the fitted probabilities; they decrease faster than the misclassification rate as the tree grows. Therefore, to obtain a parsimonious tree, choose the misclassification impurity.

Tree assessments

Besides giving a useful split criterion, an impurity measure can be used to give an overall assessment of a tree. Let N (T ) be the number of leaves (terminal nodes) of a treeT. The total impurity ofT is obtained as

I (T )=

N (T )

m=1

I (tm)pm

wherepm are the observed proportions of observations in the ﬁnal classiﬁcation.

In particular, the misclassification impurity constitutes a very important assessment of the goodness of fit of a classification tree. Even when the number of leaves coincides with the number of levels of the response variable, it need not to be that all the observations classified in the same node actually have the same level of the response variable. The percentage of misclassifications, or the

percentage of observations classified with a level different from the observed value, is also called misclassification error or misclassification rate; it is another important overall assessment of a classification tree.

The impurity measure used by CHAID is the distance between the observed and the expected frequencies; the expected frequencies are calculated using the hypotheses for homogeneity for the observations in the considered node. This split criterion function is the PearsonX2_{index. If the decrease in}_X2_{is signiﬁcant (i.e.}

thep-value is lower than a prespeciﬁed levelα) then a node is split, otherwise it remains unsplit and becomes a leaf.

4.5.2 Pruning

In the absence of a stopping criterion, a tree model could grow until each node contains identical observations in terms of values or levels of the dependent variable. This obviously does not constitute a parsimonious segmentation. Therefore it is necessary to stop the growth of the tree at a reasonable dimension. The ideal final tree configuration is both parsimonious and accurate. The first property implies that the tree has a small number of leaves, so that the predictive rule can be easily interpreted. The second property implies a large number of leaves that are maximally pure. The final choice is bound to be a compromise between the two opposing strategies. Some tree algorithms use stopping rules based on thresholds on the number of the leaves, or on the maximum number of steps in the process. Other algorithms introduce probabilistic assumptions on the variables, allowing us to use suitable statistical tests. In the absence of probabilistic assumptions, the growth is stopped when the decrease in impurity is too small. The results of a tree model can be very sensitive to the choice of a stopping rule. The CART method uses a strategy somewhat different from the stepwise stopping criteria; it is based on the concept of pruning. First the tree is built to its greatest size. This might be the tree with the greatest number of leaves, or the tree in which every node contains only one observation or observations all with the same outcome value or level. Then the tree is ‘trimmed’ or ‘pruned’ according to a cost-complexity criterion. LetT0 indicate the tree of greatest size and let T

indicate, in general, a tree. From any tree a subtree can be obtained by collapsing any number of its internal (non-terminal) nodes. The idea of pruning is to ﬁnd a subtree of T0 in an optimal way, the one that minimises a loss function. The

loss function implemented in the CART algorithm depends on the total impurity of the treeT and the tree complexity:

Cα(T )=I (T )+αN (T )

where, for a treeT,I (T )is the total impurity function calculated at the leaves, and N (T ) is the number of leaves; with α a constant that penalises complexity linearly. In a regression tree the impurity is a variance, so the total impurity can be determined as

I (T )=

N (T )

m=1

We have seen how impurity can be calculated for classiﬁcation trees. Although any of the three impurity measures can be used, the misclassiﬁcation impurity is usually chosen in practice. Notice that the minimisation of the loss function leads to a compromise between choosing a complex model (low impurity but high complexity cost) and choosing a simple model (high impurity but low complexity cost). The choice depends on the chosen value ofα. For eachαit can be shown that there is a unique subtree ofT0 which minimisesCα(T ).

A possible criticism of this loss function is that the performance of each tree configuration is evaluated with the same data used for building the classification rules, which can lead to optimistic estimates of the impurity. This is particularly true for large trees, due to the phenomenon we have already seen for regression models: the goodness of fit to the data increases with the complexity, here represented by the number of leaves. An alternative pruning criterion is based on the predictive misclassification errors, according to a technique known as cross-validation (Section 6.4). The idea is to split the available data set, use one part to train the model (i.e. to build a tree configuration), and use the second part to validate the model (i.e. to compare observed and predicted values for the response variable), thereby measuring the impurity in an unbiased fashion. The loss function is thus evaluated by measuring the complexity of the model fitted on the training data set, whose misclassification errors are measured on the validation data set.

To further explain the fundamental difference between training and validation error, Figure 4.7 takes a classification tree and illustrates the typical behaviour of the misclassification errors on the training and validation data sets, as functions of model complexity. I (T )is always decreasing on the training data.I (T )is non- monotone on the validation data; it usually follows the behaviour described in the figure, which allows us to choose the optimal number of leaves as the value of

N (T )such thatI (T )is minimum. For simplicity, Figure 4.7 takesα=0. When greater values for the complexity penalty are speciﬁed, the optimal number of nodes decreases, reﬂecting aversion towards complexity.

The misclassiﬁcation rate is not the only possible performance measure to use during pruning. Since the costs of misclassiﬁcation can vary from one class

N(T) Training data Validation data Optimal I ( T )

to another, the misclassiﬁcation impurity could be replaced by a simple cost function that multiplies the misclassiﬁcation impurity by the costs attached to the consequence of such errors. This is further discussed in Section 6.5.

The CHAID algorithm uses chi-squared testing to produce an implicit stopping criterion based on testing the signiﬁcance of the homogeneity hypothesis; the hypothesis is rejected for large values of χ2_{. If homogeneity is rejected for}

a certain node, then splitting continues, otherwise the node becomes terminal. Unlike the CART algorithm, CHAID prefers to stop the growth of the tree through a stopping criterion based on the signiﬁcance of the chi-squared test, rather than through a pruning mechanism.

In document Applied Data Mining Statistical Methods for Business and Industry Giudici P (2003) pdf (Page 114-121)