Tree models - Model specification - Applied Data Mining For Business And Industry Paolo Giudici

Model specification

4.5 Tree models

While linear and logistic regression methods produce a score and then possibly a classification according to a discriminant rule, tree models begin by producing a classification of observations into groups and then obtain a score for each group. Tree models are usually divided into regression trees, when the response variable is continuous, and classification trees, when the response variable is quantitative discrete or qualitative (categorical). However, as most concepts apply equally well to both, here we do not distinguish between them, unless otherwise specified. Tree models can be defined as a recursive procedure, through which a set of n

statistical units are progressively divided into groups, according to a division rule that aims to maximise a homogeneity or purity measure of the response variable in each of the obtained groups. At each step of the procedure, a division rule is specified by the choice of an explanatory variable to split and the choice of a splitting rule for the variable, which establishes how to partition the observations. The main result of a tree model is a final partition of the observations. To achieve this it is necessary to specify stopping criteria for the division process. Suppose that a final partition has been reached, consisting ofg groups (g < n). Then for any given response variable observationyi, a regression tree produces

a fitted value ˆyi that is equal to the mean response value of the group to which

the observation ibelongs. Let mbe such a group; formally we have that ˆ yi = nm l=1ylm nm .

For a classification tree, fitted values are given in terms of fitted probabilities of affiliation to a single group. Suppose only two classes are possible (binary classification); the fitted success probability is therefore

πi =

l=1ylm

where the observations ylm can take the value 0 or 1, and therefore the fitted

probability corresponds to the observed proportion of successes in group m. Notice that both ˆyi andπi are constant for all the observations in the group.

The output of the analysis is usually represented as a tree; it looks very similar to the dendrogram produced by hierarchical clustering (Section 4.2). This implies that the partition performed at a certain level is influenced by the previous choices. For classification trees, a discriminant rule can be derived at each leaf of the tree. A typical rule is to classify all observations belonging to a terminal node in the class corresponding to the most frequent level (mode). This corresponds to the so-called ‘majority rule’. Other ‘voting’ schemes are also possible but, in the absence of other considerations, this rule is the most reasonable. Therefore

each of the leaves points out a clear allocation rule of the observations, which is read by going through the path that connects the initial node to each of them. Every path in a tree model thus represents a classification rule. Compared with discriminant models, tree models produce rules that are less explicit analytically but easier to understand graphically.

Tree models can be considered as non-parametric predictive models, since they do not require assumptions about the probability distribution of the response variable. In fact, this flexibility means that tree models are generally applicable, whatever the nature of the dependent variable and the explanatory variables. But this greater flexibility may have disadvantages, such as a greater demand for computational resources. Furthermore, their sequential nature and their algorith- mic complexity can make them dependent on the observed data, and even a small change might alter the structure of the tree. It is difficult to take a tree structure designed for one context and generalise it to other contexts.

Despite their graphical similarities, there are important differences between hierarchical cluster analysis and classification trees. Classification trees are predictive rather than descriptive. Cluster analysis performs an unsupervised classification of the observations on the basis of all available variables, whereas classification trees perform a classification of the observations on the basis of all explanatory variables and supervised by the presence of the response (target) variable. A second important difference concerns the partition rule. In classification trees the segmentation is typically carried out using only one explanatory variable at a time (the maximally predictive explanatory variable), whereas in hierarchical clustering the division (agglomerative) rule between groups is estab- lished on the basis of considerations on the distance between them, calculated using all the available variables.

We now describe in more detail the operational choices that have to be made before fitting a tree model to the data. It is appropriate to start with an accurate preliminary exploratory analysis. First, it is necessary to verify that the sample size is sufficiently large. This is because subsequent partitions will have fewer observations, for which the fitted values may have a lot of variance. Second, it is prudent to conduct accurate exploratory analysis on the response variable, especially to identify possible anomalous observations that could severely distort the results of the analysis. Pay particular attention to the shape of the response variable’s distribution. For example, if the distribution is strongly asymmetrical, the procedure may lead to isolated groups with few observations from the tail of the distribution. Furthermore, when the dependent variable is qualitative, ideally the number of levels should not be too large. Large numbers of levels should be reduced to improve the stability of the tree and to achieve improved predictive performance.

After the preprocessing stage, choose an appropriate tree model algorithm, paying attention to how it performs. The two main aspects are the division criteria and the methods employed to reduce the dimension of the tree. The most popular algorithm in the statistical community is the CART algorithm (Breiman et al., 1984), which stands for ‘classification and regression trees’. Other algorithms include CHAID (Kass, 1980), C4.5 and its later version, C5.0 (Quinlan, 1993).

C4.5 and C5.0 are widely used by computer scientists. The first versions of C4.5 and C5.0 were limited to categorical predictors, but the most recent versions are similar to CART. We now look at two key aspects of the CART algorithm: division criteria and pruning, employed to reduce the complexity of a tree.

4.5.1 Division criteria

The main distinctive element of a tree model is how the division rule is chosen for the units belonging to a group, corresponding to a node of the tree. Choosing a division rule means choosing a predictor from those available, and choosing the best partition of its levels. The choice is generally made using a goodness measure of the corresponding division rule. This allows us to determine, at each stage of the procedure, the rule that maximises the goodness measure. A goodness measure (t) is a measure of the performance gain in subdividing a (parent) node t according to a segmentation into a number of (child) nodes. Let tr, r=

1, . . . , s, denote the child groups generated by the segmentation (s=2 for a binary segmentation) and let pr denote the proportion of observations, among

those in t, that are allocated to each child node, with pr =1. The criterion

function is usually expressed as

(s, t)=I (t)−

r=1

I (tr)pr,

where the symbol I denotes an impurity function. High values of the criterion function imply that the chosen partition is a good one. The concept of impurity is used to measure the variability of the response values of the observations. In a regression tree, a node will be pure if it has null variance (all observations are equal) and impure if the variance of the observations is high. More precisely, for a regression tree, the impurity at node mis defined by

IV(m)= nm l=1(ylm−yˆm) 2 nm ,

where ˆym is the fitted mean value for group m. For regression trees impurity

corresponds to the variance; for classification trees alternative measures should be considered. Here we present three choices. The misclassification impurity is given by IM(m)= nm l=11(ylm, yk) nm = 1−πk,

where yk is the modal category of the node, with fitted probability πk, and the

function 1(·,·)denotes the indicator function, which takes the value 1 ifylm=yk

and 0 otherwise. The Gini impurity is

IG(m)=1− k(m)

i=1

where the πi are the fitted probabilities of the levels present at node m, which

are at mostk(m). Finally, the entropy impurity is

IE(m)= − k(m)

i=1

πilogπi,

withπi as above. Notice that the Gini impurity and entropy impurity correspond

to the application of the indexes of heterogeneity (Section 3.1.3) to the observations at node m. Compared to the misclassification impurity, both are more sensitive to changes in the fitted probabilities; they decrease faster than the misclassification rate as the tree grows. Therefore, to obtain a parsimonious tree, choose the misclassification impurity.

Besides giving a useful split criterion, an impurity measure can be used to give an overall assessment of a tree. LetN (T )be the number of leaves (terminal nodes) of a treeT. The total impurity ofT is given by

I (T )=

N (T )

m=1

I (tm)pm

wherepmare the observed proportions of observations in the final classification.

In particular, the misclassification impurity constitutes a very important assessment of the goodness of fit of a classification tree. Even when the number of leaves coincides with the number of levels of the response variable, it need not be that all the observations classified in the same node actually have the same level of the response variable. The percentage of misclassifications, or the percentage of observations classified with a level different from the observed value, is also called misclassification error or misclassification rate; it is another important overall assessment of a classification tree.

The impurity measure used by CHAID is the distance between the observed and expected frequencies; the expected frequencies are calculated using the hypotheses for homogeneity for the observations in the considered node. This split criterion function is the Pearson X2 index. If the decrease inX2is signif- icant (i.e. thep-value is lower than a prespecified level α) then a node is split; otherwise it remains unsplit and becomes a leaf.

4.5.2 Pruning

In the absence of a stopping criterion, a tree model could grow until each node contains identical observations in terms of values or levels of the dependent variable. This obviously does not constitute a parsimonious segmentation. Therefore it is necessary to stop the growth of the tree at a reasonable dimension. The ideal final tree configuration is both parsimonious and accurate. The first property implies that the tree has a small number of leaves, so that the predictive rule can be easily interpreted. The second property implies a large number of leaves that are maximally pure. The final choice is bound to be a compromise between the two opposing strategies. Some tree algorithms use stopping rules based on

thresholds on the number of the leaves, or on the maximum number of steps in the process. Other algorithms introduce probabilistic assumptions on the variables, allowing us to use suitable statistical tests. In the absence of probabilistic assumptions, the growth is stopped when the decrease in impurity is too small. The results of a tree model can be very sensitive to the choice of a stopping rule. The CART method uses a strategy somewhat different from the stepwise stopping criteria; it is based on the concept of pruning. First the tree is built to its greatest size. This might be the tree with the greatest number of leaves, or the tree in which every node contains only one observation or observations all with the same outcome value or level. Then the tree is ‘trimmed’ or ‘pruned’ according to a cost-complexity criterion. LetT be a tree, and letT0denote the tree of greatest size. From any tree a subtree can be obtained by collapsing any number of its internal (non-terminal) nodes. The idea of pruning is to find a subtree ofT0in an optimal way, so as to minimise a loss function. The loss function implemented in the CART algorithm depends on the total impurity of the tree T and the tree complexity:

Cα(T )=I (T )+αN (T )

where, for a treeT,I (T ) is the total impurity function calculated at the leaves, and N (T )is the number of leaves; with α a constant that penalises complexity linearly. In a regression tree the impurity is a variance, so the total impurity can be determined as

I (T )=

N (T )

m=1

IV(m)nm.

We have seen how impurity can be calculated for classification trees. Although any of the three impurity measures can be used, the misclassification impurity is usually chosen in practice. Notice that the minimisation of the loss function leads to a compromise between choosing a complex model (low impurity but high complexity cost) and choosing a simple model (high impurity but low complexity cost). The choice depends on the chosen value ofα. For eachαit can be shown that there is a unique subtree of T0that minimisesCα(T ).

A possible criticism of this loss function is that the performance of each tree configuration is evaluated with the same data used for building the classification rules, which can lead to optimistic estimates of the impurity. This is particularly true for large trees, due to the phenomenon we have already seen for regression models: the goodness of fit to the data increases with the complexity, here represented by the number of leaves. An alternative pruning criterion is based on the predictive misclassification errors, according to a technique known as cross-validation (Section 5.4). The idea is to split the available data set, use one part to train the model (i.e. to build a tree configuration), and use the second part to validate the model (i.e. to compare observed and predicted values for the response variable), thereby measuring the impurity in an unbiased fashion. The loss function is thus evaluated by measuring the complexity of the model fitted on the training data set, whose misclassification errors are measured on the validation data set.

N(T)

Training data Validation data

Optimal

I(T)

Figure 4.6 Misclassification rates.

To further explain the fundamental difference between training and validation error, Figure 4.6 takes a classification tree and illustrates the typical behaviour of the misclassification errors on the training and validation data sets, as functions of model complexity. I (T ) is always decreasing on the training data. I (T ) is non-monotone on the validation data; it usually follows the behaviour described in the figure, which allows us to choose the optimal number of leaves as the value of N (T ) such that I (T ) is minimum. For simplicity, Figure 4.6 takes

α=0. When greater values for the complexity penalty are specified, the optimal number of nodes decreases, reflecting aversion to complexity.

The misclassification rate is not the only possible performance measure to use during pruning. Since the costs of misclassification can vary from one class to another, the misclassification impurity could be replaced by a simple cost function that multiplies the misclassification impurity by the costs attached to the consequence of such errors. This is further discussed in Section 5.5.

The CHAID algorithm uses chi-squared testing to produce an implicit stopping criterion based on testing the significance of the homogeneity hypothesis; the hypothesis is rejected for large values of χ2. If homogeneity is rejected for a certain node, then splitting continues, otherwise the node becomes terminal. Unlike the CART algorithm, CHAID prefers to stop the growth of the tree through a stopping criterion based on the significance of the chi-squared test, rather than through a pruning mechanism.

In document Applied Data Mining For Business And Industry Paolo Giudici (2009) pdf (Page 78-83)