Multivariate Analysis: Classification and Regression Tree (CART)

RESPONSE RATES

3.8 Tools and Techniques for Analysis 1 Regression Analysis

3.8.3 Multivariate Analysis: Classification and Regression Tree (CART)

Classification and regression tree (CART), a data mining technique, is a predictive

model that classifies the data into leaf and node divisions viewed as a tree. Each branch

of the tree represents a variable for classification and the leaves of the tree branch out

according to some splitting algorithms. Decision trees produce rules that are mutually

exclusive and collectively exhaustive. The method categorizes data on each branch

point without losing any of the data. The total number of observations in a parent node

is equal to the sum of the number of observations contained in its two children nodes.

In CART, each group called a ‘node’ can be divided into two sub-groups termed as ‘binary partitioning’. The original node is known as parent node and the resulting two groups or nodes are the child nodes. The term “recursive” means that the ‘binary

partitioning” process can be applied again and again. Therefore, each parent node can produce two child nodes and, in turn, each of these child nodes can be split again and

again to form additional child nodes. Partitioning refers to the process where the dataset

is split into sections or sub-groups.

The root node (node 0) includes all cases of the learning dataset and the tree

building process starts at this point. The CART algorithm searches for the best predictor

to divide the root node into two child nodes. To do so, the algorithm checks for all

predictor to be used to divide the node. For categorical variable, the number of

potential splits increases with the number of levels of the categorical variable. For the

best splitter, the algorithm seeks to maximize the average “purity” of the two child nodes. The most common splitting criterion or splitting function is ‘Gini’ and ‘Twoing’ which give similar results when the outcome variable is categorical in nature. Including

the root node (node 0), each node is assigned a class called ‘predicted outcome’. The node splitting process predicts class assignment in each node, and the process is

recurrent for each child node and continues recursively until it is impossible to proceed

further. Predicted class assignment is essential because it provides information on which

node will end up being the terminal node after pruning. It depends on three factors

namely, (a) distribution of classes of learning dataset in a particular node, (b) decision

loss or cost matrix, and (c) fraction of subjects that end up in each node. This method of

node class assignment ensures that the tree has a minimal expected average decision

cost in which the probability of each outcome is equal to the assumed prior

probabilities.

“Primary splitter” for each node is the variable that best splits the node, and

maximizes the purity of resulting nodes. When “primary splitter” is missing for an individual observation, that observation is not discarded but, instead, “a surrogate splitting variable” is required. A surrogate splitter is a variable whose pattern within the dataset, relative to the outcome variable, is similar to the primary splitter. Therefore, the

program uses the best available information in the face of missing values. In datasets of

reasonable quality this allows all observations to be used. For handling missing data,

CART has important advantages over other traditional multivariate regression

modelling, where observations of the predictor variable which are missing are usually

If there is only one observation in each child node, all observations within the child

node have similar distribution to that of the predictor variable with an external limit

(depth option) stipulated by the user, the tree building process will stop. A major

innovation of CART is the realization that there is no way during the tree-building

process to know when to stop, and that different parts of the tree may require different

depths. Therefore, the method of “cost-complexity” pruning is used to generate a sequence of simpler and simpler trees.

The optimal tree will fit the learning dataset with a high level of accuracy with

“re-substitution cost” that generally greatly overestimates the performance of the tree on an independent set of data than any other trees. Cross validation is a method for

validating a procedure for model building, which avoids the need for other dataset for

validation. In cross validation, the learning dataset is randomly split into Ksections,

stratified based on the outcome variable and assures that a similar distribution of

outcomes is present in each of the Ksubsets of data. From the Ksubsets, one subset is

reserved to be used as an independent test dataset, while the remaining K1 subsets are used as learning dataset. The entire model-building procedure is repeated K times, with

a different subset of data reserved as the independent test dataset each time. Thus, K

different models are produced, each of which can be tested against an independent

subset of the data. Cross validation that measures the average performance of K

models, is an excellent estimate of the performance of the original model. When cross

validation is used in CART the entire tree building and pruning sequence is conducted

K times. Thus, there are K sequences of trees produced. Based on their number of

terminal nodes, the trees within the sequences are matched to produce an estimate of the

performance of the tree in predicting outcomes for a new independent dataset, as a

tree complexity which results in best performance with respect to an independent

dataset.

To develop a CART for classification, each predictor is chosen based on how

well it fits the records with different predictions. The entropy metric5 is used to determine whether a split point for a given predictor is better than the others. Briefly,

the CART algorithm splits the independent variable into two separate hyper-rectangular

areas according to performance measures.

Yes No

Yes No Yes No

Figure 3.16: A Typical CART Model for Classification

Notes: Ovals are the intermediate nodes and rectangles are terminal nodes,K₁,K₂ and K₃ are splitting

values of the variables Y₁,Y₂ andY₃ respectively.

From the algorithmic point of view, CART has a forward stepwise technique that adds

model terms and a backward technique for pruning, while selecting important variables

that are useful in the model. The output of the models is a hierarchical structure that

consists of a series of “if-then” rules to predict the outcome of the dependent variable. For example, at each intermediate node (ovals in Figure 3.16) of the tree, a condition is

In document Tobacco consumption, environmental tobacco smoke exposure and illicit drug use: A study on selected south Asian countries / Mohammad Alamgir Kabir (Page 106-109)