• No results found

3.3 Supervised learning techniques

3.3.5 Decision trees

Decision trees are a well-known data mining method because they can be interpreted easily and possess high computational accuracy. The method entails recursively partitioning the data set into discrete subcategories, based on the value of a certain variable (Paramasivam et al., 2014). A decision tree, for example displayed in Figure 3.5, comprises decision nodes, which are connected by branches, extending from the root node, which is usually at the top of the tree diagram, towards terminating leaf nodes. Variables are tested from the root node at each decision node, with the possible outcome being represented by a branch, which again leads to another decision node or terminating leaf node. When the tree cannot be split further, no new nodes are grown (Larose, 2005). For example, in Figure 3.5, the node for where savings are equal to medium, all the instances are classified to be good credit risks, resulting in a 100% pure node and no further splitting options are available. This is not always the case and therefore there are various methods to measure purity and decide on a cut-off value. Two leading decision-tree algorithms, namely classification and regression trees (CART) and C4.5, will be introduced in this section.

Problems to be modelled by decision trees have to be supervised learning problems, with pre- classified variables that can be used for the learning set. The learning set must also have a wide array of data points representing the data points that will require classification in the future.

3.3. Supervised learning techniques 47

Figure 3.5: An general example of a decision tree to determine whether a potential customer being is a good or bad credit risk. (Adapted from: Larose (2005).)

Decision trees rely heavily on learning from example (Larose, 2005). Tree-based methods, such as CART and C4.5, have a wide application field, especially in applied science, political science, speech recognition, marketing and biomedical and genetic research (Izenman, 2008). Decision trees have been used to predict the chance of survival for a patient suffering from breast cancer and to characterise types of skin conditions in adults and children (Agarwal & Tomar, 2013). In another case, a hybrid tree was developed to classify the activities of a patient suffering from a chronic disease (Agarwal & Tomar, 2013).

3.3.5.1 Classification and regression trees

CART is a nonparametric statistical method that mainly uses a recursive partitioning algorithm. Recursive partitioning is a stepwise process to the formation of a decision tree by either splitting or not splitting a node into two child nodes. One of the factors contributing to the popularity of the CART method is that the results can be interpreted and understood easily owing to the algorithm asking hierarchical boolean questions in sequence. Classification and regression are both supervised data mining techniques, but differ in their output variables. For binary classification models, the dependent variable (Y ) has a binary value, whereas with regression problems, Y is continuous (Bramer, 2007).

This method was first developed by (Breiman et al., 1993) who described it as binary tree- structured classifiers. The name ‘CART’ came from the computer program that was used to solve the algorithm. An example of a six-class tree can be seen in Figure 3.6. From the figure, it can be seen that G = G2 ∪ G3 and in the same way G3 = G6 ∪ G7. Subsets that are not

split are referred to as terminal subsets (rectangular nodes). A class label (j) is given to each terminal node, where two or more terminal nodes may have the same allocated class label. The classifiers are determined by adding terminal nodes that belong to a certain class, e.g. j1 = G15

and j4 = G6∪ G17. The splits are determined by conditions of the measurement vector (g) for

example, the split into G2 and G3 may be in the form of G2 = {g; g4 ≤ 7}, G3 = {g; g4 > 7}

(Breiman et al., 1993).

The tree method predicts a class for g by determining in which subset g goes, for example at split 1, g goes into G2 if g4 is smaller or equal to seven and so forth. When g reaches a terminal

Figure 3.6: Example of the classification tree developed for a heart attack study. (Adapted from: Breiman et al. (1993).)

Constructing a tree comprises mainly of three steps, namely:

1. Selecting the splits;

2. Deciding when to stop splitting and declare a terminal node; and 3. Assigning a terminal node to a class.

Some statistical techniques are designed for homogeneous, small and structured data sets where the variables are of the same type. Data is homogeneous when the relationship between the variables are the same over the measurement space. CART, in turn, is designed for larger data sets, with higher complexity, where complexity is characterised by high dimensionality, different data types, having a non-standard data set and, of most concern, non-homogeneity.High dimensionality refers to the data points being sparser and spread further apart. CART is described in more detail in Section 3.7.

3.3.5.2 C4.5

The C4.5 decision-tree algorithm is the successor of the ID3 algorithm, which was first devel- oped for generating decision trees, and the predecessor of the C5 method, although the C4.5 is more applied. Similar to CART, the C4.5 also uses recursive partitioning; however, there exist fundamental differences between them when analysed in more detail (Larose, 2005). Firstly, C4.5 is not restricted to binary splits, such as CART, and for categorical variables, it produces a separate branch by default for each value of the categorical variable. This results in a ‘bushier’ tree which is not always ideal, owing to the some groups having a low frequency (Larose, 2005). The C4.5 algorithm used ‘information gain’ (entropy) to determine the splits, whereas CART uses the more simple Gini impurity index.