2.3 Machine Learning
2.3.1 Supervised Learning
2.3.1.1 Decision Trees
A very commonly used technique to approximate discrete-valued target func- tions are decision trees [Qui86, RM05, SL90]. Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node of the tree performs a test on some attribute of the instance and each descending branch from the node corresponds to a possible value of the attribute. Decision trees can be applied to instances that can be represented as attribute-value pairs. The data set can contain errors and the missing data. Figure 2.5 displays an example of decision tree (related to the data set of Table 2.2).
# Friends N <6 Forecast Temperature P < 30 ◦ C Sunn y Temperature Windy N True P False ≥10 ◦C N < 10 ◦ C Cloudy N Rain ≥ 6
Figure 2.5: Decision Tree Example
The most common type of decision trees requires the target function to assume only discrete values (distinct classes), thus performing classification. There are extensions of decision tree learning that are capable to deal with real values [BFSO84,WW96,OW03,XWVA05] – regression trees. Broadly speaking, most of the regression tree techniques differ from decision trees only in having values rather than classes at the leaves. Nevertheless, decision trees tend to perform better when dealing with categorical features and outputs.
The canonical algorithm in the literature for building decision trees is the C4.5 [Qui14] proposed by Quinlan as an extended version of an earlier algorithm ID3 [Qui79] from the same author. Quinlan also proposed an extended version of C4.5 in order to tackle the issue of continuous output, the algorithm M5 [Q+92].
While decision trees have values at their leaves, M5 generate “model” trees that represent multivariate linear models. Model trees tend to be smaller than regression trees. Many other approaches rely on the core algorithm defined by ID3 and C4.5; the core consists of a top-down, greedy search through the decision trees space. Starting from the tree root an attribute is selected and a statistical test is performed to evaluate the impact of the attribute on the classification. For each possible value of the selected attribute a descendant node is created and the training set instances are directed in the appropriate node (down the branch corresponding to the instance value of the attribute). The process is repeated using the example in the training set and descending the tree, each time selecting the best attribute at the current tree level. At each node the attribute is selected through a heuristic and the decision cannot be backtracked.
A crucial component of the algorithm is to detect which are the most “infor- mative” attributes, the attributes with higher relevancy in order to classify an instance and therefore those that should be tested first – ideally at the root of the decision tree. A statistical property called information gain has been intro- duced to measure how well a given attribute can separate the data set instances according to their target classification. To understand the most widespread metric for information gain the concept of entropy must be introduced. En- tropy is a concept derived from information theory and it measures the average amount of information necessary to identify the class of an example in a data set [Jay57, CT12, And08]. Given a set S with negative and positive examples and the class Cj the entropy is measured as:
Entropy(S) = − K X j=1 f req(cj, S) |S| × log2 f req(cj, S) |S| (2.12)
where the target attribute can assume K different values (classes); f req(cj, S) is
the frequency of examples of class j in the set (i.e. the number of examples of the class divided by the total number of examples). Entropy can also be seen as a measure of the unevenness of collection of examples: higher entropy corresponds to a more variegated set. In the case of two classes, given the proportions of negative and positive example p− and p+ (p− = 1 − p+) the equation can be
rewritten as:
Entropy(S) = −p+× log
2p+− p−× log2p− (2.13)
Having introduce the entropy, the information gain produced by a test on an attribute A is the expected reduction in entropy caused by partitioning the examples according to this attribute:
Gain(S, A) = Entropy(S) − X
v∈V als(A)
Sv
S Entropy(Sv) (2.14) where V als(A) is the set of possible values for attribute A and Svis the subset
2.3 Machine Learning 45
second term is the expected value of the entropy after S is partitioned using attribute A. Information gain is precisely the metric used by ID3 and C4.5 to select the best attribute, giving preference to attributes with higher information gains.
A decision tree, or any learned hypothesis h, is said to overfit the training data if another hypothesis h0exists that has a larger error than h when tested on
the training data, but a smaller error than h when tested on the entire dataset. For example, an hypothesis listing only positive examples of the training set is equivalent to a rule that memorize the training sample, thus having a very small (null) error on the training set. The drawback is that said rule could predict the class of an example if and only if the example appeared already in the training set. Consequently the error on the entire data set would be much greater. More generally, overfitting is a concern because algorithms will typically be optimizing over the training sample. The two most common approaches to tackle overfitting are: 1) stopping the training algorithm before the point when the learned model perfectly fits the data; 2) pruning the induced decision tree [BKK+98]. Several analysis have been made to identify the best pruning
methods [BA97, Bru00, Elo99].
A commonly appreciated aspect of decision trees is their high human read- ability; it is possible to look at a decision tree scheme and understands why the learning model classifies a certain instance as belonging to a certain class. Another great advantage of decision trees is their ability to deal with incomplete information, i.e. instances with missing feature values. One simple strategy for dealing with a missing attribute value is to assign it the value that is most com- mon among training examples at the tree node [Min89]. A more sophisticated strategy consists in assigning a probability to each of the possible values that the attribute can assume. These probabilities are computed using the observed frequencies of the various values among the training set instances. This is the strategy adopted by C4.5.