Decision Tree - Popular Classification Algorithms

2.2 Popular Classification Algorithms

2.2.1 Decision Tree

Decision tree [49] is one of the most commonly used classification algorithms. Based on the training data, the decision tree algorithm builds a top-down tree structure to fit the data. Basically, the tree construction follows recursive steps starting from the root. Each time, the data is split by the current best attribute, until the distribution of examples on each node of the tree becomes pure (belonging to the same class) or satisfies a certain condition. When a new example is coming, its attributes will be tested from the root until the leaf node is reached, where the class of the example can be determined.

For example, suppose we would like to predict whether or not we will enjoy playing tennis today, according to some input attributes such as the outlook, windy or not, humidity, etc. Based on a training set consisting of previous playing records as shown in Table 2.1, we can build a decision tree as shown in Figure 2.3. We can see, each internal (including root) node of the tree is associated with an attribute, while each leaf node is connected to a class. For example, on the root node, according to the value of the attributeoutlook, the whole training data can be divided into three parts (outlook==rainy, outlook==overcast and outlook==sunny). Two of those three parts are further divided, while the other one (outlook==overcast) directly 8_{In the rest of the thesis,}_{positive documents (or examples)}_{mean the documents (or examples) belonging to} positive class, andnegative documents (or examples)mean the documents (or examples) belonging to negative class

2.2. PopularClassificationAlgorithms 25

outlook temperature humidity windy play or not

sunny hot high FALSE no

sunny hot high TRUE no

overcast hot high FALSE yes

rainy mild high FALSE yes

rainy cool normal FALSE yes

rainy cool normal TRUE no

overcast cool normal TRUE yes

sunny mild high FALSE no

sunny cool normal FALSE yes

rainy mild normal FALSE yes

sunny mild normal TRUE yes

overcast mild high TRUE yes

overcast hot normal FALSE yes

rainy mild high TRUE no

Table 2.1: Weather dataset.

leads to a leaf node (the class“YES”). It means, when the outlook is overcast, we always enjoyed playing tennis according to the previous data. After the decision tree is constructed, if today is sunny but too humid, the classifier will output “NO”, meaning that we will probably not enjoy the tennis playing today.

The key step in building a decision tree is to find the best attribute to split the data on each internal node. There are many proposed metrics [46, 41] to accomplish it. The most typical and commonly used metric isentropy, which is used to calculate the uncertainty associated with a random variable. In this case, the class attribute of each example can be the random variable. If the class attribute hasndifferent values, then the entropy of an example setXcan be defined as Entropy(X)= n X i=1 −pi×log2pi,

where piis the proportion ofXbelonging to classi.

Generally, the purer the class on a node, the less uncertain the class is, and the lower the value of entropy. The best attribute is the attribute that produces the greatestinformation gain, which means the reduction in the entropy. Mathematically, the information gain of an attribute

26 Chapter2. Review ofPreviousWork

Figure 2.3: Decision tree structure.

Aon a set of examplesXcan be defined as

Gain(X,A)= Entropy(X)− X

v=v(A) |Xv|

|X|Entropy(Xv),

wherev(A) is the set of all possible values for the attribute Aand Xv is the subset of X such thatA= v. For all the attributes, we simply choose the one with maximumGain(X,A) for each internal node to split the current data on the node.

Another issue with decision tree is how to avoid overfitting in decision tree? Decision tree can always perfectly classify the training examples by building a large (deep) decision tree. However, this approach sometimes overfits the training examples when we have noise in the training data, or when the size of training data is too small. To tackle this problem, usually a pruning step is adopted. Pruning typically removes the small branches in the tree, which are likely to be unreliable. It can make the decision tree more robust to noise or more accurate for unseen examples. More specifically, it uses a post-pruning rule as follows:

1. Build a decision tree following the above algorithm described. 2. Convert the tree into an equivalent set of rules.

3. Prune each rule by removing any preconditions that result in improving its estimated accuracy.

4. Sort the pruned rules according to estimated accuracies. The advantages of decision tree can be listed as follows.

1. White box model. The model can be easily interpreted and gives good intuitions to the decision maker.

2.2. PopularClassificationAlgorithms 27

2. Rules generation. The decision tree can be converted into simple and useful rules, where experts have difficulties to formalize their knowledge.

There are also some disadvantages of decision tree.

1. Hard to model complex concept. Normally, the rule-based decision tree has difficulties to model the complex target concepts with many relevant attributes. It is due to the fact that the reliability of the rules generated by decision tree strongly correlates with the number of observed data. On the deep level of the decision tree, the number of training data significantly decreases, thus the rules become unreliable.

2. High time complexity for large feature set. The decision tree can be very efficient when the attribute set is small. However, when there are millions of attributes, in order to find the best attribute on each node, the linear comparison for all the attributes is very costly. If f is the number of the features andpis the number of the nodes on the tree, the time complexity can reachO(f × p).

To build our search engine SEE, we need to classify the text contents in the webpages, which usually have millions of attributes (words). Thus, as we mentioned above, decision tree is not a good choice.

In document A New Web Search Engine with Learning Hierarchy (Page 35-38)