• No results found

CHAPTER 2 Automatic Text Classification: A Review

2.2 Components of an ATC System

2.2.3 Machine Learning algorithms for TC task

2.2.3.4 Decision Tree classifiers

The Decision Tree (DT) is an ML classifier that takes the form of a tree where a collection of training instances are used to construct a classification tree. The decision tree consists of nodes and it is a directed tree. As a result, all nodes have exactly one incoming edge and one or more outgoing edges. A node that has no incoming edge is called the root node. The root node is placed in space based on the information values of selected features that split the training instances into two or more sub-spaces for each possible value. Each node in the tree contains the right amount of information that would be needed to classify new instances. These information values for features are calculated using different FS techniques such as IG (discussed in 2.2.3.1).

A node with outgoing edges is called an internal node. The training instances are divided into two or more sub-spaces by the internal nodes for each possible value. A node without outgoing edges is called a leaf node. Each leaf node is assigned to one category. To classify an instance using the DT, the classifier tests each node starting from the root node and goes down until a leaf node is reached. The category of the instance is indicated by the leaf node [29, 44]. The most well- known algorithm in the literature for building DT is the C4.5, which uses IG [45].

Chapter 2: Automatic Text Classification: A Review

The DT characteristics have been found to fit a number of practical problems, for example classifying diseases based on medical cases, assessing the credit risk of loan applicants by their likelihood of defaulting on payments, equipment malfunctions by their cause, detecting advertisements on the web, and identifying spam emails [20, 44].

Take the example of a real-world problem predicting whether a loan applicant will repay their loan or not. An example of the training set is one that contains records of previous borrowers identified by ID number (1, 2, 3, 4, 5, 6, 7, 8, 9, and 10) as shown in Table 2.2. These records contain the personal information of each borrower such as “Home Owner”, “Marital Status”, and “Annual Income” which form the features set (F1, F2, and F3) [46].

Table 2.2 Training set for predicting borrowers who will default on loan payments

F1 F2 F3 Category

ID Home Owner Marital Status Annual Income Defaulted Borrower

1 Yes Single 125k No 2 No Married 100k No 3 No Single 70k No 4 Yes Married 120k No 5 No Divorce 95k Yes 6 No Married 60k No 7 Yes Divorce 220k No 8 No Single 85k Yes 9 No Married 75k No 10 No Single 90k Yes

These records have two categories “Yes” and “No” which indicate whether the borrower did not repay their loan and become a “Defaulted Borrower” or the borrower successfully repaid their loans and is not a “Defaulted Borrower”.

A DT classifier is constructed from the training examples in Table 2.2, in order to classify new test records. First, the best feature that divides the training examples to

different branches is chosen by calculating the information values for each feature using an FS techniques (e.g. IG technique). In the case of this example the F1 feature “Home Owner” represents the root node for DT, as shown in Figure 2.6.

Home Owner Yes NO Marital Status Married Annual Income <80K >=80K Single, divorced Defaulted =NO Defaulted =NO Defaulted =NO Defaulted =Yes

Figure 2.6 DT for constructed based on training examples

The root node F1 has two possible values: “Yes” and “No”. In the case of “Yes” where the borrowers are home owners, this leads to a leaf node with category “No” which means that they successfully repaid their loans. If the value of the case is “No”, meaning that the borrowers are not home owners, it leads to a child node with feature F2 “Marital Status”. The child node that contains feature F2 has three possible values: “Single”, “Divorced”, or “Married”. In the case where the borrowers are married, and they successfully repaid their loans, the value is “No” which is a leaf node. Both the value “Single” and “Divorce” lead to the next child node with feature F3 “Annual Income”. The value of feature F3 leads to two leaf nodes; in the case where the annual income of the borrowers is less than 80K, it will lead to the category of “No”, if the income is over 80K the value is “Yes”. For example, to classify the borrower record (ID 11) shown in Table 2.3, using the DT classifier shown in Figure 2.6, The DT classifier is going to compare feature

Chapter 2: Automatic Text Classification: A Review

values in the test example with the values of each node in the tree to choose the right branches until it reaches a leaf node that has the classification category as shown in Figure 2.7. The bold red line in Figure 2.7 shows how the test example is categorized using the DT classifier.

Table 2.3 Test example for an applicant with personal information to predict the category of the borrower

F1 F2 F3 Category

ID Home owner Marital status Annual income Defaulted borrower

11 NO Single 125k ? Home Owner Yes NO Marital Status Married Annual Income <80K >=80K Single, divorced Defaulted =NO Defaulted =NO Defaulted =NO Defaulted =Yes

Figure 2.7 Illustration of how the DT classifies a test example

One of the advantages of DT is that the representation of the model is self- explanatory and easy to understand. It is clear why a classified instance belongs to a specific category [47].

The DT uses the divide-and-conquer method to divide the training space for building the classification tree. Even though this method is quick, efficiency can

become an important element when hundreds of thousands of training instances are involved. The most time consuming aspect is sorting the instances on a feature to find the best split. In addition, decision trees perform more efficiently when a few relevant features are involved, but they are less proficient when many complex relevant features are present. DT is over-sensitive to the training set and noise [45, 47].

The DT suffers from the problem of over-fitting which happens when the ML algorithm continues to develop hypotheses that reduce training set errors at the cost of an increased test set error [20]. The issue of over-fitting in DT could be avoided by pruning the tree after its full growth or preventing it from reaching its maximum size and over-fitting the training examples [20, 44].