• No results found

2.3 Knowledge Representation and Data Mining

2.3.1 Decision Tree

Decision Trees (DTs) are a widely used classification technique due to their simplicity, ease of understanding, explanation generation capability and interpretability. In addi- tion, if desired, they can easily be converted into a rule format [216]. In the context of classification a decision tree is a tree structure where the root and body nodes represent alternatives while the leaf nodes represent individual classifications. More specifically each root/body node represents an attribute and the connections to child nodes poten- tial individual attribute values or groups of values. Therefore, a DT can be said to be a tree based classifier. In a binary DT there can only be two alternatives at each root/- body node; in other forms of DT there may be many alternatives emanating from root and body nodes. When constructing a decision tree the challenge is in selecting which attribute is to be represented by which node and how to split the range of potential values that an attribute might have. Once a DT is constructed it becomes very easy and straightforward to classify a new data item starting from the root and finding a route through the DT until one of the leaves (classes) is reached. Typically DT con- struction is top down following a “greedy” search process, with no backtracking, based on a “divide and conquer” strategy where the training set is partitioned recursively

into subsets according to some splitting criterion. Various splitting criteria have been proposed. Popular measures include Information Gain, Gini Index and Gain Ratio (for more information see [55, 94]). A variety of decision tree generation algorithms have also been proposed.

With respect to the work described in this thesis the C4.5 algorithm [173] was adopted as it has been considered to be a benchmark DT classifier throughout the data mining community. C4.5 uses Information Gain (IG) as the splitting criteria whereby the attribute with the highest information gain is selected to be used in the current node. IG is calculated using Equation 2.1:

IG(D, X) =Entropy(D)−Entropy(D, X) (2.1) whereIG(D, X) is the information gain for the data setDwith respect to attributeX. Entropy for the data set Dis calculated using Equation 2.2.

Entropy(D) =

i=|c|

X

i=1

−pi log pi (2.2)

wherepi is the probability of classi∈c. Normally,pi =

|ci,D|

|D| where|ci,D|is the number of records corresponding to class i with respect to the entire data set D. Intuitively, 0 ≤ Entropy(D) ≤ 1. Entropy is a measure of the homogeneity of a given data set. If Entropy(D) = 0, then all the records belongs to the same class and therefore the outcome is certain. On the other hand, if Entropy(D) = 1 this would mean that the data set is totally homogeneous and all classes are equally likely.

IG is thus a measure of the expected reduction in the entropy for a given attribute. In other words IG indicates the “importance” of a given attribute with respect to the DT construction process. In the context of Equation 2.1 the importance of an attribute is determined by identifying the entropy value of the attributebefore andafter splitting. The same calculation is made for the complete set of attributes and the attribute that maximises information gain selected for the DT node in question.

Example: The following example illustrate the process of constructing the DT for a given data set (D) where the IG measurement is adopted as the splitting criteria. Table 2.1 describes a small hypothetical data set (D) comprised of 14 records describing bank customers, each record consists of five attributes: (i) Status describing whether the customer is a new or existing customer (True or False), (ii) Age (Young, Midde or Senior), (iii) Gender (Male or Female), (iv) Income (Low, Medium or High) and (v) Loan (Yes or No). The Loan attribute is the class attribute. The steps required to induce the DT for the given data set (D) are described below.

1. For the root node, the IG values for each attribute is calculated. For the Age attribute, the entropy values for the data set D before and after splitting are

calculated. Recall that the datasetD is a collection of 14 records with 9Y es and 5 N orecords. So, the entropy before splitting is:

Entropy(D) =−9/14log9/145/14log5/14

= 0.94

After splitting and with respect to age attribute, the data set (D) is divided into three subsets according to the attribute values as shown in Figure 2.3 and therefore the entropy after splittingEntropy(D, Age) is calculated as follows:

Entropy(D, Age) = X v∈{young, middle, senior} |Sv| |D| Entropy(Agev)

=5/14Entropy (Ageyoung) +4/14Entropy (Agemiddle)

+5/14 Entropy(Agesenior)

Where |Sv| is the number of records where S has the label v and Sv/D is the

proportion of records that exist in the the setSv that features the labelv(Sv ⊂D).

Entropy(Ageyoung),Entropy(Agemiddle) andEntropy(Ageyoung) are calculated as

follows.

Entropy(Ageyoung) =−35log3/5−25log2/5= 0.97 Entropy(Agemiddle) =−44log4/4− 04log0/4= 0.0

Entropy(Agesenior) =−35log3/5−25log2/5= 0.97

Now, the value of Entropy(D, Age) is calculated as: Entropy(D, Age) = 5/14 (0.97) + 4/14 (0) + 5/14 (0.97)

= 0.69

2. The value of the IG(D, Age) is calculated as follows. IG(D, Age) =Entropy(D)−Entropy(D, Age)

= 0.94−0.69 = 0.26

3. Similarly, theIGvalue for the rest of attributes are calculated in the same manner and thus IG(D, Income) = 0.03, IG(D, Gender) = 0.15 and IG(D, Status) = 0.05.

4. Based on the obtained IG values, the attribute associated with the highest IG value is selected to represent the current node. In this case, the age attribute is selected to represent the root node.

5. Steps 1, 2, 3 and 4 are repeated recursively for each node in the tree as it is constructed until no more records and/or attributes remain. The Final DT is shown in Figure 2.4 is obtained.

Status (new Customer) Age Gender Income Loan

True Young Female High No

False Young Female High No

True Middle Female High Yes

True Senior Female Medium Yes

True Senior Male Low Yes

False Senior Male Low No

False Middle Male Low Yes

True Young Female Medium No

True Young Male Low Yes

True Senior Male Medium Yes

False Young Male Medium Yes

False Middle Female Medium Yes

True Middle Male High Yes

False Senior Female Medium No

Table 2.1: A labeled training data set consists of 14 records. Four attributes are used

to describe the data set while the Loan attribute used to label the record with either

Yes label (coloured in red) orNolabel (coloured in green).

Figure 2.3: The three subsets of the data setD obtained after splitting with respect

to the different attribute values: young,middle andsenior.