(4.47) Here the variable m ranges over all output nodes with arcs from j We then derive
4.6 RULE-BASED ALGORITHMS
One straightforward way to perform classification is to generate if-then rules that cover all cases. For example, we could have the following rules to determine classification of grades:
If 90 ::S grade, then class
If 80 ::S grade and grade < 90, then class
If 70 ::S grade and grade < 80, then class
A B c
i 60 ::S grade and grade < 70, then class D If grade < 60, then class F
A classification rule, r = (a , c), consists of the if or antecedent, a, part and the then or consequent portion, c. The antecedent contains a predicate that can be evaluated as true or false against each tuple in the database (and obviously in the training data). These rules relate directly to the corresponding DT that could be created. A DT can always be used to generate rules, but they are not equivalent. There are differences between rules and trees:
• The tree has an implied order in which the splitting is performed. Rules have no
order. . ·
• A tree is created based on looking at all classes. When generating rules, only one
class must be examined at a time.
There are algorithms that generate rules from trees as well as algorithms that generate rules without first creating DTs.
4.6. 1 Generating R�les from a DT
The process to generate a rule from a DT is straightforward and is outlined in Algo rithm 4.8. This algorithm will generate a rule for each leaf node in the decision tree. All rules with the same consequent could be combined together by ORing the antecedents of the simpler rules.
ALGORlTHM 4.8
Input :
T / /De c i s ion tree
Output :
R / /Rules
Section 4.6 ��ule-Based Algorithms 1 1 5
Gen algorithm :
/ / I l lustrate s impl e approach to generating c l as s i f i cat i on rules f rom a DT
R = 0
for each path from root to a l e a f in T do
a = True
for each non - l eaf node do
a = a/\ ( l abel of node combined with label of incident outgoing arc )
c = label of leaf node
R = R U r = (a, c)
Using this algorithm, the following rules are generated for the DT in Figure 4.13(a):
{ ((Height ::S 1 . 6 m), Short)
(((Height > 1 .6 m) 1\ (Height ::S 1 .7 m)), Short) (((Height > 1 .7 m) 1\ (Height ::S 1 . 8 m)), Medium) (((Height > 1.8 m) 1\ (Height ::s 1 .9 m)), Medium)
(((Height > 1 .9 m) 1\ (Height ::: 2 m) 1\ (Height ::: 1 .95 m)), Medium) (((Height > 1.9 m) 1\ (Height ::S 2 m) 1\ (Height > 1 .95 m)), Tall)
((Height > 2 m) , Tall) } An optimized version of these rules is then:
{ ((Height ::S 1 . 7 m), Short)
(((Height > 1 .7 m) 1\ (Height ::: 1 .95 m)), Medium) ((Height > 1 .95 m), Tall) }
4.6.2 Generating Rules from a Neural Net
To increase the understanding of an NN, classification rules may be derived from it. While the source NN may still be used for classification, the derived rules can be used to verify or interpret the network. The problem is that the rules do not explicitly exist. They are buried in the structure of �p.e graph itself. In addition, if learning is still occurring, the rules themselves are dynaffiic. The rules generated tend both to be more concise and to have a lower error rate than rules used with DTs. The basic idea of the RX
algorithm is to cluster output values with the associated hidden nodes and input. A major problem with rule extraction is the potential size that these rules should be. For example, if you have a node with n inputs each having 5 values, there are 5n different input combinat�ons to this one node alone. These patterns would all have to be accounted for when constructing rules. To overcome this problem and that of having continuous ranges of output values from nodes, the output values for both the hidden and output layers are first discretized. This is accomplished by clustering the values and dividing continuous values into disjoint ranges. The rule extraction algorithm, RX, shown in Algorithm 4.9 is derived from [LSL95].
1 1 6 Chapter 4 Classification ALGORITHM 4.9
Input :
D / / Training data
N / / Ini t ial neural network
Output :
R / / Derived rul e s
RX algorithm:
/ /Rule extract i on algori thm to extract rules from NN cluster output node act ivation values ;
c luster hi dden node act ivation value s ;
generate rul es that de s c r ibe the output values in terms o f the hidden act ivation value s ;
generate rules that des cribe hidden output values in terms o f input s ;
combine the two s e t s o f rul e s .
4.6.3 Generating Rules
iN
ithout a DT or N NThese techniques are sometimes called covering algorithms because they attempt to
generate rules exactly cover a specific class [WFOO]. Tree algorithms work in a top down divide and conquer approach, but this need not be the case for covering algorithms. They generate the best rule possible by optimizing the desired classification probability. Usually the "best" attribute-value pair is chosen, as opposed to the best attribute with the tree-based algorithms. Suppose that we wished to generate a rule to classify persons as tall. The basic format for the rule is then
If ? then class = tall
The objective for the covering algorithms is to replace the "?" in this statement with predicates that can be used to obtain the "best" probability of being tall.
One simple approach is called 1R because it generates a simple set of rules that are equivalent to a DT with only one level. The basic idea is to choose the best attribute to perform the classification based on the training data. "Best" is defined here by counting the number of errors. In Table 4.4 this approach is illustrated using the height example,
TABLE 4.4: 1R Classification
Option Attribute Rules Errors Total Errors
Gender F � Medium 3/9 6/ 15 M � Tall 3/6 2 Height (0, 1 .6] � Short 0/2 1 / 15 (1 .6, 1 .7] � Short 0/2 (1 .7, 1 . 8] � Medium 0/3 (1 .8, 1.9] � Medium 0/4 (1 .9, 2.0] � Medium 1/2 (2.0, oo) � Tall 0/2
Section 4.6 f{ule-Based Algorithms 1 1 7 Outputl . If we only use the gender attribute, there are a total of 6/15 errors whereas
if we use the height attribute, there are only 1/ 15. Thus, the height would
b
e chosenand the six rules stated in the table would be used. As with ID3, 1R tends to choose
attributes w
�
th a large .n�
mber o�
values leading to overfitting. l R can handle missingdat
�
b�
addmg an additional attnbute value for the value of missing. Algorithm 4.10,which 1s adapted from [WFOO], shows the outline for this algorithm.
ALGORITHM 4.10 Input : D R c Output : R / / Training data
/ /Att ributes to cons ider for rul e s / / C lasses
/ / Rules
lR algori t hm :
/ / lR algori thm generates rules based on one attribute
R = 0 ;
for each A E R do
RA = 0 ;
for each possible value , v, of A do
/ / v may be a range rather than a spec i f i c value for each Cj E c f ind count (Cj) ;
I I Here count i s the number of occurrences o f this c l a s s for thi s att ribute
let Cm be the class with the larges t count ;
RA = RA U ((A = v) ---r (class = Cm)) ;
ERRA = number of tuples incorrectly c l as s i f i ed by RA i R = RA where ERRA i s minimum ;
Another approach to generating rules without first having a DT is called PRISM. PRISM generates rules for each class by looking at the training data and adding rules that completely describe all tuples in that class. Its accuracy is 100 percent. Example 4. 12 illustrates the use of PRISM. Algorithm 4. 1 1, which is adapted from [WFOO], shows the process. Note that the algorithm refers to attribute-value pairs. Note that the values will include an operator so that in Example 4.12 the first attribute-value pair chosen is with attribute height and value 72.0. As with earlier classification techniques, this must be modified to handle continuous attributes. In the example, we have again used the ranges of height values used in earlier examples.
EXAMPLE 4.1 2
Using the data in Table
4.1
and the Outputl classification, the following shows the basic probability of putting a tuple in the tall class based on the given attribute-value pair:Gender = F 0/9 Gender = M 3/6 Height <= 1.6 0/2 1 .6 < Height <= 1.7 0/2
1 18 Chapter 4 Classification 1 .7 < Height <= 1 . 8 1 .8 < Height <= 1 .9 1 .9 < Height <= 2.0 2.0 < Height 0/3 0/4 1 /2 2/2 Based on this analysis, we would generate the rule
If 2.0 < height, then class = tall
Since all tuples that satisfy this predicate are tall, we do not add any additional predicates to this rule. We now need to generate additional rules for the tall class. We thus look at the remaining 1 3 tuples in the training set and recalculate the accuracy of the corresponding predicates: Gender = F 0/9 Gender = M 1 /4 Height <= 1 .6 0/2 1 .6 < Height <= 1 .7 0/2 1 .7 < Height <= 1 . 8 0/3 1 .8 < Height < = 1 .9 0/4 1 .9 < Height <= 2.0 1/2
Based on the analysis, we see that the last height range is the most accurate and thus generate the rule:
If 2.0 < height, then class = tall
However, only one of the tuples that satisfies this is actually tall, so we need to add another predicate to it. We then look only at the other predicates affecting .these two tuples. We now see a problem in that both of these are males. The problem 1s actually caused by our "arbitrary" range divisions. We now divide the range into two subranges:
1 .9 < Height <= 1 .95 0/ 1
1 .95 < Height <= 2.0 1/1
We thus add this second predicate to the rule to obtain
If 2.0 < height and 1 .95 < height <= 2.0, then class = tall or
If 1 .95 < height, then class = tall
This problem does not exist if we look at tuples individually using the attrib
�
te:-value pairs. However, in that case we would not generate the needed ranges f�
r class1fymg the actual data. At this point, we have classified all tall tuples. The algonthm would then proceed by classifying the short and medium classes. This is left as an exercise.Section 4.7 Combi n i n g Techn iques 1 1 9
ALGORITHM 4.11 Input : D c Output : / / Training data / / Clas s e s R / / Rul e s PRISM algori thm : R = 0 ;
/ / PRISM algori thm generates rules based on best attribut e - value pairs
for each Cj E C do repeat
T = D; I /Al l ins tances of class Cj wi ll be systemat ically
removed from T ·
p = true ; I /Create new ru le with empty left - hand s i de r = ( I f p then Cj) ;
repeat
for each attribute A value v pair found in T do
l l
(
l (tup l e S E T with A=v)ApA(ECj)l)
ca cu atel (tuples E T with A=v)Apl ;
f ind A = v that maximi zes this value ;
p = p 1\ (A = v) ;
T = {tup l e s in T that sat i s fy A = v} ; until a l l tuples in T belong to Cj ; D = D - T;
R = R U r;
until there are no tuples in D that belong to Cj ; 4.7 COMBINING TECHNIQUES
Given a classification problem, no one classification technique always yields the best results. Therefore, there have been some proposals that look at combining techniques. While discussing C5.0, we briefly introduced one technique for combining classifiers called boosting. 1\vo basic techniques can be used to accomplish this:
• A synthesis of approaches takes multiple techniques and blends them into a new
approach. An example of this would be using a prediction technique, such as linear regression, to predict a future value for an attribute that is then used as input to a classification
NN
. In this way theNN
is used to predict a future classification value.• Multiple independent approaches can be applied to a classification problem, each
yielding its own class prediction. The results of these individual techniques can then be combined in some manner. This approach has been referred to as combination of multiple classifiers ( CMC).
One approach to combine independent classifiers assumes that there are n inde pendent classifiers and that each generates the posterior probability Pk ( C J 1 ti) for each class. The values are combined with a weighted linear combination
n
L
WkPk (Cj I ti) k=l1 20 Chapter 4 Classification
0 0
� A � �
A Thple in Class 1 and correctly classified
0 • 0 0
0 X 0 � Tuple in Class and incorrectly classified 1
A �
• •
A 0 A 0
o Thple in Class 2 and correctly classified (a) Classifier 1 (b) Classifier 2
• Thple in Class 2 and incorrectly classified FIGURE 4.20: Combination of multiple classifiers.
Here the weights, Wk. can be assigned by a user or learned based on the past accuracy of each classifier. Another technique is to choose the classifier that has the best accuracy in a database sample. This is referred to as a dynamic classifier selection (DCS).
�
x�m ple 4. 13, which is modified from [LJ98], illustrate� the use. of. DCS. Anothe� vanat10n is simple voting: assign the tuple to the class to which a maJonty of the classifiers have assigned it. This may have to be modified slightly in case there are many classes and no majority is found.EXAMPLE 4. 1 3
Tw o classifiers exist to classify tuples into two classes. A target tuple, X , needs t o be classified. Using a nearest neighbor approach, the 10 tuples closest to X are identified. Figure 4.20 shows the 10 tuples closest to X. In Figure 4.20(a) the res�lts for the first classifier are shown, while in Figure 4.20(b) those for the second classifier are shown. The tuples designated with triangles should be in c
�
as� 1, whil� those shown. as s�uares should be in class 2. Any shapes that are darkened mdicate an mcorrect classificauon by that classifier. To combine the classifiers using DCS, look at the general accuracy of each classifier. With classifier 1 , 7 tuples in the neighborhood of X are correctly classified, while with the second classifier, only 6 are correctly classified. Thus, X will be classified according to how it is classified with the first classifier.Recently, a new CMC technique, adaptive classifier combination (ACC), has been proposed [LJ98]. Given a tuple to classify, the neighborhood around it is first determined, then the tuples in that neighborhood are classified by each classifier, and finall� the accuracy for each class is measured. By examining the accuracy across all classifiers for each class, the tuple is placed in the class that has the highest local accuracy. In effect, the class chosen is that to which most of its neighbors are accurately classified independent of classifier. Example 4.14 illustrates the use of ACC.
EXAMPLE 4.14
Using the same data as in Example 4 . 13, the ACC technique examines how accurat� all classifiers are for each class. With the tuples in class 1, classifier 1 accurately classifies 3 tuples, while classifier 2 accurately classifies only 1 tuple. A measure of the accuracy
SE!Ction 4.9 Exercises 121 for both classifiers with respect to class 1 is then: 3/4 + 1/4. When looking at ·class 2, the measure is: 4/6 + 5/6. Thus, X is placed in class 2.
4.8 SUMMARY
No one classification technique is always superior to the others in terms of classification accuracy. However, there are advantages and disadvantages to the use of each. The regression approaches force the data to fit a predefined model. If a linear model is chosen, then the data are fit into that model even though it might not be linear. It requires that linear data be used. The KNN technique requires only that the data be such that distances can be calculated. This can then be applied even to nonnumeric data. Outliers are handled by looking only at the K nearest neighbors. Bayesian classification assumes that the data attributes are independent with discrete values. Thus, although it is easy to use and understand, results may not be satisfactory. Decision tree techniques are easy to understand, but they may lead to overfitting. To avoid this, pruning techniques may be needed. ID3 is applicable only to categorical data. Improvements on it, C4.5 and CS, allow the use of continuous data and improved techniques for splitting. CART creates binary trees and thus may result in very deep trees.
When looking at the approaches based on complexity analysis, we see that they are all very efficient. This is due to the fact that once the model is created, applying it for classification is relatively straightforward. The statistical techniques, regression and naive Bayes, require constant time to classify a tuple once the models are built. The distance-based approaches, simple and KNN, are also constant but require that each tuple be compared either to a representative for each class or to all items in the training set. Assuming there are q of these, the KNN then requires O(q) time per tuple. DT classification techniques, ID3, C4.5, and CART require a number of comparisons that are (in the worst case) equal to the longest path from a root to a leaf node. Thus, they require 0 (log q) time per tuple. Since q is a constant, we qm view these as being performed in constant time as well. The NN approaches again require that a tuple be propagated through the graph. Since the size of the graph is constant, this can be viewed as being performed in constant time. Thus, all algorithms are 0 (n) to classify the n items in the database.
4.9 EXERCISES
1. Explain the differences between the definition of the classification problem found
in Definition 4. 1 and an alternative one with the mapping from C to D .
2 . Using the data i n Table 4 . 1 , draw OC curves assuming that the Output2 column is the correct classification and Outputl is what is seen. You will need to draw three curves, one for each class.
3. Using the data in Table 4. 1 , construct a confusion matrix assuming Output is the correct assignment and Output1 is what is actually made.
4. Apply the method of least squares technique to determine the division between medium and tall persons using the training data in Table 4. 1 and the classification shown in the Output1 column (see Example 4.3). You may use either the division technique or the prediction technique.
1 22 Chapter 4 Classification
5. Apply the method of least squares technique to determine the divisio1 n ?fietw�en medium and tall persons using the training data in Table 4.1 and the c assi catwn shown in the Output2 column. This uses both the height data and the gender data to do the classification. Use the division technique.
6. Redo Exercise 5 using the prediction technique.
7. Use KNN to classify (Jim, M, 2.0) with K = 5 using the height data and assuming that Output2 is correct.
8. Explain the difference between P Cti I C j) and P ( C j I t;) ·
9. Redo Example 4.5 using Output2 data.
10. Determine the expected number of comparisons for each tree shown in Figure 4. 1 1 .
11. Generate a DT for the height example in Table 4. 1 using the ID 3 algorithm and
the training classifications shown in the Output2 column of that table.
12. Repeat Exercise 1 1 using the GainRatio instead of the Gain.
13. Construct the confusion matrix for the results of Exercises 1 1 and 12.
14. Using 1R, gerferate rules for the height example using the Output2 column in