Rule Induction and Covering Approaches - Common Classification Techniques

2.3 Common Classification Techniques

2.3.6 Rule Induction and Covering Approaches

2.3.6.1 Incremental Reduced Error Pruning (IREP)

Incremental Reduced Error Pruning (IREP) technique was proposed by (Furnkranz and Widmer, 1994). It combines the separate and conquer approach with the Reduced Error pruning (REP). REP is an effective method that prunes and generates a set of rules. The errors are estimated at all nodes of the tree by keeping the chunk of training data as a test data that is independent. The process works by estimating the misclassification rate of the test data at each node and is compared with the error if the concerned node is exchanged with the resultant majority class. While replacing the node if the error is reduced, pruning of the sub tree is carried out. This computational and pruning process is performed repeatedly for each node until there is no reduction in error at each node.

IREP works in greedy manner to generate some rules; two sets of data are constructed by random partitioning the training data. One set is growing which contains 66.6% and the rest of training data is considered as pruning set. IREP starts with an empty rule; a rule condition containing the attribute value is added using a Foil-gain metric (Quinlan and Cameron-Jones, 1993). The algorithm appends those conditions to the current rule continuously that increases the Foil-gain value until that rule is not able to cover any data from the growing set. When the rule is constructed, the algorithm then picks one rule from the generated rules and prunes it backwards. From the selected rule IREP deletes one condition and selects that deletion which showed improvement in the function below:

M P m M p M P m r p r rule v   ( , _ , _ , , ) ( ) (2.3)

where P,M represents the total number of data instances in the pruning set and r_ p,r _m are the total number of data instances that are predicted by the pruned rule. The pruning process will

be continued until a deletion in the rule condition does not improve the value of ‘v’.

The pruned rules are placed in the classifier and the data instances that are linked with it are deleted from both the sets. An empirical study was performed on some benchmark problems which showed that REP is slower than IREP and both are competitive when compared on the basis of error rate (Furnkranz and Widmer, 1994). When compared with C4.5 algorithm, IREP performed well on 16 data sets and showed less error rate, whereas IREP is out performed by C4.5 on 21 data sets.

2.3.6.2 Repeated Incremental Pruning to Produce Error Reduction

RIPPER is a modified version of the IREP algorithm and was proposed by Cohen (Cohen, 1995). RIPPER builds the set of rules called classifier as follows: firstly it divides the training data set like IREP into two parts, a pruning and a growing set. The process starts by an empty rule set and the algorithm appends heuristically one condition at a time till no error is found on the growing set.

The modification in IREP will be explained in this section. One modification is the introduction of revised stopping condition when the rules are generated. In IREP a stopping condition is used that checks the error rate of a learned rule and stops adding rules, when error rate exceeds 50% on the pruned data. This criterion seems to stop too early, if an application domain contains a large number of low coverage rules. RIPPER uses a minimum description length principle (MDL) to stop adding a rule (Rissanen, 1985). When the rule is added the complete description length of the training and the rule data sets is calculated. If the description length is greater than the shortest description length extracted so far, the algorithm will not add any more rules. This technique of MDL considers the best set of rules that reduces the size of the classifier (set of rules) and also minimizes the quantity of information needed to handle the exceptions relating to these set of rules (Witten and Frank, 2000).

Another modification in IREP algorithm is the procedure that optimizes and reduces the number of rules discovered by pruning the discovered rule set. It means that this process is applied after the first pruning phase, so known as post-pruning. So the classifier produced by

IREP is again processed through an optimization phase, to further simplify the rule set characteristics.

The RIPPER integrates the IREP and optimization procedures. The working is as follows: A pruned rule pr is selected from the rule set and two alternative rules such as _i replacement and revision of pr are built. To create the replacement of_i pr , an empty rule _i pr_i is constructed and pruned to decrease the error rate of the rule set including pri on the pruning

data set. And the revision of pr_i is built in the same fashion but the rule is constructed heuristically and one condition is added at one time to the actual pr_i instead of the empty rule. The rule with the minimum error rate is selected from these three rules when analysed on the pruning data.

A comparative study is conducted on 36 benchmark data sets (Merz and Murphy, 1996) to estimate prediction accuracy of C4.5, IREP and RIPPER algorithms. The results have revealed that C4.5 has shown less error rates on 15 data sets while RIPPER has demonstrated better values than C4.5 on 20 data sets. On the other hand RIPPER has achieved better results when compared with IREP on 28 data sets.

In document LC an effective classification based association rule mining algorithm (Page 37-39)