2.6 Techniques for Text Classification
2.6.4 Inductive Rule Learning
Inductive Rule Learning (IRL) is a widely used technique for text classification. Rule- based classifiers offer an advantage over “blackbox” style classifiers, such as SVMs and NB, in that they are easily interpretable and simple to apply (an advantage also shared by some other methods, for example, decision tree classifiers). This advantage is of importance, particularly in applications where manual human intervention is essential in analyzing the classifiers, for example, in the medical domain. The IRL technique (without featuring the use of negation) for text classification is a mature research field. Numerous algorithms, mostly based on thecoveringalgorithm, have been implemented and used in text classification. The proposed IRL mechanism in this thesis will also adopt the covering algorithm. In the covering algorithm, rules are “learned” sequen- tially based on the documents in the training set. The “covered” documents are then removed and the process is repeated until all the documents are covered or no more rules can be generated. The rules that are generated are added into the ruleset. The ruleset is then the set of rules that is used as a classifier. The covering algorithm is shown in Algorithm 2.
The goodness of a rule learnt is usually measured by its accuracy. To measure how accurate a rule is, the number of documents correctly covered by the rule is taken into consideration. A rule may cover both documents from the class at which the rule is
Algorithm 2: Coveringalgorithm
input :D, a dataset of class-labelled documents, Feature setc, the set of features for classc output:Ruleset, a set of rules
Ruleset ={ }; //initial set of rules learned is empty;
1
for each class cdo
2
while stopping condition is not met do
3
Rule =LearnOneRule(Feature setc,D,c);
4
Remove documents covered by Rule fromD;
5
Rulesetc =Rulesetc+ Rule;
6
end
7
Ruleset =Ruleset+Rulesetc;
8
end
9
returnRuleset;
10
directed (positive documents) and documents from other classes (negative documents). The formula for the accuracy of a rule is given in Equation 2.6:
Accuracy = P
P +N (2.6)
whereP is the number of positive documents covered andN is the number of negative documents covered.
The use of accuracy as a measure for rule quality however, can be a little misleading. Say, for example, referring to Table 2.1, there are two rules: Rule 1 covers only one positive document and has a higher accuracy because it does not cover any negative documents, in other words, 100% accuracy. Rule 2 however, covers 100 positive doc- uments and two negative documents. Although one can argue that Rule 2 is a much better rule as it covers a lot more positive documents, the rule’s accuracy is dragged down by the coverage of two negative documents. The comparison of these two rules using the accuracy measure is therefore misleading. A better measure that can give a fairer evaluation of the rule quality is the accuracy with Laplace estimation. The formula for calculating rule accuracy with Laplace estimation is given in Equation 2.7.
Rule P N Accuracy
1 1 0 100%
2 100 2 98%
Table 2.1: Examples of rule accuracy
AccuracywithLaplaceestimation= P+ 1
The accuracy of the rules in Table 2.1, using Laplace estimation and assuming that there are 10 classes in the dataset, is shown in Table 2.2. With the inclusion of Laplace estimation in the accuracy measure, the quality of a rule can now be more fairly reflected. Rule 2 now has a higher accuracy than Rule 1.
Rule P N Accuracy with Laplace estimation
1 1 0 18.2%
2 100 2 90.2%
Table 2.2: Examples of rule accuracy with Laplace estimation
The rules in an IRL ruleset are usually ordered so that they can be fired according to their priority. Two rule ordering strategies are described by Han and Kamber [40]; namely, class-based ordering and rule-based ordering. For class-based ordering, the classes are sorted by their prevalence. This means that the rules for the most frequent class will rank at the top, followed by the next frequent class and so on. On the other hand, rule-based ordering sorts the rules according to the rule quality, taking measures like rule accuracy, coverage or length as the ordering priority. Ordering by accuracy usually ranks the higher accuracy rules first. This means that more accurate rules will be fired first. Ordering by coverage will cause rules with larger coverage (covering more documents) being ranked higher. Length ordering orders the rules according to the antecedent length. The more features in the antecedent, the longer the rule and the higher the ranking. This means that more specific rules are ordered first.
Much previous research into IRL has been applied to the text classification task. Perhaps the most popular of the IRL algorithms is the Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [22]. This algorithm evolved from the Reduced Error Pruning (REP) algorithm [9]. An extension to RIPPER was proposed by Vasile et al. [88]. The evolution of REP to RIPPER is described below.
REP is an algorithm for decision tree pruning, which can be adapted to rule learning systems [9]. The training data used for REP for rule learning is split into a growing set and a pruning set. An initial ruleset that overfits the growing set is generated and then pruned based on the pruning set using pruning operators that yields the greatest reduction of error. Pruning will stop when the application of any further pruning would increase the error with respect to the pruning set. Though REP can work rather well for noisy data, F¨urnkranz and Widmer [36] outline some problems they found in REP; the main problem being its efficiency on large datasets. They proposed a rule-learning algorithm called Incremental Reduced Error Pruning (IREP) to address the problems in REP.
IREP integrates pre-pruning and post-pruning into the learning process. After a rule is learned from the growing set, pruning is immediately done in a greedy fashion
until the accuracy of the rule no longer improves on the pruning set. The rule is then added to the ruleset and all the covered examples in both the growing and pruning set are removed. The remaining examples will then be split into the growing and pruning set and another new rule is learned in the same manner. The process is repeated until the predictive accuracy of the pruned rule is worse than the empty rule. IREP was shown to be more efficient than REP with a slight gain in accuracy. However, it was found that IREP did not perform well for domains with a very specific concept description [36].
Cohen made some improvements to IREP and came up with an algorithm called the Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [22] . RIPPER was the result of three modifications to IREP. The first modification was an alternative rule value metric for pruning. Next, a new MDL-based heuristic to decide when to stop adding rules to the ruleset was devised. These two modifications improved IREP’s generalization performance and was referred to as the modified IREP (IREP*). The third modification was a postpass to optimize the ruleset, which was produced by IREP*. These three modifications resulted in the formulation of RIPPER. The author reported that RIPPER was comparable to C4.5Rules [73] in terms of error rates, but was more efficient in dealing with large datasets. RIPPER is thus an IRL system which uses thecoveringalgorithm to learn rules. RIPPER generates rules by greedily adding features to a rule until the rule achieves a 100% accuracy. This process tries every possible value of each feature and chooses the one with the highest information gain. Following this rule building phase, a rule pruning phase is applied, whereby the generated rule is pruned using a pruning metric.
Vasile et al. [88] subsequently extended RIPPER to include external knowledge in the form of taxonomies for IRL. Their work made used of WordNet [66], an online lexical reference system built by the Cognitive Science Laboratory at Princeton University. This extended rule induction algorithm is called Taxonomical RIPPER (TRIPPER). External knowledge was used both in the rule generation and rule pruning stages. In the rule generation stage, a process called feature space augmentation was introduced. The augmented set of features was obtained based on taxonomies defined over the values of the original features. In the rule pruning stage, pruning was replaced with taxonomy- guided abstraction, where different levels of specificity can be chosen for a feature under consideration. The results obtained from experiments on the ten biggest classes in the Reuters 21578 dataset showed that TRIPPER outperformed RIPPER in eight out of ten classes in terms of the precision and recall break-even point. TRIPPER was also able to generate rules which were generally more comprehensible than RIPPER.
Another IRL algorithm is called Swap-1 [91]. Apt´e et al. [4] made used of Swap-1 to induce rules for text classification. In Swap-1, a covering set of rules is obtained through a heuristic search to find a single best rule that covers only one class. The
rule is then added into the ruleset and the examples covered are removed. This process is repeated until there are no more examples to be covered. Swap-1 uses local opti- mization techniques to dynamically improve the ruleset. Pruning is done progressively to decrease the complexity of the ruleset. A method of pruning, called weakest-link pruning, is done to obtain a series of rulesets in decreasing order of complexity. The best ruleset is the one that results in the lowest observed true error rate with respect to a test dataset.
RIPPER is regarded as one of the most successful IRL algorithms. Therefore, it was used by many other researchers as a benchmark for comparison or for testing other elements of text classification [35, 80, 85, 79]. JRip, which is the WEKA’s implemen- tation of RIPPER, will be used in this thesis as a technique for comparison with the proposed IRL mechanism.