Rule Induction Learners - Machine Learning and Data Mining Methods Used for Classification

Chapter 2: Literature Review

2.3 Machine Learning and Data Mining Methods

2.3.3 Common Methods in Machine Learning and Data Mining

2.3.3.1 Machine Learning and Data Mining Methods Used for Classification

2.3.3.1.5 Rule Induction Learners

Rule induction learners are used to handle large datasets and induce formal rules based on training examples, where these rules can be utilised for classifying data and making precise decisions. The induced rules by these learners are represented in a form of if-then classification rules, which are easy to understand (Witten, Frank, and Hall, 2011). The rule induction algorithms employ condition action rules and perform a heuristic search to find the optimal rules set whose conditions match the instances in the training examples. Some of the most popular rule induction algorithms are CN2 (Clark and Boswell, 1991) and RIPPER (Cohen, 1995).

The rule induction algorithm CN2 (Clark and Boswell, 1991) is based on the concepts of the AQ (Michalski et al., 1986) and ID3 (Quinlan, 1986) algorithms. It performs a general-to-specific search to learn from a given set of training examples for

generating an unordered list of optimal rules. CN2 starts to search for rules and apply a set of conditions to a set of examples in order to test rules and find the best rules that cover the examples. Then, it repeats the search until no more optimal rules can be found.

The CN2 rule induction algorithm has been applied to a variety of domains for inducing classification rules. Džeroski and Lavrac (1996) employ the CN2 rule induction algorithm in the medical diagnosis domain. The algorithm is applied on patient records with corresponding diagnosis for inducing rules that can be used to diagnose new cases in early diagnosis of rheumatic diseases. The paper by Džeroski, et al. (1997) uses a rule induction approach for biological classification in the ecological domain. In their approach, the CN2 rule induction method is applied on a data of biological samples for classifying several problems related to the river water quality. Furthermore, in the study presented by Samanovic, Cukusic, and Jadric (2011), a rule induction approach has been used in the business domain. Their study applies the CN2 algorithm on publicly available online databases of business web sites for inducing rules that predict the amount of foreign direct investments in a country, based on various business indicators.

The well-known rule induction algorithm called RIPPER (Repeated Incremental Pruning to Produce Error Reduction) (Cohen, 1995) is suggested by William Cohen as an extended version of the IREP algorithm (Furnkranz and Widmer, 1994) for generating classification rules in the form of if-then rules. It applies a propositional rule learner to create and test rules until it could find a list of optimal rules. The RIPPER algorithm generates a rule set by adding rules repeatedly into an empty list (i.e. rule set) until there are no more positive examples to be covered. It starts by splitting the training data into two sets. The first set is for growing the rules, while the second set is for pruning the weak rules. In the growing phase, the RIPPER algorithm starts to grow a rule by adding conditions greedily to the rule until it becomes optimal and covers no negative examples. At this time, the algorithm also starts to check the attribute values in the dataset for choosing the condition that has the highest information gain. After that, the pruning phase starts by incrementally pruning rules and weak conditions in order to reduce the errors in the created rule set. The algorithm will stop repeating the growing and pruning phases if one of the following cases

occur: the error rate of the last rule is greater than or equal to 50%, the description length (number of bits needed to perform rules) of the rule set is 64 bits greater than the smallest description length produced so far, or there are no more positive examples to be covered. Then it generates an initial rule set. Afterwards, the optimisation stage begins to randomise data by growing and pruning two rules from the initial rule set until it can find a rule with the smallest description length to be added to the last representation of the rule set (Cohen, 1995).

The RIPPER rule induction algorithm has been applied to real world problems and applications. For instance, Jovic and Bogunovic (2009) have proposed a study for heart rate analysis based on using some classification algorithms, e.g. RIPPER, C4.5 decision tree, Bayesian network (Friedman, Geiger, and Goldszmidt, 1997), and random forest (Breiman, 2001). Thus, the RIPPER algorithm is applied into a collection of databases of patient records obtained from the PhysioBank website for analysing features of heart rate variability in order to find the rules that classify the patient records. Ulutaşdemir and Dağlı (2010) present a study in the medical diagnosis domain for predicting death risk in hepatitis. They have employed different algorithms (RIPPER, PART [a mixture of C4.5 and RIPPER algorithms] (Witten et al., 1999), and J48 [an implementation of C4.5 algorithm]) including JRIP (i.e. a WEKA (Hall et al., 2009) implementation of RIPPER) on clinical hepatitis datasets obtained from the UCI Machine Learning Repository (Frank and Asuncion, 2010) for finding rules that can be made use of to evaluate the risk of death in hepatitis. In the study introduced by Peng et al. (2011), a wide range of classification algorithms (Bayesian network, Naïve Bayes, k-nearest neighbor, C4.5, RIPPER, support vector machine (SVM) (Platt, 1999), linear logistic regression (Le Cessie and Van Houwelingen, 1992) and radial basis function (RBF) network (Bishop, 1995)) have been utilised for discovering any financial risk that might have taken place in the most recent years. In their study, the RIPPER rule induction algorithm is applied on real credit risk and fraud risk datasets from six countries in order to find the rules that could be used for predicting any financial risk.

In document A framework for employee appraisals based on inductive logic programming and data mining methods (Page 59-61)