CiteSeerX — Decision Rules

(1)

C5.1.4 Decision rules

Willi Klösgen

German National Research Center for Information Technology (GMD) D-53757 Sankt Augustin, Germany

Abstract: Decision rules are the most prominent subtype of subgroup patterns (C5.3). They are expressed as if-then rules with preconditional left hand side subgroups and conclusive right hand side target groups. The subgroups and target groups are constructed as conjunctions of propositional selectors. More expressive rules based on first order Horn clauses are treated in C5.2.5. In this section, we discuss algorithms that derive rule sets which are going to be used for classification purposes. Specifically, we focus on description languages to construct left hand side preconditional subgroups, evaluation functions measuring the quality of rules, and search strategies deriving rule sets.

Keywords: subgroup mining, decision rules, classification, CN2, RIPPER, PRIM, CWS C5.1.4.1 Introduction

Decision rules are a special subtype of subgroup patterns (B2.2 and C5.3). A disjoint and exhaustive set of target groups (classes) is given, and a set of rules is searched which can be used to classify a new object, i.e. to predict the target group (class) to which the object belongs. A decision rule thus identifies a subgroup of objects in the studied population that have a dominant probability to belong to one of the classes. Typically it is assumed that the set of target groups is given by a nominal target variable so that each target group corresponds to a selector built with a single value of this target variable. The left hand side preconditional part of a rule describes a subgroup of objects (cases of the data base) and is constructed as a conjunction of propositional selectors. Thus decision rules represent the upper left cell in table 1 of C5.3.1.2 which summarizes various subtypes of subgroup patterns, i.e. decision rules deal with only one population of objects and a nominal target variable.

We assume in this section a one-relational, propositional description language for constructing left hand side conditional subgroups; more advanced description languages are discussed in B2.2 and C5.2.5. Thus we regard a data base consisting of one relation with a schema {A1, ..., An+1} and associated domains D_i for the attributes Ai. A rule is then given by:

if A₁ ∈ V1 ∧ ... ∧ An ∈ Vn then An+1 = a with V_i⊆ Di and a ∈ Dn+1.

Conjunctive selectors with V_i = Di can of course be omitted in a left hand side condition.

Without loss of generality we assume the target attribute to be An+1.

A rule description language is defined in more detail by the type of subsets V_ithat can be constructed for a left hand side of a rule. With respect to nominal variables Ai, a basic description language is given by including one-value selectors (Ai = a, e.g. marital status = single). A first extension includes also negations built with one value, i.e. A_i  a. A next extension includes any internal disjunction of values (A_i = a ∨ ... ∨ b, e.g. marital status = single ∨ widowed ∨ divorced). Dependent on these extensions of the basic description language, the number of potential left hand side conditions gets larger, specifically much larger for internal disjunctions.

For an ordinal variable Ai, intervals of values are dynamically constructed by most rule search algorithms and used as selectors. Several discretization methods are offered (see also C3.4 and C7.1).

(2)

Decision rules can be applied for different data mining tasks. Besides for classification, which is the main focus of this section, they can also be used for domain understanding and nuggets detection. The main reason for this wide application area of decision rules is their expressive and easy human readable representation, as well as the broad range of evaluation and search strategies that can be applied to find rule sets. When using decision rules for classification or prediction purposes, prediction accuracy plays a major role in evaluating rules and rule sets. The prediction accuracy of a rule set can be estimated by applying the rule set for a test set of objects and calculating the relative frequency of correct classifications. Rule searchers mainly differ in the evaluation functions used to rank rules and search strategies applied to heuristically traverse the space of potential left hand sides.

A set of rules is first derived in a data mining task and then applied in a subsequent performance task to predict the class to which a new object belongs. Three forms of rule sets can be distinguished: only one rule set which specifically refers to a single distinguished class, a system of k rule sets where each rule set is related to one of the k classes, and a single rule set jointly treating all classes. In the first case, the user is interested only in one distinguished class and wants to know whether an object belongs to the class. If no rule of the single rule set can be applied to a new object, the object is predicted not to belong to the class. In the second case, the k rule sets are typically derived separately one after the other.

Usually a vector of probabilities is connected with a rule where for each class the probability is included that an object satisfying the precondition of the rule belongs to the class. The probabilities are estimated with the data base calculating the relative frequencies of objects that belong to the class within the preconditional subgroup. The class corresponding to the largest probability in the vector is usually predicted by a rule. When a rule set refers to a distinguished class, the probability is calculated that an object in the subgroup belongs to the class. This probability must be larger than 0.5 for a rule to be applied for classification. For overlapping rules (from k rule sets or jointly dealing with k classes), i.e. when several rules cover an object, some conflict resolving procedure has to be applied. This can be solved by rules ordered in form of decision lists, so that the first rule in the given order that is applicable for an object predicts its class, or by combination techniques when e.g. the sum of the probability vectors of the competing applicable rules is calculated and the class with the highest resulting probability is predicted.

C5.1.4.2 Sequential covering approaches

Because of their expressive and easy understandable representation, decision rules are often applied also for classification purposes making the prediction transparent to the user. In contrast to applications that focus on domain understanding, classification accuracy plays the major role in classification applications and simplicity of the rule set is only a secondary goal.

Thus the sequential covering search heuristic of exploring the space of potential rule sets by iteratively searching a best rule for the still uncovered data is very attractive for classification applications. The interpretation problems that exist with this approach specifically occur for domain understanding applications. Therefore we concentrate in this classification section on the sequential covering algorithm and only shortly summarize other approaches.

The general control of the sequential covering algorithm is simple: it iteratively searches for the best rule (with some basic best rule search heuristic) and explicitely or implicitely eliminates all objects in the data base that are covered by this rule before applying the next iteration step for the remaining data. This is only a greedy hill climbing approach in the space of rule sets without backtracking which does not guarantee to find the best set of rules. In classification applications, (best) rule sets are assessed primarily by their

(3)

classification accuracy, but also simplicity (e.g. number of rules) plays a role to avoid overfitting. Other criteria to assess rule sets are discussed in C5.3.1.3.

Let us assume a simple example to demonstrate the interpretation drawback of this sequential covering approach. If the first best rule identifies males, all males are eliminated from the data base, the next rule is searched among the remaining females and e.g. married is found as second best condition for a given target group (for simplicity reasons we assume the same right hand side target group in both rules). Because the second rule is found only in the subpopulation of females, it must correctly be presented as a rule for married females.

However the rule: if married and female then target group does not necessarily get the same evaluation in the whole population as the rule if married then target group in the subpopulation of females. Both rules have the same probability or certainty (the percentage of married females with the target property within married females), but different reference probabilities (the percentage of persons with the target property within the whole population, resp. the subpopulation of females). Thus the evaluation will be different, if the reference probability is used for the evaluation of a rule.

Also the rule: If married then target group does not necessarily hold in the whole population. Thus presenting both rules on males and married persons is misleading for domain understanding purposes. The comprehensibility of the generated ordered rule set can be difficult, because the interpretation of a single rule depends on the other rules that precede it. For classification, nevertheless this works when the rules are executed in their generation order for classifying new objects in a classification task. If a new object is a male, the target group is predicted. The second rule is applied only when the first rule is not applicable, i.e. it is applied correctly for a married female.

Within this sequential covering approach and the large and noisy data context of data mining, some variations of a heuristic, general to specific, and generate and test search strategy are applied in the partially ordered space of conditional left hand side subgroups to find a best rule (see also B8 and C5.3.1.3). We summarize now some of these variants, especially the CN2 (Clark and Niblett 1989), GoldDigger (Riddle et al. 1994), RIPPER (Cohen 1995), and PRIM (Fisher and Friedman 1999) approaches.

The description languages for rules include the much larger search space of internal disjunctions both for CN2 and PRIM; RIPPER allows single valued selectors which can also be negated. But additionally, RIPPER is applicable for multiple value attributes (B1), called set-valued features, constructing selectors such as red ∈ colour or blue ∉ colour. In this case, a single object can have several colours and the first selector defines the set of objects for which at least one colour is red.

Whereas the description language defines the set of all potential left hand side preconditions, the expansion operator (neighborhood operator) that is used by a general to specific search strategy to construct the specializations of a description, fixes the space of really processed descriptions. Usually one must differ nominal and ordinal variables. Using a nominal expansion variable V that does not yet occur in the current description, specializations are generated by adding for all values v of V a further conjunctive selector V = v, or also V Y when negations are allowed. If the description language includes internal disjunctions, the description is specialized by eliminating one internal disjunctive term from a selector of the description, including the case that internal disjunctions of all values of the variable V are implicitely given in a selector which means that V does not really occur in the description. For ordinal expansion variables V, usually the selectors V < v, V > v are added for all values v. PRIM uses a more patient discretization, cutting only small upper and lower boundary quantiles (e.g. deciles corresponding to ten percent) from a current ordinal selector which however is problematic for U-form dependencies.

(4)

This expansion operators are applied in steps by a search heuristic, where the first step starts with the most general subgroup of all objects. Usually a hill-climbing or beam search strategy is used selecting at each expansion step the best expansion or the best w expansions in case of a beam search with width w. To identify the best expansions and the currently best overall description in an expansion step, a quality function is used for ranking the expanded descriptions. A broad spectrum of quality functions is used by diverse rule generating systems, mainly on a statistical or information theoretic (entropy measure) background.

A next distinction refers to the options of searching only rules for some fixed value of the target variable or for treating all values of the target variable in parallel. This option influences the quality function that specifically refers to the distinguished target value when only rules for that value shall be derived. When a symmetric quality function such as entropy is used (not directed to a special selected target value), that target value is associated to the found left hand side of a rule, that meets the majority of objects of the left hand side.

Additionally to a quality function, sometimes a filter is used to select only those expanded descriptions (and corresponding rules) that are statistically significant. This implicitely includes a cutoff criterium stopping further specialization when no generated expanded description is statistically significant. Again various such significance tests are applied by diverse rule algorithms. Besides statistical significance, some other constraints may be applied by rule finders, e.g. that the precondition covers at least a specified minimal number of objects or consists of at most d conjunctions.

Table 1: Four sequential covering strategies in comparison

Search aspect 1) CN2 2)GoldDigger 3) RIPPER 4) PRIM Description

language

internal disjunctions

pos. or negated one-value conjunctions

pos. one-value conjunctions, also set-valued

internal disjunctions Expansion (speci–

alization) with nominal variable V

remove a disjunctive element in a selector

for all values v of V: add con–

junctive selector V = v, V Y

for all values v of nominal (set valued) V V = v, (v∈V)

remove a disjunctive element in a selector Discretization:

specialization with ordinal variable V

user provided subrange values v:

V < v, V > v

for all occurring values v of V:

V < v, V > v

for all occurring values v of V:

V < v, V > v

remove upper, lower quantile of current interval Search strategy beam search hill climber hill climber hill climber Backtrack strategy no yes, separate data yes, separate data yes, separate data Quality for

ranking

entropy =

– pⁱ log pi , pi = targetshare classi

percentage of pos.examples for dist. target value

information gain average of conti- nuous or binary target

Significance test likelihood ratio for expected, observed frequency distrib.

no no no

Other constraints no precondition: at least p pos. exps

no precondition: at

least p examples Refinement of rule

set

no no optimizing,

pruning rule set no

(5)

A next component of a rule search procedure is the inclusion of backtracking modifying the heuristically found best rule(s) by trying more general preconditions (e.g. eliminating conjunctive terms) to check if they have a still better quality. Due to the heuristic nature of rule searching, not all of these generalizations may have been processed during previous search.

When the best rules are generated with a separate training dataset, statistical pruning approaches check on a test dataset, if more general rules have a better quality. Thus overfitting is treated to avoid too special rules adapting to the pecularities of the given training dataset.

Another distinction refers to the elimination of objects in the sequential covering strategy. Especially when only positive rules (for one value of a target variable) are searched, one can eliminate only the positive examples, which favours selectors that are less correlated to the already found subgroups.

RIPPER developed an optimization method for a rule set. Each rule of the derived rule set is considered in turn and an alternative new rule is constructed from scratch with pruning guided so as to minimize the error rate of the modified rule set. A decision based on the Minimum Description Length heuristic (C8.2.1) is then made whether the new rule or the original rule is included. This optimization algorithm can be iterated. An overview on these features of rule finding algorithms as they are applied in some systems is given in table 1.

C5.1.4.2 Other approaches

We have focused in C5.1.4.1 on general to specific, generate and test search heuristics that are dominantly applied in the data mining context of large and noisy data. Especially in the early Machine Learning environment concentrating on exact rules classifying training data perfectly, bottom up search strategies have been applied too. These are mainly directed by a single positive example, i.e. apply example driven search strategies (as opposed to generate and test strategies). The main problem with these example driven strategies is that the examples can mislead the search when they are noisy. AQ (Michalski 1969) is a very prominant representant of the example driven (but general to specific) approaches. Within the family of AQ algorithms, some extensions deal also with noisy data (Michalski et al. 1986), but are not so appropriate for large datasets.

Another approach not using the sequential covering technique is based on association rules (Liu et al. 1998). In a first step all association rules (C5.2.3) are found that have as right hand side conclusion a simple selector built with the given target attribute (for a selected confidence and low support parameter). In a second step, these rules are hierarchically ordered by using first confidence and then support values. In a next iteration, which similarly to the sequential covering approach eliminates all covered objects, a subset of the association rules is identified that is used as final rule set for classification. Results are reported that show some better classification accuracy than those induced by the sequential covering approach.

CWS (Domingos 1996) dynamically interleaves rule induction and performance evaluation of a current rule set. Thus a new rule (which can also be a refinement of an existing rule) is not evaluated independently from the already found rules, but the accuracy of the rule set consisting of the current rules and the new rule is assessed, and that new rule is added to the current rule set that achieves the highest increase in evaluation.

C5.1.4.3 Time complexity and performance

To assess the applicability of rule searching in very large databases, analysing time complexity of rule searchers is important. Most comparisons use decision tree methods or CN2 as a reference. The C4.5RULES system (Quinlan 1993) generates rules from decision

(6)

trees (C5.1.3) by statistical postpruning of decision trees. C4.5RULES has empirically been observed to require O(N³) computation time dependent of the number of objects N of the database (Cohen 1995). For CWS, a time complexity of O(N) is claimed (Domingos 1996) and an empirically shown better accuracy on large datasets than C4.5RULES and CN2.

RIPPER is competitive with C4.5RULES in accuracy, but has time complexity of O(N log N).

But also the number A of attributes is important for analysing time complexity. A worst case bound for CN2 is given by O(N ²A ²W) with beam width W. O(NA²) is also reported for C4.5 when there are no numeric attributes (no discretization). Some pruned near exhaustive search methods (Segal and Etzioni 1994) cause running time to become exponential in A. Thus often an efficient preselection of attributes would be necessary to ensure time efficient rule discoveries.

These figures are only approximate values to assess the appropriateness of a rule searching method for a given very large dataset. A comprehensive, detailed, and authoritative comparative study of time complexity and other performance figures, e.g. accuracy, is still missing.

References

P. Clark and T. Niblett 1989. The CN2 induction algorithm. Machine Learning 3, 261–284.

W. Cohen 1995. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, 115–123, Tahoe City, CA: Morgan-Kaufmann.

P. Domingos 1996. Linear Time Rule Induction. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 96–101, Menlo Park, CA, AAAI Press.

J. Friedman and N. Fisher 1999. Bump Hunting in High-Dimensional Data. In Statistics and Computing, 9:2, 1–20.

B. Liu, W. Hsu, and Y. Ma 1998. Integrating Classification and Association Rule Mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 80–86, Menlo Park, CA, AAAI Press.

R.S. Michalski 1969. On the quasi-minimal solution of the general covering problem. In Proceedings of the First International Symposium on Information Processing, 125–128, Bled, Yugoslavia.

R.S. Michalski, I. Mozetic, J. Hong, and H. Lavrac 1986. The multi-purpose incremental learning system AQ15 and its testing application in three medical domains. In Proceedings of the Fifth National Conference on AI, 619–625, Seattle, WA: AAAI Press.

J.R. Quinlan 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan- Kaufmann.

P. Riddle, R. Segal, and O. Etzioni 1994. Representation Design and Brute-Force Induction in a Boeing Manufacturing Domain. In Aplied Artificial Intelligence, 8, 125–147.

R. Segal and O. Etzioni 1994. Learning decision lists using homogeneous rules. In Proceedings of the Twelfth National Conference on AI, 1041–1045, Philadelphia, Morgan- Kaufmann.