CHAPTER 5 MDL-BASED PRUNING OF RULE SETS
5.3 Minimum Description Length (MDL) Principle
5.3.1 Existing Coding Methods
5.3.1.1 Model encoding I l l
The coding length of a model M, L(M), is the sum of the coding lengths of its rules:
Each rule consists of a sequence of conditions where each condition is either a value of the nominal attribute or an interval o f values (either greater-than, GT, or less-than-or- equal-to, LTE, some threshold value) o f the continuous attribute. Therefore, the coding length for all conditions of a single rule is the sum of the coding lengths of these three different possible kinds of conditions:
The coding length of the nominal conditions is given by Equation 5.4 (adapted from (Pfahringer, 1997) and (Robnik-sikonja and Kononenko, 1998)):
L(M ) = X LCrJ (5.2)
L(rJ = L(Nominal Conds) + L (G TC onds) + L(LTE_Conds) (5.3)
( n \
L(Nominal_Conds) = Log2 + T] Log: nA V W I ) i=\
where Nn is the number o f all nominal attributes, m, the number of nominal attributes involved in the antecedent of the rule and nA the number of possible values that a certain nominal attribute A, can take.
The first term on the right-hand side o f the above equation is the average coding length for selecting a subset of attributes from the set o f all nominal attributes. The second term is the coding length for specifying the respective values for each of the selected attributes.
Similarly, the coding lengths of the continuous conditions, L(GT Conds), L(LTE_Conds), are estimated by first selecting the continuous attributes actually involved in the antecedent of the rule and then encoding the respective thresholds as shown in Equations 5.5 and 5.6 (adapted from (Pfahringer, 1997) and (Robnik-sikonja and Kononenko, 1998)): ( N \ '”2 L(GT_Conds) = Log2 L + ^ L (t,L) (5.5)
(N \
L(LTE_Conds) = Log2 c + T] L(tn) \ ”h ) / = ! (5.6)where Nc is the number of all continuous attributes, and m2 and m3 are the numbers of continuous attributes used to create conditions of the form A, > r,7 and A-t < ty respectively.
The cost o f specifying a single threshold for a continuous attribute A t is given by:
L(t,J = Log2 (d-1) (5.7)
where d-1 is the number of possible thresholds, and d is the number of distinct values for the attribute A, occurring in the examples covered by the current rule.
5.3.1.2 Data encoding
Encoding the data given the model may be thought o f either as encoding the data points that are covered by the model or as encoding the exceptional instances that are erroneously classified by the model. There are several different schemes for encoding the classes of the instances covered by model M. Let S be a set of instances covered by model
M containing N instances, each belonging to one of k classes. Let nc be the number of
instances in class Cr The cost o f encoding the classes for the N instances is given in (Quinlan and Rivest, 1989) as:
L(S/M) = Log2'N + k - \
k -1 j + Log2 (5.8)
The first term in Equation 5.8 is the number of bits needed to specify the class distribution of the training instances, that is, the number of instances in each class. The second term is the number of bits required to encode the class for each instance once the class distribution is known.
Krichevsky and Trofimov (1983) proposed another scheme (Equation 5.9):
(5.9)
where T () is the Gamma function and is defined by the integral:
f ( z ) = '<? y d y (5.10)
It has been shown that Equation 5.9 yields more accurate encoding costs than Equation 5.8, especially when some nc are close to either 0 or N (Mehta et al., 1995).
A scheme for coding the exceptions to the model was first introduced in Quinlan (1993) and then in Quinlan (1994) with a slight modification. The basic idea can be outlined as
be given by identifying the misclassified instances o f the rules R. Assuming a binary- class problem, misclassified instances can be specified by indicating which of the instances covered by the rules R are false positives and which of those not covered are false negatives, i.e. two sets o f exceptions. It should be noted that learning tasks involving multiple classes, when being dealt with on a class-by-class basis, are essentially a two-class problem in which the goal is to generate rules that cover instances of one of the classes, called the target class, while not covering instances belonging to any other class. The coding length of a sensible encoding scheme for identifying t exceptions in N instances is (assuming all ways of selecting t of the N instances are equally likely):
( N \
L(N, t) = Log2 ( N + \ ) + L o g 2
I t ( 5 . 1 1 )
which may be interpreted as the cost o f specifying t in the range 0 to N, plus the cost of specifying which selections of t out o f N instances are exceptions.
Let C be the number o f instances covered by the rules and C the number of instances not covered. Further, let f p be the number o f false positive instances (instances that are covered by the rules but are really negative) and f n the number of false negative instances (positive instances not covered by the rules). The exceptions cost of specifying the data given the model is then:
L(D/M) = L(C,fp) + L(C ,f„) (5.12)
The first term is the number o f bits needed to indicate the false positives among the instances covered by the rules and the second term gives a similar expression for identifying the false negatives among the instances not covered. This is called the divided
strategy since errors are separated into two groups.
This scheme for encoding exceptions has two problems. First, it is symmetric and would give ambiguous results as the cost o f encoding f p and f n can be the same as that of encoding (N-P) - f p and P - f„ respectively, where P is the number of instances belonging to the target class and (N-P) is the number o f negative instances. Second, it could lead to
poor choices among contending models. In order to solve the second problem, Quinlan (1994) described a simple approach which attempts to restrict the candidate models from which the final model is selected. He introduced a bias in favour of models whose predicted class distribution matches that observed in the data. This bias was justified in that a model learned from the data should accurately summarise that data. To implement this bias, an ad-hoc penalty function that significantly increases the description length of unsatisfactory models was employed. Empirical results showed that this bias was effective in selecting models with a lower error rate on unseen instances.
Quinlan (1994) also explored an alternative scheme called uniform coding strategy for estimating L(D/M). This scheme (Equation 5.13) encodes errors in a single group rather than separating them into false positives and false negatives:
L(D/M) = L(N, e) (5.13)
where N is the total number of instances in the training data set and e is the total number of errors, calculated as f p + f„.
Equation 5.13 still exhibits a counter-intuitive symmetry: the cost of encoding e errors is the same as the cost for N - e errors.
Instead of relying on an artificial penalty function, Quinlan (1995) presented a biased
exceptions coding strategy that achieves the same effect in a manner consistent with the
instances predicted by a model and observed in the training data are the same when the numbers of false positive and false negative errors are equal. Before the strategy is defined, a theoretically optimal scheme for coding the exceptions needs to be introduced.
In Equation 5.11, the t exceptions are encoded based on the assumption that t is equally likely a priori. The length o f an ideal coding scheme in which t may have unequal likelihoods is given by:
where p is the probability o f a message selection. O f course, this assumes that p is independent of the previous messages and that it is known to the receiver.
The coding length of the data given the model can then be defined by:
The term L(C, f p. e/(2C)) is the number o f bits needed to specify the error messages for covered instances. The term L ( C . f n, f n / C ) is the number of bits required to encode the error messages for instances not covered. The error probabilities of covered and non covered instances are derived from the assumption that false positives and false negatives are balanced and that the sender first transmits the errors in the C instances covered by the model and then communicates those in the C instances not covered. There is a slight
L(N, t, p) = Log2 (N+l) + t Log2 (I/p) + (N-t) Log2 (l/(l-p)) (5.14)
complication: if the number C o f covered instances is small, e/2C may be greater than one. To overcome this problem, the above equation is followed when the model covers at least half the instances. If less than half are covered, the following equation is used:
L(D/M) = L(C,f„, e/(2 C )) + L (C J p, f p /C) (5.16)
where the false negative errors in the instances not covered are transmitted first, using the probability e/(2C ), followed by the false positives using f p/C.
Adopting the coding scheme represented by Equation 5.14, Equations 5.12 and 5.13 can be rewritten as:
L(D/M) = L(C, fp, f P /C) + L ( C . U f J C ) (5.17)
L(D/M) = L(N, e, e/N) (5.18)
The biased strategy and the divided strategy, represented respectively by Equation 5.15 and Equation 5.17, are similar except that the former uses the initial assumption of equal numbers of false positive and false negative errors to derive error probabilities for covered and non-covered instances.