The accuracy and coverage properties of a rule are very important. In fact, they are fundamental properties of a nugget, because they allow us to establish a
partial ordering, ≤ ca, of rules. The partial ordering ≤ ca can be defined as follows:
• Given rules r1 and r2, r1 < ca r2 if and only if Cov (r1) ≤ Cov (r2) and Acc (r1) < Acc(r2), or Cov (r1) < Cov (r2) and Acc (r1) ≤ Acc(r2)
b c r Cov( )= d b class DefAcc( )= ) ( 1 ) ( ) ( ) ( class DefAcc class DefAcc r Acc r Load − − =
• Also, r1 =ca r2 if and only if
Cov (r1) = Cov (r2) and Acc (r1) = Acc(r2)
The partial ordering ≤ca, illustrated in Figure 2, was also proposed independently by Bayardo and Agrawal (1999). In this simple graph, the coverage of a rule is represented by the x-axis, and the accuracy is represented by the y-axis. Rules r 1and r 2 have the same coverage with r 2 having higher accuracy. r
2 is the “preferred” rule,
as it is higher in the partial ordering ≤ ca. Similarly, r 3 and r 4 have the same accuracy, but r 4 has higher coverage, so r 4 is higher in the partial ordering than r 3. r 5 has less accuracy and coverage than both r 2 and r4, and hence r 5 is lower in the partial ordering than both r 2 and r 4. This ordering is called partial because it cannot order all rules. In fact, each rule defines a rectangular area as marked by the dotted line in Figure 2, and a rule can only be ordered with respect to another if it falls within the perimeter of the other rule’s area. For example, rules r2and r4cannot be ordered with respect to one another using ≤ca as they belong to different accuracy/coverage areas. The simple ordering of rules established by ≤ca may appear to be obvious. It seems safe to assume that with equal accuracy a rule of more coverage represents a more interesting concept. Similarly, with equal coverage, a rule of higher accuracy is a more interesting concept. The partial ordering ≤ca is therefore important and must be enforced by any measure of interest that is used to guide the search for nuggets. Surprisingly, many of the measures of interest proposed in the literature do not support this partial ordering.
The partial ordering establishes that the more covering and the more accurate a rule is, the more interesting it is. However, there is often a trade-off in real-world datasets between accuracy and coverage. In commercial databases, a completely accurate description of a class can often not be found. The more general a pattern is (i.e., the higher the coverage), the lower the accuracy tends to be. Very specific patterns, capturing the behaviour of a few world entities, may achieve very high levels of accuracy. As the patterns become more general and capture the behaviour for a whole target class, we can expect the accuracy of those patterns to drop to reflect the levels of noise present in the real environment. Hence, when a completely
Figure 2: Partial rule ordering
0 Acc r1 Cov r2 r3 r4 r5 r1 ≤ ca r2 r3 ≤ ca r4 r5 ≤ ca r2 r5 ≤ ca r4 r4
covering and accurate description for a class cannot be found, the interest measure needs to balance the trade-off between both properties. In other words, it needs to be able to select one of the defined areas or accuracy/coverage as the target for the discovery.
Interest measures should therefore contain some criteria for selecting one of the areas of accuracy/coverage as the more interesting area. Which then should be the criteria for selecting high accuracy areas or high coverage areas? The answer will vary from one application to another. For example, let us suppose that a medical database exists containing characteristics and history of patients with a particular disease. Let us also suppose that patients are divided into two classes: those that suffer the disease in its initial stages, and those that suffer it in an advanced stage. Let us assume that a drug is available, which may prevent the disease from spreading in the initial stages, but would have serious side effects for patients with the disease in an advanced stage. In this case, the description of the class “patient with disease in initial stage” to be used for the administration of the drug would have to be very accurate to be of use. In such a case, accuracy will be the most important property to be considered in an interest measure to guide the search for rules. If, however, the drug had no side effects for other patients, but was extremely effective at curing the disease if found in the initial stages, then it would be coverage of the class “patient with disease in initial stage” that should be the guiding force for nugget discovery.
A measure of interest for nugget discovery must therefore have two fundamental qualities:
• It must establish the ≤ca partial ordering between any two nuggets that can be compared or ordered under such ordering.
• It must also allow the search to be geared towards accurate rules or highly covering rules, depending on the preferences of the user or the application needs.