• No results found

Objective Rule Interestingness Measures

ASSOCIATION RULES

Definition 4.6: Confidence of an association rule

4.4 Post-Processing of Association Rules

4.4.4 Objective Rule Interestingness Measures

Objective measures of interestingness are based on the statistical properties of association rules. Amongst others [137], the most well known are interest [7, 60], correlation [60, 206] and intensity of implication [118]. We will illustrate them by using the following real example shown in table 4.2. The association rules algorithm produces the following results on data set 1, containing 88163 receipts:

Itemset Support count Support (%)

{orange juice} 579 0.6573 %

{semi-skimmed milk} 2885 3.273 %

{orange juice, semi-skimmed milk} 140 0.1584 %

-94-The interest, or also called the lift measure, measures the statistical dependence of a product association by relating the observed frequency of co-occurrence s(A ∪ C) of the antecedent (A) and the consequent (C) of the rule against the expected frequency of co-occurrence under the assumption of conditional independence of A and C. Interest is therefore defined as:

( )

( )

( ) * ( ) s A C I A C

s A s C

Þ = ∪

(4.1)

Note that interest is a symmetric measure. An interest value equal to 1 indicates that the observed frequency of the rule in the data (nominator) equals the expected frequency (denominator), given the assumption of conditional independence between the antecedent (A) and the consequent (C) of the rule. An interest value larger than 1 indicates that the combination of A and C occurs more frequently in the data (i.e. positive interdependence) than we would expect. An interest value smaller than 1 indicates less than expected co-occurrence, or thus a negative interdependence. Applied on the example given in table 4.2, the interest measure of the rule orange juiceÞsemi-skimmed milk equals

I(orange juice Þ semi-skimmed milk) =

0.001584

7.363 0.006573* 0.03273 =

This is a fairly high value and thus demonstrates highly positive interdependence between orange juice and semi-skimmed milk. The interest measure has therefore been used as guidance for retailers to identify complementarity and substitution effects between products [25]. From a marketing modeller’s perspective, this may however not be entirely accurate.

They usually define complementarity/substitution in terms of the effect on the sales of a particular product as a result of a marketing action on another product (see figure 3.1). The interest therefore really only measures higher or

lower than expected co-occurrence instead of complementarity or substitution.

Our own analysis on real data showed that when looking at the association rules with high interest values, the involved products are typically usage complements or result from variety-seeking behaviour by the customer.

Another objective measure of interestingness for association rules is based on the statistical notion of correlation between the items in the antecedent and the consequent of the rule [60, 177, 206]. The idea is to construct a contingency table from the association rule results and test the interdependence between the antecedent and the consequent of the rule by means of chi-squared analysis. This is demonstrated below for the example in table 4.3.

semi-skimmed milk ¬ semi-skimmed milk Totals

orange juice 140 439 579

¬ orange juice 2745 84839 87584

Totals 2885 85278 88163

Table 4.3: contingency table for orange juice and semi-skimmed milk

A chi-squared analysis on this contingency table produces the following result, with i representing the row index and j the column index:

( )

2

2

i j ij

ij

O E

ij

χ

=

åå E

(4.2)

thus,

( ) (

2

)

2

2 140 579 2885 / 88163 439 579 85278 / 88163 579 2885/ 88163 579 85278 / 88163

χ

= ×× + ×× +

(

2745 87584 2885 / 88163

) (

2 84839 87584 85278 / 88163

)

2

87584 2885/ 88163 87584 85278 / 88163

− × + − ×

× ×

= 804.87 >> 3.84

-96-For a p-value of 0.05 with one degree of freedom, the cut-off value is 3.84.

Consequently, semi-skimmed milk and orange juice are significantly interdependent at the (1-0.05) 95% confidence interval. However, some important comments can be made on these results.

First of all, it can be noticed that there exists a relation between the chi-squared test for statistical interdependence and the interest value (see formula 4.1). Indeed, the chi-squared distance for the (1,1) cell in the contingency table corresponds closely to the interest measure. In fact, the larger the interest value deviates from 1, the bigger its contribution to the chi-squared statistic. This is fairly easy to proof, changing notation from interest to chi-squared:

Now, the rule that maximizes the deviation of the interest from 1 ij 1

ij chi-squared distance of that cell in the contingency table. In other words, rules with strong negative or positive interdependence as measured by the interest value contribute strongly to the chi-squared statistic. The only difference is that the chi-squared statistic measures the overall interdependence for a set of variables (thus over the entire contingency table), whereas the interest measures the interdependence between a set of events of those variables.

Secondly, the chi-squared test rests on the normal approximation to the binomial distribution. This approximation breaks down when the expected values are small. Moore [205] therefore suggests to use the chi-squared test only when all cells in the contingency table have expected values greater than 1, and at least 80% of the cells have expected values greater than 5. In a real case scenario, however, these requirements will be easily broken. One way to avoid this problem is to set the minimum support threshold high enough, or to

use an exact calculation of the probability instead of the chi-squared approximation. The latter, however, turns out to be prohibitively expensive [60].

Finally, it is tempting to use the value of the chi-squared statistic as an indication of the degree of dependence. However, an important limitation of the chi-squared statistic is that it tends to produce larger values when the data set size tends to grow to infinity. While comparison of chi-squared values within the same data set may be meaningful, it is therefore not advisable to compare chi-squared values across different data sets.

Last but not least, intensity of implication [118, 256] is also worth mentioning within the context of statistical measures of interestingness of association rules. The idea is to measure the statistical surprise of having so few negative examples on a rule as compared with a random draw. Consider a database ,, where |,| is the number of transactions in the database, and an association rule XÞY . Now, let U and V be two sets randomly chosen from , with the same cardinality, i.e., s(X)=s(U) and s(Y)=s(V), and let ¬Y mean ‘not Y’

as shown in figure 4.6.

Figure 4.6: Intensity of implication

Let s(U ∧¬V) be the random variable that measures the expected number of random negative examples under the assumption that U and V are independent, and s(X ∧¬Y) the number of negative examples observed on the

, s(Y)

s(X)

s(X ∧¬Y)

, s(V)

s(U)

compare s(U ∧¬V)

-98-rule. Now, if s(X ∧¬Y) is unusually small compared with s(U ∧¬V), the one we would expect at random, then we say that the rule XÞY has a strong statistical implication. In other words, the intensity of implication for a rule XÞY is stronger, if the quantity P[s(U ∧¬V) ≤ s(X ∧¬Y)] is smaller. Intensity of implication is then defined as 1 - P[s(U ∧¬V) ≤ s(X ∧¬Y)]. The random variable s(U ∧¬V) follows the hypergeometric law and therefore the intensity of implication can be written as:

( )

This formula for intensity of implication is suitable as long as the number of cases in the database, i.e. |,|, is reasonably small. Otherwise, the combination numbers in the above formula explode very quickly. Therefore, Suzuki et al.

[256] came up with an approximation of this formula for big datasets. They argue that if s(U ∧¬V) is small, which is often the case in rule discovery, then Poisson approximations can be applied. In that case, the above formula for intensity of implication reduces to a much simpler version that is easier to compute:

Nevertheless, the computational burden is still quite high since for every rule, the calculation involves the summation over a relatively large number of calculations.