• No results found

Chapter 2 Literature Review

2.3 Data Analysis

2.3.3 Data Mining and Machine Learning Methods

2.3.3.7 Association Rule Mining

In the early 1990s, as the technology for recording business data became increasingly prevalent, there began an increase in studies for analysing that data: beginning the field of data mining. One of the earliest applications of this was to identify shopping trends in large-scale databases of customer transactions. Each record in these databases constitutes the list of items that the customer bought in that shopping transaction. This problem presented unique challenges in that each case did not conform to having the same limited set of attributes and a value for each one, but rather had a variable-numbered combination of a large possible item set. Nevertheless, in order to work with these cases each was typically treated as having a large number of Boolean attributes, one for each possible item: most of which will be false for each case. This allows the automatic identification of which items are associated with which other items; in this case for identifying how better to market products, but can be generalised to discovering which attributes are related to which other attributes (Agrawal, Imielinski, & Swami, 1993). The resultant association rules are of the form XI, where X is an item (or attribute) set, and I is an individual item (or attribute) which is not contained in X (Agrawal, et al., 1993).

Interestingness Measures

While it is a trivially simple task to define all the possible association rules that could be supported by a dataset, or more correctly a data model, the difficulty lies in identifying which of those rules are adequately evidenced by the data to warrant

regarding them as reasonable and indicative of knowledge (Agrawal, et al., 1993). The simplest of these, support and confidence, form the fundamental components of many of the other measures, some of the more common of which are described here. The confidence of a rule is the percentage of the cases that have the antecedent of the rule (X), that also have the consequent of the rule (I); for example, in a dataset of 20 cases, if 10 cases had all items in X, and of those 10, 4 cases had the consequent I, the rule confidence would be 40%. The confidence suggests how likely this rule is to represent a true association: if 100% of cases with item A also have item B, the system can be maximally confident that there is an association between A and B; whereas if only 5% of cases with A also have B, this is much more likely to be coincidental, or at least unreliable enough not to warrant further action.

Support attempts to describe the statistical significance of a rule: it is the fraction of cases in the dataset that satisfy the rule, having both the antecedent and the consequent. This helps to indicate the likelihood that an association is able to be generalised beyond the current data (Agrawal, et al., 1993).

Lift, also called interest (Brin, Motwani, Ullman, & Tsur, 1997; Roberto J. Bayardo & Agrawal, 1999), is a measure of how singularly dependent the consequent is on the antecedent. A low value for lift indicates that the consequent is unlikely to be dependent on the antecedent. Lift can be defined as (Roberto J. Bayardo & Agrawal, 1999), (Brin, et al., 1997):

In more literal terms, the lift value describes how much more likely, multiplicatively, the consequent is to appear in the set of cases that have the antecedent, than in the overall set of cases. For example: as previously, there are 20 cases, 10 having X and 4 of those with I, giving XI a confidence of 40%; but of those 20 cases the only that have I are the 4 that also have X; then the rule has a lift of 2. Although only 40% of cases with the antecedent X have the consequent, which would seem to indicate a relatively weak correlation, X is still twice as good a predictor of I than random selection, which can only predict cases which have I 20% of the time. Lift is also a symmetrical measurement, in that:

Gain is a measure used by Fukuda et al to help find optimal ranges for rule definition, and is defined as:

where the variable represents the minimum confidence threshold (Fukuda, Morimoto, Morishita, & Tokuyama, 1996). Explicitly, the resultant value of gain is the number of cases that support the rule above the minimum necessary for the rule to match the confidence threshold, given the support for the antecedent. Continuing from the previous examples, if the minimum confidence threshold was set at 20%, then would be 2: as the minimum number of cases required for the confidence to meet the threshold of 20% is 2 ( ), and the is 4.

Piatetsky-Shapiro defined a further interestingness measure in 1991, which Bayardo and Agrawal pointed out is a specialised case of gain, with fixed as , where |D| is the number of cases in the dataset (Bayardo Jr. & Agrawal, 1999; Piatetsky-Shapiro, 1991). Thus the measure calculates: of the cases that support the antecedent, how many more have the consequent than would be expected, using the ratio of number of cases with the consequent against the dataset to derive the expected value. To illustrate, again using the previous examples: if 10 out of 20 cases have antecedent X, and 4 cases have consequent I, all 4 of which also have X, then the p-s gain is 2; which is indicating that the antecedent X appears in association with the consequent I in 2 more cases than would be expected. 4 cases have both consequent and antecedent, while based on the ratio of cases with the consequent to cases in the dataset (0.2), it would be expected that only 2 of the 10 cases with the antecedent would also have the consequent.

Conviction is another function of confidence which was designed to complement lift, as it considers the probabilities of both the consequent and antecedent individually (Brin, et al., 1997). It is defined as:

Each of these measures provides a relative measurement of interestingness for each possible rule; however as they are based from different measurements they will

often provide conflicting rankings. In order to improve the efficiency of the data mining, all measurements for all possibilities are rarely calculated: rather, a first pass is run finding all rules which match a minimum threshold for simple measurements such as confidence and support, then more complex measurements made over the remaining rules (Agrawal, et al., 1993; Bayardo Jr. & Agrawal, 1999; Lenca, Vaillant, & Lallich, 2006). These will again often have minimum thresholds, displaying only those rules which surpass the threshold value for each measure. The literature generally does not suggest optimal thresholds for these or other measures, and threshold values are rarely discussed in detail. The most common view is that the thresholds should be modifiable by the user, as the required minimum interestingness of a rule is dependent on the data and what the user is looking for (Hidber, 1999; Lenca, et al., 2006; Tan & Kumar, 2001), although some methods have attempted to develop relative thresholds (Lavra , Flach, & Zupan, 1999). The major problems with association rules are the computational complexity of identifying the rules and the often vague results: the method provides absolutely no explanation for why these associations exist, which makes it difficult to quantify exactly how well an association might generalise, or what to do with the associations once discovered. Nevertheless association rule mining became a very popular approach in data mining which found wide application in marketing research (Fayyad, et al., 1996a).