2.2 Association Rules and Frequent Pattern Mining
2.2.2 Frequent Pattern Mining
As noted above FPM plays an essential role in ARM. On its own FPM is concerned with finding frequent patterns (frequently co-occurring sub-sets of attributes) in data.
A variety of FPM algorithms have been proposed. With respect to tabular data the majority of these have been integrated with ARM algorithms. Of these the best known, and most frequently cited, is the Apriori algorithm [7]. There a great many variations of Apriori, two variations of note are AprioriTid and AprioriHybrid. Another well known FPM algorithm is FP-growth which is founded on a set enumeration tree structure called the FP-tree. The Apriori, AprioriTid and AprioriHybrid algorithms are therefore discussed further in Sub-section 2.2.2.1, and FP-growth in Sub-section 2.2.2.2. The FPM algorithm adopted with respect to this thesis is the TFP [30, 31], this is therefore discussed further in Sub-section 2.2.2.3.
2.2.2.1 The Apriori Algorithm
The Apriori algorithm operates in an iterative manner by first identifying frequent 1- itemsets and then using these to identify frequent 2-itemsets and so on in a “generate, count-support and prune” loop. An important aspect of the Apriori algorithm (and many other FPM algorithms) is the downward closure property of itemsets which is used to limit the search space. This property states that an itemset cannot be frequent if its subsets are not frequent. Some pseudo code describing the Apriori algorithm is presented in Algorithm 2.1. GivenI, a set of itemsets in a transaction datasetD, the algorithm commences by generating the candidate one itemsetsIk(wherek= 1); then,
for each itemsetaiinIk, the support for each itemset,ai.support, is obtained. For each
itemset ai where ai.support ≤ σ (where σ is some support threshold), the itemset is
pruned fromIk. What is left inIk are the frequentK = 1 itemsets. Now the itemsets
inIk are used to generate theIk+1 itemsets (thus using the downward closure property
of itemsets). The efficiency of Apriori (and similar algorithms) is significantly affected when it is applied to very large datasets (that cannot be held in primary storage) as multiple scans through the database will be required. Thus, many modifications to the Apriori algorithm have been proposed, for example AprioriTid and AprioriHybrid [8], to address this issue.
AprioriTid uses a “vertical” representation of the data where each single attribute has a Transaction ID list (a TID list) associated with it. The support for single items is then simply the length of the appropriate TID list. The support for the two itemsets is obtained from a single intersection operations; and so on. Algorithm 2.2 desribes the pseudo code for AprioriTid. The algorithm commences by using the same candidate itemset generation algorithm as Apriori for producing the candidate sets Ik. The al-
gorithm again proceeds in an iterative manner. At each iteration the algorithm scans
Ik and obtains the support of the itemsets using intersection operations. Each item in Ik has a TID list associated with. The size of the intersection of the TID lists is then
the support for each item inIk. If the support is≤σ, then the itemset will be pruned
Algorithm 2.1: Apriori FPM algorithm
input :I, D and minimum support thresholdσ
output:F F={ }; 1 k= 1; 2 Ck =I; 3 while Ck 6=∅ do 4 for ∀c∈Ck do 5
Count support for cwith reference toD;
6 end 7 for ∀c∈Ck do 8 if support c ≤σ then 9 Prune fromCk; 10 end 11 end 12 F =F∪Ck; 13 k++; 14
Ck = the set of candidatek-itemsets derived from Ck−1; 15
end
16
In terms of speed, performance and memory management, reported experiments indicated that AprioriTid outperformed the original Apriori algorithm when generating large k-itemsets [45, 81]. Apriori performs better than AprioriTid in the initial passes but in the later passes AprioriTid had better performance than Apriori. For this reason a combination algorithm was introduced, called AprioriHybrid, in which Apriori was used in the initial passes and AprioriTid in the later passes.
2.2.2.2 The Frequent Pattern-Growth Algorithm
Another established FPM algorithm is the Frequent Pattern (FP)-growth algorithm. While Apriori develops itemsets using a candidate generation method, FP-growth uses a partitioning-based, divide-and-conquer method [19, 52]. In common with a number of other FPM algorithms, including TFP discussed in the following sub-section, FP- growth uses a set enumeration tree structure, the FP-tree, in which to store itemset data. The pseudo code for FP-growth is shown in Algorithm 2.3. The algorithm starts by calculating the support for each single item inI, unsupported items are pruned from the dataset. The remaining items, in each transaction, are then ordered according to their frequency and stored in FP-tree header table. The transactions are then stored in FP-tree. What sets the FP-tree apart from other set enumeration tree structures is that it includes additional links originating from the header table linking tree nodes that feature the same label. The FP-growth algorithm proceeds in a depth first manner starting with the least frequent item in the header table. For each entry the support value for the item is produced by following the links connecting all occurrences of the
Algorithm 2.2: AprioriTid FPM algorithm
input :I, set of Tid lists, σ
output:F F={ }; 1 k= 1; 2 Ck =I; 3 for ∀c∈Ck do 4
Support cthe length of corresponding TID list;
5 end 6 for ∀c∈Ck do 7 if support c ≤σ then 8 Prune from Ck; 9 end 10 end 11 F =F∪Ck; 12 k = 2; 13
Ck = the set of 2-itemsets derived fromCk−1; 14
while Ck 6=∅ do 15
for ∀c∈Ck do 16
Obtain support for c from the size of the intersection of the TID lists for
17 itemsets in c; end 18 for ∀c∈Ck do 19 if support c ≤σ then 20 Prune fromCk; 21 end 22 end 23 F =F∪Ck; 24 k++; 25
Ck = the set of k-itemsets derived fromCk−1; 26
end
27
current item in the FP-tree. If the item is adequately supported, then for each leaf node a set ofancestor labelsare produced each of which has a support equivalent to the sum of the leaf node items from which they originate. If the set of ancestor labels is not null, a new FP-tree is generated with the set of ancestor labels as the dataset, and the process repeated. A disadvantage of FP-growth, when finding long frequent patterns, is that many FP-trees may be generated and processed thus introducing additional efficiency overheads. The benefit provided by FP-growth is that the ordering of the 1-itemsets according to their support, and the pruning of unsupported 1-itemsets at the beginning of the mining process, reduces the size of input dataset thus contributing to the efficiency of the approach (although there is no reason why this expedient cannot be applied to other FPM algorithms).
Algorithm 2.3: FP-growth Algorithm- Frequent itemset mining
input :I, D and minimum support thresholdσ
output:F
for ∀c∈I do
1
Get support for cfrom D;
2
end
3
for ∀c∈I do
4
if support for c≤σ then
5 F =F ∪C; 6 end 7 end 8
H = Header table of elements inC order in descending support;
9
D0 =Dreordered according to ordering of H;
10
for ∀h∈H do
11
Follow links through FP-tree and obtain support;
12 if h support ≥σ then 13 add to F 14 end 15
Dtemp = set of items created by following through links; 16
Repeat process usingDtemp and D0; 17
end
18
2.2.2.3 Total From Partial
The Total From Partial (TFP) algorithm is an established FPM algorithm that, like FP-growth, utilizes a set enumeration tree structure for fast lookup purposes [30]. TFP is itself an extension of another algorithm, Apriori-T, which was developed as a more efficient ARM algorithm than straightforward Apriori. Apriori-T uses a reverse set enu- meration tree data structure, the Total support tree (T-tree), that facilitates fast “look up”. TFP extends Apriori-T by introducing a second tree structure, the Partial support tree (P-tree), in which partial support counts are stored. TFP offers advantages, with respect to generating frequent itemsets, in terms of time and storage efficiency; it also provides a good data structure for finding association rules [31]. As noted above the significance of TFP in the context of this thesis is that it is the foundation on which the proposed TM-TFP trend mining algorithm is based (see Chapter 4). TFP is therefore discussed in some detail in this section. The discussion is presented in terms of the generation of the P-tree and the T-tree, the first is discussed in Sub-section 2.2.2.3.1, and the second in Sub-section 2.2.2.3.2.
2.2.2.3.1 Partial support tree (P-tree)
The concept of the P-tree was introduced by Coenen et al. in [30, 31]. The P-tree is described as a “preprocessing” tree structure (similar to the FP-tree) into which an input dataset can be translated so that it is stored in a more concise way and at the same
time some partial support counting can take place. Figure 2.1 shows an example of how a P-tree is generated. Let D = {{A, B, C},{B, C},{A, B, E},{B, D, E},{A, D, E}}. P-tree generation commences with the first record inD. The record{A, B, C}is stored, together with its support count of 1, as a single P-tree node. The second record{B, C} is stored in a second P-tree node, also with a support count of 1, and linked to the first node so that it becomes a “sibling” of this first node. The next record {A, B, E} has a common prefix {A, B} with the first P-tree node. This is therefore split into a parent-child pair, with{A, B}as the parent and{C}as the child (both with a support count of one). Then {A, B, E} is added by incrementing the count for {A, B} and adding a further P-tree node for {E} as a sibling of {C} (with a support count of 1). The fourth record {B, D, E} shares a leading substring {B} with the P-tree node representing {B, C}. This is therefore split into another parent-child pair {B} and {C}. The fourth record is then included by incrementing{B}and adding{D, E}as a sibling of{C}. The fifth record,{A,D,E}, is included in a similar manner by splitting {A, B}and including {D, E}as a sibling of{B}.
Usage of the P-tree provides several advantages with respect to the generation of frequent patterns:
1. Faster run times because the counting of pattern support is done partially as the P-tree is constructed.
2. Reduced storage requirements with respect to large datasets where the likelihood of duplicate records and common prefixes are high.
A comparison between the operation of the FP-tree and the P-tree was conducted by Ahmed et al. [10]. Despite similarities in their structure Ahmed et al. highlighted two distinctions between the two:
1. The FP-tree is a more pointer-rich data structure which leads to a more compli- cated implementation, whereas the P-tree is simpler to implement.
2. P-tree nodes seek to hold sequences of itemsets which are partially closed, while the FP tree nodes hold separate itemsets.
The internal representation of the P-tree presented in Figure 2.1 is given in Figure 2.2. From the figure it should be noted that a P-tree node has four elements: (i) the node code, (ii) the support value, (iii) a reference to a potential sibling node and (iv) a reference to a potential child node.
2.2.2.3.2 Total support tree (T-tree)
The T-tree is used in the second stage of the TFP algorithm where frequent patterns are identified. A T-tree is a reverse set of enumeration tree that is used to store frequent
A B C 1 n = 1 A B C 1 B C 1 n = 2 A 2 B C 1 n = 3 B 2 C 1 E 1 A 2 B 2 n = 4 B 2 C 1 E 1 C D E 1 1 A 3 B 2 n = 5 B 2 C 1 E 1 C D E 1 1 D E 1
Figure 2.1: P-tree generation
patterns. Each level in the T-tree is actually an array (some authors refer to this structure as atrie). Items are stored “in reverse” as this is facilitated by the indexing mechanism permitted by the use of arrays. This indexing also facilitates fast look up [30]. The T-tree is generated in an apriori manner (see above) from the P-tree. In otherwords the T-tree is generated level by level starting with level 1 (one itemsets). Figure 2.3 illustrates the T-tree constructed using the P-tree presented in Figures 2.1. The example assumes that the support threshold for frequent patterns is σ = 2. The T-tree includes nodes for all the items that may exist at a particular level. Initially the support for each node is set to 0. Then, the support counts are updated (Figure 2.3(b)) as a result of a traversal of the P-tree. Then Level 1 pruning is done so that nodes that do not have support aboveσ= 2 are “removed”. The following level in the T-tree is constructed from the supported nodes in Level 1. Followed by level 2 pruning (Figure 2.3(d)), and so on.
A 3 B 2 null C 1 null DE 1 null null B 2 DE 1 null null C 1 null E 1 null null
Figure 2.2: Internal representation of P-tree generated in Figure 2.1 to an efficient mining process:
1. The size of the storage requirements for the T-tree is less than that required by other tree structures (such as the FP-tree).
2. The fast lookup facility provided by the indexing mechanism.
Given the benefits of P-tree and T-tree data structures discussed above, the TFP algorithm is used as the foundation for the frequent pattern trend mining algorithm, proposed later in this thesis (Chapter 4).