Parallel leap traversal Approach : An example

The following example illustrates how the Bi- foldLeap approach is applied in parallel. Figure 4 (A) presents 7 transactions made of 8 distinct items which are: A, B, C, D, E, F, G, and H with prices $10, $20, $30, $40, $50, $60, and $70

respectively. Assuming we want to mine those transactions with a support threshold equals to at least 3 and generates patterns that their total prices are between $30 and $100 (i.e., P() : Sum of Prices < $100, and Q()Sum of Prices > $30, using two processors. Figures 4 (A) and Figure 4 (B) illustrate all the needed steps to accomplish this task. The database is partitioned between the two processors where the first three transactions are assigned to the first processor, P1, and the remaining ones are assigned to the second processor, P2

(Figure 4 (A)). In the first scan of the database, each processor finds the local support for each item: P1 finds the support of A, B, C, D, E, F and

G which are 3, 2, 2, 2, 2, 1 and 2 respectively, and P 2 the supports of A, B, C, D, E, F, and H which are 2, 3, 3, 3, 3, 3, 2. A reduced operation is executed to find that the global support of A, B, C, D, E, F, G, and H items is 5, 5, 5, 5, 5, 4, 2, and 2. The last two items are pruned as they do not meet the threshold criteria (support > 2), and the remaining ones are declared frequent items of size 1. The set of Global frequent 1-itemset is broadcast to all processors using the first round of messages.

The second scan of the database starts by build- ing the local headerless tree for each processor. From each tree the local frequent path bases are generated. In P1 the frequent-path-bases ABCDE, ABE, and ACDF with branch support equal to 1 are generated. P2 generates ACDEF, BCDF, BEF, and ABCDE with branch supports equal to 1 for all of them (Figure 4 (B)). The second set of messages is executed to send the locally generated frequent path bases to P1. Here, identical ones are merged and the final global set of frequent path bases are broadcast to all processors with their branch support. (Figure 4 (C)).

Each processor is assigned a set of header nodes to build their intersection tree as in Figure 4.D. In our example, the first, third, and sixth frequent path bases are assigned to P1 as header nodes for its intersection trees. P2 is assigned to the second, fourth, and fifth frequent path bases. The first tree of intersection in P1 produces 3 P-maximals (i.e., with total prices is less than $100) BCD: $90, ABE: $80, and ACD: $80 with support equal to 3, 3, and 4 respectively. The second assigned tree does not produce any P-maximals. P1 produces 3 local P-maximals which are BCD: $90, ABE: $80, and ACD: $80. P2 produced BE: $70, and AE: $60 with support equal to 4 and 4 respectively. All local P-maximals are sent to P1 in which any local P-maximal that has any other superset of local P-maximals from other processors are removed.

The remaining patterns are declared as global P-maximals (Figure 4 (E)). Subsets of the Global P-maximals that satisfy Q() which is prices > 30 are kept and others are pruned. The final results set produces D: $40, E: $50, AC: $40, AD: $50, BC: $50, BD: $60, CD: $70, BE: $70, AE: $60, BCD: $90, ABE: $80, and ACD: $80

sequentIAl PerforMAnce

evAluAtIon

To evaluate our BifoldLeap algorithm, we conducted a set of experiments to test the effect of pushing monotone and anti-monotone constraints separately, and then both in combination for the same datasets. To quantify scalability, we experi- mented with datasets of varying size. We also measured the impact of pushing vs. postprocessing constraints on the number of evaluations of P()

and Q(). Like in Bucila et al. (2002), we assigned prices to items using both uniform and zipf distri- butions. Our constraints consisted of conjunctions of tests for aggregate, minimum, and maximum price in relation to specific thresholds.

We received an FP-Bonsai code (base on FP-Growth) from its original authors Bonchi et al. (2004). Unfortunately, not all pruning and clever constraint considerations suggested in their FP-Bonsai paper were implemented in this code. Moreover, the implementation as received produced some false positives and false nega- tives. This is why we opted not to add it to our comparison study. Although, with simple and only

monotone constraints, the received FP-Bonsai implementation was indeed fast. FP-Bonsai as described in the paper has merit, but because of lack of time we could not implement it ourselves (albeit adding implementation bias) or fix the received code.

We compared our algorithm with Dualminer (Bucila et al., 2002). Based on its authors’ recom- mendations, we built the Dualminer framework on top of the MAFIA (Bucila et al., 2001) imple-

mentation provided by its original authors. Our experiments were conducted on a 3GHz Intel P4 with 2 GB of memory running Linux 2.4.25, Red Hat Linux release 9. The times reported also include the time to output all matching itemsets. We have tested these algorithms using both real datasets provided by FIMI (2003) and synthetic datasets generated using Q. synthetic data genera- tion code (2000); we used “retail” as our primary

real dataset reported here. A dataset with the same characteristics as the one reported in Bucila et al. (2002) was also generated. We also report some results with our DPC-COFI to illustrate the ad- vantages of the Leap approach (BifoldLeap).

retail dataset 0 0 0 0 0 00 0 0 0 0 0.% 0.% 0.0% Support T im e i n s e c o n d s .

BifoldLeap (Q only) DPC-COFI (Q only) DualMiner (Q only) BifoldLeap (P only) DPC-COFI (P only) DualMiner (P only) BifoldLeap (P & Q) DPC-COFI (P & Q) DualMiner (P & Q)

Figure 5. Pushing P(), Q(), and P() ∩ Q()

Figure 6. Pushing more selective P(), Q(), and P() ∩ Q()

Figure 7. Pushing extremely selective P(), Q(), and P() � Q()

retail dataset 0 0 0 0 0 00 0 0 0 0 0.% 0.% 0.0% Support T im e i n s e c o n d s .

BifoldLeap (Q only) DPC-COFI (Q only) DualMiner (Q only) BifoldLeap (P only) DPC-COFI (P only) DualMiner (P only) BifoldLeap (P & Q) DPC-COFI (P & Q) DualMiner (P & Q)

retail dataset 0 0. . . . . 0.% 0.% 0.0% Support T im e i n s e c o n d s .

Impact of P() and Q() selectivity on

In document Data Mining Patterns New Methods And Applications Pascal Poncelet (2008) pdf (Page 63-67)