Since the introduction of association rules a decade ago and the launch of the research in efficient fre- quent itemset mining, the development of effective approaches for mining large transactional data- bases has been the focus of many research studies. Furthermore, it is widely recognized that mining for frequent items or association rules, regardless of its efficiency, usually yields an overwhelming, crushing number of patterns. This is one of the reasons it is argued that the integration of data mining and database management technologies is required Chaudhuri (1998). These large sets of discovered patterns could be queried. Expressing constraints using a query language could indeed help sift through the large pattern set to identify the useful ones.
We argue that pushing the consideration of these constraints at the mining process before discovering the patterns is an efficient and ef- fective way to solve the problem. This does not exclude the integration of data mining and database systems, but suggests the need for data mining query languages intricately integrated with the data mining process.
In this chapter we address the issue of early consideration of monotone and anti-monotone
constraints in the case of frequent itemset mining. We propose a leap traversal approach, BifoldLeap that traverses the search space by jumping from relevant node to relevant node and simultaneously checking for constraint violations. The approach we propose uses existing data structures, FP- tree and COFI-tree, but introduces new pruning techniques to reduce the search costs. We con-
ducted a battery of experiments to evaluate our constraint-based search and report a fraction of them herein for lack of space. The experiments show the advantages of pushing both monotone
and anti-monotone constraints as early as possible in the mining process despite the overhead of constraint checking. We also compared our algo- rithm to Dualminer, a state-of-the-art algorithm in constraint-based frequent itemset mining, and showed how our algorithm outperforms it and can find all frequent itemsets, the closed and the maximal patterns that satisfy constraints along with their exact supports.
Parallelizing the search for frequent patterns plays an important role in opening the doors to the mining of extremely large datasets. Not all good sequential algorithms can be effectively parallel- ized and parallelization alone is not enough. An algorithm has to be well suited for parallelization, and in the case of frequent pattern mining, clever methods for searching are certainly an advantage. The algorithm we propose for parallel mining of frequent patterns while pushing constraints is based on a new technique for astutely jumping within the search space, and more importantly, is composed of autonomous task segments that can be performed separately and thus minimize communication between processors.
Our proposal is based on the finding of particular patterns, called pattern bases, from which selective jumps in the search space can be performed in parallel and independently from each other pattern base in the pursuit of frequent patterns that satisfy user’s constraints. The suc- cess of this approach is attributed to the fact that pattern base intersection is independent and each intersection tree can be assigned to a given processor. The decrease in the size of intersection trees allows a fair strategy for distributing work among processors and in the course reducing most of the load balancing issues. While other published works claim results with millions of transactions, our approach allows the mining in reasonable time of databases in the order of
billion transactions using relatively inexpensive clusters; 16 dual-processor boxes in our case. This is mainly credited to the low communication cost of our approach.
references
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of SIGMOD Int.
Conf. Management of Data (pp 207–216). Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the Int. Conf. Very Large Data Bases (pp 487–499).
Bayardo, R. J. (1998). Efficiently mining long pat-
terns from databases. ACM SIGMOD.
Bonchi, F., Giannotti, Mazzanti, A., & Pedreschi, D. (2004).Examiner: Optimized level-wise fre- quent pattern mining with monotone constraints. In Proceedings of the IEEE ICDM, Melbourne, Florida.
Bonchi, F., & Goethals, B. (2004). FP-Bonsai: The art of growing and pruning small fp-trees. In Proceedings of the Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD ’04) (pp.155–160 ).
Bonchi, F., & Lucchese, C. (2004). On closed con- strained frequent pattern mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM’04).
Bucila, C., Gehrke, J., Kifer, D., & White, W. (2002). Dualminer: A dual-pruning algorithm for itemsets with constraints. In Proceedings of the Eight ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining (pp. 42–51), Edmonton, Alberta.
Burdick, D., Calimlim, M. J., & Gehrke, J. (2001).
Mafia: A maximal frequent itemset algorithm for
transactional databases. ICDE (pp 443–452). Chaudhuri, S. (1998).Data mining and database systems: Where is the intersection? Bulletin of
the Technical Committee on Data Engineering,
p.21.
El-Hajj, M., & Zaïane, O. R. (2003).Non recursive generation of frequent k-itemsets from frequent pattern tree representations. In Proceedings of 5th
International Conference on Data Warehousing and Knowledge Discovery (DaWak’2003). El-Hajj, M., & Zaïane, O. R. (2005). Mining with constraints by pruning and avoiding ineffectual processing. In Proceedings of the 18th Austra-
lian Joint Conference on Artificial Intelligence, Sydney, Australia.
El-Hajj, M., & Zaïane, O. R. (2006). Parallel leap: Large-scale maximal pattern mining in a distributed environment. In Proceedings of
the International Conference on Parallel and Distributed Systems (ICPADS’06), Minneapolis, Minnesota.
El-Hajj, M., Zaïane, O. R., & Nalos, P. (2005). Bifold constraint-based mining by simultaneous monotone and anti-monotone checking. In Pro- ceedings of the IEEE 2005 International Confer- ence on Data Mining, Houston, Texas.
Han, J., Pei, J., & Yin, Y.Mining frequent patterns without candidate generation. In Proceedings of
the ACM SIGMOD Intl. Conference on Manage- ment of Data (pp 1–12).
FIMI (2003). Itemset mining implementations repository. Retrieved March 6, 2007, from http:// fimi.cs.helsinki.fi/
Lakshmanan, L., Ng, R., Han, J., & Pang, A. (1999). Optimization of constrained frequent set queries with 2variable constraints. In Proceedings of the
ACM SIGMOD Conference on Management of Data (pp 157–168).
Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. Discovering frequent closed itemsets for associa- tion rules. In Proceedings of the International
Conference on Database Theory (ICDT), (pp 398–416).
Pie, J., & Han, J. (2000). Can we push more constraints into frequent pattern mining? In
Proceedings of the ACM SIGKDD Conference
(pp 350–354).
Pie, J., Han, J., & Lakshmanan, L. Mining fre- quent itemsets with convertible constraints. In
Proceedings of the IEEE ICDE Conference (pp 433–442).
Q. synthetic data generation code. (2000). Re- trieved March 6, 2007, fromhttp://www.almaden. ibm.com/software/quest/Resources/index.shtml Ting, R. M., Bailey, J., & Ramamohanarao, K. (2004). Paradualminer: An efficient parallel implementation of the dualminer algorithm. In
Proceedings of the Eight Pacific-Asia Conference, PAKDD (pp 96–105).
Zaïane, O. R., & El-Hajj, M. (2005).Pattern lattice traversal by selective jumps. In Proceedings of the Int’l Conf. on Data Mining and Knowledge Discovery (ACM SIGKDD) (pp 729–735).