3. Related Work
3.2. Algorithms for the FCSM Problem
3.2.1. Frequent Tree Mining Algorithms
Among the practical implementations of frequent subgraph mining algorithms, frequent tree mining algorithms are most closely relatedto our work here. Several algorithms have been proposed for computing the set of frequent trees in databases of trees, forests, or “ar- bitrary” graphs. As shown in Section 2.2, frequent subtrees can be enumerated efficiently (i.e., with polynomial delay) in forest transaction databases. However, most systems do not use an efficient embedding operator and hence may result in exponential delay and memory consumption even in this case.
W
ork
(polynomial)
HybridTreeMiner Chi et al(2004a) Forests Embedding lists(exponential) FreeTreeMiner Rückert and Kramer(2004) Graphs support sets(exponential) F3TM Zhao and Yu(2008) Graphs Ullmann(exponential)(1976)
FSG Kuramochi and Karypis(2004) Graphs Embedding lists(exponential) Mines all frequent subgraphs MoSS Borgelt and BertholdBorgelt et al(2005) (2002) ChemicalGraphs Embedding lists(exponential) Mines all frequent subgraphs gSpan Yan and Han(2002) Graphs Cordella et al(exponential)(1998) Mines all frequent subgraphs FFSM Huan et al(2003) Graphs Embedding lists(exponential) Mines all frequent subgraphs Gaston Nijssen and KokNijssen and Kok((20042005)) Graphs Embedding lists(exponential) Can mine paths, trees,and cyclic patterns – Horváth and Ramon(2010) Bounded Tree-Width specializedincr. pol. time Mines all frequent subgraphs Summarize-Mine Chen et al(2009) Graphs Embedding lists(exponential) Mines a random subset ofall frequent subgraphs MUSE Zou et al(2010) UncertainGraphs Embedding lists(exponential)
REAFUM Li and Wang(2015) Graphs Embedding lists(exponential) βsubgraph isomorphism
– Schulz et al(2018) Graphs Dalmau et al(polynomial)(2002) Partially Injective Homomorphismresults in superset of frequent trees
Table 3.1.: An overview on related frequent subtree and subgraph mining systems for forest and graph transaction databases. Unless stated otherwise, these methods enumerate the full set of frequent subtrees and are our direct competitors.
3.2. Algorithms for the FCSM Problem
FreeTreeMiner byChi et al(2003) solves the FTM problem for tree databases. This work introduces tree mining as an area of research and develops the first7algorithm that uses canonical representations of trees for efficient pattern generation. The authors pro- pose a canonical string representation for trees and a levelwise algorithm to mine all fre- quent trees in a tree database. Based on their particular canonical representation, they ar- guethatallfrequenttrees canbegeneratedbyeitherjoiningtwofrequent trees H+e, H+e′ with a common parent H that differ in exactly one edge, or by extending the frequent tree
Hby a single edge f such that the resulting tree has a larger height8. Duplicate candidate
generation is reduced9by identifying nontrivial automorphisms of H and some support counting steps are avoided by first checking whether all possible parent patterns of H + e are frequent.Chi et aluse the efficient algorithm ofChung(1987) to compute the support of a candidate tree pattern in the tree database. They evaluate their algorithm on a chem- ical dataset, an IP multi-cast dataset that represents one-to-many streaming topologies on the Internet, and on synthetic datasets.
HybridTreeMiner byChi et al(2004a) also solves the FTM problem for tree databases, and, in addition, the problem of mining rooted trees in databases of rooted trees. Hence the name of the algorithm. There are two main differences to their FreeTreeMiner algo- rithm above: First, they use a DFS approach, instead of a BFS approach and second, they propose a novel way of counting the support. Now, the authors resort to embedding lists but use them in a smart way that requires only one pass over the database. If a candidate pattern H +e+e′is generated by joining two frequent patterns H +e, H +e′, its support can be computed by joining the support lists of the parent patterns: Two embeddings are com- patible, if they are identical on H, and map the endpoints of e and e′to different vertices. All embeddings for H +e+e′can therefore be constructed by combining such compatible embeddings. The extension operation works in a similar way by combining compatible embeddings of H and the frequent tree corresponding to the single edge f. In this way, an explicit access to the graph database is not necessary after initially computing the em- bedding lists of all frequent tree patterns consisting of single edges. They evaluate their algorithm on a chemical tree dataset and on a synthetic tree dataset and compare it to FreeTreeMiner (discussed above). They show that this approach is faster by an order of magnitude. Interestingly, the IP multi-cast dataset is not considered in this study. In (Chi et al,2004b), they extended this system to mine only closed frequent subtrees or maximal frequent subtrees.
FreeTreeMiner byRückert and Kramer(2004) solves the FTM problem in databases containing cyclic graphs. The authors propose a canonical string representation that al- lows their candidate generation process to reduce the number of duplicate evaluations of candidate patterns. They define the height of a vertex in a tree pattern as the distance to the root of the canonical representation and generate patterns by only extending on leaves 7 Zaki(2002) introduced “tree mining” before, but considered rooted ordered trees and a different embed-
ding operator.
8 With respect to its canonical representation which is a rooted tree.
9 The authors claim to avoid duplicate candidate enumeration by identifying pattern automorphisms. They
do not prove, however, that their technique guarantees nonredundant candidate enumeration. FreeTreeM- iner additionally compares canonical strings of candidate patterns.
with largest height. When evaluating the frequency of a candidate pattern by computing all of its embeddings explicitly, the algorithm at the same time computes the embedding lists for all extensions by a single edge. All extensions of height(H) + 1 are obtained by combining such single edge extensions. These candidate patterns are only recursively ex- tended if they are in canonical form. The authors do not prove the correctness of their algorithm (neither soundness, completeness, nor irredundancy) and evaluate their algo- rithm on the AIDS database.
F3TM byZhao and Yu(2008) similarly solves the FTM problem in databases contain- ing cyclic graphs using a depth-first search over the pattern space. They focus on the can- didate generation step and employ an iterative version of (Ullmann,1976) for the support counting step that is intertwined with the candidate generation step. In particular, for a frequent pattern H they explicitly store a subset of all subgraph isomorphisms in an em- bedding list. The authors focus on the candidate generation step and show that the com- plete set of frequent patterns of a dataset can be obtained by extending the patterns only on a well defined subset of their vertices, resulting in fewer duplicated candidate patterns. The number of candidate patterns is further reduced by considering automorphisms of the patterns and by considering only pattern extensions that are actually present in some transactiongraph. TheauthorsevaluatetheiralgorithmonavariantoftheAIDSdatabase considered also in this thesis (cf. Section 2.4) and on artificial data obtained with the gen- erator ofKuramochi and Karypis(2001). In (Zhao and Yu,2007) they extend F3TM to mine closed frequent trees.