Frequent Tree Mining Algorithms - Algorithms for the FCSM Problem

3. Related Work

3.2. Algorithms for the FCSM Problem

3.2.1. Frequent Tree Mining Algorithms

Among the practical implementations of frequent subgraph mining algorithms, frequent tree mining algorithms are most closely relatedto our work here. Several algorithms have been proposed for computing the set of frequent trees in databases of trees, forests, or “ar- bitrary” graphs. As shown in Section 2.2, frequent subtrees can be enumerated efficiently (i.e., with polynomial delay) in forest transaction databases. However, most systems do not use an efficient embedding operator and hence may result in exponential delay and memory consumption even in this case.

ork

(polynomial)

HybridTreeMiner Chi et al(2004a) Forests Embedding lists_{(exponential)} FreeTreeMiner Rückert and Kramer(2004) Graphs support sets_{(exponential)} F3TM Zhao and Yu(2008) Graphs Ullmann_{(exponential)}(1976)

FSG Kuramochi and Karypis(2004) Graphs Embedding lists_{(exponential)} Mines all frequent subgraphs MoSS Borgelt and Berthold_{Borgelt et al}₍₂₀₀₅₎ (2002) Chemical_Graphs Embedding lists_{(exponential)} Mines all frequent subgraphs gSpan Yan and Han(2002) Graphs Cordella et al_{(exponential)}(1998) Mines all frequent subgraphs FFSM Huan et al(2003) Graphs Embedding lists_{(exponential)} Mines all frequent subgraphs Gaston Nijssen and Kok_{Nijssen and Kok}(₍2004₂₀₀₅)₎ Graphs Embedding lists_{(exponential)} Can mine paths, trees,_{and cyclic patterns} – Horváth and Ramon(2010) Bounded Tree-Width specialized_{incr. pol. time} Mines all frequent subgraphs Summarize-Mine Chen et al(2009) Graphs Embedding lists_{(exponential)} Mines a random subset of_{all frequent subgraphs} MUSE Zou et al(2010) Uncertain_Graphs Embedding lists_{(exponential)}

REAFUM Li and Wang(2015) Graphs Embedding lists_{(exponential)} βsubgraph isomorphism

– Schulz et al(2018) Graphs Dalmau et al_(polynomial)(2002) Partially Injective Homomorphism_{results in superset of frequent trees}

Table 3.1.: An overview on related frequent subtree and subgraph mining systems for forest and graph transaction databases. Unless stated otherwise, these methods enumerate the full set of frequent subtrees and are our direct competitors.

3.2. Algorithms for the FCSM Problem

FreeTreeMiner byChi et al(2003) solves the FTM problem for tree databases. This work introduces tree mining as an area of research and develops the first7_{algorithm that} uses canonical representations of trees for efficient pattern generation. The authors propose a canonical string representation for trees and a levelwise algorithm to mine all frequent trees in a tree database. Based on their particular canonical representation, they ar- guethatallfrequenttrees canbegeneratedbyeitherjoiningtwofrequent trees H+e, H+e′ with a common parent H that differ in exactly one edge, or by extending the frequent tree

Hby a single edge f such that the resulting tree has a larger height8. Duplicate candidate

generation is reduced9_{by identifying nontrivial automorphisms of H and some support} counting steps are avoided by first checking whether all possible parent patterns of H + e are frequent.Chi et aluse the efficient algorithm ofChung(1987) to compute the support of a candidate tree pattern in the tree database. They evaluate their algorithm on a chemical dataset, an IP multi-cast dataset that represents one-to-many streaming topologies on the Internet, and on synthetic datasets.

HybridTreeMiner byChi et al(2004a) also solves the FTM problem for tree databases, and, in addition, the problem of mining rooted trees in databases of rooted trees. Hence the name of the algorithm. There are two main differences to their FreeTreeMiner algorithm above: First, they use a DFS approach, instead of a BFS approach and second, they propose a novel way of counting the support. Now, the authors resort to embedding lists but use them in a smart way that requires only one pass over the database. If a candidate pattern H +e+e′_{is generated by joining two frequent patterns H +e, H +e}′_{, its support can} be computed by joining the support lists of the parent patterns: Two embeddings are com- patible, if they are identical on H, and map the endpoints of e and e′_{to different vertices.} All embeddings for H +e+e′_{can therefore be constructed by combining such compatible} embeddings. The extension operation works in a similar way by combining compatible embeddings of H and the frequent tree corresponding to the single edge f. In this way, an explicit access to the graph database is not necessary after initially computing the embedding lists of all frequent tree patterns consisting of single edges. They evaluate their algorithm on a chemical tree dataset and on a synthetic tree dataset and compare it to FreeTreeMiner (discussed above). They show that this approach is faster by an order of magnitude. Interestingly, the IP multi-cast dataset is not considered in this study. In (Chi et al,2004b), they extended this system to mine only closed frequent subtrees or maximal frequent subtrees.

FreeTreeMiner byRückert and Kramer(2004) solves the FTM problem in databases containing cyclic graphs. The authors propose a canonical string representation that al- lows their candidate generation process to reduce the number of duplicate evaluations of candidate patterns. They define the height of a vertex in a tree pattern as the distance to the root of the canonical representation and generate patterns by only extending on leaves 7 _Zaki₍₂₀₀₂_{) introduced “tree mining” before, but considered rooted ordered trees and a different embed-}

ding operator.

8 _{With respect to its canonical representation which is a rooted tree.}

9 _{The authors claim to avoid duplicate candidate enumeration by identifying pattern automorphisms. They}

do not prove, however, that their technique guarantees nonredundant candidate enumeration. FreeTreeM- iner additionally compares canonical strings of candidate patterns.

with largest height. When evaluating the frequency of a candidate pattern by computing all of its embeddings explicitly, the algorithm at the same time computes the embedding lists for all extensions by a single edge. All extensions of height(H) + 1 are obtained by combining such single edge extensions. These candidate patterns are only recursively extended if they are in canonical form. The authors do not prove the correctness of their algorithm (neither soundness, completeness, nor irredundancy) and evaluate their algorithm on the AIDS database.

F3TM byZhao and Yu(2008) similarly solves the FTM problem in databases containing cyclic graphs using a depth-first search over the pattern space. They focus on the candidate generation step and employ an iterative version of (Ullmann,1976) for the support counting step that is intertwined with the candidate generation step. In particular, for a frequent pattern H they explicitly store a subset of all subgraph isomorphisms in an em- bedding list. The authors focus on the candidate generation step and show that the com- plete set of frequent patterns of a dataset can be obtained by extending the patterns only on a well defined subset of their vertices, resulting in fewer duplicated candidate patterns. The number of candidate patterns is further reduced by considering automorphisms of the patterns and by considering only pattern extensions that are actually present in some transactiongraph. TheauthorsevaluatetheiralgorithmonavariantoftheAIDSdatabase considered also in this thesis (cf. Section 2.4) and on artificial data obtained with the gen- erator ofKuramochi and Karypis(2001). In (Zhao and Yu,2007) they extend F3TM to mine closed frequent trees.

In document Efficient Frequent Subtree Mining Beyond Forests (Page 51-54)