• No results found

Edge Based Enumeration

4.4 A Road Map of Frequent Subgraph Mining

4.4.3 Edge Based Enumeration

Level-wise Search: the FSG Algorithm

FSG (Frequent Subgraph Mining) [KK01] identifies all frequent patterns by a level-wise search procedure. At the first step, FSG preprocesses the input graph database and identifies all frequent single edge patterns. At step k, FSG identifies the set of frequent subgraphs that contains exactly k edges. This set is denoted as Fk. For a graphG, theedge size ofG is the

number of edges thatGcontains. The task at stepk is subdivided into two phases: candidate subgraph proposing and candidate subgraph validation, with the detail covered below.

Candidate Subgraph Proposing. Given a set of frequentk−1-edge graphsFk−1, FSG

constructs candidate frequentk-edge subgraphs by “joining” two frequentk-1-edge subgraphs. In FSG, two graphs are “joinable” if they have the same edge size l > 0 and they share a common subgraph of edge size l−1. The “join” between two joinable graphs G1, G2 with

edge size k−1 produces a set of graphs that are supergraphs of both graphs with edge size

k. In other words, in FSG, the join operation is defined as:

F SG join(G1, G2) =

(

{G|G1⊆G, G2 ⊆G,|E[G]|=k} if G1 and G2 are joinable

∅ otherwise

We use|E[G]| to denote the edge size of a graphG.

FSG applies the join operation for every pair of joinable graphs in Fk−1 to produce a

list of candidate k-edge patterns Ck. The join operation is illustrated in Figure 4.3 and the

pseudo code is presented in Algorithm 2.

Example 4.4.1 In Figure 4.3, we show the FSG join operation for a pair of graphs.

Candidate Subgraph Validation. FSG selects the frequent subgraphs with edge size

k from the set Ck by computing the support value of each graph in Ck. To compute the

support value of a graphG, FSG scans the database of graphs and for each graphG0 in the graph database, FSG uses subgraph isomorphism test to determine whetherGis a subgraph

join join

Figure 4.3: An example of applying the FSG join operation to a pair of graphs. Nodes and edges have the same label.

of G0 and updates the support value ofG if it is. The pseudo code of the FSG-validation is presented below in Algorithm 3.

Putting It All Together. Here we present the pseudo code for the FSG algorithm,

which identifies all subgraphs F in a graph database G with support threshold 0< σ ≤1. We simplified the FSG algorithm to explain its basic structure; see [KK01] for details of performance improvements.

Algorithm 1 FSG(G, σ): Frequent Subgraph Mining

1: F1 ← {e|s(e)≥σ}# all frequent edges 2: k←2 3: whileFk−16=∅ do 4: Ck← FSG-join(Fk−1, k) 5: Fk ← FSG-validation(Ck,G, σ) 6: k←k+ 1 7: end while 8: F ←S i∈[1,k]Fi

Algorithm 2 FSG-join(Fk−1, k): join pairs of subgraphs inFk−1 1: Ck ← ∅

2: for each G1, G2 ∈Fk−1 do

3: if there exists e1 ∈E[G1] and e2 ∈E[G2] such thatG1−e1 =G2−e2 then 4: Ck=Ck∪ {G|G1⊂G, G2 ⊂G,| E(G)|=k}# joinable

5: end if

6: end for 7: returnCk

Depth-First Search: The gSpan Algorithm

gSpan utilizes a depth-first algorithm to search for frequent subgraphs [YH02]. gSpan, like FSG, also preprocesses a graph database and identifies all frequent single edges at the be- ginning of the algorithm. gSpan designed a novel extension operation to propose candidate

Algorithm 3 FSG-validation(Ck, G, σ): Validate Frequent Subgraphs 1: Fk← ∅

2: for each G∈Ck do

3: s(G)←0

4: for each G0 ∈ G do

5: if G⊆G0 thens(G)←s(G) + 1end # computing support value

6: end for

7: if s(G)≥σ thenFk←Fk∪ {G} end 8: end for

9: returnFk

subgraphs. In order to understand this extension operation, we introduce in sequel the depth- first code representation of a graph, developed in gSpan.

Depth-First Code Of Graphs. Given a connected graph G, a depth-first search S of

Gproduces a list of nodes in G. We denote the nodes in V[G] as 1,2, . . . , n according to the order that the nodes are enumerated in S and nis the size of the graph G. We call node n

as therightmost node.

Each edge in Gis represented by a 5-element tuple e= (i, j, λ(i), λ(i, j), λ(j)) wherei, j

are nodes in G (i < j) and λis the labeling function of G that assigns labels to nodes and edges.

We define a total order 4 of edges in G such that e1 4 e2 if i1 < i2 , or (i1 = i2 and

j1 ≤j2).

Given a graphGand a depth-first searchS, we may sort edges inGaccording to the total order 4 and concatenate such sorted edges together to produce a single sequence of labels. Such a sequence of labels is a depth first code of G. There may be many depth first codes for a graphGand the smallest one (using lexicographical order of sequences) is thecanonical depth-first search code of G, denoted by DF S(G). The depth first tree that produces the canonical form of Gis itscanonical DFS tree.

Candidate Subgraph Proposing. In gSpan, a frequent subgraph G is extended to a

candidate frequent subgraphG0 by adding one edge toG. gSpan developed very sophisticated method to carefully choose the position of the newly introduced edge to make sure that we can still enumerate all frequent subgraphs with the extension operation. See [YH02] for further detail.

Candidate Subgraph Validation. gSpan uses the same procedure used by FSG (a scan

of a graph database and use subgraph isomorphism to determine the support value) to select frequent subgraphs from a set of candidates.

Comparing to level-wise search algorithm FSG, gSpan has better memory utilization due to the depth-first search and achieves an order of magnitude speedup in several benchmarks [YH02].

Putting It All Together. We present the gSpan algorithm below. We usev to denote

a prefix relation between two strings. The procedure gSpan-validation below is the same as the FSG-validation and hence is not duplicated.

The key observation provided by gSpan is a “selective” candidate proposing strategy (line 1, Algorithm 6). By this scheme, given a graph G, gSpan only proposes subgraphs G0 of G

such that the DFS code ofGis a prefix of that ofG0. See [YH02] for proofs on how and why

this selective proposing scheme works.

Algorithm 4 gSpan(G, σ): Frequent Subgraph Mining

1: F1 ← {e|s(e)≥σ}# all frequent edges

2: F ← F1 3: k ←1 4: for each G∈F1 do 5: F ←F∪gSpan-search(G, k,G, σ) 6: end for Algorithm 5 gSpan-search(G, k,G, σ) k ← k+ 1 Ck← gSpan-extension(G, k) F ←F∪gSpan-validation(Ck,G, σ) for each G0 ∈Fk do F ←F∪ gSpan-search(G0, k,G, σ) end for returnF Algorithm 6 gSpan-extension(G, k) 1: Ck ← {G0|G⊂G0,| E[G0]|=k, DF S(G)vDF S(G0)} 2: returnCk

Other Edge-Based Depth-First Algorithms

One performance bottle-neck of gSpan is pattern validation. This is particularly true when dealing with large and complex graphs (dense graphs with few distinct labels) since subgraph isomorphism is expensive. Instead of searching all the subgraph isomorphisms, the method proposed by Borgelt & Berhold [BB02] maintains a list of all subgraph isomorphisms (“em- bedding”) of a frequent subgraphGto support incremental subgraph isomorphism testing.

In another edge-based depth first search method FFSM [HWP03], a hybrid candidate proposing algorithm has been invented for support both join and extension operations with improved efficiency. We cover details of FFSM in Section 5.