• No results found

Canonical Forms

1.4 Research Methodology

2.2.4 Canonical Forms

A canonical form is an agreed way of representing some object (such as a graph). In the context of graph mining canonical forms are important so that the relevant algorithms can be both effective and efficient [177, 222]. In the case of isomorphism testing canonical forms are significant so as to ensure that, given two identical graphs, they are represented in the same manner (thus simplifying the isomorphism testing). In other words, a canonical form facilitates isomorphism checking because it ensures that if a pair of graphs are isomorphic, then their canonical labellings will be identical [20, 117, 136, 177]. In the case of graph mining algorithms that involve the generation of candidate graphs, such as the VULS mining algorithms with respect to this thesis, the usage of a canonical form is important so as to avoid the generation of duplicates (the same graph but expressed in different ways).

From the literature we can identify a number of canonical labelling strategies such as: (i) Minimum DFS Coding [229], (ii) the Coding proposed by Akihiro Inokuchi [118], (iii) Canonical Adjacency Matrix (CAM) encoding [109, 116], (iv) Canonical Representation of free trees [180], (v) String encodings for rooted ordered trees [152, 171], (vi) Canonical forms for rooted unordered trees [8, 42], (vii) DFS Label Sequence (DFS-LS) encoding [236], (viii) Depth-Label Sequence (DLS) encoding [127], (ix) Breadth-First Canonical String (BFCS) encoding [40], (x) Depth-First Canonical String (DFCS) encoding [43] and (xi) Consolidated Pr¨ufer Sequence(CPS) encoding[200]. The first, Minimum DFS Coding differs from the other canonical forms in that it uses a tree representation of each graph instead of a more traditional adjacency matrix. With respect to the work

described in this thesis, Minimum DFS Coding, was adopted because of its popularity in the context of frequent subgraph mining.

Depth-First Search (DFS) is an algorithm for traversing or searching tree or graph data structures. One starts at the root (some arbitrary vertex in the case of a graph) and explores each branch before backtracking. This requires some ordering of vertex identifiers; given two branches emanating from a vertex some ordering needs to be imposed so that one branch is selected before the other. In the case of graphs a similar mechanism needs to be imposed to select a start vertex.

For one graphG, there are lots of ways to construct different DFS trees by selecting a different rootU and different growing edges. When performing a depth-first search in a graph, various DFS trees can be constructed each with its own DFS code. In other words, a lot of DFS codes can be adopted to represent a graphG, to prevent ambiguity we require a canonical form so that among all these DFS codes, there is only one minimal DFS code. It is this minimal DFS code that will then be employed to represent G. In the following, how to generate various DFS codes for graph G is described first. Then an example is given to illustrate such process and how to choose the minimal one as the canonical form from the various DFS codes generated is also explained at the same time.

The DFS code for a graph consists of a sequence of tuples describing the edges in a given graphG. The tuples are of the form:

hU, V, LU, LE, LVi

where: (i) U is the identifier for the start vertex, (ii) V is the identifier for the end vertex, (iii) LU is the vertex label for U, (iv) LE is the edge label and (v) LV is the

vertex label forV. The DFS codes for Gis generated as follows:

1. Impose an ordering on the vertices (a sequential set of unique vertex identifiers). Mark all edges as “unread”.

2. Select the first vertex in the ordering as the root (the start point) U.

3. Create an ordered list of edges{e1, e2, . . . , en}emanating from rootU. List back-

ward edges prior to forward edges.

4. Process the edge list {e1, e2, . . . , en},i= 1 (16i6n):

(a) Ifeiis marked as being “unread”, create a code describing the edge and store,

mark the edge as “read” and proceed to (b).

(b) Ifei is a forward edge, choose the end vertex of ei as rootU, repeat from 3.

Figure 2.3: Depth-First Search Tree and its Forward/Backward Edge Set [229]. Note

that forward edges are represented by solid lines and backward edges dashed lines.

Table 2.1: DFS code for Figure 2.9 (b), (c) and (d) [229]

edge NO. (b) (c) (d) 0 h0,1, X, a, Yi h0,1, Y, a, Xi h0,1, X, a, Xi 1 h1,2, Y, b, Xi h1,2, X, a, Xi h1,2, X, a, Yi 2 h2,0, X, a, Xi h2,0, X, b, Yi h2,0, Y, b, Xi 3 h2,3, X, c, Zi h2,3, X, c, Zi h2,3, Y, b, Zi 4 h3,1, Z, b, Yi h3,0, Z, b, Yi h3,0, Z, b, Xi 5 h1,4, Y, d, Zi h0,4, Y, d, Zi h2,4, Y, d, Zi

For instance consider the graph G given in Figure 2.3 (a). From this graph various DFS trees can be generated depending on the choices of the rootU among the vertices. Figure 2.3(b), (c) and (d) give three different DFS trees for the graph in Figure 2.3(a). If we choose the first “X” in Figure 2.3(a) as the rootU, the DFS tree shown in Figure 2.3(b) will be generated. In the same manner, if we choose “Y” or the second “X” as the root. The trees shown in Figures 2.3(c) and (d) would be generated. Thus G can be represented as three different ways. Therefore there are three different DFS codes one per DFS tree as shown in Table 2.1 to represent the same graphG(Figure 2.3(a)). Note that forward edges are represented by solid lines and backward edges dashed lines. The question is how to find the minimal DFS codes from these three DFS codes.

The minimum DFS code, as the name suggests, is the minimum code according to some lexicographic order of the graph edges. In other words, an ordering is imposed on the element values with respect to five tuples hU, V, LU, LE, LVi. Thus, in Table 2.1,

edge 0 is compared first with respect to (b), (c) and (d). The first and second elements are the same (0 and 1). The third element “X” is lexicographic before “Y”, thus DFS codes (b) and (d) are “smaller” than (c). We need to find the minimal DFS code, thus (c) is out, (b) and (d) are left so we continue by comparing edge 1; obviously the first

two elements are the same (1 and 2), the third element “X” is before “Y”, thus DFS code (d) is smaller than (b). Thus, (d) is the minimum DFS code for G and thus it is the unique canonical form with which to represent G (Figure 2.3(a)). Note that a subgraph is a duplicate subgraph if and only if its DFS code is not minimum.

Given two graphs G and G0, G is isomorphic to G0 if and only if the minimum DFS code of G is identical to the minimum DFS code of G0. The isomorphism testing process is given in algorithm 1. If the minimum DFS codes of two graphs are not the same, the two graphs are not identical. Thus the problem of mining subgraphs can be said to be equivalent to analysing their corresponding minimum DFS codes. Note that the minimal DFS code is dependent upon the global set of labels in an input graph or set of input graphs. Thus any two subgraphs that subscribe to this global labelling can still be compared; of course if they do not subscribe to this global labelling then comparison is not possible. As will become apparent later in section 2.3.3.2, the use of minimum DFS codes can enhance the process of frequent subgraph mining by comparing the minimal DFS code of two graphs to do the isomorphism testing. Thus for the VULS classification proposed in this thesis, minimum DFS coding was adopted for subgraph matching (subgraph isomorphism) with respect to both the VULS mining and vertex classification by VULS.

Algorithm 1

1: procedure IsomorphismTest(G,G0)

2: S= the set of minimal DFS codes for graph G 3: S0= the set of minimal DFS codes for graphG0 4: result=true

5: if |S|!= |S0|then

6: result=false

7: else

8: forall i from 1 to|S|do

9: if Si6=Si0 then 10: result=false 11: return result 12: end if 13: end for 14: end if 15: return result 16: end procedure