Backward-Match-Voting algorithm - Vertex unique labelled subgraph mining

As noted in the above introduction, the idea is to use pre-labelled subgraphs, such as sets of identified VULS, to label the vertices in an unseen graphGby matching the individual pre-labelled subgraphs with subgraphs in G. A general issue with this approach is that vertices in G may be matched to more than one vertex in the given set of pre-labelled subgraphs, possibly with different vertex labels. In this case the conflicting labelling needs to be resolved. We can identify a potential number of mechanisms for doing this, but propose a voting mechanism here. A variety of voting mechanisms also exist, including: (i) majority voting and (ii) weighted voting. We adopted the first as the most appropriate weighting mechanisms to use was difficult to determine. We refer to the proposed vertex classification approach as the Backward-Match-Voting mechanism because: (i) we work “backwards” from the maximum value of k = max to k = 1 (because this is more efficient as potentially a larger number of vertices in G will be covered increasing the likelihood of reaching 100% coverage early on in the process), (ii) we “match” graph structures and edge labellings of pre-labelled subgraphs toGso as to label the vertices inGusing the labels from pre-labelled subgraphs, and (iii) where more than one vertex label is assigned to a vertex in Gwe use a majority “voting” scheme.

The Backward-Match-Voting algorithm is presented in Algorithm 10. The algorithm takes as input a collected set U of pre-labelled subgraphs (such as a set of VULS gen- erated as described in the previous chapter) and a new graph Gwhich has known edge labels but unknown vertex labels. The algorithm also utilises the parametermax as the maximum size of the pre-labelled subgraphs inU so that the algorithm does not try to find matches beyond this maximum size. In the case of using a set of VULS as the pre- labelled subgraphs, max is set the same value as used by the VULS mining algorithm used to generate the VULS. The output is a vertex labelled graph G. The algorithm starts with pre-labelled subgraphs of sizemaxand iteratively proceeds with pre-labelled subgraphs of decreasing size until one edge pre-labelled subgraphs are reached or 100% coverage of the vertices in G is obtained, whichever happens first. At the beginning of the algorithm, a set of vertex labels LV is initialized (line 11) to hold vertex labels

Algorithm 10: Backward-Match-Voting algorithm to predict vertex labels in a vertex- unlabelled graph

1: Input:

2: U = a set of pre-labelled subgraphs (such as VULS)

3: G=hV, E, LEi, Edge labelled graph (with unlabelled vertices), 4: max= The maximum size of an element ofU

5: L= default vertex label

6: Output:

7: GraphGwith labelled vertices

8: procedure main(G,max,U,L)

9: coverage=0

10: k=max

11: LV =φ

12: while (k≥1)do

13: G0 = the set of all subgraphs inG.

14: for allc∈ U where|c| ≡k do

15: for each g⊆G0 isomorphic to cdo

16: add (v, lv) to LV, for each v ing withlv equal to the label ofv inc

17: end for

18: end for

19: coverage= compute the coverage ofG using Equation 4.1

20: if coverage≡100%then 21: exit 22: end if 23: k=k−1 24: end while 25: for all v inGdo 26: if v occurs in LV then

27: set the label of v inG to the most frequent label ofv inLV 28: end if

29: end for

30: Set the label of any remaining unlabelled vertex in GtoL 31: end procedure

extracted from relevant pre-labelled subgraphs inU. Note that each element ofLV cor-

responds to a vertex inGthat does not yet have a label associated with it (so all vertices in Gon start up). We then, on each iteration, process each pre-labelled subgraph c of sizek; if a k-edge subgraphg within the input graph Gis isomorphic to c, each vertex ing will be labelled according to vertex labelling inc. In this manner, part ofLV will

be populated on each iteration. Once allk-edge subgraphs inU have been processed the coverage is calculated (line 19), if coverage is equivalent to 100% the process stops (line 21), otherwise we continue with the (k−1)-edge subgraphs. Once all items in U have been processed, we then process each vertexv in graph G. If a vertex v has more than one label associated with it, the voting mechanism is invoked and the most popular label assigned to the corresponding vertex v inG. It may be the case that some elements in

LV are still set to null indicating that the corresponding vertex has not been covered

by any pre-labelled subgraph. In this case a default label is assigned. In case of the evaluation presented later in this thesis, the most frequent vertex label from the training data is used as the default label (lines 30).

Note that Backward-Match-Voting algorithm start with the largest pre-labelled subgraphs (pre-labelled subgraphs of size max) moving down to pre-labelled subgraphs of size 1. The reason for doing this, is that larger pre-labelled subgraphs are better at predicting vertex labels than smaller pre-labelled subgraphs because larger pre-labelled subgraphs define a more specific structure than smaller pre-labelled subgraphs. In other words it is easier to match small pre-labelled subgraphs to subgraphs inG, often result- ing in an incorrect vertex labelling, than it is to match larger pre-labelled subgraphs. Hence the “backwards” in the Backward-Match-Voting algorithm. The above is illus- trated by the example given in Figures 6.2 to 6.5. Figure 6.2 shows an unlabelled graph that featuresLE ={black, green, red}. Figures 6.3 to 6.5 show a set of labelled one-edge

subgraphs and a labelled four-edge subgraph. The one-edge and four-edge pre-labelled subgraphs is shown in Figure 6.3 and 6.5 respectively. The set {V1, V2, . . . , V9} in Figure 6.4 indicates the identifiers for the vertices in Figure 6.2. Using the largest four- edge pre-labelled subgraph (Figure 6.5) the vertex labelling for the new graph will be:

LV1 = {A}, LV2 = {C}, LV4 ={D}, LV6 = {D}, LV8 = {B}. While using the one-

edge VULS (Figure 6.3) the vertex labelling will be: LV1={A, B, C, D},LV2={A, C}, LV4 ={A, D}, LV6 ={A, D}, LV8 ={A, B}. This example clearly indicates that the

vertex’s labelling becomes more complicated when using (small) one-edge VULS than when using (large) four-edge VULS.

Figure 6.2: Input graphG.

Figure 6.3: One-edge pre-

labelled subgraphs.

In document Vertex unique labelled subgraph mining (Page 116-118)