This section explores how the graph fingerprints can be used to allow for efficient and accurate comparisons between graphs based on topological structure. In this context, two graphs can be said to be similar if they share similar global and micro (vertex and edge) level topological features. The approach, entitled Graph FingerPrint Comparison (GFP-C), was required when it was found existing serial methods for graph comparison were unable to scale to massive scale graphs and slow when comparing even modest sized ones. Additionally, based on the literature presented in Section 3.2.1, it is clear there are gaps in the currents methods. Particularly, an approach which meets the following criteria is missing:
1. Scalability- Highly scalable to massive graphs of millions of vertices/edges, and capable of computing the similarity in a finite time.
2. Sensitivity to Graph Size - Taking the size and order of the graphs into consideration.
3. Sensitivity to Similar Topologies - Detecting the difference between graphs which are highly structurally and topologically similar.
4. Label Free- Able to perform comparisons without requiring labelled datasets, although the approach should still function when they are available.
5. Low Number of User Defined Parameters - A minimum number of user defined parameters should be required to measure graph similarity.
3.4.1
Graph Comparison Approach Overview
The approach comprises two distinct stages: the generation of a graph’s fingerprint (GFP- X), as described in Section3.3, and the comparison of these fingerprints (GFP-C). The GFP-X approach takes the high dimensionality inherent in complex graphs and reduces this down into two fixed length feature vectors. The GFP-X approach achieves this by extracting micro and macro features from the given graph, allowing it to capture both the micro and macro-level topological features. The decision to extract both vertex level and global level features was driven by the desire to make the comparison between graphs more sensitive to small variations in the underlying graph topology and the overall size of the graph than the current state-of-the- art methods [25].
During the process of GFP-X (detailed in Section3.3), both the Vertex and Global generation produce a feature vector for each graph. Graphs can then be compared by computing the distance between their feature vectors - in this work we use the Canberra distance metric [141]. This results in two separate similarity scores, one comparing the vertex level topology and one the global level similarity. The last stage is to combine these two scores to produce the final similarity score between two graphs.
To help fulfil the scalability criteria established in Section3.4, GFP-X and GFP-C have been written to make use of a distributed parallel processing framework called Apache Spark [245], which enables the processing of graphs to be performed across multiple machines. At the time of this work being performed, alternative parallelization approaches such as the use of GPUs, could not work with the size of graphs required, or scale past being run on a single machine [211]. The work performed to achieve the Apache Spark implementation, as well as other details, is documented in AppendixA.
3.4.2
Comparison of Graph Fingerprints
The GFP-C approach compares the fingerprints of two graphs in order to compute the similarity between them. In this work, the Canberra distance was selected to compare the numerical distance between the fingerprints, similar to [25]. Other distance metrics tested included the Bray, Correlation, Chebyshev, Cosine and Manhattan but these were found to be insensitive when the feature vectors were highly similar, or produced unintuitive results such as a high similarity score for highly dissimilar graphs.
The Canberra distance between two vectors pandqofndimensions is defined as [141]:
CD(p,q) = n X i=1 |pi−qi| |pi|+|qi| . (3.4)
It should be noted that when pi and qi are both equal to zero, there is no defined value
for the distance and a score of zero is returned. Additionally, the maximum value returnable by the measure is equal to the number of dimensions in the two vectors being compared. For example, comparing two vectors of ten dimensions would result in a maximum possible Canberra distance of ten. Additionally, the Canberra distance is able to accurately detect changes close to zero, which makes it ideal for detecting small variations between graphs which might be highly topologically similar – one of the key goals for the GFP-C approach. The Canberra distance is used to compare both the distance between the vertex feature vectors and the global feature vectors. Two graphs are more ‘similar’ the closer the result of the Canberra distance is to zero, with a score of zero indicating that the graphs are ‘fingerprint’ identical.
3.4.3
Final Similarity Score Generation
The GFP-C approach returns two similarity scores, one for the distance between the vertex feature vectors fv and one for the distance between global vectorsfg for the two graphs being
compared. These two scores can be used independently to compare the global and local topolo- gical structure as separate entities. However, the GFP-C approach can produce a final similarity score between the two graphs, using the following aggregation - F inalSimScore = fv+γfg.
Whereγ is a user controllable parameter to control the weighting of the difference between the global feature vectors in the final similarity score.