• No results found

This chapter has explored the Graph Fingerprint and detailed how it can be used for the tasks of graph comparison and global graph classification.

The Graph FingerPrint Comparison approach for assessing the similarity of two unlabelled graphs, based upon their macro and micro features, has been presented. The GFP-X fingerprint generation exploits Apache Spark and GraphX to extract powerful, neighbourhood based, fea- tures from a graph in parallel. When comparing two graphs, the GFP-C approach is shown to be sensitive to small variations in graph topology, graph size and function without the requirement of labelled datasets whilst also scaling nearly sub linearly with dataset size across a Spark cluster. Thus the GFP-C approach achieves all of the goals established for it in Section3.4. The approach demonstrates promising results and the concept of a compact but accurate representation of a graph has numerous potential additional applications within machine learning.

Further, this chapter has presented a novel approach for global graph classification entitled Deep Topology Classification. The presented results show that the combination of extracting deep topological and global features from a graph and classifying these via a deep neural network is an effective approach to the problem of global graph classification. The approach is shown to have over 99% classification accuracy afterk-fold cross validation across a multi-class and binary dataset. This compares very favourably with the current state-of-the-art approach which has an accuracy of just 88.4% for the multi-class and 68% for the binary datasets.

3.9.1

Current Limitations

Whilst the work presented in this chapter has been successful when compared with competing approaches, there are some limitations with the work which are worth considering:

Global graph only: Currently the work in this chapter has only considered applications that can be considered global graph tasks. There are however many important tasks in the field of graph mining which operate at the level of vertices and edges. The work presented thus far would not be applicable to such tasks.

Datasets used: Due to the highlighted issues around the lack of large, labelled and publicly available graph datasets, this chapter has made use of synthetically generated graphs as a proxy in many of the experiments. However, it remains to be seen if the high accuracy demonstrated by the approaches would be maintained if real-world data was to be used instead.

Hand-crafted features: The graph fingerprints comprise various topological features extracted from the graphs. Whilst they have proven to be effective across the two tasks and the datasets (both empirical and synthetic) used for evaluation, it is unknown if the same set of features would continue to work well across all domains and tasks. One clear trend in the machine learning literature is the move away from the use of hand-crafted features as input, and for models to automatically learn the best data representation for themselves [80].

Lack of interpretability: The DTC approach explored in this chapter uses a deep neural network to perform classification. However, concerns have been raised in the literature about how interpretable such models are [249]. Interpretability is covered in greater detail in later chapters, but briefly a model is said to be interpretable if the decisions made by it can be understood clearly [77]. The use of a deep network in this work could reduce the interpretability of the approach in the real-world. For example, limiting the ability of the model to ‘explain’ why a graph was classified as belonging to a certain domain.

3.9.2

Future Work

There is large scope for future research based upon the work presented in this chapter. Further work could be performed on incorporating other topological features into the graph fingerprint beyond those studied thus far, perhaps focusing on those which can exploit any auxiliary information available with the graphs. Additionally, steps could be taken to allow the DTC approach to be used on empirical datasets, which could be achieved via the use of data augmentation techniques to allow for model training upon limited amounts of input data.

Epilogue

This chapter has explored how best to represent a graph using only a set of topological features extracted from it. The features were shown to be useful for the tasks of graph comparison and

global classification, thereby achieving research objective 1. Additionally, the research presented in this chapter has, since its initial publication, been expanded by a number of works from other researchers which cite this work. For example, recent work has attempted to apply the concepts explored here to real-world datasets to show that empirical graphs can indeed be classified via their structural properties [197]. Other work has explored the use of a variation of the graph fingerprint vector as a way to increase the realism of synthetic graph generation methods by minimising the distance between generated and real graphs [169].

In the following chapter, focus will be shifting from exploring problems at the level of entire graphs to those at the constituent parts: vertices and edges. Additionally, study will turn to the emerging range of graph embedding techniques [84, 92, 124, 167], which learn graph representations automatically. Knowledge gained in this chapter about the ability of certain topological features to be able to represent a graph will be used to attempt to bring some level of interpretability to these new approaches.

Chapter 4

Exploring the Semantic Content

of Unsupervised Graph

Embeddings

Prologue

The work in Chapter 3 explored how a graph can be accurately represented by topological features extracted from them. The work in this chapter changes scale to focus upon learning rep- resentations at the level of vertices. In addition, focus will shift to explore recent methods, which unlike the hand-crafted and mathematically understood topological features explored thus far, attempt to automatically learn the best representations for a given problem. Such approaches are unsupervised machine learning models, commonly referred to as graph embeddings, which have recently emerged and demonstrated a more superior performance than traditional topological feature based approaches for a range of vertex centric tasks. These approaches attempt to learn a mapping from the vertices to a vector space, where certain key relationships present between vertices is maintained in the resulting vector space.

In order to investigate research objective 2 (see Section 1.3), this chapter will explore the possibility of bringing some level of interpretability to the new family of unsupervised graph

embedding models by investigating whether any known topological features are represented in the vector space. The experimental evidence presented in this chapter demonstrates that several of the known topological features, many of which were explored in Chapter3, can be detected in the embedding space. This suggests that the type of topological structures being captured by the graph embedding techniques do approximate many of the same type of structural connectivity patterns used by human experts when representing graphs.

The work presented in this chapter has been published as the following works:

Stephen Bonner, John Brennan, Ibad Kureshi, Georgios Theodoropoulos, Andrew Stephen McGough, and Boguslaw Obara. Evaluating the quality of graph embeddings via to- pological feature reconstruction. In IEEE International Conference on Big Data, pages 2691–2700. IEEE, 2017

Stephen Bonner, Ibad Kureshi, John Brennan, Georgios Theodoropoulos, Andrew Stephen McGough, and Boguslaw Obara. Exploring the semantic content of unsupervised graph embeddings: An empirical study. Data Science and Engineering, 4(3):269–289, 2019