Graph Datasets - Advances in Learning and Understanding with Graphs through Machine Learning

It has been strongly argued that many of the recent successes in the field of machine learning, especially the approaches exploiting deeper models, has been driven by the availability of large, high quality and importantly, labelled datasets [80]. For example, the Imagenet dataset has been key to dramatic advances in the ability of Computer Vision (CV) models by providing them with over 14 million human annotated images from which to learn [65]. In the field of Natural Language Processing (NLP), recent advances have also been driven by the availability of massive quantities of text data, with Wikipedia alone providing over 3.7 billion English language words [241].

However, the field of graph analysis does, to date, not have the same quantity of quality public datasets available for use by researchers. This has arguably made the same levels of progress in graph processing models more challenging when compared with other domains. The three main sources of public graph datasets in the field, and thus of ones used throughout this thesis, are the Stanford Network Analysis Project (SNAP) [146], the Network Repository [196] and the Koblenz Network Collection (KONECT) [138].

Whilst these data sources are useful, they do not contain the quantity and variety of data seen in datasets from other fields. For example, SNAP contains less than 200 unique graphs across 18 domains. Because of this lack of empirical data, the work presented in this thesis uses both synthetically generated graphs and graphs whose topological structure has been altered in some way.

2.4.1 Graph Generation Methods

There has long been an interest in developing methods which are able to generate synthetic and random graphs which conform to some structural constraints, thus replicating empirical data [113,148]. Using such approaches means that an unlimited number of graphs can be generated, of varying sizes and structural complexities, thus helping to reduce the aforementioned data access issues. It has also been proposed that graphs generated from a known mathematical process could be used as a benchmark for a machine learning algorithm, as it could be tasked with uncovering the underlying generative process [8]. Some of the major synthetic graph generation methods utilised throughout this thesis are detailed in this section.

Random Graphs

In the generation of random graphs, as proposed by Erd˝os and R´enyi [20], the probability of the existence of each edge is equal. Thus graphs generated using the Erd˝os-R´enyi method have a degree distribution which looks to have been chosen uniformly at random. Such graphs would prove challenging for any machine learning model trained upon them, as there is no real structure to be learned.

Scale-Free Graphs

One of the mostly widely used models to study the formation of networks in the Barab´asi- Albert (BA) model [20]. It has been noted that actual real-world vertex degree distributions exhibit a fat-tailed, or power-law shape, meaning that a majority of vertices have a low degree value, whilst only a few vertices have a high value [20]. These graphs were entitled ‘scale-free’ graphs, due to their lack of natural scale [67]. Since this discovery, scale-free graphs have been reported in many other graph studies [98]. Although the prevalence of graphs which exhibit strict power-law distributions has been put under some doubt [130], generating graphs with this property can be a useful first approximation.

The BA model was designed to produce graphs which have a degree distribution which is approximately power-law, thus more closely replicating real-world data. The BA model functions

as follows: upon each new vertex being added to the graph, it has a probabilitypof forming an edge to an existing vertexv:

p(v) = _Pkv

i∈V

. (2.12)

As the chance of new edges being formed is directly proportional to a vertex’s degree value, hubs or densely connected vertices will appear.

Forest Fire Graphs

Whilst the BA model produces graphs which display the characteristic power-law degree distribution, it fails to capture other structural characteristics observed in graph data [145]. To address this issue, theForest Fire model for synthetic graph generation has been proposed [145]. The proposed model is designed so that it captures the shrinking diameter and increasing densification characteristics highlighted in the study as being missing from other methods. The model functions in such a way that a new vertexventering the graph attaches to a existing vertex

wuniformly at random. Vertex v then begins to burn through a selection of in and out edges fromw, creating links to the vertices it touches with a certain probability. The graphs created by the Forest Fire model conform to both shrinking diameter and increasing densification, as well as featuring a power-low degree distribution.

2.4.2 Graph Topology Random Rewire Process

Throughout the work presented in this thesis we will make use of the random rewire process to alter a given graph’s topological structure. The random rewire process perturbs a given source graph’s degree distribution by randomly altering the source and target of a set number of edges according to the Erd˝os-R´enyi random model. This results in edges which are uniformly distrib- uted among the vertices, instead of the more frequently observed power-law like distribution [20,56]. The number of edges which are altered can be controlled to cause either major or minor changes to the graph’s topology. During this rewire process, it is not guaranteed that the source or target of the edge will be altered, indeed it is not always possible due to the graph’s topology. Also, it should be noted that the rewiring process does not change the total number of edges or vertices within the graph.

In document Advances in Learning and Understanding with Graphs through Machine Learning (Page 43-46)