• No results found

This section outlines the two approaches explored in this chapter, including the relevant background technologies, proposed model architectures and the training procedure. The two

Symbol Definition

G A graph with an associated set of verticesV and corresponding set of edgesE. A The adjacency matrix of graphG, a symmetric matrix of size|V| × |V|, where

Ai,j is 1 if an edge is present and 0 otherwise.

ˆ

A Anormalised by its degree matrixDand its identity matrixIsuch that ˆA= (D−12(A+I)D−12) [128].

X A matrix of features for eachv∈V, set to the identityIofAfor this work. H The intermediate vertex representations in GCN and TNA layers.

Z The final variationally sampled representation matrix for eachv∈V.

G0 A temporal graph comprised of snapshots{G1, G2, ..., GT}.

T The number of snapshots inG0.

Gt A graph fromG0.

σs The sigmoid activation function.

σr The rectified linear activation function (ReLU).

σlr The leaky ReLU activation function.

l A certain layer in the model.

W(gl) A weight matrix at layerl used in the GCN.

W(sl) A weight matrix at layerl used in the skip connection.

W{(lr,u,h) } Hidden transform matrices in the GRU. U({lr,u,h) } Input transform matrices in the GRU.

N(µ, σ) A multi-dimensional Gaussian distribution parametrised by vectorsµandσ. Θ A trainable model containing a set of parameters.

Table 5.1: Definitions and Notations for Temporal Graph Learning

approaches are entitled Temporal Offset Reconstruction and Temporal Neighbourhood Aggreg- ation. This section makes use of the notation detailed in Table5.1, which lists the symbols used and an associated description.

5.3.1

Motivation

Many of the phenomena that are commonly represented via graph structures are known to evolve over time – Links between entities form and break in a constantly evolving stream of changes. We thus view graphs as a series of snapshots, with each graph snapshot containing the connections present at that particular moment in time. More formally, we can redefine a graphGto be a temporal graphG0={G1, G2, ..., GT}, where each graph snapshotGt∀t∈[1, T]

In many real-world use cases of machine learning, a model is trained on historical data and then used to make predictions about new events at a future point in time. An example of where this practice is common is in the recommender systems industry where recent state-of-the-art systems, for recommending items to users, are based on graph convolutions [24,240]. However, to date, the majority of models for creating graph representations do not consider how the graph evolves over time. This could potentially result in models which have good initial predictive capability, but whose performance will degrade as the graph continues to change over time.

Additionally, a common and vital task within the field of graph mining is that of future link prediction, where the goal is to accurately predict which vertices within a graph will form a connection in the future [83]. Figure 5.1 highlights this future link prediction task, where the goal is to predict the new edges, coloured in red, formed in GT, given the previous graphs in

the temporal history G1 and G2. Any model designed to accomplish this task must learn the evolution patterns present in edge formation, even though the number of edges changing at each time point is often a small fraction of the total number.

We propose to tackle this challenging problem of creating temporal robust graph embeddings by training a model to explicitly recreate a future time step of the graph. More concretely, a graph Gi is used as input to model θ(Gi) which learns a representation for each vertex inGi

such that its output can accurately predict the graphGi+δ. Ideally, we want to create a model

θ(Gi) which can perform this temporal offset reconstruction using the graphsGiandGi+δ alone,

Gi+δ =θ(Gi), requiring no pre-processing steps which could affect the model’s performance (e.g.

random walk procedures), no pre-computed vertex features and no labels required or used.

The remainder of this section will detail the graph convolutions used to create the vertex representations, the models we explore to perform the temporal offset reconstruction and the training procedure.

5.3.2

Background Technologies

We first review the background technologies we are employing to make the presented ap- proaches possible, namely Graph Convolutions [128] and Recurrent Neural Networks [53,101].

Graph Convolutions

To perform the graph encoding required to create the initial vertex representations, we utilise the spectral Graph Convolution Networks (GCN) [128]. One can consider a GCN to be a differentiable function for aggregating information from the immediate neighbourhood of vertices [49,93]. A GCN takes the normalised adjacency matrix ˆArepresenting a graphG, and a matrix of initial vertex level featuresX, and computes a new matrix of vertex level features H=GCN( ˆA,X). Xcan be initialized with pre-computed vertex features, but it is sufficient to initialize it with one-hot feature vectors (in which caseXis the identity matrixI). A GCN can contain many layers which aggregate the data, where the operation performed at each layer by the GCN [128] is:

GCN(l)(H(l),Aˆ) =σr( ˆAH(l−1)W(gl)), (5.1)

wherelis the number of the current layer,W(gl)denotes the weight matrix of that layer,H(l−1)

refers to the features computed at the previous layer or is equal toXatl= 0.

One can consider the GCN function to be aggregating a weighted average of the neigh- bourhood features for each vertex in the graph. Stacking multiple GCN layers has the effect of increasing the number of hops from which a vertex-level representation can aggregate information – a three layer GCN will aggregate information from three-hops within the graph to create each representation.

The original methods presented in the literature required GCN based models to be trained via supervised learning, where the final vertex representation is tuned via provided labels for a specific task – classification as a common example [93, 128]. This is a key difference between GCNs and other graph embedding approaches, as these commonly require no labels and thus are applicable on a broader selection of graphs. Recently, extensions to the GCN framework have been made which allows for convolutional auto-encoders for graph datasets [127]. Auto- encoders are a type of un-supervised neural network model which attempt to compress input data to a low-dimensional space, and then reconstruct the original data directly from the learned representation.

Recurrent Neural Networks (RNN)

RNN are neural networks with circular dependencies between neurons. Activations of a recurrent layer are dependent on their own previous activations from a previous forward pass,

and therefore form a type of internal state that can store information across time steps. They are frequently used in sequence processing tasks where the response at one time step should depend in some way on previous observations. Long Short-Term Memory (LSTM) [101] and Gated Recurrent Units (GRU) [53] are RNNs with learned gating mechanisms, which mitigate the vanishing gradient problem when back-propagating errors over a sequence of inputs, allowing the model to learn longer-term dependencies. For this work, we employ the GRU cell, as it empirically offers similar performance to an LSTM, but with fewer overall parameters. The GRU computes the outputht, for the input vectorxtat time tin the following manner [53]:

ut=σs xtU(ul)+ht−1Wu(l) rt=σs xtU(rl)+ht−1Wr(l) ˜ ht= tanh xtU (l) h + (rt∗ht−1)W (l) h ht= (1−ut)ht−1+ut˜ht, (5.2)

whereis the Hadamard product,randuare the rest and update gate values at timet,U(l)and W(l)are trainable parameter matrices at layerlandσ

sand tanh are the sigmoid and hyperbolic

tangent activation functions.