Chapter 2: Graph Neural Networks for Graph Representation Learning
2.1 Graph Neural Networks for Graph Classification
2.1.3 Deep Graph Convolutional Neural Network (DGCNN)
To address the problems of summing-based aggregation, we propose Deep Graph Convolutional Neural Network (DGCNN). DGCNN uses a simplified message passing form, and a novel sorting-based aggregation namedSortPooling, which sorts vertex states according to vertices’ structural roles such that individual node information and the global topology are preserved. Then, it applies 1-D convolutions to the node sequences to learn from the global graph topology.
Message passing layers. We first introduce the message passing (graph convolution) layers of DGCNN. For node v, the message passing takes the following form:
mt+1v = 1 |Γ(v)|+ 1 z t v + X u∈Γ(v) ztu ! , (2.1) zt+1v =f(Wtmt+1v ), (2.2) wheref is an element-wise nonlinear transformation such as tanh,Wtis a learnable parameter matrix. The above formulation first calculates the message mt+1
v by averaging the vertex
states of v and v’s neighbors. Then, a one-layer feedforward neural network is applied to
mt+1
v to output v’s state at next time step. It is a particular realization of (1.1) and (1.2),
working pretty well in practice.
If we vertically (row-wise) stack the node stateszt
v into a matrixZ
t, where the node order is
the same as in the adjacency matrixA of the graph, then we can have a matrix formulation of the above message passing:
Original image Shuffled image
Figure 2.1: A consistent input ordering is crucial for CNNs’ successes on graph classification. If we randomly shuffle the pixels of the left image, then state-of-the-art convolutional neural networks (CNNs) will fail to recognize it as an eagle.
where ˜A= A+I, ˜D is a diagonal degree matrix with ˜Dii=PjA˜ij. It reduces to the vector
forms (2.1) and (2.2) if we split the above calculations into rows.
After multiple message passing layers, we concatenate the outputsZt, t= 1, . . . , T horizontally, written as Z1:T := [Z1, . . . ,ZT]. In the concatenated output Z1:T ∈ Rn×c where n is the
number of nodes andc is the total number of feature channels, each row can be regarded as a “feature descriptor” of a vertex, encoding its multi-hop local substructure information.
The SortPooling layer. Next, we introduce the SortPooling layer, which is used to replace the plain summing layer in previous work. We notice that images and many other types of data are naturally presented with some order. For example, image pixels are arranged in a spatial order, and document words are presented in a sequential order. Figure 2.1 gives an example. Graphs, on the other hand, usually lack a tensor representation with fixed ordering. Thus, can we sort graph nodes ourselves to attach an order to graphs?
The main function of the SortPooling layer is to sort the feature descriptors, each of which represents a vertex, in a consistent order before feeding them into 1-D convolutional layers. The question is by what order should we sort the vertices? In image classification, pixels are naturally arranged with some spatial order. In text classification, we can use dictionary order
to sort words. In graphs, we can sort vertices according to their structural roles within the graph. The structural roles of nodes can be given by the Weisfeiler-Lehman (WL) algorithm [165], which iteratively encodes nodes’ neighborhoods into integer colors, so that the same neighborhoods are encoded into the same color and different neighborhoods are encoded into different colors. After convergence, the WL colors can mark the relative structural positions of the nodes within the graph.
We notice that our message passing scheme shares the same idea as WL – it also iteratively encodes neighborhoods into vertex states, except for using continuous hidden states instead of integer colors and using a learnable encoding function. We thus can regard the hidden states Zt, t= 1, . . . , T as thecontinuous WL colors, and use these continuous WL colors to sort the vertices.
Given the n×cinput Z1:T, where each row is a vertex’s feature descriptor and each column is a feature channel, the output of SortPooling is a k×ctensor, where k is a user-defined integer. In the SortPooling layer, the input Z1:T is first sorted row-wise according to ZT. We can regard these final hidden states as the vertices’ most refined continuous WL colors, and sort all the vertices using these final colors. This way, a consistent ordering is imposed for graph vertices, making it possible to train traditional neural networks on the sorted graph representations. Ideally, we need the graph convolution layers to be deep enough (meaning
T is large), so that ZT is able to partition vertices into different colors/groups as finely as possible.
The vertex order based on ZT is calculated by first sorting vertices using the last channel of
ZT in a descending order. If two vertices have the same value in the last channel, the tie is broken by comparing their values in the second to last channel, and so on. If ties still exist, we continue comparing their values in ZTi−1, ZTi−2, and so on until ties are broken. Such an
B C E A F D B C E A F D C C C D D D E E E B B B A A A F F F Sort SortPooling C D E B A
1-D convolution Dense layers Graph convolution layers
Input graph B C E A F D B C E A F D B C E A F D Concatenate Pooling C E D B A E D B A C
Figure 2.2: The overall structure of DGCNN. An input graph is first passed through multiple message passing layers where node information is propagated between neighbors. Then the vertex features are sorted and pooled with a SortPooling layer, and passed to 1-D convolutional layers to learn a predictive model.
order is similar to the lexicographical order, except for comparing sequences from right to left. We can prove that such a sorting scheme ensures permutation invariance which is important for graph isomorphism.
In addition to sorting vertex features in a consistent order, the next function of SortPooling is to unify the sizes of the output tensors. After sorting, we truncate/extend the output tensor in the first dimension from n to k. The intention is to unify graph sizes, making graphs with different numbers of vertices unify their sizes to k. The unifying is done by deleting the last
n−k rows if n > k, or adding k−n zero rows if n < k.
As a bridge between graph convolution layers and traditional layers, SortPooling has another great benefit in that it can pass loss gradients back to previous layers by remembering the sorted order of its input, making the training of previous layers’ parameters feasible.
After SortPooling, traditional 1-D convolutions are applied to the sorted node representations, similar to how convolutional filters move on image pixels. Figure 2.2 illustrates the overall architecture of DGCNN.