Contents:IntroductionBackground & DefinitionBackgroundDefinitionCategorisation and FrameworksTaxonomy of GNNsFrameworksRecGNNsGNN*Graph Echo State NetworkGated Graph Neural NetworkStochastic Steady-state EmbeddingConvGNNsSpectral-basedSpectral CNNChebyshev Spectral CNN (ChebNet)Graph Convolutional Network (GCN)Spatial-basedNeural Network for Graphs (NN4G)Diffusion Convolutional Neural Network (DCNN)Diffusion Graph Convolution (DGC)Message Passing Neural Network (MPNN)Graph Isomorphism Network (GIN)GraphSageGraph Attention Network (GAT)Mixture Model Network (MoNet)PATCHY-SANTraining EfficiencyGraph PoolingGraph IsomorphismEquivariance & InvarianceGraph AutoencodersGraph Autoencoders for Network EmbeddingsStructural Deep Network Embedding (SDNE)Graph Autoencoder (GAE*)Variational Graph Autoencoder (VGAE)GraphSageDGIDeep Recursive Network Embedding (DRNE)Graph Autoencoders for Graph GenerationSpatial-Temporal GNNsRNN-base approachesCNN-based approachesLatent Graph ConstructionApplicationsDatasetsEvaluation & ImplementationsApplications
Justification for GNNs: "there is an increasing number of applications where data are represented in the form of graphs" - particular drive to extend deep learning to graph-structured data.
Gives a history of the development of GNNs. Key historical papers: Sperduti et al. (1997), Goriet al. (2005), Scarselli et al. (2009), Gallicchio et al. (2010).
Network embedding: aim to derive representations for nodes, capturing feature and structure information. However, no requirement to use deep-learning methods.
Graph kernel methods: typically used for graph classification. Idea is to classify a graph by calculating "distance" to other graphs via a kernel function. Typically involve mapping graphs or nodes into vector spaces and using traditional kernel approaches.
Standard graph formulation, with node attributes and edge attributes . Note that edge denotes an edge from to .
A spatial-temporal graph is defined with a separate set of attributes for each timestep.
Types of GNN:
RecGNN: nodes exchange information using unlimited number of steps until stable equilibrium reached.
ConvGNN: generalises convolution to graphs. Uses fixed number of steps, with each forming an independently parameterised "layer".
GAE: attempt to encode nodes/graphs into a latent vector space, and then reconstruct original graph. Reconstruction may be step-by-step, or all-at-once.
STGNN: consider spatial and temporal dependencies simultaneously. Often use graph convolutions to capture spatial dependency, and RNN/CNNs to model the temporal dependency.
Types of output: node-level, edge-level, graph-level.
Types of training:
Supervised; graph-level classification
Semi-supervised; node-level classification: whole task based around a single (large) network, with partial node-labels where missing node-classes are to be learned.
Unsupervised; graph embedding: commonly approaches include GAE, and also negative sampling, where a logistic regression layer attempts to distinguish connected nodes from unconnected ones.
Comparison of Rec/ConvGNN architectures from the literature also given.
Historical method; ConvGNNs typically preferred currently.
Scarselli et al. (2009) (* distinguishes from general framework)
RecGNN using the update function:
Key constraint needed to reach stable equilibrium is that must be a contraction mapping (guaranteed to shrink the distance between two points after projection into latent space).
If is a neural network, this means a penalty term must be imposed on Jacobian.
Some "convergence criterion" typically used to limit iterations in practice.
Gallicchio et al. (2010)
2 parts: encoder & output layer
The encoder is randomly initialized and requires no training. It implements a contractive state transition function to recurrently update node states until convergence.
The output layer is trained by taking the fixed node states as inputs.
Li et al. (2015)
Uses a GRU at the nodes, meaning a fixed-number of steps can be used, and no parameter constraints are required.
Note that the function applied to neighbouring states is a linear transformation, and edge/node features are ignored:
These features are integrated via .
Parameters are trained using backpropagation. This can be problematic for large graphs.
Dai et al. (2018)
Enables scaling gradient updates to large graphs.
Alternatively samples a batch of nodes for state update and a batch of nodes for gradient computation.
For stability, a recurrent function is used taking a weighted average of historical and new states.
These make use of the normalised graph Laplacian matrix: where is the degree matrix, and is the adjacency matrix.
As is real symmetric positive semidefinite, it can be eigendecomposed: .
We consider a graph signal, —note that this is not in .
The graph Fourier transform is then defined as , and the inverse . This basis gives us the space in which we wish to manipulate the signal.
Given a filter , we define the graph convolution of the input signal with the filter as
We can get this into matrix-multiplication form by putting the graph Fourier transform result inside a matrix diagonal: , which simplifies the above equation to
Comparing this to our eigendecomposition, we can think of this as multiplying by but multiplying each eigenvalue by some "filter" value.
These treat the filter as a set of learnable parameters, with one diagonal matrix for each pair of input () and output () "channels", per-layer (): .
Each layer is then defined as:
where is the graph signal, and is the number of output channels for a layer.
This is essentially just a learned spectral graph convolution summed over each channel, with an added activation function.
This approach has 3 key limitations:
- Perturbations to the graph structure change the eigenbasis.
- The learned filters are only meaningful for the given graph.
- Eigendecomposition is .
Defferrard et al (2016)
Approximates the filter by Chebyshev polynomials of the scaled eigenvalue matrix: , where .
Chebyshev polynomials here are defined recursively by .
As a result, the spectral graph convolution becomes:
As where (can be proven by induction on ), ChebNet takes the form:
thus avoiding the expensive eigendecomposition (now ) and making the filters localised.
Kipf et al. (2017)
Assuming the approximation and , the above equation becomes
To simplify this further, we assume . We also want to consider a multi-channel input (i.e. going from to ). In this case we also want to add an activation function across channels. This gives us
where . This is starting to look a lot like a spatial method!
In practice, some substitutions are made to address numerical instability here:
- (standard degree matrix, but using
- (compare to above: has "moved inside" the multiplication)
These are generally preferred over spectral-based GNNs, as spectral models are typically based on whole-graph computations (e.g. eigendecomp) and are sensitive to small changes in the graph.
First spatially-framed GNN.
Considers each layer to have its own parameters and uses weighted neighbourhood aggregation with an activation function, plus uses skip connections:
Atwood and Towsley (2016)
DCNN regards graph convolutions as a diffusion process.
It assumes information is transferred from one node to one of its neighboring nodes with a certain transition probability until equilibrium is reached.
Layers are computed via
where the probability transition matrix .
DCNN concatenates (which note, don't depend on each other) together as the final model outputs.
Li et al. (2018)
As the stationary distribution of a diffusion process (see above) is a summation of power series of probability transition matrices, DGC sums the at the output:
Gilmer et al. (2017)
Framework for spatial-based ConvGNNs:
where , and , are functions with learnable parameters (, plus an optional readout function).
Veličković et al. (2019)
Previous MPNN-based methods can't distinguish different graph structures from the graph embeddings. To address this, a node adjusts its own weight by a learnable parameter :
Hamilton et al. (2017)
Uses sampling to obtain a fixed number of neighbours per node:
where , is a permutation-invariant aggregation function, is a random sample of node 's neighbours.
Training algorithm work by for each node, forming a fixed-size tree by expanding in a series of fixed-size sampling steps. Representations are then propagated hierarchically back up the tree.
Veličković et al. (2017)
Uses attention mechanism between nodes:
where is a LeakyReLU activation function and is a vector of learnable parameters. Multiple attention heads are used.
Gated Attention Network (GAAN), Zhang et al. (2018) introduces an additional attention score for each attention head.
Monti et al. (2017)
Also assigns neighbour weights, but using pseudo-coordinates, and mapping relative distances to weights.
Several existing GNN approaches can be viewed as special instances of MoNet using non-parametric weight functions.
Niepert et al. (2016)
Ranks a node's neighbours according to graph-structural properties, and associates each rank with a learnable weight. Selects the top ranked neighbours.
Can then be converted into grid-structured data, and a convolutional filter is applied.
Large-scale Graph Convolutional Network (LGCN), Gao et al. (2018) is similar to PATCHY-SAN, but ranks neighbours based on node feature information.
Training standard ConvGNNs usually requires keeping the whole graph and intermediate representations in memory.
GraphSage addresses this by training on one tree at a time
Fast-GCN, Chen et al. (2018) samples a fixed number of arbitrary nodes for each graph convolutional layer.
Cluster-GCN, Chiang et al. (2019) samples a subgraph based on a clustering algorithm and performs graph convolutions within the subgraph.
The paper compares the time and memory complexity of these and other methods.
Standard approach is to use mean/max/sum pooling to implement down-sampling.
One downside is that the embedding is always fixed-size. Vinyals et al. (2016) propose Set2Set to address this, and use an LSTM to encode information about the order of nodes.
ChebNet uses the Graclus clustering algorithm to coarsen input graphs into multiple levels, and then rearranges nodes into a balanced binary tree. This tree is then arbitrarily aggregated from bottom to top to create pools/clusters of similar nodes.
DCGNN, Zhang et al. (2018) also pools by sorting nodes into some order, but in their case it is based on structural information using WL colours.
DiffPool, Ying et al. (2018) learns an explicit cluster assignment matrix in the same way that representations are learned, with probabilities associated with each class:
however, this creates dense graphs after pooling, so complexity thereafter is .
More recently, SAGPool, Lee et al. (2019) has used self-attention to learn pooling.
Isomorphic graphs are those that are topologically identical. Some key results here. Xu et al. (2019) prove that:
- if a GNN maps two graphs to different embeddings, they will be identifiedas non-isomorphic by the Weisfeiler-Lehman (WL) test of isomorphism
- Common GNNs like GCN and GraphSage can't distinguish certain graph structures
- If the aggregation functions and readout function are injective, then the GNN is at most as powerful as the WL test (i.e. things the WL test can't distinguish between, neither can these GNNs).
In the following examples, is a permutation matrix.
For node-level tasks we require permutation equivariance.
In this case we have , and equivariance is defined as:
For graph-level tasks we require permutation invariance.
In this case we have , and invariance is defined as:
In this case we consider learning representations across a single graph.
Embeddings Definition: a low-dimensional vector representation of a node which preserves its topological information.
- encoder: graph → embedding
- decoder: embedding → graph
Wang et al. (2016)
Focuses only on topological embeddings (i.e. we only have , not features ).
Uses a stacked autoencoder to learn first and second-order proximity.
Loss function at the embeddings, and at the outputs:
where (features = adjacency vec), and
if , if .
Kipf and Welling (2016)
Also encodes feature information.
Uses two CGN layers to learn enbeddings:
The dot product between two embeddings is then the prediction for an edge:
The loss function used is the cross-entropy.
again, Kipf and Welling (2016)
Optimizes the variational lower bound :
I think this is the standard formulation. Let's break down these terms:
- The Gaussian prior:
- The (variational?) encoder: , where are outputs from stacked CGN-based layers
- The decoder:
Adversarially Regularized Variational Graph Autoencoder (ARVGA), Pan et al. (2018) uses a GAN to learn an encoder that is indistinguishable from the prior.
Uses negative sampling to ensure the decoder learns similar representations for neighbours and dissimilar ones for distant nodes:
Here is a neighbor of , whereas is sampled from which generates nodes that are distant from . is the number of negative samples.
Veličković et al. (2019)
drives local network embeddings to capture global structural information by maximizing local mutual information. It shows a distinct improvement overGraphSage experimentally.
Cui et al. (2018)
Learns network (i.e. not feature) embeddings.
Assumes node embedding should be predictable from neighbour's embeddings—learns these embeddings by minimising distance between embedding and LSTM run over neighbours.
Here the order sent to the LSTM is arbitrary. In some cases this would be an issue, but not here:
DRNE implicitly learns network embeddings via an LSTM network rather than using the LSTM network to generate network embeddings. It avoids the problem that the LSTM is not invariant to the permutation of node sequences.
ARVGA, Yu et al. (2018) and NetRA, Cui et al. (2018) use similar LSTM approaches to regularise network embeggings via adversarial training.
In this case we consider multiple graphs, and attempt to learn the generative distribution of of the graphs by encoding into hidden representations and then decoding.
A key use case here is molecular graph generation.
The is the sequential approach: "Sequential approaches generate a graph by proposing nodes and edges step by step."
And the global approach which outputs a graph all at once.
These approaches largely involve composing the kinds of approaches outlined above, as well as taking from some other areas of ML (e.g. GANs, RL)
A downside of global approaches is that they tend to be .
The task of STGNNs can be forecasting future node values or labels, or predicting spatial-temporal graph labels. STGNNs follow two directions, RNN-based methods and CNN-based methods.
Suppose a simple RNN takes the form
This can be adapted to a STGNN by inserting a graph convolutional layer as follows:
Graph Convolutional Recurrent Network (GCRN), Seo et al. (2018) use an LSTM with ChebNet.
Diffusion Convolutional Recurrent Neural Network (DCRNN), Li et al. (2018) use a diffusion graph convolutional layer with a GRU.
Alternatively, RNNs can be used at the node and edge-level.
Structural-RNN, Jain et al. (2016) uses a node-RNN and edge_RNN, through which temporal information is passed. To incorporate spatial information, a node-RNN takes the outputs of edge-RNNs as inputs. To save model complexity costs, nodes and edges are split into "semantic groups", which each share the same RNN model.
RNN-based approaches suffer from time-consuming iterative propagation and gradient explosion/vanishing issues ... CNN-based approaches tackle spatial-temporal graphs in a non-recursive manner with the advantages of parallel computing, stable gradients, and low memory requirements.
These approaches work by interleaving 1D-CNN layers with graph convolutional layers, to learn temporal and spatial dependencies respectively.
"Assume the inputs to a spatial-temporal GNN is a tensor , the 1D-CNN layer slides over along the time axis to aggregate temporal information for each node while the graph convolutional layer operates on to aggregate spatial information at each time step."
CGCN, Yu et al. (2018) integrates 1D convolutional layers with ChebNet or CGN layers.
ST-GCN, Yan et al. (2018) uses 1D convolutional layers with PGC layers.
These methods can learn a latent static graph automatically from data.
Graph WaveNet, Wu et al. (2018) uses source node embeddings and learns target node embeddings , using the combination to derive a self-adaptive adjacency matrix:
This provides interpretable correlations among entities.
GaAN, Zhang et al. (2018) employs attention mechanisms to learn these dependencies through an RNN-based approach.
ASTGCN, Guo et al. (2019) includes both a spatial attention function and a temporal attention function through a CNN-based approach.
Again, these methods typically cannot be used on very large graphs as they are .
Note, Shchur et al. (2018) and this anonymous paper point out a number of flaws in the way GNNs are typically evaluated.
Most methods use a standard split of train/valid/test.
Evaluation metrics: average accuracy & F1 score.
Common pitfalls: using the same train/valid/test split across multiple experiments underestimates generalisation error.
Evaluation: 10-fold CV typically used
Common pitfalls: ambiguous experimental settings; test set of each fold used both for model selection and risk assessment.
- PyTorch Geometric
- Deep Graph Library (GDL)
Scene graph generation
This involves recognising the semantic relationships between objects in a visual scene, by parsing an image into a graph. The inverse is also possible.
As natural language can be parsed as semantic graphs where each word represents an object, it is a promising solution to synthesize images given textual descriptions.
Point cloud classification
A point cloudis a set of 3D points (e.g. recorded by LiDAR). Classifying and segmenting these helps us to "see".
STGNNs can be applied to videos of agents interacting in the real world, with graphs formed over joints.
Include human-object interation, few-shot image classification, semantic segmentation, visual reasoning, question answering.
The inter-relations of documents or words can be used to infer labels. Although sentences are lists, they may also have a latent graph/tree structure of relationships.
This generates sentences with the same meaning given a semantic graph of abstract words (an Abstract Meaning Representation). The inverse task (sequence-to-graph) is very useful in knowledge discovery.