Graphcore’s IPU and GNNs

Speaker: Gianandrea Minneci (Graphcore)

Purpose of talk: giving computational angle to the challenges faced by GNNs.

Computational Requirements for GNNs

Computational model has some key differences wrt. other common DL models:

Sparse memory gather/scatter

Adjacency/laplacian-type matrix multiplications introduce v sparse operations.
Diagram highlights that sparsity also changes between hops/layers.
Contrast to CNNs where access pattern predictable and can be leveraged for efficiency gains).
Access patterns not completely random, but complex
notion image

Heterogeneous models

notion image
Heterogeneous models may effectively mean modelling different kinds of GNNs in parallel.
This is hard on modern accelerators which still largely rely on single instruction types.
I assume the suggestion here is that because each IPU tile has its own independent instruction & mem it is able to do MIMD much more effectively and hence better suited to heterogeneous models?
 

High sparsity

Recent improvements in hardware support for sparsity - but graphs are several orders of mag. more sparse than these.
Recall Bill Daly's presentation - A100 sparsity support assumes 2 in 4 zero entries.
Two ways you could make a sparse compution more efficient:
  1. Make the sparse linear algebra as efficient as possible by creating good low-level libraries (this is a big challenge!)
  1. To use use gather-scatter pattern to "pack" the indices that you need.
Challenges of packing/densifying:
  1. Because nodes aren't independent, "densifying" means that some data needs to be duplicated ➡️ uses more memory
  1. Overhead added because management of indexing and load balancing is challenging
(Graphcore has explored both cases - would be interested to see what they concluded!)

Neighbour explosion

notion image
Neighbourhood-hops quite quickly mean we need the whole graph.
Typically addressed by e..g
  • reducing number of neighbours / sample (e.g. GraphSage)
  • split the graph initially into ~indep subgraphs.
Relatively deep GNNs don't seem to be able to achieve same level of success as other DL models
Current approaches to deep GNNs make approximations limiting neighbourhood explosion for deep networks
Interesting point: research suggests that these approximation can be quite harmful to performance
 

Scaling to large graphs

If computations were dense, v high bandwidth needed. As expected, actual "effective" bandwidth needed much lower
➡️ question is then how to leverage this / get close to ideal scenario?
notion image

Dynamic Graphs

Many graphs evolve over time: social networks, forecasting, epidemiology, natural events etc
notion image
However ML accelerators have limited recourse to batching along the time dimension as large batches produce stale updates.
This is quite unique to GNNs relative to e.g. LSTMs - graph is updated during the batch (assuming time-dim), so trying to increase performance using batch parallelism means we're using a lot of stale data.
➡️ Standard techniques for using accelerators must be adapted.
How exactly does this fit with the advantages of IPU? See my notes ⬇️

Graphcore IPU

🧠 Design principles

IPU designed in order to:
  • Primarily: to optimise access to memory at a fixed cost (fixed number of cycles - ideally one) independent of access pattern (e.g. continuous/sparse).
  • So that efficient fine-grain computation can be done—e.g. small batch size, small vector operations, low latency—without having to use large batch sizes or specialised hardware.
  • Achieve MIMD - in a way that's simple for programmers.
  • Efficient scaling to multi-chip systems (chip-to-chip bandwidth & communication with large banks of memory)
  • High performance for low precision compute (and inbuilt support for stochastic rounding)

Single-Chip

Some key design features to address these:
  • Very large number of cores, and crucialy each is on a tile with a large chunk of SRAM - so each is its own kind of system, which feeds into MIMD parallelism paradigm
  • Not to say that GPUs can't technically do this, but because the SRAM is split and is so big there are memory concurrency issues.
  • Sparse instructions and libraries
  • Low-latency IPU links
 
Each core has direct access to local memory at a fixed cost, independent of access pattern
Therefore no concurrency / missed cache issues - giving true MIMD
CPU is on a separate server. IPU has direct access to the DRAM.
Multiple IPUs can be connected with very low latency between them, each with access to large DRAM (slower communication here).
The shared cache model may have conflicts.
The shared cache model may have conflicts.
 

Memory Usage Patterns (esp for GNNs)

notion image
Some quite different ways of using memory, compared to GPUs.
Most of data ideally transferred between IPUs, but for applications with large mem requirements, large mem can be utilised.
➡️❗this could be a good use-case for GNNs on large graphs which need to store large embeddings - if we can do access in a sparse, efficient manner (if not sparse, bandwidth requirements too high & access too slow).
Having relatively cheap access to memory that can scale to terrabytes for a medium-sized system could be a possible solution to scaling up a complex GNN to millions of nodes and even hundreds of millions of edges.

Comparing Memory Subsystems

M2000 vs A100 - comparing mem access patterns:
notion image
On-chip memory much smaller - intense scheduling needed to go from and to HBM memory (HBM faster than the DDR used for IPU-based system, but smaller)
This means execution patterns very different:
1. All weights & activations fits in processor memory -900MB in this case (fast!)
2. Most w & a still in processor mem, but using mem-intensive optimiser e.g. ADAM which requires DDR access
3. Same case but leverages the fact that we have multiple IPU, each of which is MIMD
4. Typical GPU setup (for simplicity, the cache is ignored as it is v small)
1. All weights & activations fits in processor memory -900MB in this case (fast!) 2. Most w & a still in processor mem, but using mem-intensive optimiser e.g. ADAM which requires DDR access 3. Same case but leverages the fact that we have multiple IPU, each of which is MIMD 4. Typical GPU setup (for simplicity, the cache is ignored as it is v small)
In the third case we recall that the IPUs are actually mutiple chips / multiple instruction so one part is dedicated to IO and the remainder can do the computation, effectivley "hiding" I/O overhead.
In the fourth case, this is the typical pattern for the GPU. The GPU will have to pass weights and activations back-and-forth from the HBM for every phase of the execution.

Multi-server case

IPU-POD64 - perhaps compare to a cluster.
Going from one to multiple GPUs requires added layer of complexity, whereas IPU was designed with this use-case in mind.

Q & A

What has been / can be done in the chip design to accommodate GNNs in particular?
Given chip design is a long process, mainly focussed on what can be done given current hardware design → less in the other direction.
At the lowest level each tile can even be programmed independently, so in theory specialised libraries can be created, e.g. to accommodate large embeddings.
Extent to which GNNs will influence further hardware design very dependent on what gains traction etc. (dynamic graphs?)
Have you been able to run some of the GNN workloads on Graphcore yet?
In initial stages, remain software details to be worked out to make best use of hardware for different models. Actively looking for research collaborations. Some problems will be better suited to this architecture than others—GPUs for all the reasons discussed.

My Notes

  • Appears that key computational challenge for larger graphs is how to do the kinds of sparse memory access required (gather-scatter) in an efficient way. This is fast on the limited on-chip mem of an IPU (speed indep. of access pattern), but much slower when we leave the IPU and have to access slower (but larger) DDR designed for fast contiguous mem access.
  • This sparse gather-scatter accessing mem is the main bottleneck. Key challenge is how to leverage sparsity to reduce mem bandwidth here, with "effective" bandwidth of sparse ops several orders of mag. lower than dense equivalent.
  • If I understand, the issue is essentially with sparse data accesses needed to construct minibatches from RAM?
  • Is this all model-parallel? Would like to see this framed in terms of model/data parallelism to make this all clear.
  • Is suggestion about heterogeneous models that we can effectively train multiple GNNs at once via MIMD at the IPU-core level? Could also have interesting implications for ensemble approaches too? (although data is only sampled once)
  • Dynamic graphs: I'm not entirely clear how these are handled? Can we batch on the nodes and consider each node across all timesteps? This seems like a promising approach, as long as not too many timesteps to consider. Is this not what is done as standard with e.g. LSTM though? Need to look into challenges here.