Data Movement Is All You Need: A Case Study on Optimizing Transformers

Speaker: Andrei Ivanov

Existing Transformer implementations attain much less than peak GPU flops
On BERT-large they demonstrate performance improvement:
  • 30% over PyTorch
  • 20% over Tensorflow + XLA
  • 8% over DeepSpeed

Data movement is the bottleneck

Group operators into 3 classes:
Tensor contractions = matmuls
notion image
Norm = softmax / layernorm
encoder, biases, "repulse/repose activations", residual connections

Dataflow graph: multi-head attention

notion image

Operator fusion opportunities

notion image