Data Movement Is All You Need: A Case Study on Optimizing Transformers

Speaker: Andrei Ivanov

Existing Transformer implementations attain much less than peak GPU flops

On BERT-large they demonstrate performance improvement:

Group operators into 3 classes:

Tensor contractions = matmuls

Norm = softmax / layernorm

encoder, biases, "repulse/repose activations", residual connections