💿

TPUs

Speaker: Cliff Young

The Unreasonable Effectiveness of Deep Learning

The algorithms/models keep changing, meaning that the systems problem keeps changing too.
The engineering is currently ahead of the science - we want to try and understand why e.g. TPUs are so effective.

The Revolution

Starts with AlexNet, but GPUs were expensive and inefficient
TPU v1:
  • deployed 2015, paper in 2017
  • single stream of control
  • inference only
  • 30x perf compared to CPU/GPUs
  • Maybe the first high-volume matrix architecture?
Using systolic arrays:
  • Grid that expands from corner step by step, rather than standard linear pipeline
TPU v2:
  • new task: training
  • require FP arithmetic
  • v highly parallel
  • 2 Cores each with scalar, vector and matrix units
Cloud TPU v3 out now
"Cambrian Explosion" in DL Accelerators
  • Many startups targeting this space (e.g. GraphCore)
  • Inference has huge diversity of design points
  • But trainin surprisingly convergent
Data Parallelism: replicate the model N times.
Model Parallelism: cut up the model into multiple pieces (hard problem)

Floating point formats

HPC people want more bit FP computations, whereas ML people can get away with 16 or even 8
Some benefits for ML for high precision though. Is the fiture mixed-precision algorithms?

Sparsity

Working today:
  • Pruning on the inference side
  • dropout
  • Structured spartisy (e.g. sparse attention)
Promising: GNNs
Sparsity in NNs is low by HPC standards (HPC ≥ 98% = sparse)
Brains may be sparse
Science: how can we make sparse training work?
Engineering: what are the sparce architectures that are worth building?

Weird unscientific observations

Distillation → Going larger, training, and going back smaller is more effective than direct training at that size
Feedback alignment → random feedback weights work just as well in backprop
Lottery Ticket Hypothesis → sparse accurate nets already exist inside random init arch and we just have to chip away to get them?
Some Factorisations work for CNNs → Inception (2014), Depthwise separable convs (2016)

Space race in language understanding

Ever larger machines. OpenAI's 10k GPU cluster, 3640 petaflop days of training.

takeaways

Science: ask why
  • SGD works for training
  • What num formats we need
  • How sparsity can be used
Engineering:
  • Sapir-Whorf hypothesis: language you speak helps/hurts concepts you can think about - same applis to machines
  • Can be do LHC-scale things under out desks?