TPUs

Speaker: Cliff Young

The Unreasonable Effectiveness of Deep Learning

The algorithms/models keep changing, meaning that the systems problem keeps changing too.

The engineering is currently ahead of the science - we want to try and understand why e.g. TPUs are so effective.

The Revolution

Starts with AlexNet, but GPUs were expensive and inefficient

TPU v1:

deployed 2015, paper in 2017

single stream of control

inference only

30x perf compared to CPU/GPUs

Maybe the first high-volume matrix architecture?

Using systolic arrays:

Grid that expands from corner step by step, rather than standard linear pipeline

TPU v2:

new task: training

require FP arithmetic

v highly parallel

2 Cores each with scalar, vector and matrix units

Cloud TPU v3 out now

"Cambrian Explosion" in DL Accelerators

Many startups targeting this space (e.g. GraphCore)

Inference has huge diversity of design points

But trainin surprisingly convergent

Data Parallelism: replicate the model N times.

Model Parallelism: cut up the model into multiple pieces (hard problem)

Floating point formats

HPC people want more bit FP computations, whereas ML people can get away with 16 or even 8

Some benefits for ML for high precision though. Is the fiture mixed-precision algorithms?

Sparsity

Working today:

Pruning on the inference side

dropout

Structured spartisy (e.g. sparse attention)

Promising: GNNs

Sparsity in NNs is low by HPC standards (HPC ≥ 98% = sparse)

Brains may be sparse

Science: how can we make sparse training work?

Engineering: what are the sparce architectures that are worth building?

Weird unscientific observations

Distillation → Going larger, training, and going back smaller is more effective than direct training at that size

Feedback alignment → random feedback weights work just as well in backprop

Lottery Ticket Hypothesis → sparse accurate nets already exist inside random init arch and we just have to chip away to get them?

Some Factorisations work for CNNs → Inception (2014), Depthwise separable convs (2016)

Space race in language understanding

Ever larger machines. OpenAI's 10k GPU cluster, 3640 petaflop days of training.

takeaways

Science: ask why

SGD works for training

What num formats we need

How sparsity can be used

Engineering:

Sapir-Whorf hypothesis: language you speak helps/hurts concepts you can think about - same applis to machines

Can be do LHC-scale things under out desks?