Speaker: Cliff Young

### The Unreasonable Effectiveness of Deep Learning

The algorithms/models keep changing, meaning that the systems problem keeps changing too.

The engineering is currently ahead of the science - we want to try and understand why e.g. TPUs are so effective.

#### The Revolution

Starts with AlexNet, but GPUs were expensive and inefficient

TPU v1:

- deployed 2015, paper in 2017

- single stream of control

- inference only

- 30x perf compared to CPU/GPUs

- Maybe the first high-volume matrix architecture?

Using systolic arrays:

- Grid that expands from corner step by step, rather than standard linear pipeline

TPU v2:

- new task: training

- require FP arithmetic

- v highly parallel

- 2 Cores each with scalar, vector and matrix units

Cloud TPU v3 out now

"Cambrian Explosion" in DL Accelerators

- Many startups targeting this space (e.g. GraphCore)

- Inference has huge diversity of design points

- But trainin surprisingly convergent

Data Parallelism: replicate the model N times.

Model Parallelism: cut up the model into multiple pieces (hard problem)

### Floating point formats

HPC people want more bit FP computations, whereas ML people can get away with 16 or even 8

Some benefits for ML for high precision though. Is the fiture mixed-precision algorithms?

### Sparsity

Working today:

- Pruning on the inference side

- dropout

- Structured spartisy (e.g. sparse attention)

Promising: GNNs

Sparsity in NNs is low by HPC standards (HPC ≥ 98% = sparse)

Brains may be sparse

Science: how can we make sparse training work?

Engineering: what are the sparce architectures that are worth building?

### Weird unscientific observations

Distillation → Going larger, training, and going back smaller is more effective than direct training at that size

Feedback alignment → random feedback weights work just as well in backprop

Lottery Ticket Hypothesis → sparse accurate nets already exist inside random init arch and we just have to chip away to get them?

*Some*Factorisations work for CNNs → Inception (2014), Depthwise separable convs (2016)

### Space race in language understanding

Ever larger machines. OpenAI's 10k GPU cluster, 3640 petaflop days of training.

#### takeaways

Science: ask

**why**- SGD works for training

- What num formats we need

- How sparsity can be used

Engineering:

- Sapir-Whorf hypothesis: language you speak helps/hurts concepts you can think about - same applis to machines

- Can be do LHC-scale things under out desks?