Speaker: Cliff Young
The Unreasonable Effectiveness of Deep Learning
The algorithms/models keep changing, meaning that the systems problem keeps changing too.
The engineering is currently ahead of the science - we want to try and understand why e.g. TPUs are so effective.
The Revolution
Starts with AlexNet, but GPUs were expensive and inefficient
TPU v1:
- deployed 2015, paper in 2017
- single stream of control
- inference only
- 30x perf compared to CPU/GPUs
- Maybe the first high-volume matrix architecture?
Using systolic arrays:
- Grid that expands from corner step by step, rather than standard linear pipeline
TPU v2:
- new task: training
- require FP arithmetic
- v highly parallel
- 2 Cores each with scalar, vector and matrix units
Cloud TPU v3 out now
"Cambrian Explosion" in DL Accelerators
- Many startups targeting this space (e.g. GraphCore)
- Inference has huge diversity of design points
- But trainin surprisingly convergent
Data Parallelism: replicate the model N times.
Model Parallelism: cut up the model into multiple pieces (hard problem)
Floating point formats
HPC people want more bit FP computations, whereas ML people can get away with 16 or even 8
Some benefits for ML for high precision though. Is the fiture mixed-precision algorithms?
Sparsity
Working today:
- Pruning on the inference side
- dropout
- Structured spartisy (e.g. sparse attention)
Promising: GNNs
Sparsity in NNs is low by HPC standards (HPC ≥ 98% = sparse)
Brains may be sparse
Science: how can we make sparse training work?
Engineering: what are the sparce architectures that are worth building?
Weird unscientific observations
Distillation → Going larger, training, and going back smaller is more effective than direct training at that size
Feedback alignment → random feedback weights work just as well in backprop
Lottery Ticket Hypothesis → sparse accurate nets already exist inside random init arch and we just have to chip away to get them?
Some Factorisations work for CNNs → Inception (2014), Depthwise separable convs (2016)
Space race in language understanding
Ever larger machines. OpenAI's 10k GPU cluster, 3640 petaflop days of training.
takeaways
Science: ask why
- SGD works for training
- What num formats we need
- How sparsity can be used
Engineering:
- Sapir-Whorf hypothesis: language you speak helps/hurts concepts you can think about - same applis to machines
- Can be do LHC-scale things under out desks?