William Dally
In the past: we have the models:
- Need data (e.g. ImageNet)
- Need hardware
Deep learning is gated by hardware
Nvidia doubles GPU performance 2x every year over 8 years
317x improvement over 8 years - mostly as a result of better architecture
2012: Kepler
- 3.95 TFLOPS
- FP32 math
- 250 GB/s
- 300w
2016: Pascal
- 10.6 TFLOPS
- 21.3 TFLOPS (FP16)
- Dop product operation improvements
- 732 GB/s
2017: Volta
- Added Tensor Cores!
- 15 TFLOPS (FP32)
- 125 TFLOPS (FP16)
- 900 GB/s
2018: Turing
- Integer Tensor Cores!
- 65 (FP32)
- 130 (FP16)
- 261 (Int8)
- 672 GB/s (G5 memory)
- Ray Tracing!
2020: Ampere
- Sparsity!
- BF16 & TF32
- 156/321 TFLOPS (TF32) (dense sparse)
- missing...
Key improvements are in:
- number representation (gives biggest wins; "what you want to do is int 8")
- Complex instructions: DP4, FFMA, IMMA
- A bit (2x) from processing power
Accelerators: what exactly is an accelerator?
- Start with a matrix multiplier
- Tiling
- Maximise re-use from memory hierarchy
- Number of levels of different sizes, determined as free variables
- Exploit Spartisy
- Compression
- Data gating ("i am a zero")
- Sparse computation (only got there with Ampere - now gives 2x gain)
- Number representation
- Coding
- Scaling (put the bits where they do the most good)
- Scale by the vector
Logarithmic Number Representations
Can get the same accuracy with fewer bits
Maths then takes up a smaller proportion of energy usage
Number Representations
Two considerations: range & accuracy (plus structure)
(see chart of these for FP32/16 & Int32/16/8 - copy-paste his slide image)
Log base two is floating paoint number without mantissa
Log trades dynamic range for accuracy by the choice of base
Log representations - integers errors grow more linearly, which is a larger percentage around zero - which is exactly the range of numbers used in DL!
Multiply → add is in example a 10x reduction
But what happens to the add?
- Classically: convert back to integer form - expensive!
- But they discovered there's a trick to make this much cheaper
- They do this by factoring out ?
- Rewarch this bit!
Conclusion
GPU inference performance doubling every year - largest gains from number representations
Q & A
We see doubling of performance every year from architecture - different from Moore's law, from processor improvement and exploiting hidden parallelism
"to build good accelerators we almost always have to change the algorithm" - for optimisation algorithms (e.g. ADAM) we can come up with alternatives that work better with hardware
View a GPU as a platform for domain-specific accelerators - GPUs themselves very expensive to develop.
Structured sparsity currently only for inference - partly because we tend not to start with sparsity, it is learned and then possibly exploited at inference time