🍟

Nvidia Chips

William Dally
 
In the past: we have the models:
  • Need data (e.g. ImageNet)
  • Need hardware
 
Deep learning is gated by hardware
 
Nvidia doubles GPU performance 2x every year over 8 years
317x improvement over 8 years - mostly as a result of better architecture
 
2012: Kepler
  • 3.95 TFLOPS
  • FP32 math
  • 250 GB/s
  • 300w
2016: Pascal
  • 10.6 TFLOPS
  • 21.3 TFLOPS (FP16)
  • Dop product operation improvements
  • 732 GB/s
2017: Volta
  • Added Tensor Cores!
  • 15 TFLOPS (FP32)
  • 125 TFLOPS (FP16)
  • 900 GB/s
2018: Turing
  • Integer Tensor Cores!
  • 65 (FP32)
  • 130 (FP16)
  • 261 (Int8)
  • 672 GB/s (G5 memory)
  • Ray Tracing!
2020: Ampere
  • Sparsity!
  • BF16 & TF32
  • 156/321 TFLOPS (TF32) (dense sparse)
  • missing...
 
Key improvements are in:
  • number representation (gives biggest wins; "what you want to do is int 8")
  • Complex instructions: DP4, FFMA, IMMA
  • A bit (2x) from processing power
 
Accelerators: what exactly is an accelerator?
  • Start with a matrix multiplier
  • Tiling
    • Maximise re-use from memory hierarchy
    • Number of levels of different sizes, determined as free variables
  • Exploit Spartisy
    • Compression
    • Data gating ("i am a zero")
    • Sparse computation (only got there with Ampere - now gives 2x gain)
  • Number representation
    • Coding
    • Scaling (put the bits where they do the most good)
    • Scale by the vector
 

Logarithmic Number Representations

Can get the same accuracy with fewer bits
Maths then takes up a smaller proportion of energy usage
 

Number Representations

Two considerations: range & accuracy (plus structure)
(see chart of these for FP32/16 & Int32/16/8 - copy-paste his slide image)
 
Log base two is floating paoint number without mantissa
Log trades dynamic range for accuracy by the choice of base
Log representations - integers errors grow more linearly, which is a larger percentage around zero - which is exactly the range of numbers used in DL!
Multiply → add is in example a 10x reduction
But what happens to the add?
  • Classically: convert back to integer form - expensive!
  • But they discovered there's a trick to make this much cheaper
  • They do this by factoring out ?
  • Rewarch this bit!
 

Conclusion

GPU inference performance doubling every year - largest gains from number representations

Q & A

We see doubling of performance every year from architecture - different from Moore's law, from processor improvement and exploiting hidden parallelism
"to build good accelerators we almost always have to change the algorithm" - for optimisation algorithms (e.g. ADAM) we can come up with alternatives that work better with hardware
View a GPU as a platform for domain-specific accelerators - GPUs themselves very expensive to develop.
Structured sparsity currently only for inference - partly because we tend not to start with sparsity, it is learned and then possibly exploited at inference time