Nvidia Chips

William Dally

In the past: we have the models:

Need data (e.g. ImageNet)

Need hardware

Deep learning is gated by hardware

Nvidia doubles GPU performance 2x every year over 8 years

317x improvement over 8 years - mostly as a result of better architecture

2012: Kepler

3.95 TFLOPS

FP32 math

250 GB/s

300w

2016: Pascal

10.6 TFLOPS

21.3 TFLOPS (FP16)

Dop product operation improvements

732 GB/s

2017: Volta

Added Tensor Cores!

15 TFLOPS (FP32)

125 TFLOPS (FP16)

900 GB/s

2018: Turing

Integer Tensor Cores!

65 (FP32)

130 (FP16)

261 (Int8)

672 GB/s (G5 memory)

Ray Tracing!

2020: Ampere

Sparsity!

BF16 & TF32

156/321 TFLOPS (TF32) (dense sparse)

missing...

Key improvements are in:

number representation (gives biggest wins; "what you want to do is int 8")

Complex instructions: DP4, FFMA, IMMA

A bit (2x) from processing power

Accelerators: what exactly is an accelerator?

Start with a matrix multiplier

Tiling

Maximise re-use from memory hierarchy
Number of levels of different sizes, determined as free variables

Exploit Spartisy

Compression
Data gating ("i am a zero")
Sparse computation (only got there with Ampere - now gives 2x gain)

Number representation

Coding
Scaling (put the bits where they do the most good)
Scale by the vector

Logarithmic Number Representations

Can get the same accuracy with fewer bits

Maths then takes up a smaller proportion of energy usage

Number Representations

Two considerations: range & accuracy (plus structure)

(see chart of these for FP32/16 & Int32/16/8 - copy-paste his slide image)

Log base two is floating paoint number without mantissa

Log trades dynamic range for accuracy by the choice of base

Log representations - integers errors grow more linearly, which is a larger percentage around zero - which is exactly the range of numbers used in DL!

Multiply → add is in example a 10x reduction

But what happens to the add?

Classically: convert back to integer form - expensive!

But they discovered there's a trick to make this much cheaper

They do this by factoring out ?

Rewarch this bit!

Conclusion

GPU inference performance doubling every year - largest gains from number representations

Q & A

We see doubling of performance every year from architecture - different from Moore's law, from processor improvement and exploiting hidden parallelism

"to build good accelerators we almost always have to change the algorithm" - for optimisation algorithms (e.g. ADAM) we can come up with alternatives that work better with hardware

View a GPU as a platform for domain-specific accelerators - GPUs themselves very expensive to develop.

Structured sparsity currently only for inference - partly because we tend not to start with sparsity, it is learned and then possibly exploited at inference time