🔟

Number Formats


Motivation

From 2012 to 2022 how much has GPU performance improved
300x
What are the key factors underlying GPU performance improvement from 2012 to 2022?
  1. Numerics
  1. Process transistor density
  1. Clock speed
  1. Architectural refinements
How much have numerics improved GPU performance from 2012 to 2022, and why?
16x: from fp32 to fp8 (compute increase is quadratic)
Why do smaller number formats lead to quadratic performance improvements?
Silicone area for arithmetic is quadratic in bit-width
How much has process transistor density improved GPU performance from 2012 to 2022, and from what process nodes?
8x from 28nm to 5nm
What does the nm value refer to for a process node?
Nothing, it’s purely a marketing term
What did the nm for a process node traditionally refer to?
The length of the transistor gate
How much has clock speed improved GPU performance from 2012 to 2022?
1.7x (at a power cost)
How much have architectural refinements improved GPU performance from 2012 to 2022?
1.4x

Integer representations

Four main signed int representations
  1. sign–magnitude
  1. ones' complement
  1. two's complement
  1. offset binary

Sign-magnitude

Sign-magnitude integer representation
Cons of sign-magnitude integer representation
  1. Two ways of representing 0
  1. As a consequence, range is only
  1. Addition & subtraction require different behaviour depending on the sign bit

One’s complement

One’s complement integer representation
How do you negate a one’s complement integer?
Flip the bits
Pros of one’s complement integer representation vs sign-magnitude
  1. Addition & subtraction the same as for unsigned integers, with exception of end-around carry
Cons of one’s complement integer representation vs two’s complement
  1. Two ways of representing 0
  1. As a consequence, range is only
  1. Addition & subtraction require end-around carry
What is end-around carry in arithmetic?
Adding an ‘overflowing’ carry at the most-significant bit back to the least significant bit

Two’s complement

Two’s complement integer representation
How do you negate a two’s complement integer?
Flip the bits and add 1
Pros of two’s complement integer representation
  1. One way of representing 0
  1. Range is
  1. Addition & subtraction can ignore the sign bit and depend on the overflow behaviour
Alternative interpretation for the sign bit in the two’s complement integer representation
Most-significant bit represents

Offset binary

Offset binary integer representation
Typical bias used for offset binary integer representation

Real number representations

Fixed-point representations

Fixed-point number representation
Where is an signed integer representation (typically two’s complement)
What signed integer representation is typically used in a fixed-point mantissa?
Two’s complement
What does the “fixed point” mean in a fixed-point representation?
(Assuming a power-of-2 scale) the binary representation of our number is just the bits with a decimal point at a fixed location
“Everyday” instance of a fixed-point representation
Scientific notation
Why are fixed-point representations a poor fit for machine learning?
  1. If scale is too small, quantisation error rapidly becomes large
  1. If scale is too large, clipping error rapidly becomes large
  1. The “sweet spot” is very narrow

IEEE 754 floating-point standard

What number is the IEEE floating-point standard?
IEEE 754
What’s the difference between an arithmetic format and an interchange format?
Arithmetic format: the mathematical values represented
Interchange format: how these values are realised as bit strings
What do you need to specify an IEEE 754 float format?
  1. an exponent range (really just e_max, as typically e_max = 1 - emin)
  1. # mantissa bits
  1. base
What do you need to specify a value in an IEEE 754 float format?
  1. sign bit
  1. exponent bits
  1. mantissa bits
Other words for mantissa?
  1. Significand
  1. Coefficient
(there are more)
Other word for base (of a number)?
Radix
IEEE 754 binary interchange format representation
Where:
Plus special values
How to “read” an IEEE 754 interchange format exponent bit string
  1. The bias value = all but the leading bit set to one
  1. Figure out the value of the bits you’d have to add to this to get your value (or vice versa)
  1. Take the direction to be the sign
  1. Do 2^ of this value
How to “read” an IEEE 754 interchange format mantissa bit string
  1. Cut off all trailing zeros
  1. Take the value of the bits
  1. Divide by (i.e. 1000… value with extra bit to the left)
  1. + 1
What is the proper IEEE 754 name for fp32 & fp16?
binary32 and binary16
IEEE 754 special values?
  1. Infinity (+ve & -ve)
  1. NaN (quiet & signalling)
  1. Subnormal numbers
IEEE 754 infinity representation
  1. Exponent all 1s
  1. Mantissa all 0s
(sign bit either)
How to signal an IEEE 754 NaN
  1. Exponent all 1s
  1. Mantissa all non-zero
How to interpret an IEEE 754 NaN mantissa
  1. 1st mantissa bit = 1 → qNaN, otherwise signalling NaN
  1. Rest of mantissa = payload
How to signal an IEEE 754 subnormal number
Exponent all zeros
IEEE 754 subnormal number representation
Exponent = 0 (- bias); removes implicit mantissa leading bit
How do IEEE 754 floats typically resolve rounding ties?
Round to the nearest even number
Max value of an IEEE 754 float?
Approximate max value of an IEEE 754 float?
Bias of an IEEE 754 float?
Max exponent for an IEEE 754 float?
Min exponent for an IEEE 754 float?
Absolute min normal value of an IEEE 754 float?
Absolute min subnormal value of an IEEE 754 float?
Basic approach to adding (or subtracting) IEEE 754 floats?
  1. Increase precision
  1. Shift the number with the smaller exponent to the right until exponents match, including implicit leading 1
  1. Add / subtract, adjusting exponent if necessary
  1. Round back to original precision
Why does the “shift trick” work when adding/subtracting IEEE 754 floats?
Shift mantissa right from exponent, so long as implicit leading 1 is included in shift
Basic approach to multiplying IEEE 754 floats?
  1. Add exponents
  1. Multiply mantissas in higher precision
  1. Round mantissa & adjust exponent if necessary
What is a binade?
A set of floating-point numbers with the same exponent value

FP32

How many exponent & mantissa bits in FP32
(8, 23)
FP32 approximate max value (decimal)
FP32 approximate max value (power of 2)
FP32 approximate min absolute normal value (decimal)
FP32 min absolute normal value (power of 2)
FP32 bias

TF32

How many exponent & mantissa bits in TF32
(8, 10)
What is the motivation for TF32?
Keep the range of FP32, but reduce the precision to that of FP16
How many bits are required to represent a number in TF32?
19
What is TF32?
A compute mode - not a number format
How are matmuls done in TF32?
  1. Inputs are rounded to TF32
  1. Products output in FP32
  1. Accumulation in FP32
When do GPUs do 32-bit compute in TF32?
By default (from Ampere onwards)

FP16

Benefits of reduced-precision formats
  1. Faster compute
  1. Reduced memory
  1. Reduced comms
How many exponent & mantissa bits in FP16
(5, 10)
FP16 max value (decimal)
FP16 approximate max value (power of 2)
FP16 approximate min absolute normal value (decimal)
FP16 min absolute normal value (power of 2)
FP16 bias
15
What techniques do they use in the mixed precision paper to enable FP16 training?
  1. Master weights in FP32
  1. Loss scaling
  1. FP32 partials
What might be a reason networks use FP32 master weights?
Swamping
What is swamping?
Loss of precision due to shift-truncation in large-to-small number addition
How to avoid swamping in large dot products
  1. (Hierarchical) chunk-based accumulation
  1. Stochastic rounding

BFLOAT16

How many exponent & mantissa bits in bfloat16
(8, 7)
What is the motivation for bfloat16?
Keep the range of FP32, but reduce the bit width to 16
How to calculate max/min absolute values for bfloat16
They’re the same as FP32

FP8

What are the two different proposed sizes for FP8?
1.5.2 | 1.4.3
How does the GAQ proposed FP8 standard deviate from the IEEE 754 format?
  1. The all-1s exponent is no-longer reserved for special values
  1. Negative zero is now reserved for both Inf & NaN
  1. The bias is 1 greater
FP8 1.5.2 approx max value (decimal)
for both GAQ & NAI
FP8 1.5.2 approximate max value (power of 2)
for both GAQ & NAI
FP8 1.5.2 min absolute normal value (power of 2)
GAQ:
NAI:
FP8 1.5.2 bias
GAQ:
NAI:
FP8 1.4.3 max value (decimal)
GAQ:
NAI:
FP8 1.4.3 min absolute normal value (power of 2)
GAQ:
NAI:
FP8 1.4.3 bias
GAQ:
NAI:
What is the recommended FP8 format for activations?
1.4.3
What is the recommended FP8 format for weights?
1.4.3
What is the recommended FP8 format for grad_xs?
1.5.2
What is the recommended FP8 format for grad_ws?
1.5.2
Why might the first layer be harder to train in FP8?
We may not have enough bits to represent certain input modalities such as pixel colours
Why might the last layer be harder to train in FP8?
  1. Large embeddings can give out-of-range values
  1. Softmax sensitive to low-precision values
How is the Nvidia 1.5.2 format different/similar from Graphcore’s?
It conforms exactly to the IEEE 754 standard.
How does the Nvidia 1.4.3 format differ from Graphcore’s?
Similar:
  1. Single representation for INF & NaN
Different:
  1. The all 1s mantissa (and exponent) represent NaN - other mantissa values are valid numbers
  1. -ve Zero maintained
Why do Nvidia argue against using -0 as the single NaN representation?
  1. It breaks the +ve / -ve symmetry inherent in the IEEE 754 standard
  1. This may break algorithm implementations (e.g. comparison / sorting)
What FP8 format do Nvidia propose for doing quantisation from FP16?
  1. 1.4.3 (as there are no grads)
Where do Nvidia propose we add scaling factors for FP16-to-FP8 quantisation?
Just like standard int8:
Weights: per-channel
Activations: per-tensor

IPU

What are half-partials?
Rounding of the output of the AMP unit down to FP16.
What does a partial usually refer to on the IPU?
The final output of a single vector going through the RHS of the AMP unit.
What are the two main instruction types for accelerated compute on the IPU?
AMP and SLIC

AMP

Hierarchy of compute on an IPU (numbers not required)
  1. Tiles
  1. AMP sets
  1. AMP engines
  1. AMP units
How many tiles per IPU? (mk2)
1472
How many AMP sets per IPU tile? (mk2)
2
How many AMP engines per AMP set? (mk2)
4
How many AMP engines per IPU tile? (mk2)
8
How many AMP units per AMP engine? (mk2)
2
How many AMP units per AMP set? (mk2)
8
How many AMP units per IPU tile? (mk2)
16
How many AMP units per IPU? (mk2)
What is the basic unit of computation on an IPU tile in FP16, per instruction? (mk2)
A matmul
(we can then stream vectors on the RHS)
What is the basic unit of computation on an IPU tile in FP32, per instruction? (mk2)
A matmul
(we can then stream vectors on the RHS)
What is the basic unit of computation per IPU tile in FP16, per cycle? (mk2)
A matmul
(we can then stream vectors on the RHS)
What is the basic unit of computation per AMP unit in FP16, per instruction? (mk2)
A inner product
What is the basic unit of computation per AMP unit in FP16, per cycle? (mk2)
A inner product
What is a phase in an AMP unit?
A block of work that takes a cycle
How does a tile split a matmul over its AMP units?
  1. LHS rows mapped to AMP units
  1. LHS columns / RHS rows broken into 4 phases
How many cycles does it take a tile to perform a single matmul in FP16? (mk2)
4

MACs & FLOPs

How many MACs can each AMP unit perform per cycle in FP16? (mk2)
4
How many MACs can each tile perform per cycle in FP16? (mk2)
64
How many MACs can each tile perform per cycle in FP32? (mk2)
16
How many FLOPs per MAC?
2
How many FP16 FLOPs can an IPU process for each FP32 FLOP
4
How many FLOPs can each tile perform per cycle in FP16? (mk2)
128
How many FLOPs can each tile perform per cycle in FP32? (mk2)
32
How many FLOPs can each IPU perform per cycle in FP16? (mk2)
128 FLOPs * 1472 tiles ~= 188,000 FLOPs
IPU mk2 (original) clock speed
1.325GHz
IPU Bow clock speed
1.85GHz
Bow speedup vs original mk2?
40% (clock speed increase)
How many FLOP/s can each mk2 original IPU perform in FP16?
188,000 FLOPs per cycle * 1.325GHz ~= 250TFLOP/s
How many FLOP/s can each Bow IPU perform in FP16?
250TFLOP/s * 1.4 ~= 350TFLOP/s
Steps in an IPU AMP unit cycle
  1. Compute element-wise products (output in higher precision)
  1. Align result exponents
  1. Round down to FP32
  1. Sequential sum
  1. Accumulate to FP16 / FP32 partials (FP16 process in other card)
(recall: this is a inner product for FP16)

Stochastic rounding

IPU rounding options
  1. Round to Nearest (ties to even)
  1. Stochastic Rounding
What is the key benefit of stochastic rounding?
It reduces swamping
Stochastic rounding equations
How is stochastic rounding implemented in practice?
  1. Generate a uniform random bit string, aligned to the portion to be truncated
  1. Add this to the higher-precision operand
  1. Sum
  1. Truncate
Why does the random bit-string implementation of stochastic rounding work?
  1. Using the extreme bit values is equivalent to floor and ceiling
  1. Intermediate values linearly interpolate
  1. With probability proportional to distance of sum from upper / lower bounds

To adjust

Is the shift issue for mul or sum? How is the other one handled?

Open questions

  • Results r.e. stochastic rounding and batch size
  • For FP8 what might stay in FP16? Presumably master weights and opt state. What about grads for all-reduce / accumulation? Does it need to?
  • How is SNR computed?
  • Are partials within the AMP unit always in FP32? Does the outer partials sum have an FP32 part too? I think FP8 still accumulates into an FP32 partial, which can optionally be rounded down afterwards? Is this true? Could it be accumulated in FP16 (or BF16?)
  • Is stochastic rounding done for all of: phases, partials, exchange partials, grad acc, grad all-reduce (terminology here?)