Number Formats

Motivation

From 2012 to 2022 how much has GPU performance improved

300x

What are the key factors underlying GPU performance improvement from 2012 to 2022?

Numerics

Process transistor density

Clock speed

Architectural refinements

How much have numerics improved GPU performance from 2012 to 2022, and why?

16x: from fp32 to fp8 (compute increase is quadratic)

Why do smaller number formats lead to quadratic performance improvements?

Silicone area for arithmetic is quadratic in bit-width

How much has process transistor density improved GPU performance from 2012 to 2022, and from what process nodes?

8x from 28nm to 5nm

What does the nm value refer to for a process node?

Nothing, it’s purely a marketing term

What did the nm for a process node traditionally refer to?

The length of the transistor gate

How much has clock speed improved GPU performance from 2012 to 2022?

1.7x (at a power cost)

How much have architectural refinements improved GPU performance from 2012 to 2022?

1.4x

Integer representations

Four main signed int representations

sign–magnitude

ones' complement

two's complement

offset binary

Sign-magnitude

Sign-magnitude integer representation

Cons of sign-magnitude integer representation

Two ways of representing 0

As a consequence, range is only

Addition & subtraction require different behaviour depending on the sign bit

One’s complement

One’s complement integer representation

How do you negate a one’s complement integer?

Flip the bits

Pros of one’s complement integer representation vs sign-magnitude

Addition & subtraction the same as for unsigned integers, with exception of end-around carry

Cons of one’s complement integer representation vs two’s complement

Two ways of representing 0

As a consequence, range is only

Addition & subtraction require end-around carry

What is end-around carry in arithmetic?

Adding an ‘overflowing’ carry at the most-significant bit back to the least significant bit

Two’s complement

Two’s complement integer representation

How do you negate a two’s complement integer?

Flip the bits and add 1

Pros of two’s complement integer representation

One way of representing 0

Range is

Addition & subtraction can ignore the sign bit and depend on the overflow behaviour

Alternative interpretation for the sign bit in the two’s complement integer representation

Most-significant bit represents

Offset binary

Offset binary integer representation

Typical bias used for offset binary integer representation

Real number representations

Fixed-point representations

Fixed-point number representation

Where is an signed integer representation (typically two’s complement)

What signed integer representation is typically used in a fixed-point mantissa?

Two’s complement

What does the “fixed point” mean in a fixed-point representation?

(Assuming a power-of-2 scale) the binary representation of our number is just the bits with a decimal point at a fixed location

“Everyday” instance of a fixed-point representation

Scientific notation

Why are fixed-point representations a poor fit for machine learning?

If scale is too small, quantisation error rapidly becomes large

If scale is too large, clipping error rapidly becomes large

The “sweet spot” is very narrow

IEEE 754 floating-point standard

What number is the IEEE floating-point standard?

IEEE 754

What’s the difference between an arithmetic format and an interchange format?

Arithmetic format: the mathematical values represented

Interchange format: how these values are realised as bit strings

What do you need to specify an IEEE 754 float format?

an exponent range (really just e_max, as typically e_max = 1 - emin)

# mantissa bits

base

What do you need to specify a value in an IEEE 754 float format?

sign bit

exponent bits

mantissa bits

Other words for mantissa?

Significand

Coefficient

(there are more)

Other word for base (of a number)?

Radix

IEEE 754 binary interchange format representation

Where:

Plus special values

How to “read” an IEEE 754 interchange format exponent bit string

The bias value = all but the leading bit set to one

Figure out the value of the bits you’d have to add to this to get your value (or vice versa)

Take the direction to be the sign

Do 2^ of this value

How to “read” an IEEE 754 interchange format mantissa bit string

Cut off all trailing zeros

Take the value of the bits

Divide by (i.e. 1000… value with extra bit to the left)

What is the proper IEEE 754 name for fp32 & fp16?

binary32 and binary16

IEEE 754 special values?

Infinity (+ve & -ve)

NaN (quiet & signalling)

Subnormal numbers

IEEE 754 infinity representation

Exponent all 1s

Mantissa all 0s

(sign bit either)

How to signal an IEEE 754 NaN

Exponent all 1s

Mantissa all non-zero

How to interpret an IEEE 754 NaN mantissa

1st mantissa bit = 1 → qNaN, otherwise signalling NaN

Rest of mantissa = payload

How to signal an IEEE 754 subnormal number

Exponent all zeros

IEEE 754 subnormal number representation

Exponent = 0 (- bias); removes implicit mantissa leading bit

How do IEEE 754 floats typically resolve rounding ties?

Round to the nearest even number

Max value of an IEEE 754 float?

Approximate max value of an IEEE 754 float?

Bias of an IEEE 754 float?

Max exponent for an IEEE 754 float?

Min exponent for an IEEE 754 float?

Absolute min normal value of an IEEE 754 float?

Absolute min subnormal value of an IEEE 754 float?

Basic approach to adding (or subtracting) IEEE 754 floats?

Increase precision

Shift the number with the smaller exponent to the right until exponents match, including implicit leading 1

Add / subtract, adjusting exponent if necessary

Round back to original precision

Why does the “shift trick” work when adding/subtracting IEEE 754 floats?

Shift mantissa right from exponent, so long as implicit leading 1 is included in shift

Basic approach to multiplying IEEE 754 floats?

Add exponents

Multiply mantissas in higher precision

Round mantissa & adjust exponent if necessary

What is a binade?

A set of floating-point numbers with the same exponent value

FP32

How many exponent & mantissa bits in FP32

(8, 23)

FP32 approximate max value (decimal)

FP32 approximate max value (power of 2)

FP32 approximate min absolute normal value (decimal)

FP32 min absolute normal value (power of 2)

FP32 bias

TF32

How many exponent & mantissa bits in TF32

(8, 10)

What is the motivation for TF32?

Keep the range of FP32, but reduce the precision to that of FP16

How many bits are required to represent a number in TF32?

What is TF32?

A compute mode - not a number format

How are matmuls done in TF32?

Inputs are rounded to TF32

Products output in FP32

Accumulation in FP32

When do GPUs do 32-bit compute in TF32?

By default (from Ampere onwards)

FP16

Benefits of reduced-precision formats

Faster compute

Reduced memory

Reduced comms

How many exponent & mantissa bits in FP16

(5, 10)

FP16 max value (decimal)

FP16 approximate max value (power of 2)

FP16 approximate min absolute normal value (decimal)

FP16 min absolute normal value (power of 2)

FP16 bias

What techniques do they use in the mixed precision paper to enable FP16 training?

Master weights in FP32

Loss scaling

FP32 partials

What might be a reason networks use FP32 master weights?

Swamping

What is swamping?

Loss of precision due to shift-truncation in large-to-small number addition

How to avoid swamping in large dot products

(Hierarchical) chunk-based accumulation

Stochastic rounding

BFLOAT16

How many exponent & mantissa bits in bfloat16

(8, 7)

What is the motivation for bfloat16?

Keep the range of FP32, but reduce the bit width to 16

How to calculate max/min absolute values for bfloat16

They’re the same as FP32

FP8

What are the two different proposed sizes for FP8?

1.5.2 | 1.4.3

How does the GAQ proposed FP8 standard deviate from the IEEE 754 format?

The all-1s exponent is no-longer reserved for special values

Negative zero is now reserved for both Inf & NaN

The bias is 1 greater

FP8 1.5.2 approx max value (decimal)

for both GAQ & NAI

FP8 1.5.2 approximate max value (power of 2)

for both GAQ & NAI

FP8 1.5.2 min absolute normal value (power of 2)

GAQ:

NAI:

FP8 1.5.2 bias

GAQ:

NAI:

FP8 1.4.3 max value (decimal)

GAQ:

NAI:

FP8 1.4.3 min absolute normal value (power of 2)

GAQ:

NAI:

FP8 1.4.3 bias

GAQ:

NAI:

What is the recommended FP8 format for activations?

1.4.3

What is the recommended FP8 format for weights?

1.4.3

What is the recommended FP8 format for grad_xs?

1.5.2

What is the recommended FP8 format for grad_ws?

1.5.2

Why might the first layer be harder to train in FP8?

We may not have enough bits to represent certain input modalities such as pixel colours

Why might the last layer be harder to train in FP8?

Large embeddings can give out-of-range values

Softmax sensitive to low-precision values

How is the Nvidia 1.5.2 format different/similar from Graphcore’s?

It conforms exactly to the IEEE 754 standard.

How does the Nvidia 1.4.3 format differ from Graphcore’s?

Similar:

Single representation for INF & NaN

Different:

The all 1s mantissa (and exponent) represent NaN - other mantissa values are valid numbers

-ve Zero maintained

Why do Nvidia argue against using -0 as the single NaN representation?

It breaks the +ve / -ve symmetry inherent in the IEEE 754 standard

This may break algorithm implementations (e.g. comparison / sorting)

What FP8 format do Nvidia propose for doing quantisation from FP16?

1.4.3 (as there are no grads)

Where do Nvidia propose we add scaling factors for FP16-to-FP8 quantisation?

Just like standard int8:

Weights: per-channel

Activations: per-tensor

IPU

What are half-partials?

Rounding of the output of the AMP unit down to FP16.

What does a partial usually refer to on the IPU?

The final output of a single vector going through the RHS of the AMP unit.

What are the two main instruction types for accelerated compute on the IPU?

AMP and SLIC

AMP

Hierarchy of compute on an IPU (numbers not required)

Tiles

AMP sets

AMP engines

AMP units

How many tiles per IPU? (mk2)

1472

How many AMP sets per IPU tile? (mk2)

How many AMP engines per AMP set? (mk2)

How many AMP engines per IPU tile? (mk2)

How many AMP units per AMP engine? (mk2)

How many AMP units per AMP set? (mk2)

How many AMP units per IPU tile? (mk2)

How many AMP units per IPU? (mk2)

What is the basic unit of computation on an IPU tile in FP16, per instruction? (mk2)

A matmul

(we can then stream vectors on the RHS)

What is the basic unit of computation on an IPU tile in FP32, per instruction? (mk2)

A matmul

(we can then stream vectors on the RHS)

What is the basic unit of computation per IPU tile in FP16, per cycle? (mk2)

A matmul

(we can then stream vectors on the RHS)

What is the basic unit of computation per AMP unit in FP16, per instruction? (mk2)

A inner product

What is the basic unit of computation per AMP unit in FP16, per cycle? (mk2)

A inner product

What is a phase in an AMP unit?

A block of work that takes a cycle

How does a tile split a matmul over its AMP units?

LHS rows mapped to AMP units

LHS columns / RHS rows broken into 4 phases

How many cycles does it take a tile to perform a single matmul in FP16? (mk2)

MACs & FLOPs

How many MACs can each AMP unit perform per cycle in FP16? (mk2)

How many MACs can each tile perform per cycle in FP16? (mk2)

How many MACs can each tile perform per cycle in FP32? (mk2)

How many FLOPs per MAC?

How many FP16 FLOPs can an IPU process for each FP32 FLOP

How many FLOPs can each tile perform per cycle in FP16? (mk2)

128

How many FLOPs can each tile perform per cycle in FP32? (mk2)

How many FLOPs can each IPU perform per cycle in FP16? (mk2)

128 FLOPs * 1472 tiles ~= 188,000 FLOPs

IPU mk2 (original) clock speed

1.325GHz

IPU Bow clock speed

1.85GHz

Bow speedup vs original mk2?

40% (clock speed increase)

How many FLOP/s can each mk2 original IPU perform in FP16?

188,000 FLOPs per cycle * 1.325GHz ~= 250TFLOP/s

How many FLOP/s can each Bow IPU perform in FP16?

250TFLOP/s * 1.4 ~= 350TFLOP/s

Steps in an IPU AMP unit cycle

Compute element-wise products (output in higher precision)

Align result exponents

Round down to FP32

Sequential sum

Accumulate to FP16 / FP32 partials (FP16 process in other card)

(recall: this is a inner product for FP16)

Stochastic rounding

IPU rounding options

Round to Nearest (ties to even)

Stochastic Rounding

What is the key benefit of stochastic rounding?

It reduces swamping

Stochastic rounding equations

How is stochastic rounding implemented in practice?

Generate a uniform random bit string, aligned to the portion to be truncated

Add this to the higher-precision operand

Truncate

Why does the random bit-string implementation of stochastic rounding work?

Using the extreme bit values is equivalent to floor and ceiling

Intermediate values linearly interpolate

With probability proportional to distance of sum from upper / lower bounds

To adjust

Is the shift issue for mul or sum? How is the other one handled?

Open questions

Results r.e. stochastic rounding and batch size

For FP8 what might stay in FP16? Presumably master weights and opt state. What about grads for all-reduce / accumulation? Does it need to?

How is SNR computed?

Are partials within the AMP unit always in FP32? Does the outer partials sum have an FP32 part too? I think FP8 still accumulates into an FP32 partial, which can optionally be rounded down afterwards? Is this true? Could it be accumulated in FP16 (or BF16?)

Is stochastic rounding done for all of: phases, partials, exchange partials, grad acc, grad all-reduce (terminology here?)

Number Formats

Contents

About

Motivation

Integer representations

Sign-magnitude

One’s complement

Two’s complement

Offset binary

Real number representations

Fixed-point representations

IEEE 754 floating-point standard

FP32

TF32

FP16

BFLOAT16

FP8

IPU

AMP

MACs & FLOPs

Stochastic rounding

To adjust

Open questions