### Contents

MotivationInteger representationsSign-magnitudeOne’s complementTwo’s complementOffset binaryReal number representationsFixed-point representationsIEEE 754 floating-point standardFP32TF32FP16BFLOAT16FP8IPUAMPMACs & FLOPsStochastic roundingTo adjustOpen questions

### Motivation

## From 2012 to 2022 how much has GPU performance improved

300x

## What are the key factors underlying GPU performance improvement from 2012 to 2022?

- Numerics

- Process transistor density

- Clock speed

- Architectural refinements

## How much have numerics improved GPU performance from 2012 to 2022, and why?

16x: from fp32 to fp8
(compute increase is quadratic)

## Why do smaller number formats lead to quadratic performance improvements?

Silicone area for arithmetic is quadratic in bit-width

## How much has process transistor density improved GPU performance from 2012 to 2022, and from what process nodes?

8x from 28nm to 5nm

## What does the *nm* value refer to for a process node?

Nothing, it’s purely a marketing term

## What did the *nm* for a process node *traditionally* refer to?

The length of the transistor gate

## How much has clock speed improved GPU performance from 2012 to 2022?

1.7x (at a power cost)

## How much have architectural refinements improved GPU performance from 2012 to 2022?

1.4x

### Integer representations

## Four main signed int representations

- sign–magnitude

- ones' complement

- two's complement

- offset binary

#### Sign-magnitude

## Sign-magnitude integer representation

## Cons of sign-magnitude integer representation

- Two ways of representing 0

- As a consequence, range is only

- Addition & subtraction require different behaviour depending on the sign bit

#### One’s complement

## One’s complement integer representation

## How do you negate a one’s complement integer?

Flip the bits

## Pros of one’s complement integer representation vs sign-magnitude

- Addition & subtraction the same as for unsigned integers, with exception of
*end-around carry*

## Cons of one’s complement integer representation vs two’s complement

- Two ways of representing 0

- As a consequence, range is only

- Addition & subtraction require
*end-around carry*

## What is *end-around carry* in arithmetic?

Adding an ‘overflowing’ carry at the most-significant bit back to the least significant bit

#### Two’s complement

## Two’s complement integer representation

## How do you negate a two’s complement integer?

Flip the bits and add 1

## Pros of two’s complement integer representation

- One way of representing 0

- Range is

- Addition & subtraction can ignore the sign bit and depend on the overflow behaviour

## Alternative interpretation for the sign bit in the two’s complement integer representation

Most-significant bit represents

#### Offset binary

## Offset binary integer representation

## Typical bias used for offset binary integer representation

### Real number representations

#### Fixed-point representations

## Fixed-point number representation

Where is an signed integer representation (typically two’s complement)

## What signed integer representation is typically used in a fixed-point mantissa?

Two’s complement

## What does the “fixed point” mean in a fixed-point representation?

(Assuming a power-of-2 scale) the binary representation of our number is just the bits with a decimal point at a fixed location

## “Everyday” instance of a fixed-point representation

*Scientific notation*

## Why are fixed-point representations a poor fit for machine learning?

- If scale is too small, quantisation error rapidly becomes large

- If scale is too large, clipping error rapidly becomes large

- The “sweet spot” is very narrow

#### IEEE 754 floating-point standard

## What number is the IEEE floating-point standard?

IEEE 754

## What’s the difference between an *arithmetic format* and an *interchange format*?

**Arithmetic format:**the mathematical values represented

**Interchange format:**how these values are realised as bit strings

## What do you need to specify an IEEE 754 float format?

- an exponent range (really just e_max, as typically
*e_max = 1 - emin)*

- # mantissa bits

- base

## What do you need to specify a *value* in an IEEE 754 float format?

- sign bit

- exponent bits

- mantissa bits

## Other words for mantissa?

- Significand

- Coefficient

(there are more)

## Other word for base (of a number)?

Radix

## IEEE 754 binary interchange format representation

Where:

Plus special values

## How to “read” an IEEE 754 interchange format *exponent* bit string

- The bias value = all but the leading bit set to one

- Figure out the value of the bits you’d have to add to this to get your value (or vice versa)

- Take the direction to be the sign

- Do 2^ of this value

## How to “read” an IEEE 754 interchange format *mantissa* bit string

- Cut off all trailing zeros

- Take the value of the bits

- Divide by (i.e. 1000… value with extra bit to the left)

- + 1

## What is the proper IEEE 754 name for fp32 & fp16?

*binary32*and

*binary16*

## IEEE 754 special values?

- Infinity (+ve & -ve)

- NaN (quiet & signalling)

- Subnormal numbers

## IEEE 754 infinity representation

- Exponent all 1s

- Mantissa all 0s

(sign bit either)

## How to signal an IEEE 754 NaN

- Exponent all 1s

- Mantissa all non-zero

## How to interpret an IEEE 754 NaN mantissa

- 1st mantissa bit = 1 → qNaN, otherwise signalling NaN

- Rest of mantissa = payload

## How to signal an IEEE 754 subnormal number

Exponent all zeros

## IEEE 754 subnormal number representation

Exponent = 0 (- bias); removes implicit mantissa leading bit

## How do IEEE 754 floats typically resolve rounding ties?

Round to the nearest even number

## Max value of an IEEE 754 float?

## Approximate max value of an IEEE 754 float?

## Bias of an IEEE 754 float?

## Max exponent for an IEEE 754 float?

## Min exponent for an IEEE 754 float?

## Absolute min normal value of an IEEE 754 float?

## Absolute min subnormal value of an IEEE 754 float?

## Basic approach to adding (or subtracting) IEEE 754 floats?

- Increase precision

- Shift the number with the smaller exponent to the right until exponents match,
implicit leading 1**including**

- Add / subtract, adjusting exponent if necessary

- Round back to original precision

## Why does the “shift trick” work when adding/subtracting IEEE 754 floats?

Shift mantissa right from exponent, so long as implicit leading 1 is included in shift

## Basic approach to multiplying IEEE 754 floats?

- Add exponents

- Multiply mantissas in higher precision

- Round mantissa & adjust exponent if necessary

## What is a binade?

A set of floating-point numbers with the same exponent value

#### FP32

## How many exponent & mantissa bits in FP32

(8, 23)

## FP32 approximate max value (decimal)

## FP32 approximate max value (power of 2)

## FP32 approximate min absolute normal value (decimal)

## FP32 min absolute normal value (power of 2)

## FP32 bias

#### TF32

## How many exponent & mantissa bits in TF32

(8, 10)

## What is the motivation for TF32?

Keep the range of FP32, but reduce the precision to that of FP16

## How many bits are required to represent a number in TF32?

19

## What is TF32?

A

*compute mode*- not a number format## How are matmuls done in TF32?

- Inputs are rounded to TF32

- Products output in FP32

- Accumulation in FP32

## When do GPUs do 32-bit compute in TF32?

By default (from Ampere onwards)

#### FP16

## Benefits of reduced-precision formats

- Faster compute

- Reduced memory

- Reduced comms

## How many exponent & mantissa bits in FP16

(5, 10)

## FP16 max value (decimal)

## FP16 approximate max value (power of 2)

## FP16 approximate min absolute normal value (decimal)

## FP16 min absolute normal value (power of 2)

## FP16 bias

15

## What techniques do they use in the mixed precision paper to enable FP16 training?

- Master weights in FP32

- Loss scaling

- FP32 partials

## What might be a reason networks use FP32 master weights?

Swamping

## What is swamping?

Loss of precision due to shift-truncation in

*large-to-small*number addition## How to avoid swamping in large dot products

- (Hierarchical) chunk-based accumulation

- Stochastic rounding

#### BFLOAT16

## How many exponent & mantissa bits in bfloat16

(8, 7)

## What is the motivation for bfloat16?

Keep the range of FP32, but reduce the bit width to 16

## How to calculate max/min absolute values for bfloat16

They’re the same as FP32

#### FP8

## What are the two different proposed sizes for FP8?

1.5.2 | 1.4.3

## How does the GAQ proposed FP8 standard deviate from the IEEE 754 format?

- The all-1s exponent is no-longer reserved for special values

- Negative zero is now reserved for both Inf & NaN

- The bias is 1 greater

## FP8 1.5.2 approx max value (decimal)

for both GAQ & NAI

## FP8 1.5.2 approximate max value (power of 2)

for both GAQ & NAI

## FP8 1.5.2 min absolute normal value (power of 2)

GAQ:

NAI:

## FP8 1.5.2 bias

GAQ:

NAI:

## FP8 1.4.3 max value (decimal)

GAQ:

NAI:

## FP8 1.4.3 min absolute normal value (power of 2)

GAQ:

NAI:

## FP8 1.4.3 bias

GAQ:

NAI:

## What is the recommended FP8 format for activations?

1.4.3

## What is the recommended FP8 format for weights?

1.4.3

## What is the recommended FP8 format for grad_xs?

1.5.2

## What is the recommended FP8 format for grad_ws?

1.5.2

## Why might the first layer be harder to train in FP8?

We may not have enough bits to represent certain input modalities such as pixel colours

## Why might the last layer be harder to train in FP8?

- Large embeddings can give out-of-range values

- Softmax sensitive to low-precision values

## How is the Nvidia 1.5.2 format different/similar from Graphcore’s?

It conforms exactly to the IEEE 754 standard.

## How does the Nvidia 1.4.3 format differ from Graphcore’s?

Similar:

- Single representation for INF & NaN

Different:

- The all 1s mantissa (and exponent) represent NaN - other mantissa values are valid numbers

- -ve Zero maintained

## Why do Nvidia argue against using -0 as the single NaN representation?

- It breaks the +ve / -ve symmetry inherent in the IEEE 754 standard

- This may break algorithm implementations (e.g. comparison / sorting)

## What FP8 format do Nvidia propose for doing quantisation from FP16?

- 1.4.3 (as there are no grads)

## Where do Nvidia propose we add scaling factors for FP16-to-FP8 quantisation?

Just like standard int8:

Weights: per-channel

Activations: per-tensor

### IPU

## What are half-partials?

Rounding of the output of the AMP unit down to FP16.

## What does a partial usually refer to on the IPU?

The final output of a single vector going through the RHS of the AMP unit.

## What are the two main instruction types for accelerated compute on the IPU?

AMP and SLIC

#### AMP

## Hierarchy of compute on an IPU (numbers not required)

- Tiles

- AMP sets

- AMP engines

- AMP units

## How many tiles per IPU? (mk2)

1472

## How many AMP sets per IPU tile? (mk2)

2

## How many AMP engines per AMP set? (mk2)

4

## How many AMP engines per IPU tile? (mk2)

8

## How many AMP units per AMP engine? (mk2)

2

## How many AMP units per AMP set? (mk2)

8

## How many AMP units per IPU tile? (mk2)

16

## How many AMP units per IPU? (mk2)

## What is the basic unit of computation on an IPU tile in FP16, *per instruction*? (mk2)

A matmul

(we can then stream vectors on the RHS)

## What is the basic unit of computation on an IPU tile in FP32, *per instruction*? (mk2)

A matmul

(we can then stream vectors on the RHS)

## What is the basic unit of computation per IPU tile in FP16, *per cycle*? (mk2)

A matmul

(we can then stream vectors on the RHS)

## What is the basic unit of computation per AMP unit in FP16, *per instruction*? (mk2)

A inner product

## What is the basic unit of computation per AMP unit in FP16, *per cycle*? (mk2)

A inner product

## What is a phase in an AMP unit?

A block of work that takes a cycle

## How does a tile split a matmul over its AMP units?

- LHS rows mapped to AMP units

- LHS columns / RHS rows broken into 4 phases

## How many cycles does it take a tile to perform a single matmul in FP16? (mk2)

4

#### MACs & FLOPs

## How many MACs can each AMP unit perform per cycle in FP16? (mk2)

4

## How many MACs can each tile perform per cycle in FP16? (mk2)

64

## How many MACs can each tile perform per cycle in FP32? (mk2)

16

## How many FLOPs per MAC?

2

## How many FP16 FLOPs can an IPU process for each FP32 FLOP

4

## How many FLOPs can each tile perform per cycle in FP16? (mk2)

128

## How many FLOPs can each tile perform per cycle in FP32? (mk2)

32

## How many FLOPs can each IPU perform per cycle in FP16? (mk2)

128 FLOPs * 1472 tiles ~= 188,000 FLOPs

## IPU mk2 (original) clock speed

1.325GHz

## IPU Bow clock speed

1.85GHz

## Bow speedup vs original mk2?

40% (clock speed increase)

## How many FLOP/s can each mk2 original IPU perform in FP16?

188,000 FLOPs per cycle * 1.325GHz ~= 250TFLOP/s

## How many FLOP/s can each Bow IPU perform in FP16?

250TFLOP/s * 1.4 ~= 350TFLOP/s

## Steps in an IPU AMP unit cycle

- Compute element-wise products (output in higher precision)

- Align result exponents

- Round down to FP32

- Sequential sum

- Accumulate to FP16 / FP32 partials (FP16 process in other card)

(recall: this is a inner product for FP16)

#### Stochastic rounding

## IPU rounding options

- Round to Nearest (ties to even)

- Stochastic Rounding

## What is the key benefit of stochastic rounding?

It reduces swamping

## Stochastic rounding equations

## How is stochastic rounding implemented in practice?

- Generate a uniform random bit string, aligned to the portion to be truncated

- Add this to the higher-precision operand

- Sum

- Truncate

## Why does the random bit-string implementation of stochastic rounding work?

- Using the extreme bit values is equivalent to floor and ceiling

- Intermediate values linearly interpolate

- With probability proportional to distance of sum from upper / lower bounds

### To adjust

Is the shift issue for mul or sum? How is the other one handled?

### Open questions

- Results r.e. stochastic rounding and batch size

- For FP8 what might stay in FP16? Presumably master weights and opt state. What about grads for all-reduce / accumulation? Does it need to?

- How is SNR computed?

- Are partials
*within*the AMP unit always in FP32? Does the outer partials sum have an FP32 part too? I think FP8 still accumulates into an FP32 partial, which can optionally be rounded down afterwards? Is this true? Could it be accumulated in FP16 (or BF16?)

- Is stochastic rounding done for all of: phases, partials, exchange partials, grad acc, grad all-reduce (terminology here?)