Mixed Precision Training

Benefits of sub-32-bit fp training

Less memory usage

Less memory bandwidth required (local and network)

Math faster

Minimum non-denorm value in FP16

Bias =

Min exp number for non-denorm = 1

The minimum value therefore is

Minimum denorm value in FP16

Minimum denorm value is

If smallest bit set then we multiply by

Giving

Maximum value in FP16

Set all exp bits to 1 except smallest (to avoid NaN)

Bias =

Exp number =

Combined with the bias this gives a value of

The significand bits set all to 1 then multiplies by

To give a max value of

= 65504

Loss scaling steps

All in FP16 (with a few specific exceptions):

Multiply loss by scale factor

Standard backprop (chain rule ensures scaling propagates)

Multiply the weight gradient by and feed it to the optimiser

Loss scaling: when to use FP32 typically

Master copy of weights

Large reductions ➡️ e.g. batch-norm mean & var statistics

Loss scaling: how to choose a scaling factor dynamically

Increase gradually until overflow occurs, then decrease