Benefits of sub-32-bit fp training
- Less memory usage
- Less memory bandwidth required (local and network)
- Math faster
Minimum non-denorm value in FP16
- Bias =
- Min exp number for non-denorm = 1
- The minimum value therefore is
Minimum denorm value in FP16
- Minimum denorm value is
- If smallest bit set then we multiply by
- Giving
Maximum value in FP16
- Set all exp bits to 1 except smallest (to avoid NaN)
- Bias =
- Exp number =
- Combined with the bias this gives a value of
- The significand bits set all to 1 then multiplies by
- To give a max value of
- = 65504
Loss scaling steps
All in FP16 (with a few specific exceptions):
- Multiply loss by scale factor
- Standard backprop (chain rule ensures scaling propagates)
- Multiply the weight gradient by and feed it to the optimiser
Loss scaling: when to use FP32 typically
- Master copy of weights
- Large reductions âĄī¸ e.g. batch-norm mean & var statistics
Loss scaling: how to choose a scaling factor dynamically
Increase gradually until overflow occurs, then decrease