Mixed Precision Training

Benefits of sub-32-bit fp training
  1. Less memory usage
  1. Less memory bandwidth required (local and network)
  1. Math faster
Minimum non-denorm value in FP16
  1. Bias =
  1. Min exp number for non-denorm = 1
  1. The minimum value therefore is
Minimum denorm value in FP16
  1. Minimum denorm value is
  1. If smallest bit set then we multiply by
  1. Giving
Maximum value in FP16
  1. Set all exp bits to 1 except smallest (to avoid NaN)
  1. Bias =
  1. Exp number =
  1. Combined with the bias this gives a value of
  1. The significand bits set all to 1 then multiplies by
  1. To give a max value of
  1. = 65504
Loss scaling steps
All in FP16 (with a few specific exceptions):
  1. Multiply loss by scale factor
  1. Standard backprop (chain rule ensures scaling propagates)
  1. Multiply the weight gradient by and feed it to the optimiser
Loss scaling: when to use FP32 typically
  1. Master copy of weights
  1. Large reductions ➡ī¸ e.g. batch-norm mean & var statistics
Loss scaling: how to choose a scaling factor dynamically
Increase gradually until overflow occurs, then decrease