Adam: A Method for Stochastic Optimization
Diederik P. Kingma, Jimmy Ba



Who wrote the Adam paper? (last names)
Kingma & Ba
What year was that Adam paper published?


What kinds of optimisation problems does the Adam paper target?
  1. Stochastic
  1. High-dimensional parameter space
From what is the name Adam derived?
Adaptive moment estimation
Which previous methods does Adam leverage?
AdaGrad & RMSProp
Why is Adam's focus on stochastic functions important for ML
SGD uses a stochastic loss because the value is different for each minibatch



Adam equations
What is the "effective" step-size of Adam?
How can we re-write the Adam bias-correction in one line?
(and the final line then uses the un-corrected estimates)
Adam one-line version
(term 1 = bias-correction, term 2 & 3 = factorisation)
How is Adam modified to get AdaMax?
Replace the term with .
How theoretically does AdaMax relate to Adam.
It's equilvalent to replacing the norm for with an norm.


When using Adam, what approximate upper bound do we have on the step-size magnitude and why?
- because in the one-line Adam formula each term's magnitude :
  1. Tends to 1
  1. Typically set beta params to make this <1
Why can Adam be thought of as establishing a trust region?
It gives an approximate upper bound on the step-size magnitude of
What does the Adam paper term the signal-to-noise ratio?
In the Adam paper, what is termed?
The signal-to-noise ratio
In the Adam paper, do we want a high or low signal-to-noise ratio and why?
Smaller (i.e. more noise), because the direction of the true gradient (as indicated by the signal) becomes less certain closer to the optimum
For Adam, what does multiplying the learning rate by the signal-to-noise ratio do to it over time?
It anneals it, as the SNR ⬇️ as we get closer to the optimum
What effect does the scale of the gradients have on Adam?
It is invariant to them
Why is Adam invariant to the scale of the gradient?
In the term 3 of the one-line equation, any constant scale values cancel
Given Adam is invariant to the scale of the gradient, why do we still need to unscale the gradients when performing loss scaling?
(I think) because the invariance only holds for a fixed scale, whereas loss scaling uses variable scaling

Initialisation Bias Correction

I initially had trouble understanding these equations. The following will help:
  1. We're comparing to not , which I initially thought.
  1. To get from equations (2) to (3) we assume is approximated by , and use the to correct for the approximation.
  1. From (3) to (4) is a very simple step following our rule for sums of finite geometric series.
Why, broadly, does Adam need a correction term?
Because it's biased towards whatever value we initialise the running estimates with
By initialising the moving averages in Adam as zeros what problem do we create?
We bias the estimates towards zero
By initialising the moving averages in Adam as zeros what closed-form expression of the current estimate do we get?
can be expressed in terms of a geometric series how?
Proof of Adam bias correction (same holds for and ):

Convergence Analysis

I've skipped this part: it looks hugely complicated, and in fact it was shown later to contain an error (it took several years to spot, so it appears most people skipped this part). I'm sure someone out there has done a simpler presentation of this, which at some point I may take a look at.


Logistic Regression

  • MNIST & IMBD word-pred
  • L2 regularisation
  • Simple convex objective
  • Stepsize decay

Neural Net