Adam

Contents About Introduction Method Algorithm Interpretation Initialisation Bias Correction Convergence Analysis Experiments Logistic Regression Neural Net Bias-Correction

About

Who wrote the Adam paper? (last names)

Kingma & Ba

What year was that Adam paper published?

2015

Introduction

What kinds of optimisation problems does the Adam paper target?

Stochastic

High-dimensional parameter space

From what is the name Adam derived?

Adaptive moment estimation

Which previous methods does Adam leverage?

AdaGrad & RMSProp

Why is Adam's focus on stochastic functions important for ML

SGD uses a stochastic loss because the value is different for each minibatch

Method

Algorithm

Adam equations

What is the "effective" step-size of Adam?

How can we re-write the Adam bias-correction in one line?

(and the final line then uses the un-corrected estimates)

Adam one-line version

(term 1 = bias-correction, term 2 & 3 = factorisation)

How is Adam modified to get AdaMax?

Replace the term with .

How theoretically does AdaMax relate to Adam.

It's equilvalent to replacing the norm for with an norm.

Interpretation

When using Adam, what approximate upper bound do we have on the step-size magnitude and why?

- because in the one-line Adam formula each term's magnitude :

Tends to 1

Typically set beta params to make this <1

Why can Adam be thought of as establishing a trust region?

It gives an approximate upper bound on the step-size magnitude of

What does the Adam paper term the signal-to-noise ratio?

In the Adam paper, what is termed?

The signal-to-noise ratio

In the Adam paper, do we want a high or low signal-to-noise ratio and why?

Smaller (i.e. more noise), because the direction of the true gradient (as indicated by the signal) becomes less certain closer to the optimum

For Adam, what does multiplying the learning rate by the signal-to-noise ratio do to it over time?

It anneals it, as the SNR ⬇️ as we get closer to the optimum

What effect does the scale of the gradients have on Adam?

It is invariant to them

Why is Adam invariant to the scale of the gradient?

In the term 3 of the one-line equation, any constant scale values cancel

Given Adam is invariant to the scale of the gradient, why do we still need to unscale the gradients when performing loss scaling?

(I think) because the invariance only holds for a fixed scale, whereas loss scaling uses variable scaling

Initialisation Bias Correction

I initially had trouble understanding these equations. The following will help:

We're comparing to not , which I initially thought.

To get from equations (2) to (3) we assume is approximated by , and use the to correct for the approximation.

From (3) to (4) is a very simple step following our rule for sums of finite geometric series.

Why, broadly, does Adam need a correction term?

Because it's biased towards whatever value we initialise the running estimates with

By initialising the moving averages in Adam as zeros what problem do we create?

We bias the estimates towards zero

By initialising the moving averages in Adam as zeros what closed-form expression of the current estimate do we get?

can be expressed in terms of a geometric series how?

Proof of Adam bias correction (same holds for and ):

Convergence Analysis

I've skipped this part: it looks hugely complicated, and in fact it was shown later to contain an error (it took several years to spot, so it appears most people skipped this part). I'm sure someone out there has done a simpler presentation of this, which at some point I may take a look at.

Experiments

Logistic Regression

MNIST & IMBD word-pred

L2 regularisation

Simple convex objective

Stepsize decay