### Contents

ContentsAboutIntroductionMethodAlgorithmInterpretationInitialisation Bias CorrectionConvergence AnalysisExperimentsLogistic RegressionNeural NetBias-Correction

### About

## Who wrote the *Adam* paper? (last names)

Kingma & Ba

## What year was that *Adam* paper published?

2015

### Introduction

## What kinds of optimisation problems does the *Adam* paper target?

- Stochastic

- High-dimensional parameter space

## From what is the name *Adam* derived?

Adaptive moment estimation

## Which previous methods does *Adam* leverage?

AdaGrad & RMSProp

## Why is *Adam*'s focus on *stochastic* functions important for ML

SGD uses a stochastic loss because the value is different for each minibatch

### Method

#### Algorithm

*Adam* equations

## What is the "effective" step-size of *Adam*?

## How can we re-write the *Adam* bias-correction in one line?

(and the final line then uses the un-corrected estimates)

*Adam* one-line version

(term 1 = bias-correction, term 2 & 3 = factorisation)

## How is *Adam* modified to get *AdaMax*?

Replace the term with .

## How theoretically does *AdaMax* relate to *Adam*.

It's equilvalent to replacing the norm for with an norm.

#### Interpretation

## When using *Adam*, what approximate upper bound do we have on the step-size magnitude and why?

- because in the one-line Adam formula each term's magnitude :

- Tends to 1

- Typically set beta params to make this <1

## Why can *Adam* be thought of as establishing a trust region?

It gives an approximate upper bound on the step-size magnitude of

## What does the *Adam* paper term the *signal-to-noise ratio*?

## In the *Adam* paper, what is termed?

The

*signal-to-noise ratio*## In the *Adam* paper, do we want a high or low *signal-to-noise ratio* and why?

Smaller (i.e. more noise), because the direction of the true gradient (as indicated by the signal) becomes less certain closer to the optimum

## For *Adam*, what does multiplying the learning rate by the *signal-to-noise ratio* do to it over time?

It anneals it, as the SNR ⬇️ as we get closer to the optimum

## What effect does the scale of the gradients have on *Adam*?

It is invariant to them

## Why is *Adam* invariant to the scale of the gradient?

In the term 3 of the one-line equation, any constant scale values cancel

## Given *Adam* is invariant to the scale of the gradient, why do we still need to unscale the gradients when performing loss scaling?

(I think) because the invariance only holds for a

*fixed*scale, whereas loss scaling uses variable scaling#### Initialisation Bias Correction

I initially had trouble understanding these equations. The following will help:

- We're comparing to not , which I initially thought.

- To get from equations (2) to (3) we assume is approximated by , and use the to correct for the approximation.

- From (3) to (4) is a very simple step following our rule for sums of finite geometric series.

## Why, broadly, does *Adam* need a correction term?

Because it's biased towards whatever value we initialise the running estimates with

## By initialising the moving averages in *Adam* as zeros what problem do we create?

We bias the estimates towards zero

## By initialising the moving averages in *Adam* as zeros what closed-form expression of the current estimate do we get?

## can be expressed in terms of a geometric series how?

## Proof of *Adam* bias correction (same holds for and ):

#### Convergence Analysis

I've skipped this part: it looks hugely complicated, and in fact it was shown later to contain an error (it took several years to spot, so it appears most people skipped this part). I'm sure someone out there has done a simpler presentation of this, which at some point I may take a look at.

### Experiments

#### Logistic Regression

- MNIST & IMBD word-pred

- L2 regularisation

- Simple convex objective

- Stepsize decay