Optimisation

About

(This page is based primarily on material from Chapter 8 of the Deep Learning book)

Empirical Risk Minimisation

The risk is defined as the expected loss over the true data-generating distribution:

The empirical risk is the risk taken over the empirical distribution defined by the training set:

The surrogate loss function is a proxy for the empirical risk: e.g. classification rate → NLL.

(Mini)Batch Algorithms

Why use a mini-batch?

Gradient descent requires that we calculate the gradient of the loss over the training dataset.

Using a mini-batch is a way of sampling this gradient.

Batch Size

We typically aim to use a minibatch size from around 32-256. We do so for the following reasons:

GPU: We require at least this many items to utilise GPUs, which also work best with powers of 2

Memory: Large batches may not fit in memory, especially for data parallelism

Diminishing returns: The variance of the gradient estimate decreases by , giving diminishing returns for larger batch sizes

Generalisation: Large batch sizes empirically generalise much less well, possibly because they lead to sharp minima. Generalisation error is often best with batch size = 1, although variance is extremely high here.

For second-order optimisation methods larger batch sizes are required to minimise fluctuations in the estimates of .

Optimisation Challenges

Ill-contitioning

Why is a high condition number for the hessian a problem for optimisation?

The second-order taylor approximation to the cost function predicts that the gradient descent step of adds to the cost:

In many cases does not shrink significantly during training.

However, often will grow significantly (ill-conditioning), meaning small values of must be used.

Critical Points

What is a critical point?

A value where the gradient = 0.

How can we tell what kind of critical point we have?

The eigenvalues of the hessian fall into four cases:

All +ve (positive definite): local minimum

All -ve (negative definite): local maximum

Has both +ve and -ve: saddle point

Else: inconclusive

Local Minima

How do local minima effect deep learning?

Experts now suspect that for suﬃciently large neural networks, most local minima have a low cost function and finding the global minimum is not particularly important.

How to assess if local minima are a problem?

If the norm of the gradient does not shrink to insigniﬁcant size, the problem is neither local minima (/critical points).

Saddle Points

For high-dimensional problems, what can we say about saddle points versus local minima?

The ratio of saddle points to local minima grows exponentially in the number of dimensions

Optimisation Algorithms

th moment:

th central moment:

SGD with momentum

What are the aims of SGD with momentum?

To overcome high curvature of the loss function / poor conditioning of the Hessian (oscillating up the sides of the valley)

To reduce variance in the minibatch gradient estimate

How mathematically does momentum alter SGD?

AdaGrad

Learns a separete learning rate per-parameter based on historical gradient.

All computed element-wise:

RMSProp

AdaGrad but with exponentially-weighted gradient history (and delta moved inside root):

All computed element-wise:

Adam

For more detail see

🤠

Adam:

Combines RMSProp and momentum

1st order estimate used to approximate gradient

2nd order estimate used to scale step size

Bias correction for each to account for zero-initialisation

Normalisation

Batch Norm

What is the motivation behind batch norm?

Standard parameter updates are a result of complex interactions across multiple levels which can be hard to learn

Explicitly learning the mean and variance of a layer's outputs makes the learning problem for each layer wrt. the previous layer much easier

Definition of batch norm:

We sum across the batch dimension (i.e. it 'disappears'), broadcasting as appropriate:

Batch norm at test time:

Use aggregate statistics computed during training

Should we batch norm the input to a linear layer or the output?

Original paper recommends normalising outputs, but forums seem to suggest input more effective.

Layer Norm

Definition of layer norm:

Same as batch norm, but summing across the feature dimension(s)

Instance Norm

Definition of instance norm:

Layer norm but specifically in the case of multi-channel images, normalises each channel individually.

Group Norm

Definition of group norm:

Instance norm but across groups of channels.

Weight Norm

Definition of weight norm:

Normalise the weights and multiply them by some learned scale factor (not the gradient!):