🎢

# Optimisation

(This page is based primarily on material from Chapter 8 of the Deep Learning book)

### Empirical Risk Minimisation

The risk is defined as the expected loss over the true data-generating distribution:
The empirical risk is the risk taken over the empirical distribution defined by the training set:
The surrogate loss function is a proxy for the empirical risk: e.g. classification rate → NLL.

### (Mini)Batch Algorithms

Why use a mini-batch?
Gradient descent requires that we calculate the gradient of the loss over the training dataset.
Using a mini-batch is a way of sampling this gradient.

#### Batch Size

We typically aim to use a minibatch size from around 32-256. We do so for the following reasons:
1. GPU: We require at least this many items to utilise GPUs, which also work best with powers of 2
1. Memory: Large batches may not fit in memory, especially for data parallelism
1. Diminishing returns: The variance of the gradient estimate decreases by , giving diminishing returns for larger batch sizes
1. Generalisation: Large batch sizes empirically generalise much less well, possibly because they lead to sharp minima. Generalisation error is often best with batch size = 1, although variance is extremely high here.

For second-order optimisation methods larger batch sizes are required to minimise fluctuations in the estimates of .

### Optimisation Challenges

#### Ill-contitioning

Why is a high condition number for the hessian a problem for optimisation?
The second-order taylor approximation to the cost function predicts that the gradient descent step of adds to the cost:
In many cases does not shrink significantly during training.
However, often will grow significantly (ill-conditioning), meaning small values of must be used.

#### Critical Points

What is a critical point?
A value where the gradient = 0.

How can we tell what kind of critical point we have?
The eigenvalues of the hessian fall into four cases:
1. All +ve (positive definite): local minimum
1. All -ve (negative definite): local maximum
1. Has both +ve and -ve: saddle point
1. Else: inconclusive

#### Local Minima

How do local minima effect deep learning?
Experts now suspect that for suﬃciently large neural networks, most local minima have a low cost function and finding the global minimum is not particularly important.

How to assess if local minima are a problem?
If the norm of the gradient does not shrink to insigniﬁcant size, the problem is neither local minima (/critical points).

For high-dimensional problems, what can we say about saddle points versus local minima?
The ratio of saddle points to local minima grows exponentially in the number of dimensions

### Optimisation Algorithms

th moment:
th central moment:

#### SGD with momentum

What are the aims of SGD with momentum?
1. To overcome high curvature of the loss function / poor conditioning of the Hessian (oscillating up the sides of the valley)
1. To reduce variance in the minibatch gradient estimate

How mathematically does momentum alter SGD?

Learns a separete learning rate per-parameter based on historical gradient.
All computed element-wise:

#### RMSProp

All computed element-wise:

For more detail see
🤠
:
1. Combines RMSProp and momentum
1. 1st order estimate used to approximate gradient
1. 2nd order estimate used to scale step size
1. Bias correction for each to account for zero-initialisation

### Normalisation

#### Batch Norm

What is the motivation behind batch norm?
1. Standard parameter updates are a result of complex interactions across multiple levels which can be hard to learn
1. Explicitly learning the mean and variance of a layer's outputs makes the learning problem for each layer wrt. the previous layer much easier

Definition of batch norm:
We sum across the batch dimension (i.e. it 'disappears'), broadcasting as appropriate:

Batch norm at test time:
Use aggregate statistics computed during training

Should we batch norm the input to a linear layer or the output?
Original paper recommends normalising outputs, but forums seem to suggest input more effective.

#### Layer Norm

Definition of layer norm:
Same as batch norm, but summing across the feature dimension(s)

#### Instance Norm

Definition of instance norm:
Layer norm but specifically in the case of multi-channel images, normalises each channel individually.

#### Group Norm

Definition of group norm:
Instance norm but across groups of channels.

#### Weight Norm

Definition of weight norm:
Normalise the weights and multiply them by some learned scale factor (not the gradient!):