Empirical Risk Minimisation(Mini)Batch AlgorithmsBatch SizeOptimisation ChallengesIll-contitioningCritical PointsLocal MinimaSaddle PointsOptimisation AlgorithmsSGD with momentumAdaGradRMSPropAdamNormalisationBatch NormLayer NormInstance NormGroup NormWeight Norm
(This page is based primarily on material from Chapter 8 of the Deep Learning book)
The risk is defined as the expected loss over the true data-generating distribution:
The empirical risk is the risk taken over the empirical distribution defined by the training set:
The surrogate loss function is a proxy for the empirical risk: e.g. classification rate → NLL.
Why use a mini-batch?
Gradient descent requires that we calculate the gradient of the loss over the training dataset.
Using a mini-batch is a way of sampling this gradient.
We typically aim to use a minibatch size from around 32-256. We do so for the following reasons:
- GPU: We require at least this many items to utilise GPUs, which also work best with powers of 2
- Memory: Large batches may not fit in memory, especially for data parallelism
- Diminishing returns: The variance of the gradient estimate decreases by , giving diminishing returns for larger batch sizes
- Generalisation: Large batch sizes empirically generalise much less well, possibly because they lead to sharp minima. Generalisation error is often best with batch size = 1, although variance is extremely high here.
For second-order optimisation methods larger batch sizes are required to minimise fluctuations in the estimates of .
Why is a high condition number for the hessian a problem for optimisation?
The second-order taylor approximation to the cost function predicts that the gradient descent step of adds to the cost:
In many cases does not shrink significantly during training.
However, often will grow significantly (ill-conditioning), meaning small values of must be used.
What is a critical point?
A value where the gradient = 0.
How can we tell what kind of critical point we have?
The eigenvalues of the hessian fall into four cases:
- All +ve (positive definite): local minimum
- All -ve (negative definite): local maximum
- Has both +ve and -ve: saddle point
- Else: inconclusive
How do local minima effect deep learning?
Experts now suspect that for suﬃciently large neural networks, most local minima have a low cost function and finding the global minimum is not particularly important.
How to assess if local minima are a problem?
If the norm of the gradient does not shrink to insigniﬁcant size, the problem is neither local minima (/critical points).
For high-dimensional problems, what can we say about saddle points versus local minima?
The ratio of saddle points to local minima grows exponentially in the number of dimensions
th central moment:
What are the aims of SGD with momentum?
- To overcome high curvature of the loss function / poor conditioning of the Hessian (oscillating up the sides of the valley)
- To reduce variance in the minibatch gradient estimate
How mathematically does momentum alter SGD?
Learns a separete learning rate per-parameter based on historical gradient.
All computed element-wise:
AdaGrad but with exponentially-weighted gradient history (and delta moved inside root):
All computed element-wise:
For more detail see Adam:
- Combines RMSProp and momentum
- 1st order estimate used to approximate gradient
- 2nd order estimate used to scale step size
- Bias correction for each to account for zero-initialisation
What is the motivation behind batch norm?
- Standard parameter updates are a result of complex interactions across multiple levels which can be hard to learn
- Explicitly learning the mean and variance of a layer's outputs makes the learning problem for each layer wrt. the previous layer much easier
Definition of batch norm:
We sum across the batch dimension (i.e. it 'disappears'), broadcasting as appropriate:
Batch norm at test time:
Use aggregate statistics computed during training
Should we batch norm the input to a linear layer or the output?
Original paper recommends normalising outputs, but forums seem to suggest input more effective.
Definition of layer norm:
Same as batch norm, but summing across the feature dimension(s)
Definition of instance norm:
Layer norm but specifically in the case of multi-channel images, normalises each channel individually.
Definition of group norm:
Instance norm but across groups of channels.
Definition of weight norm:
Normalise the weights and multiply them by some learned scale factor (not the gradient!):