Empirical Risk Minimisation

The risk is defined as the expected loss over the true data-generating distribution:
The empirical risk is the risk taken over the empirical distribution defined by the training set:
The surrogate loss function is a proxy for the empirical risk: e.g. classification rate → NLL.

(Mini)Batch Algorithms

Why use a mini-batch?
Gradient descent requires that we calculate the gradient of the loss over the training dataset.
Using a mini-batch is a way of sampling this gradient.

Batch Size

We typically aim to use a minibatch size from around 32-256. We do so for the following reasons:
  1. GPU: We require at least this many items to utilise GPUs, which also work best with powers of 2
  1. Memory: Large batches may not fit in memory, especially for data parallelism
  1. Diminishing returns: The variance of the gradient estimate decreases by , giving diminishing returns for larger batch sizes
  1. Generalisation: Large batch sizes empirically generalise much less well, possibly because they lead to sharp minima. Generalisation error is often best with batch size = 1, although variance is extremely high here.
For second-order optimisation methods larger batch sizes are required to minimise fluctuations in the estimates of .

Optimisation Challenges


Why is a high condition number for the hessian a problem for optimisation?
The second-order taylor approximation to the cost function predicts that the gradient descent step of adds to the cost:
In many cases does not shrink significantly during training.
However, often will grow significantly (ill-conditioning), meaning small values of must be used.

Critical Points

What is a critical point?
A value where the gradient = 0.
How can we tell what kind of critical point we have?
The eigenvalues of the hessian fall into four cases:
  1. All +ve (positive definite): local minimum
  1. All -ve (negative definite): local maximum
  1. Has both +ve and -ve: saddle point
  1. Else: inconclusive

Local Minima

How do local minima effect deep learning?
Experts now suspect that for sufficiently large neural networks, most local minima have a low cost function and finding the global minimum is not particularly important.
How to assess if local minima are a problem?
If the norm of the gradient does not shrink to insignificant size, the problem is neither local minima (/critical points).

Saddle Points

For high-dimensional problems, what can we say about saddle points versus local minima?
The ratio of saddle points to local minima grows exponentially in the number of dimensions

Optimisation Algorithms

th moment:
th central moment:

SGD with momentum

What are the aims of SGD with momentum?
  1. To overcome high curvature of the loss function / poor conditioning of the Hessian (oscillating up the sides of the valley)
  1. To reduce variance in the minibatch gradient estimate
How mathematically does momentum alter SGD?


Learns a separete learning rate per-parameter based on historical gradient.
All computed element-wise:


AdaGrad but with exponentially-weighted gradient history (and delta moved inside root):
All computed element-wise:


For more detail see
  1. Combines RMSProp and momentum
  1. 1st order estimate used to approximate gradient
  1. 2nd order estimate used to scale step size
  1. Bias correction for each to account for zero-initialisation


Batch Norm

What is the motivation behind batch norm?
  1. Standard parameter updates are a result of complex interactions across multiple levels which can be hard to learn
  1. Explicitly learning the mean and variance of a layer's outputs makes the learning problem for each layer wrt. the previous layer much easier
Definition of batch norm:
We sum across the batch dimension (i.e. it 'disappears'), broadcasting as appropriate:
Batch norm at test time:
Use aggregate statistics computed during training
Should we batch norm the input to a linear layer or the output?
Original paper recommends normalising outputs, but forums seem to suggest input more effective.

Layer Norm

Definition of layer norm:
Same as batch norm, but summing across the feature dimension(s)

Instance Norm

Definition of instance norm:
Layer norm but specifically in the case of multi-channel images, normalises each channel individually.

Group Norm

Definition of group norm:
Instance norm but across groups of channels.

Weight Norm

Definition of weight norm:
Normalise the weights and multiply them by some learned scale factor (not the gradient!):