### Empirical Risk Minimisation

The

**risk**is defined as the expected loss over the true data-generating distribution:The

**empirical risk***is the risk taken over the empirical distribution defined by the training set:*The

**surrogate loss function**is a proxy for the empirical risk: e.g. classification rate → NLL.### (Mini)Batch Algorithms

Why use a mini-batch?

Gradient descent requires that we calculate the gradient of the loss over the training dataset.

Using a mini-batch is a way of sampling this gradient.

#### Batch Size

We typically aim to use a minibatch size from around 32-256. We do so for the following reasons:

**GPU:**We require at least this many items to utilise GPUs, which also work best with powers of 2

**Memory:**Large batches may not fit in memory, especially for data parallelism

**Diminishing returns:**The variance of the gradient estimate decreases by , giving diminishing returns for larger batch sizes

**Generalisation:**Large batch sizes empirically generalise much less well, possibly because they lead to sharp minima. Generalisation error is often best with batch size = 1, although variance is extremely high here.

For second-order optimisation methods larger batch sizes are required to minimise fluctuations in the estimates of .

### Optimisation Challenges

#### Ill-contitioning

Why is a high condition number for the hessian a problem for optimisation?

The second-order taylor approximation to the cost function predicts that the gradient descent step of adds to the cost:

In many cases does not shrink significantly during training.

However, often will grow significantly (ill-conditioning), meaning small values of must be used.

#### Critical Points

What is a critical point?

A value where the gradient = 0.

How can we tell what kind of critical point we have?

The eigenvalues of the hessian fall into four cases:

**All +ve (positive definite):**local minimum

**All -ve (negative definite):**local maximum

**Has both +ve and -ve:**saddle point

**Else:**inconclusive

#### Local Minima

How do local minima effect deep learning?

Experts now suspect that for suﬃciently large neural networks, most local minima have a low cost function and finding the global minimum is not particularly important.

How to assess if local minima are a problem?

If the norm of the gradient does not shrink to insigniﬁcant size, the problem is neither local minima (/critical points).

#### Saddle Points

For high-dimensional problems, what can we say about saddle points versus local minima?

The ratio of saddle points to local minima grows exponentially in the number of dimensions

### Optimisation Algorithms

th moment:

th central moment:

#### SGD with momentum

What are the aims of SGD with momentum?

- To overcome high curvature of the loss function / poor conditioning of the Hessian (oscillating up the sides of the valley)

- To reduce variance in the minibatch gradient estimate

How mathematically does momentum alter SGD?

#### AdaGrad

Learns a separete learning rate per-parameter based on historical gradient.

All computed element-wise:

#### RMSProp

AdaGrad but with exponentially-weighted gradient history (and delta moved inside root):

All computed element-wise:

#### Adam

For more detail see Adam:

- Combines RMSProp and momentum

- 1st order estimate used to approximate gradient

- 2nd order estimate used to scale step size

- Bias correction for each to account for zero-initialisation

### Normalisation

#### Batch Norm

What is the motivation behind batch norm?

- Standard parameter updates are a result of complex interactions across multiple levels which can be hard to learn

- Explicitly learning the mean and variance of a layer's outputs makes the learning problem for each layer wrt. the previous layer much easier

Definition of batch norm:

We sum across the batch dimension (i.e. it 'disappears'), broadcasting as appropriate:

Batch norm at test time:

Use aggregate statistics computed during training

Should we batch norm the input to a linear layer or the output?

Original paper recommends normalising outputs, but forums seem to suggest input more effective.

#### Layer Norm

Definition of layer norm:

Same as batch norm, but summing across the feature dimension(s)

#### Instance Norm

Definition of instance norm:

Layer norm but specifically in the case of multi-channel images, normalises each channel individually.

#### Group Norm

Definition of group norm:

Instance norm but across groups of channels.

#### Weight Norm

Definition of weight norm:

Normalise the weights and multiply them by some learned scale factor (not the gradient!):