Regularisation


🚨
NOTE: this section is a skeleton and requires fleshing-out

Introduction

Definition

Any method that reduces the gap between test and training error (potentially at the expense of training error).

Approaches

  1. Early stopping
  1. Dataset regularisation
    1. Dataset augmentation
    2. Noise injection
  1. Parameter regularisation
    1. Parameter norm penalties
    2. Parameter sharing
  1. Multiple learners
    1. Ensemble methods
    2. Dropout

Parameter Norm Penalties

This is any weighted norm on the weights that we add to our objective function.
This punishes the weight terms but not the bias (can lead to underfitting).

Regularisation

regularisation is also known as what?
Ridge regression, Tikhonov regularisation or weight decay.
 
Definition of regularisation:
Add to the loss function.
 
Why is regression a form of weight decay?
Because the parameter update becomes:
🚨
TODO: proof of below
What effect does regularisation have on the optimal ?
  1. Assuming a quadratic cost function,
  1. It rescales it along the axes defined by the eigenvectors of its hessian.
  1. The component of that is aligned with the -th eigenvector of the hessian is rescaled by (i.e. the smaller the eigenvalue / less curvature the more the more that direction shrinks).
🚨
TODO: proof of below
What effect does regression have on the least squares solution?
It acts as though the variance of the input () has increased in each dimension by :
This causes the weights to shrink in compensation.

Regularisation

regularisation is also known as what?
Lasso regression.
 
Definition of regularisation:
Add to the loss function.
 
How does regression change the parameter update?
i.e. the "weight decay" is just a reduction by a constant term .
🚨
TODO: rest of L1 details & constrained opt

Noise Robustness

Label smoothing

What is label smoothing?
Replacing the hard 1/0 classification targets for a problem with classes with and respectively for some .

Semi-supervised learning

Outline

  • A standard approach is to learn a generative model (either or ) which shares parameters with a discriminative model ()
  • The models are then optimised using a single loss function that trades-off between the supervised (e.g. ) and unsupervised criteria (e.g. ).

Dropout

How does dropout work?
For each minibatch item we sample a binary mask to apply to all the non-output units in the network, based on some fixed probability.
 
How does backpropagation work for dropout?
As though we were using a pruned version of the network corresponding to the given map.
 
How do we do inference using a network trained with dropout?
Either:
  • Average the output using multiple random masks (can work well with as few as 10-20)
  • Use weight scaling inference: use the full model but multiply the weights out of unit by probability of including
 
Intuitive interpretation of dropout:
A form of ensemble learning/bagging without training multiple separate models