Background
Maximum Likelihood
A cost function enables us to calculate a scalar performance metric evaluating our model's predictions over a dataset. We aim to optimise with respect to this metric, so we want to choose a cost function with a minimum point that represents the "best" performance for the model.
We typically think of the model's functional form as fixed, but parameterised by some learned . This reduces our optimisation problem to finding the right solution in parameter space, rather than function space. Thus we can frame the cost function as an evaluation of our model's parameters.
We can think of our model as either outputting a single prediction, or as defining a conditional probability distribution: . We will consider the latter case first. We desire a cost function with a minimum at the point where our model is "best", but how do we define this?
One common answer to this question is the maximum likelihood principle. This simply states that given a dataset (inputs and labels), the "best" model is the one that allocates the highest probability to the correct labels given the inputs.
The following steps show how we can describe this criterion using a cost function, denoted :
- We define the likelihood of the data as: .
- Assuming the data is i.i.d., we can factorise the joint distribution as: .
- Our criterion states that the "best" set of parameters should give the most probability to the training data. In mathematical terms, this means: .
- As specified by our definition of the cost function, we need .
- One form for which satisfies this is: .
- Long chains of multiplication can lead to problems such as numerical instability. Moving into log space solves this problem without changing . This gives us our final form for , the negative log-likelihood:
We now have a criterion for our model's parameters! This gives us something to tune the parameters with respect to, regardless of the choice of model.
There is another observation we can make to further justify our use of maximum likelihood. We can re-frame our formula for the NLL as an expectation in the following way:
This formulation is exactly the same as something we have seen before: the cross-entropy. This leads us to a neat conclusion:
Minimising the NLL is the same as minimising the cross-entropy of the model's distribution relative to the data distribution.
Neat!
Note that this can also be framed as minimising the KL divergence between the two distributions, as the KL is simply and the entropy term here is irrelevant for the optimisation.
This gives us three inter-related ways of motivating our decision to minimise the NLL:
- It satisfies the maximum likelihood principle.
- It minimises the cross-entropy of the model's distribution relative to the data distribution.
- It minimises the KL divergence from the model's distribution to the data distribution.
Binary Logistic Regression
(Some good notes on this section can be found at: peterroelants.github.io/posts/cross-entropy-logistic)
For tasks where the prediction is of a binary label, we can use our model to define a Bernoulli distribution over conditioned on . The task of our network is to learn a conditional value for the distribution's parameter (the final activation), which we can then use for prediction.
We have a particular requirement for this parameter: . To satisfy this, we must add a layer to the end of our network to bound the output (note: this value is sometimes called a logit). One common choice is the sigmoid function.
Wikipedia defines the sigmoid function as a general family of S-shaped curves, and refers to this particular function as the logistic function.
The sigmoid function is defined as follows:
We can use this to model (recalling that is a function of ), and then in accordance with the laws of probability we can take to give us our full distribution over labels.
Three interesting properties of this function are:
But why use this particular bounding function over any other form? Well, it turns out that if we assume a very simple linear model for the probability, this is what results.
We begin by modelling the unnormalised log probability, . This is a good place to start, as whereas , . The most simple model for our final layer is the linear model:
One useful feature of this form is that one case is constant. We will see when we normalise how this translates into a single output controlling the probabilities of both cases, as required for a Bernoulli distribution.
To convert this into an expression for the probability distribution we take the following steps:
Thus we have shown that the sigmoid activation function is the natural consequence of a linear model for our log probabilities.
We can use this form with the NLL defined in the previous section:
Visualising the above softmax function, its curve looks like a smooth version of the ReLU function. To minimise these two cases (which is our objective for the cost function) we must therefore move to the positive and negative extremes for our respective cases.
Thus the consequence of using the sigmoid activation in combination with maximum likelihood is that our learning objective for the logits is to make the predictions for our 1 labels as positive as possible, and for our 0 labels as negative as possible.
This is pretty much what we would intuitively expect! It's promising to see that ML in combination with a linear log probability model (i.e. sigmoid) leads to such a natural objective for our network.
Note that the above equation is not the form one typically sees the NLL/CE of a binary variable written in. More common (although perhaps less insightful) is the form:
One practical consideration we also shouldn't overlook here is how amenable this combination of cost function and final layer (/distribution) are to gradient-based optimisation.
What we really care about here is the degree to which the size of the gradient shrinks when the outputs of the layer are towards the extremes. We call this phenomenon saturation, and it leads to very slow learning in cases where we have predictions that are incorrect by a large amount
Sigmoid activation combined with MSE has exactly this problem. See how the left extreme of this graph (the derivative of the MSE) tends to 0, whereas in our case it tends to -1.
The derivative of the cost function with respect to are simply:
Take a moment to appreciate how wonderfully simple this is!
For the purpose of learning, the specific gradient values are ideal. In the case of a very wrong input for a positive label (), we have so the derivative tends to ; for a very wrong negative label the derivative tends to .
This is exactly the behaviour we want: large gradients for very wrong predictions (although not too large). Conversely, the gradient for good predictions tends to zero in both cases. Learning will only slow down for this layer when we get close to the right answer!
Multi-class Logistic Regression
(Some good notes on this section can also be found at: peterroelants.github.io/posts/cross-entropy-softmax)
When we have output classes we instead use a Multinoulli distribution with parameters. The labels here are now .
To avoid an implicit assumption about numerically close classes being more similar, we model this using a network with outputs, . We add a final layer to convert each of these into a probability distribution over the associated class being the value of the label: .
We can think of our model's final output as a vector that represents a probability distribution over labels. Note that this means we must also make sure to normalise the output values so that they sum to 1.
We use the same approach as in the binary-class case to model . We begin with a linear model for the log probability at each output:
Exponentiating and normalising gives:
We can think of this function as a "soft" version of the function; in fact, some suggest that it should more properly be named . Softmax is the generalisation of the sigmoid over a vector. It's derivative is also similar (in fact, for the first case it's the same). We find this using the quotient rule:
We can plug the softmax into the NLL to give:
One point worth noting is that this looks a bit simpler than the binary case - shouldn't it be at least as complex? The reason for this is due to us having one parameter for each class here, whereas because of the probabilities summing to 1 we only really need parameters. We do this for the binary case, which makes the reasoning a little more complex.
Intuitively, we can understand this cost function as incentivising the correct output to increase and the rest to decrease. The second term also punishes the largest incorrect output the most, which is also desirable.
We can see exactly how learning progresses by calculating the gradient of the NLL. We do this over a single label for a single :
Fantastic! This is exactly the same as the gradient in the binary-case, but we have it over each item of the output vector instead of the scalar we had before. Everything we deduced for that scenario works just the same here.
We won't prove it here, but if we create a regression model which outputs the mean of a Gaussian, it turns out that the partial derivatives at for the final layer are also of the form . It turns out that's a really simple and desirable gradient for the final layer!
Numerically Stable Softmax
We should instinctively be wary of the exponential and log terms in our softmax function. As we saw in chapter 4, these terms have the potential to drive our values into ranges that can't be precisely represented by floating-point numbers if we're not careful.
Specifically, we want to avoid extremely large or extremely negative inputs to the softmax, that will drive the exponentials to infinity or zero respectively.
Fortunately, there is a trick which can help in this regard. It can be easily shown that . From this we can derive our numerically stable variant of softmax:
This means that our inputs to the function are simply the differences between each input and the largest input. Thus, if the scale of all the inputs becomes very large or negative, the computation will still be stable so long as the relative values are not extremely different across the inputs.
I believe that the gradient of the softmax should discourage these relative values getting too large, but I'm not sure. If this is the case, then we are protected against numerical instability from all directions.