Based on Chapter 4 of the Deep Learning Book
Gradient descent is simple, right? It's just
Well there's a bit more to consider. Lets start by looking at what we're actually trying to do here. Our aim is to minimise a function .
We make two assumptions about this function:
- We don't have a nice analytic expression for it.
- We do have a nice way of calculating the gradient (and ideally the Hessian).
We can't easily calculate the value of that minimises , so gradient descent proposes the following:
We can find the (possibly local) minimum of by taking small linear steps from some starting value, in the direction of greatest decrease of at each step.
We want a scalar value that describes the slope of the multivariate function in the direction of unit vector .
This is the directional derivative defined as follows:
The Taylor series gives us a local approximation of at a point .
It is defined as:
where and are the gradient and Hessian of at .
Defined for some matrix as:
The higher this number, the more will change as a result of small changes in . This can make approximate solutions to equations using increasingly inaccurate.
At each iteration we want to take a step in the direction that minimises the slope of .
In other words, the direction of minimising .
As , the desired value of is that where , which is achieved when is in the opposite direction to .
This gives us the update rule of: .
What can we say about the effect of this approximation over a step-size of ?
Considering the second-order Taylor approximation round the point , we can see that our step in the direction of the gradient gives:
These three terms represent: the original value, the improvement due to the gradient step, and the effect of the curvature of over the step (may help or hinder depending on the sign).
non-positive: is always decreasing
positive: too-large a step will reduce performance
Assuming the latter case, we can differentiate the above equation to find the optimal step size (for this quadratic approximation). This comes out as:
In other words - the Hessian gives us a good starting point for tuning the step-size!
In fact, we can go a step further. The worst-case scenario is when aligns with the maximum eigenvector of , corresponding to eigenvalue . This gives:
This gives us some intuition as to how the eigenvalues of the Hessian are tied to the maximum learning rate.
If the condition number of the Hessian is large, gradient descent performs poorly as the derivative increases at different rates in different dimensions. To adjust for this, we typically have to use a step size small enough to account for the direction with the highest curvature, making progress in other directions slow.
Understanding this was something of an epiphany for me. I had assumed that the typical approach to first and second-order optimisation was fundamentally the same but with a second term added to the Taylor approximation. You can do it that way, but...
- For first-order optimisation we have to take a bounded step in the direction of the gradient (because minimising the actual linear version just wants us to go to ).
- However, for second-order optimisation we can go straight to the minimum!
Visualise this as fitting a curve to a point. If it's a linear tangent then minimising the line goes out the bottom of the graph. However a quadratic curve has some nice minimum which we can jump straight to!
Let's look back at our Taylor series. Minimising the first order equation wrt gives us this nonsense:
In other words, we can only minimise this linear equation if our local point is already a flat line; otherwise there is no solution for that gives us a minimum - we must take a tentative step instead!
However, things change when we look at the second-order approximation:
This gives us a different update equation which jumps straight to the approximated minimum!
One drawback here however is that this method, unlike first-order methods, is attracted to saddle points where it can get stuck.
It's rare, of course, that the quadratic minimum will take us straight to the actual minimum, but in practice this tends to iterate much faster than gradient descent.