Contents
ContentsMotivationDependent variableIndependent variablePlot criteriaNaive scaling lawCorrect scaling lawKaplan et al.’s power-lawUnderstanding the lossProbsLet’s plot!
Motivation
These notes are motivated by a simple question: how should we plot the performance of our model?
Dependent variable
What do I mean by “performance” of a model? Potentially any of:
- Accuracy: the fraction of “successful” predictions for our training problem
- Probability : the probability our model assigns to the target label
- Loss : the metric we are minimising. Typically the (average) negative log probability
We’re going to focus on the second two metrics here.
Independent variable
We’re going to focus here on any of:
- Model size
- Compute budget
- Dataset size
as these are the hyperparameters studied by Kaplan et al. (2020), who show that they all obey a power-law relationship. From here on we will use the variable to denote any one of these independent variables.
Plot criteria
I’m going to suggest two different criteria for a given plot, both of which we hope to be able to satisfy:
- Our axes plot the most intuitive variables available - we can immediately “make sense” of what the graph is showing us
- We get a straight line for an “ideal” result (one that perfectly matches our model of the relationship). Why do we care about this?
- It allows us to get a clearer sense of how closely our results match our model - do we get a straight line in practice?
- The gradient and intercept of the line will typically reflect some key coefficients in our model. These will be immediately visible from the plot.
- It allows us to extrapolate beyond our recorded results by simply extending the line
Naive scaling law
We’ll begin with a simple, and wrong approach to doing this. We make the following assumption:
Let’s assume multiplying our hyperparameter by always corresponds to a multiplicative change of for the probability of error .
E.g. if we double the number of parameters, we expect the error to drop by a quarter.
This can be shown to lead to a power-law relationship:
where . If we take the logs of both sides, we get:
This gives us a linear relationship using a log-log scale. Job done!
This looks great, except for the fact that according to Kaplan et al. this is not the right power-law relationship. It’s not the probability of error that has a power-law relationship, but the loss (i.e. the negative log probability)
Correct scaling law
Kaplan et al.’s power-law
Kaplan et al. tell us that for any of our hyperparameters , the loss can be decomposed into an irreducible and redicible loss, the latter of which obeys a power law:
Comparing this to our naive scaling law, we’ve done two things:
- Swapped our error for the loss
- Subtracted a constant irreducible loss from our original loss
Understanding the loss
The standard loss is the mean negative log loss over the training data:
We can view this loss as the sum of an irreducible and reducible loss:
These terms also have an information-theoretic interpretation. Our loss can be viewed as an estimate of the cross entropy of the distribution defined by our model , with respect to the (true) data-generating distribution .
This decomposes into a sum of the entropy of the (true) data distribution and the KL divergence of the model distribution from the data distribution:
The KL is what obeys a power law:
Unlike the CE, which has a lower-bound defined by the entropy H, the KL occupies full range from 0 to inf. This makes it better to plot - especially in the next section where we turn everything into probs!
Probs
Key criterion for us is interpretable axes.
Information-theoretic viewpoint shows us that it’s really the KL that follows the power law. But a raw KL number doesn’t mean very much at a glance. Is a KL of 2.9 “good”? 🤷
However, we do know what to make of our raw probabilities. They have a nice [0, 1] bound and we generally have a sense for a given task, how “good” a probability of 0.87 is. So can we view our neat power law in terms of probability values? Yes!
Where is the model probability divided by the true data probability . These scaled probabilities can then be related to our power-law:
We’ve now related our power-law to now working in terms of mean negative log-probs, rather than KL, which is easier to work with.
Let’s plot!
The scaling laws paper has this figure:
Let’s just focus on the yellow line. We know it obeys a scaling law, and they plot params vs loss on a log scale.
However, we don’t quite get a straight line. To do this, we’d have to account for the entropy H.
Plotting accounts for this (via ), and also gives us a y-axis scale between 0 and 1:
Neat!
Update: ignoring the power-law stuff here, I think plotting the inverse perplexity gives what I want:
This denotes the geometric mean of the probs - a pretty good and interpretable (unlike actual perplexity!) measure of how well the model’s doing, that’s also aligned with the training objective! It starts at 0 (although in practice, random guessing gives 1/num_classes) and perfection is 1 (or in practice, a bit less because of irreducible loss).