What to plot

Contents Motivation Dependent variable Independent variable Plot criteria Naive scaling law Correct scaling law Kaplan et al.’s power-law Understanding the loss Probs Let’s plot!

Motivation

These notes are motivated by a simple question: how should we plot the performance of our model?

Dependent variable

What do I mean by “performance” of a model? Potentially any of:

Accuracy: the fraction of “successful” predictions for our training problem

Probability : the probability our model assigns to the target label

Loss : the metric we are minimising. Typically the (average) negative log probability

We’re going to focus on the second two metrics here.

Independent variable

We’re going to focus here on any of:

Model size

Compute budget

Dataset size

as these are the hyperparameters studied by Kaplan et al. (2020), who show that they all obey a power-law relationship. From here on we will use the variable to denote any one of these independent variables.

Plot criteria

I’m going to suggest two different criteria for a given plot, both of which we hope to be able to satisfy:

Our axes plot the most intuitive variables available - we can immediately “make sense” of what the graph is showing us

We get a straight line for an “ideal” result (one that perfectly matches our model of the relationship). Why do we care about this?

It allows us to get a clearer sense of how closely our results match our model - do we get a straight line in practice?
The gradient and intercept of the line will typically reflect some key coefficients in our model. These will be immediately visible from the plot.
It allows us to extrapolate beyond our recorded results by simply extending the line

Naive scaling law

We’ll begin with a simple, and wrong approach to doing this. We make the following assumption:

Let’s assume multiplying our hyperparameter by always corresponds to a multiplicative change of for the probability of error .

E.g. if we double the number of parameters, we expect the error to drop by a quarter.

This can be shown to lead to a power-law relationship:

where . If we take the logs of both sides, we get:

This gives us a linear relationship using a log-log scale. Job done!

This looks great, except for the fact that according to Kaplan et al. this is not the right power-law relationship. It’s not the probability of error that has a power-law relationship, but the loss (i.e. the negative log probability)

Correct scaling law

Kaplan et al.’s power-law

Kaplan et al. tell us that for any of our hyperparameters , the loss can be decomposed into an irreducible and redicible loss, the latter of which obeys a power law:

Comparing this to our naive scaling law, we’ve done two things:

Swapped our error for the loss

Subtracted a constant irreducible loss from our original loss

Understanding the loss

The standard loss is the mean negative log loss over the training data:

We can view this loss as the sum of an irreducible and reducible loss:

These terms also have an information-theoretic interpretation. Our loss can be viewed as an estimate of the cross entropy of the distribution defined by our model , with respect to the (true) data-generating distribution .

This decomposes into a sum of the entropy of the (true) data distribution and the KL divergence of the model distribution from the data distribution:

The KL is what obeys a power law:

Unlike the CE, which has a lower-bound defined by the entropy H, the KL occupies full range from 0 to inf. This makes it better to plot - especially in the next section where we turn everything into probs!

Probs

Key criterion for us is interpretable axes.

Information-theoretic viewpoint shows us that it’s really the KL that follows the power law. But a raw KL number doesn’t mean very much at a glance. Is a KL of 2.9 “good”? 🤷

However, we do know what to make of our raw probabilities. They have a nice [0, 1] bound and we generally have a sense for a given task, how “good” a probability of 0.87 is. So can we view our neat power law in terms of probability values? Yes!

Where is the model probability divided by the true data probability . These scaled probabilities can then be related to our power-law:

We’ve now related our power-law to now working in terms of mean negative log-probs, rather than KL, which is easier to work with.

Let’s plot!

The scaling laws paper has this figure:

Let’s just focus on the yellow line. We know it obeys a scaling law, and they plot params vs loss on a log scale.

However, we don’t quite get a straight line. To do this, we’d have to account for the entropy H.

Plotting accounts for this (via ), and also gives us a y-axis scale between 0 and 1:

Neat!

Update: ignoring the power-law stuff here, I think plotting the inverse perplexity gives what I want:

This denotes the geometric mean of the probs - a pretty good and interpretable (unlike actual perplexity!) measure of how well the model’s doing, that’s also aligned with the training objective! It starts at 0 (although in practice, random guessing gives 1/num_classes) and perfection is 1 (or in practice, a bit less because of irreducible loss).