Evaluating Language Models

Cross entropy loss

A language model is a probability distribution over sequences of tokens . We have a second unknown "true" probability distribution , from which we sample our training data .
The maximum-likelihood objective gives us:
Why is this a cross-entropy loss? Because this is the same as a Monte Carlo estimate of the cross entropy of the model with respect to the data distribution! (ignoring a constant multiple)
The cross entropy of the model with respect to the data distribution is as follows:
The Monte Carlo simulation gives us a dataset, allowing us to estimate:
where is the empirical distribution of the sequences, equal to the fraction of times appears in the dataset. This is equivalent to:
Minimising this is equivalent to minimising the maximum likelihood objective - which is just a cross-entropy estimate!
What does the cross-entropy mean? It can be thought of in the following way:
The cross entropy can be interpreted as the number of bits per message needed (on average) to encode events drawn from true distribution p, if using an optimal code for distribution q


General Case

Defined as :
⬆️ perplexity ⬆️ entropy
Equal probs: , , (or with events)
Mixed probs: ,
Recall that the entropy is "a measure of the expected, or "average", number of bits required to encode the outcome of the random variable, using a theoretical optimal variable-length code" (wiki)
The following are some definitions of perplexity:
  1. A measure of a probability distribution's uncertainty
  1. A random variable with perplexity k has the same uncertainty as a fair k-sided die
  1. The number values in the discrete uniform distribution with the same entropy as the given distribution

Relative to a sample

In this case we substitute the entropy for the cross entropy .
This simply makes the perplexity:
If we follow the maths out, we get the following:
Fantastic! To summarise, we can interpret the perplexity for a distribution relative to a sample as either:
  1. The exponent of the cross entropy of the model distribution wrt. the data distribution
  1. The geometric mean of the inverse probabilities
How do we interpret this now? Simple:
The number values in the discrete uniform distribution with the same cross entropy (wrt. the data distribution) as the given distribution

For sequences

In language models we tend to calculate our loss over sequences, but in this case perplexity becomes enormous.
The solution is to evaluate the perplexity per-token. This is a little odd as our prediction for the token is taking into account the rest of the sequence - so it can actually be quite low!
I think the interpretation of what perplexity means here is the same as in the above section, except our model and data distribution are conditional on the rest of the sequence.

WordPiece Model

Trained model breaks words into wordpieces
Special boundary symbols added so that original sequences can be recovered unambiguously
Training optimisation problem:
  • Given a training corpus and a number of desired tokens
  • select wordpieces
  • such that the resulting corpus is minimal in the number of wordpieces


BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. (wiki)
Output is between 0 and 1 (or 0 and 100 as some papers do x100). Note that 1 = perfect translation. Given ambiguity of language, humans attain <1 and so 1 should not be target.


Given a candidate translation and a number of reference translations:
For each n-gram in the candidate translation:
  1. Take its maximum count in any of the reference translations
  1. Take its count in the candidate translation and clip it by to get
  1. Add to a cumulative sum:
Divide by the number of n-grams in the candidate translation: