MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

Enormous language models lead to breakthroughs and new tasks, such as open-ended text-gen.

Open-ended text-generation task

Open-ended text-gen (OETG) has a problem: how to choose between equally valid alternatives?
Current solution: human evaluation ➡️ expensive & unreliable
notion image
Type 1 is the obvious error, but Type 2 also v important.
KL & reverse KL respectively are natural metrics for these
However, they can also be infinite ⚠️

Mauve Method

Solution: calculate KL with target as weighted mixture of Q & P. This gives the divergence curve.
Like a PR curve!
Like a PR curve!
The area under this curve is Mauve.
If MAUVE = 1 → P & Q are identical; 0 → completely dissimilar

Computing Mauve in practice

We require probabilities of the human documents we compare to. This is generally intractable. The solution has two steps:
notion image
So the deep embedding gives a continuous probability distribution (in embedding space). However, this is still quite high-dim to evaluate a probability distribution over. The quantisation method reduces this, making it tractable.
Note that interpretation of actual Mauve score is dependent on the embedding model (e.g. RoBERTa, or GPT2), but the comparison between scores (i.e. rank these three models) tend to be similar across embeddings.


MAUVE correlates strongly with human judgements
notion image
Also correlates with improvements coming from e.g. model size
notion image
That's it! Super simple. Why hasn't anyone thought of this before??


  • What is nucleus sampling?
  • How exactly do we get these distributions? How can both GPT2 and RoBERTa be used?