BIG-Bench

Abstract

204 tasks (plus subset for fast evaluation)

Diverse topics

Focus on tasks beyond the capabilities of current language models

Test several dense and MoE models, as well as humans

General findings:

Scale helps
Humans generally better than AIs
Performance similar across model classes
MoE helps
“emergent” tasks tend to involve multiple steps / component; knowledge-based tasks relying on memorisation grow more slowly

PaLM (>0) is fantastic on BIG-bench Lite.

Introduction

Limitations of current benchmarks

Existing benchmarks have limited scope, often on areas already well-handled:

Language understanding
Summarisation
Trivial QA

Benchmarks often have short useful lifespans, maybe a few years (e.g. SuperGLUE 18 months until solved)

Follow pattern of challenge-solve-replace
Likely due to restricted scope

Labelling not done by experts or authors

Tasks made unnecessarily easy, so they can be performed by e.g. Mechanical Turk

What’s in BIG-Bench?

Review criteria

Highlights:

Formatting: Tasks should be formatted in a way that is easy for humans to read and interpret

Specificity: Tasks should aim to cleanly capture some specific capability of language models

Difficulty: Tasks must not be fully solvable by existing language models. Tasks that include varying difficulty [and/or] are completely beyond the capabilities of current language models are also encouraged

Not solvable by memorizing the Internet: Task authors should be wary of tasks where target input/output pairs may appear verbatim in model training data

Compute resources: Reviewers will consider whether the computational resources used by a task are large enough to make running the task practically challenging

Size: Tasks should include a minimum of 32 input-output pairs of examples

API

2 types of task:

JSON

compare inputs & outputs/logits

uses standard metrics

80% of tasks

Programmatic

written in python

multiple interactions with model

custom metrics

interface requires methods for a) generation, b) log probs

All of Lite is JSON

Each task defines preferred metric, with high and low scores, so we can normalise

Special task for detecting data leakage

Evaluation

Models: (all sampling with 0 temperature)

BIG-G:

LaMDA style model

trained by Google

2.8T tokens (SentencePiece)

BIG-G sparse:

Based on ST-MoE

32 experts, top-2, cf=1.25, every 1/4 layers

Same setup as dense

Largest model = 46B (8.9B flop eqiuvalent dense)

GPT

(PaLM)

Human evaluators:

Raters = selected experts

Allowed to use all available resources, including web search

Report mean and max

Behaviour of models & humans on BIG-Bench

Model family

Cross entropy and scaling behavior are extremely similar across model families. This similarity is particularly striking, as it persists despite differences in evaluation dataset, training dataset, training hyperparameters, and model architecture.

Calibration

Sparsity

2x decrease in FLOPs for same model performance

dramatic improvements in calibration (10x!)

They achieve about a tenfold improvement in the FLOP-matched parameter count needed to reach a given calibration score.

Linearity & Breakthroughness

Still appears to me that “breakthroughness” is much more a property of the metric than anything to do with learning / the model.

Breakthroughness much less extreme in underlying log probs, even for the most breakthroughy tasks. Slight exception for periodic_elements, where flatlining at is due to the model getting to the point of “randomness”, and subsequent breakthrough indicating real jump. I suspect this could be like the idiom task breakthrough, where a certain scale is needed to absorb a few key nuggets of info that make the task possible.

They actually go on to support this hypothesis:

if the metric is broken down bit-by-bit the effect is much less dramatic (e.g. multiple-choice metric)

Also not using all-or-nothing metrics (i.e. use BLEU, ROUGE etc rather than exact_str_match)

However, if the downstream performance is what we really care about, in some cases the “smooth version” of the task/metric teaches us nothing useful about actual task performance

We’re tempted to think about this in terms of “Aha!” moments in humans - but this is a function of time, not of scale

One possible explanation for the breakthrough phenomenon on multistep tasks is that the probability of success on the task scales like the product of the success probabilities on each step. If the probabilities of each of k steps increase linearly, their product will increase like a kth-order polynomial, which will be nearly flat until a sudden increase.

Subjectivity:

The observations [above] lead to the conclusion that the capabilities and trajectories across scale of language models are much more subjective than we would like to believe, even when quantified through specific tasks. Different choices in task design can make the same capability appear to stagnate, gradually improve, or suddenly break through.

Brittleness

Some sensible-looking improvements in the prompt can cause performance to deteriorate (!)

Conclusion → if task doesn’t look like anything in the training data there can be issues.

Social Bias

High-level conclusions:

Bias often increases with scale in settings with broad or ambiguous context.

Bias can decrease with scale in settings with narrow, unambiguous context.

Bias can potentially be steered through appropriately chosen prompting.

Behaviour on selected tasks

Checkmate-in-one

This task is trivial for humans who know the rules of chess and are able to use a chess board as an external memory aid to track piece positions … None of the BIG-G models tested can solve this task.

They are getting better at some aspect of chess though (finding legal moves), so it’s not as though they can’t learn at all in this setting. But this is too hard for them.

(not from paper, see github repo)

Some interesting selected tasks from LITE

Program State Analysis/Automatic Debugging

Debug without running code

Most models essentially random

PaLM better than average human and nearing best (!)

Similar results seen for Code Description (non-PaLM models a bit more competitive here)

Conceptual Combinations

Task = “identify properties of two-concept combinations that are not typically a property of the constituent concepts”
E.g. Sliced apples are: a) cooked in a pie, b) v sharp, c) bleed, d) dysfunctional
Most models not much better than random, but PaLM nearing average rater

Colang Translation

Linguistics Olympiad-difficulty translation problems
Perf here much more linear, and even across models
Performance human-competitive. PaLM best and beats best human.

Emoji Movie

PaLM again smashes this vs other models

Formal Fallacies and Syllogisms with Negation

Distinguish valid arguments from logical fallacies (focus on negation)
All models are barely above random, with the best still below average humans and far below the best humans

Language Identification

Again, suddenly the largest PaLM model is massively better here

Known Unknowns

Task = answer unknown when that’s the correct answer
All models struggle here: scale helps v little and average human still way better

Logic Grid Puzzles

Basically constraint satisfaction problems
PaLM at scale notably better than others, and just beats average human (40%)
But still huge gap here to the best humans (100%!)

Operators

“defines (arithmetic) operators using natural language and applies them to operands, expecting the model to give results without being exposed to examples.”
Improves smoothly with size
PaLM better than average human, still a reasonable gap to best

Shakespeare Dialogue

“determine whether two nearby "lines" from dialogue in a Shakespearean play were spoken by the same or different characters”
Very difficult task - average humans barely above random
Best are 100% - memorisation?
Models are all barely better than random, with no clear benefit from massive scale - in fact PaLM seems to get worse?

Strange Stories

Measure model’s “emotional intelligence” (Theory of Mind) - can you infer mental states of others
Children begin to do this from age 4
All models get good at this when they get large enough
PaLM considerably best, getting over average (which isn’t far from best)

StrategyQA

“Open-domain questions where the required reasoning steps are implicit in the question and should be inferred using a strategy”
Known as a key limitation
Massive scale appears to help a lot here
Still a long way from the best humans

Symbol Interpretation Task

“Scene” is described purely in symbols (definition given) - task = choose sentence that best describes scene
Variety of abilities required:

Separate text from logic tokens
Language understanding
Perception
Reasoning

All models are terrible, scale doesn’t help. Essentially random.
Not clear to me why models so bad, many abilities required so this is particularly tricky - definitely one to keep an eye on!

WinoWhy

Winograd (WSC) tasks involve identifying “pronoun coreference choices”
This task gives a list of explanations of the target WSC answer, and the task = pick the correct one
V hard tasks, average rater not much better than average; though best does far better
(some) models clearly better than average and random
However, not clear that scale is helping here