- 204 tasks (plus subset for fast evaluation)
- Diverse topics
- Focus on tasks beyond the capabilities of current language models
- Test several dense and MoE models, as well as humans
- General findings:
- Scale helps
- Humans generally better than AIs
- Performance similar across model classes
- MoE helps
- “emergent” tasks tend to involve multiple steps / component; knowledge-based tasks relying on memorisation grow more slowly
PaLM (>0) is fantastic on BIG-bench Lite.
- Existing benchmarks have limited scope, often on areas already well-handled:
- Language understanding
- Trivial QA
- Benchmarks often have short useful lifespans, maybe a few years (e.g. SuperGLUE 18 months until solved)
- Follow pattern of challenge-solve-replace
- Likely due to restricted scope
- Labelling not done by experts or authors
- Tasks made unnecessarily easy, so they can be performed by e.g. Mechanical Turk
Formatting: Tasks should be formatted in a way that is easy for humans to read and interpret
Specificity: Tasks should aim to cleanly capture some specific capability of language models
Difficulty: Tasks must not be fully solvable by existing language models. Tasks that include varying difficulty [and/or] are completely beyond the capabilities of current language models are also encouraged
Not solvable by memorizing the Internet: Task authors should be wary of tasks where target input/output pairs may appear verbatim in model training data
Compute resources: Reviewers will consider whether the computational resources used by a task are large enough to make running the task practically challenging
Size: Tasks should include a minimum of 32 input-output pairs of examples
2 types of task:
- compare inputs & outputs/logits
- uses standard metrics
- 80% of tasks
- written in python
- multiple interactions with model
- custom metrics
- interface requires methods for a) generation, b) log probs
All of Lite is JSON
Each task defines preferred metric, with high and low scores, so we can normalise
Special task for detecting data leakage
Models: (all sampling with 0 temperature)
- LaMDA style model
- trained by Google
- 2.8T tokens (SentencePiece)
- Based on ST-MoE
- 32 experts, top-2, cf=1.25, every 1/4 layers
- Same setup as dense
- Largest model = 46B (8.9B flop eqiuvalent dense)
- Raters = selected experts
- Allowed to use all available resources, including web search
- Report mean and max
Cross entropy and scaling behavior are extremely similar across model families. This similarity is particularly striking, as it persists despite differences in evaluation dataset, training dataset, training hyperparameters, and model architecture.
- 2x decrease in FLOPs for same model performance
- dramatic improvements in calibration (10x!)
They achieve about a tenfold improvement in the FLOP-matched parameter count needed to reach a given calibration score.
They actually go on to support this hypothesis:
- if the metric is broken down bit-by-bit the effect is much less dramatic (e.g. multiple-choice metric)
- Also not using all-or-nothing metrics (i.e. use BLEU, ROUGE etc rather than exact_str_match)
- However, if the downstream performance is what we really care about, in some cases the “smooth version” of the task/metric teaches us nothing useful about actual task performance
- We’re tempted to think about this in terms of “Aha!” moments in humans - but this is a function of time, not of scale
One possible explanation for the breakthrough phenomenon on multistep tasks is that the probability of success on the task scales like the product of the success probabilities on each step. If the probabilities of each of k steps increase linearly, their product will increase like a kth-order polynomial, which will be nearly flat until a sudden increase.
The observations [above] lead to the conclusion that the capabilities and trajectories across scale of language models are much more subjective than we would like to believe, even when quantified through specific tasks. Different choices in task design can make the same capability appear to stagnate, gradually improve, or suddenly break through.
Some sensible-looking improvements in the prompt can cause performance to deteriorate (!)
Conclusion → if task doesn’t look like anything in the training data there can be issues.
- Bias often increases with scale in settings with broad or ambiguous context.
- Bias can decrease with scale in settings with narrow, unambiguous context.
- Bias can potentially be steered through appropriately chosen prompting.
This task is trivial for humans who know the rules of chess and are able to use a chess board as an external memory aid to track piece positions … None of the BIG-G models tested can solve this task.
They are getting better at some aspect of chess though (finding legal moves), so it’s not as though they can’t learn at all in this setting. But this is too hard for them.
(not from paper, see github repo)
- Program State Analysis/Automatic Debugging
- Debug without running code
- Most models essentially random
- PaLM better than average human and nearing best (!)
- Similar results seen for Code Description (non-PaLM models a bit more competitive here)
- Conceptual Combinations
- Task = “identify properties of two-concept combinations that are not typically a property of the constituent concepts”
- E.g. Sliced apples are: a) cooked in a pie, b) v sharp, c) bleed, d) dysfunctional
- Most models not much better than random, but PaLM nearing average rater
- Colang Translation
- Linguistics Olympiad-difficulty translation problems
- Perf here much more linear, and even across models
- Performance human-competitive. PaLM best and beats best human.
- Emoji Movie
- PaLM again smashes this vs other models
- Formal Fallacies and Syllogisms with Negation
- Distinguish valid arguments from logical fallacies (focus on negation)
- All models are barely above random, with the best still below average humans and far below the best humans
- Language Identification
- Again, suddenly the largest PaLM model is massively better here
- Known Unknowns
- Task = answer unknown when that’s the correct answer
- All models struggle here: scale helps v little and average human still way better
- Logic Grid Puzzles
- Basically constraint satisfaction problems
- PaLM at scale notably better than others, and just beats average human (40%)
- But still huge gap here to the best humans (100%!)
- “defines (arithmetic) operators using natural language and applies them to operands, expecting the model to give results without being exposed to examples.”
- Improves smoothly with size
- PaLM better than average human, still a reasonable gap to best
- Shakespeare Dialogue
- “determine whether two nearby "lines" from dialogue in a Shakespearean play were spoken by the same or different characters”
- Very difficult task - average humans barely above random
- Best are 100% - memorisation?
- Models are all barely better than random, with no clear benefit from massive scale - in fact PaLM seems to get worse?
- Strange Stories
- Measure model’s “emotional intelligence” (Theory of Mind) - can you infer mental states of others
- Children begin to do this from age 4
- All models get good at this when they get large enough
- PaLM considerably best, getting over average (which isn’t far from best)
- “Open-domain questions where the required reasoning steps are implicit in the question and should be inferred using a strategy”
- Known as a key limitation
- Massive scale appears to help a lot here
- Still a long way from the best humans
- Symbol Interpretation Task
- “Scene” is described purely in symbols (definition given) - task = choose sentence that best describes scene
- Variety of abilities required:
- Separate text from logic tokens
- Language understanding
- All models are terrible, scale doesn’t help. Essentially random.
- Not clear to me why models so bad, many abilities required so this is particularly tricky - definitely one to keep an eye on!
- Winograd (WSC) tasks involve identifying “pronoun coreference choices”
- This task gives a list of explanations of the target WSC answer, and the task = pick the correct one
- V hard tasks, average rater not much better than average; though best does far better
- (some) models clearly better than average and random
- However, not clear that scale is helping here