- We have accurate scaling curves for CE loss (over > 7 orders of magnitude!)
- But not for performance on downstream tasks
These emergent abilities can’t be predicted by extrapolating.
Aim of paper = show examples of this across the literature (not investigate when/why)
Definition of emergence:
Emergence is when quantitative changes in a system result in qualitative changes in behavior
In the context of ML:
- Performance is near-random
- Until critical threshold, after which performance rapidly increases
- Also known as a “phase transition”
3 key axes:
1 & 2 linked (except with MoE), so can’t differentiate here. For 3 many papers just use a fixed number of tokens at all scales, so again hard to compare. Hence main focus will be on flops (implies params proportionate)
(can also try and correlate CE/perplexity with emergent ability scaling)
- MMLU (multi-task language understanding) result suggests “the ability to solve knowledge-based questions spanning a large collection of topics might require scaling up past this threshold”
- WIC is the only task requiring huge PaLM-scale models. Suggests differentiating between contexts may require massive scale
Augmented prompting: (shows the point at which prompting method surpasses few-shot - see fig 3)
- Notion of emergence may be exaggerated by evaluation metrics. E.g. if exact string match is required then incremental progress may not be visible. Similar logic for multi-step problems.
- However, this is only partial explanation, as even on classification tasks emergence is seen.
- Should be noted though that CE improves even when accuracy is in the random stage, so important to note that models are “getting better” in some sense, even before emergence.
- We can think of downstream metrics as “masking” log-likelihood improvements, until a certain scale.
- However this does not explain why/how some downstream metrics have this effect.