🤸‍♀️

Emergent Abilities of Large Language Models

Title
Authors
Date
2022
Venue
DBLP
Keywords

Introduction

  1. We have accurate scaling curves for CE loss (over > 7 orders of magnitude!)
  1. But not for performance on downstream tasks
These emergent abilities can’t be predicted by extrapolating.
Aim of paper = show examples of this across the literature (not investigate when/why)

Emergent Abilities

Definition

Definition of emergence:
Emergence is when quantitative changes in a system result in qualitative changes in behavior
In the context of ML:
  1. Performance is near-random
  1. Until critical threshold, after which performance rapidly increases
  1. Also known as a “phase transition”

Measuring scaling

3 key axes:
  1. FLOPs
  1. Params
  1. Data
1 & 2 linked (except with MoE), so can’t differentiate here. For 3 many papers just use a fixed number of tokens at all scales, so again hard to compare. Hence main focus will be on flops (implies params proportionate)
(can also try and correlate CE/perplexity with emergent ability scaling)

Emergent Scale

notion image
notion image
Few-shot:
  • MMLU (multi-task language understanding) result suggests “the ability to solve knowledge-based questions spanning a large collection of topics might require scaling up past this threshold”
  • WIC is the only task requiring huge PaLM-scale models. Suggests differentiating between contexts may require massive scale
Augmented prompting: (shows the point at which prompting method surpasses few-shot - see fig 3)
notion image

Discussion

Explaining emergence

  • Notion of emergence may be exaggerated by evaluation metrics. E.g. if exact string match is required then incremental progress may not be visible. Similar logic for multi-step problems.
  • However, this is only partial explanation, as even on classification tasks emergence is seen.
  • Should be noted though that CE improves even when accuracy is in the random stage, so important to note that models are “getting better” in some sense, even before emergence.
  • We can think of downstream metrics as “masking” log-likelihood improvements, until a certain scale.
  • However this does not explain why/how some downstream metrics have this effect.
notion image
Â