Emergent Abilities of Large Language Models

Introduction

We have accurate scaling curves for CE loss (over > 7 orders of magnitude!)

But not for performance on downstream tasks

These emergent abilities can’t be predicted by extrapolating.

Aim of paper = show examples of this across the literature (not investigate when/why)

Emergent Abilities

Definition

Definition of emergence:

Emergence is when quantitative changes in a system result in qualitative changes in behavior

In the context of ML:

Performance is near-random

Until critical threshold, after which performance rapidly increases

Also known as a “phase transition”

Measuring scaling

3 key axes:

FLOPs

Params

Data

1 & 2 linked (except with MoE), so can’t differentiate here. For 3 many papers just use a fixed number of tokens at all scales, so again hard to compare. Hence main focus will be on flops (implies params proportionate)

(can also try and correlate CE/perplexity with emergent ability scaling)

Emergent Scale

Few-shot:

MMLU (multi-task language understanding) result suggests “the ability to solve knowledge-based questions spanning a large collection of topics might require scaling up past this threshold”

WIC is the only task requiring huge PaLM-scale models. Suggests differentiating between contexts may require massive scale

Augmented prompting: (shows the point at which prompting method surpasses few-shot - see fig 3)

Discussion

Explaining emergence

Notion of emergence may be exaggerated by evaluation metrics. E.g. if exact string match is required then incremental progress may not be visible. Similar logic for multi-step problems.

However, this is only partial explanation, as even on classification tasks emergence is seen.

Should be noted though that CE improves even when accuracy is in the random stage, so important to note that models are “getting better” in some sense, even before emergence.

We can think of downstream metrics as “masking” log-likelihood improvements, until a certain scale.

However this does not explain why/how some downstream metrics have this effect.