What Language Model to Train if You Have One Million GPU Hours?

Abstract

...Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale. Targeting a multilingual language model in the 100B+ parameters scale, our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study comparing different modeling practices and their impact on zero-shot generalization...

Goal: we want to figure out the best model to train at 100B param scale by doing a massive ablation at 1B param scale, specifically looking at the effect of zero-shot generalisation.

Intro

Output: design for architecture and training setup for a 100B param & 1M hour budget.

Eval scale: 1.3B params

Objective: zero-shot generalization

Arch analysis:

Encoder, decoder or encoder-decoder

Pretraining dataset

Positional embeddings

Activation functions

Embedding norm

Zero-shot generalization

Methods

Architecture & Pretraining Objective

All based on autoregressive language models (they like zero-shot, why particularly?)

They can be easily turned into non-causal decoders

Experimental Setup

Based on GPT-3

Main difference is that all layers use full attention (no sparse as compute saving negligible)

112B tokens - note that this is significantly above the optimal threshold identified by kaplan

Evaluation

Lit shows that upstream perf not always aligned with downstream. Also loss comp across architectures may not be valid

Few and zero-shot are correlated

Finetuning not used, as this is probably not how it will be used, and is considered impractical at 100B scale (I guess that’s just for end users right? ST-MoE can do ft!)

Use the EleutherAI Language Model Evaluation Harness to eval across 27 tasks - designed to reproduce as closely as possible the eval setup of GPT-3

Baselines

GPT-Neo (1.3B, trained on the Pile)

GPT-3 (OpenAI API) Babbage (1.3B) & Curie (6.7B)

Impact of Pretraining Data

Results:

The Pile is the best - it is diverse, cross-domain, combining we crawls with curated high-qualidy sources

OSCAR = multilingual, based on common crawl (close second)

C4 = also based on cc (far third)

Architecture Ablations

Positional Embeddings

ALiBi positional embeddings best:

ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance

ALiBI > learned > rotary

Baseline without position info shows competitive perf! (note: they are much more important for bidirectional models, but here we have masking)

Activation Functions

Swiglu slightly better, but reduces throughput by 1/3 → so use GELU!

Embedding Norm

Suggested that adding norm after embedding helps with stability for large models

However, their findings show this significantly harms performance

Multilingual

multilingual models significantly underperform their monolingual counterpart on English zero-shot benchmarks

Scaling to 176B parameters

Compute Budget

18 weeks

52 nodes of 8 80GB A100s. (416 chips)

~1M hours

Assuming 100 TFLOP/s throughput

Budget ~= 4,500 PF-days = 380 ZFLOPs =~23% more than GPT-3

Scaling Laws

Their runs so close to Kaplan that they actually just go and use Kaplan’s:

However, pretraining loss doesn’t always translate to downstream perf

Lit indicates significant perf increases possible well past point of “optimal training”

Scaling laws also neglect inference cost

Based on this they take the Kaplan “optimal model size & tokens” as an upper and lower boudn respecitvely

Optimal: 392B params & 165B tokens

Them: 176 params & 300-400B tokens

Final Architecture

300-400B tokens constrains model size to be around 160-200B params

Levine et al.’s work suggests 70-80 layers

Hidden dim = 14k