Abstract
...Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale. Targeting a multilingual language model in the 100B+ parameters scale, our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study comparing different modeling practices and their impact on zero-shot generalization...
Goal: we want to figure out the best model to train at 100B param scale by doing a massive ablation at 1B param scale, specifically looking at the effect of zero-shot generalisation.
Intro
Output: design for architecture and training setup for a 100B param & 1M hour budget.
Eval scale: 1.3B params
Objective: zero-shot generalization
Arch analysis:
- Encoder, decoder or encoder-decoder
- Pretraining dataset
- Positional embeddings
- Activation functions
- Embedding norm
Zero-shot generalization
Â
Methods
Architecture & Pretraining Objective
- All based on autoregressive language models (they like zero-shot, why particularly?)
- They can be easily turned into non-causal decoders
Experimental Setup
- Based on GPT-3
- Main difference is that all layers use full attention (no sparse as compute saving negligible)
- 112B tokens - note that this is significantly above the optimal threshold identified by kaplan
Evaluation
- Lit shows that upstream perf not always aligned with downstream. Also loss comp across architectures may not be valid
- Few and zero-shot are correlated
- Finetuning not used, as this is probably not how it will be used, and is considered impractical at 100B scale (I guess that’s just for end users right? ST-MoE can do ft!)
- Use the EleutherAI Language Model Evaluation Harness to eval across 27 tasks - designed to reproduce as closely as possible the eval setup of GPT-3
Baselines
- GPT-Neo (1.3B, trained on the Pile)
- GPT-3 (OpenAI API) Babbage (1.3B) & Curie (6.7B)
Impact of Pretraining Data
Results:
- The Pile is the best - it is diverse, cross-domain, combining we crawls with curated high-qualidy sources
- OSCAR = multilingual, based on common crawl (close second)
- C4 = also based on cc (far third)
Architecture Ablations
Positional Embeddings
- ALiBi positional embeddings best:
ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance
- ALiBI > learned > rotary
- Baseline without position info shows competitive perf! (note: they are much more important for bidirectional models, but here we have masking)
Activation Functions
- Swiglu slightly better, but reduces throughput by 1/3 → so use GELU!
Embedding Norm
- Suggested that adding norm after embedding helps with stability for large models
- However, their findings show this significantly harms performance
Multilingual
multilingual models significantly underperform their monolingual counterpart on English zero-shot benchmarks
Scaling to 176B parameters
Compute Budget
- 18 weeks
- 52 nodes of 8 80GB A100s. (416 chips)
- ~1M hours
- Assuming 100 TFLOP/s throughput
- Budget ~= 4,500 PF-days = 380 ZFLOPs =~23% more than GPT-3
Scaling Laws
Their runs so close to Kaplan that they actually just go and use Kaplan’s:
- However, pretraining loss doesn’t always translate to downstream perf
- Lit indicates significant perf increases possible well past point of “optimal training”
- Scaling laws also neglect inference cost
Based on this they take the Kaplan “optimal model size & tokens” as an upper and lower boudn respecitvely
Optimal: 392B params & 165B tokens
Them: 176 params & 300-400B tokens
Final Architecture
300-400B tokens constrains model size to be around 160-200B params
Levine et al.’s work suggests 70-80 layers
Hidden dim = 14k