Abstract
...Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale. Targeting a multilingual language model in the 100B+ parameters scale, our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study comparing different modeling practices and their impact on zero-shot generalization...
Goal: we want to figure out the best model to train at 100B param scale by doing a massive ablation at 1B param scale, specifically looking at the effect of zero-shot generalisation.
Intro
Output: design for architecture and training setup for a 100B param & 1M hour budget.
Eval scale: 1.3B params
Objective: zero-shot generalization
Arch analysis:
- Encoder, decoder or encoder-decoder
- Pretraining dataset
- Positional embeddings
- Activation functions
- Embedding norm
Zero-shot generalization
Â
Methods
Architecture & Pretraining Objective
- All based on autoregressive language models (they like zero-shot, why particularly?)
- They can be easily turned into non-causal decoders
Experimental Setup
- Based on GPT-3
- Main difference is that all layers use full attention (no sparse as compute saving negligible)
- 112B tokens - note that this is significantly above the optimal threshold identified by kaplan
Evaluation
- Lit shows that upstream perf not always aligned with downstream. Also loss comp across architectures may not be valid
- Few and zero-shot are correlated
- Finetuning not used, as this is probably not how it will be used, and is considered impractical at 100B scale (I guess that’s just for end users right? ST-MoE can do ft!)
- Use the EleutherAI Language Model Evaluation Harness to eval across 27 tasks - designed to reproduce as closely as possible the eval setup of GPT-3
Baselines
- GPT-Neo (1.3B, trained on the Pile)
- GPT-3 (OpenAI API) Babbage (1.3B) & Curie (6.7B)
Impact of Pretraining Data
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Ff82e4a41-2492-413d-9c5e-60916770c784%2FScreenshot_2022-04-28_at_22.50.26.png?table=block&id=13b23dd0-538c-4aec-ab49-d75081502953&cache=v2)
Results:
- The Pile is the best - it is diverse, cross-domain, combining we crawls with curated high-qualidy sources
- OSCAR = multilingual, based on common crawl (close second)
- C4 = also based on cc (far third)
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fb5933a82-bcc9-456a-b913-161062cc73b1%2FScreenshot_2022-04-28_at_22.52.54.png?table=block&id=b7ef77b0-8f32-4ef4-bd70-e2594ed2f99e&cache=v2)
Architecture Ablations
Positional Embeddings
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F10bc12d2-2f80-4999-a505-5b4a3becb2e8%2FScreenshot_2022-04-28_at_22.57.32.png?table=block&id=bc8d29ac-48ea-410d-ad0a-52e48e015ced&cache=v2)
![Wow!](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F3d8a8dc9-54a4-4e8b-938b-2b9cf6ddab41%2FScreenshot_2022-04-28_at_23.01.19.png?table=block&id=505f6e2d-fb60-4dab-9afe-444c06a3c050&cache=v2)
- ALiBi positional embeddings best:
ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance
- ALiBI > learned > rotary
- Baseline without position info shows competitive perf! (note: they are much more important for bidirectional models, but here we have masking)
Activation Functions
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F915b38fa-3e68-4637-8c50-142e8981cc2f%2FScreenshot_2022-04-28_at_23.02.42.png?table=block&id=bd2aa48f-4a5d-4016-9380-d8cd3aafb753&cache=v2)
- Swiglu slightly better, but reduces throughput by 1/3 → so use GELU!
Embedding Norm
- Suggested that adding norm after embedding helps with stability for large models
- However, their findings show this significantly harms performance
Multilingual
multilingual models significantly underperform their monolingual counterpart on English zero-shot benchmarks
Scaling to 176B parameters
Compute Budget
- 18 weeks
- 52 nodes of 8 80GB A100s. (416 chips)
- ~1M hours
- Assuming 100 TFLOP/s throughput
- Budget ~= 4,500 PF-days = 380 ZFLOPs =~23% more than GPT-3
Scaling Laws
Their runs so close to Kaplan that they actually just go and use Kaplan’s:
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F6bf39a84-7ff9-4b95-831f-b5d43ccac542%2FScreenshot_2022-04-28_at_23.19.28.png?table=block&id=e6b062d0-4bf6-440e-a2ec-c06c23a90ec2&cache=v2)
- However, pretraining loss doesn’t always translate to downstream perf
- Lit indicates significant perf increases possible well past point of “optimal training”
- Scaling laws also neglect inference cost
Based on this they take the Kaplan “optimal model size & tokens” as an upper and lower boudn respecitvely
Optimal: 392B params & 165B tokens
Them: 176 params & 300-400B tokens
Final Architecture
300-400B tokens constrains model size to be around 160-200B params
Levine et al.’s work suggests 70-80 layers
Hidden dim = 14k