🕰️

What Language Model to Train if You Have One Million GPU Hours?

Title
Authors
Date
2022
Venue
DBLP
Keywords

Abstract

...Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale. Targeting a multilingual language model in the 100B+ parameters scale, our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study comparing different modeling practices and their impact on zero-shot generalization...
Goal: we want to figure out the best model to train at 100B param scale by doing a massive ablation at 1B param scale, specifically looking at the effect of zero-shot generalisation.

Intro

Output: design for architecture and training setup for a 100B param & 1M hour budget.
Eval scale: 1.3B params
Objective: zero-shot generalization
Arch analysis:
  1. Encoder, decoder or encoder-decoder
  1. Pretraining dataset
  1. Positional embeddings
  1. Activation functions
  1. Embedding norm

Zero-shot generalization

 

Methods

Architecture & Pretraining Objective

  • All based on autoregressive language models (they like zero-shot, why particularly?)
  • They can be easily turned into non-causal decoders

Experimental Setup

  • Based on GPT-3
  • Main difference is that all layers use full attention (no sparse as compute saving negligible)
  • 112B tokens - note that this is significantly above the optimal threshold identified by kaplan

Evaluation

  • Lit shows that upstream perf not always aligned with downstream. Also loss comp across architectures may not be valid
  • Few and zero-shot are correlated
  • Finetuning not used, as this is probably not how it will be used, and is considered impractical at 100B scale (I guess that’s just for end users right? ST-MoE can do ft!)
  • Use the EleutherAI Language Model Evaluation Harness to eval across 27 tasks - designed to reproduce as closely as possible the eval setup of GPT-3

Baselines

  • GPT-Neo (1.3B, trained on the Pile)
  • GPT-3 (OpenAI API) Babbage (1.3B) & Curie (6.7B)

Impact of Pretraining Data

notion image
Results:
  1. The Pile is the best - it is diverse, cross-domain, combining we crawls with curated high-qualidy sources
  1. OSCAR = multilingual, based on common crawl (close second)
  1. C4 = also based on cc (far third)
notion image

Architecture Ablations

Positional Embeddings

notion image
Wow!
Wow!
  • ALiBi positional embeddings best:
    • ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance
  • ALiBI > learned > rotary
  • Baseline without position info shows competitive perf! (note: they are much more important for bidirectional models, but here we have masking)

Activation Functions

notion image
  • Swiglu slightly better, but reduces throughput by 1/3 → so use GELU!

Embedding Norm

  • Suggested that adding norm after embedding helps with stability for large models
  • However, their findings show this significantly harms performance

Multilingual

multilingual models significantly underperform their monolingual counterpart on English zero-shot benchmarks

Scaling to 176B parameters

Compute Budget

  • 18 weeks
  • 52 nodes of 8 80GB A100s. (416 chips)
  • ~1M hours
  • Assuming 100 TFLOP/s throughput
    • Budget ~= 4,500 PF-days = 380 ZFLOPs =~23% more than GPT-3

Scaling Laws

Their runs so close to Kaplan that they actually just go and use Kaplan’s:
notion image
  • However, pretraining loss doesn’t always translate to downstream perf
  • Lit indicates significant perf increases possible well past point of “optimal training”
  • Scaling laws also neglect inference cost
Based on this they take the Kaplan “optimal model size & tokens” as an upper and lower boudn respecitvely
Optimal: 392B params & 165B tokens
Them: 176 params & 300-400B tokens

Final Architecture

300-400B tokens constrains model size to be around 160-200B params
Levine et al.’s work suggests 70-80 layers
Hidden dim = 14k