Introduction
- Trained GPT-3 style models, but with âlatest best practices in data collection and efficient trainingâ
- Releasing GPT-style models up to 175B params
- 1000 80GB A100s (150 TFLOP/s utilisation)
Method
Models
Training Setup
- Normal init with std=0.006
- Output layers scaled by
- Seq len = 2048
- AdamW
Corpus
- Parts of RoBERTa dataset (BookCorpus, Stories, CCNews)
- The Pile
- PushShift.io Reddit
Significant data processing was needed: specifically, de-dup and eliminating certain parts which made grads spike
Training efficiency
- âFully Sharded Data Parallelâ (RTS?)
- âMegatron-LM Tensor Parallelismâ
- Model weights FP16, Adam state in FP32
- âdynamic loss scalingâ - presume this is ALS, but not clear
Training process
- Frequent hardware failures (35 manual restarts, 70+ automatic, and 100 hosts cycled over 2 months)
- Frequent loss divergences â solution = previous checkpoint & lower lr
Evaluations
Similar setup to GPT-3 (main comparison point)
Versus other language models at the same param size:
Similar perf to:
- GPT-3
- Chinchilla
- Gopher
Worse than:
- PaML (attributed to higher quality & more diverse data)
However, this is on average; per-task performance differs substantially.
Bias & Toxicity Evaluations
Very thorough here - looks like a good example of how to do this well. Compare vs GPT-3
- Better at detecting hate speech
- Worse (Table 4) or similar (Table 5) at exhibiting bias (differences again attributed to data)
- Worse at generating toxic language
- Similar at dialogue safety
Limitations
In general, we qualitatively observe that OPT-175B suffers from the same limitations noted in other LLMs
- Not good with âdeclarative instructions or point-blank interrogativesâ (see InstructGPT for addressing this)
- Repetitivity (solutions: unlikelihood training or best-first decoding)
- Factual incorrectness (solution: retrieval-augmented models)
- Toxicity & bias
In summary, we still believe this technology is premature for commercial deployment.
(due to data concerns primarily)
Â
General thoughts
- Not especially interesting / surprising paper - what I imagined from the abstract
- Pretraining data quality makes a big difference
- However, a very well-done example of how to write this kind of thing
- Excellent eval, bias analysis, and limitations
- Limitations give a really good overview of current issues with LLMs and how people have addressed them.
- Was interested to see the clear statement that this tech is premature
- Like that they released their log-book
Â