OPT: Open Pre-trained Transformer Language Models



  • Trained GPT-3 style models, but with “latest best practices in data collection and efficient training”
  • Releasing GPT-style models up to 175B params
  • 1000 80GB A100s (150 TFLOP/s utilisation)



notion image

Training Setup

  • Normal init with std=0.006
  • Output layers scaled by
  • Seq len = 2048
  • AdamW


  • Parts of RoBERTa dataset (BookCorpus, Stories, CCNews)
  • The Pile
  • PushShift.io Reddit
Significant data processing was needed: specifically, de-dup and eliminating certain parts which made grads spike

Training efficiency

  • “Fully Sharded Data Parallel” (RTS?)
  • “Megatron-LM Tensor Parallelism”
  • Model weights FP16, Adam state in FP32
  • “dynamic loss scaling” - presume this is ALS, but not clear

Training process

  • Frequent hardware failures (35 manual restarts, 70+ automatic, and 100 hosts cycled over 2 months)
  • Frequent loss divergences → solution = previous checkpoint & lower lr
notion image


Similar setup to GPT-3 (main comparison point)
notion image
notion image
Versus other language models at the same param size:
Similar perf to:
  • GPT-3
  • Chinchilla
  • Gopher
Worse than:
  • PaML (attributed to higher quality & more diverse data)
However, this is on average; per-task performance differs substantially.

Bias & Toxicity Evaluations

Very thorough here - looks like a good example of how to do this well. Compare vs GPT-3
  • Better at detecting hate speech
  • Worse (Table 4) or similar (Table 5) at exhibiting bias (differences again attributed to data)
  • Worse at generating toxic language
  • Similar at dialogue safety
notion image
notion image
notion image
notion image
notion image


In general, we qualitatively observe that OPT-175B suffers from the same limitations noted in other LLMs
  • Not good with “declarative instructions or point-blank interrogatives” (see InstructGPT for addressing this)
  • Repetitivity (solutions: unlikelihood training or best-first decoding)
  • Factual incorrectness (solution: retrieval-augmented models)
  • Toxicity & bias
In summary, we still believe this technology is premature for commercial deployment.
(due to data concerns primarily)

General thoughts

  • Not especially interesting / surprising paper - what I imagined from the abstract
  • Pretraining data quality makes a big difference
  • However, a very well-done example of how to write this kind of thing
  • Excellent eval, bias analysis, and limitations
  • Limitations give a really good overview of current issues with LLMs and how people have addressed them.
  • Was interested to see the clear statement that this tech is premature
  • Like that they released their log-book