📂

OPT: Open Pre-trained Transformer Language Models

Title

Authors

Date

2022

Venue

Link

https://arxiv.org/abs/2205.01068

DBLP

Keywords

Introduction

Trained GPT-3 style models, but with “latest best practices in data collection and efficient training”

Releasing GPT-style models up to 175B params

1000 80GB A100s (150 TFLOP/s utilisation)

Method

Models

notion image

Training Setup

Normal init with std=0.006

Output layers scaled by

Seq len = 2048

AdamW

Corpus

Parts of RoBERTa dataset (BookCorpus, Stories, CCNews)

The Pile

PushShift.io Reddit

Significant data processing was needed: specifically, de-dup and eliminating certain parts which made grads spike

Training efficiency

“Fully Sharded Data Parallel” (RTS?)

“Megatron-LM Tensor Parallelism”

Model weights FP16, Adam state in FP32

“dynamic loss scaling” - presume this is ALS, but not clear

Training process

Frequent hardware failures (35 manual restarts, 70+ automatic, and 100 hosts cycled over 2 months)

Frequent loss divergences → solution = previous checkpoint & lower lr

notion image

Evaluations

Similar setup to GPT-3 (main comparison point)

notion image

notion image

Versus other language models at the same param size:

Similar perf to:

GPT-3

Chinchilla

Gopher

Worse than:

PaML (attributed to higher quality & more diverse data)

However, this is on average; per-task performance differs substantially.

Bias & Toxicity Evaluations

Very thorough here - looks like a good example of how to do this well. Compare vs GPT-3

Better at detecting hate speech

Worse (Table 4) or similar (Table 5) at exhibiting bias (differences again attributed to data)

Worse at generating toxic language

Similar at dialogue safety

notion image

notion image

notion image

notion image

notion image

Limitations

In general, we qualitatively observe that OPT-175B suffers from the same limitations noted in other LLMs

Not good with “declarative instructions or point-blank interrogatives” (see InstructGPT for addressing this)

Repetitivity (solutions: unlikelihood training or best-first decoding)

Factual incorrectness (solution: retrieval-augmented models)

Toxicity & bias

In summary, we still believe this technology is premature for commercial deployment.

(due to data concerns primarily)

General thoughts

Not especially interesting / surprising paper - what I imagined from the abstract

Pretraining data quality makes a big difference

However, a very well-done example of how to write this kind of thing

Excellent eval, bias analysis, and limitations

Limitations give a really good overview of current issues with LLMs and how people have addressed them.

Was interested to see the clear statement that this tech is premature

Like that they released their log-book