Introduction
BigScience Large Open-science Open-access Multilingual Language Model (BLOOM)
Motivation
- Costs of LLM training only affordable for big tech
- Prior to OPT & BLOOM most LLMs not publicly available
- Previous LLMs primarily trained on English language
Overview
- 176B params (released publicly)
- 46 natural languages, 13 programming languages
- Compute provided by the French govt., using the Jean Zay supercomputer
- Aim of paper is to document process for sake of community
Model
Training Data
- Uses ROOTS corpus
- Emphasis on needs and rights of “data subjects” (those who create text or whom it is about)
- And on reducing bias resulting from naive web-crawling
- Tools for visualising dataset available on 🤗 website
- Some web-crawled data (OSCAR dataset) still used for the sake of volume (38% of corpus)
Model Architecture
- Balance tradeoff between existing, proven LLM architectures, versus promising but untested architectural innovations
- Chose a GPT-style causal decoder model because of zero/few-shot abilities - finetuning 100B param LLMs is unwieldy
- Main objective here is zero-shot generalisation
- Results from investigation suggest this is best for causal decoder models
- Did not consider MoEs “due to a lack of widely used GPU-based codebases suitable for training them at scale”
Numerics:
- Started off with
float16
but switched tobfloat16
because of “training instabilities” (they cite OPT and GLM-130B as examples)