🌸

BLOOM

Title
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Authors
Le Scao et al. (BigScience Workshop)
Date
2022
Venue
DBLP
Keywords

Introduction

BigScience Large Open-science Open-access Multilingual Language Model (BLOOM)

Motivation

  1. Costs of LLM training only affordable for big tech
  1. Prior to OPT & BLOOM most LLMs not publicly available
  1. Previous LLMs primarily trained on English language

Overview

  1. 176B params (released publicly)
  1. 46 natural languages, 13 programming languages
  1. Compute provided by the French govt., using the Jean Zay supercomputer
  1. Aim of paper is to document process for sake of community
notion image

Model

Training Data

  • Uses ROOTS corpus
  • Emphasis on needs and rights of “data subjects” (those who create text or whom it is about)
  • And on reducing bias resulting from naive web-crawling
  • Tools for visualising dataset available on 🤗 website
  • Some web-crawled data (OSCAR dataset) still used for the sake of volume (38% of corpus)
notion image
notion image

Model Architecture

  • Balance tradeoff between existing, proven LLM architectures, versus promising but untested architectural innovations
  • Chose a GPT-style causal decoder model because of zero/few-shot abilities - finetuning 100B param LLMs is unwieldy
    • Main objective here is zero-shot generalisation
    • Results from investigation suggest this is best for causal decoder models
  • Did not consider MoEs “due to a lack of widely used GPU-based codebases suitable for training them at scale”
Numerics:
  • Started off with float16 but switched to bfloat16 because of “training instabilities” (they cite OPT and GLM-130B as examples)