Introduction
BigScience Large Open-science Open-access Multilingual Language Model (BLOOM)
Motivation
- Costs of LLM training only affordable for big tech
- Prior to OPT & BLOOM most LLMs not publicly available
- Previous LLMs primarily trained on English language
Overview
- 176B params (released publicly)
- 46 natural languages, 13 programming languages
- Compute provided by the French govt., using the Jean Zay supercomputer
- Aim of paper is to document process for sake of community
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F1a5c8801-73f2-4a68-ba7e-46977d539175%2FScreenshot_2022-11-25_at_11.49.13.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221127%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221127T151442Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D8fdd06e11750547bbbd6b1318ebb63b83ced5a8fff1201b72618c9f85ff414c6%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=d5d6a7bf-914f-4779-8486-f885a316a3fa&cache=v2)
Model
Training Data
- Uses ROOTS corpus
- Emphasis on needs and rights of “data subjects” (those who create text or whom it is about)
- And on reducing bias resulting from naive web-crawling
- Tools for visualising dataset available on 🤗 website
- Some web-crawled data (OSCAR dataset) still used for the sake of volume (38% of corpus)
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fa32f5e41-ae47-4689-bc9a-de2ae1093b37%2FScreenshot_2022-11-25_at_11.53.07.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221127%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221127T151442Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3Ddfe49c6db660d9e680673349cefcb36f7b11c14321d6d93efc9ec068aa33ae6e%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=281358df-c3a8-47ff-b866-0998629b35b2&cache=v2)
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Ff9095b90-4128-4a4f-8dff-f588c8abfa7e%2FScreenshot_2022-11-25_at_11.55.42.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221127%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221127T151442Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D3211b3e0d1f7b02c51c34e633930ef5c79c9e1cbfc8fa0d489ceb0131a5a6e71%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=0c50046e-f606-4180-a7d2-9e3b1144c8be&cache=v2)
Model Architecture
- Balance tradeoff between existing, proven LLM architectures, versus promising but untested architectural innovations
- Chose a GPT-style causal decoder model because of zero/few-shot abilities - finetuning 100B param LLMs is unwieldy
- Main objective here is zero-shot generalisation
- Results from investigation suggest this is best for causal decoder models
- Did not consider MoEs “due to a lack of widely used GPU-based codebases suitable for training them at scale”
Numerics:
- Started off with
float16
but switched tobfloat16
because of “training instabilities” (they cite OPT and GLM-130B as examples)