😐

BERT

Title
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Authors
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Date
2019
Venue
NAACL
Keywords
nlp
transformer
unsupervised
fine-tuning

Contents:



Introduction

Pre-training language models has been shown to improve performance.
Pre-training typically uses unidirectional language models to learn general language representations.
Two existing strategies:
  1. feature-based: uses task-specific architectures that include pre-trained representations as additional features (e.g. ELMo)
  1. fine-tuning: minimal tast-specific parameters ➡️ training on downstream tasks involves simply fine-tuning all pre-trained parameters.
Argument: unidirectionality limits architectures used during pre-training (e.g. in GPT).
Solution: BERT
Uses masked language model (MLM) pre-training objective.
It is an encoder model ➡️ works well for classification-type tasks
Has an A-B sentance input model for simple non-classification tasks like question answering

Contributions

  1. Demonstrate the importance of bidirectional pre-training.
  1. Demonstrate that pre-trained representations reduce the need for much task-specific engineering.
  1. SOTA for 11 NLP tasks.
  1. First fine-tuning based model that achieves SOTA on sentance-level and token-level tasks.

Related Work

Unsupervised Feature-based Approaches

Widely used, e.g. word2vec, glove.
Objectives include L→R language modelling, correct context discrimination (L & R).
Initially just word embeddings, also sentence and paragraph embeddings.
ELMo extracts context-sensitive features from a L→R and a R→L language model, where the representation of each token is the concatenation of the two directional representations.

Unsupervised Fine-tuning Approaches

Like feature based approaches, started with word embeddings, now sentence/para/document encoding.
Advantage for this approach = few params need to be trained from scratch.
↪️ success of GPT

Transfer Learning from Supervised Data

Has been work showing effective transfer from supervised language tasks (with large datasets).

BERT

2 steps:
  1. pre-training: several different pre-training tasks for single set of params
  1. fine-tuning: multiple tasks each with separate params, fine-tuned from pretrained params
Same architecture for both, apart from outer layers.

Model Architecture

Bidirectional Transformer Encoder
Bert Base: L=12, H=768, A=12, T=110M
Bert Large: L=24, H=1024, A=16, T=340M
(L=# layers / transformer blocks, H = hidden size, A = # self-attention heads, T = total # params)

I/O Representations

WordPiece embeddings used, with 30k token vocab.
First token always special [CLS] token. It's final hidden state is used as the aggregate rep. for classification tasks.
Sentence pairs separated with [SEP] token. Learned embedding added to each token to indicate which sentence it belongs to.
Note the distinction between the learned input embeddings E, and the final hidden states T. Also, the C state is used as the aggregate state for classification tasks.
Note the distinction between the learned input embeddings E, and the final hidden states T. Also, the C state is used as the aggregate state for classification tasks.
Not clear what segment embeddings are specifically - I'm guessing either this is the think that indicates which sentence a token belongs to, or just a coarser-grain of token embeddings.
Not clear what segment embeddings are specifically - I'm guessing either this is the think that indicates which sentence a token belongs to, or just a coarser-grain of token embeddings.

Pre-training

Unlike previous approaches, pre-training tasks are not unidirectional. Two tasks are used:
Task #1: Masked LM
Standard LMs can only be trained unidirectionally as bidirectional conditioning would allow each word to indirectly "see itself".
Mask random 15% of the input tokens and then predict. 80% of the time, masking = replacing input token with [Mask]; 10% of the time with a random token; 10% of the time unchanged.
Unlike auto-encoders, only attempt to predict the masked words, not the whole input.
Final hidden tokens fed into softmax over vocab.
Task #2: Next Sentence Prediction (NSP)
Uses initial dummy token hidden state C to predict next sentence.
Trained on inputs consisting of sentence pairs A, B. A is current sentence. 50% of the time, B is matching next sentence (i.e. task = reconstruct input), other 50% of the time B is random sentence from corput (task = ignore B and predict true next sentence).
Corpus = BooksCorpus (800M words) and English Wikipedia (2,500M words).

Fine-tuning

Bidirectional attention makes adapting to target task very simple. We just feed in task-specific input and outputs and fine-tune end-to-end.
A-B sentence setup for pre-training makes tasks like paraphrasing, entailment and question answering easy to encode.
Compared to pre-training, fine-tuning is relatively inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.
⬆️ really amazing!

Experiments

GLUE

Contains the following tasks:
Standard GLUE task overview
Name
Acronym
Type
Description
QQP
binary classification
Are the two questions similar?
QNLI
binary classification
Does sentence B contain the answer to the question in sentence A?
SST-2
multi-class classification
Is the movie review positive, negative, or neutral?
CoLA
binary classification
Is the sentence grammatical or ungrammatical?
STS-B
regression
How similar are sentences A and B?
MRPC
binary classification
Is the sentence B a paraphrase of sentence A?
MNLI
binary classification
Does sentence A entail or contradict sentence B?
RTE
binary classification
Does sentence A entail or contradict sentence B?
WNLI
entailment classification
binary classification
Sentence B replaces sentence A's ambiguous pronoun with one of the nouns - is this the correct noun?
Results:
It's really good!!
It's really good!!
Approach for BERT:
  • only extra params needed are for final classification layer
  • batch size = 32
  • fine-tuning over 3 epochs over each dataset

SQuAD v1-2 & SWAG

notion image
notion image
notion image

Ablation Studies

Pre-training

NSP task helps a decent amount on a couple of tasks. Bidirectionality gives a big jump on a couple of tasks!
NSP task helps a decent amount on a couple of tasks. Bidirectionality gives a big jump on a couple of tasks!

Model Size

Not only does bigger = better, we haven't found the limit yet.
Not only does bigger = better, we haven't found the limit yet.
It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstratedby the LM perplexity of held-out training data shown in Table 6. However, we believe that this is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained.

Feature-based approach

Advantage of this approach is a) some tasks will need new architecture, so transferring learned features is only way, b) we can make huge computational savings by pre-computing an expensive representation of the training data once, and then use this with cheaper models on top.
The feature-based rows reflect which params in the model are kept as features for the target task. We can see that when multiple layers are kept performance is very strong, almost as good as fine-tuning approaches. Note though that using just the embeddings is not particularly strong.
The feature-based rows reflect which params in the model are kept as features for the target task. We can see that when multiple layers are kept performance is very strong, almost as good as fine-tuning approaches. Note though that using just the embeddings is not particularly strong.

Anki

Key innovation of BERT model
It jointly conditions on both left and right context
BERT modification for fine-tuning
Only the final layer needs to be changed
At a high-level, what is the input to BERT?
Either a single sentence, or a sentence pair (sentence = single span of contiguous text, not an actual sentence)
BERT special tokens
  1. [CLS]: First token of every sequence → corresponding final hidden state used as output for classification tasks
  1. [SEP]: sentence separator
  1. [Mask]
BERT transformer token input representation
The sum of:
  1. Token embedding
  1. Segment (i.e. sentence A or B) embedding
  1. Position embedding
BERT output representation for word prediction tasks (e.g. MLM)
Each sequence element is fed into a softmax over all tokens
BERT pretraining loss
Sum of NLL for MLM & NSP
BERT MLM task process
  1. Random 15% of input tokens masked
  1. When maksing:
      • 80%: replace with [Mask]
      • 10%: random token
      • 10%: unchanged
BERT NSP task process
  1. Binary output for [CLS] indicating if B follows A in the corpus
  1. 50% of the time true, 50% false
How does BERT deal with the quadratic sequence length cost?
Two phases of pre-training:
  1. 90%: seq length 128
  1. 10%: seq length 512
Why does BERT use two phases of pre-training?
To combat the transformer's quadratic sequence length cost
How does BERT encode SQuAD (just input)
Question = sentence A, passage = sentence B
How does BERT make a predictions for SQuAD
Over just the passage features:
  1. Output token vectors are each multiplied by learned start and end embedding vectors
  1. Predicted span indices =
How is BERT's loss computed for SQuAD
  1. Output token vectors are each multiplied by learned start and end embedding vectors
  1. Start/end probabilities: , similarly for
  1. Loss =
How is BERT adapted to handle SQuAD v2
SQuAD v2 has a "no answer" option ➡️ to predict this, span starts & ends at [CLS]