Contents:
Contents:IntroductionContributionsRelated WorkUnsupervised Feature-based ApproachesUnsupervised Fine-tuning ApproachesTransfer Learning from Supervised DataBERTModel ArchitectureI/O RepresentationsPre-trainingFine-tuningExperimentsGLUESQuAD v1-2 & SWAGAblation StudiesPre-trainingModel SizeFeature-based approachAnki
Introduction
Pre-training language models has been shown to improve performance.
Pre-training typically uses unidirectional language models to learn general language representations.
Two existing strategies:
- feature-based: uses task-specific architectures that include pre-trained representations as additional features (e.g. ELMo)
- fine-tuning: minimal tast-specific parameters ➡️ training on downstream tasks involves simply fine-tuning all pre-trained parameters.
Argument: unidirectionality limits architectures used during pre-training (e.g. in GPT).
Solution: BERT
Uses masked language model (MLM) pre-training objective.
It is an encoder model ➡️ works well for classification-type tasks
Has an A-B sentance input model for simple non-classification tasks like question answering
Contributions
- Demonstrate the importance of bidirectional pre-training.
- Demonstrate that pre-trained representations reduce the need for much task-specific engineering.
- SOTA for 11 NLP tasks.
- First fine-tuning based model that achieves SOTA on sentance-level and token-level tasks.
Related Work
Unsupervised Feature-based Approaches
Widely used, e.g. word2vec, glove.
Objectives include L→R language modelling, correct context discrimination (L & R).
Initially just word embeddings, also sentence and paragraph embeddings.
ELMo extracts context-sensitive features from a L→R and a R→L language model, where the representation of each token is the concatenation of the two directional representations.
Unsupervised Fine-tuning Approaches
Like feature based approaches, started with word embeddings, now sentence/para/document encoding.
Advantage for this approach = few params need to be trained from scratch.
↪️ success of GPT
Transfer Learning from Supervised Data
Has been work showing effective transfer from supervised language tasks (with large datasets).
BERT
2 steps:
- pre-training: several different pre-training tasks for single set of params
- fine-tuning: multiple tasks each with separate params, fine-tuned from pretrained params
Same architecture for both, apart from outer layers.
Model Architecture
Bidirectional Transformer Encoder
Bert Base: L=12, H=768, A=12, T=110M
Bert Large: L=24, H=1024, A=16, T=340M
(L=# layers / transformer blocks, H = hidden size, A = # self-attention heads, T = total # params)
I/O Representations
WordPiece embeddings used, with 30k token vocab.
First token always special [CLS] token. It's final hidden state is used as the aggregate rep. for classification tasks.
Sentence pairs separated with [SEP] token. Learned embedding added to each token to indicate which sentence it belongs to.
Pre-training
Unlike previous approaches, pre-training tasks are not unidirectional. Two tasks are used:
Task #1: Masked LM
Standard LMs can only be trained unidirectionally as bidirectional conditioning would allow each word to indirectly "see itself".
Mask random 15% of the input tokens and then predict. 80% of the time, masking = replacing input token with [Mask]; 10% of the time with a random token; 10% of the time unchanged.
Unlike auto-encoders, only attempt to predict the masked words, not the whole input.
Final hidden tokens fed into softmax over vocab.
Task #2: Next Sentence Prediction (NSP)
Uses initial dummy token hidden state C to predict next sentence.
Trained on inputs consisting of sentence pairs A, B. A is current sentence. 50% of the time, B is matching next sentence (i.e. task = reconstruct input), other 50% of the time B is random sentence from corput (task = ignore B and predict true next sentence).
Corpus = BooksCorpus (800M words) and English Wikipedia (2,500M words).
Fine-tuning
Bidirectional attention makes adapting to target task very simple. We just feed in task-specific input and outputs and fine-tune end-to-end.
A-B sentence setup for pre-training makes tasks like paraphrasing, entailment and question answering easy to encode.
Compared to pre-training, fine-tuning is relatively inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.
⬆️ really amazing!
Experiments
GLUE
Contains the following tasks:
Standard GLUE task overview
Name
Acronym
Type
Description
QNLI
binary classification
Does sentence B contain the answer to the question in sentence A?
SST-2
multi-class classification
Is the movie review positive, negative, or neutral?
CoLA
binary classification
Is the sentence grammatical or ungrammatical?
MRPC
binary classification
Is the sentence B a paraphrase of sentence A?
MNLI
binary classification
Does sentence A entail or contradict sentence B?
RTE
binary classification
Does sentence A entail or contradict sentence B?
WNLI
entailment classification
binary classification
Sentence B replaces sentence A's ambiguous pronoun with one of the nouns - is this the correct noun?
Results:
Approach for BERT:
- only extra params needed are for final classification layer
- batch size = 32
- fine-tuning over 3 epochs over each dataset
SQuAD v1-2 & SWAG
Ablation Studies
Pre-training
Model Size
It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstratedby the LM perplexity of held-out training data shown in Table 6. However, we believe that this is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained.
Feature-based approach
Advantage of this approach is a) some tasks will need new architecture, so transferring learned features is only way, b) we can make huge computational savings by pre-computing an expensive representation of the training data once, and then use this with cheaper models on top.
Anki
Key innovation of BERT model
It jointly conditions on both left and right context
BERT modification for fine-tuning
Only the final layer needs to be changed
At a high-level, what is the input to BERT?
Either a single sentence, or a sentence pair
(sentence = single span of contiguous text, not an actual sentence)
BERT special tokens
- [CLS]: First token of every sequence → corresponding final hidden state used as output for classification tasks
- [SEP]: sentence separator
- [Mask]
BERT transformer token input representation
The sum of:
- Token embedding
- Segment (i.e. sentence A or B) embedding
- Position embedding
BERT output representation for word prediction tasks (e.g. MLM)
Each sequence element is fed into a softmax over all tokens
BERT pretraining loss
Sum of NLL for MLM & NSP
BERT MLM task process
- Random 15% of input tokens masked
- When maksing:
- 80%: replace with [Mask]
- 10%: random token
- 10%: unchanged
BERT NSP task process
- Binary output for [CLS] indicating if B follows A in the corpus
- 50% of the time true, 50% false
How does BERT deal with the quadratic sequence length cost?
Two phases of pre-training:
- 90%: seq length 128
- 10%: seq length 512
Why does BERT use two phases of pre-training?
To combat the transformer's quadratic sequence length cost
How does BERT encode SQuAD (just input)
Question = sentence A, passage = sentence B
How does BERT make a predictions for SQuAD
Over just the passage features:
- Output token vectors are each multiplied by learned start and end embedding vectors
- Predicted span indices =
How is BERT's loss computed for SQuAD
- Output token vectors are each multiplied by learned start and end embedding vectors
- Start/end probabilities: , similarly for
- Loss =
How is BERT adapted to handle SQuAD v2
SQuAD v2 has a "no answer" option ➡️ to predict this, span starts & ends at [CLS]