BERT

Contents:Introduction Contributions Related Work Unsupervised Feature-based Approaches Unsupervised Fine-tuning Approaches Transfer Learning from Supervised Data BERT Model Architecture I/O Representations Pre-training Fine-tuning Experiments GLUE SQuAD v1-2 & SWAG Ablation Studies Pre-training Model Size Feature-based approach Anki

Introduction

Pre-training language models has been shown to improve performance.

Pre-training typically uses unidirectional language models to learn general language representations.

Two existing strategies:

feature-based: uses task-specific architectures that include pre-trained representations as additional features (e.g. ELMo)

fine-tuning: minimal tast-specific parameters ➡️ training on downstream tasks involves simply fine-tuning all pre-trained parameters.

Argument: unidirectionality limits architectures used during pre-training (e.g. in GPT).

Solution: BERT

Uses masked language model (MLM) pre-training objective.

It is an encoder model ➡️ works well for classification-type tasks

Has an A-B sentance input model for simple non-classification tasks like question answering

Contributions

Demonstrate the importance of bidirectional pre-training.

Demonstrate that pre-trained representations reduce the need for much task-specific engineering.

SOTA for 11 NLP tasks.

First fine-tuning based model that achieves SOTA on sentance-level and token-level tasks.

Related Work

Unsupervised Feature-based Approaches

Widely used, e.g. word2vec, glove.

Objectives include L→R language modelling, correct context discrimination (L & R).

Initially just word embeddings, also sentence and paragraph embeddings.

ELMo extracts context-sensitive features from a L→R and a R→L language model, where the representation of each token is the concatenation of the two directional representations.

Unsupervised Fine-tuning Approaches

Like feature based approaches, started with word embeddings, now sentence/para/document encoding.

Advantage for this approach = few params need to be trained from scratch.

↪️ success of GPT

Transfer Learning from Supervised Data

Has been work showing effective transfer from supervised language tasks (with large datasets).

BERT

2 steps:

pre-training: several different pre-training tasks for single set of params

fine-tuning: multiple tasks each with separate params, fine-tuned from pretrained params

Same architecture for both, apart from outer layers.

Model Architecture

Bidirectional Transformer Encoder

Bert Base: L=12, H=768, A=12, T=110M

Bert Large: L=24, H=1024, A=16, T=340M

(L=# layers / transformer blocks, H = hidden size, A = # self-attention heads, T = total # params)

I/O Representations

WordPiece embeddings used, with 30k token vocab.

First token always special [CLS] token. It's final hidden state is used as the aggregate rep. for classification tasks.

Sentence pairs separated with [SEP] token. Learned embedding added to each token to indicate which sentence it belongs to.

Note the distinction between the learned input embeddings E, and the final hidden states T. Also, the C state is used as the aggregate state for classification tasks.

Not clear what segment embeddings are specifically - I'm guessing either this is the think that indicates which sentence a token belongs to, or just a coarser-grain of token embeddings.

Pre-training

Unlike previous approaches, pre-training tasks are not unidirectional. Two tasks are used:

Task #1: Masked LM

Standard LMs can only be trained unidirectionally as bidirectional conditioning would allow each word to indirectly "see itself".

Mask random 15% of the input tokens and then predict. 80% of the time, masking = replacing input token with [Mask]; 10% of the time with a random token; 10% of the time unchanged.

Unlike auto-encoders, only attempt to predict the masked words, not the whole input.

Final hidden tokens fed into softmax over vocab.

Task #2: Next Sentence Prediction (NSP)

Uses initial dummy token hidden state C to predict next sentence.

Trained on inputs consisting of sentence pairs A, B. A is current sentence. 50% of the time, B is matching next sentence (i.e. task = reconstruct input), other 50% of the time B is random sentence from corput (task = ignore B and predict true next sentence).

Corpus = BooksCorpus (800M words) and English Wikipedia (2,500M words).

Fine-tuning

Bidirectional attention makes adapting to target task very simple. We just feed in task-specific input and outputs and fine-tune end-to-end.

A-B sentence setup for pre-training makes tasks like paraphrasing, entailment and question answering easy to encode.

Compared to pre-training, fine-tuning is relatively inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.

⬆️ really amazing!

Experiments

GLUE

Contains the following tasks:

Standard GLUE task overview

Name

Acronym

Type

Description

Quora Question Pairs

QQP

binary classification

Are the two questions similar?

Question Natural Language Inference

QNLI

binary classification

Does sentence B contain the answer to the question in sentence A?

Stanford Sentiment Treebank

SST-2

multi-class classification

Is the movie review positive, negative, or neutral?

Corpus of Linguistic Acceptability

CoLA

binary classification

Is the sentence grammatical or ungrammatical?

Semantic Textual Similarity Benchmark

STS-B

regression

How similar are sentences A and B?

Microsoft Research Paraphrase Corpus

MRPC

binary classification

Is the sentence B a paraphrase of sentence A?

Multi-Genre Natural Language Inference

MNLI

binary classification

Does sentence A entail or contradict sentence B?

Recognizing Textual Entailment

RTE

binary classification

Does sentence A entail or contradict sentence B?

Winograd NLI

WNLI

entailment classification

binary classification

Sentence B replaces sentence A's ambiguous pronoun with one of the nouns - is this the correct noun?

Results:

Approach for BERT:

only extra params needed are for final classification layer

batch size = 32

fine-tuning over 3 epochs over each dataset

SQuAD v1-2 & SWAG

Ablation Studies

Pre-training

NSP task helps a decent amount on a couple of tasks. Bidirectionality gives a big jump on a couple of tasks! — NSP task helps a decent amount on a couple of tasks. Bidirectionality gives a *big* jump on a couple of tasks!

Model Size

Not only does bigger = better, we haven't found the limit yet.

It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstratedby the LM perplexity of held-out training data shown in Table 6. However, we believe that this is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained.

Feature-based approach

Advantage of this approach is a) some tasks will need new architecture, so transferring learned features is only way, b) we can make huge computational savings by pre-computing an expensive representation of the training data once, and then use this with cheaper models on top.

The feature-based rows reflect which params in the model are kept as features for the target task. We can see that when multiple layers are kept performance is very strong, almost as good as fine-tuning approaches. Note though that using just the embeddings is not particularly strong.

Anki

Key innovation of BERT model

It jointly conditions on both left and right context

BERT modification for fine-tuning

Only the final layer needs to be changed

At a high-level, what is the input to BERT?

Either a single sentence, or a sentence pair (sentence = single span of contiguous text, not an actual sentence)

BERT special tokens

[CLS]: First token of every sequence → corresponding final hidden state used as output for classification tasks

[SEP]: sentence separator

[Mask]

BERT transformer token input representation

The sum of:

Token embedding

Segment (i.e. sentence A or B) embedding

Position embedding

BERT output representation for word prediction tasks (e.g. MLM)

Each sequence element is fed into a softmax over all tokens

BERT pretraining loss

Sum of NLL for MLM & NSP

BERT MLM task process

Random 15% of input tokens masked

When maksing:

80%: replace with [Mask]

10%: random token

10%: unchanged

BERT NSP task process

Binary output for [CLS] indicating if B follows A in the corpus

50% of the time true, 50% false

How does BERT deal with the quadratic sequence length cost?

Two phases of pre-training:

90%: seq length 128

10%: seq length 512

Why does BERT use two phases of pre-training?

To combat the transformer's quadratic sequence length cost

How does BERT encode SQuAD (just input)

Question = sentence A, passage = sentence B

How does BERT make a predictions for SQuAD

Over just the passage features:

Output token vectors are each multiplied by learned start and end embedding vectors

Predicted span indices =

How is BERT's loss computed for SQuAD

Output token vectors are each multiplied by learned start and end embedding vectors

Start/end probabilities: , similarly for

Loss =

How is BERT adapted to handle SQuAD v2

SQuAD v2 has a "no answer" option ➡️ to predict this, span starts & ends at [CLS]