BERTology

What knowledge does BERT have?

Syntactic knowledge

“Encodes positional information about word tokens well on its lower layers, but switches to a hierarchically-oriented encoding on higher layers.” (Lin et al., 2019)

Lin et al. show this via following auxiliary task:

Take a pretrained BERT and chop if off at a given layer. Freeze the network and add a classifier layer which is then trained. It has to differentiate between the following:

Earlier layers tend to pick the distractor, whereas latter layers pick the correct token.

Without going in depth, this gramatical task is designed such that a linear interpretation of the sentence structure leads to the error, and a hierarchical interpretation is needed to give the correct answer.

Embeddings encode information about parts of speech, syntactic chunks and roles

❌ able to recover syntactic structure from self-attention weights

✅ able to recover syntactic structure from token representations

Certian types of negation are very poorly understood by BERT (Ettinger, 2019)

It appears (assuming results are correct) to do well at “A dog is a _” tasks, but fail at tasks like “A dog is not a _”.

May be because MLM task has too many potential options for “_”

Insensitive to heavily malformed inputs

Not clear if this is a good or bad thing!

Likely BERT does not need to rely on syntax to solve tasks (even if it captures some of it)

Succeptible to adversarial sequences (not just BERT) (Wallace et al., 2019)

Found using “white box” gradient-guided search over tokens

Universal triggers exist independent of surrounding context

Can be transferred to other models with different embeddings (although trained on the same dataset) and are still adversarial

Semantic knowledge

Shown to encode information about entity types, relations, semantic roles, proto-roles

Struggles with representations of numbers

Wordpiece tokeniser makes this particularly bad

Remarkably brittle to named entity replacements (Balasubramanian et al. 2020)

Coreference resolution task requires identifying all phrases in the sentence that refer to the same entity

On this task 85% of test sentances have change in their predictions with a single person name change

Suggests model poor at forming generic representations of entities

World knowledge

Struggles with abstract attributes of objects

However, very good at knowledge induction if framed as a “fill-in-the-blank” task - even competitive with knowledge bases (Petroni et al., 2019)

Poor at reasoning about relationships/properties between entities

Localising Linguistic Knowledge

Embeddings

(output of a Transformer layer, typically the final one)

“Two random words will on average have a much higher cosine similarity than expected if embeddings were directionally uniform/isotropic (i.e. covariance = identity) - as isotropy has been shown to be beneficial for static word embeddings, this may be a problem for BERT

For sentence level embeddings: standard choice = [CLS] token, but Toshniwal et al. present & evaluate several alternative aproaches (e.g. normalized token mean)

Self-attention heads

“Some heads seem to specialise in certain types of syntactic relations”

However, no head has the complete syntactic tree information

We should be very wary about using attention maps for interpreting a model - currently much work debating this assumption, and many papers cherry-pick clear examples

Most heads only encode trivial linguistic information (<50% of heads exhibit the heterogeneous pattern, many vertical) → redundancy

Most attention is to special tokens. Not clear why. One hypothesis is that they act as a “no-op”, indicating to ignore the head if its pattern is not applicable to the current case

Layers

Lower layers have the most information about (linear) word order

Middle layers have the most syntactic information

Final layers have the most task-specific information

For pre-training, this means MLM
This is reflected in the fact that middle layers are the “most transferable”

The entire model has semantic information spread across it

Training BERT

Model architecture

Number of heads not as significant as the number of layers

Deeper models have more capacity to encode non-task-specific information

Many self-attention heads naturally learn the same pattern

Raganato et al. (2020) show that for translation we can pre-set self-attention patterns

Press et al. (2020) report benefits from more self-attention sublayers lower in the model, and more FFN sublayers higher

Significance of larger hidden size varies across settings

Training regime

Suggestion training with large batch sizes (32k) is possible with no performance degredation

Zhou et al. (2019) suggest normalisation of the [CLS] token may improve performance

Gong et al. show a layer-by-layer approach where earlier layers are trained first and copied to later layers, that may lead to faster training

Pre-training

Things that have been altered in the literature:

How to mask

What to mask

e.g. full words, spans, named entities
Latter improves representation of structured knowledge

Where to mask

Alternatives to masking

NSP alternatives

Removing NSP doesn’t hurt / slightly improves performance

Other tasks

Increased dataset size & longer training beneficial

Explicitly supplying structured knowledge in the data:

E.g. RoBERTa enhanced with both linguistic and factual knowledge with task-specific adapters

Benefit of pre-training:

Kovaleva et al. (2019) show that for some tasks pre-training is worse than random+fine-tune

However, does help in most situations

Prasanna et al. (2020) show that although most pre-trained weights help, some subnetworks are good and some bad

Hard to know how much impact each modification has, because increased size has made thorough ablations expensive 💰 Gains are typically marginal though.

Fine-tuning

Kovaleva et al. (2019) show that for GLUE fine-tuning

The last two layers change most (expected)

The changes caused self-attention to focus more on [SEP]

if Clark et al. (2019) are right, and [SEP] = no-op, fine-tuning here is basically telling BERT what to ignore

Possible improvements:

Using the outputs of layers throughout the model rather than just the final one

Supervised training stage between pre-training and fine-tuning

Adversarial token pertubations & regularisation

“Adapter modules” which freeze the initial model and only train the task-specific head (good for low-resource settings)

Research issue:

High variance for GLUE tasks demonstrated across initialisations and data shuffles

Many fine-tuning “improvements” may actually be within baseline variance

How big should BERT be?

Overparameterisation

Not only are many BERT layers redundant (learn very similar patterns), for some fine-tuning tasks performance increases if they're removed

Gordon et al. (2020) find that 30–40% of the weights can be pruned without impact on downstream tasks

Compression techniques

Knowledge distilation, via teacher-student mimicking of either:

Loss function
Activation paterns
Or via some kind of pre-training knowledge transfer

Quantization

Pruning

Better to train a larger model and compress heavily, than a smaller model and compress lightly