A Primer in BERTology: What We Know About How BERT Works
Anna Rogers and Olga Kovaleva and Anna Rumshisky

What knowledge does BERT have?

Syntactic knowledge

“Encodes positional information about word tokens well on its lower layers, but switches to a hierarchically-oriented encoding on higher layers.” (Lin et al., 2019)
  • Lin et al. show this via following auxiliary task:
  • Take a pretrained BERT and chop if off at a given layer. Freeze the network and add a classifier layer which is then trained. It has to differentiate between the following:
    • notion image
  • Earlier layers tend to pick the distractor, whereas latter layers pick the correct token.
  • Without going in depth, this gramatical task is designed such that a linear interpretation of the sentence structure leads to the error, and a hierarchical interpretation is needed to give the correct answer.
Embeddings encode information about parts of speech, syntactic chunks and roles
❌ able to recover syntactic structure from self-attention weights
✅ able to recover syntactic structure from token representations
Certian types of negation are very poorly understood by BERT (Ettinger, 2019)
  • It appears (assuming results are correct) to do well at “A dog is a _” tasks, but fail at tasks like “A dog is not a _”.
  • May be because MLM task has too many potential options for “_”
Insensitive to heavily malformed inputs
  • Not clear if this is a good or bad thing!
  • Likely BERT does not need to rely on syntax to solve tasks (even if it captures some of it)
Succeptible to adversarial sequences (not just BERT) (Wallace et al., 2019)
  • Found using “white box” gradient-guided search over tokens
  • Universal triggers exist independent of surrounding context
  • Can be transferred to other models with different embeddings (although trained on the same dataset) and are still adversarial

Semantic knowledge

Shown to encode information about entity types, relations, semantic roles, proto-roles
Struggles with representations of numbers
  • Wordpiece tokeniser makes this particularly bad
Remarkably brittle to named entity replacements (Balasubramanian et al. 2020)
  • Coreference resolution task requires identifying all phrases in the sentence that refer to the same entity
  • On this task 85% of test sentances have change in their predictions with a single person name change
  • Suggests model poor at forming generic representations of entities

World knowledge

  • Struggles with abstract attributes of objects
  • However, very good at knowledge induction if framed as a “fill-in-the-blank” task - even competitive with knowledge bases (Petroni et al., 2019)
  • Poor at reasoning about relationships/properties between entities

Localising Linguistic Knowledge


(output of a Transformer layer, typically the final one)
  • “Two random words will on average have a much higher cosine similarity than expected if embeddings were directionally uniform/isotropic (i.e. covariance = identity) - as isotropy has been shown to be beneficial for static word embeddings, this may be a problem for BERT
  • For sentence level embeddings: standard choice = [CLS] token, but Toshniwal et al. present & evaluate several alternative aproaches (e.g. normalized token mean)

Self-attention heads

notion image
  • “Some heads seem to specialise in certain types of syntactic relations
  • However, no head has the complete syntactic tree information
  • We should be very wary about using attention maps for interpreting a model - currently much work debating this assumption, and many papers cherry-pick clear examples
  • Most heads only encode trivial linguistic information (<50% of heads exhibit the heterogeneous pattern, many vertical) → redundancy
  • Most attention is to special tokens. Not clear why. One hypothesis is that they act as a “no-op”, indicating to ignore the head if its pattern is not applicable to the current case


  • Lower layers have the most information about (linear) word order
  • Middle layers have the most syntactic information
  • Final layers have the most task-specific information
    • For pre-training, this means MLM
    • This is reflected in the fact that middle layers are the “most transferable”
  • The entire model has semantic information spread across it

Training BERT

Model architecture

  • Number of heads not as significant as the number of layers
  • Deeper models have more capacity to encode non-task-specific information
  • Many self-attention heads naturally learn the same pattern
  • Raganato et al. (2020) show that for translation we can pre-set self-attention patterns
  • Press et al. (2020) report benefits from more self-attention sublayers lower in the model, and more FFN sublayers higher
  • Significance of larger hidden size varies across settings

Training regime

  • Suggestion training with large batch sizes (32k) is possible with no performance degredation
  • Zhou et al. (2019) suggest normalisation of the [CLS] token may improve performance
  • Gong et al. show a layer-by-layer approach where earlier layers are trained first and copied to later layers, that may lead to faster training


Things that have been altered in the literature:
  • How to mask
  • What to mask
    • e.g. full words, spans, named entities
    • Latter improves representation of structured knowledge
  • Where to mask
  • Alternatives to masking
  • NSP alternatives
    • Removing NSP doesn’t hurt / slightly improves performance
  • Other tasks
  • Increased dataset size & longer training beneficial
  • Explicitly supplying structured knowledge in the data:
    • E.g. RoBERTa enhanced with both linguistic and factual knowledge with task-specific adapters
Benefit of pre-training:
  • Kovaleva et al. (2019) show that for some tasks pre-training is worse than random+fine-tune
  • However, does help in most situations
  • Prasanna et al. (2020) show that although most pre-trained weights help, some subnetworks are good and some bad
Hard to know how much impact each modification has, because increased size has made thorough ablations expensive 💰 Gains are typically marginal though.


Kovaleva et al. (2019) show that for GLUE fine-tuning
  • The last two layers change most (expected)
  • The changes caused self-attention to focus more on [SEP]
  • if Clark et al. (2019) are right, and [SEP] = no-op, fine-tuning here is basically telling BERT what to ignore
Possible improvements:
  • Using the outputs of layers throughout the model rather than just the final one
  • Supervised training stage between pre-training and fine-tuning
  • Adversarial token pertubations & regularisation
  • “Adapter modules” which freeze the initial model and only train the task-specific head (good for low-resource settings)
Research issue:
  • High variance for GLUE tasks demonstrated across initialisations and data shuffles
  • Many fine-tuning “improvements” may actually be within baseline variance

How big should BERT be?


  • Not only are many BERT layers redundant (learn very similar patterns), for some fine-tuning tasks performance increases if they're removed
  • Gordon et al. (2020) find that 30–40% of the weights can be pruned without impact on downstream tasks

Compression techniques

  • Knowledge distilation, via teacher-student mimicking of either:
    • Loss function
    • Activation paterns
    • Or via some kind of pre-training knowledge transfer
  • Quantization
  • Pruning
Better to train a larger model and compress heavily, than a smaller model and compress lightly