Revealing the Dark Secrets of BERT

Note: for a more-in depth discussion of BERTology, see

🥸

BERTology (where some of this paper’s findings are mentioned

Overview

Contributions:

Analysis of how attention weights capture linguistic information

Evidence of BERT’s overparameterisation (really?) and a simple improvement (remove some attention heads)

Types of attention head

Also some results indicating that certain attention heads appear to learn “certain types of linguistic relations”

Change in self-attention patterns after fine-tuning

Fine-tuning changes the final layers far more than earlier ones on most tasks (apparently similar results exist for convnets)

Attention to special tokens

Earlier layers attend more to the CLS token, and later layers to the SEP token.

This is interesting - I’d have assumed the CLS token matters more to later layers where the classification is going to be used? I’m assuming that the MLM task isn’t using CLS though, which may cause it to behave weirdly?

Also found that:

Contrary to our initial hypothesis that the vertical attention pattern may be motivated by linguistically meaningful features, we found that it is associated predominantly, if not exclusively, with attention to [CLS] and [SEP] tokens

Disabling self-attention heads

(i.e. setting attention weights to uniform value)

Our experiments suggest that certain heads have a detrimental effect on the overall performance of BERT, and this trend holds for all the chosen tasks. Unexpectedly, disabling some heads leads not to a drop in accuracy, as one would expect, but to an increase in performance

We found no evidence that attention patterns that are mappable onto core frame-semantic relations actually improve BERT’s performance. 2 out of 144 heads that seem to be “responsible” for these relations (see Section 4.2) do not appear to be important in any of the GLUE tasks: disabling of either one does not lead to a drop of accuracy.

Thoughts

It’s a really interesting question why SEP gets so much attention weight. I guess it’s often really important to know which sequence each token is in, so that makes sense. However it does seem like this imbalance is probably a bit of an issue.

But wait - don’t you have sequence embeddings for that??? Surely the SEP token doesn’t tell you anything you don’t already know? Eugh.

I’d have thought that CLS would have gotten a lot more atttention weight, especially in the later layers where it’s closer to being used - but the opposite is true. How weird. Maybe token informs CLS representation, but it never goes the other way?

Also, the idea that attention heads learn useful linguistically interpretable patterns seems dead - there’s clearly some useful stuff going on in some of them (esp. I’d have thought in the heterogeneous heads - can we see an ablation just removing them?) but a) many are harmful for certain tasks, b) some whole layers can be harmful, suggesting attention really isn’t working that well here