What Does BERT Look At?

Study of the attention maps of pre-trained BERT models.

A surprisingly large amount of BERT’s attention focuses on the deliminator tken [SEP], which we argue is used by the model as a sort of no-op

articular heads correspond remarkably well to particular relations. For example, we find heads that find direct objects of verbs, determiners of nouns, objects of prepositions, and objects of possessive pronouns with >75% accuracy

Surface-level patterns

Relative position

Most heads put little attention on the current token.

Frequently see heads that specialise in attending to previous or next token.

Separator tokens

Over half of the attention in layers 6-10 focuses on [SEP]. Why?

Their reponse here is deserves reproducing in full:

One possible explanation is that [SEP] is used to aggregate segment-level information...However, further analysis makes us doubtful this is the case. If this explanation were true, we would expect attention heads processing [SEP] to attend broadly over the whole segment to build up these representations. However, they instead almost entirely (more than 90%; see bottom of Figure 2) attend to themselves and the other [SEP] token. Furthermore, qualitative analysis (see Figure 5) shows that heads with specific functions attend to [SEP] when the function is not called for. For example, in head 8-10 direct objects attend to their verbs. For this head, non-nouns mostly attend to [SEP]. Therefore, we speculate that attention over these special tokens might be used as a sort of “no-op” when the atten- tion head’s function is not applicable.

Figure 3 shows that in the layers where [SEP] begins to receive attention weight, its gradient becomes smaller. They claim this is further evidence for their no-op theory. Initially I wasn’t sure, but looking at head 8-10 in fig 5, it seems clear that [SEP] is just somewhere for everyone else “to go”, to get out the way. It seems to follow that in this case the grad magnitude might be low.

Focused vs Broad Attention

Probing individual heads

These results are great. Table 1 shows that there are quite a few heads that are very good at particular relations, so we do see may attention heads learning interpretable patterns. These results are better than I’ve seen in other bertology papers, although its worth noting that “there are many relations for which BERT only slightly improves over the simple baseline, so we would not say individual attention heads capture dependency structure as a whole.”

Notes

Previous papers seem to cast doubt on the idea of specific heads fulfilling neat, specific functions. However, here they show quite clearly that this is the case, albeit not for all heads. I suspect getting this analysis to come out nicely is a little tricky, but it seems to be possible (e.g. they convert byte-pair encoding token weights to word token weights for some of this analysis)

They give a great explanation of this [SEP] as a no-op phenomenon. I’ve quoted it at length because they put it really well. I buy their explanation of this strange phenomenon. Figure 5 shows this nicely.

Interesting to see how [CLS], [SEP] and punctuation all have their own distinct phases of usefullness within the layers (fig 2)

Early layers are roughly bag-of-vectors attention