Motivation
- Leverage the same kind of benefits seen in NLP using transformers, but for images
- Hope that scale can ultimately trump inductive bias (i.e. of CNNs)
- Seem to be making their approach as NLP-like as possible. Uses patches instead of tokens, and everything else is very similar
Method
- Reshape input from to where patches have resolution
- Linear projection to where is the latent/model dimension
- Prepend
[class]
token
- Add 1D positional embeddings
- Send to transformer (just encoder)
- Still using layernorm
- Classification head attached to
[class]
position
- Classification head = FFN (pre-training) or linear layer (fine-tuning)
- Supervised objective
Fine-tuning
- New classification head
- Typically higher resolution
- Interpolate “missing” positional embeddings for larger images
Datasets
ImageNet: 1k classes, 1.3M images
ImageNet-21k: 21k classes, 14M images
JFT: 18k classes, 303M (high-res) images
Models
Compared against:
- “Big Transfer (BiT) (Kolesnikov et al., 2020), which performs supervised transfer learning with large ResNets”
- “Noisy Student (Xie et al.,2020), which is a large EfficientNet trained using semi-supervised learning on ImageNet and JFT-300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here”
Results
Interpretability
- We can clearly see that the initial embeddings are capturing very different features.
- The position embeddings are most similar to the ones in the same row or column - a good sign!
- The attention starts off as a mixture of local (like a CNN) and global, and as we get deeper becomes more universally global