📺

ViT

Title
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Authors
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Date
2020
Venue
ICLR
Keywords
transformer
vision
unsupervised

Motivation

  • Leverage the same kind of benefits seen in NLP using transformers, but for images
  • Hope that scale can ultimately trump inductive bias (i.e. of CNNs)
  • Seem to be making their approach as NLP-like as possible. Uses patches instead of tokens, and everything else is very similar
notion image

Method

  1. Reshape input from to where patches have resolution
  1. Linear projection to where is the latent/model dimension
  1. Prepend [class] token
  1. Add 1D positional embeddings
  1. Send to transformer (just encoder)
  1. Still using layernorm
  1. Classification head attached to [class] position
  1. Classification head = FFN (pre-training) or linear layer (fine-tuning)
  1. Supervised objective

Fine-tuning

  • New classification head
  • Typically higher resolution
  • Interpolate “missing” positional embeddings for larger images

Datasets

ImageNet: 1k classes, 1.3M images
ImageNet-21k: 21k classes, 14M images
JFT: 18k classes, 303M (high-res) images

Models

Compared against:
  • “Big Transfer (BiT) (Kolesnikov et al., 2020), which performs supervised transfer learning with large ResNets”
  • “Noisy Student (Xie et al.,2020), which is a large EfficientNet trained using semi-supervised learning on ImageNet and JFT-300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here”

Results

Noisy Student is trained on JFT (with labels removed?) - so this is a fair comparison. Not sure about BiT.
Noisy Student is trained on JFT (with labels removed?) - so this is a fair comparison. Not sure about BiT.
ViT outperforms BiT when we get to very large pre-training. Not sure why BiT error bars so huge?
ViT outperforms BiT when we get to very large pre-training. Not sure why BiT error bars so huge?
notion image

Interpretability

notion image
  1. We can clearly see that the initial embeddings are capturing very different features.
  1. The position embeddings are most similar to the ones in the same row or column - a good sign!
  1. The attention starts off as a mixture of local (like a CNN) and global, and as we get deeper becomes more universally global