📺

ViT

Title

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Authors

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

Date

2020

Venue

ICLR

Link

https://arxiv.org/abs/2010.11929

DBLP

https://dblp.org/rec/conf/iclr/DosovitskiyB0WZ21.html?view=bibtex

Keywords

transformer

vision

unsupervised

Motivation

Leverage the same kind of benefits seen in NLP using transformers, but for images

Hope that scale can ultimately trump inductive bias (i.e. of CNNs)

Seem to be making their approach as NLP-like as possible. Uses patches instead of tokens, and everything else is very similar

notion image

Method

Reshape input from to where patches have resolution

Linear projection to where is the latent/model dimension

Prepend [class] token

Add 1D positional embeddings

Send to transformer (just encoder)

Still using layernorm

Classification head attached to [class] position

Classification head = FFN (pre-training) or linear layer (fine-tuning)

Supervised objective

Fine-tuning

New classification head

Typically higher resolution

Interpolate “missing” positional embeddings for larger images

Datasets

ImageNet: 1k classes, 1.3M images

ImageNet-21k: 21k classes, 14M images

JFT: 18k classes, 303M (high-res) images

Models

Compared against:

“Big Transfer (BiT) (Kolesnikov et al., 2020), which performs supervised transfer learning with large ResNets”

“Noisy Student (Xie et al.,2020), which is a large EfficientNet trained using semi-supervised learning on ImageNet and JFT-300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here”

Results

Noisy Student is trained on JFT (with labels removed?) - so this is a fair comparison. Not sure about BiT.

ViT outperforms BiT when we get to very large pre-training. Not sure why BiT error bars so huge?

notion image

Interpretability

notion image

We can clearly see that the initial embeddings are capturing very different features.

The position embeddings are most similar to the ones in the same row or column - a good sign!

The attention starts off as a mixture of local (like a CNN) and global, and as we get deeper becomes more universally global