Motivation
- Leverage the same kind of benefits seen in NLP using transformers, but for images
- Hope that scale can ultimately trump inductive bias (i.e. of CNNs)
- Seem to be making their approach as NLP-like as possible. Uses patches instead of tokens, and everything else is very similar
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F0814f82a-75f3-4025-8932-2be10960ae6d%2FScreenshot_2022-01-05_at_15.28.23.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T192505Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D9d79be483191f9cf0995f701bcd9ada6da4cb37fa63254cb6bbee58f0999f65d%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=348798ed-0631-4e16-bb31-bac840c6ea63&cache=v2)
Method
- Reshape input from to where patches have resolution
- Linear projection to where is the latent/model dimension
- Prepend
[class]
token
- Add 1D positional embeddings
- Send to transformer (just encoder)
- Still using layernorm
- Classification head attached to
[class]
position
- Classification head = FFN (pre-training) or linear layer (fine-tuning)
- Supervised objective
Fine-tuning
- New classification head
- Typically higher resolution
- Interpolate “missing” positional embeddings for larger images
Datasets
ImageNet: 1k classes, 1.3M images
ImageNet-21k: 21k classes, 14M images
JFT: 18k classes, 303M (high-res) images
Models
Compared against:
- “Big Transfer (BiT) (Kolesnikov et al., 2020), which performs supervised transfer learning with large ResNets”
- “Noisy Student (Xie et al.,2020), which is a large EfficientNet trained using semi-supervised learning on ImageNet and JFT-300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here”
Results
![Noisy Student is trained on JFT (with labels removed?) - so this is a fair comparison. Not sure about BiT.](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F600a5fe6-bd89-44e9-8773-9bbb9bbd9869%2FScreenshot_2022-01-05_at_17.38.37.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T192505Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D1d4aa2d0e5c8acb6563150541dc2ac42a45ed298c4895413b64288948c64f1a0%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=8fee3031-a39e-4fad-bde2-b52805b8eb3b&cache=v2)
![ViT outperforms BiT when we get to very large pre-training. Not sure why BiT error bars so huge?](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F98b77c26-cde9-4dd0-be56-2041036bed3d%2FScreenshot_2022-01-05_at_17.42.50.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T192505Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D4609fb4827cda943d7adf9fdc0b66ade3e44aa54939487886268b33fd63a9f96%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=49d81fb1-2095-461b-a5d5-0eae4e80d397&cache=v2)
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F443cfa1c-15ac-4a26-8fef-a3c8d6e59bb1%2FScreenshot_2022-01-05_at_17.46.53.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T192505Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3Dcccfbc4a492d4d67323c38bd5b13b0c9f8e637308780d6b44f393becaf8b27cd%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=fd0563ea-18ce-4b89-8df0-5f16ed1009cc&cache=v2)
Interpretability
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F1ba27bbe-d6c8-4477-8dc9-d54d33daece2%2FScreenshot_2022-01-05_at_17.51.46.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T192505Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D8321d01fdca2088e27a715ef267ee14a0d6c2d74812ac051e33047197b43a17c%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=403f5fcb-1c07-4ce0-960c-956d2176da31&cache=v2)
- We can clearly see that the initial embeddings are capturing very different features.
- The position embeddings are most similar to the ones in the same row or column - a good sign!
- The attention starts off as a mixture of local (like a CNN) and global, and as we get deeper becomes more universally global