Attention Is All You Need

Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin



What is transduction in NLP? What kind of models does this require?
Input and output = sequences, not general/intermediate representation. Requires an encoder and decoder.
From a hardware perspective, why are transformers preferable to RNNs?
Transformers remove sequential dependency ➡️ better parallelism ➡️ better utilisation
Attention Is All You Need is not the first paper to introduce attention - what makes it special?
First sequence transduction approach to entirely replace recurrence.


Transformer overall architecture (don't have to explain encoder and decoder internals)
notion image
with embedding layer and output linear layer sharing weights (think of as reverse process?)


Standard transformer encoder architecture
notion image
Standard transformer decoder architecture
notion image


Scaled dot-product attention
with dimensions: giving a feature vector per-output
Additive attention (idea, not equation)
Concatenate each query-key pair and feed them into a scalar-output neural network to calculate attention weights
Multiplicative attention (compared with dot-product attention)
The term turns into , where is a weight matrix.
Advantage of dot-product/multiplicative attention vs additive
Can all be represented as one big matmul (i.e. ⬆️ parallelism) (I think additive can be treated as one big NN operation, but requires duplication of every q-k pair to create batch)
Standard multi-head attention architecture
notion image
Rationale for using multiple attention heads
Each one learns to attend to different parts of the sequence


How do the feedforward layers in a transformer work?
They are applied to each position separately and identically.
Why can transformers be thought of as having a convolutional aspect?
The separate and identical application of the feedforward layer is equivalent to stacking convolutional layers with kernel size 1.
What is notable about the embedding layers and final linear layer of a standard transformer?
They share the same weight matrix
How does Attention Is All You Need implement positional embedding?
  1. Each dimension of the embedding is a point on a sinusoidal wave
  1. The frequency of the sinusoidal waves increase geometrically across dimensions
  1. Positional embeddings are summed with input embeddings


Complexity of an attention layer (assuming k&v are of size , input of length , and output )
Complexity of a self-attention layer (assuming k&v are of size , input of length )
Typical complexity of a recurrent layer (seq length , hidden dim )
When does an attention layer have better complexity than a recurrent or convolutional layer?
When (seq length < hidden dim; which is typically the case for seq modelling)



What kind of input embeddings are used in Attention Is All You Need?
Byte-pair encoding
Byte-pair encoding
  1. Initial tokens = chars in corpus
  1. Repeatedly: merge most common char pair to create new token
  1. Stop when desired vocab size (number of tokens) is reached
Benefits of byte-pair encoding
  1. Token size a hybrid of character and word-level encodings
  1. Enables encoding of rare words
In NLP, what is constituency parsing?
Breaking sentences down into parts of speech
In NLP, the task of breaking sentences down into parts of speech is know as what?
Constituency parsing
How does Attention Is All You Need generate the whole autoregressive output sequence?
  1. Beam search
  1. With variable length
  1. And a length penalty
Beam search
Breadth-first search, keeping only (beam width) possibilities by pruning at the most recent level
Uses dropout and label smoothing for regularisation


Translation: better BLEU score for EN-DE and EN-FR than existing methods, including ensembles, with similar or less training cost (FLOPS).
Constituency Parsing: strong results, although not quite SOTA
Ablations: broadly show that across the transformer increasing dimensionality helps, although they vary too many things at once to make strong conclusions. Dropout definitely helps. Sinusoids no different than positional embeddings.