Attention Is All You Need

Introduction

What is transduction in NLP? What kind of models does this require?

Input and output = sequences, not general/intermediate representation. Requires an encoder and decoder.

From a hardware perspective, why are transformers preferable to RNNs?

Transformers remove sequential dependency ➡️ better parallelism ➡️ better utilisation

Attention Is All You Need is not the first paper to introduce attention - what makes it special?

First sequence transduction approach to entirely replace recurrence.

Method

Transformer overall architecture (don't have to explain encoder and decoder internals)

with embedding layer and output linear layer sharing weights (think of as reverse process?)

Encoder-decoder

Standard transformer encoder architecture

Standard transformer decoder architecture

Attention

Scaled dot-product attention

with dimensions: giving a feature vector per-output

Additive attention (idea, not equation)

Concatenate each query-key pair and feed them into a scalar-output neural network to calculate attention weights

Multiplicative attention (compared with dot-product attention)

The term turns into , where is a weight matrix.

Advantage of dot-product/multiplicative attention vs additive

Can all be represented as one big matmul (i.e. ⬆️ parallelism) (I think additive can be treated as one big NN operation, but requires duplication of every q-k pair to create batch)

Standard multi-head attention architecture

Rationale for using multiple attention heads

Each one learns to attend to different parts of the sequence

Misc

How do the feedforward layers in a transformer work?

They are applied to each position separately and identically.

Why can transformers be thought of as having a convolutional aspect?

The separate and identical application of the feedforward layer is equivalent to stacking convolutional layers with kernel size 1.

What is notable about the embedding layers and final linear layer of a standard transformer?

They share the same weight matrix

How does Attention Is All You Need implement positional embedding?

Each dimension of the embedding is a point on a sinusoidal wave

The frequency of the sinusoidal waves increase geometrically across dimensions

Positional embeddings are summed with input embeddings

Analysis

Complexity of an attention layer (assuming k&v are of size , input of length , and output )

Complexity of a self-attention layer (assuming k&v are of size , input of length )

Typical complexity of a recurrent layer (seq length , hidden dim )

When does an attention layer have better complexity than a recurrent or convolutional layer?

When (seq length < hidden dim; which is typically the case for seq modelling)

Experiments

Details

What kind of input embeddings are used in Attention Is All You Need?

Byte-pair encoding

Initial tokens = chars in corpus

Repeatedly: merge most common char pair to create new token

Stop when desired vocab size (number of tokens) is reached

Benefits of byte-pair encoding

Token size a hybrid of character and word-level encodings

Enables encoding of rare words

In NLP, what is constituency parsing?

Breaking sentences down into parts of speech

In NLP, the task of breaking sentences down into parts of speech is know as what?

Constituency parsing

How does Attention Is All You Need generate the whole autoregressive output sequence?

Beam search

With variable length

And a length penalty

Beam search

Breadth-first search, keeping only (beam width) possibilities by pruning at the most recent level

Uses dropout and label smoothing for regularisation

Results

Translation: better BLEU score for EN-DE and EN-FR than existing methods, including ensembles, with similar or less training cost (FLOPS).

Constituency Parsing: strong results, although not quite SOTA

Ablations: broadly show that across the transformer increasing dimensionality helps, although they vary too many things at once to make strong conclusions. Dropout definitely helps. Sinusoids no different than positional embeddings.