Attention
Original paper: https://arxiv.org/pdf/1409.0473.pdf
What's wrong with Seq2Seq?
A critical and apparent disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input.
The attention mechanism was born to help memorize long source sentences in neural machine translation
Overview
Attention mechanism for sequence-to-sequence problems.
- Training data:
- Encoder is a recurrent network that learns some state
- Decoder is a recurrent network that learns some state
- The context vector is based on and all previous and subsequent :
- The define the alignment between each source and target.
The matrix of alignment scores is a nice byproduct to explicitly show the correlation between source and target words.
Some popular score functions are:
- Dot product:
- Scaled dot-product:
- Additive:
Transformers
Simple scaled dot-product attention equation:
The dimensions of are simply (target x source), leading to our attention map!
Consider a single row of this map for a given target, , and think of
Then the attention output row for the target is simply . Simple!
Multi-head self attention
- V, K, Q all come from input
- Multiple linear layers transforming V, K, Q
- Going into multiple scaled dop-product attention
- Concatenated and fed into a linear layer
Encoder
- Multi-head attention
- Followed by fully-connected feed-forward network
- Layer norm in after each
- Residual connections over each
- Stack of these layers
Decoder
- Input comes from (> shifted) output sequence (presumably at training time this is labels, and for inference this is outputs)
- First multi-head attention layer is masked so as to not use future sequence information
- Second multi-head attention layer aligns encoder output with masked attention layer's output
- Subsequent observations same as encoder
- Followed by fully-connected feed-forward network
- Layer norm in after each
- Residual connections over each
- Stack of these layers