🚥

Attention & Transformers


Attention

What's wrong with Seq2Seq?

A critical and apparent disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input.
The attention mechanism was born to help memorize long source sentences in neural machine translation

Overview

Attention mechanism for sequence-to-sequence problems.
  1. Training data:
  1. Encoder is a recurrent network that learns some state
  1. Decoder is a recurrent network that learns some state
  1. The context vector is based on and all previous and subsequent :
  1. The define the alignment between each source and target.
 
 
The matrix of alignment scores is a nice byproduct to explicitly show the correlation between source and target words.
notion image
Some popular score functions are:
  1. Dot product:
  1. Scaled dot-product:
  1. Additive:

Transformers

Simple scaled dot-product attention equation:
The dimensions of are simply (target x source), leading to our attention map!
Consider a single row of this map for a given target, , and think of
Then the attention output row for the target is simply . Simple!
 

Multi-head self attention

  1. V, K, Q all come from input
  1. Multiple linear layers transforming V, K, Q
  1. Going into multiple scaled dop-product attention
  1. Concatenated and fed into a linear layer
notion image

Encoder

  1. Multi-head attention
  1. Followed by fully-connected feed-forward network
  1. Layer norm in after each
  1. Residual connections over each
  1. Stack of these layers
notion image

Decoder

  1. Input comes from (> shifted) output sequence (presumably at training time this is labels, and for inference this is outputs)
  1. First multi-head attention layer is masked so as to not use future sequence information
  1. Second multi-head attention layer aligns encoder output with masked attention layer's output
  1. Subsequent observations same as encoder
    1. Followed by fully-connected feed-forward network
    2. Layer norm in after each
    3. Residual connections over each
    4. Stack of these layers
notion image

Overall transformer sketch

notion image