Attention & Transformers

Attention What's wrong with Seq2Seq?Overview Transformers Multi-head self attention Encoder Decoder Overall transformer sketch

About

(This page is based primarily on material from this blog post)

Attention

Original paper: https://arxiv.org/pdf/1409.0473.pdf

What's wrong with Seq2Seq?

A critical and apparent disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input.

The attention mechanism was born to help memorize long source sentences in neural machine translation

Overview

Attention mechanism for sequence-to-sequence problems.

Training data:

Encoder is a recurrent network that learns some state

Decoder is a recurrent network that learns some state

The context vector is based on and all previous and subsequent :

The define the alignment between each source and target.

The matrix of alignment scores is a nice byproduct to explicitly show the correlation between source and target words.

Some popular score functions are:

Dot product:

Scaled dot-product:

Additive:

Transformers

Simple scaled dot-product attention equation:

The dimensions of are simply (target x source), leading to our attention map!

Consider a single row of this map for a given target, , and think of

Then the attention output row for the target is simply . Simple!

Multi-head self attention

V, K, Q all come from input

Multiple linear layers transforming V, K, Q

Going into multiple scaled dop-product attention

Concatenated and fed into a linear layer

Encoder

Multi-head attention

Followed by fully-connected feed-forward network

Layer norm in after each

Residual connections over each

Stack of these layers

Decoder

Input comes from (> shifted) output sequence (presumably at training time this is labels, and for inference this is outputs)

First multi-head attention layer is masked so as to not use future sequence information

Second multi-head attention layer aligns encoder output with masked attention layer's output

Subsequent observations same as encoder

Followed by fully-connected feed-forward network
Layer norm in after each
Residual connections over each
Stack of these layers