RNN Overview
What is the key feature of a vanilla RNN?
The computation of (hidden) state at one time step based on past states:
Computing backpropagation in a vanilla RNN:
The key observation is that the gradient of is calculated as the sum of the gradients calculated flowing backwards from output and the hidden unit .
Standard approach to vanilla RNN:
Bidirectional RNNs
Combines a RNN moving forwards in time with one moving backwards in time:
Encoder-Decoder Sequence-to-Sequence Architectures
Purpose of encoder-decoder/sequence-to-sequence RNN architecture:
Map sequences to sequences of different arbitrary lengths
Sketch of encoder-decoder/sequence-to-sequence RNN architecture:
- Encoder RNN takes input sequence and outputs single (final) context vector
- Decoder RNN takes context vector and feeds it as input at each step, producing output sequence
Long-Term Dependencies
The challenge
The challenge of long-term dependencies can be seen from the repeated application of linear layers:
eigenvalues of larger than 1 will explode, and less than 1 will vanish.
Simple approaches
- Add skip connections through time (possibly remove some nearer connections)
- Use leaky units: 's connection to is replaced with a connection to an exponentially weighted sum of past . The higher the weight, the longer-back the memory.
Gated RNNs
Given a "gate" is:
LSTM
is the sigmoid function, except for and sometimes which are .
GRU
uses activation, the rest sigmoid.