(This page is based primarily on material from Chapter 10 of the Deep Learning book)
What is the key feature of a vanilla RNN?
The computation of (hidden) state at one time step based on past states:
Computing backpropagation in a vanilla RNN:
The key observation is that the gradient of is calculated as the sum of the gradients calculated flowing backwards from output and the hidden unit .
Standard approach to vanilla RNN:
Combines a RNN moving forwards in time with one moving backwards in time:
Purpose of encoder-decoder/sequence-to-sequence RNN architecture:
Map sequences to sequences of different arbitrary lengths
Sketch of encoder-decoder/sequence-to-sequence RNN architecture:
- Encoder RNN takes input sequence and outputs single (final) context vector
- Decoder RNN takes context vector and feeds it as input at each step, producing output sequence
The challenge of long-term dependencies can be seen from the repeated application of linear layers:
eigenvalues of larger than 1 will explode, and less than 1 will vanish.
- Add skip connections through time (possibly remove some nearer connections)
- Use leaky units: 's connection to is replaced with a connection to an exponentially weighted sum of past . The higher the weight, the longer-back the memory.
Given a "gate" is:
is the sigmoid function, except for and sometimes which are .
uses activation, the rest sigmoid.