Recurrent Neural Networks

RNN Overview

What is the key feature of a vanilla RNN?
The computation of (hidden) state at one time step based on past states:
Computing backpropagation in a vanilla RNN:
The key observation is that the gradient of is calculated as the sum of the gradients calculated flowing backwards from output and the hidden unit .
Standard approach to vanilla RNN:
 can be made to depend on ,  or both (as in this case)
can be made to depend on , or both (as in this case)

Bidirectional RNNs

Combines a RNN moving forwards in time with one moving backwards in time:
notion image

Encoder-Decoder Sequence-to-Sequence Architectures

Purpose of encoder-decoder/sequence-to-sequence RNN architecture:
Map sequences to sequences of different arbitrary lengths
Sketch of encoder-decoder/sequence-to-sequence RNN architecture:
  1. Encoder RNN takes input sequence and outputs single (final) context vector
  1. Decoder RNN takes context vector and feeds it as input at each step, producing output sequence
notion image

Long-Term Dependencies

The challenge

The challenge of long-term dependencies can be seen from the repeated application of linear layers:
eigenvalues of larger than 1 will explode, and less than 1 will vanish.

Simple approaches

  1. Add skip connections through time (possibly remove some nearer connections)
  1. Use leaky units: 's connection to is replaced with a connection to an exponentially weighted sum of past . The higher the weight, the longer-back the memory.

Gated RNNs

Given a "gate" is:


is the sigmoid function, except for and sometimes which are .


uses activation, the rest sigmoid.