Recurrent Neural Networks

(This page is based primarily on material from Chapter 10 of the Deep Learning book)

What is the key feature of a vanilla RNN?

The computation of (hidden) state at one time step based on past states:

Computing backpropagation in a vanilla RNN:

The key observation is that the gradient of is calculated as the sum of the gradients calculated flowing backwards from output and the hidden unit .

Standard approach to vanilla RNN:

Combines a RNN moving forwards in time with one moving backwards in time:

Purpose of encoder-decoder/sequence-to-sequence RNN architecture:

Map sequences to sequences of different arbitrary lengths

Sketch of encoder-decoder/sequence-to-sequence RNN architecture:

Decoder RNN takes context vector and feeds it as input at each step, producing output sequence

The challenge of long-term dependencies can be seen from the repeated application of linear layers:

eigenvalues of larger than 1 will explode, and less than 1 will vanish.

Use leaky units: 's connection to is replaced with a connection to an exponentially weighted sum of past . The higher the weight, the longer-back the memory.

Given a "gate" is:

is the sigmoid function, except for and sometimes which are .

uses activation, the rest sigmoid.