### RNN Overview

What is the key feature of a vanilla RNN?

The computation of (hidden) state at one time step based on past states:

Computing backpropagation in a vanilla RNN:

The key observation is that the gradient of is calculated as the sum of the gradients calculated flowing backwards from output and the hidden unit .

Standard approach to vanilla RNN:

### Bidirectional RNNs

Combines a RNN moving forwards in time with one moving backwards in time:

### Encoder-Decoder Sequence-to-Sequence Architectures

Purpose of encoder-decoder/sequence-to-sequence RNN architecture:

Map sequences to sequences of different arbitrary lengths

Sketch of encoder-decoder/sequence-to-sequence RNN architecture:

- Encoder RNN takes input sequence and outputs single (final) context vector

- Decoder RNN takes context vector and feeds it as input at each step, producing output sequence

### Long-Term Dependencies

#### The challenge

The challenge of long-term dependencies can be seen from the repeated application of linear layers:

eigenvalues of larger than 1 will explode, and less than 1 will vanish.

#### Simple approaches

- Add
**skip connections**through time (possibly remove some nearer connections)

- Use
**leaky units**: 's connection to is replaced with a connection to an exponentially weighted sum of past . The higher the weight, the longer-back the memory.

### Gated RNNs

Given a "gate" is:

#### LSTM

is the sigmoid function, except for and sometimes which are .

#### GRU

uses activation, the rest sigmoid.