RNN Overview
What is the key feature of a vanilla RNN?
The computation of (hidden) state at one time step based on past states:
Computing backpropagation in a vanilla RNN:
The key observation is that the gradient of is calculated as the sum of the gradients calculated flowing backwards from output and the hidden unit .
Standard approach to vanilla RNN:
![can be made to depend on , or both (as in this case)](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F876ca5eb-2175-4e2b-b594-59af9320d5ff%2FUntitled.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T192502Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3Dd9f776b92439ebfa982bfc1c14c950e18d29238c3faa5c8170d2cefb8e01e3d6%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=d3307e0b-ef5d-4ab0-a8b5-3e406affc9d2&cache=v2)
Bidirectional RNNs
Combines a RNN moving forwards in time with one moving backwards in time:
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F7c9a3689-9cd4-4d1a-9314-524c35a8537e%2FUntitled.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T192502Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3Dd91faeb7d96f53cd152fd473c473e0023679ee29940170bb8a2ec57dc96843b1%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=bd94df54-962c-4b50-8dae-a34009c24ed9&cache=v2)
Encoder-Decoder Sequence-to-Sequence Architectures
Purpose of encoder-decoder/sequence-to-sequence RNN architecture:
Map sequences to sequences of different arbitrary lengths
Sketch of encoder-decoder/sequence-to-sequence RNN architecture:
- Encoder RNN takes input sequence and outputs single (final) context vector
- Decoder RNN takes context vector and feeds it as input at each step, producing output sequence
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F115ff558-f239-4fe1-8600-f081f9125552%2FUntitled.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T192502Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3De5c1a65da7306e7b24b287e25fe11881dba81abbd881d2273346919c15cf09fc%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=410886f7-6c55-4a47-8427-ec4aa237aaa3&cache=v2)
Long-Term Dependencies
The challenge
The challenge of long-term dependencies can be seen from the repeated application of linear layers:
eigenvalues of larger than 1 will explode, and less than 1 will vanish.
Simple approaches
- Add skip connections through time (possibly remove some nearer connections)
- Use leaky units: 's connection to is replaced with a connection to an exponentially weighted sum of past . The higher the weight, the longer-back the memory.
Gated RNNs
Given a "gate" is:
LSTM
is the sigmoid function, except for and sometimes which are .
GRU
uses activation, the rest sigmoid.