### Distributed Implementations

#### Asynchronous stochastic gradient descent

- Parameters stored in a database

- Each independent process reads parameters, computes gradients and increments parameters

- All done without a lock

- Processes may overwrite/interfere with each others' updates, but seems to work well regardless

### Speech Recognition

Objective of speech recognition:

- Input = acoustic signal (phonemes, or whole utterances)

- Represented by input vectors taken over small time frames

- Output sequence of words

Speech recognition approaches:

- Early: Gaussian mixture models / hidden markov models

- Seq-to-seq LSTM

### NLP

Language model definition:

A probability distribution over sequences of natural language tokens (e.g. words & punctuation)

ย

n-gram definition:

a sequence of n tokens where each token's probability is modelled as conditionally dependent on previous tokens

ย

#### Word Embeddings

word embedding definition:

A dense representation of a word's semantics in the form of a numeric vector that represents key properties and relationships. e.g. vector(โcatโ) - vector(โkittenโ) is similar to vector(โdogโ) - vector(โpuppyโ)

ย

#### Skip-Gram Model

- Form dataset by:
- Each word is a target
- Each target has an associated
*context*comprising a number of nearby words - Each (target, context-word) pair goes into the dataset

- Start with arbitrary embedding matrix, one row per-word

- Get the context word embedding and multiply by the rest of the embedding matrix (excluding itself)

- Softmax over the output to get target probabilities

- Form loss by comparing with one-hot target representation

#### Continuous Bag-of-Words

Like skip-gram, but takes as input all words in the nearby context and averages over their word embeddings.

#### Hierarchical Softmax

- Binary tree

- Leaf nodes = word labels (not embeddings)

- Non-leaf nodes = embeddings representing the probability the target is in the left or right subtree when we do the inner product with the context vector

- Training
- Evaluate probabilities along target path and compare to 1 label
- Update non-leaf embeddings & context embedding to lead us to target words

- Inference
- still have to compute over whole tree

#### Word2Vec

- Either CBOW or skip-gram model

- Using NLL

- Makes softmax easier by either:
- Negative sampling: we consider alignment with the target and a random sample of non-target word embeddings
- Hierarchical softmax

### Recommender Systems

#### Collaborative filtering

- Given a sparse target matrix : targets may be binary (buy / ignore), or continuous (rating), with an extra "missing" value

- we wish to learn a matrix factorisation such that (plus bias vectors for each matrix)

- will also then contain predictions for the missing values

- Loss is only computed on non-missing values

- Alternative = SVD