Distributed Implementations
Asynchronous stochastic gradient descent
- Parameters stored in a database
- Each independent process reads parameters, computes gradients and increments parameters
- All done without a lock
- Processes may overwrite/interfere with each others' updates, but seems to work well regardless
Speech Recognition
Objective of speech recognition:
- Input = acoustic signal (phonemes, or whole utterances)
- Represented by input vectors taken over small time frames
- Output sequence of words
Speech recognition approaches:
- Early: Gaussian mixture models / hidden markov models
- Seq-to-seq LSTM
NLP
Language model definition:
A probability distribution over sequences of natural language tokens (e.g. words & punctuation)
ย
n-gram definition:
a sequence of n tokens where each token's probability is modelled as conditionally dependent on previous tokens
ย
Word Embeddings
word embedding definition:
A dense representation of a word's semantics in the form of a numeric vector that represents key properties and relationships. e.g. vector(โcatโ) - vector(โkittenโ) is similar to vector(โdogโ) - vector(โpuppyโ)
ย
Skip-Gram Model
- Form dataset by:
- Each word is a target
- Each target has an associated context comprising a number of nearby words
- Each (target, context-word) pair goes into the dataset
- Start with arbitrary embedding matrix, one row per-word
- Get the context word embedding and multiply by the rest of the embedding matrix (excluding itself)
- Softmax over the output to get target probabilities
- Form loss by comparing with one-hot target representation
Continuous Bag-of-Words
Like skip-gram, but takes as input all words in the nearby context and averages over their word embeddings.
Hierarchical Softmax
- Binary tree
- Leaf nodes = word labels (not embeddings)
- Non-leaf nodes = embeddings representing the probability the target is in the left or right subtree when we do the inner product with the context vector
- Training
- Evaluate probabilities along target path and compare to 1 label
- Update non-leaf embeddings & context embedding to lead us to target words
- Inference
- still have to compute over whole tree
Word2Vec
- Either CBOW or skip-gram model
- Using NLL
- Makes softmax easier by either:
- Negative sampling: we consider alignment with the target and a random sample of non-target word embeddings
- Hierarchical softmax
Recommender Systems
Collaborative filtering
- Given a sparse target matrix : targets may be binary (buy / ignore), or continuous (rating), with an extra "missing" value
- we wish to learn a matrix factorisation such that (plus bias vectors for each matrix)
- will also then contain predictions for the missing values
- Loss is only computed on non-missing values
- Alternative = SVD