๐Ÿ—๏ธ

Applications


Distributed Implementations

Asynchronous stochastic gradient descent

  1. Parameters stored in a database
  1. Each independent process reads parameters, computes gradients and increments parameters
  1. All done without a lock
  1. Processes may overwrite/interfere with each others' updates, but seems to work well regardless

Speech Recognition

Objective of speech recognition:
  1. Input = acoustic signal (phonemes, or whole utterances)
  1. Represented by input vectors taken over small time frames
  1. Output sequence of words
Speech recognition approaches:
  1. Early: Gaussian mixture models / hidden markov models
  1. Seq-to-seq LSTM

NLP

Language model definition:
A probability distribution over sequences of natural language tokens (e.g. words & punctuation)
ย 
n-gram definition:
a sequence of n tokens where each token's probability is modelled as conditionally dependent on previous tokens
ย 

Word Embeddings

word embedding definition:
A dense representation of a word's semantics in the form of a numeric vector that represents key properties and relationships. e.g. vector(โ€œcatโ€) - vector(โ€œkittenโ€) is similar to vector(โ€œdogโ€) - vector(โ€œpuppyโ€)
ย 

Skip-Gram Model

  1. Form dataset by:
    1. Each word is a target
    2. Each target has an associated context comprising a number of nearby words
    3. Each (target, context-word) pair goes into the dataset
  1. Start with arbitrary embedding matrix, one row per-word
  1. Get the context word embedding and multiply by the rest of the embedding matrix (excluding itself)
  1. Softmax over the output to get target probabilities
  1. Form loss by comparing with one-hot target representation

Continuous Bag-of-Words

Like skip-gram, but takes as input all words in the nearby context and averages over their word embeddings.

Hierarchical Softmax

  • Binary tree
  • Leaf nodes = word labels (not embeddings)
  • Non-leaf nodes = embeddings representing the probability the target is in the left or right subtree when we do the inner product with the context vector
  • Training
    • Evaluate probabilities along target path and compare to 1 label
    • Update non-leaf embeddings & context embedding to lead us to target words
  • Inference
    • still have to compute over whole tree

Word2Vec

  • Either CBOW or skip-gram model
  • Using NLL
  • Makes softmax easier by either:
    • Negative sampling: we consider alignment with the target and a random sample of non-target word embeddings
    • Hierarchical softmax

Recommender Systems

Collaborative filtering

  • Given a sparse target matrix : targets may be binary (buy / ignore), or continuous (rating), with an extra "missing" value
  • we wish to learn a matrix factorisation such that (plus bias vectors for each matrix)
  • will also then contain predictions for the missing values
  • Loss is only computed on non-missing values
  • Alternative = SVD