TL;DR

 

In 5 bullet points

  1. Current models give increased performance when more parameters and data are used
  1. This leads to quadratic increase in training costs
  1. Using sparse gating with MoEs, we can train each param on the subset of the data it can help the most with
  1. This is done in a distributed way, within a layer of the network
  1. Allows models to be trained with far more parameters and still accrue expected performance gains, without huge increase in training costs

Overview

Context:
  1. Current limitation for ML is number of parameters
  1. With more parameters we also require more data
High-level problem: adding more parameters and more data to existing models gives a quadratic ⬆️ in training costs
High-level solution: MoE model where only parts of the model are active, conditional on the input
MoE problems addressed:
  1. Effective batch size can get so small as to be inefficient
  1. Network bandwidth can become a bottleneck
  1. Not been tried on large datasets

Method

Proposed approach:
Expert: ff-NN, 1 hidden layer + ReLU
Gate: softmax gating + noise + sparsity:
Rest: Word embedding layer, LSTM layers before & after
Experimental Details:
  1. Dropout
  1. Residual connections
  1. Activation checkpointing
  1. Attention mechanism between encoder and decoder
Distributed Implementation:
Data parallel: LSTM + gating layers
Model parallel: experts grouped across devices
Hierarchical MoE: first gating network = data-parallel, secondary MoEs = single device

Problems Solved

Small batch size:
Problem: As we add devices batch size shrinks
Solution: We can compute a group of sequential LSTM outputs and send them as a (macro)batch to the MoE layer ("convolutional approach")
Outcome: Increases effective batch size & efficiency
Network bandwidth:
Problem: Major limitation can be network bandwidth
Solution: Arithmetic intensity (ops:bytes) hidden layer size ()
Outcome: By using larger hidden layers we can hide cost of network
Expert importance balancing:
Problem: Vicious cycle where commonly selected experts trained more, get better, and are selected more
Solution: Add to loss
Outcome: Regularises gating mechanism to make experts equally important
Expert load balancing:
Problem: Importance loss equalises weights across batch, but not explicitly load
Solution: Add to loss
Outcome: Regularises gating mechanism to make experts equally important
Adam Adjustment:
Problem: Adam optimiser states take up too much memory
Solution:
  1. No first moment gradient estimates → just use current value
  1. Factored representation of each parameter matrix's second moment estimates

Experiments

1 Bn Word Language Modelling
(Left) Fixed computational budget, increased no. experts;  (Right) Fixed number of experts, increased budget.
(Left) Fixed computational budget, increased no. experts; (Right) Fixed number of experts, increased budget.
Increased no. experts/params: Near-linear speedup for flat model; slight improvement beyond that for hierarchical.
Increased budget: Linear improvement, comparable to LSTM.
100 Bn Word Language Modelling
notion image
Increased no. experts/params: Linear speedup dropping off; can't quite scale to 100 bn params
Problem: Possible too much sparsity?
Increased data: Amount determines asymptote
Machine Translation: Far higher BLEU and lower perplexity than baselines, with far more params