TL;DR

In 5 bullet points

Current models give increased performance when more parameters and data are used

This leads to quadratic increase in training costs

Using sparse gating with MoEs, we can train each param on the subset of the data it can help the most with

This is done in a distributed way, within a layer of the network

Allows models to be trained with far more parameters and still accrue expected performance gains, without huge increase in training costs

Overview

Context:

Current limitation for ML is number of parameters

With more parameters we also require more data

High-level problem: adding more parameters and more data to existing models gives a quadratic ⬆️ in training costs

High-level solution: MoE model where only parts of the model are active, conditional on the input

MoE problems addressed:

Effective batch size can get so small as to be inefficient

Network bandwidth can become a bottleneck

Not been tried on large datasets

Method

Proposed approach:

Expert: ff-NN, 1 hidden layer + ReLU

Gate: softmax gating + noise + sparsity:

Rest: Word embedding layer, LSTM layers before & after

Experimental Details:

Dropout

Residual connections

Activation checkpointing

Attention mechanism between encoder and decoder

Distributed Implementation:

Data parallel: LSTM + gating layers

Model parallel: experts grouped across devices

Hierarchical MoE: first gating network = data-parallel, secondary MoEs = single device

Problems Solved

Small batch size:

Problem: As we add devices batch size shrinks

Solution: We can compute a group of sequential LSTM outputs and send them as a (macro)batch to the MoE layer ("convolutional approach")

Outcome: Increases effective batch size & efficiency

Network bandwidth:

Problem: Major limitation can be network bandwidth

Solution: Arithmetic intensity (ops:bytes) hidden layer size ()

Outcome: By using larger hidden layers we can hide cost of network

Expert importance balancing:

Problem: Vicious cycle where commonly selected experts trained more, get better, and are selected more

Solution: Add to loss

Outcome: Regularises gating mechanism to make experts equally important

Expert load balancing:

Problem: Importance loss equalises weights across batch, but not explicitly load

Solution: Add to loss

Outcome: Regularises gating mechanism to make experts equally important

Adam Adjustment:

Problem: Adam optimiser states take up too much memory

Solution:

No first moment gradient estimates → just use current value

Factored representation of each parameter matrix's second moment estimates

Experiments

1 Bn Word Language Modelling

(Left) Fixed computational budget, increased no. experts; (Right) Fixed number of experts, increased budget.

Increased no. experts/params: Near-linear speedup for flat model; slight improvement beyond that for hierarchical.

Increased budget: Linear improvement, comparable to LSTM.

100 Bn Word Language Modelling

Increased no. experts/params: Linear speedup dropping off; can't quite scale to 100 bn params

Problem: Possible too much sparsity?

Increased data: Amount determines asymptote

Machine Translation: Far higher BLEU and lower perplexity than baselines, with far more params