👥

Sparsely-Gated Mixture-of-Experts

Title
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Authors
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean
Date
2017
Venue
ICLR
Keywords
moe
sparsity

Contents:



Introduction

What is the problem/context that MoE methods are trying to solve?
  1. ⬆️ parameters + ⬆️ data = ⬆️ accuracy
  1. Increasing both leads to quadratic increase in training costs
How does Mixture of Experts (MoE) work at a high-level?
A "gating" network outputs sparse weights to select which expert networks to activate, based on the input
Why does MoE address the quadratic training cost problem?
Not all parameters are trained on all datapoints

Prior Issues

No prior work has demonstrated that gating can give ⬆️ capacity, efficiency or performance. This is due to:
  1. GPUs slow at branching. For this reason most approaches turn on/off large chunks of the network when gating.
  1. Conditional computation reduces the batch size for the conditionally active chunks, and can get too small to be efficient.
  1. Network bandwidth can be a bottleneck
  1. Special loss terms may be necessary to induce sparsity, but this can affect performance
  1. Existing literature doesn't deal with very large datasets (e.g. labelled image recognition datasets are relatively small and the labels offer minimal signal)

Contribution

  1. Address all 5 prior issues
  1. > 1000x improvement in model capacity, with only minor losses in efficiency
  1. Significant ⬆️ SOTA on public language modelling and translation datasets

Approach

New type of NN layer: Sparsely-Gated Mixture-of-Experts Layer (MoE).
notion image
  • Expert = simple ff-NN
  • Compined by trainable gating network ➡️ selects sparse combination of experts to process each input
  • Gating Network applied convolutionally over previous inputs
  • Trained jointly by backprop
  • Used here between stacked LSTM layers
  • Called at each position in the input, and may select different experts at each
  • Experts tend to become highly specialised

Related Work

Different types of experts have been used: SVM, Gaussian Processes, Dirichlet Processes, deep networks
Different configuerations also: hierarchical, sequential experts, infinite number of experts
These approaches have not leveraged sparsity to improve computational efficiency though
 
Typically experts have been mixed at the top-level. Eigen et al. (2013) use MoEs internally within a deep model ➡️ this paper builds on this, adding sparsity and investigating increased model capacity.

The MoE Layer

Input ; Output
Gating network (sparse output)
Set of expert networks , each with their own parameters
  • Whenever we can avoid computing .
  • Often have , but evaluate just a handful.
  • We may use a two-level hierarchical MoE ➡️a primary gating network chooses a combination of "experts", each of which is itself a secondary MoE layer.

Gating Network

Softmax gating:
where is a learned weight matrix.
 
Noisy Top-K gating:
Softmax gating + spartisy + noise.
Where is a vector of noise, where each element is sampled from a normal distribution.
The term enforces sparsity, as it sets the outputs to zero.
The noise term helps with load balancing (see Appendix).
 
Training the gating network
  • Standard backprop
  • Gradients also pass back through the gating network to previous layers
  • Bengio et al. (2015) use boolean gates (non-differentiable) and a REINFORCE-style approach

Addressing Performance Challenges

Batch Size

Problem: Say we can use max batch size b → this means each expert gets examples. As the number of experts increases, this makes training increasingly inefficient as we're effectively training on tiny batch sizes.
Solution: Use a very large original batch size.
Further problem: this increases the memory needed to store activations for the backward pass, potentially hitting memory limits.
Further solutions:
  1. Mixing data and model parallelism
  1. Taking advantage of convolutionality

Mixing data and model parallelism

Conventional distributed training setting = asynchronous processing w/ central parameter server
Approach here:
  • (sub?-)batches processed synchronously so that we can combine for the MoE layer
  • Standard layer of model (LSTM) + gating network distributed via standard data-parallel approach
  • One copy of each expert. These are then grouped across devices.
  • Each expert (/device) receives the elements of the overall batch that the gating network assigns to it
  • Assuming we still have the same per-device batch size limit and devices, each expert now receives examples ➡️ effective batch size scales linearly with the number of devices
  • Many features remain constant with this model as we scale: device batch size, device memory and bandwidth, number of model steps.
For hierarchical MoE, the first gating network is data-parallel, and each secondary MoE resides on one device.

Taking advantage of convolutionality

This paper's convolutional input architecture allows the input to the MoE to be gathered across multiple timesteps and grouped together to increase the expert batch size.
For recurrent models this is not possible (a paper is cited with a potentially applicable trick).

Network bandwidth

Most network communication = sending experts' input & output
We are bounded by the network if an expert does less computation per byte required across the network (i.e. for I/O), than the max ops/byte of the device over the network
(i.e. if arithmetic intensity < ops:byte, using the network bandwidth)
As implementation of experts = 1 hidden-layer ReLU-activated ff-NN ➡️ weight matrices are of size
This means that the arithmetic intensity is
Therefore as long as is ⬆️ enough (or we have more hidden layers), we aren't bounded by the network.

Balancing Expert Utilisation

We have observed that the gating network tends to converge to a state where it always produces large weights for the same few experts. This imbalance is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more by the gating network
Eigen et al. (2013) use a hard constraint to deal with this, whereas Bengio et al. (2015) use a soft constraint on the batch-wise average of each gate.
Bengio et al. (2015) also add a per-example sparsity loss (not needed here due to the top-k constraint), and a loss to encourage diversity of gate values (this happens naturally here as the experts specialise).
 
This paper uses a soft constraint by adding the following to the loss:
Here we take the vector of gating weights for each item in the batch, and sum across them to create a per-expert vector of importance weights → the sum indicates how "important" each expert is to the batch.
We then take the importance weight vector and calculate it's coefficient of variation, then square. By adding this to the loss it penalises us for having different importances for each expert.
is just a hyperparameter which weights this part of the loss-term.
 
Second loss function outlined in appendix

Experiments

1 Billion Word Language Modeling Benchmark

Dataset: shuffled unique sentences from news articles, totaling approximately 829 million words, with a vocabulary of 793,471 words (Chelba et al., 2013).
Previous SOTA: (as. of 2016/7) Jozefowics et al. (2016) uses stacked LSTM layers
Paper's Approach: two LSTM layers with a MoE layer in between
Investigating #experts:
  • Train a series of models with fixed computational cost
  • We vary the number of experts: 4, 32, 256 (regular); 256, 1024, 4096 (hierarchical)
  • Each expert has around 1 million params
  • k=4
Investigating computation budget:
  • 4 bn params
  • Larger LSTMs, and fewer but larger experts (Appendix for details)
Results:
(Left) Adding experts gives near-linear improvement for flat approach, and improvements for hierarchical. Note also that the initial approach n = k = 4 is equal in perf to the baseline which has same computational cost.

(Right) Continue to improve well as computational budget increases.
(Left) Adding experts gives near-linear improvement for flat approach, and improvements for hierarchical. Note also that the initial approach n = k = 4 is equal in perf to the baseline which has same computational cost. (Right) Continue to improve well as computational budget increases.
Even the low-budget MoE model beats SOTA; the high-budget one smashes it! (note this is computational budget)

We see that the key here appears to be the huge number of params.
Even the low-budget MoE model beats SOTA; the high-budget one smashes it! (note this is computational budget) We see that the key here appears to be the huge number of params.
Computational efficiency:
Measured in TFLOPS/GPU = number of floating-point ops / (time for a step * no. GPUs)
For all MoE models, floating-point ops for experts represent between 37% and 46% of the total ops.
13-32 Tesla K40 GPUs.

100 billion word Google News Corpus

Observation: for 1 billion-word corpus produces diminishing returns as number of params in MoE layer exceeds 1 billion.
Hypothesis: for 100 billion words significant quality improvement
Dataset: shuffled unique sentences from Google’s internal news corpus
Model: computational cost = 8m ops/timesteps, MoE layers containing experts, corresponds to 137 billion parameters in the MoE layer
Results:
a) the larger dataset makes a large difference to the asymptote!  b) we see good improvements all the way up to the  experts mark!  c) the  experts model drops off → it's suggested this is to do with too much sparsity.
a) the larger dataset makes a large difference to the asymptote! b) we see good improvements all the way up to the experts mark! c) the experts model drops off → it's suggested this is to do with too much sparsity.
Note that even the experts model keeps a strong efficiency of 0.72 TFLOPS/GPU.

Machine Translation (Single Language Pair)

Model: modified version of the GNMT encoder-decoder model (Wu et al., 2016). 2048 experts, with ~8bn params total
Datasets: WMT’14 En→Fr and En→De corpora, with 36M and 5m sentence pairs respectively
Results:
Across all 3: lower perplexity and higher BLEU. Far more params used.
Across all 3: lower perplexity and higher BLEU. Far more params used.

Multilingual machine translation

Train on very large combined dataset of twelve language pairs.
notion image

Appendices

Load balancing loss

Problem: Although importance loss encourages similar weights across the batch, it doesn't explicitly encourage an equal number of elements per expert
Solution:
where is the Normal CDF.
Initial load imbalance: weight matrices initialised to zero to give ballanced initialisation
Results: I'm not convinced - would really like to see (0,1) and (1,0) to confirm need for both.
notion image

Experimental Details

Additional info:
Uses dropout to all layer outputs apart from softmax
Residual connections
Activation checkpointing
8M Ops/ts Model:
Embedding: (512)
LSTM: (512) hidden size - 2M total per layer across devices
MoE:
Params: (512 x 2014), (1024 x 512) = 1M params (*k = 4M per )
d(evices): 16
n(umber of experts): 4 → 256 (hier: 256 → 4096)
k: 4 (2,2)
total params: 8M → 270M (270M → 4B)
8M = 2M + 4M + 2M
143M Ops/ts Model:
Increase comes through larger hidden layer sizes throughout ⬇️
Embedding: (1024)
LSTM: (4096) hidden size
MoE:
Params: (1024 x 8192), (8192x 1024) = 16M params (*k = 64M per )
d(evices): 16
n(umber of experts): 256 hier
k: 2,2
total params: 4B
Largest model : hierarchical with 130k experts, 137B params
Adam Adjustment:
  1. No first moment gradient estimates → just use current value
  1. Factored representation of each parameter matrix's second moment estimates
Interpretability (for translation task):
notion image
Attention:
  1. translation task uses attention between encoder and decoder
  1. original paper uses (weighted) NN-type approach
  1. this paper uses (weighted) tanh inner product

My Notes

General thoughts

This seems like a promising method. We're essentially adding parameters+data, but reducing the quadratic cost increase by only training each parameter on a subset of the datapoints it can help most with.
This idea seems to work well. The key innovations here are that a) we have sparse gating, which means that each expert only focuses on its "assigned" datapoints, and b) this all happens within a layer of the network.
The upshot here is that we still get the performance gains we'd expect by adding more parameters in a standard way, but without the associated blowup in training costs.
The leveraging of sparsity is also interesting from a wider ML perspective - this is key to some of the most successful approaches. As one reviewer on openreview said:
we know historically that sparsity of parameters is among the most important modelling principles in machine learning, being used with great success in e.g. Lasso with the l1 penalty, in SVM with the hinge loss and in ConvNets by setting connections outside the receptive field to zero.

vs Data and Model parallelism

Data parallel is "wasteful" because we replicate parameters; model parallel because we replicate data.
This MoE approach has no parameter replication, and only data replication of a factor of . The challenge is that now not every parameter is trained on every datapoint - but it seems to work!

LSTM load vs MoE

One issue that's not mentioned explicitly is the amount of data each expert is trained on relative to the LSTM layer.
Consider that for our LSTM layer we have a total minibatch size of , spread across our devices. At the MoE layer we then replicate the data times, and spread it across experts, giving a per-expert minibatch size of . Thus the fraction of the total minibatch each expert gets is .
Additionally, each device gets elements for the LSTM layer, and for the MoE layer, so each device's MoE layer has to cope with times more data.

Suitability for the IPU

  1. Spread out to experts and back to out-LSTM are scatter and gather operations respectively ➡️ expensive for GPU, not for IPU
  1. As IPU-tiles offer very fine-grained parallelism, we could have more experts of a smaller size ➡️ this may be beneficial
  1. As IPU better supports MIMD we have the possibility of different architectures for different experts?

Outstanding questions

  1. How "smart" does the gating have to be? I would like to see an ablation where the rest of the model is fixed and we incrementally increase the size of the hidden layer. The translation "specialisation" results clearly show that the gating is learning something meaningful, but it's not clear how important this decision-making is.
  1. Ditto on the above for the LSTM layers.
  1. How does the value of affect performance? How good/bad is ? What about
  1. How much does the Adam simplification hurt performance? Is there precedent for this method?
  1. I would like to see some experiments on the network bandwidth problem - how large does the hidden layer size have to be to offset the overhead?
  1. I'm not convinced that the experiments justify the use of both losses?

TL;DR

TL;DR

Related Papers

Switch Transformers

  • Replaces the FF layer in a transformer with a MoE layer
  • Shows k=1 just as good, and has additional benefits
  • Uses Mesh-Tensorflow for tensor/model parallelism
  • Replaces double loss with a single, simple loss
  • Trained on mixed-precision bfloat16
  • Combine expert, model and data parallelism (note: "expert" framed as new form of parallelism)

V-MoE

  • Same transformer modification as switch transformer, but using K>1 MoE approach
  • Train largest vision models to-date (15B params), matching SOTA dense models with less compute

Deep MoE via Shallow Embedding

Intro
  • Shazeer et al. require many times larger models to recover SOTA
  • Question: "can we stack and train many layers of mixture of experts models to improve accuracy and reduce prediction cost without radically increasing the network width?"
  • Hundreds of MoE layers
notion image
Method
  • Treats each channel in a convolutional layer as an expert → 1 for each input channel
  • Number of experts selected varies at each layer (no top-k)
  • Two additional loss terms: a) for sparsity, b) embedding classification loss = embedding & label cross-entropy