Contents:IntroductionPrior IssuesContributionApproachRelated WorkThe MoE LayerGating NetworkAddressing Performance ChallengesBatch SizeMixing data and model parallelismTaking advantage of convolutionalityNetwork bandwidthBalancing Expert UtilisationExperiments1 Billion Word Language Modeling Benchmark100 billion word Google News CorpusMachine Translation (Single Language Pair)Multilingual machine translationAppendicesLoad balancing lossExperimental DetailsMy NotesGeneral thoughtsvs Data and Model parallelismLSTM load vs MoESuitability for the IPUOutstanding questionsTL;DRRelated PapersSwitch TransformersV-MoEDeep MoE via Shallow Embedding
What is the problem/context that MoE methods are trying to solve?
- ⬆️ parameters + ⬆️ data = ⬆️ accuracy
- Increasing both leads to quadratic increase in training costs
How does Mixture of Experts (MoE) work at a high-level?
A "gating" network outputs sparse weights to select which expert networks to activate, based on the input
Why does MoE address the quadratic training cost problem?
Not all parameters are trained on all datapoints
No prior work has demonstrated that gating can give ⬆️ capacity, efficiency or performance. This is due to:
- GPUs slow at branching. For this reason most approaches turn on/off large chunks of the network when gating.
- Conditional computation reduces the batch size for the conditionally active chunks, and can get too small to be efficient.
- Network bandwidth can be a bottleneck
- Special loss terms may be necessary to induce sparsity, but this can affect performance
- Existing literature doesn't deal with very large datasets (e.g. labelled image recognition datasets are relatively small and the labels offer minimal signal)
- Address all 5 prior issues
- > 1000x improvement in model capacity, with only minor losses in efficiency
- Significant ⬆️ SOTA on public language modelling and translation datasets
New type of NN layer: Sparsely-Gated Mixture-of-Experts Layer (MoE).
- Expert = simple ff-NN
- Compined by trainable gating network ➡️ selects sparse combination of experts to process each input
- Gating Network applied convolutionally over previous inputs
- Trained jointly by backprop
- Used here between stacked LSTM layers
- Called at each position in the input, and may select different experts at each
- Experts tend to become highly specialised
Different types of experts have been used: SVM, Gaussian Processes, Dirichlet Processes, deep networks
Different configuerations also: hierarchical, sequential experts, infinite number of experts
These approaches have not leveraged sparsity to improve computational efficiency though
Typically experts have been mixed at the top-level. Eigen et al. (2013) use MoEs internally within a deep model ➡️ this paper builds on this, adding sparsity and investigating increased model capacity.
Input ; Output
Gating network (sparse output)
Set of expert networks , each with their own parameters
- Whenever we can avoid computing .
- Often have , but evaluate just a handful.
- We may use a two-level hierarchical MoE ➡️a primary gating network chooses a combination of "experts", each of which is itself a secondary MoE layer.
where is a learned weight matrix.
Noisy Top-K gating:
Softmax gating + spartisy + noise.
Where is a vector of noise, where each element is sampled from a normal distribution.
The term enforces sparsity, as it sets the outputs to zero.
The noise term helps with load balancing (see Appendix).
Training the gating network
- Standard backprop
- Gradients also pass back through the gating network to previous layers
- Bengio et al. (2015) use boolean gates (non-differentiable) and a REINFORCE-style approach
Problem: Say we can use max batch size b → this means each expert gets examples. As the number of experts increases, this makes training increasingly inefficient as we're effectively training on tiny batch sizes.
Solution: Use a very large original batch size.
Further problem: this increases the memory needed to store activations for the backward pass, potentially hitting memory limits.
- Mixing data and model parallelism
- Taking advantage of convolutionality
Conventional distributed training setting = asynchronous processing w/ central parameter server
- (sub?-)batches processed synchronously so that we can combine for the MoE layer
- Standard layer of model (LSTM) + gating network distributed via standard data-parallel approach
- One copy of each expert. These are then grouped across devices.
- Each expert (/device) receives the elements of the overall batch that the gating network assigns to it
- Assuming we still have the same per-device batch size limit and devices, each expert now receives examples ➡️ effective batch size scales linearly with the number of devices
- Many features remain constant with this model as we scale: device batch size, device memory and bandwidth, number of model steps.
For hierarchical MoE, the first gating network is data-parallel, and each secondary MoE resides on one device.
This paper's convolutional input architecture allows the input to the MoE to be gathered across multiple timesteps and grouped together to increase the expert batch size.
For recurrent models this is not possible (a paper is cited with a potentially applicable trick).
Most network communication = sending experts' input & output
We are bounded by the network if an expert does less computation per byte required across the network (i.e. for I/O), than the max ops/byte of the device over the network
(i.e. if arithmetic intensity < ops:byte, using the network bandwidth)
As implementation of experts = 1 hidden-layer ReLU-activated ff-NN ➡️ weight matrices are of size
This means that the arithmetic intensity is
Therefore as long as is ⬆️ enough (or we have more hidden layers), we aren't bounded by the network.
We have observed that the gating network tends to converge to a state where it always produces large weights for the same few experts. This imbalance is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more by the gating network
Eigen et al. (2013) use a hard constraint to deal with this, whereas Bengio et al. (2015) use a soft constraint on the batch-wise average of each gate.
Bengio et al. (2015) also add a per-example sparsity loss (not needed here due to the top-k constraint), and a loss to encourage diversity of gate values (this happens naturally here as the experts specialise).
This paper uses a soft constraint by adding the following to the loss:
Here we take the vector of gating weights for each item in the batch, and sum across them to create a per-expert vector of importance weights → the sum indicates how "important" each expert is to the batch.
We then take the importance weight vector and calculate it's coefficient of variation, then square. By adding this to the loss it penalises us for having different importances for each expert.
is just a hyperparameter which weights this part of the loss-term.
Second loss function outlined in appendix
Dataset: shuffled unique sentences from news articles, totaling approximately 829 million words, with a vocabulary of 793,471 words (Chelba et al., 2013).
Previous SOTA: (as. of 2016/7) Jozefowics et al. (2016) uses stacked LSTM layers
Paper's Approach: two LSTM layers with a MoE layer in between
- Train a series of models with fixed computational cost
- We vary the number of experts: 4, 32, 256 (regular); 256, 1024, 4096 (hierarchical)
- Each expert has around 1 million params
Investigating computation budget:
- 4 bn params
- Larger LSTMs, and fewer but larger experts (Appendix for details)
Measured in TFLOPS/GPU = number of floating-point ops / (time for a step * no. GPUs)
For all MoE models, floating-point ops for experts represent between 37% and 46% of the total ops.
13-32 Tesla K40 GPUs.
Observation: for 1 billion-word corpus produces diminishing returns as number of params in MoE layer exceeds 1 billion.
Hypothesis: for 100 billion words significant quality improvement
Dataset: shuffled unique sentences from Google’s internal news corpus
Model: computational cost = 8m ops/timesteps, MoE layers containing experts, corresponds to 137 billion parameters in the MoE layer
Note that even the experts model keeps a strong efficiency of 0.72 TFLOPS/GPU.
Model: modified version of the GNMT encoder-decoder model (Wu et al., 2016). 2048 experts, with ~8bn params total
Datasets: WMT’14 En→Fr and En→De corpora, with 36M and 5m sentence pairs respectively
Train on very large combined dataset of twelve language pairs.
Problem: Although importance loss encourages similar weights across the batch, it doesn't explicitly encourage an equal number of elements per expert
where is the Normal CDF.
Initial load imbalance: weight matrices initialised to zero to give ballanced initialisation
Results: I'm not convinced - would really like to see (0,1) and (1,0) to confirm need for both.
Uses dropout to all layer outputs apart from softmax
8M Ops/ts Model:
LSTM: (512) hidden size - 2M total per layer across devices
Params: (512 x 2014), (1024 x 512) = 1M params (*k = 4M per )
n(umber of experts): 4 → 256 (hier: 256 → 4096)
k: 4 (2,2)
total params: 8M → 270M (270M → 4B)
8M = 2M + 4M + 2M
143M Ops/ts Model:
Increase comes through larger hidden layer sizes throughout ⬇️
LSTM: (4096) hidden size
Params: (1024 x 8192), (8192x 1024) = 16M params (*k = 64M per )
n(umber of experts): 256 hier
total params: 4B
Largest model : hierarchical with 130k experts, 137B params
- No first moment gradient estimates → just use current value
- Factored representation of each parameter matrix's second moment estimates
Interpretability (for translation task):
- translation task uses attention between encoder and decoder
- original paper uses (weighted) NN-type approach
- this paper uses (weighted) tanh inner product
This seems like a promising method. We're essentially adding parameters+data, but reducing the quadratic cost increase by only training each parameter on a subset of the datapoints it can help most with.
This idea seems to work well. The key innovations here are that a) we have sparse gating, which means that each expert only focuses on its "assigned" datapoints, and b) this all happens within a layer of the network.
The upshot here is that we still get the performance gains we'd expect by adding more parameters in a standard way, but without the associated blowup in training costs.
The leveraging of sparsity is also interesting from a wider ML perspective - this is key to some of the most successful approaches. As one reviewer on openreview said:
we know historically that sparsity of parameters is among the most important modelling principles in machine learning, being used with great success in e.g. Lasso with the l1 penalty, in SVM with the hinge loss and in ConvNets by setting connections outside the receptive field to zero.
Data parallel is "wasteful" because we replicate parameters; model parallel because we replicate data.
This MoE approach has no parameter replication, and only data replication of a factor of . The challenge is that now not every parameter is trained on every datapoint - but it seems to work!
One issue that's not mentioned explicitly is the amount of data each expert is trained on relative to the LSTM layer.
Consider that for our LSTM layer we have a total minibatch size of , spread across our devices. At the MoE layer we then replicate the data times, and spread it across experts, giving a per-expert minibatch size of . Thus the fraction of the total minibatch each expert gets is .
Additionally, each device gets elements for the LSTM layer, and for the MoE layer, so each device's MoE layer has to cope with times more data.
- Spread out to experts and back to out-LSTM are scatter and gather operations respectively ➡️ expensive for GPU, not for IPU
- As IPU-tiles offer very fine-grained parallelism, we could have more experts of a smaller size ➡️ this may be beneficial
- As IPU better supports MIMD we have the possibility of different architectures for different experts?
- How "smart" does the gating have to be? I would like to see an ablation where the rest of the model is fixed and we incrementally increase the size of the hidden layer. The translation "specialisation" results clearly show that the gating is learning something meaningful, but it's not clear how important this decision-making is.
- Ditto on the above for the LSTM layers.
- How does the value of affect performance? How good/bad is ? What about
- How much does the Adam simplification hurt performance? Is there precedent for this method?
- I would like to see some experiments on the network bandwidth problem - how large does the hidden layer size have to be to offset the overhead?
- I'm not convinced that the experiments justify the use of both losses?
- Replaces the FF layer in a transformer with a MoE layer
- Shows k=1 just as good, and has additional benefits
- Uses Mesh-Tensorflow for tensor/model parallelism
- Replaces double loss with a single, simple loss
- Trained on mixed-precision bfloat16
- Combine expert, model and data parallelism (note: "expert" framed as new form of parallelism)
- Same transformer modification as switch transformer, but using K>1 MoE approach
- Train largest vision models to-date (15B params), matching SOTA dense models with less compute
- Shazeer et al. require many times larger models to recover SOTA
- Question: "can we stack and train many layers of mixture of experts models to improve accuracy and reduce prediction cost without radically increasing the network width?"
- Hundreds of MoE layers
- Treats each channel in a convolutional layer as an expert → 1 for each input channel
- Number of experts selected varies at each layer (no top-k)
- Two additional loss terms: a) for sparsity, b) embedding classification loss = embedding & label cross-entropy