Sparsely-Gated Mixture-of-Experts

Contents:Introduction Prior Issues Contribution Approach Related Work The MoE Layer Gating Network Addressing Performance Challenges Batch Size Mixing data and model parallelism Taking advantage of convolutionality Network bandwidth Balancing Expert Utilisation Experiments 1 Billion Word Language Modeling Benchmark 100 billion word Google News Corpus Machine Translation (Single Language Pair)Multilingual machine translation Appendices Load balancing loss Experimental Details My Notes General thoughts vs Data and Model parallelism LSTM load vs MoE Suitability for the IPU Outstanding questions TL;DR Related Papers Switch Transformers V-MoE Deep MoE via Shallow Embedding

Introduction

What is the problem/context that MoE methods are trying to solve?

⬆️ parameters + ⬆️ data = ⬆️ accuracy

Increasing both leads to quadratic increase in training costs

How does Mixture of Experts (MoE) work at a high-level?

A "gating" network outputs sparse weights to select which expert networks to activate, based on the input

Why does MoE address the quadratic training cost problem?

Not all parameters are trained on all datapoints

Prior Issues

No prior work has demonstrated that gating can give ⬆️ capacity, efficiency or performance. This is due to:

GPUs slow at branching. For this reason most approaches turn on/off large chunks of the network when gating.

Conditional computation reduces the batch size for the conditionally active chunks, and can get too small to be efficient.

Network bandwidth can be a bottleneck

Special loss terms may be necessary to induce sparsity, but this can affect performance

Existing literature doesn't deal with very large datasets (e.g. labelled image recognition datasets are relatively small and the labels offer minimal signal)

Contribution

Address all 5 prior issues

> 1000x improvement in model capacity, with only minor losses in efficiency

Significant ⬆️ SOTA on public language modelling and translation datasets

Approach

New type of NN layer: Sparsely-Gated Mixture-of-Experts Layer (MoE).

Expert = simple ff-NN

Compined by trainable gating network ➡️ selects sparse combination of experts to process each input

Gating Network applied convolutionally over previous inputs

Trained jointly by backprop

Used here between stacked LSTM layers

Called at each position in the input, and may select different experts at each

Experts tend to become highly specialised

Related Work

Different types of experts have been used: SVM, Gaussian Processes, Dirichlet Processes, deep networks

Different configuerations also: hierarchical, sequential experts, infinite number of experts

These approaches have not leveraged sparsity to improve computational efficiency though

Typically experts have been mixed at the top-level. Eigen et al. (2013) use MoEs internally within a deep model ➡️ this paper builds on this, adding sparsity and investigating increased model capacity.

The MoE Layer

Input ; Output

Gating network (sparse output)

Set of expert networks , each with their own parameters

Whenever we can avoid computing .

Often have , but evaluate just a handful.

We may use a two-level hierarchical MoE ➡️a primary gating network chooses a combination of "experts", each of which is itself a secondary MoE layer.

Gating Network

Softmax gating:

where is a learned weight matrix.

Noisy Top-K gating:

Softmax gating + spartisy + noise.

Where is a vector of noise, where each element is sampled from a normal distribution.

The term enforces sparsity, as it sets the outputs to zero.

The noise term helps with load balancing (see Appendix).

Training the gating network

Standard backprop

Gradients also pass back through the gating network to previous layers

Bengio et al. (2015) use boolean gates (non-differentiable) and a REINFORCE-style approach

Addressing Performance Challenges

Batch Size

Problem: Say we can use max batch size b → this means each expert gets examples. As the number of experts increases, this makes training increasingly inefficient as we're effectively training on tiny batch sizes.

Solution: Use a very large original batch size.

Further problem: this increases the memory needed to store activations for the backward pass, potentially hitting memory limits.

Further solutions:

Mixing data and model parallelism

Taking advantage of convolutionality

Mixing data and model parallelism

Conventional distributed training setting = asynchronous processing w/ central parameter server

Approach here:

(sub?-)batches processed synchronously so that we can combine for the MoE layer

Standard layer of model (LSTM) + gating network distributed via standard data-parallel approach

One copy of each expert. These are then grouped across devices.

Each expert (/device) receives the elements of the overall batch that the gating network assigns to it

Assuming we still have the same per-device batch size limit and devices, each expert now receives examples ➡️ effective batch size scales linearly with the number of devices

Many features remain constant with this model as we scale: device batch size, device memory and bandwidth, number of model steps.

For hierarchical MoE, the first gating network is data-parallel, and each secondary MoE resides on one device.

Taking advantage of convolutionality

This paper's convolutional input architecture allows the input to the MoE to be gathered across multiple timesteps and grouped together to increase the expert batch size.

For recurrent models this is not possible (a paper is cited with a potentially applicable trick).

Network bandwidth

Most network communication = sending experts' input & output

We are bounded by the network if an expert does less computation per byte required across the network (i.e. for I/O), than the max ops/byte of the device over the network

(i.e. if arithmetic intensity < ops:byte, using the network bandwidth)

As implementation of experts = 1 hidden-layer ReLU-activated ff-NN ➡️ weight matrices are of size

This means that the arithmetic intensity is

Therefore as long as is ⬆️ enough (or we have more hidden layers), we aren't bounded by the network.

Balancing Expert Utilisation

We have observed that the gating network tends to converge to a state where it always produces large weights for the same few experts. This imbalance is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more by the gating network

Eigen et al. (2013) use a hard constraint to deal with this, whereas Bengio et al. (2015) use a soft constraint on the batch-wise average of each gate.

Bengio et al. (2015) also add a per-example sparsity loss (not needed here due to the top-k constraint), and a loss to encourage diversity of gate values (this happens naturally here as the experts specialise).

This paper uses a soft constraint by adding the following to the loss:

Here we take the vector of gating weights for each item in the batch, and sum across them to create a per-expert vector of importance weights → the sum indicates how "important" each expert is to the batch.

We then take the importance weight vector and calculate it's coefficient of variation, then square. By adding this to the loss it penalises us for having different importances for each expert.

is just a hyperparameter which weights this part of the loss-term.

Second loss function outlined in appendix

Experiments

1 Billion Word Language Modeling Benchmark

Dataset: shuffled unique sentences from news articles, totaling approximately 829 million words, with a vocabulary of 793,471 words (Chelba et al., 2013).

Previous SOTA: (as. of 2016/7) Jozefowics et al. (2016) uses stacked LSTM layers

Paper's Approach: two LSTM layers with a MoE layer in between

Investigating #experts:

Train a series of models with fixed computational cost

We vary the number of experts: 4, 32, 256 (regular); 256, 1024, 4096 (hierarchical)

Each expert has around 1 million params

Investigating computation budget:

4 bn params

Larger LSTMs, and fewer but larger experts (Appendix for details)

Results:

(Left) Adding experts gives near-linear improvement for flat approach, and improvements for hierarchical. Note also that the initial approach n = k = 4 is equal in perf to the baseline which has same computational cost.

(Right) Continue to improve well as computational budget increases. — (Left) Adding experts gives near-linear improvement for flat approach, and improvements for hierarchical. Note also that the initial approach n = k = 4 is equal in perf to the baseline which has same computational cost. (Right) Continue to improve well as computational budget increases.

Even the low-budget MoE model beats SOTA; the high-budget one smashes it! (note this is computational budget)

We see that the key here appears to be the huge number of params. — Even the low-budget MoE model beats SOTA; the high-budget one smashes it! (note this is *computational budget)* We see that the key here appears to be the huge number of params.

Computational efficiency:

Measured in TFLOPS/GPU = number of floating-point ops / (time for a step * no. GPUs)

For all MoE models, floating-point ops for experts represent between 37% and 46% of the total ops.

13-32 Tesla K40 GPUs.

100 billion word Google News Corpus

Observation: for 1 billion-word corpus produces diminishing returns as number of params in MoE layer exceeds 1 billion.

Hypothesis: for 100 billion words significant quality improvement

Dataset: shuffled unique sentences from Google’s internal news corpus

Model: computational cost = 8m ops/timesteps, MoE layers containing experts, corresponds to 137 billion parameters in the MoE layer

Results:

a) the larger dataset makes a large difference to the asymptote! b) we see good improvements all the way up to the experts mark! c) the experts model drops off → it's suggested this is to do with too much sparsity.

Note that even the experts model keeps a strong efficiency of 0.72 TFLOPS/GPU.

Machine Translation (Single Language Pair)

Model: modified version of the GNMT encoder-decoder model (Wu et al., 2016). 2048 experts, with ~8bn params total

Datasets: WMT’14 En→Fr and En→De corpora, with 36M and 5m sentence pairs respectively

Results:

Across all 3: lower perplexity and higher BLEU. Far more params used.

Multilingual machine translation

Train on very large combined dataset of twelve language pairs.

Appendices

Load balancing loss

Problem: Although importance loss encourages similar weights across the batch, it doesn't explicitly encourage an equal number of elements per expert

Solution:

where is the Normal CDF.

Initial load imbalance: weight matrices initialised to zero to give ballanced initialisation

Results: I'm not convinced - would really like to see (0,1) and (1,0) to confirm need for both.

Experimental Details

Additional info:

Uses dropout to all layer outputs apart from softmax

Residual connections

Activation checkpointing

8M Ops/ts Model:

Embedding: (512)

LSTM: (512) hidden size - 2M total per layer across devices

MoE:

Params: (512 x 2014), (1024 x 512) = 1M params (*k = 4M per )

d(evices): 16

n(umber of experts): 4 → 256 (hier: 256 → 4096)

k: 4 (2,2)

total params: 8M → 270M (270M → 4B)

8M = 2M + 4M + 2M

143M Ops/ts Model:

Increase comes through larger hidden layer sizes throughout ⬇️

Embedding: (1024)

LSTM: (4096) hidden size

MoE:

Params: (1024 x 8192), (8192x 1024) = 16M params (*k = 64M per )

d(evices): 16

n(umber of experts): 256 hier

k: 2,2

total params: 4B

Largest model : hierarchical with 130k experts, 137B params

Adam Adjustment:

No first moment gradient estimates → just use current value

Factored representation of each parameter matrix's second moment estimates

Interpretability (for translation task):

Attention:

translation task uses attention between encoder and decoder

original paper uses (weighted) NN-type approach

this paper uses (weighted) tanh inner product

My Notes

General thoughts

This seems like a promising method. We're essentially adding parameters+data, but reducing the quadratic cost increase by only training each parameter on a subset of the datapoints it can help most with.

This idea seems to work well. The key innovations here are that a) we have sparse gating, which means that each expert only focuses on its "assigned" datapoints, and b) this all happens within a layer of the network.

The upshot here is that we still get the performance gains we'd expect by adding more parameters in a standard way, but without the associated blowup in training costs.

The leveraging of sparsity is also interesting from a wider ML perspective - this is key to some of the most successful approaches. As one reviewer on openreview said:

we know historically that sparsity of parameters is among the most important modelling principles in machine learning, being used with great success in e.g. Lasso with the l1 penalty, in SVM with the hinge loss and in ConvNets by setting connections outside the receptive field to zero.

vs Data and Model parallelism

Data parallel is "wasteful" because we replicate parameters; model parallel because we replicate data.

This MoE approach has no parameter replication, and only data replication of a factor of . The challenge is that now not every parameter is trained on every datapoint - but it seems to work!

LSTM load vs MoE

One issue that's not mentioned explicitly is the amount of data each expert is trained on relative to the LSTM layer.

Consider that for our LSTM layer we have a total minibatch size of , spread across our devices. At the MoE layer we then replicate the data times, and spread it across experts, giving a per-expert minibatch size of . Thus the fraction of the total minibatch each expert gets is .

Additionally, each device gets elements for the LSTM layer, and for the MoE layer, so each device's MoE layer has to cope with times more data.

Suitability for the IPU

Spread out to experts and back to out-LSTM are scatter and gather operations respectively ➡️ expensive for GPU, not for IPU

As IPU-tiles offer very fine-grained parallelism, we could have more experts of a smaller size ➡️ this may be beneficial

As IPU better supports MIMD we have the possibility of different architectures for different experts?

Outstanding questions

How "smart" does the gating have to be? I would like to see an ablation where the rest of the model is fixed and we incrementally increase the size of the hidden layer. The translation "specialisation" results clearly show that the gating is learning something meaningful, but it's not clear how important this decision-making is.

Ditto on the above for the LSTM layers.

How does the value of affect performance? How good/bad is ? What about

How much does the Adam simplification hurt performance? Is there precedent for this method?

I would like to see some experiments on the network bandwidth problem - how large does the hidden layer size have to be to offset the overhead?

I'm not convinced that the experiments justify the use of both losses?