### Contents:

Contents:IntroductionPrior IssuesContributionApproachRelated WorkThe MoE LayerGating NetworkAddressing Performance ChallengesBatch SizeMixing data and model parallelismTaking advantage of convolutionalityNetwork bandwidthBalancing Expert UtilisationExperiments1 Billion Word Language Modeling Benchmark100 billion word Google News CorpusMachine Translation (Single Language Pair)Multilingual machine translationAppendicesLoad balancing lossExperimental DetailsMy NotesGeneral thoughtsvs Data and Model parallelismLSTM load vs MoESuitability for the IPUOutstanding questionsTL;DRRelated PapersSwitch TransformersV-MoEDeep MoE via Shallow Embedding

### Introduction

**What **is the problem/context that MoE methods are trying to solve?

- ⬆️ parameters + ⬆️ data = ⬆️ accuracy

- Increasing both
**quadratic**increase in training costs

**How** does Mixture of Experts (MoE) work at a high-level?

A "gating" network outputs sparse weights to select which expert networks to activate, based on the input

**Why **does MoE address the quadratic training cost problem?

Not all parameters are trained on all datapoints

**Prior Issues**

No prior work has demonstrated that gating can give ⬆️ capacity, efficiency or performance. This is due to:

- GPUs slow at branching. For this reason most approaches turn on/off large chunks of the network when gating.

- Conditional computation reduces the batch size for the conditionally active chunks, and can get too small to be efficient.

- Network bandwidth can be a bottleneck

- Special loss terms may be necessary to induce sparsity, but this can affect performance

- Existing literature doesn't deal with very large datasets (e.g. labelled image recognition datasets are relatively small and the labels offer minimal signal)

#### Contribution

- Address all 5 prior issues

- > 1000x improvement in model capacity, with only minor losses in efficiency

- Significant ⬆️ SOTA on public language modelling and translation datasets

#### Approach

New type of NN layer:

**Sparsely-Gated Mixture-of-Experts Layer (MoE)**.- Expert = simple ff-NN

- Compined by trainable gating network ➡️ selects sparse combination of experts to process each input

- Gating Network applied convolutionally over previous inputs

- Trained jointly by backprop

- Used here between stacked LSTM layers

- Called at each position in the input, and may select different experts at each

- Experts tend to become highly specialised

#### Related Work

Different types of experts have been used: SVM, Gaussian Processes, Dirichlet Processes, deep networks

Different configuerations also: hierarchical, sequential experts, infinite number of experts

These approaches have

*not*leveraged sparsity to improve computational efficiency thoughTypically experts have been mixed at the top-level. Eigen et al. (2013) use MoEs internally within a deep model ➡️ this paper builds on this, adding sparsity and investigating increased model capacity.

### The MoE Layer

Input ; Output

Gating network (sparse output)

Set of expert networks , each with their own parameters

- Whenever we can avoid computing .

- Often have , but evaluate just a handful.

- We may use a two-level hierarchical MoE ➡️a primary gating network chooses a combination of "experts", each of which is itself a secondary MoE layer.

#### Gating Network

**Softmax gating:**

where is a learned weight matrix.

**Noisy Top-K gating:**

Softmax gating + spartisy + noise.

Where is a vector of noise, where each element is sampled from a normal distribution.

The term enforces sparsity, as it sets the outputs to zero.

The noise term helps with load balancing (see Appendix).

**Training the gating network**

- Standard backprop

- Gradients also pass back through the gating network to previous layers

- Bengio et al. (2015) use boolean gates (non-differentiable) and a REINFORCE-style approach

### Addressing Performance Challenges

#### Batch Size

**Problem:**Say we can use max batch size b → this means each expert gets examples. As the number of experts increases, this makes training increasingly inefficient as we're effectively training on tiny batch sizes.

**Solution:**Use a very large original batch size.

**Further problem:**this increases the memory needed to store activations for the backward pass, potentially hitting memory limits.

**Further solutions:**

- Mixing data and model parallelism

- Taking advantage of convolutionality

#### Mixing data and model parallelism

Conventional distributed training setting = asynchronous processing w/ central parameter server

Approach here:

- (sub?-)batches processed synchronously so that we can combine for the MoE layer

- Standard layer of model (LSTM) + gating network distributed via standard data-parallel approach

- One copy of each expert. These are then grouped across devices.

- Each expert (/device) receives the elements of the overall batch that the gating network assigns to it

- Assuming we still have the same per-device batch size limit and devices, each expert now receives examples ➡️ effective batch size scales linearly with the number of devices

- Many features remain constant with this model as we scale: device batch size, device memory and bandwidth, number of model steps.

For hierarchical MoE, the first gating network is data-parallel, and each secondary MoE resides on one device.

#### Taking advantage of convolutionality

This paper's convolutional input architecture allows the input to the MoE to be gathered across multiple timesteps and grouped together to increase the expert batch size.

For recurrent models this is not possible (a paper is cited with a potentially applicable trick).

#### Network bandwidth

Most network communication = sending experts' input & output

We are bounded by the network if an expert does less computation per byte required across the network (i.e. for I/O), than the max ops/byte of the device over the network

(i.e. if arithmetic intensity < ops:byte, using the

*network bandwidth)*As implementation of experts = 1 hidden-layer ReLU-activated ff-NN ➡️ weight matrices are of size

This means that the arithmetic intensity is

Therefore as long as is ⬆️ enough (or we have more hidden layers), we aren't bounded by the network.

### Balancing Expert Utilisation

We have observed that the gating network tends to converge to a state where it always produces large weights for the same few experts. This imbalance is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more by the gating network

Eigen et al. (2013) use a hard constraint to deal with this, whereas Bengio et al. (2015) use a soft constraint on the batch-wise average of each gate.

Bengio et al. (2015) also add a per-example sparsity loss (not needed here due to the top-k constraint), and a loss to encourage diversity of gate values (this happens naturally here as the experts specialise).

This paper uses a soft constraint by adding the following to the loss:

Here we take the vector of gating weights for each item in the batch, and sum across them to create a per-expert vector of

*importance*weights → the sum indicates how "important" each expert is to the batch.We then take the importance weight vector and calculate it's coefficient of variation, then square. By adding this to the loss it penalises us for having different importances for each expert.

is just a hyperparameter which weights this part of the loss-term.

Second loss function outlined in appendix

### Experiments

#### 1 Billion Word Language Modeling Benchmark

**Dataset:**shuffled unique sentences from news articles, totaling approximately 829 million words, with a vocabulary of 793,471 words (Chelba et al., 2013).

**Previous SOTA:**(as. of 2016/7) Jozefowics et al. (2016) uses stacked LSTM layers

**Paper's Approach:**two LSTM layers with a MoE layer in between

**Investigating #experts:**

- Train a series of models with fixed computational cost

- We vary the number of experts: 4, 32, 256 (regular); 256, 1024, 4096 (hierarchical)

- Each expert has around 1 million params

- k=4

**Investigating computation budget:**

- 4 bn params

- Larger LSTMs, and fewer but larger experts (Appendix for details)

**Results:**

**Computational efficiency:**

Measured in TFLOPS/GPU = number of floating-point ops / (time for a step * no. GPUs)

For all MoE models, floating-point ops for experts represent between 37% and 46% of the total ops.

13-32 Tesla K40 GPUs.

**100 billion word Google News Corpus**

**Observation:**for 1 billion-word corpus produces diminishing returns as number of params in MoE layer exceeds 1 billion.

**Hypothesis:**for 100 billion words significant quality improvement

**Dataset:**shuffled unique sentences from Google’s internal news corpus

**Model:**computational cost = 8m ops/timesteps, MoE layers containing experts, corresponds to 137 billion parameters in the MoE layer

**Results:**

Note that even the experts model keeps a strong efficiency of 0.72 TFLOPS/GPU.

#### Machine Translation (Single Language Pair)

**Model:**modified version of the GNMT encoder-decoder model (Wu et al., 2016). 2048 experts, with ~8bn params total

**Datasets:**WMT’14 En→Fr and En→De corpora, with 36M and 5m sentence pairs respectively

**Results:**

#### Multilingual machine translation

Train on very large combined dataset of twelve language pairs.

### Appendices

#### Load balancing loss

**Problem:**Although importance loss encourages similar weights across the batch, it doesn't explicitly encourage an equal number of elements per expert

**Solution:**

where is the Normal CDF.

**Initial load imbalance:**weight matrices initialised to zero to give ballanced initialisation

**Results:**I'm not convinced - would really like to see (0,1) and (1,0) to confirm need for both.

#### Experimental Details

**Additional info:**

Uses dropout to all layer outputs apart from softmax

Residual connections

Activation checkpointing

**8M Ops/ts Model:**

**Embedding:**(512)

**LSTM:**(512) hidden size - 2M total per layer across devices

**MoE:**

**Params:**(512 x 2014), (1024 x 512) = 1M params (*k = 4M per )

**d(evices):**16

**n(umber of experts):**4 → 256 (hier: 256 → 4096)

**k:**4 (2,2)

**total params:**8M → 270M (270M → 4B)

**8M**= 2M + 4M + 2M

**143M Ops/ts Model:**

Increase comes through larger hidden layer sizes throughout ⬇️

**Embedding:**(1024)

**LSTM:**(4096) hidden size

**MoE:**

**Params:**(1024 x 8192), (8192x 1024) = 16M params (*k = 64M per )

**d(evices):**16

**n(umber of experts):**256 hier

**k:**2,2

**total params:**4B

**Largest model :**hierarchical with 130k experts, 137B params

**Adam Adjustment:**

- No first moment gradient estimates → just use current value

- Factored representation of each parameter matrix's second moment estimates

**Interpretability (for translation task):**

**Attention:**

- translation task uses attention between encoder and decoder

- original paper uses (weighted) NN-type approach

- this paper uses (weighted) tanh inner product

### My Notes

#### General thoughts

This seems like a promising method. We're essentially adding parameters+data, but reducing the quadratic cost increase by only training each parameter on a

*subset*of the datapoints it can help most with.This idea seems to work well. The key innovations here are that a) we have

*sparse*gating, which means that each expert only focuses on its "assigned" datapoints, and b) this all happens*within*a layer of the network.The upshot here is that we still get the performance gains we'd expect by adding more parameters in a standard way, but without the associated blowup in training costs.

The leveraging of sparsity is also interesting from a wider ML perspective - this is key to some of the most successful approaches. As one reviewer on openreview said:

we know historically that sparsity of parameters is among the most important modelling principles in machine learning, being used with great success in e.g. Lasso with thel1 penalty, in SVM with thehinge lossand inConvNetsby setting connections outside the receptive field to zero.

#### vs Data and Model parallelism

Data parallel is "wasteful" because we replicate parameters; model parallel because we replicate data.

This MoE approach has no parameter replication, and only data replication of a factor of . The challenge is that now not every parameter is trained on every datapoint - but it seems to work!

#### LSTM load vs MoE

One issue that's not mentioned explicitly is the amount of data each expert is trained on relative to the LSTM layer.

Consider that for our LSTM layer we have a total minibatch size of , spread across our devices. At the MoE layer we then replicate the data times, and spread it across experts, giving a per-expert minibatch size of . Thus the fraction of the total minibatch each expert gets is .

Additionally, each device gets elements for the LSTM layer, and for the MoE layer, so each device's MoE layer has to cope with times more data.

#### Suitability for the IPU

- Spread out to experts and back to out-LSTM are scatter and gather operations respectively ➡️ expensive for GPU, not for IPU

- As IPU-tiles offer very fine-grained parallelism, we could have more experts of a smaller size ➡️ this may be beneficial

- As IPU better supports MIMD we have the possibility of different architectures for different experts?

#### Outstanding questions

- How "smart" does the gating have to be? I would like to see an ablation where the rest of the model is fixed and we incrementally increase the size of the hidden layer. The translation "specialisation" results clearly show that the gating is learning something meaningful, but it's not clear how important this decision-making is.

- Ditto on the above for the LSTM layers.

- How does the value of affect performance? How good/bad is ? What about

- How much does the Adam simplification hurt performance? Is there precedent for this method?

- I would like to see some experiments on the network bandwidth problem - how large does the hidden layer size have to be to offset the overhead?

- I'm not convinced that the experiments justify the use of both losses?

### TL;DR

TL;DR### Related Papers

#### Switch Transformers

- Replaces the FF layer in a transformer with a MoE layer

- Shows k=1 just as good, and has additional benefits

- Uses Mesh-Tensorflow for tensor/model parallelism

- Replaces double loss with a single, simple loss

- Trained on mixed-precision bfloat16

- Combine expert, model and data parallelism (note: "expert" framed as new form of parallelism)

#### V-MoE

- Same transformer modification as switch transformer, but using K>1 MoE approach

- Train largest vision models to-date (15B params), matching SOTA dense models with less compute

#### Deep MoE via Shallow Embedding

**Intro**

- Shazeer et al. require many times larger models to recover SOTA

- Question: "can we stack and train many layers of mixture of experts models to improve accuracy and reduce prediction cost without radically increasing the network width?"

- Hundreds of MoE layers

**Method**

- Treats each channel in a convolutional layer as an expert → 1 for each input channel

- Number of experts selected varies at each layer (no top-k)

- Two additional loss terms: a) for sparsity, b) embedding classification loss = embedding & label cross-entropy