In 5 bullet points
- Current models give increased performance when more parameters and data are used
- This leads to quadratic increase in training costs
- Using sparse gating with MoEs, we can train each param on the subset of the data it can help the most with
- This is done in a distributed way, within a layer of the network
- Allows models to be trained with far more parameters and still accrue expected performance gains, without huge increase in training costs
Overview
Context:
- Current limitation for ML is number of parameters
- With more parameters we also require more data
High-level problem: adding more parameters and more data to existing models gives a quadratic ⬆️ in training costs
High-level solution: MoE model where only parts of the model are active, conditional on the input
MoE problems addressed:
- Effective batch size can get so small as to be inefficient
- Network bandwidth can become a bottleneck
- Not been tried on large datasets
Method
Proposed approach:
Expert: ff-NN, 1 hidden layer + ReLU
Gate: softmax gating + noise + sparsity:
Rest: Word embedding layer, LSTM layers before & after
Experimental Details:
- Dropout
- Residual connections
- Activation checkpointing
- Attention mechanism between encoder and decoder
Distributed Implementation:
Data parallel: LSTM + gating layers
Model parallel: experts grouped across devices
Hierarchical MoE: first gating network = data-parallel, secondary MoEs = single device
Problems Solved
Small batch size:
Problem: As we add devices batch size shrinks
Solution: We can compute a group of sequential LSTM outputs and send them as a (macro)batch to the MoE layer ("convolutional approach")
Outcome: Increases effective batch size & efficiency
Network bandwidth:
Problem: Major limitation can be network bandwidth
Solution: Arithmetic intensity (ops:bytes) hidden layer size ()
Outcome: By using larger hidden layers we can hide cost of network
Expert importance balancing:
Problem: Vicious cycle where commonly selected experts trained more, get better, and are selected more
Solution: Add to loss
Outcome: Regularises gating mechanism to make experts equally important
Expert load balancing:
Problem: Importance loss equalises weights across batch, but not explicitly load
Solution: Add to loss
Outcome: Regularises gating mechanism to make experts equally important
Adam Adjustment:
Problem: Adam optimiser states take up too much memory
Solution:
- No first moment gradient estimates → just use current value
- Factored representation of each parameter matrix's second moment estimates
Experiments
1 Bn Word Language Modelling![(Left) Fixed computational budget, increased no. experts; (Right) Fixed number of experts, increased budget.](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Ffb91f0ae-dea1-4f3e-b36c-081ee2b16d7f%2FUntitled.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T201550Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D2cea61af6c599d539ccb2d179927cafd87ffda7944b0300cef97410ce0ed7fda%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=73b30003-b20d-4d7a-abd9-7c3394a0136c&cache=v2)
(Left) Fixed computational budget, increased no. experts; (Right) Fixed number of experts, increased budget.
![(Left) Fixed computational budget, increased no. experts; (Right) Fixed number of experts, increased budget.](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Ffb91f0ae-dea1-4f3e-b36c-081ee2b16d7f%2FUntitled.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T201550Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D2cea61af6c599d539ccb2d179927cafd87ffda7944b0300cef97410ce0ed7fda%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=73b30003-b20d-4d7a-abd9-7c3394a0136c&cache=v2)
Increased no. experts/params: Near-linear speedup for flat model; slight improvement beyond that for hierarchical.
Increased budget: Linear improvement, comparable to LSTM.
100 Bn Word Language Modelling![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fcc7a7404-85d0-4247-a3a3-51c8f0a6afb1%2FUntitled.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T201550Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D9424ed13c3fe6cbd19912e331499691f6cc164fa8d745c64a03f760bc327ad9e%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=86805428-b650-415e-8f57-819704e1dd03&cache=v2)
![notion image](https://www.notion.so/image/https%3A%2F%2Fs3.us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Fcc7a7404-85d0-4247-a3a3-51c8f0a6afb1%2FUntitled.png%3FX-Amz-Algorithm%3DAWS4-HMAC-SHA256%26X-Amz-Content-Sha256%3DUNSIGNED-PAYLOAD%26X-Amz-Credential%3DAKIAT73L2G45EIPT3X45%252F20221016%252Fus-west-2%252Fs3%252Faws4_request%26X-Amz-Date%3D20221016T201550Z%26X-Amz-Expires%3D86400%26X-Amz-Signature%3D9424ed13c3fe6cbd19912e331499691f6cc164fa8d745c64a03f760bc327ad9e%26X-Amz-SignedHeaders%3Dhost%26x-id%3DGetObject?table=block&id=86805428-b650-415e-8f57-819704e1dd03&cache=v2)
Increased no. experts/params: Linear speedup dropping off; can't quite scale to 100 bn params
Problem: Possible too much sparsity?
Increased data: Amount determines asymptote
Machine Translation: Far higher BLEU and lower perplexity than baselines, with far more params