Introduction
Deployment of (inductive?) ML models has 2 parts:
- Generate the representation
- Use the representation for some downstream application (scales with and )
Problem: downstream apps dominate compute at web-scale; each have different costs
Solution:
encoding coarse-to-fine-grained representations, which are as accurate as the independently trained counterparts, we learn with minimal overhead a representation that can be deployed adaptively at no additional cost during inference.
Focus on two key (large-scale) tasks: large scale classification & retrieval
Method
Matryoshka Representation Learning (MRL):
- Run the full neural network as normal to generate the output embedding .
- Take each power-of-two-size chunk of , and apply a (necessarily different) projection to get a prediction for eachπͺ
- Apply the loss to each prediction, and minimise the sum
Result:
- different embedding sizes
- In fact you can interpolate effectively to get different sizes
Efficient MRL:
- Share projections across sizes by taking initial slice of largest-size projection
- Saves on (almost) half the projection size
Β