Speakers: Bilge Acun, Chunxing Yin
Recommendation Models at FB:
- News Feed Ranking
- Stories Ranking
- Instagram Explore
In FB datacentres, they take up:
- ~50% of training
- ~80% of inference
Deep Learning Recommendation Model (DLRM)
Includes sparse features: e.g. pages liked, videos watched, etc
Embedding lookup is a hashmap indexed by the sparse feature
From a systems perspective embedding learning is the most important part to optimise in these models.
Challenges in Embedding Learning
- Huge vocabulary size → mem capacity requirements of embedding tables have grown from 10s of GBs to TBs.
- Skewed data distribution in embedding tables → typically power-log distribution both for rows within a table, and the tables themselves
Motivation
AIM: "to make the tables smaller and denser, in order to trade off memory requirements for computation, to make them fit better to memory limited accelerators"
Tensor Train Compression
A low-rank tensor factorisation method
Tensor factorisation:
Think of the factorisation as an einsum, where each of the dimensions cancels out: the middle ones with each other, and the end ones with each other, as . This leaves the factorisation with the same dims as the original.
Application to DLRM
Replace emb with TT format, with appropriately chosen TT-rank.
TT-cores learned during training.
Challenges:
- Low-rank approximation performance degradation
- Hyperparameter tuning for TT-ranks
- Extra compute required
Benefit: Memory Reduction
- Compress the largest embeddings
- Overall model reduction ranges from 4x to 120x
Model quality
In some cases can actually improve accuracy
Comparison vs Hashed Embeddings
Hashed embeddings simply hashes multiple embeddings into one bucket to reduce size
Note: the two can be combined.
Both methods appear similar? Is this an improvement?