TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models

Speakers: Bilge Acun, Chunxing Yin

Recommendation Models at FB:
  • News Feed Ranking
  • Stories Ranking
  • Instagram Explore
In FB datacentres, they take up:
  • ~50% of training
  • ~80% of inference

Deep Learning Recommendation Model (DLRM)

notion image
Includes sparse features: e.g. pages liked, videos watched, etc
Embedding lookup is a hashmap indexed by the sparse feature
From a systems perspective embedding learning is the most important part to optimise in these models.

Challenges in Embedding Learning

  • Huge vocabulary size → mem capacity requirements of embedding tables have grown from 10s of GBs to TBs.
  • Skewed data distribution in embedding tables → typically power-log distribution both for rows within a table, and the tables themselves

Motivation

AIM: "to make the tables smaller and denser, in order to trade off memory requirements for computation, to make them fit better to memory limited accelerators"
notion image

Tensor Train Compression

A low-rank tensor factorisation method
Tensor factorisation:
Also, the  match the cardinality of each corresponding dimension  of .
Also, the match the cardinality of each corresponding dimension of .
Think of the factorisation as an einsum, where each of the dimensions cancels out: the middle ones with each other, and the end ones with each other, as . This leaves the factorisation with the same dims as the original.

Application to DLRM

Replace emb with TT format, with appropriately chosen TT-rank.
notion image
TT-cores learned during training.
Challenges:
  • Low-rank approximation performance degradation
  • Hyperparameter tuning for TT-ranks
  • Extra compute required

Benefit: Memory Reduction

  • Compress the largest embeddings
  • Overall model reduction ranges from 4x to 120x

Model quality

In some cases can actually improve accuracy

Comparison vs Hashed Embeddings

Hashed embeddings simply hashes multiple embeddings into one bucket to reduce size
Note: the two can be combined.
Both methods appear similar? Is this an improvement?