TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models

Speakers: Bilge Acun, Chunxing Yin

Recommendation Models at FB:

News Feed Ranking

Stories Ranking

Instagram Explore

In FB datacentres, they take up:

~50% of training

~80% of inference

Deep Learning Recommendation Model (DLRM)

Includes sparse features: e.g. pages liked, videos watched, etc

Embedding lookup is a hashmap indexed by the sparse feature

From a systems perspective embedding learning is the most important part to optimise in these models.

Challenges in Embedding Learning

Huge vocabulary size → mem capacity requirements of embedding tables have grown from 10s of GBs to TBs.

Skewed data distribution in embedding tables → typically power-log distribution both for rows within a table, and the tables themselves

Motivation

AIM: "to make the tables smaller and denser, in order to trade off memory requirements for computation, to make them fit better to memory limited accelerators"

Tensor Train Compression

A low-rank tensor factorisation method

Tensor factorisation:

Also, the match the cardinality of each corresponding dimension of .

Think of the factorisation as an einsum, where each of the dimensions cancels out: the middle ones with each other, and the end ones with each other, as . This leaves the factorisation with the same dims as the original.

Application to DLRM

Replace emb with TT format, with appropriately chosen TT-rank.

TT-cores learned during training.

Challenges:

Low-rank approximation performance degradation

Hyperparameter tuning for TT-ranks

Extra compute required

Benefit: Memory Reduction

Compress the largest embeddings

Overall model reduction ranges from 4x to 120x

Model quality

In some cases can actually improve accuracy

Comparison vs Hashed Embeddings

Hashed embeddings simply hashes multiple embeddings into one bucket to reduce size

Note: the two can be combined.

Both methods appear similar? Is this an improvement?