🪅

Learning in High Dimension Always Amounts to Extrapolation

Title
Learning in High Dimension Always Amounts to Extrapolation
Authors
Randall Balestriero, Jerome Pesenti, Yann LeCun
Date
2021
Venue
DBLP
Keywords
extrapolation

Introduction

Definition: Interpolation occurs for a sample if it belongs to the convex hull of a set of samples. (extrapolation defined conversely)
Common assumption: as an algorithm transitions from interpolation to extrapolation, its performance decreases
Goal of the paper: show interpolation almost surely never occurs in high-dimensional spaces (>100) regardless of the underlying intrinsic dimension of the data manifold.
Corrolaries:
  1. DL models basically always extrapolate
  1. Extrapolation regime is not necessarily to be avoided
  1. Generalisation should not be thought of in terms of extrapolation/interpolation
From the conclusion:
Interpolation and extrapolation [...] provide an intuitive geometrical characterization on the location of new samples with respect to a given dataset. Those terms are commonly used as geometrical proxy to predict a model’s performances on unseen samples and many have reached the conclusion that a model’s generalization performance depends on how a model interpolates. In other words, how accurate is a model within a dataset’s convex-hull defines its generalization performances. In this paper, we proposed to debunk this (mis)conception.

Interpolation is Doomed by the Curse of Dimensionality

The Role of the Intrinsic, Ambient and Convex Hull Dimensions

Ambient dimension : dimension of the space in which the data lives
(Underlying data manifold) Intrinsic dimension : the number of variables needed in a minimal representation of the data
Convex hull dimension: the dimension of the smallest affine subspace that includes all the data manifold.
 
Claim: The probability of interpolation occuring depends on the convex hull dimension, not the intrinsic (manifold) dimension.
Evidence:
How to interpret this: what we want is to find the underlying variable that causes the probability of interpolation (left axis) to change. By fixing the convex hull dimension and increasing  (RHS), we show that the probability of interpolation doesn't change (black line is vertical), and hence this is the underlying variable we're looking for.

LHS: setting =  increases with . This assumes the manifold is the same size as the data dimensionality. result = as they increase together, the probability of interpolation decreases. This is the standard curse of dimensionality setting.

Middle: setting =  is now just 1. This assumes the minimal representation of the data is a single variable. If  was the variable we're looking for, then this should give a fixed probability of interpolation, regardless of . result = as  increases, the probability of interpolation decreases! So the dimensionality of the manifold isn't the key. The diagonal line effectively shows what we have to do to our dataset to regain this equivalence as  increases. The straightness of the line (on a log-scale x-axis) shows that we must increase exponentially!)

RHS: setting = as we now sample from a density within an affine subspace of fixed dimension, we limit the convex hull dimension (as well as ). Now we get the same probability regardless of ) 🙂.
How to interpret this: what we want is to find the underlying variable that causes the probability of interpolation (left axis) to change. By fixing the convex hull dimension and increasing (RHS), we show that the probability of interpolation doesn't change (black line is vertical), and hence this is the underlying variable we're looking for. LHS: setting = increases with . This assumes the manifold is the same size as the data dimensionality. result = as they increase together, the probability of interpolation decreases. This is the standard curse of dimensionality setting. Middle: setting = is now just 1. This assumes the minimal representation of the data is a single variable. If was the variable we're looking for, then this should give a fixed probability of interpolation, regardless of . result = as increases, the probability of interpolation decreases! So the dimensionality of the manifold isn't the key. The diagonal line effectively shows what we have to do to our dataset to regain this equivalence as increases. The straightness of the line (on a log-scale x-axis) shows that we must increase exponentially!) RHS: setting = as we now sample from a density within an affine subspace of fixed dimension, we limit the convex hull dimension (as well as ). Now we get the same probability regardless of ) 🙂.

Real Datasets and Embeddings are no Exception

Question: what if real datasets have a special type of low-dim manifold embedding that means we are still in the interpolation regime?
Result: on MNIST, CIFAR and Imagenet, despite the low-dim manifold, finding samples in the interpolation regime is still exp-difficult.
 

No interpolation in pixel-space

notion image

No interpolation in embedding-space (!)

Question: "one could argue that the key interest of machine learning is not to perform interpolation in the data space, but rather in a (learned) latent space" - so do we interpolate in the latent space?
This take doesn't account for the results in Table 1.
This take doesn't account for the results in Table 1.
Result: apparently not!? This seems remarkable but makes total sense when you consider how high-dimensional the latent space is.
Key quote:
We observed that embedding-spaces provide seemingly organized representations (with linear separability of the classes), yet, interpolation remains an elusive goal even for embedding-spaces of only 30 dimensions. Hence current deep learning methods operate almost surely in an extrapolation regime in both the data space, and their embedding space.
Note: this is for embeddings!

A couple of things aren't clear to me here and give pause for thought. They pick 30 dims randomly out of 512. Is this ok? I think so - if anything it should save us from some CoD! But it feels a bit strange.

Why do they not show a resnet trained on the datasets. Surely we want to see what kind of embedding is learned? They pretrain on imagenet which is one of the three - and it still suffers horribly.
Note: this is for embeddings! A couple of things aren't clear to me here and give pause for thought. They pick 30 dims randomly out of 512. Is this ok? I think so - if anything it should save us from some CoD! But it feels a bit strange. Why do they not show a resnet trained on the datasets. Surely we want to see what kind of embedding is learned? They pretrain on imagenet which is one of the three - and it still suffers horribly.

Is interpolation/extrapolation info preserved when using dimensionality reduction techniques?

TL;DR:
dimensionality reduction methods loose the interpolation/extrapolation information and lead to visual misconceptions significantly skewed towards interpolation
 

My take-away

notion image