Hyperdimensional Computing

Contents



Motivation

New nanotechnology can enable much more power-efficient and large-scale circuitry, compared with existing semiconductor technology
This can be used to improve conventional computing (e.g. large nonvolatile memories).
But it also makes radically new architectures possible.
One such example is High Dimensional Computing (HDC). The key feature of this kind of computing is that is uses much larger word-sizes, where words no-longer simply represent integers or approximate floating-point numbers.
For instance, a traditional system might have a word-size of 64 bits, whereas HDC might use 10,000 bits.
These new word representations require a different set of operations for arithmetic and addressing memory.
There is good reason to believe that the kind of system used in HDC is particularly well-suited to machine-learning applications.
🙊
A word on word-sizes
In the traditional setup, a memory address might be 64 bits wide, addressing words that themselves are of 64 bits. There are addressable words, giving a maximum of exabytes. This is the theoretical maximum memory such a system could support, and is probably more than enough!
In the HDC setup, a memory address might be 10,000 bits wide. This could potentially address an enormous words is used in the traditional way. We don’t need such capacity, and we’ll see that addressing memory works very differently in this world.

Arithmetic

In a traditional processor we’d have dedicated circuits for computing standard arithmetic operations on 32/64-bit floating-point numbers.
For HDC we have a new word-size, representing a new kind of value. It has it’s own dedicated aritmetic operations, and its own (much larger) circuits.

Premise

We want our high-dimensional words to represent a lot more than just a few real numbers or a memory address (as is the case in the traditional approach). But we still want them to have some of the fundamental algebraic properties of the kinds of mathmatical objects we’re used to dealing with (i.e. real numbers).
The solution is to define a set of new operations that form (or approximate) a field over vector space (see the Wikipedia page for an overview of what you need to define for a field).
In short, we require the following, approximately:
  1. Addition is commutative
  1. Multiplication is invertible
  1. Multiplication distributes over addition
  1. Vectors can be compared for similarity (the space has a metric)
So what operations will give us these properties?
🧑‍🎓
An ML perspective
A word in HDC is a lot like a hidden representation in a modern ML model. Both are vectors, only the former is a vector of bits, and the latter a vector of floating-point numbers.
Interestingly, the amount of data per-word used by a 10,000 bit HDC system is of a similar magnitude to that of our current deployment-scale ML models. If we were to use a float16 representation, a BERT-Base hidden representation would take up 12,288 bits. Very close!

Metric

Operation: Hamming distance (the number of positions in which the bits are different)
We can model the hamming distance of randomly-drawn words using a binomial distribution.
For 10,000-bit words we have n=10k, p=0.5, which gives an std of 50. So 95% of (randomly-drawn) values will have a hamming distance of 5k 100.
That most unrelated words are roughly the same distance away is a useful property, as words that are related will hopefully be easily distinguishable by distance alone.

Addition

Operation: mode (component-wise)
In the event we get a tie, we break it randomly (bear in mind that we typically do this over multiple words).
The output of randomly drawn words is likely close in Hamming distance (i.e. similar) to its inputs. This distance is statistically significant for over a thousand arguments.
Based on this, we can consider the resulting word to represent an “approximate set” of its arguments.

Multiplication

Operation: XOR (component-wise)
Ties are again broken randomly.
Here the output of randomly drawn words is likely far in Hamming distance (dissimilar) to its inputs.
Distance is also preserved between words that are both multiplied by the same amount.
XOR has both the required inverse and distributional properties (though the latter is only approximate over even numbers of vectors).
It also enables variable binding, which as we will see, allows data to be approximately stored and retrieved within a word.

Representing data

In HDC we can (approximately) represent computer science’s classic data structures—and in fact, any kind of object comprising pointers and values.

Structured data

A key data structure (from which others can be built) is the associative array (i.e. dictionary, or map). It is represented as follows:
Consider an associative array with variables , to which we assign the values .
In HDC each is a separate (high-dimensional) word. We will choose to be randomly-drawn vectors.
These can be combined into a single word using the holographic (or holistic) representation:
This new word contains the supplied values in a way that can be easily retrieved with little information lost (so long as the number of key-value pairs is not too large). How is this done?
Retrieval:
To retrieve we multiply by its inverse, which is simply . Using the distributive property, we get:
Using the previously mentioned properties of addition, we can see this roughly reduces to the mode of and two random vectors. As the mode approximately preserves distance, (so long as there aren’t too many other values) the resulting .
(If we’re not happy with our noisy vector, we always have the option of doing some nearest-neighbour search to get the original value, assuming we still have it available.)

Sequences

Simple approach:
To represent a (variable-length) sequence, we can use the above associative array technique. To do this we make the first element a key, and the second element its value. We then repeat this indefinitely (basically, a linked-list).
For example, given a sequence , we would compute:
Given we can retrieve , which lets us retrieve , and so on. In fact, we can even start part-way through the sequence.
By combining this with the structured data approach above, many kinds of data structure can also be derived.
One issue here is that is doesn’t handle duplicate values well - we end up introducing a cycle, even if we don’t want one. I think some kind of positional encoding could work to fix this—the approach taken in the literature is slightly different though ⬇️
Permutation approach:
In this approach we make use of another key operation: permutation. Kanerva (2014) sums up the useful properties of permutation succinctly:
A permutation resembles multiplication (with XOR) in the sense that it distributes over vector addition—it distributes over any coordinatewise operation including multiplication with XOR—it preserves distance, and it is invertible. However, it is not its own inverse: permutation of a permutation is another, different permutation, which makes it suitable for encoding sequences.
We will represent a fixed, arbitrary permutation operation by the symbol .
The basic idea here is that we input the sequence one element at a time, and at each step permute our existing word, before adding the new element:
Because of the similarity property of addition, we should be able to retrieve elements given surrounding subsequences. There are some complexities here that aren’t clear to me yet, so I’ll avoid a fuller explanation for now.
One thing to note though is that elements of the mode can be weighted to give more emphasis to recent data.

Questions

  1. Is there an easy way to calculate mode from existing instructions? Are there ways of approximating these operations with instructions that we already have in common accelerators?
  1. What are the benefits and challenges of engineering these circuits from a hardware perspective?
  1. What is this magical nanotechnology and what are its tradeoffs?
  1. Is there any penalty into splitting up HDC into smaller words that we then combine? Do the circuits have to be gigantic, or can we do this well using existing semiconductor technology?
  1. What are the old AI approaches mentioned (e.g. Holographic Reduced Representation)? Do we care?
  1. Do we scale this by using multiple 10k words, or moving to >10k words? (even if the circuits don’t scale, we’d maybe give ourselves a way of computing much larger words using the fixed-size circuits)
  1. How does the permutation-sequence thing work