(Based on Chapter 3 of the Deep Learning Book)
This is a very brief overview of some key concepts in information theory.
The amount of information an event tells us, depends on its likelihood. Frequent events tell us little, while rare events tell us a lot.
Information theory gives us a measure called a nat, that quantifies how much information an event gives us. We denote this by .
Our requirements for such a function are that it satisfies the following:
- An event with probability 1 has
- The less likely an event, the more information it transmits
- The information conveyed by independent events should be additive
We therefore define a nat as follows:
Here we use the natural logarithm. If base 2, is used this measurement is called shannons or bits.
Moving to whole probability distributions, we define the expected information in an event sampled from a distribution as the Shannon entropy:
Distributions that are nearer deterministic have lower entropies, and distributions that are nearer uniform have higher entropies.
When is continuous this is also known as differential entropy.
If we want to compare the information in two probability distributions, we use the Kullback-Leibler divergence, which is the expected log probability ratio between the two distributions:
The KL divergence is when and are the same.
This is sometimes thought of as a measure of "distance" between the two distributions. However, this measure is not symmetric, so does not satisfy the typical requirements of distances.
To visualise the asymmetry, see figure 6.3 in the book. The key point here is that if we wish to minimise the kl:
- From the perspective of Q: we want to make sure we have high probability whenever P has high probability (and if P has low probability, Q can be low or high)
- From the perspective of P: we want to make sure we have low probability whenever Q has low probability (and if Q has high probability, P can be low or high)
A similar measure is the cross-entropy, which is defined as:
This measure can be thought of in the following way:
The cross entropy can be interpreted as the number of bits per message needed (on average) to encode events drawn from true distribution p, if using an optimal code for distribution q
Note that the cross entropy can be defined as the Shannon entropy of plus the KL divergence from to :