🔔

# Probability

\frac{\lambda^k e^-\lambda}{k!}

This is a rather arbitrary selection of probability-related stuff that I though it worth reminding myself of.

### Random Variables

Definition:
A random variable is a measurable function from a set of possible outcomes to a measurable space .
For our purposes, we can simplify this to:
A random variable is a function that assigns each outcome of an event to a real number: .
Wikipedia gives this informally as:
A random variable is a variable whose values depend on outcomes of a random phenomenon
We can then calculate probabilities of random variables equalling different values as a result of an event.

### Common Probability Distributions

Support:
Parameters:
PMF:
CDF:
Mean:
Variance:

#### Multinoulli / Categorical Distribution

Support:
Parameters:
• number of categories:
• event probabilities:
PMF:

#### Binomial Distribution

Support:
Parameters:
• Number of trials:
• Probability of success for each trial
PMF:
Mean:
Variance:

#### Multinomial Distribution

Support:
Parameters:
• Number of trials:
• Outcome probabilities
PMF:
Mean:
Variance:

#### Geometric distribution

Either:
• The number of Bernoulli trials needed to get one success ()
• The number of Bernoulli trials before the first success ()

#### Hypergeometric Distribution

The probability of successes in draws without replacement from a population of size .
Similar to binomial distribution, but that uses replacement. When is much larger than , the binomial is often a good enough approximation.

#### Poisson Distribution

Support:
Parameters:
• Expected number of occurrences (fixed window):
PMF:
CDF:
Mean:
Variance:
Notes:
• Represents the probability of events occurring in a fixed interval, assuming these events occur with a known constant mean rate , and independently of the time since the last event.
• Useful approximation for Binomial when large and small enough to make moderate, as Binomial CDF hard to compute.

#### Normal / Gaussian Distribution

Support:
Parameters:
PDF:
The CDF is more complex and cannot be expressed in terms of elementary functions.
The central limit theorem shows that the sum of many independent random variables is approximately normally distributed.
The following gives an interesting Bayesian interpretation of the normal distribution:
Out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. We can thus think of the normal distribution as being the one that inserts the least amount of prior knowledge into a model.
Multivariate normal distribution, PDF:

#### Exponential Distribution

Support:
Parameters:
PDF:
CDF:
One benefit of using this distribution is that it has a sharp peak at .

#### Laplace Distribution

Support:
Parameters:
• location:
• scale:
PDF:
CDF:
This is similar to the exponential distribution, but it allows us to place the peak anywhere we wish.
It is similar to the normal distribution too, but uses an absolute difference rather than the square.

#### Dirac Distribution

If we wish to specify that all the mass in a probability distribution clusters around a single point then we can use the Dirac delta function, , which is zero-valued everywhere except 0, yet integrates to 1 (this is a special mathematical object called a generalised function).
PDF:

#### Empirical Distribution

We can use the Dirac delta function with our training data, , to define the following PDF:
This concentrates all of the probability mass on the training data. In effect, this distribution represents the distribution that we sample from when we train a model on this dataset.
It is also the PDF that maximises the likelihood of the training data.

### Structured Probabilistic Models

#### Motivation

The number of parameters in a probability distribution over random variables is exponential in . Hence, using a single probability distributions over a large number of random variables can be very inefficient.
If we can factorise joint probability distributions into chains of conditional distributions, we can greatly reduce the number of parameters and computational cost.
We call these structured probabilistic models or graphical models.

#### Directed Models

Given a graph , we define the immediate parents of a node (as defined by the directed edges) as . We can then express the factorisation as follows:
The graph itself effectively encodes a number of (mainly conditional) independence relations between random variables. Specifically, any two nodes are conditionally independent given the values of their parents. This is really what we're exploiting to gain the efficiency speedup here.

#### Undirected Models

In undirected models we associate groups of nodes with a factor.
We define a clique as a set of nodes that are all connected to one-another.
Each clique in the model is then associated with a factor . Note that these factors are simply non-negative functions; not probability distributions.
To obtain the full joint probability distribution, we then multiply and normalise:
where is a normalising constant (i.e. the sum/integral of the probability over all outcomes).