🎛️

Generative Modelling


Motivation

In machine learning we generally want to come up with some kind of probabilistic representation of our data distribution.
The typical use-case here is prediction: we want to represent the distribution of outputs given an input: . A specific prediction can then be made from this distribution by sampling from the distribution or taking its mode/mean.
When we model this conditional distribution , it is known as discriminative modelling.
This usually is a good approach but, in some cases alternative strategies are worth considering.
When we model the joint distribution , it is known as generative modelling.
This generative model has more uses than the conditional distribution, as it models the distribution of the input data as well as the output data. It can still be used for prediction though, as we will demonstrate. There are prediction problems for which we may indeed prefer the generative approach to the discriminative one.

Preliminaries

Joint and Conditional Models

Here we seek to answer two questions:
  1. How do we go from the joint distribution to the conditional distribution, required for prediction?
  1. How do we represent the joint distribution in the first place?
An answer both questions comes via Bayes' rule:
The numerator breaks down the joint distribution into two separate distribution. There are ways of representing the joint distribution directly, but typically this form makes it easier to handle.
This of course requires not one but three representations that we must choose or derive: , and . For now we will not worry about ; for our purposes it suffices to note that:
To approximate these functions, we want to represent them using parameterised models. We use weights and respectively:

Maximum Likelihood Prediction for Generative Classifiers

Maximum Likelihood (ML) estimation is a method for deriving the parameters of a model for some dataset. It is the solution to the optimisation problem:
Where is our training data, and is our model parameterised by weights .
In other words, we seek to find the parameters for the model which maximise the likelihood of the data occurring.
 
Given the framework for generative modelling set out above, our ML estimation problem becomes:
Note that the product here is based on the standard i.i.d assumption used for training data, meaning the probability of each datapoint can be treated independently.
Negating this decomposition, requiring the minimum and taking the logarithm (leaves the solution unchanged), this becomes:
The two terms in this equation can thus be treated independently. The first is more challenging, and we will examine two strategies for modelling and solving this class-conditional feature distribution in subsequent sections. For now we will focus on the simpler second term.
We are going to focus on the case here where our output is a categorical variable, modelled by a categorical (multinoulli) distribution with categories. In this case, we have simply: . This gives the following optimisation problem:
where is the number of of class . We can solve for by taking the Lagrangian and setting the gradient to 0:
This is all to tell us that the MLE for the class probability is simply the fraction of the class present in the training data! We would probably have assumed this to be the case anyway, but now we have proof!
All that remains is modelling the class-conditional features distribution. We will examine two approaches to this: Naive Bayes, and Discriminant Analysis.

Classification

Given some model for the class-conditional feature distribution, we can now compare classes pairwise to see which is more probable. Given classes and , we predict class if:
Based on this we can make our classification by choosing the class with the lowest discriminant function on :

The Naive Bayes Classifier

The Naive Bayes classifier uses the generative framework outlined above. It adds to this a particular assumption for deriving the class-conditional feature distribution (note that we will assume from this point onwards that represents a vector of features).
This "naive Bayes" assumption is that all features in are mutually independent given the category . This allows us to represent the joint probability distribution as:
This again makes our modelling problem simpler: we now only need to represent the probability of a single feature, given its class.
One useful advantage of naive Bayes is that we can easily mix-and-match continuous and categorical input data. We will consider each case separately here, but combining them is straightforward.

Categorical Naive Bayes

Simple presentation
If we continue as normal, the MLE for a class-conditional categorical feature can be calculated in the same way as above:
i.e. the proportion of datapoints of class that have label for feature . Our joint probability distribution becomes:
or equivalently:
Histogram-based presentation
The above formulation allows us to model the joint distribution, and hence make predictions etc. However, it is not in the most useful form. For a categorical data distribution we often want to be able to answer questions like: "What is the probability of drawing > 3 red balls from 5 draws". Work is needed to derive an answer to this sort of question using the above form.
For this reason, we often present our model and solution for a single feature as being over a histogram of multiple datapoints and all classes, where each represents the number of events of class observed.
Using this approach, the natural model for is a multinomial distribution:
where .
This is fundamentally no different from the simple presentation, but if we put our model for the features in terms of histograms it allows us to answer more natural questions about our data model.

Gaussian Naive Bayes

This is similar to our simple presentation above, but in this case we assume conditionally independent continuous features which we choose to model using a Gaussian distribution. We thus have:
where we estimate the mean and variance for each feature-class from the training data.

Linear Discriminant Analysis

If we have continuous features but don't allow the naive Bayes assumption, we're forced to model the joint class-conditional distribution over the features. When we do this using a multivariate Gaussian distribution this is known as linear discriminant analysis.
In this case we have . Note that we for now assume a common covariance matrix across all classes. We can estimate this from the data using the scatter matrix .
Plugging this into Bayes' rule and taking the logarithm we get the discriminant functions :
This is know as linear discriminant analysis because this term is linear in .
💡
I believe the decision boundary here is equivalent to using a linear model directly on one-hot-over- encoded output vectors, without a sigmoid layer. The linear term in this case appears identical, but I can't quite make the log term cancel out. It is suggested here that the two decision boundaries are indeed the same. A little more work is needed here for the maths to make sense.
If we assume each class has its own covariance matrix, then the decision boundaries become quadratic curves. This is known as quadratic discriminant analysis.