👋

# 1. Introduction

Below are the key ideas and equations in Chapter 1: Introduction of Probabilistic Machine Learning: An Introduction.
This is intended as a concise reference to accompany the text, and as the basis of an Anki flashcard deck
(see my explanation of the benefits of using Anki here:
📗
Probabilistic Machine Learning: An Introduction
)

📢
Press cmd/ctrl + option/alt + t to expand/close all the toggle lists at once (only on desktop)

## 1.2 Supervised Learning

### 1.2.1 Classification

#### 1.2.1.4 Empirical risk minimization

What is the definition of empirical risk? (in words)
The average loss calculated over the training set.
What is the definition of empirical risk? (equation)
where is our loss function.

#### 1.2.1.5 Uncertainty

Predictive error due to lack of knowledge is known as what? (two terms)
1. Epistemic uncertainty
1. Model uncertainty
Predictive error due to intrinsic stochasticity is known as what? (two terms)
1. Aleatoric uncertainty
1. Data uncertainty
What is the purpose of the softmax function?
It converts a vector of real values into a vector of probabilities.
What is the definition of the softmax function? (equation)
What is the definition of a logistic regression model? (equation)
where is the softmax function.

#### 1.2.1.6 Maximum likelihood estimation

What is the common choice of loss function for probabilistic models? (in words)
The negative log probability.
What is the common choice of loss function for probabilistic models? (equation)
What is the definition of negative log likelihood (NLL)? (in words)
The empirical risk, using negative log loss.
What is the definition of negative log likelihood (NLL)? (equation)
What is the definition of the maximum likelihood estimate (MLE)? (equation)
where is the negative log likelihood.

### 1.2.2 Regression

What is the difference between classification and regression?
For the former the class label is categorical, whereas for the latter its real-valued.
What is the definition of mean squared error (MSE)? (in words)
The empirical risk, using quadratic loss.
What is the definition of mean squared error (MSE)? (equation)
How are the NLL and MSE linked?
If we assume our predictions have Gaussian noise and compute the NLL, it is proportional to the MSE.

#### 1.2.2.2 Polynomial regression

What is the definition of a polynomial regression model, of degree ? (equation)

### 1.2.3 Overfitting and generalization

What is the definition of the population risk? (equation)
where is the true (but unknown) distribution used to generate the training set.
What is the definition of the generalization gap? (verbal)
The population risk minus the empirical risk.
How can we frame overfitting in terms of the generalisation gap?
It is present if a model has a large generalisation gap.

### 1.2.4 No free lunch theorem

What is the premise of the no free lunch theorem?
There is no single best model that works optimally for all kinds of problems.

## 1.3 Unsupervised learning

What distributions are supervised and unsupervised leaning trying to model?
These approaches model and respectively.

### 1.3.2 Discovering latent “factors of variation”

What is a latent factor?
A hidden low-dimensional variable, from which the observed high-dimensional variable is generated.
What is the definition of a factor analysis (FA) model? (equation)
Where are the latent factors.

### 1.3.3. Self-supervised learning

What is self-supervised learning?
A form of unsupervised learning that trains on 'proxy' supervised tasks, created from the unlabelled data.

## 1.4 Reinforcement learning

In reinforcement learning, what is a policy?
A model which specifies which action to take in response to each possible input.

## 1.5 Data

### 1.5.3 Preprocessing discrete input data

#### 1.5.3.1 One-hot encoding

What is the definition of a one-hot encoding for a variable that takes categorical values? (equation)

#### 1.5.3.1 Feature crosses

What is the definition of feature crosses? (verbal)
One-hot encoding over every possible combination of values, given multiple categorical values.

### 1.5.4 Preprocessing text data

#### 1.5.4.1 Bag of words model

What is a bag of words model?
A representation of a text document where we ignore word order.
What is stop word removal?
The dropping of common but uninformative words (e.g. "the", "and").
What is word stemming?
Replacing words with their base form (e.g. “running” → “run”)
What is the definition of a vector space model of text? (verbal)
Given a vocabulary of tokens, it encodes a document into a -dimensional vector where each element indicates the frequency of a word.
What is the definition of a term frequency (TF) matrix of a text dataset? (verbal)
A matrix where each entry is the frequency of term in document .
What is the definition of inverse document frequency (IDF)? (equation)
where is the number of documents with term .

#### 1.5.4.2 TF-IDF

What does TF-IDF stand for?
It stands for term frequency-inverse document frequency.
What is the definition of the term frequency-inverse document frequency (TF-IDF)? (equation)
where is the frequency of term in document , and is the inverse document frequency.
(we often normalise each row as well)

#### 1.5.4.3 Word embeddings

What is a word embedding?
A mapping from a high-dimensional one-hot vector , to a lower-dimensional dense vector , via multiplication by (i.e. indexing into) an embedding matrix :
What is the definition of a bag of word embeddings? (verbal)
The sum of the word embeddings of each token in a document
What is the definition of a bag of word embeddings? (equation)
where is the embedding matrix, and is the vector space model of the document.

### 1.5.5 Handling missing data

Name the three kinds of missing data.
1. Missing completely at random (MCAR)
1. Missing at random (MAR)
1. Not missing at random (NMAR)
What is the definition of missing completely at random (MCAR)? (verbal)
The 'missingness' of data does not depend on the hidden or observed features.
What is the definition of missing at random (MAR)? (verbal)
The 'missingness' of data does not depend on the hidden features, but may depend on the observed features.
What is the definition of not missing at random (NMAR)? (verbal)
The 'missingness' of data depends on the hidden features.
What must we do if we have not missing at random (NMAR) data?
If this is the case, we model the missing data mechanism, since the lack of information may be informative.
What is mean value imputation?
Replacing missing values by their empirical mean.