### About

Below are the key ideas and equations in

*Chapter 1: Introduction*of*Probabilistic Machine Learning: An Introduction.*This is intended as a

**concise reference**to accompany the text, and as the basis of an**Anki flashcard deck**(see my explanation of the benefits of using Anki here: Probabilistic Machine Learning: An Introduction)

### Links

⬇️ (TODO: Anki download)

⬇️ Download PDF

😸 Github

### Table of contents

1.2 Supervised Learning1.2.1 Classification1.2.1.4 Empirical risk minimization1.2.1.5 Uncertainty1.2.1.6 Maximum likelihood estimation1.2.2 Regression1.2.2.2 Polynomial regression1.2.3 Overfitting and generalization1.2.4 No free lunch theorem1.3 Unsupervised learning1.3.2 Discovering latent “factors of variation”1.3.3. Self-supervised learning1.4 Reinforcement learning1.5 Data1.5.3 Preprocessing discrete input data1.5.3.1 One-hot encoding1.5.3.1 Feature crosses1.5.4 Preprocessing text data1.5.4.1 Bag of words model1.5.4.2 TF-IDF1.5.4.3 Word embeddings1.5.5 Handling missing data

Press

`cmd/ctrl`

+ `option/alt`

+ `t`

to expand/close all the toggle lists at once (only on desktop)## 1.2 Supervised Learning

### 1.2.1 Classification

#### 1.2.1.4 Empirical risk minimization

## What is the definition of **empirical risk**? (in words)

The average loss calculated over the training set.

## What is the definition of **empirical risk**? (equation)

where is our loss function.

#### 1.2.1.5 Uncertainty

## Predictive error due to **lack of knowledge** is known as what? (two terms)

- Epistemic uncertainty

- Model uncertainty

## Predictive error due to **intrinsic stochasticity** is known as what? (two terms)

- Aleatoric uncertainty

- Data uncertainty

## What is the purpose of the **softmax** function?

It converts a vector of real values into a vector of probabilities.

## What is the definition of the **softmax** function? (equation)

## What is the definition of a **logistic regression** model? (equation)

where is the softmax function.

#### 1.2.1.6 Maximum likelihood estimation

## What is the common choice of loss function for probabilistic models? (in words)

The negative log probability.

## What is the common choice of loss function for probabilistic models? (equation)

## What is the definition of **negative log likelihood (NLL)**? (in words)

The empirical risk, using negative log loss.

## What is the definition of **negative log likelihood (NLL)**? (equation)

## What is the definition of the **maximum likelihood estimate (MLE)**? (equation)

where is the negative log likelihood.

### 1.2.2 Regression

## What is the difference between classification and regression?

For the former the class label is categorical, whereas for the latter its real-valued.

## What is the definition of **mean squared error (MSE)**? (in words)

The empirical risk, using quadratic loss.

## What is the definition of **mean squared error (MSE)**? (equation)

## How are the NLL and MSE linked?

If we assume our predictions have Gaussian noise and compute the NLL, it is proportional to the MSE.

#### 1.2.2.2 Polynomial regression

## What is the definition of a **polynomial regression** model, of degree ? (equation)

### 1.2.3 Overfitting and generalization

## What is the definition of the **population risk**? (equation)

where is the true (but unknown) distribution used to generate the training set.

## What is the definition of the **generalization gap**? (verbal)

The population risk minus the empirical risk.

## How can we frame **overfitting** in terms of the generalisation gap?

It is present if a model has a large generalisation gap.

### 1.2.4 No free lunch theorem

## What is the premise of the **no free lunch** theorem?

There is no single best model that works optimally for all kinds of problems.

## 1.3 Unsupervised learning

## What distributions are supervised and unsupervised leaning trying to model?

These approaches model and respectively.

### 1.3.2 Discovering latent “factors of variation”

## What is a **latent factor?**

A hidden low-dimensional variable, from which the observed high-dimensional variable is generated.

## What is the definition of a **factor analysis** (FA) model? (equation)

Where are the latent factors.

### 1.3.3. Self-supervised learning

## What is **self-supervised learning?**

A form of unsupervised learning that trains on 'proxy' supervised tasks, created from the unlabelled data.

## 1.4 Reinforcement learning

## In reinforcement learning, what is a **policy**?

A model which specifies which action to take in response to each possible input.

## 1.5 Data

### 1.5.3 Preprocessing discrete input data

#### 1.5.3.1 One-hot encoding

## What is the definition of a **one-hot encoding** for a variable that takes categorical values? (equation)

#### 1.5.3.1 Feature crosses

## What is the definition of **feature crosses**? (verbal)

One-hot encoding over every possible combination of values, given multiple categorical values.

### 1.5.4 Preprocessing text data

#### 1.5.4.1 Bag of words model

## What is a **bag of words** model?

A representation of a text document where we ignore word order.

## What is **stop word removal**?

The dropping of common but uninformative words (e.g. "the", "and").

## What is **word stemming**?

Replacing words with their base form (e.g. “running” → “run”)

## What is the definition of a **vector space model** of text? (verbal)

Given a vocabulary of tokens, it encodes a document into a -dimensional vector where each element indicates the frequency of a word.

## What is the definition of a **term frequency (TF) matrix** of a text dataset? (verbal)

A matrix where each entry is the frequency of term in document .

## What is the definition of **inverse document frequency (IDF)**? (equation)

where is the number of documents with term .

#### 1.5.4.2 TF-IDF

## What does **TF-IDF** stand for?

It stands for term frequency-inverse document frequency.

## What is the definition of the **term frequency-inverse document frequency (TF-IDF)**? (equation)

where is the frequency of term in document , and is the inverse document frequency.

(we often normalise each row as well)

#### 1.5.4.3 Word embeddings

## What is a **word embedding**?

A mapping from a high-dimensional one-hot vector , to a lower-dimensional dense vector , via multiplication by (i.e. indexing into) an embedding matrix :

## What is the definition of a **bag of word embeddings**? (verbal)

The sum of the word embeddings of each token in a document

## What is the definition of a **bag of word embeddings**? (equation)

where is the embedding matrix, and is the vector space model of the document.

### 1.5.5 Handling missing data

## Name the three kinds of missing data.

- Missing completely at random (MCAR)

- Missing at random (MAR)

- Not missing at random (NMAR)

## What is the definition of **missing completely at random (MCAR)**? (verbal)

The 'missingness' of data does not depend on the hidden or observed features.

## What is the definition of **missing at random (MAR)**? (verbal)

The 'missingness' of data does not depend on the hidden features, but may depend on the observed features.

## What is the definition of **not missing at random (NMAR)**? (verbal)

The 'missingness' of data depends on the hidden features.

## What must we do if we have not missing at random (NMAR) data?

If this is the case, we model the missing data mechanism, since the lack of information may be informative.

## What is **mean value imputation**?

Replacing missing values by their empirical mean.