About
Below are the key ideas and equations in Chapter 1: Introduction of Probabilistic Machine Learning: An Introduction.
This is intended as a concise reference to accompany the text, and as the basis of an Anki flashcard deck
(see my explanation of the benefits of using Anki here: Probabilistic Machine Learning: An Introduction)
Links
⬇️ (TODO: Anki download)
⬇️ Download PDF 
😸 Github
Table of contents
1.2  Supervised Learning1.2.1  Classification1.2.1.4  Empirical risk minimization1.2.1.5  Uncertainty1.2.1.6  Maximum likelihood estimation1.2.2  Regression1.2.2.2  Polynomial regression1.2.3  Overfitting and generalization1.2.4  No free lunch theorem1.3  Unsupervised learning1.3.2  Discovering latent “factors of variation”1.3.3. Self-supervised learning1.4  Reinforcement learning1.5  Data1.5.3  Preprocessing discrete input data1.5.3.1  One-hot encoding1.5.3.1  Feature crosses1.5.4  Preprocessing text data1.5.4.1  Bag of words model1.5.4.2  TF-IDF1.5.4.3  Word embeddings1.5.5  Handling missing data
Press 
cmd/ctrl + option/alt + t to expand/close all the toggle lists at once (only on desktop)1.2 Supervised Learning
1.2.1 Classification
1.2.1.4 Empirical risk minimization
What is the definition of empirical risk? (in words)
The average loss calculated over the training set.
What is the definition of empirical risk? (equation)
where  is our loss function.
1.2.1.5 Uncertainty
Predictive error due to lack of knowledge is known as what? (two terms)
- Epistemic uncertainty
- Model uncertainty
Predictive error due to intrinsic stochasticity is known as what? (two terms)
- Aleatoric uncertainty
- Data uncertainty
What is the purpose of the softmax function?
It converts a vector of real values into a vector of probabilities.
What is the definition of the softmax function? (equation)
What is the definition of a logistic regression model? (equation)
where  is the softmax function.
1.2.1.6 Maximum likelihood estimation
What is the common choice of loss function for probabilistic models? (in words)
The negative log probability.
What is the common choice of loss function for probabilistic models? (equation)
What is the definition of negative log likelihood (NLL)? (in words)
The empirical risk, using negative log loss.
What is the definition of negative log likelihood (NLL)? (equation)
What is the definition of the maximum likelihood estimate (MLE)? (equation)
where  is the negative log likelihood.
1.2.2 Regression
What is the difference between classification and regression?
For the former the class label is categorical, whereas for the latter its real-valued.
What is the definition of mean squared error (MSE)? (in words)
The empirical risk, using quadratic loss.
What is the definition of mean squared error (MSE)? (equation)
How are the NLL and MSE linked?
If we assume our  predictions have Gaussian noise and compute the NLL, it is proportional to the MSE.
1.2.2.2 Polynomial regression
What is the definition of a polynomial regression model, of degree ? (equation)
1.2.3 Overfitting and generalization
What is the definition of the population risk? (equation)
where  is the true (but unknown) distribution used to generate the training set.
What is the definition of the generalization gap? (verbal)
The population risk minus the empirical risk.
How can we frame overfitting in terms of the generalisation gap?
It is present if a model has a large generalisation gap.
1.2.4 No free lunch theorem
What is the premise of the no free lunch theorem?
There is no single best model that works optimally for all kinds of problems.
1.3 Unsupervised learning
What distributions are supervised and unsupervised leaning trying to model?
These approaches model  and  respectively.
1.3.2 Discovering latent “factors of variation”
What is a latent factor?
A hidden low-dimensional variable, from which the observed high-dimensional variable is generated.
What is the definition of a factor analysis (FA) model? (equation)
Where  are the latent factors.
1.3.3. Self-supervised learning
What is self-supervised learning?
A form of unsupervised learning that trains on 'proxy' supervised tasks, created from the unlabelled data.
1.4 Reinforcement learning
In reinforcement learning, what is a policy?
A model which specifies which action to take in response to each possible input.
1.5 Data
1.5.3 Preprocessing discrete input data
1.5.3.1 One-hot encoding
What is the definition of a one-hot encoding for a variable that takes categorical values? (equation)
1.5.3.1 Feature crosses
What is the definition of feature crosses? (verbal)
One-hot encoding over every possible combination of values, given multiple categorical values.
1.5.4 Preprocessing text data
1.5.4.1 Bag of words model
What is a bag of words model?
A representation of a text document where we ignore word order.
What is stop word removal?
The dropping of common but uninformative words (e.g. "the", "and").
What is word stemming?
Replacing words with their base form (e.g. “running” → “run”)
What is the definition of a vector space model of text? (verbal)
Given a vocabulary of  tokens, it encodes a document into a -dimensional vector where each element indicates the frequency of a word.
What is the definition of a term frequency (TF) matrix of a text dataset? (verbal)
A matrix where each entry  is the frequency of term  in document .
What is the definition of inverse document frequency (IDF)? (equation)
where  is the number of documents with term .
1.5.4.2 TF-IDF
What does TF-IDF stand for?
It stands for term frequency-inverse document frequency.
What is the definition of the term frequency-inverse document frequency (TF-IDF)? (equation)
where  is the frequency of term  in document , and  is the inverse document frequency.
(we often normalise each row as well)
1.5.4.3 Word embeddings
What is a word embedding?
A mapping from a high-dimensional one-hot vector , to a lower-dimensional dense vector , via multiplication by (i.e. indexing into) an embedding matrix :  
What is the definition of a bag of word embeddings? (verbal)
The sum of the word embeddings of each token in a document
What is the definition of a bag of word embeddings? (equation)
where  is the embedding matrix, and  is the vector space model of the document.
1.5.5 Handling missing data
Name the three kinds of missing data.
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Not missing at random (NMAR)
What is the definition of missing completely at random (MCAR)? (verbal)
The 'missingness' of data does not depend on the hidden or observed features.
What is the definition of missing at random (MAR)? (verbal)
The 'missingness' of data does not depend on the hidden features, but may depend on the observed features.
What is the definition of not missing at random (NMAR)? (verbal)
The 'missingness' of data depends on the hidden features.
What must we do if we have not missing at random (NMAR) data?
If this is the case, we model the missing data mechanism, since the lack of information may be informative.
What is mean value imputation?
Replacing missing values by their empirical mean.