Sampling

Validation Validation Sets Leave-One-Out Cross-Validation K-Fold Cross Validation Bootstrapping Handling Imbalanced Datasets Detecting "imbalanced" classification Addressing imbalanced data in training Synthetic Minority Oversampling Technique (SMOTE)

About

(This page is based primarily on material from Chapter 5 of Introduction to Statistical Learning)

Validation

Validation Sets

We can think of a validation set as an approximation of the test set. We can use it to assess "test error" in scenarios where we don't yet want to touch the test set.

This can be used for:

Model selection

Hyperparameter search

Leave-One-Out Cross-Validation

Motivation:

validation set means we lose a lot of training data

different random splits can give very different results

Approach:

For i = 1 to n

Fit on all but
Calculate of this model on

Sum the MSEs calculated

i.e.:

Benefit:

Uses all data

Always gives the same results

Drawback:

Have to fit model times

In the special case of linear or polynomial regression there is a neat fix for this, meaning we can compute the same CV score using a model fitted on all data:

where is the leverage of the point , denoting how unusual its value is, defined:

K-Fold Cross Validation

This is the same as leave-one-out, apart from that we leave datapoints out each time we fit the model, resulting in different groups/fits. Leave-one-out essentially sets .

Benefits:

Reduces cost of having to fit model times

Has lower variance than LOOCV, as outputs are less correlated

Bootstrapping

The bootstrap is a statistical tool used to quantify the uncertainty of an estimator in a wide range of circumstances.

Method:

Repeat times:

Randomly sample observations from the training data with replacement.
Calculate the value of the estimator () on this reduced dataset.

Use these values to calculate the standard error of the estimator () over the whole dataset:

Handling Imbalanced Datasets

Detecting "imbalanced" classification

These are evaluation metrics, not training metrics:

Confusion matrix

AUROC

F1 Score

Can also be good practice to use stratified sampling to unsure even balance of classes in test and training set. This also applies to stratified cross-validation.

Addressing imbalanced data in training

Collect more data

Undersampling majority classes

Oversampling minority classes

Generating synthetic data for minority classes (see SMOTE)

Weighted cost function (in-training)

Change classification threshold (post-training)

🚨

changing the class balance in the training dataset alters the prior probability of the class and hence our classification. We must account for this.

Synthetic Minority Oversampling Technique (SMOTE)

Selects a random point along the line segment between a feature and its nearest neighbour of the same class