Validation
Validation Sets
We can think of a validation set as an approximation of the test set. We can use it to assess "test error" in scenarios where we don't yet want to touch the test set.
This can be used for:
- Model selection
- Hyperparameter search
Leave-One-Out Cross-Validation
Motivation:
- validation set means we lose a lot of training data
- different random splits can give very different results
Approach:
- For i = 1 to n
- Fit on all but
- Calculate of this model on
- Sum the MSEs calculated
i.e.:
Benefit:
- Uses all data
- Always gives the same results
Drawback:
- Have to fit model times
In the special case of linear or polynomial regression there is a neat fix for this, meaning we can compute the same CV score using a model fitted on all data:
where is the leverage of the point , denoting how unusual its value is, defined:
K-Fold Cross Validation
This is the same as leave-one-out, apart from that we leave datapoints out each time we fit the model, resulting in different groups/fits. Leave-one-out essentially sets .
Benefits:
- Reduces cost of having to fit model times
- Has lower variance than LOOCV, as outputs are less correlated
Bootstrapping
The bootstrap is a statistical tool used to quantify the uncertainty of an estimator in a wide range of circumstances.
Method:
- Repeat times:
- Randomly sample observations from the training data with replacement.
- Calculate the value of the estimator () on this reduced dataset.
- Use these values to calculate the standard error of the estimator () over the whole dataset:
Handling Imbalanced Datasets
Detecting "imbalanced" classification
These are evaluation metrics, not training metrics:
- Confusion matrix
- ROC
- AUROC
- F1 Score
Can also be good practice to use stratified sampling to unsure even balance of classes in test and training set. This also applies to stratified cross-validation.
Addressing imbalanced data in training
- Collect more data
- Undersampling majority classes
- Oversampling minority classes
- Generating synthetic data for minority classes (see SMOTE)
- Weighted cost function (in-training)
- Change classification threshold (post-training)
changing the class balance in the training dataset alters the prior probability of the class and hence our classification. We must account for this.
Synthetic Minority Oversampling Technique (SMOTE)
Selects a random point along the line segment between a feature and its nearest neighbour of the same class
ย