Preliminaries: value of , , set of values to test.
- Use recursive binary splitting until every terminal node has fewer than observations.
- Apply cost complexity pruning to the full tree for each to get a series of candidate trees.
- Use K-fold CV across a range of values. For :
- Repeat (1.) & (2.) leaving out the fold.
- Evaluate the MSE of each CV candidate tree on the fold.
- Select the candidate tree from (2.) with an value corresponding to the best average MSE across each (3.2) step.
Recursive Binary Splitting Algorithm:
Given data , we perform repeated greedy splits of the feature space to form our decision tree.
These splits select a feature and partition the data around value for that feature, creating two regions:
The predictions for each region are simply the mean of the corresponding values.
We seek the values of and that minimise .
Cost Complexity Pruning Algorithm:
Given some , return the subtree that minimises:
Where is the number of leaf nodes in a given subtree.
This is essentially the same as regression trees, but we can no longer use the RSS to assess the loss in a given region.
The alternatives are:
(where is the proportion of observations in the region from class )
Classification error rate:
Bootstrapping in the context of decision trees is known as bagging. In this context, note:
- We no longer prune individual trees
- Using a very large number of trees does not overfit
- We can estimate the test error without having to use cross-validation:
The latter is known as out-of-bag error estimation.
The out-of-bag observations are the observations not sampled in each iteration (on average, about of the dataset assuming "bag size"=n). For each observation we can average the predictions of the models for which the observation is OOB, and then calculate the overall OOB MSE across each.
This method addresses the following problem with bagging decision trees:
- The order on which predictors are split is heavily correlated between trees, especially early on when strong predictors stand out consistently.
- This correlation makes the reduction in variance caused by bootstrapping less effective.
To address this, we do the following:
- At each split we randomly select out of the predictors to assess
- Typically we choose
This method, like bootstrapping, can be used in a variety of domains. In this context, we do the following:
- For :
- Fit tree with terminal nodes to the training data, with target (not ).
- Update function :
- Update value :
- This approach learns slowly by gradually reducing the residuals (typically a good thing!)
- This approach, unlike RF, can overfit, but only still quite robust to overfitting
- We can use CV to select
- Very low , e.g. , can be the best model
The kind of boosting presented above fits into a wider framework known as "gradient boosting".
This is because we can write the update to our function as:
This is just gradient descent, with the MSE loss function differentiated to give the term!
Instead of tuning parameters in the learner (i.e. tree) based on the gradient—like in a neural network—here we're fitting each learner in the additive model to approximate the new gradient at each step, relative to the current additive model. By adding together these gradient approximations we descend towards the minimum of the function.
This is really neat - it frames boosting in a context we already know well, and shows how even without tweaking a model repeatedly, we can still leverage gradients to learn well. Different loss functions can be chosen and fitted into this framework to give different updates.