🌲

Tree-Based Methods

About


(This page is based primarily on material from Chapter 8 of Introduction to Statistical Learning)
 

Decision Trees

Regression Trees

Overall Algorithm:
Preliminaries: value of tt, KK, set of Ξ±\alpha values to test.
  1. Use recursive binary splitting until every terminal node has fewer than tt observations.
  1. Apply cost complexity pruning to the full tree for each Ξ±\alpha to get a series of candidate trees.
  1. Use K-fold CV across a range of Ξ±\alpha values. For k=1,…,Kk = 1, \dots, K:
    1. Repeat (1.) & (2.) leaving out the kthk^{\text{th}} fold.
    2. Evaluate the MSE of each CV candidate tree on the kthk^{\text{th}} fold.
  1. Select the candidate tree from (2.) with an Ξ±\alpha value corresponding to the best average MSE across each (3.2) step.
 
Recursive Binary Splitting Algorithm:
Given data {(y, x)∣y∈R,β€…β€Šx∈Rd}\{(y, \, \mathbf{x}) \mid y \in \mathbb{R}, \; \mathbf{x} \in \mathbb{R}^d\}, we perform repeated greedy splits of the feature space to form our decision tree.
These splits select a feature j∈{1,…,d}j \in \{1, \dots, d\} and partition the data around value ss for that feature, creating two regions:
R1(j,s)={(y, x)∣xj<s},R2(j,s)={(y, x)∣xjβ‰₯s}R_1(j, s) = \{(y, \, \mathbf{x}) \mid \mathbf{x}_j < s\}, \quad R_2(j, s) = \{(y, \, \mathbf{x}) \mid \mathbf{x}_j \ge s\}
The predictions for each region are simply the mean of the corresponding yy values.
We seek the values of jj and ss that minimise RSS(R1)+RSS(R2)\text{RSS}(R_1) + \text{RSS}(R_2).
Cost Complexity Pruning Algorithm:
Given some Ξ±\alpha, return the subtree TβŠ‚T0T \sub T_0 that minimises:
βˆ‘m=1∣T∣RSS(Rm)+α∣T∣\sum_{m=1}^{|T|} \text{RSS}(R_m) + \alpha|T|
Where ∣T∣|T| is the number of leaf nodes in a given subtree.

Classification Trees

This is essentially the same as regression trees, but we can no longer use the RSS to assess the loss in a given region.
The alternatives are:
(where pkp_k is the proportion of observations in the region from class kk)
Classification error rate:
Gini index:
βˆ‘k=1K pk(1βˆ’pk)\sum_{k=1}^K \, p_k (1-p_k)
Cross-entropy:
βˆ’βˆ‘k=1K pklog⁑pk-\sum_{k=1}^K \, p_k \log p_k

Bagging

Bootstrapping in the context of decision trees is known as bagging. In this context, note:
  • We no longer prune individual trees
  • Using a very large number of trees does not overfit
  • We can estimate the test error without having to use cross-validation:
The latter is known as out-of-bag error estimation.
The out-of-bag observations are the observations not sampled in each iteration (on average, about 13\frac{1}{3} of the dataset assuming "bag size"=n). For each observation we can average the predictions of the models for which the observation is OOB, and then calculate the overall OOB MSE across each.

Random Forest

This method addresses the following problem with bagging decision trees:
  • The order on which predictors are split is heavily correlated between trees, especially early on when strong predictors stand out consistently.
  • This correlation makes the reduction in variance caused by bootstrapping less effective.
To address this, we do the following:
  • At each split we randomly select mm out of the pp predictors to assess
  • Typically we choose m=pm=\sqrt p

Boosting

This method, like bootstrapping, can be used in a variety of domains. In this context, we do the following:
Preliminaries: set d,Ξ»,Bd, \lambda, B
  1. Set f(x)=0,ri=yiβ€…β€Šβ€…β€Šβˆ€if(x) = 0, \quad r_i = y_i \;\; \forall i
  1. For b=1,…,Bb = 1, \dots, B:
    1. Fit tree fbf^b with d+1d+1 terminal nodes to the training data, with target rr (not yy).
    2. Update function : f(x)β€…β€Š+=Ξ»fb(x)f(x) \;+= \lambda f^b(x)
    3. Update value : riβ€…β€Šβˆ’=β€…β€ŠΞ»fb(xi)β€…β€Šβˆ€ir_i \; -= \; \lambda f^b(x_i) \; \forall i
  1. Return f(x)f(x)
Note:
  • This approach learns slowly by gradually reducing the residuals (typically a good thing!)
  • This approach, unlike RF, can overfit, but only still quite robust to overfitting
  • We can use CV to select BB
  • Very low dd, e.g. =1=1, can be the best model

Gradient Boosting

The kind of boosting presented above fits into a wider framework known as "gradient boosting".
This is because we can write the update to our function as:
This is just gradient descent, with the MSE loss function differentiated to give the (yβˆ’f(x))(y-f(x)) term!
Instead of tuning parameters in the learner (i.e. tree) based on the gradientβ€”like in a neural networkβ€”here we're fitting each learner in the additive model to approximate the new gradient at each step, relative to the current additive model. By adding together these gradient approximations we descend towards the minimum of the function.
This is really neat - it frames boosting in a context we already know well, and shows how even without tweaking a model repeatedly, we can still leverage gradients to learn well. Different loss functions can be chosen and fitted into this framework to give different updates.