Part II Statistical Methodology
6.2 Tree Based Methods
6.2.4 Pruning
The pruning process initially starts with a fully grown tree. The procedure then
iteratively removes branches that least contribute to the predictive accuracy of the tree to form a sequence of potentially optimal nested subtrees from which the best optimal subtree 𝑇∗ is selected. The best optimal subtree is the subtree that minimises the
91
tree based methods is referred to as cost-complexity pruning. For a fully grown tree, 𝑇, the total tree cost-complexity is defined as:
𝑅𝛼(𝑇) = 𝑅(𝑇) + 𝛼|𝜏̃| (6.3)
where 𝑅(𝑇) is a measure of the quality of a tree or the tree cost, 𝛼(≥0) is the complexity parameter (the cost of an additional single terminal node), and |𝜏̃| is the number of terminal nodes in the fully grown tree. Here the cost of the total tree 𝑅(𝑇), measured by the quality of its terminal nodes, is penalized further with respect to the complexity of the tree where the complexity is measured simply by the size of the tree i.e. the number of terminal nodes |𝜏̃|. The complexity parameter 𝛼 is a positive continuous real number where each value of 𝛼 may lead to a different subtree that minimizes the cost- complexity. Breiman et al were able to show that every value of 𝛼 has a unique subtree of the fully grown tree that minimizes the cost-complexity, thus there are a finite number of subtrees corresponding to a infinite number of complexity parameter values (142). Therefore, instead of searching through every possible subtree for each value of
𝛼 to find the subtree with minimal cost-complexity using (6.3), Breiman and colleagues proposed an algorithm that created a sequence of complexity parameter values. This algorithm utilises a function 𝛼(𝜏) to estimate the complexity parameter
𝛼(𝜏) =𝑅𝒔(𝜏) − 𝑅𝒔(𝜏̃𝜏)
|𝜏̃𝜏| − 1 (6.4)
where 𝜏̃ is the set of all terminal nodes and 𝜏̃𝜏 is the set of offspring terminal nodes of the internal node 𝜏, 𝑅𝒔(𝜏)is the resubstitution cost of the internal node 𝜏 and 𝑅𝒔(𝜏̃
𝜏)is
the resubstitution cost of the offspring terminal nodes 𝜏̃𝜏. Breiman et al. used the
resubstitution cost to help prune back a tree. It is called the resubstitution cost because the same data used to build the tree are again used to estimate the cost of a tree. The function numerator 𝑅𝒔(𝜏) − 𝑅𝒔(𝜏̃
92
the total cost of the terminal nodes in the branch connected to 𝜏, denoted by 𝜏̃𝜏. Since we are describing regression trees here, i.e. the response is continuous; the pruning algorithm uses the sum of squared errors (SSE) as a measure of the resubstitution cost to prune back a tree (𝑅𝒔(𝜏) = ∑ (𝑌
𝑖− 𝑌̅𝜏)2
𝑖∈𝜏 ). Note that the SSE was also used as the
impurity function in the tree growing process described in the previous section.
Now that the components of the complexity parameter estimating function 𝛼(𝜏) have been described, the steps of the pruning algorithm for determining the first subtree can now be explained. The algorithm consists of the following steps:
1) Let 𝑇0 be a fully grown tree. Compute the estimate of the complexity parameter 𝛼 using the function 𝛼(𝜏) (see (6.4)) for all internal nodes (i.e.∀𝜏 ∉ 𝜏̃) of the initial fully grown tree 𝑇0
2) Find the internal node with the smallest value of 𝛼(𝜏) and remove (prune) all subsequent branches connected to this node. This internal node
therefore becomes a terminal node. The resulting tree thus forms the first subtree 𝑇1 corresponding to the complexity parameter estimate 𝛼1, as estimated by 𝛼(𝜏).
3) Repeat steps 1) and 2) using 𝑇1 as the initial tree.
Steps 1) and 2) from the above algorithm are continuously repeated using the previously formed subtree as the initial tree of the next iteration. The value 𝛼(𝜏)
computed for each internal node (step 1) reflects how much additional predictive accuracy the branch connected to node 𝜏 contributes to the tree. Hence, larger values of
𝛼(𝜏) indicate greater contribution. Therefore, each iteration of the pruning procedure removes the branch that least contributes to the trees predictive accuracy thus forming an increasing sequence of complexity parameter estimates 𝛼0< 𝛼1 < 𝛼2< ⋯ < 𝛼𝑚,
93
where 𝛼0 = 0 for the fully grown tree i.e. there is no additional cost for extra terminal nodes hence a fully grown tree is the best predictor. Furthermore, the complexity parameter sequence 𝛼𝑚(𝑚 = 0,1, … , 𝑚) corresponds to a sequence of nested optimal subtrees 𝑇0⊃ 𝑇1⊃ 𝑇2⊃ ⋯ ⊃ 𝑇𝑚, where each subsequent subtree in the sequence is a subtree of the previous tree i.e. 𝑇𝑚−1 ⊃ 𝑇𝑚. The algorithm continues until the final
subtree 𝑇𝑚 in the sequence is just the root node. The optimal subtree 𝑇∗ is then
selected from the sequence of nested optimal subtrees. How to select the optimal subtree is described in the next section. This pruning procedure is quite often referred to as weakest-link pruning.