Pruning - Tree Based Methods - Statistical Methodology

Part II Statistical Methodology

6.2 Tree Based Methods

6.2.4 Pruning

The pruning process initially starts with a fully grown tree. The procedure then

iteratively removes branches that least contribute to the predictive accuracy of the tree to form a sequence of potentially optimal nested subtrees from which the best optimal subtree 𝑇∗_{is selected. The best optimal subtree is the subtree that minimises the}

tree based methods is referred to as cost-complexity pruning. For a fully grown tree, 𝑇, the total tree cost-complexity is defined as:

𝑅_𝛼(𝑇) = 𝑅(𝑇) + 𝛼|𝜏̃| (6.3)

where 𝑅(𝑇) is a measure of the quality of a tree or the tree cost, 𝛼(≥0) is the complexity parameter (the cost of an additional single terminal node), and |𝜏̃| is the number of terminal nodes in the fully grown tree. Here the cost of the total tree 𝑅(𝑇), measured by the quality of its terminal nodes, is penalized further with respect to the complexity of the tree where the complexity is measured simply by the size of the tree i.e. the number of terminal nodes |𝜏̃|. The complexity parameter 𝛼 is a positive continuous real number where each value of 𝛼 may lead to a different subtree that minimizes the cost- complexity. Breiman et al were able to show that every value of 𝛼 has a unique subtree of the fully grown tree that minimizes the cost-complexity, thus there are a finite number of subtrees corresponding to a infinite number of complexity parameter values (142). Therefore, instead of searching through every possible subtree for each value of

𝛼 to find the subtree with minimal cost-complexity using (6.3), Breiman and colleagues proposed an algorithm that created a sequence of complexity parameter values. This algorithm utilises a function 𝛼(𝜏) to estimate the complexity parameter

𝛼(𝜏) =𝑅𝒔(𝜏) − 𝑅𝒔(𝜏̃𝜏)

|𝜏̃𝜏| − 1 (6.4)

where 𝜏̃ is the set of all terminal nodes and 𝜏̃_𝜏 is the set of offspring terminal nodes of the internal node 𝜏, 𝑅𝒔_(𝜏)_{is the resubstitution cost of the internal node}_𝜏_and_𝑅𝒔_(𝜏̃

𝜏)is

the resubstitution cost of the offspring terminal nodes 𝜏̃𝜏. Breiman et al. used the

resubstitution cost to help prune back a tree. It is called the resubstitution cost because the same data used to build the tree are again used to estimate the cost of a tree. The function numerator 𝑅𝒔_{(𝜏) − 𝑅}𝒔_(𝜏̃

the total cost of the terminal nodes in the branch connected to 𝜏, denoted by 𝜏̃_𝜏. Since we are describing regression trees here, i.e. the response is continuous; the pruning algorithm uses the sum of squared errors (SSE) as a measure of the resubstitution cost to prune back a tree (𝑅𝒔_{(𝜏) = ∑ (𝑌}

𝑖− 𝑌̅𝜏)2

𝑖∈𝜏 ). Note that the SSE was also used as the

impurity function in the tree growing process described in the previous section.

Now that the components of the complexity parameter estimating function 𝛼(𝜏) have been described, the steps of the pruning algorithm for determining the first subtree can now be explained. The algorithm consists of the following steps:

1) Let 𝑇₀ be a fully grown tree. Compute the estimate of the complexity parameter 𝛼 using the function 𝛼(𝜏) (see (6.4)) for all internal nodes (i.e.∀𝜏 ∉ 𝜏̃) of the initial fully grown tree 𝑇0

2) Find the internal node with the smallest value of 𝛼(𝜏) and remove (prune) all subsequent branches connected to this node. This internal node

therefore becomes a terminal node. The resulting tree thus forms the first subtree 𝑇₁ corresponding to the complexity parameter estimate 𝛼₁, as estimated by 𝛼(𝜏).

3) Repeat steps 1) and 2) using 𝑇₁ as the initial tree.

Steps 1) and 2) from the above algorithm are continuously repeated using the previously formed subtree as the initial tree of the next iteration. The value 𝛼(𝜏)

computed for each internal node (step 1) reflects how much additional predictive accuracy the branch connected to node 𝜏 contributes to the tree. Hence, larger values of

𝛼(𝜏) indicate greater contribution. Therefore, each iteration of the pruning procedure removes the branch that least contributes to the trees predictive accuracy thus forming an increasing sequence of complexity parameter estimates 𝛼₀< 𝛼₁ < 𝛼₂< ⋯ < 𝛼_𝑚,

where 𝛼₀ = 0 for the fully grown tree i.e. there is no additional cost for extra terminal nodes hence a fully grown tree is the best predictor. Furthermore, the complexity parameter sequence 𝛼_𝑚(𝑚 = 0,1, … , 𝑚) corresponds to a sequence of nested optimal subtrees 𝑇₀⊃ 𝑇₁⊃ 𝑇₂⊃ ⋯ ⊃ 𝑇_𝑚, where each subsequent subtree in the sequence is a subtree of the previous tree i.e. 𝑇𝑚−1 ⊃ 𝑇𝑚. The algorithm continues until the final

subtree 𝑇_𝑚 in the sequence is just the root node. The optimal subtree 𝑇∗_{is then}

selected from the sequence of nested optimal subtrees. How to select the optimal subtree is described in the next section. This pruning procedure is quite often referred to as weakest-link pruning.

In document Recursive partitioning based approaches for low back pain subgroup identification in individual patient data meta analyses (Page 108-111)