• No results found

Part II Statistical Methodology

6.2 Tree Based Methods

6.2.3 Node splitting

Having explained the basic concepts behind growing an initial fully grown tree, there are two questions that need to be addressed:

Employed ≀4 Employment status Baseline RMDQ >4 Unemployed 1:𝜏

3: 5: 3.60 4: 1.79 2:0.33

88

1) How to choose which covariate out of all covariates to split and at what optimal split point 𝑠 to split at?

2) How to decide when not to split an internal node (i.e. when to decide that an internal node is a terminal node)?

Before answering the first question, it is important to note here that the total number of potential split points varies at each stage of the tree growing process depending on whether the covariates are continuous or categorical. For a continuous covariate or discrete ordered covariate, the number of possible split points is simply one minus the total number of its distinct values. For example a continuous variable with 100 distinct values will have 100-1=99 possible split points. If a covariate is categorical, with 𝑀

categories, then there are 2π‘€βˆ’1βˆ’ 1 potential split points. For example, assume

ethnicity is a categorical covariate with three categories; white, black and asian. Thus, it has three possible split points; white vs black and asian, white and black vs asian and finally white and asian vs black. The overall number of possible splits when initially splitting the root node is simply the summation of all possible splits from all covariates. Therefore to answer the first question, all potential splits for all variables are evaluated using a goodness-of-split criterion to decide which covariate to split and at what optimal split point 𝑠 to split at.

Splitting criterion

The goodness-of-split criterion is typically an impurity function that measures the reduction in the heterogeneity of an outcome π‘Œ between two newly formed child nodes created when splitting an internal node. An impurity function is basically a function that quantifies how impure or heterogeneous two child nodes are having formed a split. In an ideal case, we would want an optimal split 𝑠 to form two subgroups of individuals that are completely homogenous (pure) in terms of an outcome π‘Œ,

89

however, we know this will be highly unlikely and that the child nodes will be β€˜partially homogenous’ or β€˜impure’. Moreover, the amount of node impurity will vary for all possible splits over all 𝑋𝑗 covariates. Therefore an impurity function is evaluated for all possible splits to find the optimal split that maximises the reduction in impurity i.e. produces the most β€œpure” subgroups. The type of node impurity measure used depends on whether the response variable is continuous or categorical. In the case where the response is continuous, a natural option for a node impurity measure is the within- node sum of squares:

𝑖(𝜏) = βˆ‘ (π‘Œπ‘–βˆ’ π‘ŒΜ…πœ)2

π‘–βˆˆπœ (6.1)

where 𝑖 ∈ 𝜏 are the individuals in node 𝜏 and π‘ŒΜ…πœ is the mean of the response for those

individuals in node 𝜏. The goodness-of-split (impurity function) can therefore be calculated for a split 𝑠 of an internal node 𝜏 to form left and right child nodes, 𝜏𝐿 and πœπ‘…

respectively, as follows:

πœ‘(𝑠, 𝜏) = βˆ†π‘–(𝑠, 𝜏) = 𝑖(𝜏) βˆ’ 𝑖(𝜏𝐿) βˆ’ 𝑖(πœπ‘…) (6.2)

Here the impurity function simply subtracts the impurity of the two child nodes from the impurity of its parent node. As mentioned earlier, the impurity function is

evaluated over all possible splits to find the optimal split 𝑠 for each covariate 𝑋𝑗. Subsequently the covariate that maximises πœ‘(𝑠, 𝜏) i.e. the split that leads to the biggest difference between the means of the two groups, is chosen to form the new split. This procedure is recursively applied to the newly formed internal nodes at each stage to continue the tree growing process.

Stopping criteria

The second question is how to determine when to stop growing a tree. One possible solution, although not the best, is to implement some sort of stopping rule also referred

90

to as pre-pruning. For example you could set a minimum size for the number of individuals in a child node e.g. n=10 or 2% of the original sample size, such that it becomes a terminal node (i.e. stops splitting) if it goes below that number. However implementing such a stopping rule can be problematic resulting in the tree growing process either stopping too early (under-fitting) or too late (over-fitting) (131).

Breiman suggested that no stopping rules be put in place and that an initial fully grown saturated tree 𝑇0 is formed such that the nodes cannot be split any further i.e. all individuals in the node are identical. Such a tree is very well fitted to the available data but is rather unstable and relatively poor when predicting future data. Instead, simpler subtrees nested in 𝑇0 may fit the data well enough but prove to be better predictors of future data hence making predictions more generalizable; however going through all possible subtrees could be a daunting task. Thus to overcome the problem of under or over fitting, improve model stability, improve the predictability of future data and to limit the choices of optimal subtrees, Breiman introduced the concept of post-pruning; a process analogous to backward stepwise regression that simply removes nodes that minimally contribute to the predictive accuracy of the tree (130). The next section will describe the post-pruning process which will be referred to as just β€œpruning” from now onward.