• No results found

Greedy induction of a binary decision tree 1 : function BuildDecisionTree(L)

2: Create a decision tree ϕ with root node t0 3: Create an empty stack S of open nodes (t,Lt) 4: S.push((t0,L))

5: while Sis not empty do 6: t,Lt= S.pop( )

7: ifthe stopping criterion is met for t then 8:

b

yt=some constant value

9: else

10: Find the split onLt that maximizes impurity decrease s∗=arg max

s∈Q

∆i(s, t)

11: PartitionLtintoLtL∪LtR according to s∗ 12: Create the left child node tL of t

13: Create the right child node tR of t 14: S.push((tR,LtR)) 15: S.push((tL,LtL)) 16: end if 17: end while 18: return ϕ 19: end function

The rest of this chapter is now dedicated to a detailed discussion of specific parts of Algorithm3.2. Section3.4discusses assignment rules for terminal nodes (Line8of Algorithm3.2) while Section3.5outlines stopping criteria for deciding when a node becomes terminal (Line7 of Algorithm 3.2). Section 3.6 then presents families Q of splitting rules, impurity criteria i(t) to evaluate the goodness of splits and strategies for finding the best split s∗ ∈Q (Line10 of Algorithm3.2). As we will see, the crux of the problem is in finding good splits and in knowing when to stop splitting.

Lookahead search

Obviously, the greedy assumption on which Algorithm3.2relies may produce trees that are suboptimal. A natural improvement of the greedy strategy is lookahead search, which consists in evaluating the goodness of a split, but also of those deeper in the tree, assuming this former split was effectively performed. As empirically investigated in

[Murthy and Salzberg,1995], such an approach is however not only

more computationally intensive, it is also not significantly better than greedily induced decision trees. For this reason, more elaborate vari- ants of Algorithm3.2are not considered within this work. Let us note however that trees grown with lookahead search are usually shorter, which may be a strong advantage when interpretability matters.

3.4 a s s i g n m e n t r u l e s

Let us assume that node t has been declared terminal given some stopping criterion (See next Section 3.5). The next step (Line 8 of Algorithm3.2) in the induction procedure is to label t with a constant valueybt to be used as a prediction of the output variable Y. As such, node t can be regarded as a simplistic model defined locally onXt×Y

and producing the same output valuebytfor all possible input vectors falling into t.

Let us first notice that, for a tree ϕ of fixed structure, minimizing the global generalization error is strictly equivalent to minimizing the local generalization error of each simplistic model in the terminal nodes. Indeed,

Err(ϕ) =EX,Y{L(Y, ϕ(X))}

= X

t∈ϕe

P(X∈Xt)EX,Y|t{L(Y,byt)} (3.6) whereϕe denotes the set of terminal nodes in ϕ and where the inner expectation1

is the local generalization error of the model at node t. In this later form, a model which minimizes Err(ϕ) is a model which minimizes the inner expectation leaf-wise. Learning the best possible decision tree (of fixed structure) therefore simply amounts to find the best constantsbyt at each terminal node.

3.4.1 Classification

When L is the zero-one loss, the inner expectation in Equation3.6 is minimized by the plurality rule:

b y∗t =arg min c∈Y EX,Y|t{1(Y, c)} =arg min c∈Y P(Y 6= c|X ∈ Xt) =arg max c∈Y P(Y = c|X ∈ Xt) (3.7)

Put otherwise, the generalization error of t is minimized by predict- ing the class which is the most likely for the samples in the subspace of t. Note that if the maximum is achieved by two or more different classes, thenyb

t is assigned arbitrarily as any one of the maximizing

classes.

Equation3.7cannot be solved without the probability distribution P(X, Y). However, its solution can be approximated by using estimates of the local generalization error. Let Ntdenotes the number of objects

inLtand let Nctdenotes the number of objects of class c inLt. Then,

the proportion NctNt can be interpreted as the estimated probability2

1 The joint expectation of X and Y is taken over all objects i ∈ Ω such that xi∈Xt. 2 Lower case p denotes an estimated probability while upper case P denotes a theoret-

p(Y = c|X ∈ Xt)(shortly denoted p(c|t)) of class c in t and therefore

be used to solve Equation3.7:

b yt =arg min c∈Y 1 − p(c|t) =arg max c∈Y p(c|t) (3.8)

Similarly, let us also define the proportionNtN as the estimated prob- ability p(X ∈ Xt) (shortly denoted p(t)). Plugging this estimate into

Equation 3.6 and approximating the local generalization error with 1 − p(ybt|t) as done above, it follows:

d Err(ϕ) = X t∈ϕe p(t)(1 − p(byt|t)) = X t∈ϕe Nt N(1 − N b ytt Nt ) = 1 N X t∈ϕe Nt− Nyttb = 1 N X t∈ϕe X x,y∈Lt 1(y6=byt) = 1 N X x,y∈L 1(y6= ϕ(x)) = dErrtrain(ϕ) (3.9)

Accordingly, approximating Equation 3.6 through local probability estimates computed from class proportions in Lt reduces to the re-

substitution estimate of ϕ (Equation2.5). In other words, assignment rule 3.8in fact minimizes the resubstitution estimate rather than the true generalization error.

An important property of assignment rule3.8is that the more one splits a terminal node in any way, the smaller the resubstitution esti- mate dErrtrain(ϕ)becomes.

Proposition 3.1. For any non-empty split of a terminal node t ∈ ϕe into