Classification using Boosted Decision Trees

The software package TMVA (Hoecker et al. 2007) provides a generic scheme in which different multivariate classification methods can be trained, evaluated and tested in parallel. As of version 3.8.4 (the version used in this work) the following algorithms are included in different representations: Rectangular cut optimisation, Projective Likelihood estimator, Multidimensional Likelihood estimator, k-Nearest Neighbour Classifier, H-Matrix discriminant, Fisher discriminant, Function Discriminant Analysis, Artificial Neural Networks (ANN), Support Vector Machine, Boosted Decision Trees and Predictive learning via rule ensembles (see Hoecker et al. 2007) for further details and different classifier training op- tions). Even though the aforementioned methods have a different response to (non)-linear correlations and different robustness in terms of e.g. overtraining or weakly classifying variables, they all share the property of being basically extensions of one-dimensional cut-based analysis techniques to multivariate algorithms. Multivariate analysis (MVA) methods can be divided into two different types: those, which consider non-linear correlations between input parameters in the classification (like e.g. ANN and BDT) and those which do not (like e.g. Likelihood-, Fisher- and Cuts-based methods). Given the considerations made in the previous section, former algorithms are expected to be preferable for the purposes of this work, a hypothesis which is going to be tested in Section 2.6. While boosted decision trees effectively ignore weakly- or non-classifying variables in the separation, neural networks could suffer from those, leading to a degraded performance or unexpected behaviour of the ANN response.

MVA classifiers are commonly utilised in natural sciences and sociology for complex prob- lems, such as e.g. classification of events of different type using a set of input variables. In particular, the BDT algorithm has been successfully utilised for particle identification in high energy physics (Yang et al. 2005; Abazov et al. 2008) and for supernova searches in optical astronomy (Bailey et al. 2007).

2.2 Classification using Boosted Decision Trees

Figure 2.6: Sketch of a decision tree. An event, described by a parameter set, Mi

= (mi,1,. . .,mi,6), undergoes at each node a binary split criterion (passed or failed)

on one of its parameters until it ends up in a leaf. This leaf marks it as signal (S) or background (B) (Figure adapted from Ohm 2007; Ohm et al. 2009b).

2.2.1 Basics of the Decision Tree algorithm

A decision tree can be depicted by a two-dimensional structure like the one shown in Fig. 2.6. At each branching, a binary split criterion (passed or failed) on one of the characterising input parameters is applied, thereby classifying events of unknown type as signal- or background-like. The training of a decision tree is the process of determining these split criteria with a training set consisting of events of known type. Since single trees suffer from statistical fluctuations, the single tree is extended to a forest of decision trees to achieve a stable response and an improved performance. All trees of a BDT differ in the binary split criteria; the final response is calculated as the weighted mean vote of the classification of all single trees. This vote is the output of the BDT and is referred to as ζ variable in this work – it describes the background- or signal-likeliness of an event. The forest of trees is generated from the initial single tree by a process called “boosting”.

2.2.2 The training procedure for a single tree

In the following, the training or building of a single tree is described in more detail. During the training process appropriate splitting criteria for each node in a tree are determined using a training sample S of events of known type. This training sample consists of a signal

training sample S1 and a background training sample S2, which are again comprised of

N1 signal events and N2 background events, respectively. Each event in the training set is

characterised by a set of input parameters Mi and a weighting factor ωi. A single decision

tree is build from such a training sample by performing the following steps:

⋄ The training samples are normalised to the total number of signal and background events in such a way that all signal events have the same weight ωi(S1) = 1/N1 and

all background events have the same weight ωi(S2) = 1/N2.

⋄ The tree-building procedure starts at the root node (top node in Fig. 2.6), where the variable and split value that provides the best separation of signal and background

events is determined. Correspondingly, S is divided into two subsets of events that either pass or fail this splitting criterion. Each subset is fed into a child node where again the cut parameter which separates signal and background events best is determined.

⋄ This process is applied recursively until one of two stop criteria is fulfilled. Tree building is stopped if further splitting would not increase the separation, or if a preassigned minimum number of events is reached in the child node. Thereby overtraining due to statistically insignificant leaves is avoided. According to the majority of signal and background events, the last-grown nodes (which are called leaves) are assigned signal- (S) or background (B) type, respectively (see Fig. 2.6).

2.2.3 Boosting

Single decision trees are sensitive to statistical fluctuations in the training sample, hence a boosting procedure is applied which extends a single tree to a forest of trees. Thereby, the stability of the method is increased. In the boosting procedure, events that got misclassified in the building of the (n − 1)st _{tree are multiplied with a boost weight, α}

n, thereby getting

a higher weight in the training of the nth _{tree. Hence, the boosting is applied to all trees}

except for the first one. This method is known as AdaBoost or adaptive boost (Freund & Schapire 1997), where αn is calculated from the fraction of misclassified events in all

leaves of the tree n − 1, errn−1:

αn= 1 − errn−1

errn−1

. (2.2)

The mis-classification error erri,n−1 in a single node i in tree n − 1 is calculated using the

number of signal events Si,n−1 and background events Bi,n−1 in that node:

erri,n−1= 1 − max (p, 1 − p) , p =

Si,n−1

Si,n−1+ Bi,n−1

(2.3)

After having applied αn to each misclassified event, the training samples of signal and

background events are re-normalised to retain the sum of weights of all events in a decision tree constant.

2.2.4 BDT settings

The BDT method used in this work is provided by the TMVA package (in version 3.8.4). The decision tree settings are mostly default values, and have been optimised and tested by the TMVA developers. They guarantee a fast training process and at the same time a stable response of the classifier and are marked with a * in the following:

⋄ The total number of trees in the forest was chosen to be 200*, a good compromise between maximum separation performance and at the same time adequate processing power consume. Varying this number in a wide range does not change the presented results significantly.

In document Development of an advanced gamma/hadron separation technique and application to particular gamma-ray sources with H.E.S.S. (Page 44-47)