Boosted decision trees - Statistical methods

Statistical methods

2.2 Boosted decision trees

The choice of the distribution used to perform the likelihood fit can have a strong impact on the signal strength sensitivity. It is therefore important to select the variable that shows the best separation between signal and background. However, a single kinematic variable is often not suited for this purpose. In fact, the features of the signal can be embedded in correlations between multiple variables. It is partly possible to overcome this issue by combining multiple features into engineered variables. As an example, one could sum the transverse momenta of multiple particles, so as to include kinematic information for all the objects in a given event. Unfortunately, it is difficult to create the optimal variable "by hand", therefore this analysis makes use of a more refined technique. A set of variables, which encode some of the most important kinematic features of the H+ decay, is chosen, and then a boosted decision tree (BDT) [87] is trained on the entire set, using MC events. The BDT learns how to exploit the correlation between variables, in order to distinguish between signal and background events. The BDT is then applied on data and MC events, generating an output proportional to the 32

Boosted decision trees

probability that each event has of being generated by signal. Such output is used to perform the profile likelihood fit.

The fundamental component of a BDT is a single decision tree. A tree is a construct that performs a certain number of cuts on the input variables so as to separate between the classes of interest. For example, the decision tree of Figure 2.2 is trained to distinguish between two classes, signal and background. The tree can take its decision based on three variables: the

pT, η and φ of a lepton. A metric called Gini-Index is used to select the variable and the position at which each cut is performed. The Gini-Index is defined as G = P_S(1 − P_S), where

PS is the signal purity of the fraction of events passing a cut. The optimal cut is defined as

the one that maximises the separation gain (SG):

SG = G(branch) − G(leaf 1) − G(leaf 2), (2.8)

where G(branch) is the Gini-Index computed before performing the cut while G(leaf 1) and

G(leaf 2) are the ones computed on the two subsets of events divided by the cut. The maximum number of subsequent cuts is known as the depth of the tree (the decision tree of Figure 2.2 has a depth of 3).

Figure 2.2: Schematic representation of a decision tree trained on a dataset composed of signal and background events. The training variables are the p_T, η and φ of a lepton. The output of the tree is the probability that each event has of being generated by signal. The probabilities do not correspond to the result of a real training but have been been randomly generated. Values below 0.5 (in red) correspond to background-like events, values above 0.5 (in green) are associated to signal-like events.

Multiple decision trees can be combined into ensembles, known as "forests", to mitigate statistical fluctuations and improve the performances. In the case of BDTs, the trees are trained one after each other, using events that are weighted to give larger importance to the ones that were misclassified by the previous tree. The final probability is produced as

Statistical methods

a weighted average of the scores of all trees. While the output of a single tree is a discrete distribution, BDTs produce pseudo-continuous distributions, spanning values from −1 to 1 (or 0 to 1). The smaller the BDT output, the more background-like is the input event. To maximise the performance of the BDT, one could increase the size of the forest or the depth of each tree. However, there is a limiting factor given by the number of training events. If the forest is too large or the trees are too deep, the BDT could overtrain, i.e. pick up statistical fluctuations characteristic of the training sample only. This would result in a sub-optimal classifier. To mitigate these effects, a k-fold training is performed: k BTDs are trained on different sub-samples that exclude 1/k of the events, which are then used for the validation (k ≥ 2). The optimal BDT parameters are chosen as the ones that minimise the differences between the BDT output distributions for the training and validation events, while maximising the separation between signal and background. The separation power of the BDT is estimated using the Receiver Operating Characteristic (ROC) curve. The ROC curve is a graphical representation of the background rejection efficiency vs the signal acceptance efficiency of a given distribution. Larger values of the area under the ROC curve (AUC) correspond to better probabilities of assigning the correct class to the events. The AUC must be computed on the validation set to avoid biases coming from the training. An optimal classifier would have AUC=1, while no separation power corresponds to AUC=0.5. An example of ROC curve is provided in Figure 2.3.

Boosted decision trees

Figure 2.3: Example of a Receiver Operating Characteristic curve. The ROC curve, at the bottom, is a graphical representation of the separation between the Gaussian distributions drawn in the top left corner. The red distribution can be interpreted as the signal distribution and the blue as the background distribution. The ROC curve is built by drawing the rate of true positive (signal) events versus the rate of false positive (background) events above a moving threshold (the black vertical line between the Gaussian distributions) [88].

Chapter 3

In document Search for charged Higgs bosons decaying into top and bottom quarks with single-lepton final states using pp collisions collected at a centre-of-mass energy of 13 TeV by the ATLAS detector (Page 48-53)