5.3 Random forest of bagged decision trees
5.3.1 The binary tree
The basic unit of the forest is the binary decision tree. In our applications, the goal of the tree is to separate events of unknown class into two categories — e.g., signal or background (Section 8.3), or glitch or not (Chapter 6). In general, a tree has the following elements, also callednodes:
• root: the first node in a tree, at which all the training data starts;
• branching point: where a binarysplitis made such that a node splits into two daughter nodes — which events go to which daughter depends on the parameter and threshold chosen by the algorithm;
• leaf: a terminal node (no more splits are made).
The entire set of training data (for which the class is known) starts at the root node. For n-dimensional data, theithrow of training data looks like:
(x1, x2, ...xn, y, w)i, (5.7)
wherexis the n-dimensional feature vector used in the previous two sections,y={0,1}indicates the class to which it belongs, andwis the weight assigned to the event by the user (in the simplest case, all weights
are set to 1).
In a generic self-creating tree, at each node, all thresholds on all feature-space dimensions are tested, and the one that best optimizes the chosen figure of merit is picked. If no dimension/threshold can improve the figure of merit, the node becomes a leaf. Otherwise, it is a branching point, and all events that have a numerical value of the chosen dimension lower than the chosen threshold take the “left” branch and the rest take the “right” branch. A simple choice for the figure of merit on a node,Q, isp, the correctly classified fraction of events [105]. Once the branching begins, each non-terminal node comes in pairs:
pleft= P ifyi=0wi,left node P wi,left node or (5.8) pright= P ifyi=1wi,right node P wi,right node , (5.9)
where left and right are defined such that the right hand side of Equation (5.10) is maximized, if the figure of merit is symmetric with respect to the two classes, aspis. For asymmetric figures of merit, the split is chosen
The condition for becoming a terminal node for a symmetric figure of merit is Qparent node X i,parent node wi> Qleft node X i,left node wi+Qright node X i,right node wi, (5.10)
while for an asymmetric figure of merit it is
Qparent node>max(Qleft node, Qright node). (5.11)
There are other criteria that can be put in place beforehand to stop splitting. The package used in this thesis, which will be described in Section 5.3.2.1, only sets a minimum number of events allowed on a leaf [105].
After a tree is “grown” (i.e. trained), the structure of the tree is saved. The tree is a series of branching points, each defined by a dimension and a threshold. The leaves can be defined in a discrete or continuous manner. If discrete leaves are chosen, each leaf is labeled as either Class 0 or Class 1, depending on how many Class 0 and Class 1 training events landed on said leaf. If the leaves are labeled in a continuous manner, then they are each assigned a “rank”:
r= Σw1
Σw0+ Σw1
, (5.12)
wherew1andw0are the weights of each event on the leaf, and the sum is only over events on the leaf. If the
weights are all set to 1, then this rank is simply the fraction of the total number of events on a leaf that are Class 1. When an event of unknown class is evaluated by the tree, it will deterministically end up on one leaf and is either assigned to a class (discrete leaves) or given a rank (continuous leaves).
The process of splitting is equivalent to recursively splitting the data up into rectangular regions, where the rectangles are analogous to the nodes, making them easy to interpret [105]. Other benefits of decision trees are:
• They are not only immune to complications caused by correlated dimensions, the correlations actually
help the tree make better decisions [105];
• They can deal with mixed data types (float versus integer);
• They are more easily interpreted than other machine learning algorithms — i.e., not “black boxes”;
• They are not computationally limited by a very large feature space [105].
Simple decision trees are often defined as a “weak” classifier. Some weaknesses are listed here:
• The decisions cannot be reversed — if the first split is bad, the tree will never recover; this can be thought of as an instability in the method [105];
• They can be victim to overtraining (the tree perfectly classifies the training set but fails at classifying a unique testing set from the same population). Therefore a validation set must be used.
Creating an ensemble (or “forest”) of decision trees and averaging their output can mitigate the problems of a single decision tree [107]. A modern realization of this scheme is discussed in the following section.