s→ J/ψ φ
acceptance and the decay-time resolution determined here are close to the ones used in the official analysis, while the final fit result and the statistical subtraction of background candidates were only used as a cross-check. For consistency reasons, it was decided to present the results of this autonomous study rather than the one of the official LHCb analysis. A comparison of the two sets of results is shown in Appendix G.3.
3.3 Probability density function of the
decay B
s0→ J/ψ φ
The extraction of the physics parameters is strongly based on a correct description of decay-time and angular distributions of the selected B0
s→ J/ψ φ decays. The
parametrization of this distribution is called probability density function (PDF) and will be developed throughout the following chapters. The starting point is the underlying PDF as it would be present if the complete information about all B0
s→ J/ψ φ decays would be available. It is essentially given by Equations (1.53)
and (1.54): PDF(t, Ω|q) = 1 Nq 10 � k=1 Akhk,q(t) fk(Ω), (3.1)
where Ω represents the three angles of the helicity basis and q =±1 corresponds to the initial flavour of the B0
s/B0s meson. The absolute amplitudes squared A2⊥,0,�,S are
parametrized by a S-wave fraction, FSj = A2
S, for every bin in m(K−K+) and values
for |A⊥|2 and |A
0|2 representing the respective fraction of the resonant component.
The parallel component is fixed by A2
⊥+ A20+ A2� = 1. Nq is a normalization factor
that is given by:
Nq = ∞ � t=0 � Ω PDF(t, Ω|q) dΩ dt. (3.2)
This PDF has to be modified when introducing flavour tagging, detector acceptances and resolution effects. At the end of the respective chapter or section, the relevant modifications of the PDF are given.
4
Analysis tools
4.1 Efficient selection: Boosted decision trees
The task to solve
A typical situation at the beginning of an analysis based on data from a collider experiment is a sample of signal candidates that is swamped by background processes. It is crucial to effectively discriminate between these two contributions and to obtain a signal sample as large and pure as possible. While the classical approach is based on the optimization of a set of rectangular cuts on some of the properties of the candidates, the method presented here allows to automatically consider correlations between these properties. In addition, it reduces the final optimization decision between signal efficiency and background rejection rate to the simple choice of a cut value on a single classification variable.
Machine learning
The algorithms presented here fall in the class of supervised machine learning. This means that the algorithm is trained to discriminate between signal and background candidates using a set of labeled candidates of these two categories. Typically, these training data sets are obtained from control regions in data or from simulation. An independent set of such samples can then be used to get an unbiased estimate of the performance of the algorithm.
Decision trees
We consider a training sample N that consists of two species, labeled as y = 1 and y = −1, which have a set of properties x. A decision tree (DT) aims to create regions, called leafs, in the property space and classifies them as either y = 1 or y =−1. These leafs are defined in an iterative procedure that is based on binary decisions in one of the properties xi. Figure 4.1 shows a simple example of a DT.
Chapter 4 Analysis tools
xi0≤ c0 xi0> c0
xi1≤ c1 xi1> c1
Figure 4.1: A simple example of a decision tree. The colors blue and red represent the two species that are separated by splits in the variables xi0/i1 at values c0/1.
xik and a respective cut value ck such that the chosen metric is minimized. An
example for such a metric is the sum of squares of the difference between the predicted species and the true species for all elements n in the leaf ˜N ⊂ N that is currently processed: � n∈ ˜N xik≤ck (˜y1− yn)2+ � n∈ ˜N xik>ck (˜y2− yn)2. (4.1)
Here, yn∈ {1, −1} is the actual species of the element n and ˜y1/2 are the predicted
species in the specific leaf. This prediction is typically defined as the species that is more abundant in this region. In this case, the metric is therefore directly proportional to the number of wrongly assigned species hypothesis.
The same concepts can be also applied to regression trees, which try to predict a continuous variable y instead of a binary classification. Typically, the average y value of the entries of the training sample in a leaf is chosen as predicted value. A decision and regression tree is then completely defined by the parameters{(i0, c
0), ..., (il, cl)},
i.e. the cut values ck and the properties xik to cut on. The number of those cuts, l,
depends on the depth of the tree.
Boosting and gradient boosting
The decision trees presented previously can in principle perfectly solve the task of classifying a training sample. However, they suffer from instability under small variations of this training sample. To mitigate this effect and ensure a good per-
4.1 Efficient selection: Boosted decision trees
formance also on an independent sample, the method of boosting is employed [66]. The concept of boosting involves the sequential combination of many relatively weak classification or regression algorithms, called weak learners, to obtain a more power- ful, but still robust, overall algorithm. One way to formulate a boosting algorithm, is the gradient boosting method [67], which is discussed in more detail below.
We consider again a training data set with N entries that have the properties x and y. The aim is to find a function F , such that F (x) infers the variable y of an entry based on its other properties x. Given a general loss function L(F (x1), y1, ..., F (xN), yN) that measures the deviation between the predicted and
true values of y, the gradient boosting algorithm tries to minimize L in terms of a gradient descent method, in which the gradients are approximated by weak learners. An example for such a loss function is the metric given in Equation (4.1):
L(F (x1), y1, ..., F (xN), yN) = N
�
i=1
(F (xi)− yi)2, (4.2)
but in general any loss function can be used. In the case discussed here, the weak learners, φ(x, θ), are the previously introduced regression trees that are described by the parameters θ ={(i0, c
0), ..., (il, cl)}, see Figure 4.1.
The first step of the boosting is to fit a weak learner, φ(x, θ0), to the training
data, which is then the first estimate F0(x) of the desired relation between y and x.
The following three steps, see Figure 4.2 for illustration, are then repeated M times to sequentially improve this approximation:
For m = 1, m < M :
1. Calculate the gradient rm of the loss function L with respect to the prediction
of the current model:
rim =− � ∂L(F (x1), y1, ..., F (xN), yN) ∂F (xi) � F =Fm−1 (4.3)
In the case of the loss function given in Equation (4.2), these residuals are given for every element i of the training sample by:
rim=−2[Fm−1(xi)− yi]. (4.4)
Chapter 4 Analysis tools F (xi) Fm−1(xi) Fm(xi) −rm i ≈ φ(xi, θm) L (F (x 1 ), y1 ,... ,F (x N ), yN )
Figure 4.2: Schematic view of one step during the gradient boosting technique. The red line indicates the value of the loss function L evaluated for the training sample. The x axis represents one dimension of the high dimensional space F (xi),
with i∈ {1, ..., N}.
3. Update the estimate of the relation between y and x:
Fm(x) = Fm−1(x) + νmφ(x, θm), (4.5)
where νm is a real parameter that can be determined using line search to
minimize the loss function.
In this way, the final prediction of y based on x is given by the linear combination of the output of many weak learners:
F (x) =
M
�
m=1
νmφ(x, θm), (4.6)
and minimizes the defined loss function L. During the boosting iterations, the step parameters νm are typically scaled by a number of the interval (0, 1]. This procedure
is called shrinkage and, although more weak learners have to be combined, makes the boosting more robust.
In the analysis presented in this thesis, the implementation of gradient boosting within the TMVA framework [68] is used to discriminate between signal and background candidates in data. Although the classifier consists solely out of regression trees, such a classifier is usually called a boosted decision tree (BDT).