Probability density function of the decay B

s→ J/ψ φ

acceptance and the decay-time resolution determined here are close to the ones used in the official analysis, while the final fit result and the statistical subtraction of background candidates were only used as a cross-check. For consistency reasons, it was decided to present the results of this autonomous study rather than the one of the official LHCb analysis. A comparison of the two sets of results is shown in Appendix G.3.

3.3 Probability density function of the

decay B

_s0

_{→ J/ψ φ}

The extraction of the physics parameters is strongly based on a correct description of decay-time and angular distributions of the selected B0

s→ J/ψ φ decays. The

parametrization of this distribution is called probability density function (PDF) and will be developed throughout the following chapters. The starting point is the underlying PDF as it would be present if the complete information about all B0

s→ J/ψ φ decays would be available. It is essentially given by Equations (1.53)

and (1.54): PDF(t, Ω_{|q) =} 1 Nq 10 � k=1 Akhk,q(t) fk(Ω), (3.1)

where Ω represents the three angles of the helicity basis and q =_{±1 corresponds to} the initial ﬂavour of the B0

s/B0s meson. The absolute amplitudes squared A2_⊥,0,�,S are

parametrized by a S-wave fraction, F_Sj = A2

S, for every bin in m(K−K+) and values

for _|A_⊥_|2 _and _|A

0|2 representing the respective fraction of the resonant component.

The parallel component is ﬁxed by A2

⊥+ A20+ A2� = 1. Nq is a normalization factor

that is given by:

Nq = ∞ � t=0 � Ω PDF(t, Ω_{|q) dΩ dt.} (3.2)

This PDF has to be modified when introducing flavour tagging, detector acceptances and resolution effects. At the end of the respective chapter or section, the relevant modifications of the PDF are given.

4

Analysis tools

4.1 Eﬃcient selection: Boosted decision trees

The task to solve

A typical situation at the beginning of an analysis based on data from a collider experiment is a sample of signal candidates that is swamped by background processes. It is crucial to effectively discriminate between these two contributions and to obtain a signal sample as large and pure as possible. While the classical approach is based on the optimization of a set of rectangular cuts on some of the properties of the candidates, the method presented here allows to automatically consider correlations between these properties. In addition, it reduces the final optimization decision between signal efficiency and background rejection rate to the simple choice of a cut value on a single classification variable.

Machine learning

The algorithms presented here fall in the class of supervised machine learning. This means that the algorithm is trained to discriminate between signal and background candidates using a set of labeled candidates of these two categories. Typically, these training data sets are obtained from control regions in data or from simulation. An independent set of such samples can then be used to get an unbiased estimate of the performance of the algorithm.

Decision trees

We consider a training sample N that consists of two species, labeled as y = 1 and y = −1, which have a set of properties x. A decision tree (DT) aims to create regions, called leafs, in the property space and classiﬁes them as either y = 1 or y =_{−1. These leafs are deﬁned in an iterative procedure that is based on binary} decisions in one of the properties xi. Figure 4.1 shows a simple example of a DT.

Chapter 4 Analysis tools

x_i0≤ c0 xi0> c0

xi1≤ c1 xi1> c1

Figure 4.1: A simple example of a decision tree. The colors blue and red represent the two species that are separated by splits in the variables x_i0_/i1 at values c_0/1.

xik and a respective cut value c_k such that the chosen metric is minimized. An

example for such a metric is the sum of squares of the difference between the predicted species and the true species for all elements n in the leaf Ñ _{⊂ N that is} currently processed: � n∈ Ñ x_ik≤ck (˜y1− yn)2+ � n∈ Ñ x_ik>ck (˜y2− yn)2. (4.1)

Here, yn∈ {1, −1} is the actual species of the element n and ˜y1/2 are the predicted

species in the speciﬁc leaf. This prediction is typically deﬁned as the species that is more abundant in this region. In this case, the metric is therefore directly proportional to the number of wrongly assigned species hypothesis.

The same concepts can be also applied to regression trees, which try to predict a continuous variable y instead of a binary classiﬁcation. Typically, the average y value of the entries of the training sample in a leaf is chosen as predicted value. A decision and regression tree is then completely deﬁned by the parameters{(i0_{, c}

0), ..., (il, cl)},

i.e. the cut values ck and the properties xik to cut on. The number of those cuts, l,

depends on the depth of the tree.

Boosting and gradient boosting

The decision trees presented previously can in principle perfectly solve the task of classifying a training sample. However, they suﬀer from instability under small variations of this training sample. To mitigate this eﬀect and ensure a good per-

4.1 Eﬃcient selection: Boosted decision trees

formance also on an independent sample, the method of boosting is employed [66]. The concept of boosting involves the sequential combination of many relatively weak classiﬁcation or regression algorithms, called weak learners, to obtain a more power- ful, but still robust, overall algorithm. One way to formulate a boosting algorithm, is the gradient boosting method [67], which is discussed in more detail below.

We consider again a training data set with N entries that have the properties x and y. The aim is to ﬁnd a function F , such that F (x) infers the variable y of an entry based on its other properties x. Given a general loss function L(F (x1), y1, ..., F (xN), yN) that measures the deviation between the predicted and

true values of y, the gradient boosting algorithm tries to minimize L in terms of a gradient descent method, in which the gradients are approximated by weak learners. An example for such a loss function is the metric given in Equation (4.1):

L(F (x1), y1, ..., F (xN), yN) = N

�

i=1

(F (xi)− yi)2, (4.2)

but in general any loss function can be used. In the case discussed here, the weak learners, φ(x, θ), are the previously introduced regression trees that are described by the parameters θ =_{(i0_{, c}

0), ..., (il, cl)}, see Figure 4.1.

The ﬁrst step of the boosting is to ﬁt a weak learner, φ(x, θ0), to the training

data, which is then the ﬁrst estimate F0(x) of the desired relation between y and x.

The following three steps, see Figure 4.2 for illustration, are then repeated M times to sequentially improve this approximation:

For m = 1, m < M :

1. Calculate the gradient rm _{of the loss function L with respect to the prediction}

of the current model:

r_im =₋ � ∂L(F (x1), y1, ..., F (xN), yN) ∂F (xi) � F =Fm−1 (4.3)

In the case of the loss function given in Equation (4.2), these residuals are given for every element i of the training sample by:

r_im=_−2[Fm−1(xi)− yi]. (4.4)

Chapter 4 Analysis tools F (xi) Fm−1(xi) Fm(xi) −rm i ≈ φ(xi, θm) L (F (x 1 ), y1 ,... ,F (x N ), yN )

Figure 4.2: Schematic view of one step during the gradient boosting technique. The red line indicates the value of the loss function L evaluated for the training sample. The x axis represents one dimension of the high dimensional space F (xi),

with i_{∈ {1, ..., N}.}

3. Update the estimate of the relation between y and x:

Fm(x) = Fm−1(x) + νmφ(x, θm), (4.5)

where νm is a real parameter that can be determined using line search to

minimize the loss function.

In this way, the ﬁnal prediction of y based on x is given by the linear combination of the output of many weak learners:

F (x) =

�

m=1

νmφ(x, θm), (4.6)

and minimizes the deﬁned loss function L. During the boosting iterations, the step parameters νm are typically scaled by a number of the interval (0, 1]. This procedure

is called shrinkage and, although more weak learners have to be combined, makes the boosting more robust.

In the analysis presented in this thesis, the implementation of gradient boosting within the TMVA framework [68] is used to discriminate between signal and background candidates in data. Although the classiﬁer consists solely out of regression trees, such a classiﬁer is usually called a boosted decision tree (BDT).

In document CP violation and lifetime measurements in the decay B_0^s → J/ψ φ with the LHCb experiment (Page 55-61)