5.6 Data-Driven Background Estimation
5.6.1 Hemisphere Mixing
The basis of the data-driven background estimation method here proposed is to divide each event in two parts, referred to as hemispheres, so each can be substituted by an hemisphere from a different event in order to produce an artificial dataset. A graphical illustration of the hemisphere mixing technique used in this work is provided in Figure 5.6. The transverse thrust axis, defined as the axis in thex− y
plane for which the absolute value sum of the projections of the transverse momenta of the selected subset of reconstructed jets is maximal, is used as a reference to divide each original event in two halves perpendicularly to the mentioned axis. This procedure is carried out for all the collected events that pass the selection described in Section 5.5, creating a dataset (or library) of hemispheres with as many rows as twice as many rows as the number of original events. Each half, or hemisphere, can be basically reduced to a set of reconstructed jets with their directions relative to the thrust axis. Once the hemisphere library has been created, each hemisphere in the original event can be substituted by a similar one by from a different event, once an appropriate distance metric has been defined. The procedure results in an artificial dataset that can be used to model the background component.
Hemisphere library
filled in 1st pass, queried on 2nd
transverse thrust axis
b-tag jets non b-tag jets
x y
x y
Original Event
break in two hemispheres
transverse thrust axis
Mixed Event
using replaced hemispheres
Figure 5.6: Schematic depiction of the hemisphere mixing background estimation procedure. The red arrows represent b-tagged jets and the blue arrows represent jets that were not b-tagged in an event. The first step includes finding the thrust axis in thex− y plane. The event is then divided in two hemispheres, each composed of a set of jets, by the plane perpendicular to the thrust axis. All these hemispheres are used to create a dataset (or library) of hemispheres. For each original event, a artificial event can be created by substituting each original hemisphere with its closest neighbours, once a distance metric for hemispheres has been defined. Figure adapted from [148].
The matching between the original and the replacement hemisphere is done by finding the pair minimising a inter-hemisphere distance. The mentioned distance is a function of the set of reconstructed jets contained within each hemisphere, and it is a combination of discrete and continuous variables. The discrete requirement for matching original hemispheres with those in the library is that they have the same number of jets Nh
distributions for the artificial data. The previous condition also avoids creating artificial events that do not pass the event selection, e.g. by combining an hemisphere with 2 b-tagged jets with another one including only one b-tagged jet, which would result in the artificial events having less that four b-tagged jets. For infrequent jet and b-jet multiplicity categories, the discrete condition is relaxed by considering a unique category. This is for example the case when four jets or b-jets are present in the hemisphere. In addition to the mentioned categorisation, the following continuous distance metric between the original hemisphere ho and each hemisphere from the
libraryhq is defined as a measure of similarity:
d(ho, hq)2 = (Mt(ho)− Mt(hq))2 Var(Mt) +(T (ho)− T (hq)) 2 Var(T ) +(Ta(ho)− Ta(hq)) 2 Var(Ta) + (Pz(ho)− Pz(hq)) 2 Var(Pz) (5.6)
whereMt(h) is the invariant mass of the system composed of all the jets contained
in the hemisphere,T (h) is the scalar sum of all the transverse momenta projection of all jets of an hemisphere to the thrust axis,Ta(h) is the scalar sum of the transverse
momenta projections over a axis orthogonal to the thrust axis, and Pz(h) is the
absolute value of the projection of the vectorial sum of the jet momenta along the beam axis. The denominators in Equation 5.6 are the variances of each of the variables and discrete category, as estimated directly from the library of hemispheres. This normalisation factor is included in order to reduce the effect of the scale of the magnitude of each component to the distance metric.
The substitute for each original hemisphere is found by finding the kth nearest-
neighbour hemisphere in the library. The closest hemisphere (k = 0), corresponding to zero distance, would be the very same original hemisphere which is present in the library. Rather, the hemisphere is substituted with its kth nearest neighbour,
only considering k ≥ 1. Assuming forward-backward symmetry in the z direction and ϕ rotational symmetry, and given that the distance metric d(ho, hq)2 does not
depend on the sign and absolute magnitude of those quantities, all the jets in the hemisphere can be rotated in ϕ or their pz sign to match the original hemisphere
properties. It is possible to considering different k neighbours for each hemisphere, obtaining a different artificial dataset in each case. Each of this artificial datasets can be labelled by a tuple(k1, k2), where k1 indicate the ordinal of the neighbour used
as the substitute for the original hemisphere corresponding to a∆ϕ > 0 with respect to the thrust vector rotatedπ/2 clock-wise, and k2 corresponds to the ordinal of the
neighbour substituting the other original hemisphere. Consequently, if up to kmax
neighbours are considered for each hemispheres, a total of k2
max artificial datasets,
each of the same size of the original dataset, could be composed by considering all the permutations.
The rationale of the above technique rests on the fact that QCD multi-jet produc- tion at leading-order corresponds to a2→ 2 parton scattering process, which is then affected by higher order corrections such such as QCD radiation, pileup or multiple interactions. By breaking the event in two hemispheres using the transverse thrust, the aim is to separate the outcome of the processes associated with each of the two final state partons in the mentioned2→ 2 approximation. The hemisphere distance metric attempts to preserve the main properties of the event, while avoiding strong correlations between jets in the two hemispheres. The goal of the hemisphere mixing procedure is then to obtain an artificial dataset where the effect of the signal present in the original dataset are effectively removed. This has been tested by injecting up to 100 times the expected SM contribution of simulated HH production events to a dataset of simulated QCD multi-jet events [147]. The distributions of the various variables after hemisphere mixing are not affected by the presence of signal, and are compatible with the QCD multi-jet component, which is the majority component. The level of agreement for the variables used as input of the probabilistic classifier in a control region will be discussed in more detail in Section 5.6.2.
1-1 2-2 4-4 8-8 16-16 32-32 64-64 128-128 k1 k2 neighbour combination 80 90 100 110 120 130 140 2 sc or e ( m ixe d ve rsu s r e- m ixe d da ta ) 50% 10% 5% 3 CMS preliminary 35.9 fb1(13 TeV)
Figure 5.7: Comparison (χ2 score) of the mixed and re-mixed data (see Section5.6.2) as a
function of the neighbour combination (k1, k2). The test score has been calcu-
lated based on the binned distribution of the probabilistic classifier. The one- sided confidence bands for the test score are also included for guidance. Figure adapted from [148].
The hemisphere mixing technique is applied to the data events passing the se- lection described in Section 5.5. Artificial datasets up to kmax = 10 have been
considered, given that good modelling was observed until very large values ofkmax.
The test score of the compatibility between the mixed artificial data as a function of the combination label is included in Figure5.7, modelling breaks only at high val- ues, e.g. k = 128. All the neighbour combinations up to kmax= 10 are sub-divided
in three sets used for training the probabilistic classifier (training), validating and optimised the classifier (validation) and to estimate the background distribution of the final summary statistic (application). The last dataset is referred to as applic- ation instead of test set because its purpose is not to obtain unbiased estimates of the classifier performance, but rather to extract unbiased estimates of the classifier output distribution of background events. All the artificial datasets are not inde- pendent, e.g. the(1, 1) and (1, 2) dataset use the same first hemisphere, thus some careful choices are required when splitting the mixed datasets. The dataset splitting considered in this analysis, using the (k1, k2) notation described before, correspond
to:
• training set: concatenation of (1, 1), (1, 2), (2, 1) and (2, 2) mixed datasets • validation set: concatenation of (3, 4), (5, 6), (7, 8) and (9, 10) mixed datasets • application set: concatenation of (4, 3), (6, 5), (8, 7) and (10, 9) mixed datasets noting that the observation in the training set are not fully independent, but it is expected that reusing hemispheres in the training sample at most might degrade slightly the classifier performance, but does not bias in any way the inference results if an independent set is used. The next section is devoted to the validation of the background model in data control regions and the development of a methodology to correct for possible biases in the final summary statistic expectations. For com- pleteness, a comparison of the distribution of relevant variables, that are used as input to the probabilistic classifier, between the QCD multi-jet simulations available and those estimated using hemisphere mixing, are shown in Figure5.8. The overall agreement is good, as expected from the discussion at beginning of this section, the statistical uncertainties coming from the low HT range simulated QCD dataset are
large.