Dimensionality Reduction - Event Simulation and Reconstruction

2.3 Event Simulation and Reconstruction

3.1.3 Dimensionality Reduction

which is referred to as generator-level re-weighting, a procedure that in some cases can even be done analytically. The concept of re-weighting will be useful to model different parameter points in Chapter 5 with a single set of simulated observations as well as to understand how the effect of varying parameters can be modelled via differentiable transformations in Chapter6.

3.1.3 Dimensionality Reduction

In the previous overview of the basic statistical modelling principles of experimental high-energy physics, the structure and properties of the probability distribution of the full detector readoutx_{∈ X have been considered. The consideration of the detector} readout as single observable variable x in the generative model greatly simplifies the modelling narrative, plus also allows to include the effect of any arbitrary event selection as a deterministic function 1C(x). Nevertheless, the high-dimensionality of

the readout space x ∈ X (i.e. O(108_{)) complicates its direct use when comparing}

simulated and recorded observations, which is crucial when carrying out any type of statistical inference.

The high-dimensionality of the raw detector readout space x ∈ X also makes it very difficult to specify an effective event selection 1C(x) that is able to reduce the

contributions from non-interesting or not well-modelled background processes. This motivates the use of a dimensionality reduction function f (x) : X −→ Y, from the raw detector readout space_{X ⊆ R}dto a lower dimensional space_{Y ⊆ R}b. Heref (x)

represents any deterministic function of the detector readout, but in practice it can be implemented by a series of consecutive transformations.

Let us denote as y _{∈ Y the resulting variable after the transformation f(x) is} applied to the observed detector readout. If the function f is differentiable and bijective (i.e. there is a one-to-one correspondence betweenx and y), the probability density distribution function ofy could be obtained as:

p(y_{|θ) = p(x|θ)} ⏐ ⏐ ⏐ ⏐det dx dy ⏐ ⏐ ⏐ ⏐ (3.19)

where the last term is the Jacobian determinant of the inverse of f . The transformations commonly used in particle colliders are non-bijective and sometimes non- differentiable, plus Equation3.19is in any case of little use whenp(x_{|θ) is intractable.} However, the expectation value of y as well any other deterministic transformation of the detector readoutx after any arbitrary event selection 1C(x) can be obtained

using simulated samples for a given interaction process as shown in Equation3.14, independently of whether the transformation is invertible or differentiable. In the rest of this section, the main procedures followed to reduce the dimensionality of the observable space and its objectives from a statistical perspective will be discussed.

Event Reconstruction

The methods of event reconstruction, as described in Section2.3.3, provide a very efficient way to transform the high-dimensional detector readout to a lower-dimensional space that can more easily be interpreted from a physical standpoint. In fact, reconstruction can be viewed as a complex procedural technique of inference on a subset of the latent variables given the detector readout x of an event. These methods attempt to walk back the generative chain described in Equation3.9to recover the subset of the parton-levelzp (andzs or zd in some cases) that strongly depends on

the detector readouts, providing a compressed summary of the information in the event about the parameters of interest θ. The dimensionality of the output of the reconstruction procedurey_recodepends on the subset of variables considered for each physical object, which typically amounts to a total of_{O(100) dimensions, which is a} significant reduction fromdim(_{X ) ≈ O(10}8_).

Due to the detector noise and characteristics, the reconstruction functionf_reco(x) : X −→ Yreco cannot fully recover zp ∈ Zp. This is the case for neutrinos that leave

the detector undetected, when the measured four-momentum of a given particle dif- fers from the real value, or when the reconstructed particle does not even exist inzp.

Simulated events can then be used to make calibrated probabilistic statements of the resulting reconstructed physical objects and their relation with the actual unobserved particles going through the detector. Particle identification (e.g. jet b-tagging) and fine-tuned momentum regressions on the reconstructed objects can also be thought of as inference of latent variables, which amounts to using the additional detector information around an object to measure more precisely its properties. These properties include the type of particle that produced the detector readouts clustered for particle identification, or a more precise determination of the momentum for particle regression.

One aspect of the generative model that complicates both reconstruction and statistical inference which has not been discussed yet is that efficient representations of the latent space of simulated events are not easily represented as a fixed-size real vector z ∈ Z ⊆ Rb_{. Let us consider as an example the parton-level latent inform-}

ation zp, which amounts to a short list of produced particles. The total number

of particles and the number of particles of each type are variable, thus zp is better

represented by a set (or several sets, one for each particle type): zset

p ={zip | i ∈ {1, ..., np}} (3.20)

where np is the total number of particles produced at parton-level and zip are the

latent variables associated to each particle (i.e. type, four-momenta, charge, colour and spin). A similar set structure can be attributed to latent variables describing long-lived particles after the parton-shower zs, while additional variables might be

associated to each particle (e.g. production vertex) and total number and type di- versity would be considerably larger. Because the number of particles and their type greatly varies between different interaction processes, the mapping this structure to observable variable space is very useful. In fact, the result of general event reconstruction process at CMS can be expressed also as a set of physical objects:

yset_reco={yi

reco | i ∈ {1, ..., nreco}} (3.21)

where nreco is the total number of particles, yireco are the reconstructed variables

for each physical object (i.e. reconstructed type, reconstructed four-momenta, reconstructed charge and any other reconstructed attributes). The calibration between the reconstructed physical objects yset

reco and the actual particles produced in the

collision zset

distance criterion, see Section2.2.1) and the comparison of their reconstructed and generated attributes.

The fact that both reconstructed and latent spaces have a variable-size set structure greatly complicates the application of inference and learning techniques directly based on yset

reco, because they often can only deal with a fixed-size vector of real

numbers Rb. Similarly to what is done for event selection, often the elements in the set of reconstructed objects in an event are reduced by imposing a given condition based on their attributes (e.g. type, isolation or momenta). There exist naive ways to embed a set such as yset

reco as a fixed-size vector Rb, such as taking the relevant

attributes of the first nsel objects according to a specific ordering convention after

a given object selection and possibly padding with zeros or alternative numerical values the elements that do not exist for a given event. Some of the newer machine learning techniques that will be presented in Chapter 4 can deal with variable-size input, such as sequences, sets or graphs inputs, by embedding them in vector representations internally, providing new ways to deal with the mentioned representational issue.

Summary Statistics

The attributes of the subset of reconstructed objects selected in an event for a given analysis, often as a fixed-size vector representationy_sel ∈ Zsel ⊆ Rb, are often still

too high-dimensional to be considered directly for statistical inference. The effect- iveness of the likelihood-free techniques that will be presented later in this chapter strongly depend on the dimensionality of the observable space considered. Hence, it is desirable to further combine the reconstructed outputs in a lower dimensional summary statistic, which can be either a function of each single observation or a set of multiple observations, so simpler statistical models that relate the parameters of interest with the observations can be constructed.

Until now, we have been dealing with the problem of how a single event is distrib- utedp(x|θ), however in practice a collection D = {x0, ..., xn} of events is considered

for inference. Let us first consider again the set D, before any trigger or event selection, similarly to what was done at the beginning of Section 3.1.1. Because of the independence between events, the probability density of a given set D can be expressed as the product of individual probability densities for each eventxi:

p(D|θ) =

x∏i∈D

wherep(xi|θ) can only be modelled realistically by forward simulation, and has the

mixture model structure and latent factorisation discussed before. After an arbitrary event selection 1C(x), only a subset of events DC ={x0, ..., xnC} ⊆ D remain. These

events are also independent, so their probability density can be expressed as:

g(DC|θ) = xi_∏∈DC

g(xi|θ) (3.23)

where the dependence between the distribution function after the event selection g(xi|θ) and that before p(xi|θ) was already described in Equation3.7. If we are only

focussed on the probability distribution of the events in DC, we would be neglecting

an important quantity that can also provide information about the parameters of interest: the total number of events that pass the event selection nC. Because this

quantity depends on the set of recorded readouts _{D, where each individual readout} xi is assumed to be an independent and identically distributed variable, the total

number of selected events nC after a deterministic selection 1C(x) can be modelled

using a binomial distribution:

p(nC|n, θ) = Binomial(n, ϵ) ≈ Poisson(nϵ) (3.24)

where n is the total number of events, and the dependence on the parameters is contained in the total efficiencyϵ, i.e. probability x_{∼ p(x|θ) of passing the selection} criteria, that can be defined as ϵ = ∫ 1C(x)p(x|θ). The Poisson approximation is

justified because the number of trials n is sufficiently large (i.e. 40 million bunch crossings per second) and the total selection efficiencies ϵ _{≤ 0.000025 already at} trigger level, as discussed in Section 2.2.7. This type of stochastic process is also referred to in the literature as multi-dimensional homogenous Poisson point process [88]. The expected value of nC coincides with the Poisson mean nϵ. It could be

more intuitively linked with the parameters of interest θ by making explicit the contributions from the different mixture processes:

E D∼p(D|θ) [nC] = n K−1_∑ j=0 ϕj E x∼pj(x|θ) [1C(x)] = n K−1_∑ j=0 ϕjϵj (3.25)

where the efficiency for each process ϵj =

∫

1C(x)pj(x|θ) can be estimated using

simulated observations as shown in Equation3.13. In principle, all possible processes j that could occur have to be considered, i.e. cases when no hard collision occurred as well as the inclusive contribution of each possible hard process, as described in

Equation3.4. However, if the product of the expected probability of a given process occurring ϕj and the event selection efficiency ϵj is low enough relative to the total

efficiency ϵ = ∑K−1_j=0 ϕjϵj, the effect of those mixture components can be safely

neglected.

The situation discussed above is often the case for events where no hard collision occurred after some basic event selection, that isϵnot-hard≈ 0 so it can thus be

neglected. For the subset of bunch crossings where hard interactions occur, the probability of a given type of interaction before any event selection might be expressed as the product of its cross sectionσj by the total integrated luminosity during the data

taking period_Lint divided by the total number of bunch crossings, thus the expected

value for number of observations nC after an event selection that reduces to a neg-

ligible fraction the contribution of non-hard processes 1C(x) can also be expressed

as: E D∼p(D|θ) [nC] = n K−1_∑ j=0 Lintσj n ϵj =Lint K−1_∑ j=0 σj ϵj (3.26)

wherenj =Lint σj ϵj is the expected number of events coming from a given process

j, that can be estimated with theoretical input regarding σj, simulated observations

to estimateϵj and an experimental measurement of the luminosity L.

The number of observations nC that pass a given event selection 1C(x), which

normally includes trigger and some additional analysis dependent selection, is the quantity that serves as the basis of the simplest statistical model used in particle physics to link theoretical parameters and observations. This type of summary statistic is very effective when the parameter of interest is the cross section of a single process σS and the rest of background processes are well modelled by theoretical

predictions and simulated observations. In that case, if all parameters but σS are

known, a cut-and-count sample-based likelihood can be built based on Equation3.24, corresponding to the following probability density function:

p(nC|σS) = Poisson ( σsϵs+ j∈B_∑ σjϵj ) (3.27)

which can be used to carry out statistical inference about σS given an observed

number of events that pass the event selectionnobs

C , using classical techniques.

The previous concept can be applied to several disjoint subsets of_{X simultaneously} T ={C0, ...,Cb}, each characterised by a different indicator function 1Ct(x) defining

function for the variablenT ={nC0, ..., nCb}, given that each nCi is independent, can be obtained as: p(nT|θ) = C∏i∈T Poisson (_j∈H ∑ nCi j (θ) ) (3.28) wherenCi

j (θ) is the expected number of observed events coming from process j after

the selection _Ci. As long as a parametrisation of nC_ji(θ) exists, which can be often

estimated asnCi

j (θ) =L σj ϵCji(θ), Equation3.28can be used to construct a likelihood

to carry out inference on the parametersθ based on the observed value of the sample summary statistic nobs

T .

Sufficient Statistics The selection count vector nobs

T (D), which has not been specified yet, could be also

written as a sum over a function nT(x) : X ⊆ Rd −→ Y ⊆ {0, 1}b ⊂ Rb applied for

each event in D =_{x0, ..., xn} as follows:

nobs_T (D) =

x∑i∈D

nT(x) (3.29)

wherenobs

T (D) could be described as a summary statistic of the whole collection of

observations whilenT(xi) summarises a single event xi.

There are infinite ways to choose a lower-dimensional summary statistic of the detector readout s(x) :_{X ⊆ R}d_{−→ Y ⊆ R}b_{. Functions of the type} _n

T(x) are only

a reduced subset, yet still infinite, of the possible space of functions. Regardless of the likelihood-free inference methods considered (see Section 3.2), the need of a low-dimensional summary statistic can be thought as an effective consequence of the curse of dimensionality, because the number of simulated observations required to realistically model the probability density function or compute useful distance measures rapidly increases with the number of dimensions.

In general, the selection of a summary statistics(x) is far from trivial, and naive choices can lead to large losses of useful information about the parameters of interest θ. From classical statistics, we can define a sufficient summary statistic as the function of the set of observations that can be used for carrying out inference about the model parameters θ of a given statistical model in place of the original dataset without losing information [89]. Such a sufficient statistic contains all the information in the observed sample useful to compute any estimate on the model parameters, and no complementary statistic can add any additional information aboutθ contained in

the set of observations. Sufficient statistics can be formally characterised using the Fisher-Neyman factorisation criterion, which states that a summary statistic s(x) is sufficient for the parameter vector θ if and only if the probability distribution function ofx can be factorised as follows:

p(x_{|θ) = q(x)r(s(x)|θ)} (3.30)

whereq(x) is a non-negative function that does not depend on the parameters and r(x) is also a non-negative function for which the dependence on the parameters θ is a function of the summary statistic s(x). The identity function s(x) = x is a sufficient summary statistic according to the theorem in Equation 3.30, however we are only interested in summaries that reduce the original data dimensionality without losing of useful information about the parametersθ.

The definition of sufficiency can also be applied to a set of observations D = {x0, ..., xn}. In fact if we assume they are independent and identically distributed,

ands(x) is sufficient for each observation xi, we may rewrite Equation 3.22as:

p(D|θ) =

x∏i∈D

q(x)

x∏i∈D

r(s(xi)|θ) = q(D)r(s(D)|θ)

where the set of sufficient summary statistics for each observation is a sufficient summary statistic for the whole datasets(D) =_{{ s(x}i) | ∀xi ∈ D} and the dependence

on the summary statistic is contained as the product of independent factors for each observation.

Because p(x|θ) is not available in closed form in particle collider experiments, the general task of finding a sufficient summary statistic that reduces the dimensionality cannot be tackled directly by analytic means. However, for finite mixture models where the only model parameters are a function of the mixture coefficients ϕj, probabilistic classification can be used to obtain (approximate) sufficient sum-

mary statistics. We will return to this topic in Chapter4. When the parameters of interest or additional unknown parameters affect the mixture components pj(x|θ),

the construction of sufficient summary statistics cannot be addressed directly, thus a fraction information about the parametersθ is very likely to be lost in the dimensionality reduction step. An automated way to obtain powerful summary statistics in those cases using machine learning techniques will be presented in Chapter6.

Synthetic Likelihood

The advantage of using lower-dimensional summary statisticss(D) :XD ⊆ Rd×n−→

YD ⊆ Rb×nof the detector readout collected by the experiment is that often the gen-

In document Statistical Learning and Inference at Particle Collider Experiments (Page 94-120)