Probabilistic Classification and Regression

2.3 Event Simulation and Reconstruction

4.1.1 Probabilistic Classification and Regression

One of the conceptually simple, yet versatile, tasks that can be addressed with machine learning algorithms is classification. A classifier or a classification rule is a function f (x) : X −→ Y that predicts a label y ∈ {0, ..., k − 1}, denoting corres- pondence to one category in a set of of k categories, for each input x _{∈ X . The} task of classification, in the context of machine learning algorithms, is to produce classification functionsf (x) that perform well on an unobserved set of data.

Classification is often framed as belonging to a larger category of tasks referred to as supervised learning, where the goal is predicting the value of an output variable y (here a multi-dimensional vector for generality) based on the observed values of the input variablesx, based on a learning set of n input vectors with known output values S = _{(x0, y0), ..., (xn, yn)}. The output values y are known in the learning

set, because they were previously determined by an external method, typically a teacher or supervisor looking at past observations, thus explaining the name of these family of techniques.

From a statistical standpoint, the input observations and target values from the learning set can be viewed as random variables sampled from a joint probability distribution p(x, y), which is typically unknown. The family of supervised learning tasks also includes regression, which amounts to construct af (x) that can to predict a numerical target output y, and structured output tasks where the output vector y is a vector or a complex data structure where its elements are tightly interrelated. As will be reviewed in Section 4.3, most analysis problems amenable by machine learning in high-energy physics experiments are framed as classification and regression tasks, while the use of structured output tasks is instead not quite extended. The reconstruction of the set and properties of physical objects in an event directly from the detector readout could be framed as a structured output task, if it was to

be approached directly using machine learning algorithms instead of the procedures described in Section2.3.3.

The goal of supervised learning is not to perform well on the learning setS used for improving at the specified task, but rather to perform well on additional unseen observations sampled from the joint distribution p(x, y). Supervised learning algorithms exploit the conditional relations between the input and the output variables, in order to classify new observations better than a random classification rule that does not depend on the value of x. When using machine learning techniques in data analysis at the LHC, as will be reviewed in Section 4.3, simulated observations are used instead of expert-labelled past observations. Simulated observations correspond to random samples of the joint distribution over the latent variables for the generative modelp(x, z|θ), as described in Section 3.1.

In fact, the problem of inferring a subset of latent variables z of the statistical model for the raw detector readouts of a collider experimentx, or from any determ- inistic function of its(x), can be cast as a supervised learning problem. The learning setS would consist of simulated observations xi (or a summary of it s(xi)), and a

matching subset of interest of the latent variablesy_i ∈ Y ⊆ Z. The supervised learning task can then be viewed as the estimation of the conditional expectation value

Ep(y|x=xi)[y] for each given input observation xi, thus characterising the probability

distributionp(y|x).

While several performance measuresP are possible for a given task T , for supervised learning is common to use performance measures that estimate the expected prediction error, or risk R, of a given predictor function f (x), which can normally be expressed as:

R(f ) = E

(x,y)∼p(x,y)

[L(y, f (x))] (4.1)

whereL is a loss function that quantifies the discrepancy between the true output and the prediction. The quantity defined in Equation4.1is often also referred to as risk, test error, or also as generalisation error.

The optimal model for a given task T thus depends on the definition of its loss functionL, if the objective is minimising the expected prediction error. In practice, the expected prediction error cannot be estimated analytically because p(x, y) is not known, or not tractable in the case of a generative simulation model. The

generalisation error has thus to be estimated from a subset of labelled samplesS′ ₌ {(x0, y0), ..., (xn′, y_n′)} as follows: R(f )_{≈ R}S’= 1 n′ ∑ (xi,yi)∈S′ L(y_i, f (xi)) (4.2)

which is also commonly referred to as empirical risk approximationRS’(f) based on

the setS′. The supervised learning problem can then be stated as one of finding the function ˆf from a class of functions F, which depends on the particularities of the algorithm, that minimises the empirical risk over the learning set S:

f = arg min

f ∈F

RS(f ) (4.3)

which is referred to as empirical risk minimisation (ERM) [112], and it is at core of most of the existing learning techniques, such as those described in Section 4.2. However, the ultimate goal of a learning algorithm is to find a function f∗ that minimises the risk or expected prediction error R(f ):

f∗= arg min

f ∈F

R(f ) (4.4)

where R(f ) is the quantity defined in Equation 4.1, corresponding to the generalisation error, or average expected performance on unseen observations sampled from p(x, y). The previous equation can be used to define the optimal prediction function fB(x), also referred as Bayes model, which represents the minimal error that any

supervised learning algorithm can achieve due to the intrinsic statistical fluctuations and properties in the data. The Bayes model can be expressed as:

fB(x) = arg min y∈Y

y∼p(y|x)

[L(y, f (x))] (4.5)

where the last term indicates the optimal choice of target y for each value of x. The previous expression can be obtained by explicitly considering the conditional expectation in the risk term described in Equation 4.4, that is R(h) =

Ex∼p(x|y)

[

Ey∼p(y|x)[L(y, f (x))]

]

, that can be obtained using Bayes theorem. The Bayes modelfB(x), and its corresponding risk R(fB), also referred as residual error,

can only be estimated ifp(x, y) is known and the expectation can be computed analytically. Even though the Bayes optimal model cannot be obtained for real world

problems, it can be useful nevertheless when benchmarking techniques in synthetic datasets or for theoretical studies.

Because most learning algorithms optimisef , or its parameters, using the learning set S, the empirical risk RS(f ) is not a good estimator of the expected generalisa-

tion error R(f ). In general, RS(f ) underestimates RS(f ) because the statistical

fluctuations of the finite number of observations in S can be learnt to increase the performance onS, while they are not useful for prediction on a new set of observations. If the family of functions _{F considered in the learning algorithm is flexible} enough, which is often the case, it is possible to achieveRS(f ) = 0 for the learning

set S while the generalisation error R(f ) is well away from zero. This effect can actually lead to a degradation of the generalisation error while the empirical risk in the learning set is decreasing during the learning procedure, which is often referred to as over-fitting.

To compare different prediction functions or to realistically evaluate the generalised performance of a given prediction modelf , it is useful to be able to compute unbiased estimates ofR(f ). The simplest way to obtain such estimate is to divide the learning setS into two disjoint random subsets Strain and Stest. The train subsetStrain will

be used by the learning algorithm to optimise the prediction function f by means of empirical risk minimisation, as described in Equation 4.3. The hold-out or test subset Stest can then be used to obtain an unbiased estimation of the performance

off on unseen observations.

For many learning algorithms, the learning process, or training, is iterative: the function f is optimised incrementally based on the training data. In those cases, an estimation of the generalisation error as the training evolves may be useful to stop the training procedure and avoid the degradation of generalisation due over- fitting, in what is referred as early stopping. In those cases, as well as to compare and ensemble the results of various predictor functions and model configurations, is useful to hold out a fraction of Strain which is commonly referred as validation set

Svalid. Alternative approaches to estimate the generalisation error exist, including

cross-validation and its variations [113], which are usually preferred when the amount of training data is reduced.

Another important concept for most machine learning techniques is that of hyper- parameters. The majority of machine learning algorithms depend on a set of parameters that regulate the flexibility of the family of functions_{F to consider for empir-} ical risk minimisation as well as the details of the optimisation procedure followed to solve the task presented in Equation4.3. The expected performance of a given model

depends on these parameters, however their optimal value depends on the particularities of the data (e.g. number of input dimensions or number of size of the data size). This motivates the notion of hyper-parameter optimisation, where the performance of the various choices of hyper-parameters is evaluated on the validation set or by means of cross-validation techniques, in order to select the best configuration.

The loss function L of a supervised learning algorithm, which quantifies the dis- crepancies between the prediction and the true output target, depends on the task T and formally defines it. A principled loss function for classification is the zero-one loss, which is defined as zero when the predictionf (x) matches the target y and one otherwise. The zero-one risk can then be expressed as:

R0−1(f ) = E (x,y)∼p(x,y)

[1(y_{̸= f(x))]} (4.6)

where 1(y ̸= f(x)) is an indicator function, which was defined in Equation 3.6. The zero-one loss is non-differentiable when y = f (x) and its gradients are zero elsewhere; in addition, it is not convex, a property which makes the minimisation task in Equation 4.3 hard to tackle by optimisation algorithms. In fact, it can be proven that finding the function f in F that minimises directly the R0−1 empirical

risk with a training sample is a NP-hard problem [114]. The Bayes optimal classifier for the 0-1 loss can nevertheless be easily obtained from Equation 4.7as a function of the conditional expectation:

fB(x) = arg min y∈Y

y∼p(y|x)

[1(y̸= f(x))] = arg max

y∈Y

p(y|x) (4.7)

thus the optimal classifier amounts to the prediction of the most likely output category y for a given input x. The previous problem is normally referred to as hard classification, where the objective is to assign a category for each input observation. Because most problem in high-energy physics that can be cast as supervised learning are ultimate inference problems as will be reviewed in Section4.3, it is generally more useful to consider the problem of soft classification, which instead amounts to estimate the class probability for each inputx.

Soft classification is especially useful when the classes are not separable, which is often the case for applications in collider experiments. Luckily, soft classification is also a consequence of most convex relaxations of the zero-one loss of Equation

4.6. For a two-class classification problem, e.g signal versus background, a useful approximation of the zero-one loss is the binary cross entropy, defined as:

LBCE(y, f (x)) =−y log(f(x)) − (1 − y) log(1 − f(x)) (4.8)

where now the one-dimensional output predictionf (x), when bounded between 0 and 1 (e.g. using a sigmoid/logistic function), will effectively approximate the conditional probabilityp(y = 1|x). In fact, the Bayes optimal model for a binary cross-entropy classifier is:

fB(x) = E

(x,y)∼p(x,y)

[LBCE(y, f (x))] = p(y = 1|x)

= ∑ p(x|y = 1)p(y = 1) ∀yi∈{0,1}p(x|y = yi)p(y = yi) = ( 1 +p(x|y = 0)p(y = 0) p(x|y = 1)p(y = 1) )−1 (4.9)

where the second line in the equation is a direct consequence of Bayes theorem and from the last term it can be clearly seen that the prediction output is monotonic with the density ratio between the probability density functions for each category. Similar results can be obtained for the Bayes optimal classifier when using other soft relaxations of the zero-one function. Machine learning binary classifiers will effectively approximate this quantity directly from empirical samples, where the prior probabilities of each class represent the relative presence of observations from each category.

Binary cross entropy is a subclass of the more general cross entropy loss function, that can be used fork-categories classification, commonly referred to as multi-class classification. In these cases, a k-dimensional vector target y is often constructed, where each component yi is one if the observation belongs to the class i or zero

otherwise, and the output of the prediction function ˆy = f (x) is also a vector of k components. Within this framework, the cross entropy loss can then be defined as:

LCE(y, f (x)) =−

∑

yilog ˆyi (4.10)

which can be used to recover Equation 4.8 when k = 2, considering the one- dimensional target and prediction as the i=1 elements and that y0 = 1− y and

y0= 1− f(x). If the prediction output is to generally represent exclusive class prob-

abilities, as is the goal of soft classification, the prediction sum is expected to be one. A simple way to ensure the aforementioned property is to apply a function that en- sures that the prediction outputs are in the range[0, 1] and normalised so∑_iyˆi = 1.

The softmax function is a common choice to achieving the mentioned transformation within the field of machine learning. It is a generalisation of the logistic function to k dimensions, and is defined as:

ˆ yi = efi(x)/τ ∑k j=0efj(x)/τ (4.11)

wherefi andfj refer to thei and j elements of the vector function f (x) and τ is the

temperature, a parameter that regulates the softness of the operator which is often omitted (i.e.τ = 1). In the limit of τ → 0+_{, the probability of the largest component}

will tend to 1 while others to 0. The softmax output can be used to represent the probability distribution of a categorical distribution in a differentiable way, where the outcome represent the probabilities of each of the k possible outcomes. We will make use of this function in Chapter 6. When the softmax function and the cross entropy loss are used together for multiclass classification, the optimal Bayes model is: fB,i(x) = E (x,y)∼p(x,y) [LCE(y, f (x))] = p(y = yi|x) = ∑ p(x|y = yi)p(y = yi) ∀yi∈{0,...,k−1}p(x|y = yi)p(y = yi) (4.12)

which can also be expressed as a function of a sum of density ratios of the categories.

In document Statistical Learning and Inference at Particle Collider Experiments (Page 120-126)