S/B discrimination with ML - Prototype of machine learning “as a service” for CMS physicsin s

Most HEP analyses needs mechanisms to perform event classification. In particular, methods to discriminate signal events from background events, depending on the actual physics goal, are needed. This might be done at the event level (e.g. Higgs searches, or SUSY searches, or Top-mass measurement, ..), at the cone level (e.g. Tau versus quark-jet reconstruction, ..), at the track level (e.g. particle identification, ..), in the secondary vertex finding (e.g. b-tagging), in the flavour tagging, etc. This can only be done with input information from multiple variables coming from a variety of sources, like kinematic variables (masses, momenta, decay angles, ..), event properties (jet/lepton multiplicity, sum of charges, ..), event shape (sphericity, various type of high-order moments, ..), detector response (silicon hits, dE/dx, Cherenkov angle, shower profiles, muon hits, ..). Traditionally - despite different on a case by case basis - this is done by exploiting few powerful input variables and combine them. Nowadays, new ML-based methods allow to use up to 100 (and more) variables without loss of classification power.

Suppose one analyst has a data sample that consists of two types of events: in real life they will be mixed up together in the same sample, but we can use simulations to obtain samples of both, separate types and mix them together with

4.2. S/B DISCRIMINATION WITH ML 55 class labels S for “Signal” and B for “Background” (in blue and red respectively) in Figure4.3(note we are restricting here to just two class cases, while many classifiers would be able to also deal with several classes). The question is how to set the

Figure 4.3: Example of Signal versus Background discrimination using decision boundaries. Discussion in the text.

decision boundaries, in terms of cut on specific variables, in order to select events of type S at best, given that we only have a limited set of discriminating variables - let’s suppose only two, x1 and x2 in this example. In the three plots of Figure 4.3, from left to right, one is trying to achieve the goal by applying rectangular cuts (plot on the left), a linear boundary (plot in the center), a non linear boundary (plot on the right). Once decided on a set of possible boundaries, how to find the optimal one? The problem, as from this example, might seem very simple, but this is only because this example has 2 variables, and the human eye, and brain behind, have very good pattern recognition capabilities. These capabilities might become much weaker if we have more variables, e.g. dozens of them. Suppose that each event, if Signal or Background, has indeed “N” measured variables, which we can call “features”. These many variables characterize the members of a given population: the same variables for signal and background are used, but the two groups have different PDFs. The problem becomes finding a mapping from a N-dimensional input/observable/“feature” space to one-dimensional output, i.e. a function that has N xi arguments and gives a single variable as output. With a rule-based approach, PDF distributions of S and B are lotted and a cut can be found that maximizes the efficiency and the purity of the selected Signal sample, estimated on the simulated data. This is basically the idea (despite simplified here) behind a “Multivariate Analysis” (MVA) technique. And this is where ML techniques may naturally apply. Given a certain type of model class, a ML system might be able to automatically find the mapping discussed above, by using “known” or “previously solved” events, i.e. learning from known “patterns”, such that the result output variable has good generalization properties when applied to really “unknown” events. This is precisely what a machine is supposed to be doing when applying supervised machine learning algorithms. Of course, there is no magic: one still needs to choose the discriminating variables, choose the class of models, tune the “learning parameters” (bias versus variance trade off), check generalization properties, consider trade off between statistical and systematic uncertainties, etc.

Operationally, what happen is that a program is “trained” on a predefined set of “training examples”, which empowers its ability to reach an accurate conclusion

56 CHAPTER 4. S/B DISCRIMINATION... when given new data. Note however that the goal of ML is never to make “perfect” guesses. The “best” values i.e. how “good” or “better” than others they are, depend essentially on which level of precision in the prediction one’s problem needs, which quality of ML models we can afford to apply, e.g. also taking into account the non-infinite amount of computing resources we may have available to run a ML system. Ultimately, the goal of any successful ML effort is never to reach perfection, but only and always to make guesses that are good enough to be useful for the problem under study.

In the directions to follow to build a successful ML model, much care must be given to the choice of the training data. Regardless of how such data is operationally divided into in the model implementation (e.g. training vs validation vs test sub- samples) and considering instead “training data” the whole set of data used to build up the model, it is worth underlying that this must be a statistically significant random sample – as ML builds heavily on statistics. If the sample is not random, the price to pay is that the machine learns patterns in the data that are not actually there. If the sample is not large enough, the price to pay is that the machine will not learn enough, or (even worse) reach inaccurate conclusions.

In document Prototype of machine learning “as a service” for CMS physics in signal vs background discrimination (Page 64-66)