3.13 Calculation of Uncertainties for Statistical Learning Methods
3.13.2 Systematic Uncertainties
Unfortunately the opinion that statistical learning methods have an uncertainty in them- selves which needs to be added somehow to the total uncertainty is met quite frequently. There exist ideas like varying the behaviour of a classifier by small amounts and observing the variation of the outputs (e.g. by varying the weights of a neural network). But what will that tell us? This only tells us that indeed the output of a classifier depends on the values which represent the learned hypothesis.
An simple way to derive the true systematic uncertainties is to think of a trained classifier which solves a classification problem by applying a cut in its output distribution. That is all it is. Like any well known one-dimensional cut this multidimensional cut does nothing else but propagate the systematic uncertainties of its inputs to the output. There is no uncertainty from the classifier itself (a cut has no uncertainty). There is also no uncertainty from the learning procedure since we want to evaluate only the classifier we obtained. The question whether a training with different parameters would have resulted in a different classifier has nothing to do with the systematic uncertainties of the classifier we obtained. Again: We have a fixed classifier with a fixed cut in the output distribution and all we have to calculate is the propagation of the systematic uncertainties of the inputs. For one-dimensional cuts this propagation is often simulated by a variation of the cut. This is a legal procedure but only in exactly this one-dimensional case. Let us assume that
64 3. Statistical Learning for Physics Experiments a quantity z has an uncertainty δz and a cut z < c is applied. Varying the cut by δz is legal because this variation is identical to the variation of all events according to δz.
For statistical learning methods we cut in the output distribution which depends on all inputs simultaneously. We have no information about any uncertainty there but what we do know are the uncertainties of the inputs. To calculate the propagation of these uncertainties several modified test sets have to be created for which the input quantities are varied according to their own uncertainty. These modified test sets are passed through the statistical learning method without any changes in the method itself, also the cut stays the same. The resulting output distributions changed according to the uncertainties of the inputs. The same cut as before will result in new efficiencies and rejections. Calculating the variation over the modified test sets finally results in an estimate of the systematic uncertainty which corresponds to the propagated uncertainties of the inputs.
The systematic uncertainties of the inputs can for example be due to an energy calibra- tion which is varying over time, a varying noise contribution, a detector efficiency which is degrading over time or a movement of parts of the detector. Systematic uncertainties are generally found as any variation which may appear after the training set has been fixed and which affects the performance on future events by changing the inputs (or underlying quantities from which the inputs are derived) systematically in one direction (which may vary over time).
The following procedure assumes that the inputs are independent so that the variations can be added up in quadrature. One could also imagine that two or more of the inputs are correlated because they depend on a set of underlying quantities. Then the correct procedure would be to vary these underlying quantities according to their uncertainties and using their propagation over the actual inputs finally to the output of the statistical learning method.
Example: Systematic Uncertainties
Figure 3.18 shows how an original test set showing an efficiency of 80% and a rejection of 90% is modified six times, with a variation upwards and downwards for each of the three inputs. The known systematic uncertainties may, for example, be the following: x1 has a Gaussian error distribution with an absolute sigma of σ1 = 0.1, x2 and x3
have relative errors of σ2
x2 = 5% and
σ3
x3 = 10%.
The usual way to create the modified test sets is that exactly two sets are created per input (for every 1σ variation, up and down). In our example we would have six modified test sets: Starting with a set in which in every event x1 is replaced by
x1+ 0.1 and ending with a set in which in every event x3 is replaced by 0.9·x3. The
individual inputs are here assumed to be independent, the six differences in efficiency and rejection (modified vs. original) can thus be added in quadrature. This is done by adding up the squared differences in positive direction on the one hand and all squared differences in negative direction on the other hand. This procedure covers the case that sometimes both modification directions for the same input lead to a change (of efficiency or rejection) into the same direction6.
As can be seen in figure 3.18 the same classifier and the same cut as for the original test set are used for all the modified sets. The variation of the efficiency is shown as
6
One can understand this effect if one thinks of a Gaussian input distribution for signal events from which the centre part, say [µ−σ, µ+σ], is selected by the learning method. Any shift in this distribution will result in a loss of efficiency.
3.14 Data Mining 65