using prior knowledge) are better suited. The data compression step from “raw” data – which usually consists of many inputs – to a higher level of abstraction with usually fewer inputs is called preprocessing (discussed in section 3.6).
Careful consideration of the known meanings and interactions of the inputs that are available are typically combined with many cycles of trial and error in which different combinations of inputs are tried out. For supervised learning (see below) the correlation between each input xi and the target value y
ρi = cov(xi, y) σxiσy = 1 N N X k=1 (xik−x¯i)(yk−y¯) σxiσy (3.1) gives an indication which input may be suited well to describe the target because of a high correlation. A technique called relevance which estimates the importance of each input after the training has been done will be discussed in section 3.14.
Too many inputs with only a moderate number of training examples lead to serious problems. The curse of dimensionality means that too few training examples distributed in a very high dimensional space result in such a sparse density that the learning task will be very difficult, especially overtraining (see section 3.11) will be a problem. Section 3.6 will discuss some algorithms which try to reduce the dimensionality of the input space while trying to not loose information.
To deal with missing values for some components of the input vector in some events is a task that is important, for example, in medical applications (e.g. a specific test was not done on some of the patients). Therefore some learning methods have the capability to deal with missing information. In the further discussion, however, the data will be assumed to be complete.
A weight might also be part of the information that is available for each event. The weight does not serve the purpose to describe each example but to tell about its importance. Usually all examples have the same importance and thus always weight 1. But for some datasets some events may be regarded as being more important than others resulting in a weight wi > 1 for some and wi < 1 for others. These weights may come from Monte Carlo simulations or may be introduced by purpose to modify the behaviour of the learning method (the boosting method does so as described in section 5.5.4). Weights must not be used like normal inputs to distinguish one event from another but they should be used to steer the learning process and to evaluate the performance (see section 3.12).
3.3
Supervised and Unsupervised Learning
Supervision of a statistical learning method means that each input vector ~x gets a label, the target value y. These target values have to be predicted by the learning method and a teacher is available who provides feedback about the errors which are made.
For unsupervised learning the only available information is the sampled distribution of data points in the input space. Naturally the task is then to derive statements about the underlying probability distribution from which the data points were sampled. Typical statements that result from unsupervised learning tell about local densities and clusters. Unsupervised learning is sometimes useful as a preprocessing step to supervised learning. Section 3.6 discusses, for example, clustering as a way to reduce the dimensionality of a
44 3. Statistical Learning for Physics Experiments problem. Typical unsupervised learning methods are discussed for example in [38, 31]. In the following we will concentrate on supervised learning.
Supervised learning means that N pairs of input and corresponding target (~xi, yi),
i= 1. . . N are given to the learning method. The special case where the learning method itself can decide at which position ~x in the input space the corresponding target value y
should be given by the teacher is called active learning. Throughout this thesis we will always assume the more general case where the N pairs (~xi, yi), i = 1. . . N are fixed in advance.
The inference from these N pairs of input and target to some kind of output function
out(~x) will be called the training of the statistical learning method. In chapter 5 we will see that some methods do not require a training step: They derive the output directly from the given examples without building a model in advance. Figure 3.1 shows a scheme of the usual process of training and evaluation for a classification problem.
Figure 3.1: Supervised learning methods (here for classification) which build a model are trained resulting in a classifier which is then used to evaluate new events.
Example: Radial Basis Function Neural Networks
Radial Basis Function neural networks (figure 3.2) are usually trained in two steps. The first step consists of unsupervised clustering of the data. The centres and vari- ances of the radial basis functions are determined, for example, with a Gaussian mixture model (see [31]).
The second step consists of supervised learning for which the found clusters remain fixed and only the influence of each cluster on the output is determined. Since only the weights of all basis functions in the final sum have to be determined this can be done by a simple multi-linear regression.
Figure 3.2: Structure of a Radial Basis Function Neural Network: The activation ai of eachhidden neuron depends on the distance of the input event to the centre ci and on the radius ri. The output is the (positively or negatively) weighted sum of all activations.