• No results found

4.2 Machine Learning Techniques

4.2.2 Artificial Neural Networks

An alternative way to carry out empirical risk minimisation is based on consider func- tionf (x; ϕ), which depends on a vector of parameters ϕ, and attempt to find the val- ues ofϕ that minimise the risk RS(f ) over the learning set S ={(x0, y0), ..., (xn, yn)}.

Iff (x; ϕ) is differentiable with respect to the parameter vector ϕ, the minimisation from Equation 4.4, can be attempted with gradient-based methods. The simplest

gradient-based optimisation technique is referred to as gradient descent (GD), and can be applied to the previous problem by initialising the parameter vector at random ϕ0 and then iteratively updating the model parameters ϕ at each step t according to: ϕt+1= η(t)ϕRS(ϕt) = η(t)∇ϕ 1 n ∑ (xi,yi)∈S ( L(yi, f (xi; ϕt)) + Ω(ϕt) ) (4.23)

where ϕ is the gradient operator with respect the model parameters, η(t) is the learning rate or step size andΩ(ϕ) is a generic generalisation term added to the loss to constrain model complexity. Many other gradient-based optimisation methods exist [124], e.g. using second-order derivative information. The previous flavour of gradient descent is often referred as batch gradient descent, because the whole learning set S is used to compute the parameter updates at each step. Batch gradient descent can be very computationally demanding when the number of observations in S is large and the computation of the gradient of the loss for each labelled observation is costly. In addition, batch gradient descent is a deterministic optimisation method and likely to get stuck at a local minima if the optimisation surface is non-convex.

A variation of the previous technique, that is referred to as stochastic gradient descent (SGD) [125], overcomes the mentioned issues by using a random subset B ={(x0, y0), ..., (xm, ym)} of m observations from the training set S at each step.

If m is small the updates can be computed much faster, the trade-off being more noisy estimates of E(xi,y

i)∈S∇ϕ

[

L(yi, f (xi; ϕt)

]

. The parameter update rule from Equation4.23 in SGD can be instead be expressed as:

ϕt+1= η(t) ϕRS(ϕt) = η(t)∇ϕ 1 m ∑ (xi,yi)∈B ( L(yi, f (xi; ϕt)) + Ω(ϕt) ) (4.24)

where B is a random subset of size m of the learning set S. In the original formu- lation m = 1, yet nowadays a larger value for m is often used in what is referred to as mini-batch SGD to obtain balance the estimate noise and take advantage of vectorised computations. Several variations of SGD exist, which in some cases can provide convergence advantages over the previous update rule by using adaptive learning rates or momentum in the update dynamics [126]. Stochastic gradient des- cent methods are a key element for training complex differentiate machine models f (x; ϕ) as artificial neural networks, which will be discussed in the rest of this sec- tion. SGD in combination with a non-decomposable loss function is also used in Chapter6 to learn inference-aware summary statistics.

A particularly promising family of parametric functions f (x; ϕ) is referred to as artificial neural networks. Artificial neural networks are differentiable functions based on the composition of simple (and possibly non-linear) operations. The simplest type of artificial neural network is depicted in Figure4.2, which is referred as feed-forward neural network, that maps a inputx to an output y by means of a series of forward transformations, referred as neural network layers. In the simplest configuration, the values at a given layer k other than the input layer can be computed as non-linear transformation of the result of a linear combination of the output of the previous layer after the addition of a bias term. The previous transformation can be expressed very compactly in matrix form as:

ak = g((Wk)Tak−1+ bk) (4.25)

whereak is the outcome in vector notation after the layer transformation, ak−1 is

the vector of values from the previous transformation (ora0= x if it is the first layer

after the input),Wk a matrix with all the linear combination coefficients andbk is the bias vector that is added after linear combination. The activation functiong(z) is applied element-wise, and it is often based on a simple non-linear function. The sigmoid function σ(z) = 1/(1 + ez) used to be a common choice for the activation

function, but nowadays the rectified linear unit (ReLU) function g(z) = max(0, z) and its variants are most frequently used instead.

The full feed-forward modelf (x; ϕ) is based on the composition of transformation of the type described in Equation 4.25. When a single transformation is applied, i.e. y = g((W )Tx + b), the model can be referred to as perceptron. If the model

is instead based on the composition of several transformations, it can also be called multi-layer perceptron (MLP), and each of the intermediate transformations (which can be composed by an arbitrary number of computational units) is referred as hidden layers. The model in Figure4.2 is a MLP. The advantage of using models based on feed-forward neural networks with hidden layers is that they can be used to model any arbitrary function due to the universal approximation theorem [127]. In fact, while it is still the focus of theoretical research, the use of a large number of hidden layers is found to increase the expressivity and facilitate the training of powerful neural network models. The experimental success of these family techniques has led to the concept of deep learning, where multiple transformations layers are used for learning data representations in many learning tasks.

b0 x1 x2 x3 x4 Input layer b1 Hidden layer 1 b2 Hidden layer 2 y1 y2 y3 y4 Output layer

Figure 4.2: Graphical representation of a feed-forward neural network with two hidden lay- ers, which is a function mapping and input x to an output y by means simple non-linear transformations. The output value of a node each layer (other than the input layer) is the result of applying an activation function g to a linear combination of the previous layer outputs plus possibly a bias term.

A good choice for depth and overall structure for a neural network model depends on the problem at hand as well as the characteristics and size of the learning set available, thus it frequently has to be defined by trial-and-error, based on the per- formance on a validation set as discussed in Equation 4.1.1. The output size and choice of activation function in the last transformation often depends on the task at hand. For binary classification classification tasks, it is practical to use the sigmoid functionσ(z) = 1/(1 + ez) as the activation function of the last layer, in combination

with a loss function for soft classification (e.g. binary cross entropy from Equation

4.8). For multi-class classification problems, such as the one discussed in Section

4.3.2, the size of the output vector usually matches the number of the categories

given that the softmax function (see Equation4.11) is often used in the last layer to approximate conditional class probabilities in combination with a cross entropy loss (see Equation 4.10). For learning tasks different from classification, different output structures and constraints might be used, e.g. the output vector size in the use case in Chapter 6 corresponds to the number of dimensions of the resulting summary statistic, that is based on a transformation of the input using a multi-layer neural network.

The SDG update rule from Equation 4.24requires the computation of the gradi- ents of the loss function with respect to the model parameters. For complex models, e.g. those put together by stacking layers as those described in Equation 4.25, the computation of derivatives by numerical finite differences or symbolic differentiation may become rather challenging. The former requires the evaluation of the loss func- tion after variations for at least twice the number of parameters and are affected by round-off and truncation errors, and a naive use of the later could instead lead to very large expressions for the exact derivative that cannot be easily simplified. Given that a numerical function as implemented in a computer program is a se- quence of simple operations (e.g. addition, subtraction, exponentiation, etc.), it is possible to efficiently obtain gradients and other derivatives by applying the chain rule repeatedly based on the structure of the program, the derivatives of the simple operations and a record of the intermediate values.

The previous family of techniques, which will not be discussed in depth in this work, are referred as automatic differentiation (AD) [128]. The most efficient way of computing the gradients of a one-dimensional function that depends on many parameters, as the gradient of the empirical risk for a batch of observations from Equation 4.24 is by means of reverse-mode automatic differentiation, which is also referred to as the backpropagation in the context of neural network training. The computational cost of computing the full gradient of the loss to numerical precision using backpropagation is of the same order than a single forward evaluation of the loss, which provides a great advantage relative to finite differences. In addition, when implemented in a computation framework, it can be generally applied to any numer- ical function as long as can be expressed as a computational graph, e.g. an arbitrary program containing control flow statements, without requiring complex expression simplification as would be the case for symbolic differentiation. In fact, modern computational that include automatic differenciation such as TensorFlow [129] or PyTorch [130] may also be used to compute higher-order gradients (e.g. Hessian matrix elements), which are useful in Chapter 6 to build a differentiable approxim- ation the covariance matrix based on a summary statistic.

As mentioned before, reverse mode automatic differentiation can be used to com- puted the gradients of an arbitrary function as long as it can be represented as a computational graph containing differentiable simple operations. Thus the neural network model f (x; ϕ) is not restricted to the composition of layers of the type described in Equation4.25, which are often referred as fully connected or dense lay- ers. Alternative function components are useful for dealing with data cannot be

represented by a fixed-length vector [115], e.g. convolutional layers are often useful for working with 2D images while recurrent layers extend the application of neural networks to sequences that vary in length between observations. Both convolutional and recurrent layers are used in the neural network model for jet flavour-tagging described in Section 4.3.2. Other differentiable neural network components have also been developed to deal with permutation invariant sets [131] or graphs [132] as input data structures, which could have promising applications in particle collider experiments analyses.