5.4 Methods Based on Linear Separation
5.4.2 Neural Networks
Neural networks [77, 78] have become a standard tool in many fields of application. Based on the idea to model a machine learning algorithm similar to the human brain, neural networks are now used as a precise classifier (and regression method) in many commercial and scientific applications. We will focus here on the so-called multilayer perceptron or feed forward network3.
The elementary unit of a neural network is called neuron (figure 5.5 (a)). It computes the function out=σ(X j xjwj −b) whereσ(a) = 1 1 +e−a. (5.20)
The sigmoidal transfer function σ is plotted in figure 5.5 (b).
The argument to the sigmoid function is called activation. Geometrically, the interpre- tation of the functionality of such a neuron is straightforward as soon as the activation is recognised as a distance measure for a separating hyperplane in the input space defined by the normal vector w~ and the threshold b. Applying σ to the sum results in a small value (near 0) for the one side of the hyperplane and a large value (near 1) for the other with a soft transition region in between. The length of the weight vectorw~ scales the steepness of the threshold function and thereby the size of the transition region as shown in figure 5.6. Historically the first kind of feed forward neural network consisted of only one neuron and was called perceptron [79]. A simple training rule was developed to apply this single neuron to real-life problems. The criticism of Minsky and Papert [80] eliminated most of the enthusiasm which came along with this first attempt to create an artificial neural
3
5.4 Methods Based on Linear Separation 91
w
w
w
x
x
x
3 2 1 1 2 3(a)
−b
out
−5 −4 −3 −2 −1 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 (a) σ a(b)
Figure 5.5: A neuron sums up weighted inputs, subtracts the threshold (a) and passes the sum through the soft threshold function (b).
w
x x2 1out=0.1
out=0.5
out=0.9
Figure 5.6: The length of the weight vector scales the width of the transi- tion region in input space. The sepa- rating hyperplane is a line in this two- dimensional example. On the line the activation is 0 and the output thus 0.5. The longer the weight vector, the smaller is the width of the band from, for example,out= 0.1 to out= 0.9.
network that performs similarly to the human brain. Minsky and Papert remarked that this linear classifier is not able to solve some very simple problems like the XOR configuration shown in figure 4.1.
It took several years to find a training rule for the multi-layer network structure shown in figure 5.7 which offers the possibility to model much more complex functions than just a linear separation. This is achieved with the help of a sufficiently large number of neurons in the so-called hidden4 layers. Throughout this thesis only one hidden layer will be used
since a general theorem [81] guarantees that any continuous function can be expressed already by only one hidden layer with a sufficiently large number of neurons.
The output of a network with one hidden layer and a single output is calculated by
out=σ X i σ³ X j xjwij −bi ´ ·w˜i−˜b (5.21)
where ~x is the input, wij are the weights connecting neuron i in the hidden layer with the jth input and ˜wi are the weights connecting the output neuron with the ith neuron in the hidden layer. bi are the thresholds of the hidden neurons and ˜b is the threshold of
4
Often the input-layer is counted as the first layer despite the fact that no neuron is calculated there. Figure 5.7 would then be a three-layer network.
92 5. Statistical Learning Methods
Figure 5.7: Architecture of a feed forward neural network with one hidden layer. the output neuron. From the classification point of view a combination of the separations done with the hidden neurons is calculated in the output neuron.
Regression
From the regression point of view any arbitrarily complex function can be formed by overlaying the sigmoid functions from a sufficiently large number of hidden neurons. The soft threshold function of the output neuron is often omitted for regression but can still be used if the range of the target is within the interval [0,1].
Parameters and Regularisation
Both weights and biases are optimised to fit the given classification task during the training phase. In principle any optimisation technique can be used to find the best weights. Historically the “back-propagation” algorithm [82, 83] was used in the first applications and is still used frequently. It will be used here as an example to discuss controlling parameters and mechanisms for regularisation.
Given the cost function per event Cost = 1
2(out(~x)−y)
2
(5.22) a gradient descent approach
∆w∝ ∂Cost
∂w with w∈ {w˜i,˜b, wij, bi} (5.23)
leads to the update rule
∆w(k) =−η∂Cost
∂w +µ∆w(k−1) withw∈ {w˜i,˜b, wij, bi}. (5.24)
The partial derivatives can be calculated directly for ˜wiand ˜band via the chain-rule (back- propagation) also for wij and bi. In the update rule 5.24 different parameters can be used to steer the step width of the gradient descent (η) and the scaling of a momentum term (µ) both of which control mainly how fast the algorithm converges (trying to not get stuck in secondary minima). These parameters can be set in various ways and can even be varied during the training (compare the details given in appendix B.3).
Regularisation is done with the number of hidden neurons – the more separating hyper- planes are used the more complex the decision boundary can be. But also the lengths of the
5.4 Methods Based on Linear Separation 93 weight vectors have influence on the overtraining behaviour. The shorter the weight vectors are, the softer the threshold function is (smalla in equation 5.20, compare figure 5.6). Soft threshold functions are combined to soft decision boundaries, while long weight vectors induce sharp thresholds and sharp decision boundaries. A weight decay term added in the update rule of the back-propagation algorithm can be used to penalise large weights and by this control overtraining (compare the details given in appendix B.3). A weight decay for the output neuron can also be interpreted as maximisation of the margin (compare the support vector algorithm in section 5.4.3).
The gradient descent is a local optimisation process and depends on the starting point given by a random initialisation of the weights and biases. Usually multiple networks with different initialisations are trained to avoid local minima.
Execution Times and Variants
The training times depend on the chosen strategy but are usually minutes to hours. Once a network is trained the evaluation for any given input is done very fast. Hardware implemen- tations making use of the inherent parallelism of neural networks have been constructed. In recent implementations the calculation of a large digital neural network may take only 400ns[84]. Hardware implementations of neural networks will be discussed in appendix A. Variants of the presented method of a feed forward neural network with one hidden layer naturally extend the architecture to more hidden layers (despite the theorem discussed above) and specific interconnections of these layers (also recurrent). Frequently more than one output neuron is used. As mentioned above, there are many different possibilities to train the network besides back-propagation, ranging from conjugate gradient [85] to genetic algorithms [86]. Extensions to the basic training procedure implement dynamic construction (adding) and dynamic pruning (removing) of neurons [87].