2.3 Machine Learning
2.3.1 Supervised Learning
2.3.1.2 Artificial Neural Networks
Artificial Neural Networks (ANNs) are a robust and generalized approach to ap- proximate real-valued, discrete-valued and vector-valued target functions [Hop82]. ANNs proved to be extremely effective to deal with a large variety of problems, such as pattern recognition [Bis95, Fuk88, CG88, Rip07], handwritten charac- ter recognition [LBD+89, RMS89, SBB+92], face recognition [LKL97, LGTB97],
stock market prediction [KAYT90, GKD11], image compression [DH95] and many others.
The study of ANNs has been partially inspired by the observation of how the neurons organize their structure in tightly connected networks in biological systems such as the human brain. Artificial neural networks are composed by densely interconnected simple units that take a number of real-valued inputs (possibly coming from other units) and produce a single real-valued output. There exist different type of fundamental units that serve as building blocks in ANNs; one common kind is the so called perceptron [Ros62], depicted in Figure 2.6. The perceptron takes as input a vector of N real values X, calculates a linear combination of the input and then generates an output O(X) that can be 0 or 1, depending on the linear combination being smaller or higher than a threshold.
O(X) = (
1 if PN
i xiwi ≥ w0
0 otherwise (2.15)
where wiis a real value called weight determining the contribution of the input
xi; w0 represents the threshold.
Single perceptrons can be used as linear classifiers. The linear combina- tion of weighted inputs is taken as input by the activation function which is triggered only when its input exceeds a given threshold. Commonly used ac- tivation functions are non-linear functions such as sigmoid σ(x) = 1/(1 + ex),
tanh tanh(x) = 2σ(2x) − 1, rectified linear unit (RelU) ReLU (x) = max(0, x) and many others. The single perceptrons can represent the primitive boolean functions like AND, OR, NAND and NORA. Since every boolean function can be represented through combinations of these primitives all boolean functions can be expressed using a two levels deep network of perceptrons, where the second stage collects the output of multiple first-stage units.
Figure 2.6: Example of ANN. Source: https://github.com/cdipaolo/goml/ tree/master/perceptron
Perceptrons can only classify linearly separable groups of instances. Linearly separable means that drawing a straight line or plane on all input instances is sufficient to distinguish those belonging to the target class. If the instances cannot be separated in this fashion, a classifier based on single perceptron will never be able to classify them. In order to overcome the limitations of single perceptrons, artificial neural networks are modeled as collections of neurons that are connected in an acyclic graph – the outputs of some neurons can become inputs to other neurons [RHW85]. Cycles are not allowed since that would imply an infinite loop in the forward pass of a network. Often, ANN are organized in separate layers of neurons; typical ANN topologies have an initial layer (input layer), a final layer (output) and one or multiple intermediate layers (hidden layers). An example of basic ANN can be seen in Figure 2.7. Multi-layered ANNs where the information flows from the input layer to the output layer with no cycle allowed are also called feedforward networks. Feedforward networks containing three layers of units are able to approximate any function to arbitrary accuracy, given a sufficient (potentially very large) number of units in each layer [Cyb88, Cyb89].
The first step of creating a ANN consists of training the model in order to determine the input-output mapping. In the learning process the weights of the connections between the neurons are updated until they reach the correct value;
2.3 Machine Learning 47
Figure 2.7: Example of ANN. Source: http://cs231n.github.io/ neural-networks-1/
after the training stage the weights are fixed. Afterwards, the network can be used to generate an output given a vector of real values as an input, for example performing classification tasks.
The most commonly used algorithm to train ANNs is the backpropagation algorithm [RHW88, CR95]. The core steps of the backtracking algorithm are the following. 1) Give a training sample < X, T > to the ANN as input and compare the generated output with the expected result; compute the error in each output neuron. For each output unit k the error δk is computed with the
formula: δk = ok(1 − ok)(tk− ok), with ok the output of the neuron and tk the
target outcome. 2) Propagate the error “backwards” to hidden layers. The error formula for hidden neurons h is δh= oh(1 − ok)Pkwkhδk; the error assigned to
hidden neurons depends on the errors received by the output neurons, weighted by the weights of the connections between them. 3) Update each network weight wji= wji+∆jiwhere ∆ji= ηδjxji. xjiis the input value to which the weight is
applied and η is the learning rate. The weight-update loop in backpropagation may be iterated a huge number of times, therefore several different termination conditions can be used to halt the procedure. One may choose to stop after a fixed number of iterations, or once the error on the training examples falls below some threshold, or after the error measure has not improved after a certain number of iterations or once the error on a separate validation set of examples meets some criterion – to keep overfitting in check.
Over the years a lot of research effort has been put in order to improve the ef- ficiency of this fundamental component of ANN-based learning [LBOM98,LK90, MVA99, RB93, VON92]. For example Wang et al. [WTT+04] propose a novel
backpropagation algorithm aimed at avoiding the local minima problem caused by neuron saturation in the hidden layer. The main point of the new method is to adapt the activation functions in order to prevent saturation in hidden layer neurons. Some approaches address the weaknesses of backpropagation, such as the risk of overfitting [TLL95, LG00]. Schittenkopf et al. [SDB97] describe a
strategy to avoid overfitting in two-layered networks. They use two additional linear layers and principal component analysis to reduce the dimension of both inputs and internal representation; in this way less significant neurons and better generalization are obtained. Giles et al. [Gil01] show that increasing the number of hidden units and applying backpropagation with early stopping leads to ANN able to generalize well. This is due to the excess capacity of hidden layer that allows better fit for regions of high non-linearity. Early stop guarantees that the increased net will not be over-trained and therefore overfitted.
Artificial Neural Networks are powerful instruments capable of approximat- ing universal functions. One of the main drawback of ANN is their lack of “transparency”: ANNs behave as black boxes and once a network has been trained is not always trivial (or even possible) to understand the criteria it uses to produce a certain output given a set of inputs. Another problem with ANN is choosing the correct number of hidden layers: underestimating the number of neurons can lead to poor approximations while too many neuron nodes can result in overfitting and overall complicates the training phase. A very good analysis on the right number of hidden layers and neurons can be found in [CY01,KP00]. Even networks of practical size can represent a large number of nonlinear func- tions, making ANNs a powerful instrument for learning discrete and continuous functions whose general form is unknown in advance.