Model specification
4.6 Neural networks
Neural networks can be used for many purposes, notably descriptive and pre- dictive data mining. They were originally developed in the field of machine learning to try to imitate the neurophysiology of the human brain through the combination of simple computational elements (neurons) in a highly intercon- nected system. They have become an important data mining method. However, the neural networks developed since the 1980s have only recently have received the attention from statisticians (e.g. Bishop, Ripley, 1995, 1996). Despite contro- versies over the real ‘intelligence’ of neural networks, there is no doubt that they have now become useful statistical models. In particular, they show a notable ability to fit observed data, especially with high-dimensional databases, and data
sets characterised by incomplete information, errors or inaccuracies. We will treat neural networks as a methodology for data analysis; we will recall the neurobiological model only to illustrate the fundamental principles.
A neural network is composed of a set of elementary computational units, called neurons, connected together through weighted connections. These units are organised in layers so that every neuron in a layer is exclusively connected to the neurons of the preceding layer and the subsequent layer. Every neuron, also called a node, represents an autonomous computational unit and receives inputs as a series of signals that dictate its activation. Following activation, every neuron produces an output signal. All the input signals reach the neuron simultaneously, so the neuron receives more than one input signal, but it produces only one output signal. Every input signal is associated with a connection weight. The weight determines the relative importance the input signal can have in producing the final impulse transmitted by the neuron. The connections can be exciting, inhibiting or null according to whether the corresponding weights are respectively positive, negative or null. The weights are adaptive coefficients that, by analogy with the biological model, are modified in response to the various signals that travel on the network according to a suitable learning algorithm. A threshold value, called bias, is usually introduced. Bias is similar to an intercept in a regression model. In more formal terms, a generic neuron j, with a threshold θj, receives n
input signals x=[x1, x2, . . . , xn] from the units to which it is connected in
the previous layer. Each signal is attached with an importance weight wj =
w1j, w2j, . . . , wnj
.
The same neuron elaborates the input signals, their importance weights, and the threshold value through a combination function. The combination function produces a value called the potential, or net input. An activation function trans- forms the potential into an output signal. Figure 4.7 schematically represents the activity of a neuron. The combination function is usually linear, therefore the potential is a weighted sum of the input values multiplied by the weights of the respective connections. The sum is compared with the threshold value. The potential of a neuronj is defined by the linear combination
Pj = n i=1 xiwij−θj .
To simplify this expression, the bias term can be absorbed by adding a further input with constant value, x0=1, connected to the neuron j through a weight
w0j = −θj:
Pj = n
i=0
xiwij.
Now consider the output signal. The output of thejth neuron,yj, is obtained by
applying the activation function to the potentialPj:
yj =f x,wj =fPj =f n i=0 xiwij .
1 2 Output w1j w1j w2j x1 x2 xn . . . . . . 2 Activation function 1 Potential .
Figure 4.7 Representation of the activity of a neuron in a neural network.
The quantities in bold are vectors.
In the definition of a neural network model, the activation function is typically one of the elements to specify. Three types are commonly employed: linear, stepwise and sigmoidal. A linear activation function is defined by
fPj
=α+βPj,
wherePj is defined on the set of real numbers, andαandβreal constants;α=0
andβ=1 is a particular case called the identity function, usually employed when the model requires the output of a neuron is exactly equal to its level of activation (potential). Notice the strong similarity between the linear activation function and the expression for a regression line (Section 4.3). In fact, a regression model can be seen as a simple type of neural network.
A stepwise activation function is defined by
fPj = α, Pj ≥θj, β, Pj < θj.
It can assume only two values depending on whether or not the potential exceeds the threshold θj. For α=1, β=0 and θj =0 we obtain the so-called sign
activation function, which takes the value 0 if the potential is negative and+1 if the potential is positive.
Sigmoidal, or S-shaped, activation functions are probably the most commonly used. They produce only positive output; the domain of the function is the interval [0, 1]. They are widely used because they are non-linear and also because they are easily differentiable and understandable. A sigmoidal activation function is defined by
f Pj
= 1
1+e−αPj,
whereα is a positive parameter that regulates the slope of the function.
Another type of activation function, the softmax function, is typically used to normalise the output of different but related nodes. Consider g of such nodes,
and let their outputs be vj, j =1, . . . , g. The softmax function normalises the
vj so that they sum to 1:
softmax vj = exp vj g h=1exp(vh) , j=1, . . . , g.
The softmax function is used in supervised classification problems, where the response variable can takeg alternative levels.
4.6.1 Architecture of a neural network
The neurons of a neural network are organised in layers. These layers can be of three types: of input, of output or hidden. The input layer receives information only from the external environment; each neuron in it usually corresponds to an explanatory variable. The input layer does not perform any calculation; it transmits information to the next level. The output layer produces the final results, which are sent by the network to the outside of the system. Each of its neurons corresponds to a response variable. In a neural network there are generally two or more response variables. Between the output and the input layer there can be one or more intermediate levels, called hidden layers because they are not directly in contact with the external environment. These layers are exclusively for analysis; their function is to take the relationship between the input variables and the output variables and adapt it more closely to the data. In the literature there is no standard convention for calculating the number of layers in a neural network. Some authors count all the layers of neurons and others count the number of layers of weighted neurons. We will use the weighted neurons and count the number of layers that are to be learnt from the data. The ‘architecture’ of a neural network refers to the network’s organisation: the number of layers, the number of units (neurons) belonging to each layer, and the manner in which the units are connected. Network architecture can be represented using a graph, hence people often use the term ‘network topology’ instead of ‘network architecture’. Four main characteristics are used to classify network topology:
• degree of differentiation of the input and output layer; • number of the layers;
• direction of the flow of computation; • type of connections.
The simplest topology is called autoassociative; it has a single layer of intra- connected neurons. The input units coincide with the output units; there is no differentiation. We will not consider this type of network, as it is of no statistical interest. Networks with a single layer of weighted neurons are known as single layer perceptrons. They have n input units(x1, . . . , xn)connected to a layer of
p output units y1, . . . , yp
represented in matrix form: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ w11 . . . w1j . . . w1p .. . ... ... ... ... wi1 . . . wij . . . wip .. . ... ... ... ... wn1 . . . wnj . . . wnp ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ,
fori=1, . . . , nandj =1, . . . , p;wij is the weight of the connection between
theith neuron of the input layer and thejth neuron of the output layer. Neural networks with more than one layer of weighted neurons, which con- tain one or more hidden layers, are called multilayer perceptrons, and we will concentrate on these. A two-layer network has one hidden layer; there are n
neurons in the input layer, h in the hidden layer and p in the output layer. Weightswik (i=1, . . . , n;k=1, . . . , h) connect the input layer nodes with the
hidden layer nodes; weights zkj (k=1, . . . , h;j =1, . . . , p) connect the hid-
den layer nodes with the output layer nodes. The neurons of the hidden layer receive information from the input layer, weighted by the weightswik, and pro-
duce outputs hk =f (x,wk), where f is the activation function of the units in
the hidden layer. The neurons of the output layer receive the outputs from the hidden layer, weighted by the weightszkj, and produce the final network outputs
yj =g
h,zj
. The output of neuronj in the output layer is
yj =g k hkzkj =g k zkjf i xiwik .
This equation shows that the output values of a neural network are determined recursively and typically in a non-linear way.
Different information flows lead to different types of network. In feedforward networks the information moves in only one direction, from one layer to the next, and there are no return cycles. In feedback networks it is possible for information to return to previous layers. If each unit of a layer is connected to all the units of the next layer, the network is described as totally interconnected; if each unit is connected to every unit of every layer, the network is described as totally connected.
Networks can also be classified into three types according to their connection weightings: networks with fixed weights, supervised networks and unsupervised networks. We shall not consider networks with fixed weights as they cannot learn from the data and they cannot constitute a statistical model. Supervised networks use a supervising variable, a concept introduced in Section 4.5. With a super- vised network, there can be information about the value of a response variable corresponding to the values of the explanatory variables; this information can be used to learn the weights of the neural network model. The response variable behaves as a supervisor for the problem. When this information is not available,
the learning of the weights is exclusively based on the explanatory variables and there is no supervisor. Here is the same idea expressed more formally:
• Supervised learning.Assume that each observation is described by a pair of vectors (xi,ti) representing the explanatory and response variables, respec-
tively. LetD= {(x1,t1) , . . . , (xn,tn)}represent the set of all available obser-
vations. The problem is to determine a neural network yi =f (xi), i=
1, . . . , n, such that the sum of the distances d(yi,ti) is minimum. Notice
the analogy with linear regression models.
• Unsupervised learning. Each observation is described by only one vector,
with all available variables, D= {(x1) , . . . , (xn)}. The problem is the par-
titioning of the set D into subsets such that the vectorsxi belonging to the
same subset are ‘close’ in comparison to a fixed measure of distance. This is basically a classification problem.
We now examine the multilayer perceptron, an example of a supervised net- work, and the Kohonen network, an example of an unsupervised network.
4.6.2 The multilayer perceptron
The multilayer perceptron is the most commonly used architecture for predictive data mining. It is a feedforward network, with possibly several hidden layers, one input layer and one output layer, totally interconnected. It can be considered as a highly non-linear generalisation of the linear regression model when the output variables are quantitative, or of the logistic regression model when the output variables are qualitative.
Preliminary analysis
Multilayer perceptrons, and the neural networks in general, are often used ineffi- ciently on real data because no preliminary considerations are applied. Although neural networks are powerful computation tools for data analysis, they also require exploratory analysis (Chapter 3).
Coding of the variables
The variables used in a neural networks can be classified according by type (qualitative or quantitative) and by their role in the network (input or output). Input and output in neural networks correspond to explanatory and response variables in statistical methods. In a neural network, quantitative variables are represented by one neuron. Qualitative variables, both explanatory and response, are represented in a binary way using several neurons for every variable; the number of neurons equals the number of levels of the variable. In practice the number of neurons to represent a variable need not be exactly equal to the num- ber of its levels. It is advisable to eliminate one level, and therefore one neuron, since the value of that neuron will be completely determined by the others.
Transformation of the variables
Once the variables are coded, a preliminary descriptive analysis may indicate the need for some kind of transformation, perhaps to standardise the input variables to weight them in a proper way. Standardisation of the response variable is not strictly necessary. If a network has been trained with transformed input or output, when it is used for prediction, the outputs must be mapped on to the original scale.
Reduction of the dimensionality of the input variables
One of the most important forms of preprocessing is reduction of the dimension- ality of the input variables. The simplest approach is to eliminate a subset of the original inputs. Other approaches create linear or non-linear combinations of the original variables to represent the input for the network. Principal component methods can be usefully employed here (Section 3.5).
Choice of the architecture
The architecture of a neural network can have a fundamental impact on real data. Nowadays, many neural networks optimise their architecture as part of the learning process. Network architectures are rarely compared using the classical methods described later in this chapter; this is because a neural network does not need an underlying probabilistic model and seldom has one. Even when there is an underlying probabilistic model, it is often very difficult to draw the distribution of the statistical test functions. Instead it is possible to make comparison based on the predictive performance of the alternative structures; an example is the cross-validation method (Section 5.4).
Learning of the weights
Learning the weights in multilayer perceptrons appears to introduce no particu- lar problems. Having specified an architecture for the network, the weights are estimated on the basis of the data, as if they were parameters of a (complex) regression model. But in practice there are at least two aspects to consider: • The error function between the observed values and the fitted values could be
a classical distance function, such as the Euclidean distance or the misclas- sification error, or it could depend in a probabilistic way on the conditional distribution of the output variable with respect to the inputs.
• The optimisation algorithm needs to be a computationally efficient method to obtain estimates of the weights by minimising the error function.
The error functions usually employed for multilayer perceptrons are based on the maximum likelihood principle (Section 4.9). For a given training data set D = {(x1,t1) , . . . , (xn,tn)}, this requires us to minimise the entropy error
function
E(w)= −
n
i=1
wherep (ti|xi ;w)is the distribution of the response variable, conditional on the
values of the input values and weighting function. For more details, see Bishop (1995). We now look at the form of the error function for the two principal appli- cations of the multilayer perceptron: predicting a continuous response (predictive regression) and a qualitative response (predictive classification).
Error functions for predictive regression
Every component ti,k of the response vector ti is assumed to be the sum of a
deterministic component and an error term, similar to linear regression:
ti,k =yi,k+εi,k, k=1, . . . , q,
where yi,k=y(xi,w)is the kth component of the output vectoryi. To obtain
more information from a neural network for this problem it can be assumed that the error terms are normally distributed, similar to the normal linear model (Section 4.11).
Since the objective of statistical learning is to minimise the error function in terms of the weights, we can omit everything that does not depend on the weights. We obtain E(w)= n i=1 q k=1 (ti,k−yi,k)2.
This expression can be minimised using a least squares procedure (Section 4.3). In fact, linear regression can be seen as a neural network model without hidden layers and with a linear activation function.
Error functions for predictive classification
Multilayer perceptrons can also be employed for solving classification problems. In this type of application, a neural network is employed to estimate the proba- bilities of affiliation of every observation to the various groups. There is usually an output unit for each possible class, and the activation function for each output unit represents the conditional probability P (Ck|x), where Ck is the kth class
and xis the input vector. The output value yi,k represents the fitted probability
that the observationibelongs to thekth groupCk. To minimise the error function
with respect to the weights, we need to minimise
E(w)= − n i=1 q k=1
ti,klogyi,k+ 1−ti,k log1−yi,k ,
which represents a distance based on the entropy index of heterogeneity (Section 3.1). Notice that a particular case of the preceding expression can be obtained for the logistic regression model.
In fact, logistic regression can be seen as a neural network model without hid-