• No results found

Computational data mining

4.6 Neural networks

Neural networks, can be used for many purposes, notably descriptive and predic- tive data mining. They were originally developed in the field of machine learning to try to imitate the neurophysiology of the human brain through the combination of simple computational elements (neurons) in a highly interconnected system. They have become an important data mining method. However, the neural net- works developed since the 1980s have only recently received attention from statisticians (e.g. Bishop, 1995; Ripley, 1996). Despite controversies over the real ‘intelligence’ of neural networks, there is no doubt they have now become useful statistical models. In particular, they show a notable ability to fit observed data, especially with high-dimensional databases, and data sets characterised by incomplete information, errors or inaccuracies. We will treat neural networks as a methodology for data analysis; we will recall the neurobiological model only to illustrate the fundamental principles.

A neural network is composed of a set of elementary computational units, called neurons, connected together through weighted connections. These units are organised in layers so that every neuron in a layer is exclusively connected to the neurons of the preceding layer and the subsequent layer. Every neuron, also called a node, represents an autonomous computational unit and receives inputs as a series of signals that dictate its activation. Following activation, every neuron produces an output signal. All the input signals reach the neuron simultaneously, so the neuron receives more than one input signal, but it produces only one output signal. Every input signal is associated with a connection weight. The weight determines the relative importance the input signal can have in producing the final impulse transmitted by the neuron. The connections can be exciting, inhibiting or null according to whether the corresponding weights are respectively positive, negative or null. The weights are adaptive coefficients that, in analogy with the biological model, are modified in response to the various signals that travel on the network according to a suitable learning algorithm. A threshold value, called bias, is usually introduced. Bias is similar to an intercept in a regression model.

In more formal terms, a generic neuron j, with a threshold θj, receives n

Output w1j wnj w2j x1 x2 xn . . . . . . 2 2 Activation function 1 1 Potential

Figure 4.8 Representation of the activity of a neuron in a neural network.

the previous layer. Each signal is attached with an importance weight wj=

[w1j, w2j, . . . , wnj].

The same neuron elaborates the input signals, their importance weights and the threshold value through something called a combination function. The com- bination function produces a value called potential, or net input. An activation function transforms the potential into an output signal. Figure 4.8 schematically represents the activity of a neuron. The combination function is usually linear, therefore the potential is a weighted sum of the input values multiplied by the weights of the respective connections. This sum is compared with the value of the threshold. The potential of neuron j is defined by the following linear combination: Pj = n i=1 (xiwijθj)

To simplify the expression for potential, the bias term can be absorbed by adding a further input with constant value x0 =1, connected to the neuronj through a

weightw0j = −θj: Pj = n i=0 (xiwij)

Now consider the output signal. The output of thejth neuron,yj,is obtained by

applying the activation function to potentialPj:

yj=f (x,wj)=f (Pj)=f n i=0 xiwij

The quantities in bold italics are vectors. In defining a neural network model, the activation function is typically one of the elements to specify. Three types are commonly employed: linear, stepwise and sigmoidal. A linear activation function is defined by

where Pj is defined on the set of real numbers, andαand β are real constants;

α=0 andβ=1 is a particular case called the identity function, usually employed when the model requires the output of a neuron to be exactly equal to its level of activation (potential). Notice the strong similarity between the linear activation function and the expression for a the regression line (Section 4.3). In fact, a regression model can be seen as a simple type of neural network.

A stepwise activation function is defined by

f (Pj)=

α Pjθj

β Pj < θj

The activation function can assume only two values according to whether or not the potential exceeds the thresholdθj. Forα=1,β=0 andθj =0 we obtain the

so-called sign activation function, which takes value 0 if the potential is negative and value+1 if the potential is positive.

Sigmoidal, or S-shaped, activation functions are probably the most used. They produce only positive output; the domain of the function is the interval [0, 1]. They are widely used because they are non-linear and also because they are easily differentiable and understandable. A sigmoidal activation function is defined by

f (Pj)=

1 1+e−αPj

whereα is a positive parameter that regulates the slope of the function.

Another type of activation function, the softmax function, is typically used to normalise the output of different but related nodes. Consider g such nodes, and let their outputs bevj, j =1, . . . , g. The softmax function normalises thevj so

they sum to 1: softmax(vj)= exp(vj) g h=1 exp(vh) (j =1, . . . , g)

The softmax function is used in supervised classification problems, where the response variable can takeg alternative levels.

4.6.1 Architecture of a neural network

The neurons of a neural network are organised in layers. These layers can be of three types: input, output or hidden. The input layer receives information only from the external environment; each neuron in it usually corresponds to an explanatory variable. The input layer does not perform any calculation; it transmits information to the next level. The output layer produces the final results, which are sent by the network to the outside of the system. Each of its neurons corresponds to a response variable. In a neural network there are generally two or more response variables. Between the output layer and the input layer there

can be one or more intermediate layers, called hidden layers because they are not directly in contact with the external environment. These layers are exclusively for analysis; their function is to take the relationship between the input variables and the output variables and adapt it more closely to the data. In the literature there is no standard convention for calculating the number of layers in a neural network. Some authors count all the layers of neurons and others count the number of layers of weighted neurons. I will use the weighted neurons and count the number of layers that are to be learnt from the data. The ‘architecture’ of a neural network refers to the network’s organisation: the number of layers, the number of units (neurons) belonging to each layer, and the manner in which the units are connected. Network architecture can be represented using a graph, hence people often use the term ‘network topology’ instead of ‘network architecture’. Four main characteristics are used to classify network topology:

• Degree of differentiation of the input and output layer

• Number of layers

• Direction of flow for the computation

• Type of connections

The simplest topology is called autoassociative; it has a single layer of intra- connected neurons. The input units coincide with the output units; there is no differentiation. We will not consider this type of network, as it has no statis- tical interest. Networks with a single-layer of weighted neurons are known as single-layer perceptrons. They have n input units (x1, . . . , xn) connected to a

layer of p output units (y1, . . . , yp) through a system of weights. The weights

can be represented in matrix form:

        w11 . . . w1j . . . w1p .. . ... ... ... ... wi1 . . . wij . . . wip .. . ... ... ... ... wn1 . . . wnj . . . wnp        

fori=1, . . . , n;j =1, . . . , p. The generic weightwij represents the weight of

the connection between the ith neuron of the input layer and the jth neuron of the output layer.

Neural networks with more than one layer of weighted neurons, which con- tain one or more hidden layers, are called multilayer perceptrons, and we will concentrate on these. A two-layer network has one hidden layer; there arenneu- rons in the input layer, hin the hidden layer andp in the output layer. Weights

wik (i=1, . . . , n; k=1, . . . , h) connect the input layer nodes with the hidden

layer nodes; weights zkj (k=1, . . . , h; j=1, . . . , p) connect the hidden layer

nodes with the output layer nodes. The neurons of the hidden layer receive infor- mation from the input layer, weighted by the weights wik, and produce outputs

layer. The neurons of the output layer receive the outputs from the hidden layer, weighted by the weightszkj, and produce the final network outputsyj =g(h,zj).

The output of neuronj in the output layer is

yj =g k hkzkj =g k zkjf i xiwik

This equation shows that the output values of a neural network are determined recursively and typically in a non-linear way.

Different information flows lead to different types of network. In feedforward networks the information moves in only one direction, from one layer to the next, and there are no return cycles. In feedback networks it is possible that information returns to previous layers. If each unit of a layer is connected to all the units of the next layer, the network is described as totally interconnected; if each unit is connected to every unit of every layer, the network is described as totally connected.

Networks can also be classified into three types according to their connection weightings: networks with fixed weights, supervised networks and unsupervised networks. We shall not consider networks with fixed weights as they cannot learn from the data and they do not offer a statistical model. Supervised networks use a supervising variable, a concept introduced in Section 4.5. With a supervised network, there can be information about the value of a response variable cor- responding to the values of the explanatory variables; this information can be used to learn the weights of the neural network model. The response variable behaves as a supervisor for the problem. When this information is not available, the learning of the weights is exclusively based on the explanatory variables and there is no supervisor. Here is the same idea expressed more formally:

Supervised learning: assume that each observation is described by a pair

of vectors (xi,ti) representing the explanatory and response variables, res-

pectively. Let D= {(x1,t1), . . . , (xn,tn)} represent the set of all available

observations. The problem is to determine a neural network yi =f (xi),

i=1, . . . , n, such that the sum of the distancesd(yi,ti)is minimum. Notice

the analogy with linear regression models.

Unsupervised learning: each observation is described by only one vector, with

all available variables,D= {x1, . . . ,xn}. The problem is the partitioning of

the setD into subsets such that the vectorsxi, belonging to the same subset

are ‘close’ in comparison to a fixed measure of distance. This is basically a classification problem.

We will now examine the multilayer perceptron, an example of a supervised network, and the Kohonen network, an example of an unsupervised network.

4.6.2 The multilayer perceptron

The multilayer perceptron is the most used architecture for predictive data mining. It is a feedforward network with possibly several hidden layers, one input layer

and one output layer, totally interconnected. It can be considered as a highly non- linear generalisation of the linear regression model when the output variables are quantitative, or of the logistic regression model when the output variables are qualitative.

Preliminary analysis

Multilayer perceptrons, and neural networks in general, are often used ineffi- ciently on real data because no preliminary considerations are applied. Although neural networks are powerful computational tools for data analysis, they also require exploratory analysis (Chapter 3).

Coding of the variables

The variables used in a neural network can be classified by type – qualitative or quantitative – and by their role in the network – input or output. Input and output in neural networks correspond to explanatory and response in statistical methods. In a neural network, quantitative variables are represented by one neuron. The qualitative variables, both explanatory and responses, are represented in a binary way using several neurons for every variable; the number of neurons equals, the number of levels of the variable (Section 2.3). In practice the number of neurons to represent a variable need not be exactly equal to the number of its levels. It is advisable to eliminate one level, and therefore one neuron, since the value of that neuron will be completely determined by the others.

Transformation of the variables

Once the variables are coded, a preliminary descriptive analysis may underline the need for some kind of transformation, perhaps to standardise the input variables to weight them in a proper way. Standardisation of the response variable is not strictly necessary. If a network has been trained with transformed input or output, when it is used for prediction, the outputs must be mapped on to the original scale.

Reduction in the dimensionality of the input variables

One of the most important forms of preprocessing is reduction in the dimension- ality of the input variables. The simplest approach is to eliminate a subset of the original inputs. Other approaches create linear or non-linear combinations of the original variables to represent the input for the network. Principal component methods can be usefully employed here (Section 3.5).

Choice of the architecture

The architecture of a neural network can have a fundamental impact on real data. Nowadays, many neural networks optimise their architecture as part of the learning process. Network architectures are rarely compared using the clas- sical methods of Chapter 5; this is because a neural network does not need an underlying probabilistic model and seldom has one. Even when there is an underlying probabilistic model, it is often very difficult to draw the distribution of the statistical test functions. Instead it is possible to make comparison based

on the predictive performances of the alternative structures; an example is the cross-validation method (Chapter 6).

Learning of the weights

Learning the weights in multilayer perceptrons appears to introduce no particu- lar problems. Having specified an architecture for the network, the weights are estimated on the basis of the data, as if they were parameters of a (complex) regression model. But in practice there are at least two aspects to consider:

• The error function between the observed values and the fitted values could be a classical distance function, such as the Euclidean distance or the misclassi- fication error, or it could be depend in a probabilistic way on the conditional distribution of the output variable with respect to the inputs.

• The optimisation algorithm needs to be a computationally efficient method to obtain estimates of the weights by minimising of the error function. The error functions usually employed for multilayer perceptrons are based on the maximum likelihood principle (Section 5.1). For a given training data set D=

{(x1,t1), . . . , (xn,tn)}, this requires us to minimise the entropy error function:

E(w)= −

n

i=1

logp(ti|xi;w)

where p(ti|xi;w)is the distribution of the response variable, conditional on the

input values and the weighting function. For more details see Bishop (1995). We will now look at the form of the error function for the two principal applica- tions of the multilayer perceptron: predicting a continuous response (predictive regression) and predicting a qualitative response (predictive classification).

Error functions for predictive regression

Every component ti,k of the response vector ti is assumed to be the sum of a

deterministic component and an error term, similar to linear regression:

ti,k =yi,k+εi,k (k=1, . . . , q)

where yi,k is the kth component of the output vector yi. To obtain more infor-

mation from a neural network for this problem it can be assumed that the error terms are normally distributed, similar to the normal linear model (Section 5.3). Since the objective of statistical learning is to minimise the error function in terms of the weights, we can omit everything that does not depend on the weights. Then we obtain

E(w)= n i=1 q k=1 (ti,kyi,k)2

This expression can be minimised using a least squares procedure (Section 4.3). In fact, linear regression can be seen as a neural network model without hidden layers and with a linear activation function.

Error functions for predictive classification

Multilayer perceptrons can also be employed for solving classification problems. Then it is used to estimate the probabilities of affiliation of every observation to the various groups. There is usually an output unit for each possible class, and the activation function for each output unit represents the conditional probability

P (Ck|x), whereCk is thekth class andxis the input vector. Output value yi,k

represents the fitted probability that the observationibelongs to thekth groupCk.

To minimise the error function with respect to the weights, we need to minimise

E(w)= − n i=1 q k=1

[ti,klogyi,k+(1−ti,k)log(1−yi,k)]

which represents a distance based on the entropy index of heterogeneity (Section 3.1). Notice that a particular case can be obtained for the logistic regression model.

In fact, logistic regression can be seen as a neural network model without hidden nodes and with a logistic activation function and softmax output function. In contrast to logistic regression, which produces a linear discriminant rule, a multilayer perceptron provides a non-linear discriminant rule and this cannot be given a simple analytical description.

Choice of optimisation algorithm

In general, the error function E(w)of a neural network is highly non-linear in the weights, so there may be many minima that satisfy the condition ∇E=0.