Lecture 17-Classification by Backpropagation-M

(1)

CSC479

Data Mining

Lecture # 17

Classification by

Backpropagation

(2)

•

Brain

•

A marvelous piece of

architecture and design.

•

In association with a

nervous system, it

controls the life patterns,

communications,

interactions, growth and

development of

(3)



There are about 10

10

to 10

14

nerve cells

(called neurons) in an adult human brain.



Neurons are highly connected with each

other. Each nerve cell is connected to

hundreds of thousands of other nerve cells.



Passage of information between neurons is

slow (in comparison to transistors in an IC). It

takes place in the form of electrochemical

signals between two neurons in milliseconds.



Energy consumption per neuron is low

(4)

Look more like some

blobs of ink… aren’t they!

Taking a more closer look

reveals that there is a

large collection of different

molecules, working together

coherently, in an organized

manner.

(5)

Cell Body Axon

Dendrites Nucleus

Axons from another

neurons Synapse

Synapse

(6)

(7)

(8)



An artificial neural network is an information

processing system that has certain performance

characteristics in common with biological neural

networks.



An ANN can be characterized by:

1.

Architecture:

The pattern of connections between

different neurons.

2.

Training or Learning Algorithms:

The method of

determining weights on the connections.

(9)



There are two basic categories:

1.

Feed-forward Neural Networks



These are the nets in which the signals flow

from the input units to the output units, in a

forward direction.



They are further classified as:

1.

Single Layer Nets

2.

Multi-layer Nets

2.

Recurrent Neural Networks



These are the nets in which the signals can

(10)

(11)

(12)

Y

₁

Y

_m

X

₁

X

_n

w₁₁

w_nm

1

v_1m v_1n

w_1n

w_1m

1 1

(13)

Supervised Training



Training is accomplished by presenting a sequence of

training vectors or patterns, each with an associated

target output vector.



The weights are then adjusted according to a learning

algorithm.



During training, the network develops an associative

memory. It can then recall a stored pattern when it is

given an input vector that is sufficiently similar to a

vector it has learned.

Unsupervised Training



A sequence of input vectors is provided, but no traget

vectors are specified in this case.



The net modifies its weights and biases, so that the

(14)

14

Classification by Backpropagation



Backpropagation: A

neural network

learning algorithm



Started by psychologists and neurobiologists to develop

and test computational analogues of neurons



A neural network: A set of connected input/output units

where each connection has a

weight

associated with it



During the learning phase, the

network learns by

adjusting the weights

so as to be able to predict the

correct class label of the input tuples



Also referred to as

connectionist learning

due to the

(15)

15

Neural Network as a Classifier



Weakness

 Long training time

 Require a number of parameters typically best determined

empirically, e.g., the network topology or “structure."

 Poor interpretability: Difficult to interpret the symbolic meaning

behind the learned weights and of “hidden units" in the network



Strength

 High tolerance to noisy data

 Ability to classify untrained patterns

 Well-suited for continuous-valued inputs and outputs

 Successful on a wide array of real-world data

 Algorithms are inherently parallel

 Techniques have recently been developed for the extraction of

(16)

16

A Neuron (= a perceptron)

 The n-dimensional input vector x is mapped into variable y by

means of the scalar product and a nonlinear function mapping



_j

-f

weighted

sum

Input

vector

x

output

y

Activation

function

weight

vector

w

å

w

₀

w

₁

w

_n

x

₀

x

₁

x

_n n i 0 For Example

y sign( w x_{i i} _j)



(17)

A Multi-Layer Feed-Forward Neural Network

 Given a unit, j in a hidden or output layer, the net input, I_j, to unit j is

 Given the net input I_j to unit j, then O_j , the output of unit j, is computed as

17

å





i ij i j j w O

I



j

I j

e

O

_





(18)

 Backpropagate the error: The error is propagated backward by

updating the weights and biases to reflect the error of the

network’s prediction. For a unit j in the output layer, the error Err_j

is computed by

where O_j is the actual output of unit j, and T_j is the known target

value of the given training tuple.

 The error of a hidden layer unit j is

where w_jk is the weight of the connection from unit j to a unit k in

the next higher layer, and Err_k is the error of unit k.

 Weights are updated by the following equations, where is

the change in weight w_ij

 Biases are updated by the following equations, where is

the change in bias weight

18

)

)(

1

(

_j _j _j

j

O

T

O

Err





i j ij

ij

w

l

Err

O

w





(

)

jk k k j

j

O

Err

w

Err



(

1



)

å

( )l Err O_j _i

j j

j







(

l

)

Err



( )

l Err

_j

j

(19)

19

How A Multi-Layer Neural Network Works?

 The inputs to the network correspond to the attributes measured

for each training tuple

 Inputs are fed simultaneously into the units making up the input

layer

 They are then weighted and fed simultaneously to a hidden layer

 The number of hidden layers is arbitrary, although usually only one

 The weighted outputs of the last hidden layer are input to units

making up the output layer, which emits the network's prediction

 The network is feed-forward in that none of the weights cycles

back to an input unit or to an output unit of a previous layer

 From a statistical point of view, networks perform nonlinear

(20)

20

Defining a Network Topology



First decide the

network topology:

# of units in the

input layer

, # of

hidden layers

(if > 1), # of units in each

hidden layer, and # of units in the

output layer



Normalizing the input values for each attribute measured

in the training tuples to [0.0—1.0]



One

input

unit per domain value, each initialized to 0



Output

, if for classification and more than two classes,

one output unit per class is used



Once a network has been trained and its accuracy is

(21)

21

Backpropagation

 Iteratively process a set of training tuples & compare the network's

prediction with the actual known target value

 For each training tuple, the weights are modified to minimize the

mean squared error between the network's prediction and the actual target value

 Modifications are made in the “backwards” direction: from the output

layer, through each hidden layer down to the first hidden layer, hence “backpropagation”

 Steps

 Initialize weights (to small random #s) and biases in the network

 Propagate the inputs forward (by applying activation function)

 Backpropagate the error (by updating weights and biases)

(22)

Example 9.1

» _{The class label of the}

training tuple is 1 and the learning rate l is 0.9

 Let the initial weights and bias values are

22

å





i ij i j j w O

I



j

I j

e

O

_





(23)

23

Example 9.1 (Cont…)

)

)(

1

(

_j _j _j

j

O

T

O

Err





jk k k j

j

O

Err

w

Err



(

1



)

å

i j ij

ij

w

l

Err

O

w





(

)

j j

j







(

l

)

Err

(24)

Terminating condition:



Training stops when



All



w

_ij

in the previous epoch are so small as

to be below some specified threshold, or



The percentage of tuples misclassified in the

previous epoch is below some threshold, or



A prespecified number of epochs has expired.



In practice, several hundreds of thousands of

epochs may be required before the weights

will converge.

(25)

 Efficiency of backpropagation: Each epoch (one interation through the

training set) takes O(|D| * w), with |D| tuples and w weights, but # of

epochs can be exponential to n, the number of inputs, in the worst case

 Rule extraction from networks: network pruning

 Simplify the network structure by removing weighted links that

have the least effect on the trained network

 Then perform link, unit, or activation value clustering

 The set of input and activation values are studied to derive rules

describing the relationship between the input and hidden unit layers

 Sensitivity analysis: assess the impact that a given input variable has

on a network output. The knowledge gained from this analysis can be represented in rules

25