Artificial Neural Networks

(1)

Artificial Neural Networks

Oliver Schulte

School of Computing Science Simon Fraser University Introduction to Deep Learning

(2)

Neural Networks

• Neural networks arise from attempts to model human/animal brains

• Many models, many claims of biological plausibility

• We will focus onmulti-layer perceptrons

(3)

Uses of Neural Networks

• Pros

• Good for continuous input variables.

• General continuous function approximators.

• Highly non-linear.

• Learn features.

• Good to use in continuous domains with little knowledge:

• When you don’t know good features.

• You don’t know the form of a good functional model.

• Cons

• Not interpretable, “black box”.

• Learning is slow.

(4)

Function Approximation Demos

• Home Value of Hockey State https://user-images. githubusercontent.com/22108101/

28182140-eb64b49a-67bf-11e7-97aa-046298f721e5. jpg

• Function Learning Examples (open in Safari) http://neuron.eng.wayne.edu/

(5)

Applications

There are many, many applications.

• World-Champion Backgammon Player.

http://en.wikipedia.org/wiki/TD-Gammon http://en.wikipedia.org/wiki/Backgammon

• No Hands Across America Tour.

http://www.cs.cmu.edu/afs/cs/usr/tjochem/ www/nhaa/nhaa_home_page.html

• Digit Recognition with 99.26% accuracy.

• Speech Recognition

http://research.microsoft.com/en-us/news/ features/speechrecognition-082911.aspx

• Translation http://translate.google.com

(6)

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Theory: Backpropagation implements Gradient Descent

(7)

Outline

Network Training

(8)

Feed-forward Networks Network Training Error Backpropagation Theory: Backpropagation implements Gradient Descent Examples

Neurons

Model of an individual neuron j

• Pass input inj through a non-linearactivation functionto

get output aj= g(inj)

• For non-input nodes, the input is the weighted linear sum of connected node activations + bias w0,j:

inj = n X i=0 wijai Output

Σ

Input

Links FunctionInput ActivationFunction OutputLinks aj = g(inj) aj g inj wi,j w0,j ai

(9)

Feed-forward Networks Network Training Error Backpropagation Theory: Backpropagation implements Gradient Descent Examples

Neurons

Σ

Input

Links FunctionInput ActivationFunction OutputLinks aj = g(inj) aj g inj wi,j w0,j ai

(10)

Neurons

Σ

Input

Links FunctionInput ActivationFunction OutputLinks a0 = 1 _a_j_{= g(in}_j₎ aj g inj wi,j w0,j Bias Weight ai

(11)

Activation Functions

• Can use a variety of activation functions

• Sigmoidal (S-shaped)

• Logistic sigmoid 1/(1 + exp(−a)) (useful for binary classification)

• Hyperbolic tangent tanh

• Radial basis function aj=Pi(xi− wji)2

• Softmax

• Useful for multi-class classification

• Hard Threshold

• Rectified Linear Unit (deep learning)

• . . .

• Should be differentiable for gradient-based learning (later)

(12)

Activation Functions Visualized

0 0.5 1 -8 -6 -4 -2 0 2 4 6 8 0 0.5 1 -6 -4 -2 0 2 4 6 -2 ₀ 2 ₄ 6 -4 -2 0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 x1 x2 Left Threshold

Middle Logistic sigmoid Logistic(x) = _1+exp(−x)1 maps a real number to a probability

(13)

Network of Neurons

w3,5 3,6 w 4,5 w 4,6 w 5 6 w1,3 1,4 w 2,3 w 2,4 w 1 2 3 4 w1,3 1,4 w 2,3 w 2,4 w 1 2 3 4 (b) (a)

(14)

Function Composition

Think logic circuits

-4 _-2 0 ₂ 4 x1 -4 -20 2 4 x₂ 0 0.2 0.4 0.6 0.8 1 hW(x1, x2) -4 _-2 0 ₂ 4 x1 -4 -20 2 4 x2 0 0.2 0.4 0.6 0.8 1 hW(x1, x2)

(15)

Hidden Units As Feature Extractors

sample training patterns

learned input-to-hidden weights

...

FIGURE 6.13. The top images represent patterns from a large training set used to train a 64-2-3 sigmoidal network for classifying three characters. The bottom figures show the input-to-hidden weights, represented as patterns, at the two hidden units after training. Note that these learned weights indeed describe feature groupings useful for the clas-sification task. In large networks, such patterns of learned weights may be difficult to interpret in this way. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c! 2001 by John Wiley & Sons, Inc.

• 64 input nodes

• 2 hidden units

(16)

(17)

(18)

Expressive Power

• From the outside perspective, a neural net defines a function from input node values x to output node values.

• Let us write aw(x)for the output activation vector.

• How complex can aw be?

• With 1 hidden layer, can represent

• exactly Boolean functions

• approximately continuous functions. e.g. non-convex classification. http://playground.tensorflow.org

• With 2 hidden layers, can represent approximately an arbitrary function.

(19)

Outline

Network Training

(20)

Measuring Training Error

• Given a specified network structure, how do we set its parameters (weights)?

• As usual, we define a criterion to measure how well our network performs, optimize against it

• Training data are (xn, yn)

• Corresponds to neural net with multiple output nodes • Given a set of weight values w, the network defines an

output activation vector aw(x).

• Given training data x, y, define

• output node error Ek_{≡ `(a}

w,k(x), yk)measures the loss for a single output. Must be differentiable.

• e.g. `(aw,k(x), yk) = 1/2(yk− aw,k)2

• Theerror function is the sum of the individual output node

errors E(w, x) =P

(21)

Measuring Training Error

• Given a specified network structure, how do we set its parameters (weights)?

• As usual, we define a criterion to measure how well our network performs, optimize against it

• Training data are (xn, yn)

• Corresponds to neural net with multiple output nodes

• Given a set of weight values w, the network defines an output activation vector aw(x).

• Given training data x, y, define

• output node error Ek_{≡ `(a}

w,k(x), yk)measures the loss for a

single output. Must be differentiable.

• e.g. `(aw,k(x), yk) = 1/2(yk− aw,k)2

• Theerror function is the sum of the individual output node

errors E(w, x) =P

(22)

Parameter Optimization

w1 w2 E(w) wA wB _w_C ∇E

• For typical cases, the error function E(w) ≡P

xE(w, x)is

nasty

• Nasty = non-convex

(23)

Gradient Descent

• The function aw(x)implemented by a network is

complicated.

• No closed-form: Use gradient descent.

• It isn’t obvious how to compute error function derivatives with respect to hidden weights.

• The credit assignment problem.

(24)

Outline

Network Training

(25)

Notice of Notation

• In the following we consider a backpropagation training step for a fixed example (xn, yn)and fixed current weights

w.

• We adapt our notation by omitting these arguments unless needed.

• Examples:

• Eninstead of E(w, xn)

(26)

Error Backpropagation

• Backprop is an efficient method for computing error derivatives ∂En

∂wji for all nodes in the network. Intuition:

1. Calculating derivatives for weights connected to output nodes is easy.

2. Treat the derivatives as virtual “error”—how far is each node activation “off”. Compute derivative of error for nodes in previous layer.

3. Repeat until you reach input nodes.

• Propagates backwards the output error signal through the network.

(27)

Error at the output nodes

• First, feed training example xnforward through the network,

storing all node activations ai

• Calculating derivatives for weights connected to output nodes is easy.

• For output node k with activation ak = g(ink) = g(Piwikai)

and target value yk the error signal is

∆[k] ≡ −g0(ink)

∂`(ak, yk)

∂ak

. • E.g. for squared error loss ∆[k] = g0(ink)(yk− ak)

• Gradient Descent Weight Update:

(28)

Error at the output nodes

• First, feed training example xnforward through the network,

storing all node activations ai

• Calculating derivatives for weights connected to output nodes is easy.

• For output node k with activation ak = g(ink) = g(Piwikai)

and target value yk the error signal is

∆[k] ≡ −g0(ink)

∂`(ak, yk)

∂ak

.

• E.g. for squared error loss ∆[k] = g0(ink)(yk− ak)

• Gradient Descent Weight Update:

(29)

Error at the hidden nodes

• Consider a hidden node i connected to downstream nodes in the next layer.

• The error signal ∆[i] is node activation derivative, times the weighted sum of contributions to the connected error signals. • In symbols, ∆[i] = g0(ini) X j wij∆[j].

(30)

Backpropagation Picture

The error signal at a hidden unit is proportional to the error signals at the units it influences:

∆[i] = g0(ini) c

X

j=1

(31)

The Backpropagation Algorithm

1. Apply input vector xnand forward propagate to find all

inputs ini and outputs ai.

2. Evaluate the error signals ∆kfor all output nodes.

3. Backpropagate the ∆k to obtain error signals ∆jfor each

hidden node.

4. Perform the gradient descent updates for each weight vector wij:

wij← wij+ α × ai× ∆[j]

(32)

Outline

Network Training

(33)

Correctness Proof for Backpropagation Algorithm I.

func%onal diagram

in

_j g

a

_j

in

_K

aj x wjk

Exercise: From this functional diagram find expressions for the following quantities: • ∂ink ∂wjk • ∂ink ∂aj • ∂ink ∂inj • ∂inj ∂wij

(34)

Correctness Proof for Backpropagation Algorithm II.

ai inj En

aix wij

• We need to show that −∂En

wij = ∆[j] · ai.

• This follows easily given the following result Theorem

For each node j, we have ∆[j] = −∂En

inj .

• Proof given theorem: −∂En

wij = −

∂En

inj ·

∂inj

∂wij = ∆[j] · ai.

(35)

Multi-variate Chain Rule

f"

x" u"

y"

• For f (x, y), with f differentiable wrt x and y, and x and y differentiable wrt u and v: ∂f ∂u = ∂f ∂x ∂x ∂u+ ∂f ∂y ∂y ∂u and ∂f ∂v = ∂f ∂x ∂x ∂v+ ∂f ∂y ∂y ∂v

(36)

Proof of Theorem, I

• We want to show that ∆[j] = −∂En

inj .

• Think of the error as a function of the activation levels of the nodes after node j.

• Formally, we can write ∂En

∂inj =

∂

∂injEn(ink1, ink2, . . . , inkm)

where {ki} are the indices of the nodes that receive input

from j.

• Now using the multi-variate chain rule, we have ∂En ∂inj =X k ∂En ∂ink ∂ink ∂inj

• We saw before that ∂ink

∂inj = wjk× g

0_(in j).

(37)

Proof of Theorem, I

inj .

• Think of the error as a function of the activation levels of the nodes after node j.

• Formally, we can write ∂En

∂inj =

∂

∂injEn(ink1, ink2, . . . , inkm)

where {ki} are the indices of the nodes that receive input

from j.

• Now using the multi-variate chain rule, we have ∂En ∂inj =X k ∂En ∂ink ∂ink ∂inj

• We saw before that ∂ink

∂inj = wjk× g

0_(in j).

(38)

Proof of Theorem, II

inj .

• Proof by backward induction. Easy to see that the claim is true for output nodes. (Exercise).

• Inductive step: Consider node j and suppose that ∆[k] = −∂En

ink for all nodes k that receive input from j.

• Using the multivariate chain rule, we have

−∂En ∂inj = m X k=1 −∂En ∂ink ∂ink ∂inj = m X k=1 ∆[k]∂ink ∂inj = m X k=1 ∆[k]wjkg0(inj) = ∆[j].

where step 2 applies the inductive hypothesis, step 3 the result from the previous slide, and step 4 the definition of ∆[j].

(39)

Outline

Network Training

(41)

Applications of Neural Networks

• Many success stories for neural networks

• Credit card fraud detection

• Hand-written digit recognition

• Face detection

(42)

Hand-written Digit Recognition

• MNIST - standard dataset for hand-written digit recognition

(43)

LeNet-5

INPUT 32x32 Convolutions Subsampling Convolutions C1: feature maps 6@28x28 Subsampling S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84 Full connection Full connection Gaussian connections OUTPUT 10

• LeNet developed by Yann LeCun et al.

• Convolutional neural network

• Local receptive fields (5x5 connectivity)

• Subsampling (2x2)

• Shared weights (reuse same 5x5 “filter”)

• Breaking symmetry

• See

(44)

4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3 9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4 8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1 9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8 4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4 8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8 1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1 2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0 4!>9 2!>8

(45)

Conclusion

• Feed-forward networks can be used for predicting discrete or continuous target variables

• Very expressive, can approximate arbitrary continuous functions.

• Different activation functions possible.

• Learning is more difficult, error function has many local minima

• Use stochastic gradient descent, obtain (good?) local minimum

Artificial Neural Networks