Artificial Neural Networks
Oliver Schulte
School of Computing Science Simon Fraser University Introduction to Deep Learning
Neural Networks
• Neural networks arise from attempts to model human/animal brains
• Many models, many claims of biological plausibility
• We will focus onmulti-layer perceptrons
Uses of Neural Networks
• Pros
• Good for continuous input variables.
• General continuous function approximators.
• Highly non-linear.
• Learn features.
• Good to use in continuous domains with little knowledge:
• When you don’t know good features.
• You don’t know the form of a good functional model.
• Cons
• Not interpretable, “black box”.
• Learning is slow.
Function Approximation Demos
• Home Value of Hockey State https://user-images. githubusercontent.com/22108101/
28182140-eb64b49a-67bf-11e7-97aa-046298f721e5. jpg
• Function Learning Examples (open in Safari) http://neuron.eng.wayne.edu/
Applications
There are many, many applications.
• World-Champion Backgammon Player.
http://en.wikipedia.org/wiki/TD-Gammon http://en.wikipedia.org/wiki/Backgammon
• No Hands Across America Tour.
http://www.cs.cmu.edu/afs/cs/usr/tjochem/ www/nhaa/nhaa_home_page.html
• Digit Recognition with 99.26% accuracy.
• Speech Recognition
http://research.microsoft.com/en-us/news/ features/speechrecognition-082911.aspx
• Translation http://translate.google.com
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Theory: Backpropagation implements Gradient Descent
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Theory: Backpropagation implements Gradient Descent
Feed-forward Networks Network Training Error Backpropagation Theory: Backpropagation implements Gradient Descent Examples
Neurons
Model of an individual neuron j
• Pass input inj through a non-linearactivation functionto
get output aj= g(inj)
• For non-input nodes, the input is the weighted linear sum of connected node activations + bias w0,j:
inj = n X i=0 wijai Output
Σ
InputLinks FunctionInput ActivationFunction OutputLinks aj = g(inj) aj g inj wi,j w0,j ai
Feed-forward Networks Network Training Error Backpropagation Theory: Backpropagation implements Gradient Descent Examples
Neurons
Model of an individual neuron j
• Pass input inj through a non-linearactivation functionto
get output aj= g(inj)
• For non-input nodes, the input is the weighted linear sum of connected node activations + bias w0,j:
inj = n X i=0 wijai Output
Σ
InputLinks FunctionInput ActivationFunction OutputLinks aj = g(inj) aj g inj wi,j w0,j ai
Neurons
Model of an individual neuron j
• Pass input inj through a non-linearactivation functionto
get output aj= g(inj)
• For non-input nodes, the input is the weighted linear sum of connected node activations + bias w0,j:
inj = n X i=0 wijai Output
Σ
InputLinks FunctionInput ActivationFunction OutputLinks a0 = 1 aj = g(inj) aj g inj wi,j w0,j Bias Weight ai
Activation Functions
• Can use a variety of activation functions• Sigmoidal (S-shaped)
• Logistic sigmoid 1/(1 + exp(−a)) (useful for binary classification)
• Hyperbolic tangent tanh
• Radial basis function aj=Pi(xi− wji)2
• Softmax
• Useful for multi-class classification
• Hard Threshold
• Rectified Linear Unit (deep learning)
• . . .
• Should be differentiable for gradient-based learning (later)
Activation Functions Visualized
0 0.5 1 -8 -6 -4 -2 0 2 4 6 8 0 0.5 1 -6 -4 -2 0 2 4 6 -2 0 2 4 6 -4 -2 0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1 x1 x2 Left ThresholdMiddle Logistic sigmoid Logistic(x) = 1+exp(−x)1 maps a real number to a probability
Network of Neurons
w3,5 3,6 w 4,5 w 4,6 w 5 6 w1,3 1,4 w 2,3 w 2,4 w 1 2 3 4 w1,3 1,4 w 2,3 w 2,4 w 1 2 3 4 (b) (a)Function Composition
Think logic circuits
-4 -2 0 2 4 x1 -4 -20 2 4 x2 0 0.2 0.4 0.6 0.8 1 hW(x1, x2) -4 -2 0 2 4 x1 -4 -20 2 4 x2 0 0.2 0.4 0.6 0.8 1 hW(x1, x2)
Hidden Units As Feature Extractors
sample training patternslearned input-to-hidden weights
...
...
...
FIGURE 6.13. The top images represent patterns from a large training set used to train a 64-2-3 sigmoidal network for classifying three characters. The bottom figures show the input-to-hidden weights, represented as patterns, at the two hidden units after training. Note that these learned weights indeed describe feature groupings useful for the clas-sification task. In large networks, such patterns of learned weights may be difficult to interpret in this way. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c! 2001 by John Wiley & Sons, Inc.
• 64 input nodes
• 2 hidden units
Expressive Power
• From the outside perspective, a neural net defines a function from input node values x to output node values.
• Let us write aw(x)for the output activation vector.
• How complex can aw be?
• With 1 hidden layer, can represent
• exactly Boolean functions
• approximately continuous functions. e.g. non-convex classification. http://playground.tensorflow.org
• With 2 hidden layers, can represent approximately an arbitrary function.
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Theory: Backpropagation implements Gradient Descent
Measuring Training Error
• Given a specified network structure, how do we set its parameters (weights)?
• As usual, we define a criterion to measure how well our network performs, optimize against it
• Training data are (xn, yn)
• Corresponds to neural net with multiple output nodes • Given a set of weight values w, the network defines an
output activation vector aw(x).
• Given training data x, y, define
• output node error Ek≡ `(a
w,k(x), yk)measures the loss for a single output. Must be differentiable.
• e.g. `(aw,k(x), yk) = 1/2(yk− aw,k)2
• Theerror function is the sum of the individual output node
errors E(w, x) =P
Measuring Training Error
• Given a specified network structure, how do we set its parameters (weights)?
• As usual, we define a criterion to measure how well our network performs, optimize against it
• Training data are (xn, yn)
• Corresponds to neural net with multiple output nodes
• Given a set of weight values w, the network defines an output activation vector aw(x).
• Given training data x, y, define
• output node error Ek≡ `(a
w,k(x), yk)measures the loss for a
single output. Must be differentiable.
• e.g. `(aw,k(x), yk) = 1/2(yk− aw,k)2
• Theerror function is the sum of the individual output node
errors E(w, x) =P
Parameter Optimization
w1 w2 E(w) wA wB wC ∇E• For typical cases, the error function E(w) ≡P
xE(w, x)is
nasty
• Nasty = non-convex
Gradient Descent
• The function aw(x)implemented by a network is
complicated.
• No closed-form: Use gradient descent.
• It isn’t obvious how to compute error function derivatives with respect to hidden weights.
• The credit assignment problem.
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Theory: Backpropagation implements Gradient Descent
Notice of Notation
• In the following we consider a backpropagation training step for a fixed example (xn, yn)and fixed current weights
w.
• We adapt our notation by omitting these arguments unless needed.
• Examples:
• Eninstead of E(w, xn)
Error Backpropagation
• Backprop is an efficient method for computing error derivatives ∂En
∂wji for all nodes in the network. Intuition:
1. Calculating derivatives for weights connected to output nodes is easy.
2. Treat the derivatives as virtual “error”—how far is each node activation “off”. Compute derivative of error for nodes in previous layer.
3. Repeat until you reach input nodes.
• Propagates backwards the output error signal through the network.
Error at the output nodes
• First, feed training example xnforward through the network,
storing all node activations ai
• Calculating derivatives for weights connected to output nodes is easy.
• For output node k with activation ak = g(ink) = g(Piwikai)
and target value yk the error signal is
∆[k] ≡ −g0(ink)
∂`(ak, yk)
∂ak
. • E.g. for squared error loss ∆[k] = g0(ink)(yk− ak)
• Gradient Descent Weight Update:
Error at the output nodes
• First, feed training example xnforward through the network,
storing all node activations ai
• Calculating derivatives for weights connected to output nodes is easy.
• For output node k with activation ak = g(ink) = g(Piwikai)
and target value yk the error signal is
∆[k] ≡ −g0(ink)
∂`(ak, yk)
∂ak
.
• E.g. for squared error loss ∆[k] = g0(ink)(yk− ak)
• Gradient Descent Weight Update:
Error at the hidden nodes
• Consider a hidden node i connected to downstream nodes in the next layer.
• The error signal ∆[i] is node activation derivative, times the weighted sum of contributions to the connected error signals. • In symbols, ∆[i] = g0(ini) X j wij∆[j].
Backpropagation Picture
The error signal at a hidden unit is proportional to the error signals at the units it influences:
∆[i] = g0(ini) c
X
j=1
The Backpropagation Algorithm
1. Apply input vector xnand forward propagate to find all
inputs ini and outputs ai.
2. Evaluate the error signals ∆kfor all output nodes.
3. Backpropagate the ∆k to obtain error signals ∆jfor each
hidden node.
4. Perform the gradient descent updates for each weight vector wij:
wij← wij+ α × ai× ∆[j]
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Theory: Backpropagation implements Gradient Descent
Correctness Proof for Backpropagation Algorithm I.
func%onal diagram
in
j ga
jin
Kaj x wjk
Exercise: From this functional diagram find expressions for the following quantities: • ∂ink ∂wjk • ∂ink ∂aj • ∂ink ∂inj • ∂inj ∂wij
Correctness Proof for Backpropagation Algorithm II.
ai inj En
aix wij
• We need to show that −∂En
wij = ∆[j] · ai.
• This follows easily given the following result Theorem
For each node j, we have ∆[j] = −∂En
inj .
• Proof given theorem: −∂En
wij = −
∂En
inj ·
∂inj
∂wij = ∆[j] · ai.
Multi-variate Chain Rule
f"
x" u"
y"
• For f (x, y), with f differentiable wrt x and y, and x and y differentiable wrt u and v: ∂f ∂u = ∂f ∂x ∂x ∂u+ ∂f ∂y ∂y ∂u and ∂f ∂v = ∂f ∂x ∂x ∂v+ ∂f ∂y ∂y ∂v
Proof of Theorem, I
• We want to show that ∆[j] = −∂Eninj .
• Think of the error as a function of the activation levels of the nodes after node j.
• Formally, we can write ∂En
∂inj =
∂
∂injEn(ink1, ink2, . . . , inkm)
where {ki} are the indices of the nodes that receive input
from j.
• Now using the multi-variate chain rule, we have ∂En ∂inj =X k ∂En ∂ink ∂ink ∂inj
• We saw before that ∂ink
∂inj = wjk× g
0(in j).
Proof of Theorem, I
• We want to show that ∆[j] = −∂Eninj .
• Think of the error as a function of the activation levels of the nodes after node j.
• Formally, we can write ∂En
∂inj =
∂
∂injEn(ink1, ink2, . . . , inkm)
where {ki} are the indices of the nodes that receive input
from j.
• Now using the multi-variate chain rule, we have ∂En ∂inj =X k ∂En ∂ink ∂ink ∂inj
• We saw before that ∂ink
∂inj = wjk× g
0(in j).
Proof of Theorem, II
• We want to show that ∆[j] = −∂Eninj .
• Proof by backward induction. Easy to see that the claim is true for output nodes. (Exercise).
• Inductive step: Consider node j and suppose that ∆[k] = −∂En
ink for all nodes k that receive input from j.
• Using the multivariate chain rule, we have
−∂En ∂inj = m X k=1 −∂En ∂ink ∂ink ∂inj = m X k=1 ∆[k]∂ink ∂inj = m X k=1 ∆[k]wjkg0(inj) = ∆[j].
where step 2 applies the inductive hypothesis, step 3 the result from the previous slide, and step 4 the definition of ∆[j].
Other Learning Topics
• Regularization: L2-regularizer (weight decay).
• Experimenting with Network Architectures is often key.
• Learn Architecture
• Prune Weights: the Optimal Brain Damage Method.
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Theory: Backpropagation implements Gradient Descent
Applications of Neural Networks
• Many success stories for neural networks
• Credit card fraud detection
• Hand-written digit recognition
• Face detection
Hand-written Digit Recognition
• MNIST - standard dataset for hand-written digit recognition
LeNet-5
INPUT 32x32 Convolutions Subsampling Convolutions C1: feature maps 6@28x28 Subsampling S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84 Full connection Full connection Gaussian connections OUTPUT 10• LeNet developed by Yann LeCun et al.
• Convolutional neural network
• Local receptive fields (5x5 connectivity)
• Subsampling (2x2)
• Shared weights (reuse same 5x5 “filter”)
• Breaking symmetry
• See
4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3 9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4 8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1 9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8 4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4 8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8 1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1 2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0 4!>9 2!>8
Conclusion
• Feed-forward networks can be used for predicting discrete or continuous target variables
• Very expressive, can approximate arbitrary continuous functions.
• Different activation functions possible.
• Learning is more difficult, error function has many local minima
• Use stochastic gradient descent, obtain (good?) local minimum