CS 1571 Intro to AI
CS 1571 Introduction to AI Lecture 27
Milos Hauskrecht [email protected] 5329 Sennott Square
Multi-layer neural networks
Announcements
Homeworks:
• Homework 10 due today
Final exam: December 08, 2003 at 10:00-11:50am
• Location: 5129 Sennott Square
• Closed book
• Cumulative
• Format similar to the midterm exam Office hours:
• Milos: Tue 2:30-4:00pm, Wed 11:00-12:00am
• Tomas: Wed 2:00-3:30pm, Fri 10:00-11:30am
CS 1571 Intro to AI
Linear units
Logistic regression Linear regression
∑ 1
)
| 1
(y x
p =
w0
w1
w2
wd
∑ z 1
x1
w0
w1
w2
wd
xd
x2
) (x f
∑
=+
= d
j
j jx w w
f
1
) 0
(x ( ) ( 1| , ) ( )
1
0
∑
=
+
=
=
= d
j j jx w w
g y
p
f x x w
j j
j w y f x
w ← +α( − (x)) wj←wj+α(y−f(x))xj ))
(
0 (
0 w y f x
w ← +α − On-line gradient update:
)) (
0 (
0 w y f x
w ← +α − On-line gradient update:
The same
= ) (x f x1
xd
x2
Limitations of basic linear units
Logistic regression Linear regression
∑ 1
)
| 1
(y x
p =
w0
w1
w2
wd
∑ z 1
w0
w1
w2
wd
) (x f
Function linear in inputs !! Linear decision boundary!!
∑
=+
= d
j
j jx w w
f
1
) 0
(x ( ) ( 1| , ) ( )
1
0
∑
=
+
=
=
= d
j j jx w w
g y
p
f x x w
x1
xd
x2
x1
xd
x2
CS 1571 Intro to AI
Extensions of simple linear units
) ( )
(
1
0 x
x m j
j
wj
w
f
∑
φ=
+
=
) ∑
1(x φ
)
2(x φ
)
m( x φ
1
x1
w0
w1
w2
wm
xd
)
j(x
φ - an arbitrary function of x
• use feature (basis) functions to model nonlinearities
)) ( (
) (
1
0 x
x m j
j
wj
w g
f
∑
φ=
+
=
Linear regression Logistic regression
Regression with the quadratic model.
Limitation:linear hyper-plane only a non-linear surface can be better
CS 1571 Intro to AI
Classification with the linear model.
Logistic regression model defines a linear decision boundary
• Example: 2 classes (blue and red points)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Decision boundary
Linear decision boundary
• logistic regression model is not optimal, but not that bad
-4 -3 -2 -1 0 1 2 3 4 5 6
-4 -3 -2 -1 0 1 2 3 4 5
CS 1571 Intro to AI
When logistic regression fails?
• Example in which the logistic regression model fails
-4 -3 -2 -1 0 1 2 3 4 5
-4 -3 -2 -1 0 1 2 3 4 5
Limitations of linear units.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
• Logistic regression does not work for parity functions -no linear decision boundary exists
Solution:a model of a non-linear decision boundary
CS 1571 Intro to AI
Example. Regression with polynomials.
Regression with polynomials of degree m
• Data points:pairs of
• Feature functions:m feature functions
• Function to learn:
m i i
i i
m i
i x w w x
w w
x
f
∑ ∑
=
=
+
= +
=
1 0 1
0 ( )
) ,
( w φ
i i(x) = x φ
>
< x,y
m i =1,2,K,
∑
=x )
1(x φ
2 2(x)=x φ
m m(x)= x φ
1
x
w0
w1
w2
wm
Learning with extended linear units
) ( )
(
1
0 x
x m j
j
wj
w
f
∑
φ=
+
=
Feature (basis) functions model nonlinearities
Important property:
• The problem of learning the weights is the same as it was for the linear units
• Trick: we have changed the inputs – but the weights are still linear in the new input
) ∑
1(x φ
)
2(x φ
)
m( x φ
1
x1
w0
w1
w2
wm
xd
)) ( (
) (
1
0 x
x m j
j
wj
w g
f
∑
φ=
+
=
Linear regression Logistic regression
CS 1571 Intro to AI
Function to learn:
On line gradient updatefor the <x,y> pair
Gradient updates are of the same form as in the linear and logistic regression models
) ( )) , (
( x w j x
j
j w y f
w = +α − φ
)) , (
0 (
0 w y f x w
w = +α −
Learning with feature functions.
) ( )
, (
1
0 w x
w x
f k i
i iφ
∑
=+
= w
Example:Regression with polynomials of degree m
• On line updatefor <x,y> pair
j j
j w y f x
w = +α( − (x,w)) )) , (
0 (
0 w y f x w
w = +α −
Example. Regression with polynomials.
m i i
i i
m i
i x w w x
w w
x
f
∑ ∑
=
=
+
= +
=
1 0 1
0 ( )
) ,
( w φ
CS 1571 Intro to AI
Multi-layered neural networks
• Alternative way to introduce nonlinearities to regression/classification models
• Idea:Cascade several simple neural models with logistic units. Much like neuron connections.
Multilayer neural network
Hidden layer Output layer Input layer
Cascades multiple logistic regression units Also called a multilayer perceptron (MLP)
∑
1
x1
)
| 1
(y x
p = )
1
1(
,
w0
) 1
1(
,
wk
) 1
2(
,
wk
xd
x2 z1(2)
) 1
2(
,
w0
∑ ) 1
1( z
) 1
2( z
1
) 2
1(
,
w0
) 2
1(
,
w1
) 2
1(
,
w2
Example: (2 layer) classifier with non-linear decision boundaries
CS 1571 Intro to AI
Multilayer neural network
• Models non-linearities through logistic regression units
• Can be applied to both regression and binary classification problems
∑
1
) ,
| 1 ( )
(x = p y= xw f
) 1
1(
,
w0
) 1
1(
,
wk
) 1
2(
,
wk
) 2
1( z )
1
2(
,
w0
∑ ) 1
1( z
) 1
2( z
) 2
1(
,
w0
) 2
1(
,
w1
) 2
1(
,
w2
Hidden layer Output layer
Input layer
1
) , ( )
(x f x w
f =
regression
classification
option x1
xd
x2
Multilayer neural network
• Non-linearities are modeled using multiple hidden logistic regression units (organized in layers)
• The output layer determines whether it is a regression or a binary classification problem
) ,
| 1 ( )
(x = p y= xw f
Hidden layers Output layer
Input layer
) , ( )
(x f x w
f =
regression
classification
option x1
xd
x2
CS 1571 Intro to AI
Learning with MLP
• How to learn the parameters of the neural network?
• Online gradient descent algorithm – Weight updates based on
• We need to compute gradients for weights in all units
• Can be computed in one backward sweep through the net !!!
• The process is called back-propagation ) ,
online ( i w
j j
j J D
w w
w ∂
− ∂
← α
) ,
online (Di w J
Backpropagation
∑ xi(k) xl(k+1)
) 1
, (k−
wij zi(k) wl,i(k+1) ∑ zl(k+1) )
1 (k− xj
k-th level (k+1)-th level (k-1)-th level
) (k
xi - output of the unit i on level k
) (k
zi - input to the sigmoid function on level k
∑
−+
=
j
j j i i
i k w k w k x k
z( ) ,0( ) , ( ) ( 1) ))
( ( )
(k g z k xi = i
)
, (k
wij - weight between units j and i on levels (k-1) and k
CS 1571 Intro to AI
Backpropagation
)
i(k δ
)) ( ( )
(K y f x,w
i =− −
δ
)
, (k wij
Update weight using a data point
) , ) (
) ( ( )
(
, ,
, online u w
j i j
i j
i J D
k k w
w k
w ∂
− ∂
← α
) , ) (
) (
( online u w
i
i J D
k k z
∂
= ∂
Let δ
Then: ( ) ( , ) (( ), ) (()) ( ) ( 1)
, ,
−
∂ =
∂
∂
=∂
∂
∂ k x k
k w
k z k
z D D J
k J
w ij i j
i i
u online u
online j
i
w δ w
S.t. is computed from and the next layer δl(k+1) ))
( 1 )(
( ) 1 ( ) 1 ( )
(k k wl,i k xi k xi k
l l
i −
+ +
=
∑
δδ
Last unit(is the same as for the regular linear units):
It is the same for the classification with the log-likelihood measure of fit and linear regression with least-squares error!!!
>
=< y Du x,
) (k xi
Learning with MLP
• Online gradient descent algorithm – Weight update:
) , ) (
) ( ( )
( online
, ,
, u w
j i j
i j
i J D
k k w
w k
w ∂
− ∂
← α
) 1 ( ) ) (
( ) ( )
( ) , ) (
, ) (
( ,
,
−
∂ =
∂
∂
= ∂
∂
∂ k x k
k w
k z k
z D D J
k J
w i j i j
i i
u online u
online j
i
w δ w
) 1 ( ) ( )
( )
( ,
, k ← w k − k x k −
wi j i j αδi j
) 1 (k − xj
)
i(k δ
- j-th output of the (k-1) layer
- derivative computed via backpropagation α - a learning rate
CS 1571 Intro to AI
Online gradient descent algorithm for MLP
Online-gradient-descent (D, number of iterations) Initialize all weights
for i=1:1: number of iterations
do select a data point Du=<x,y> from D set learning rate
compute outputs for each unit
compute derivatives via backpropagation update all weights (in parallel)
end for
return weights w
)
, (k wi j
α
) 1 ( ) ( )
( )
( ,
, k ← w k − k x k −
wi j i j αδi j
) (k xj
)
i(k δ
Xor Example.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
• linear decision boundary does not exist
CS 1571 Intro to AI
Xor example. Linear unit
Xor example.
Neural network with 2 hidden units
CS 1571 Intro to AI
Xor example.
Neural network with 10 hidden units
MLP in practice
• Optical character recognition– digits 20x20 – Automatic sorting of mails
– 5 layer network with multiple output functions
10 outputs (0,1,…9)
…
20x20 = 400 inputs
5 10 3000
4 300 1200
3 1200 50000
2 784 3136
1 3136 78400 layer Neurons Weights