Multi-layer neural networks

(1)

CS 1571 Intro to AI

CS 1571 Introduction to AI Lecture 27

Milos Hauskrecht [email protected] 5329 Sennott Square

Multi-layer neural networks

Announcements

Homeworks:

• Homework 10 due today

Final exam: December 08, 2003 at 10:00-11:50am

• Location: 5129 Sennott Square

• Closed book

• Cumulative

• Format similar to the midterm exam Office hours:

• Milos: Tue 2:30-4:00pm, Wed 11:00-12:00am

• Tomas: Wed 2:00-3:30pm, Fri 10:00-11:30am

(2)

CS 1571 Intro to AI

Linear units

Logistic regression Linear regression

∑ 1

)

| 1

(y x

p =

w0

w1

w2

wd

∑ z 1

x1

w0

w1

w2

wd

xd

x2

) (x f

∑

=

+

= ^d

j

j jx w w

f

1

) 0

(x ( ) ( 1| , ) ( )

1

0

∑

=

+

=

= ^d

j j jx w w

g y

p

f x x w

j j

j w y f x

w ← +α( − (x)) w_j←w_j+α(y−f(x))x_j ))

(

0 (

0 w y f x

w ← +α − On-line gradient update:

)) (

0 (

0 w y f x

w ← +α − On-line gradient update:

The same

= ) (x f x1

xd

x2

Limitations of basic linear units

Logistic regression Linear regression

∑ 1

)

| 1

(y x

p =

w0

w1

w2

wd

∑ z 1

w0

w1

w2

wd

) (x f

Function linear in inputs !! Linear decision boundary!!

∑

=

+

= ^d

j

j jx w w

f

1

) 0

(x ( ) ( 1| , ) ( )

1

0

∑

=

+

=

= ^d

j j jx w w

g y

p

f x x w

x1

xd

x2

x1

xd

x2

(3)

CS 1571 Intro to AI

Extensions of simple linear units

) ( )

(

1

0 x

x ^m _j

j

wj

w

f

∑

φ

=

+

=

) ∑

1(x φ

)

2(x φ

)

m( x φ

1

x1

w0

w1

w2

wm

xd

)

j(x

φ - an arbitrary function of x

• use feature (basis) functions to model nonlinearities

)) ( (

) (

1

0 x

x ^m _j

j

wj

w g

f

∑

φ

=

+

=

Linear regression Logistic regression

Regression with the quadratic model.

Limitation:linear hyper-plane only a non-linear surface can be better

(4)

CS 1571 Intro to AI

Classification with the linear model.

Logistic regression model defines a linear decision boundary

• Example: 2 classes (blue and red points)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Decision boundary

Linear decision boundary

• logistic regression model is not optimal, but not that bad

-4 -3 -2 -1 0 1 2 3 4 5 6

-4 -3 -2 -1 0 1 2 3 4 5

(5)

CS 1571 Intro to AI

When logistic regression fails?

• Example in which the logistic regression model fails

-4 -3 -2 -1 0 1 2 3 4 5

Limitations of linear units.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

• Logistic regression does not work for parity functions -no linear decision boundary exists

Solution:a model of a non-linear decision boundary

(6)

CS 1571 Intro to AI

Example. Regression with polynomials.

Regression with polynomials of degree m

• Data points:pairs of

• Feature functions:m feature functions

• Function to learn:

m i i

i i

m i

i x w w x

w w

x

f

∑ ∑

=

+

= +

=

1 0 1

0 ( )

) ,

( w φ

i i(x) = x φ

>

< x,y

m i =1,2,K,

∑

=x )

1(x φ

2 2(x)=x φ

m m(x)= x φ

1

x

w0

w1

w2

wm

Learning with extended linear units

) ( )

(

1

0 x

x ^m _j

j

wj

w

f

∑

φ

=

+

=

Feature (basis) functions model nonlinearities

Important property:

• The problem of learning the weights is the same as it was for the linear units

• Trick: we have changed the inputs – but the weights are still linear in the new input

) ∑

1(x φ

)

2(x φ

)

m( x φ

1

x1

w0

w1

w2

wm

xd

)) ( (

) (

1

0 x

x ^m _j

j

wj

w g

f

∑

φ

=

+

=

Linear regression Logistic regression

(7)

CS 1571 Intro to AI

Function to learn:

On line gradient updatefor the <x,y> pair

Gradient updates are of the same form as in the linear and logistic regression models

) ( )) , (

( x w _j x

j

j w y f

w = +α − φ

)) , (

0 (

0 w y f x w

w = +α −

Learning with feature functions.

) ( )

, (

1

0 w x

w x

f ^k _i

i iφ

∑

=

+

= w

Example:Regression with polynomials of degree m

• On line updatefor <x,y> pair

j j

j w y f x

w = +α( − (x,w)) )) , (

0 (

0 w y f x w

w = +α −

Example. Regression with polynomials.

m i i

i i

m i

i x w w x

w w

x

f

∑ ∑

=

+

= +

=

1 0 1

0 ( )

) ,

( w φ

(8)

CS 1571 Intro to AI

Multi-layered neural networks

• Alternative way to introduce nonlinearities to regression/classification models

• Idea:Cascade several simple neural models with logistic units. Much like neuron connections.

Multilayer neural network

Hidden layer Output layer Input layer

Cascades multiple logistic regression units Also called a multilayer perceptron (MLP)

∑

1

x1

)

| 1

(y x

p = )

1

1(

,

w0

) 1

1(

,

wk

) 1

2(

,

wk

xd

x2 z₁(2)

) 1

2(

,

w0

∑ ) 1

1( z

) 1

2( z

1

) 2

1(

,

w0

) 2

1(

,

w1

) 2

1(

,

w2

Example: (2 layer) classifier with non-linear decision boundaries

(9)

CS 1571 Intro to AI

Multilayer neural network

• Models non-linearities through logistic regression units

• Can be applied to both regression and binary classification problems

∑

1

) ,

| 1 ( )

(x = p y= xw f

) 1

1(

,

w0

) 1

1(

,

wk

) 1

2(

,

wk

) 2

1( z )

1

2(

,

w0

∑ ) 1

1( z

) 1

2( z

) 2

1(

,

w0

) 2

1(

,

w1

) 2

1(

,

w2

Hidden layer Output layer

Input layer

1

) , ( )

(x f x w

f =

regression

classification

option x1

xd

x2

Multilayer neural network

• Non-linearities are modeled using multiple hidden logistic regression units (organized in layers)

• The output layer determines whether it is a regression or a binary classification problem

) ,

| 1 ( )

(x = p y= xw f

Hidden layers Output layer

Input layer

) , ( )

(x f x w

f =

regression

classification

option x1

xd

x2

(10)

CS 1571 Intro to AI

Learning with MLP

• How to learn the parameters of the neural network?

• Online gradient descent algorithm – Weight updates based on

• We need to compute gradients for weights in all units

• Can be computed in one backward sweep through the net !!!

• The process is called back-propagation ) ,

online ( _i w

j j

j J D

w w

w ∂

− ∂

← α

) ,

online (D_i w J

Backpropagation

∑ ^xⁱ^(k⁾ ^x^l⁽^k⁺¹⁾

) 1

, (k−

w_i_j z_i(k) _w_l_,_i₍_k₊₁₎ ∑ z_l(k+1) )

1 (k− x_j

k-th level (k+1)-th level (k-1)-th level

) (k

x_i - output of the unit i on level k

) (k

z_i - input to the sigmoid function on level k

∑

⁻

+

=

j

j j i i

i k w k w k x k

z( ) _,₀( ) _, ( ) ( 1) ))

( ( )

(k g z k x_i = _i

)

, (k

w_i_j - weight between units j and i on levels (k-1) and k

(11)

CS 1571 Intro to AI

Backpropagation

)

i(k δ

)) ( ( )

(K y f x,w

i =− −

δ

)

, (k w_i_j

Update weight using a data point

) , ) (

) ( ( )

(

, ,

, _online _u w

j i j

i j

i J D

k k w

w k

w ∂

− ∂

← α

) , ) (

) (

( _online _u w

i

i J D

k k z

∂

= ∂

Let δ

Then: ₍ ₎ ⁽ ^, ⁾ ⁽₍ ₎^, ⁾ ⁽₍⁾₎ ⁽ ⁾ ⁽ ¹⁾

, ,

−

∂ =

∂

=∂

∂

∂ k x k

k w

k z k

z D D J

k J

w _i_j ⁱ ^j

i i

u online u

online j

i

w δ w

S.t. is computed from and the next layer δ_l(k+1) ))

( 1 )(

( ) 1 ( ) 1 ( )

(k k w_l_,_i k x_i k x_i k

l l

i  −

 



 + +

=

∑

^δ

δ

Last unit(is the same as for the regular linear units):

It is the same for the classification with the log-likelihood measure of fit and linear regression with least-squares error!!!

>

=< y D_u x,

) (k x_i

Learning with MLP

• Online gradient descent algorithm – Weight update:

) , ) (

) ( ( )

( _online

, ,

, _u w

j i j

i j

i J D

k k w

w k

w ∂

− ∂

← α

) 1 ( ) ) (

( ) ( )

( ) , ) (

, ) (

( _,

,

−

∂ =

∂

= ∂

∂

∂ k x k

k w

k z k

z D D J

k J

w _i _j ⁱ ^j

i i

u online u

online j

i

w δ w

) 1 ( ) ( )

( )

( _,

, k ← w k − k x k −

w_i _j _i _j αδ_i _j

) 1 (k − x_j

)

i(k δ

- j-th output of the (k-1) layer

- derivative computed via backpropagation α - a learning rate

(12)

CS 1571 Intro to AI

Online gradient descent algorithm for MLP

Online-gradient-descent (D, number of iterations) Initialize all weights

for i=1:1: number of iterations

do select a data point D_u=<x,y> from D set learning rate

compute outputs for each unit

compute derivatives via backpropagation update all weights (in parallel)

end for

return weights w

)

, (k w_i _j

α

) 1 ( ) ( )

( )

( _,

, k ← w k − k x k −

w_i _j _i _j αδ_i _j

) (k x_j

)

i(k δ

Xor Example.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

• linear decision boundary does not exist

(13)

CS 1571 Intro to AI

Xor example. Linear unit

Xor example.

Neural network with 2 hidden units

(14)

CS 1571 Intro to AI

Xor example.

Neural network with 10 hidden units

MLP in practice

• Optical character recognition– digits 20x20 – Automatic sorting of mails

– 5 layer network with multiple output functions

10 outputs (0,1,…9)

…

20x20 = 400 inputs

5 10 3000

4 300 1200

3 1200 50000

2 784 3136

1 3136 78400 layer Neurons Weights