• No results found

Multi-layer neural networks

N/A
N/A
Protected

Academic year: 2022

Share "Multi-layer neural networks"

Copied!
14
0
0

Loading.... (view fulltext now)

Full text

(1)

CS 1571 Intro to AI

CS 1571 Introduction to AI Lecture 27

Milos Hauskrecht [email protected] 5329 Sennott Square

Multi-layer neural networks

Announcements

Homeworks:

• Homework 10 due today

Final exam: December 08, 2003 at 10:00-11:50am

• Location: 5129 Sennott Square

• Closed book

• Cumulative

• Format similar to the midterm exam Office hours:

• Milos: Tue 2:30-4:00pm, Wed 11:00-12:00am

• Tomas: Wed 2:00-3:30pm, Fri 10:00-11:30am

(2)

CS 1571 Intro to AI

Linear units

Logistic regression Linear regression

1

)

| 1

(y x

p =

w0

w1

w2

wd

z 1

x1

w0

w1

w2

wd

xd

x2

) (x f

=

+

= d

j

j jx w w

f

1

) 0

(x ( ) ( 1| , ) ( )

1

0

=

+

=

=

= d

j j jx w w

g y

p

f x x w

j j

j w y f x

w ← +α( − (x)) wjwj+α(yf(x))xj ))

(

0 (

0 w y f x

w ← +α − On-line gradient update:

)) (

0 (

0 w y f x

w ← +α − On-line gradient update:

The same

= ) (x f x1

xd

x2

Limitations of basic linear units

Logistic regression Linear regression

1

)

| 1

(y x

p =

w0

w1

w2

wd

z 1

w0

w1

w2

wd

) (x f

Function linear in inputs !! Linear decision boundary!!

=

+

= d

j

j jx w w

f

1

) 0

(x ( ) ( 1| , ) ( )

1

0

=

+

=

=

= d

j j jx w w

g y

p

f x x w

x1

xd

x2

x1

xd

x2

(3)

CS 1571 Intro to AI

Extensions of simple linear units

) ( )

(

1

0 x

x m j

j

wj

w

f

φ

=

+

=

)

1(x φ

)

2(x φ

)

m( x φ

1

x1

w0

w1

w2

wm

xd

)

j(x

φ - an arbitrary function of x

• use feature (basis) functions to model nonlinearities

)) ( (

) (

1

0 x

x m j

j

wj

w g

f

φ

=

+

=

Linear regression Logistic regression

Regression with the quadratic model.

Limitation:linear hyper-plane only a non-linear surface can be better

(4)

CS 1571 Intro to AI

Classification with the linear model.

Logistic regression model defines a linear decision boundary

• Example: 2 classes (blue and red points)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Decision boundary

Linear decision boundary

• logistic regression model is not optimal, but not that bad

-4 -3 -2 -1 0 1 2 3 4 5 6

-4 -3 -2 -1 0 1 2 3 4 5

(5)

CS 1571 Intro to AI

When logistic regression fails?

• Example in which the logistic regression model fails

-4 -3 -2 -1 0 1 2 3 4 5

-4 -3 -2 -1 0 1 2 3 4 5

Limitations of linear units.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

• Logistic regression does not work for parity functions -no linear decision boundary exists

Solution:a model of a non-linear decision boundary

(6)

CS 1571 Intro to AI

Example. Regression with polynomials.

Regression with polynomials of degree m

• Data points:pairs of

• Feature functions:m feature functions

• Function to learn:

m i i

i i

m i

i x w w x

w w

x

f

∑ ∑

=

=

+

= +

=

1 0 1

0 ( )

) ,

( w φ

i i(x) = x φ

>

< x,y

m i =1,2,K,

=x )

1(x φ

2 2(x)=x φ

m m(x)= x φ

1

x

w0

w1

w2

wm

Learning with extended linear units

) ( )

(

1

0 x

x m j

j

wj

w

f

φ

=

+

=

Feature (basis) functions model nonlinearities

Important property:

• The problem of learning the weights is the same as it was for the linear units

• Trick: we have changed the inputs – but the weights are still linear in the new input

)

1(x φ

)

2(x φ

)

m( x φ

1

x1

w0

w1

w2

wm

xd

)) ( (

) (

1

0 x

x m j

j

wj

w g

f

φ

=

+

=

Linear regression Logistic regression

(7)

CS 1571 Intro to AI

Function to learn:

On line gradient updatefor the <x,y> pair

Gradient updates are of the same form as in the linear and logistic regression models

) ( )) , (

( x w j x

j

j w y f

w = +α − φ

)) , (

0 (

0 w y f x w

w = +α −

Learning with feature functions.

) ( )

, (

1

0 w x

w x

f k i

i iφ

=

+

= w

Example:Regression with polynomials of degree m

• On line updatefor <x,y> pair

j j

j w y f x

w = +α( − (x,w)) )) , (

0 (

0 w y f x w

w = +α −

Example. Regression with polynomials.

m i i

i i

m i

i x w w x

w w

x

f

∑ ∑

=

=

+

= +

=

1 0 1

0 ( )

) ,

( w φ

(8)

CS 1571 Intro to AI

Multi-layered neural networks

• Alternative way to introduce nonlinearities to regression/classification models

• Idea:Cascade several simple neural models with logistic units. Much like neuron connections.

Multilayer neural network

Hidden layer Output layer Input layer

Cascades multiple logistic regression units Also called a multilayer perceptron (MLP)

1

x1

)

| 1

(y x

p = )

1

1(

,

w0

) 1

1(

,

wk

) 1

2(

,

wk

xd

x2 z1(2)

) 1

2(

,

w0

) 1

1( z

) 1

2( z

1

) 2

1(

,

w0

) 2

1(

,

w1

) 2

1(

,

w2

Example: (2 layer) classifier with non-linear decision boundaries

(9)

CS 1571 Intro to AI

Multilayer neural network

• Models non-linearities through logistic regression units

• Can be applied to both regression and binary classification problems

1

) ,

| 1 ( )

(x = p y= xw f

) 1

1(

,

w0

) 1

1(

,

wk

) 1

2(

,

wk

) 2

1( z )

1

2(

,

w0

) 1

1( z

) 1

2( z

) 2

1(

,

w0

) 2

1(

,

w1

) 2

1(

,

w2

Hidden layer Output layer

Input layer

1

) , ( )

(x f x w

f =

regression

classification

option x1

xd

x2

Multilayer neural network

• Non-linearities are modeled using multiple hidden logistic regression units (organized in layers)

• The output layer determines whether it is a regression or a binary classification problem

) ,

| 1 ( )

(x = p y= xw f

Hidden layers Output layer

Input layer

) , ( )

(x f x w

f =

regression

classification

option x1

xd

x2

(10)

CS 1571 Intro to AI

Learning with MLP

• How to learn the parameters of the neural network?

• Online gradient descent algorithm – Weight updates based on

• We need to compute gradients for weights in all units

• Can be computed in one backward sweep through the net !!!

• The process is called back-propagation ) ,

online ( i w

j j

j J D

w w

w

− ∂

← α

) ,

online (Di w J

Backpropagation

xi(k) xl(k+1)

) 1

, (k

wij zi(k) wl,i(k+1) zl(k+1) )

1 (k xj

k-th level (k+1)-th level (k-1)-th level

) (k

xi - output of the unit i on level k

) (k

zi - input to the sigmoid function on level k

+

=

j

j j i i

i k w k w k x k

z( ) ,0( ) , ( ) ( 1) ))

( ( )

(k g z k xi = i

)

, (k

wij - weight between units j and i on levels (k-1) and k

(11)

CS 1571 Intro to AI

Backpropagation

)

i(k δ

)) ( ( )

(K y f x,w

i =

δ

)

, (k wij

Update weight using a data point

) , ) (

) ( ( )

(

, ,

, online u w

j i j

i j

i J D

k k w

w k

w

α

) , ) (

) (

( online u w

i

i J D

k k z

=

Let δ

Then: ( ) ( , ) (( ), ) (()) ( ) ( 1)

, ,

=

=

k x k

k w

k z k

z D D J

k J

w ij i j

i i

u online u

online j

i

w δ w

S.t. is computed from and the next layer δl(k+1) ))

( 1 )(

( ) 1 ( ) 1 ( )

(k k wl,i k xi k xi k

l l

i

+ +

=

δ

δ

Last unit(is the same as for the regular linear units):

It is the same for the classification with the log-likelihood measure of fit and linear regression with least-squares error!!!

>

=< y Du x,

) (k xi

Learning with MLP

• Online gradient descent algorithm – Weight update:

) , ) (

) ( ( )

( online

, ,

, u w

j i j

i j

i J D

k k w

w k

w

− ∂

← α

) 1 ( ) ) (

( ) ( )

( ) , ) (

, ) (

( ,

,

∂ =

= ∂

k x k

k w

k z k

z D D J

k J

w i j i j

i i

u online u

online j

i

w δ w

) 1 ( ) ( )

( )

( ,

, kw kk x k

wi j i j αδi j

) 1 (kxj

)

i(k δ

- j-th output of the (k-1) layer

- derivative computed via backpropagation α - a learning rate

(12)

CS 1571 Intro to AI

Online gradient descent algorithm for MLP

Online-gradient-descent (D, number of iterations) Initialize all weights

for i=1:1: number of iterations

do select a data point Du=<x,y> from D set learning rate

compute outputs for each unit

compute derivatives via backpropagation update all weights (in parallel)

end for

return weights w

)

, (k wi j

α

) 1 ( ) ( )

( )

( ,

, kw kk x k

wi j i j αδi j

) (k xj

)

i(k δ

Xor Example.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

• linear decision boundary does not exist

(13)

CS 1571 Intro to AI

Xor example. Linear unit

Xor example.

Neural network with 2 hidden units

(14)

CS 1571 Intro to AI

Xor example.

Neural network with 10 hidden units

MLP in practice

Optical character recognition– digits 20x20 – Automatic sorting of mails

– 5 layer network with multiple output functions

10 outputs (0,1,…9)

20x20 = 400 inputs

5 10 3000

4 300 1200

3 1200 50000

2 784 3136

1 3136 78400 layer Neurons Weights

References

Related documents

When available, data from earlier toxicological studies including data on toxicokinetics will provide information on potential local, gastroenteral effects, and the extent

As shown in Table 2, most of the commercial so- lutions (with the exclusion of FortiDB and Secure- Sphere) use rules (black or white listing) to define unauthorized behavior that

1) To develop a new skin color modeling and detection method for detecting human targets in complex images. 2) To propose a new methodology for testing and evaluation of

eRouter DOCSIS Embedded Router: An eSAFE that is compliant with [eRouter], providing IPv4 and/or IPv6 data forwarding, address configuration, and Domain Name services to

(b) if the land is not sold at the time and place so appointed, the Court may either order the land to be again offered for sale or make a vesting order in the case of a mortgage

We use this classification of constrained firms to conduct two additional empirical analyses: (i) first, we analyze the consistency between the classification from

Section 4 empirically investigates the relationship between the single measures for labor market institutions and globalization: in each scenario, I will first briefly

and forgive us our debts as we ourselves have forgiven our debtors, and lead us not into temptation but deliver us from evil.. Phillips The New Testament in Modern English 1958