• No results found

Solutions to Selected Problems-Duda, Hart

N/A
N/A
Protected

Academic year: 2021

Share "Solutions to Selected Problems-Duda, Hart"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

Solutions to Selected Problems In:

Solutions to Selected Problems In:

Pattern Classification

Pattern Classification

by Duda, Hart, Stork

by Duda, Hart, Stork

John L. Weatherwax

John L. Weatherwax

∗∗

February 24, 2008

February 24, 2008

Problem Solutions

Problem Solutions

Chapter 2 (Bayesian Decision Theory)

Chapter 2 (Bayesian Decision Theory)

Problem 11 (randomized rules) Problem 11 (randomized rules) Part (a):

Part (a): LetLet R R((xx) be the average risk given measurement) be the average risk given measurement xx. Then since this risk depends. Then since this risk depends on the action and correspondingly the probability of taking such action, this risk can be on the action and correspondingly the probability of taking such action, this risk can be expressed as expressed as R R((xx) ) == m m

ii=1=1 R R((aaii

||

xx))P P ((aaii

||

xx))

Then the total risk (over all possible

Then the total risk (over all possible xx) is given by the average of ) is given by the average of  R R((xx) ) oror R R = =

 

 

Ω Ωxx (( m m

ii=1=1 R R((aaii

||

xx))P P ((aaii

||

xx)))) p p((xx))dxdx Part (b):

Part (b):  One would pick One would pick P 

P ((aaii

||

xx) =) =

00 ii = anything else11 ii = ArgMin = ArgMin = anything elseiiRR((aaii

||

xx))

Thus for every

Thus for every x x  pick the action that minimizes the local risk. This amounts to a determin- pick the action that minimizes the local risk. This amounts to a determin-istic rule i.e. there is no benefit to using randomness.

istic rule i.e. there is no benefit to using randomness.

[email protected][email protected]

(2)

Part (c):

Part (c):   Yes, since in a deterministic decision rule setting randomizing some decisions  Yes, since in a deterministic decision rule setting randomizing some decisions might further improve performance. But how this would be done would be difficult to say. might further improve performance. But how this would be done would be difficult to say.

Problem 13 Problem 13 Let Let λ λ((ααii

||

ωω j j) ) ==

00 ii = = j j λ λrr ii = = c c + 1 + 1 λ λss   otherwise  otherwise

Then since the definition of risk is expected loss we have Then since the definition of risk is expected loss we have

R R((ααii

||

xx) ) == cc

ii=1=1 λ λ((aaii

||

ωω j j))P P ((ωω j j

||

xx)) ii =  = 11,, 22, . . . , c, . . . , c (1)(1) R R((ααcc+1+1

||

xx) ) == λλrr (2)(2) Now for

Now for i i =  = 11,, 22, . . . , c, . . . , c the risk can be simplified as the risk can be simplified as R R((ααii

||

xx) ) == λ λss cc

 j  j=1=1,, jj



==ii λ λ((aaii

||

ωω j j))P P ((ωω j j

||

xx) ) == λ λss(1(1

P P ((ωωii

||

xx))))

Our decision function is to choose the action with the smallest risk, this means that we pick Our decision function is to choose the action with the smallest risk, this means that we pick the reject option if and only if 

the reject option if and only if  λ λrr < < λλss(1(1

P ((ωP  ωii

||

xx))))

∀∀

ii =  = 11,, 22, . . . , c, . . . , c This is equivalent to This is equivalent to P  P ((ωωii

||

xx)) < < 1 1

 λ  λrr λ λss

∀∀

ii =  = 11,, 22, . . . , c, . . . , c

This can be interpreted as if all posterior probabilities are below a threshold (1

This can be interpreted as if all posterior probabilities are below a threshold (1

λλrr

λ

λss) ) wewe

should reject. Otherwise we pick action

should reject. Otherwise we pick action  i i  such that such that λ λss(1(1

P P ((ωωii

||

xx)))) <  < λλss(1(1

P P ((ωωii

||

xx))))

∀∀

jj = = 11,, 22, ., . . .. . , c, c, , jj

 

 

== i i This is equivalent to This is equivalent to P  P ((ωωii

||

xx)) >  > P P ((ωω j j

||

xx))

∀∀

jj

 

 

== i i Thi

This s can be can be ininteterprrpreteeted d as as selselecectinting g the class with the class with the largethe largest st postpostererioriori i proprobabbabiliilityty. . InIn summary we should reject if and only if 

summary we should reject if and only if  P  P ((ωωii

||

xx)) < < 1 1

 λ  λrr λ λss

∀∀

ii =  = 11,, 22, . . . , c, . . . , c (3)(3) Note that if 

Note that if  λλrr  = 0 there is no loss for rejecting (equivalent benefit/reward for correctly  = 0 there is no loss for rejecting (equivalent benefit/reward for correctly

classifying) and we always reject. If 

classifying) and we always reject. If  λλrr > > λλss, the loss from rejecting is greater than the loss, the loss from rejecting is greater than the loss

from missclassifing and we should never reject. This can be seen from the above since from missclassifing and we should never reject. This can be seen from the above since λλrr

λ λss >> 1 1

so

so 11

λλrr

λ

(3)

Problem 31 (the probability of error) Problem 31 (the probability of error) Part (a):

Part (a):  We can assume with out loss of generality that We can assume with out loss of generality that µµ11 < < µµ22. . If thiIf this is not the cas is not the casese

initially change the labels of class one and class two so that it is. We begin by recalling the initially change the labels of class one and class two so that it is. We begin by recalling the probability of error probability of error P  P ee P  P ee == P P (( ˆ ˆH H  == H  H 11

||

H H 22))P P ((H H 22) +) + P  P (( ˆ ˆH H  == H  H 22

||

H H 11))P P ((H H 11)) = =

 

 

ξξ

−∞

−∞

 p  p((xx

||

H H 22))dxdx

P P ((H H 22) +) +

 

 

ξξ  p  p((xx

||

H H 11))dxdx

P ((H P  H 11)) .. Here

Here ξ ξ   is the decision boundary betw  is the decision boundary between the two classeen the two classes i.e. es i.e. we claswe classify as sify as classclass H H 11  when  when

x

x < < ξ ξ  and classify and classify xx as class as class H H 22  when  when x x > > ξ ξ . . HHeerere P P ((H H ii) are the priors for each of the) are the priors for each of the

tw

two o classclasses. es. WWe e begin by finding begin by finding the Baythe Bayes decision boundaryes decision boundary ξ ξ  under the minimum error under the minimum error crit

criterionerion. . FFrom the discusrom the discussion in sion in the book, the book, the decithe decision boundary is given by a sion boundary is given by a liklikelihelihoodood ratio test. That is we decide class

ratio test. That is we decide class H H 22 if if 

 p  p((xx

||

H H 22))  p  p((xx

||

H H 11))

 ≥

 ≥

P  P ((H H 11)) P  P ((H H 22))

C C 2121

C C 1111 C  C 1212

C C 2222

,, and classify the point as a member of 

and classify the point as a member of  H H 11 other otherwisewise. . The decisThe decision bion boundaroundaryy ξ ξ  is the point is the point

at which this inequality is an

at which this inequality is an  equality   equality . . If we If we use a use a minminimimum probabum probabiliility of ty of ererror critror critererionion the costs

the costs C C ijij are given by are given by C C ijij = = 11

δ δ ijij, and the decision boundary reduces when, and the decision boundary reduces when P P ((H H 11) ) ==

P ((H H 22) =) = 1122 toto

 p

 p((ξ ξ 

||

H H 11) =  p) = p((ξ ξ 

||

H H 22)) ..

If each conditional distributions is Gaussian

If each conditional distributions is Gaussian p p((xx

||

H H ii) ) ==

 N 

 N 

((µµii,, σσii22), then the above expression), then the above expression

becomes becomes 11

√ 

√ 

22πσπσ 1 1 exp exp

{−

{−

11 22 ((ξ ξ 

µµ11))22 σ σ22 1 1

}}

= =

√ 

√ 

11 22πσπσ22 exp exp

{−

{−

11 22 ((ξ ξ 

µµ22))22 σ σ22 2 2

}}

.. If we assume (for simplicity) that

If we assume (for simplicity) that σ σ11 = = σ σ22 = = σ σ the above expression simplifies to ( the above expression simplifies to (ξ ξ 

µµ11))22 ==

((ξ ξ 

µµ22))22, or on taking the square root of both sides of this we get, or on taking the square root of both sides of this we get  ξ  ξ 

µµ11 = =

±

±

((ξ ξ 

µµ22). If we). If we

take the plus sign in this equation we get the contradiction that

take the plus sign in this equation we get the contradiction that  µ µ11 = = µ µ22. Taking the minus. Taking the minus

sign and solving for

sign and solving for ξ  ξ  we find our decision threshold given by we find our decision threshold given by ξ 

ξ  = = 11

22((µµ11 + + µ µ22)) .. Again since we have uniform priors (

Again since we have uniform priors (P P ((H H 11) ) == P P ((H H 22) ) == 1122), we have an error probability), we have an error probability

given by given by 22P P ee = =

 

 

ξξ

−∞

−∞

11

√ 

√ 

22πσπσ expexp

{−

{−

11 22 ((xx

µµ22))22 σ σ22

}}

dxdx + +

 

 

ξξ 11

√ 

√ 

22πσπσ expexp

{−

{−

11 22 ((xx

µµ11))22 σ σ22

}}

dxdx .. or or 22

√ 

√ 

22πσπσP P ee = =

 

 

ξξ

−∞

−∞

exp exp

{−

{−

11 22 ((xx

µµ22))22 σ σ22

}}

dxdx + +

 

 

ξξ exp exp

{−

{−

11 22 ((xx

µµ11))22 σ σ22

}}

dxdx ..

To evaluate the first integral use the substitution

To evaluate the first integral use the substitution vv == xx

µµ22

√ 

√ 

2σ soso dvdv ==

√ 

√ 

dxdx2σ, and in the second, and in the second

use

use vv = = xx

µµ22

√ 

√ 

2σ soso dv dv  = =

√ 

√ 

dxdx2σ. This then gives. This then gives

22

√ 

√ 

22πσπσP P ee = = σ σ

√ 

√ 

22

 

 

µ µ11µµ22 2 2√ √ 22σσ

−∞

−∞

ee

vv22dvdv + + σ σ

√ 

√ 

22

 

 

µµ2211√ √ −−22µµσσ22 ee

vv22dvdv ..

(4)
(5)

as

as

||

µµ11

µµ22

||

σ

σ

 + +

. This can happen if  µ. This can happen if µ11 andand µµ22 get far apart of  get far apart of  σσ shrinks to zero. In each shrinks to zero. In each

case the

case the classclassificatification ion problproblem em gets easier.gets easier.

Problem 32 (probability of error in higher dimensions) Problem 32 (probability of error in higher dimensions)

As in Problem 31 the classification decision boundary are now the points (

As in Problem 31 the classification decision boundary are now the points ( xx) that satisfy) that satisfy

||||

xx

µµ11

||||

22 ==

||||

xx

µµ22

||||

22..

Expanding each side of this expression we find Expanding each side of this expression we find

||||

xx

||||

22

22xxttµµ11 + +

||||

µµ11

||||

22 ==

||||

xx

||||

22

22xxttµµ22 + +

||||

µµ22

||||

22..

Canceling the common

Canceling the common

||||

xx

||||

22 from each side and grouping the terms linear infrom each side and grouping the terms linear in  x x we have that we have that x xtt((µµ11

µµ22) ) == 11 22



||||

µµ11

||||

2 2

− ||||

µµ22

||||

22



..

Which is the equation for a hyperplane. Which is the equation for a hyperplane.

Problem 34 (error bounds on non-Gaussian data) Problem 34 (error bounds on non-Gaussian data) Part (a):

Part (a):  Since the Bhattacharyya bound is a specialization of the Chernoff bound, we Since the Bhattacharyya bound is a specialization of the Chernoff bound, we begin by considering the Chernoff bound. Specifically, using the bound min(

begin by considering the Chernoff bound. Specifically, using the bound min(a,a, bb))

aaβ β bb11

β β ,,

for

for a  a >> 0, 0, b  b >> 0, and 0 0, and 0

β β 

 ≤

 ≤

1, one can show that1, one can show that P 

P (error)(error)

P P ((ωω11))β β P P ((ωω22))11

β β ee

kk((β β )) ffoor r 00

β β 

 ≤

 ≤

11 ..

Here the expression

Here the expression kk((β β ) is given by) is given by kk((β β ) =) = β β (1(1

β β )) 22 ((µµ22

µµ11)) tt[[β β ΣΣ 1 1 + (1 + (1

β β )Σ)Σ22]]

11((µµ22

µµ11) +) +  1  1 22 ln ln

||

β β ΣΣ11 + (1 + (1

β β )Σ)Σ22

||

||

ΣΣ11

||

β β 

||

ΣΣ22

||

11

β β 

 . . When we desire to compute the Bhattacharyya bound we assign

When we desire to compute the Bhattacharyya bound we assign β β  = = 11//2. 2. FFor a or a twtwo classo class scalar problem with symmetric means and equal variances that is where

scalar problem with symmetric means and equal variances that is where  µ µ11 = =

µµ,, µ µ22 =  = ++µµ,,

and

and σ σ11 = = σ σ22 = = σ σ  the expression for the expression for k k((β β ) above becomes) above becomes

kk((β β ) ) == β β (1(1

β β )) 22 (2(2µµ))



βσβσ 2 2+ (1+ (1

β β ))σσ22



11(2(2µµ) +) + 1 1 22 ln ln

||

βσβσ22+ (1+ (1

β β ))σσ22

||

σ σ22β β σσ2(12(1

β β ))

= = 22β β (1(1

β β ))µµ 2 2 σ σ22 ..

Since the above expression is valid for

Since the above expression is valid for al al l l  β  β  in the range 0 in the range 0

β β 

 ≤

 ≤

1, we can obtain the tightest1, we can obtain the tightest possible bound by computing the

possible bound by computing the maximum  maximum  of the expression of the expression kk((β β ) (since the dominant) (since the dominant β β  expression is

expression is ee

kk((β β )). . TTo o evevaluate this maximaluate this maximum um of of  k k((β β ) we take the derivative with respect) we take the derivative with respect

to

to β  β  and set that result equal to zero obtaining and set that result equal to zero obtaining dd dβ  dβ 

22β β (1(1

β β ))µµ22 σ σ22

= = 00

(6)

which gives which gives

11

22β β  =  = 00 or

or β β  = = 11//2. 2. ThThus in us in thithis s cascase the Cherne the Chernoff boundoff bound equals  equals   the Bhat  the Bhattactacharra bound. harra bound. WhenWhen β 

β  = = 11//2 we have for the following bound on our our error probability2 we have for the following bound on our our error probability P 

P (error)(error)

 

 

P P ((ωω11))P P ((ωω22))ee

µ µ22

2 2σσ22 ..

To further simplify things if we assume that our variance is such that

To further simplify things if we assume that our variance is such that  σ σ22 == µ µ22, and that our, and that our priors over class are taken to be uniform (

priors over class are taken to be uniform (P P ((ωω11) ) == P  P ((ωω22) ) = = 11//2) the above becomes2) the above becomes

P  P (error)(error)

11 22ee

1 1//22

00..30333033 ..

(7)

Chapter 3 (Maximum Likelihood and Bayesian Estimation)

Chapter 3 (Maximum Likelihood and Bayesian Estimation)

Problem 34 Problem 34

In the problem statement we are told that to operate on a data of “size”

In the problem statement we are told that to operate on a data of “size” nn, requires, requires f f ((nn)) numeric calculations. Assuming that each calculation requires 10

numeric calculations. Assuming that each calculation requires 10

99 seconds of time to com-seconds of time to

com-pute, we can process a data structure of size

pute, we can process a data structure of size nn in 10 in 10

99f ((nn) seconds) seconds. . If we wanIf we want our t our resuresultslts

finished by a time

finished by a time T  T  then the largest data structure that we can process must satisfy then the largest data structure that we can process must satisfy 10

10

99f ((nf  n))

T T  oror nn

f f 

11(10(1099T T )) .. Converting the ranges of times into seconds we find that

Converting the ranges of times into seconds we find that T 

T  ==

{{

1sec1sec,, 11 hohourur,, 1day1day,, 11 yyeaearr

}}

==

{{

1sec1sec,, 36360000 sesecc,, 8640864000 secsec,, 3153315360006000 sesecc

}}

.. Using a very naive algorithm we can compute the largest

Using a very naive algorithm we can compute the largest nn such that such that f  f ((nn) by simply incre-) by simply incre-menting

menting nn repeat repeatedlyedly. . This is done This is done in the Matlab scriptin the Matlab script  prob 34 chap 3.m  prob 34 chap 3.m and the results and the results

are displayed in Table XXX. are displayed in Table XXX.

Probl

Problem em 35 35 (rec(recursivursive e v.s. v.s. diredirect ct calcucalculatiolation n of of meanmeans s and and covcovariaariances)nces) Part (a):

Part (a):   To compute ˆ  To compute ˆµµnn we have to add we have to add nn,, dd-dimensional vectors resulting in-dimensional vectors resulting in ndnd  com-

 com-putat

putations. ions. WWe then e then havhave e to divide each componento divide each component t by the scalarby the scalar nn, resulting in another, resulting in another nn computations. Thus in total we have

computations. Thus in total we have nd

nd + + n n =  = (1 +(1 + d d))nn ,, calculations.

calculations.

To compute the sample covariance matrix,

To compute the sample covariance matrix, C C nn, we have to first compute the differences, we have to first compute the differences

x

xkk

µµˆˆnn requiring dd subtraction for each vector requiring  subtraction for each vector xxkk. Since we must do this for each of the. Since we must do this for each of the nn

vectors

vectors x xkk we have a total of  we have a total of  nd nd calculations required to compute all of the calculations required to compute all of the  x xkk

µµˆˆnn difference difference

ve

vectorctors. s. WWe then have to compute then have to compute the outer products (e the outer products (xxkk

 −

 −

µµˆˆnn)()(xxkk

 −

 −

µµˆˆnn))

′′

, which require, which require

dd22 calccalculatioulations to ns to produce eacproduce each h (in a (in a naivnaive implemene implementationtation). ). SincSince e we havwe havee nn  such outer  such outer

products to compute computing them all requires

products to compute computing them all requires ndnd22 operations. We next, have to add alloperations. We next, have to add all of these

of these n n  outer products together, resulting in another set of ( outer products together, resulting in another set of ( nn

1)1)dd22 operations. Finallyoperations. Finally

dividing by the scalar

dividing by the scalar n n

1 requires another1 requires another d d22 operations. Thus in total we haveoperations. Thus in total we have nd

nd + + nd nd22 + (+ (nn

1)1)dd22++ d d22 = = 22ndnd22 ++ nd nd ,, calculations.

calculations. Part (b):

Part (b):  We can derive recursive formulas for ˆ We can derive recursive formulas for ˆµµnn andand C C nn in the followi in the following wayng way. . First forFirst for

ˆˆ µ

µnn we note that we note that

ˆˆ µ µnn+1+1 == 11 n n + 1 + 1 n n+1+1

k k=1=1 x xkk == x xnn+1+1 n n + 1 + 1 + +

nn n n + 1 + 1

 1 1 n n n n

k k=1=1 x xkk = = 11 n n + 1 + 1xxnn+1+1 + +

nn n n + 1 + 1

µµˆˆ n n..

(8)

Writing

Writing nn+1n+1n asas nn+1+1n+1n+1

11 = = 11

nn+1+111  the above becomes the above becomes ˆˆ µ µnn+1+1 == 11 n n + 1 + 1xxnn+1+1 +  + ˆˆµµnn

11 n n + 1 + 1µµˆˆnn = = ˆˆµµnn + +

11 n n + 1 + 1

((xx n n+1+1

µµˆˆnn)) ,, as expected. as expected.

To derive the recursive update formula for the covariance matrix

To derive the recursive update formula for the covariance matrix C C nn we begin by recalling we begin by recalling

the definition of 

the definition of  C  C nn of of 

C  C nn+1+1 = = 11 n n n n+1+1

k k=1=1 ((xxkk

µµˆˆnn+1+1)()(xxkk

µµˆˆnn+1+1))

′′

..

Now introducing the recursive definition of the mean ˆ

Now introducing the recursive definition of the mean ˆµµnn+1+1 as computed above we have that as computed above we have that

C  C nn+1+1 == 11 n n n n+1+1

k k=1=1 ((xxkk

µµˆˆnn

11 n n + 1 + 1((xxnn+1+1

µµˆˆnn))())(xxkk

µµˆˆnn

11 n n + 1 + 1((xxnn+1+1

µµˆˆnn))))

′′

= = 11 n n n n+1+1

k k=1=1 ((xxkk

µµˆˆnn)()(xxkk

µµˆˆnn))

′′

11 n n((nn + 1) + 1) n n+1+1

k k=1=1 ((xxkk

µµˆˆnn)()(xxnn+1+1

µµˆˆnn))

′′

nn((nn + 1)11 + 1) n n+1+1

k k=1=1 ((xxnn+1+1

µµˆˆnn)()(xxkk

µµˆˆnn))

′′

11 n n((nn + 1) + 1)22 n n+1+1

k k=1=1 ((xxnn+1+1

µµˆˆnn)()(xxnn+1+1

µµˆˆnn))

′′

 . . Since ˆ

Since ˆµµnn = = nn11

nnkk=1=1 x xkk we see that we see that

n nˆˆµµnn

11 n n n n

k k=1=1 x xkk = = 0 0 oorr 11 n n n n

k k=1=1 ((ˆˆµµnn

xxkk) ) = = 00 ,,

Using this fact the second and third terms the above can be greatly simplified. For example Using this fact the second and third terms the above can be greatly simplified. For example we have that we have that n n+1+1

k k=1=1 ((xxkk

µµˆˆnn) =) = n n

k k=1=1 ((xxkk

µµˆˆnn) + () + (xxnn+1+1

µµˆˆnn) ) == x xnn+1+1

µµˆˆnn

Using this identity and the fact that the fourth term has a summand that does not depend Using this identity and the fact that the fourth term has a summand that does not depend on

on k k  we have we have C C nn+1+1 given by given by

C  C nn+1+1 == n n

11 n n

11 n n

11 n n

k k=1=1 ((xxkk

µµˆˆnn)()(xxkk

µµˆˆnn))

′′

 + + 11 n n

11((xxnn+1+1

µµˆˆnn)()(xxnn+1+1

µµˆˆnn))

′′

nn((nn + 1)22 + 1)((xxn+1n+1

µµˆˆnn)()(xxnn+1+1

µµˆˆnn) +) + 11 n n((nn + 1) + 1)((xxnn+1+1

µµˆˆnn)()(xxnn+1+1

µµˆˆnn))

′′

= =

11

11 n n

n n + +

11 n n

 −

 −

11 n n((nn + 1) + 1)

((xx n n+1+1

µµˆˆnn)()(xxnn+1+1

µµˆˆnn))

′′

= = C C nn

11 n nC C nn + + 11 n n

nn + 1 + 1

11 n n + 1 + 1

((xx n n+1+1

µµˆˆnn)()(xxnn+1+1

µµˆˆnn))

′′

= =

nn

11 n n

n n + + 11 n n + 1 + 1((xxnn+1+1

µµˆˆnn)(x)(xnn+1+1

µµˆˆnn))

′′

..

(9)

which is the result in the book. which is the result in the book. Part (c):

Part (c):  To compute ˆ To compute ˆµµnn using the recurrence formulation we have using the recurrence formulation we have dd subtractions to com- subtractions to

com-pute

pute xxnn

 ˆ ˆµµnn

11 and the scalar division requires another and the scalar division requires another dd operati operations. ons. FinallFinallyy, , additaddition byion by

the ˆ

the ˆµµnn

11 requires another requires another dd operations giving in total 3 operations giving in total 3dd operations. To be compared with operations. To be compared with

the (

the (dd + 1) + 1)nn operations in the non-recursive case. operations in the non-recursive case.

To compute the number of operations required to compute

To compute the number of operations required to compute C C nn recursively we have dd com- recursively we have

 com-putations to compute the difference

putations to compute the difference  x xnn

µµˆˆnn

11 and then and then d d22 operations to compute the outeroperations to compute the outer

product (

product (xxnn

µµˆˆnn

11)(x)(xnn

µµˆˆnn

11))

′′

, another, another d d22 operations to multiplyoperations to multiply C  C nn

11 by by



nnnn

2211



and finallyand finally

dd22 operations to add this product to the other one. Thus in total we haveoperations to add this product to the other one. Thus in total we have

dd + + d d22 ++ d d22++ d d22 ++ d d22 = = 44dd22 ++ d d .. To be compare with 2

To be compare with 2ndnd22 ++ nd nd   in the non   in the non rerecurcursivsive e cascase. e. FFor largeor large nn   the non-recursive  the non-recursive

calculations are much more computationally intensive. calculations are much more computationally intensive.

(10)

Chapter 6 (Multilayer Neural Networks)

Chapter 6 (Multilayer Neural Networks)

Problem 39 (derivatives of matrix inner products) Problem 39 (derivatives of matrix inner products) Part (a):

Part (a):  We present a slightly different method to prove this here. We desire the gradient We present a slightly different method to prove this here. We desire the gradient (or derivative) of the function

(or derivative) of the function

φ

φ

xxT T KKxx .. The

The ii-th component of this derivative is given by-th component of this derivative is given by ∂φ ∂φ ∂x ∂xii == ∂  ∂  ∂x ∂xii



xx T  T KKxx



== e eT T ii K x Kx + + x xT T KeKeii where

where e eii is the i is the i-th elementary basis function for-th elementary basis function for RRnn, i.e. it has a 1 in the, i.e. it has a 1 in the  i i-th position and-th position and

zeros everywhere else. Now since zeros everywhere else. Now since

((eeT T ii K K xx))T T  == x xT T K K T T eeT T ii == e eT T ii K K xx ,, the above becomes

the above becomes

∂φ ∂φ ∂x ∂xii

=

= e eT T ii K K xx + + e eT T ii K  K T T xx = = e eT T ii



((K K  + + K  K T T ))xx



.. Since multiplying by

Since multiplying by eeT T 

ii   on the left selects the  on the left selects the ii-th row from the expression to its right we-th row from the expression to its right we

see that the full gradient expression is given by see that the full gradient expression is given by

φφ =  = ((K K  + + K  K T T ))xx ,, as request

as requested ed in the in the texttext. . Note that this Note that this exprexpressiession on can also bcan also be e proproveved d easileasily y by writinby writing g eaceachh term in component notation as suggested in the text.

term in component notation as suggested in the text. Part (b):

Part (b): If If  K  K  is symmetric then since is symmetric then since K K T T  == K  K  the above expression simplifies to the above expression simplifies to

φφ =  = 22KxKx ,, as claimed in the book.

as claimed in the book.

Chapter 9 (Algorithm-Independent Machine Learning)

Chapter 9 (Algorithm-Independent Machine Learning)

Problem 9 (sums of binomial coefficients) Problem 9 (sums of binomial coefficients) Part (a):

Part (a):  We first recall the binomial theorem which is We first recall the binomial theorem which is ((xx + + y y))nn == n n

k k=0=0

nn kk

xx k kyynn

kk.. If we let

If we let xx = 1  = 1 anandd y y  = 1 then = 1 then xx + + y y = 2 and the sum above becomes = 2 and the sum above becomes 22nn == n n

k k=0=0

nn kk

,,

(11)

whi

whicch h is the is the stastateted d ideidentntitityy. . As an As an asiaside, note also that de, note also that if we if we letlet xx ==

 −

 −

1 and1 and yy = 1 then = 1 then x

x + + y y = 0 and we obtain another interesting identity involving the binomial coefficients = 0 and we obtain another interesting identity involving the binomial coefficients 0 0 == n n

k k=0=0

nn kk

((

1)1)kk ..

Problem 23 (the average of the leave one out means) Problem 23 (the average of the leave one out means) We define the leave one out mean

We define the leave one out mean  µ µ((ii)) asas

µ µ((ii)) = = 11 n n

11

 j  j



==ii x x j j . .

This can obviously be computed for every

This can obviously be computed for every  i i. Now we want to consider the mean of the leave. Now we want to consider the mean of the leave one out means. To do so first notice that

one out means. To do so first notice that  µ µ((ii)) can be written as can be written as

µ µ((ii)) == 11 n n

11

 j j



= =ii x x j j = = 11 n n

11

nn

 j  j=1=1 x x j j

xxii

= = 11 n n

11

n n

11 n n

nn  j  j=1=1 x x j j

xxii

= = nn n n

11µµˆˆ

x xii n n

11 . .

With this we see that the mean of all of the leave one out means is given by With this we see that the mean of all of the leave one out means is given by

11 n n n n

ii=1=1 µ µ((ii)) == 11 n n n n

ii=1=1

nn n n

11µµˆˆ

x xii n n

11

= = nn n n

11µµˆˆ

11 n n

11

11 n n

nn ii=1=1 x xii = = nn n n

11µµˆˆ

ˆˆ µ µ n n

11 = = ˆˆµµ ,, as expected. as expected.

Problem 40 (maximum likelihood estimation with a binomial random variable) Problem 40 (maximum likelihood estimation with a binomial random variable) Since

Since kk has a binomial distribution with parameters ( has a binomial distribution with parameters (nn

′′

,, pp) we have that) we have that

P  P ((kk) =) =

nn

′′

kk

 p p k k(1(1

 p p))nn′′

kk..

(12)

Then the

Then the pp   that maximizes this expression is given by taking the derivative of the above  that maximizes this expression is given by taking the derivative of the above (with respect to

(with respect to  p p) setting the resulting expression equal to zero and solving for) setting the resulting expression equal to zero and solving for  p p. We find. We find that this derivative is given by

that this derivative is given by dd dp dpP P ((kk) ) ==

nn

′′

kk

kpkp k k

11(1(1

 p p))nn′′

kk ++

nn

′′

kk

 p p k k(1(1

 p p))nn′′

kk

11((nn

′′

kk)()(

1)1) .. Which when set equal to zero and solve for

Which when set equal to zero and solve for  p p  we find that we find that p p = = nnkk′′, or the empirical counting, or the empirical counting estimate of the probability of success as we were asked to show.

References

Related documents

In this paper we discussed the methods of estimating prediction error of the simple linear regression model using two methods of bootstrap, Efron’s bootstrap and Banks’ bootstrap..

This auditorium is used false ceiling. The false ceiling used to conceal service lines such as air conditioning systems, electrical wiring and sprinkler. They used to give

eLearning in working life (corporate training sector) has its main advantage in supporting the elimination of the border between learning and working. Some examples are

Western blot analysis revealed that the polypeptides corresponding to full-length, unaltered Mid1 were detected in HEK293 expressing Mid1-S605A–GFP or Mid1-GFP ( Fig. 5C ),

An experimental study was conducted to investigate the ability of micro-jets in controlling the amplitude of shock unsteadiness associated with the interaction

1.2.3.4 Summary of the terminological dispute over controversial transitivity alternation In this section so far, we have reviewed three major accounts of the ‘theme +

Furthermore, since most of the land clearing in the Amazon since 1960 has taken place in areas with relatively open vegetation, especially in the border areas to Southern Brazil where