Solutions to Selected Problems-Duda, Hart

(1)

Solutions to Selected Problems In:

Pattern Classiﬁcation

by Duda, Hart, Stork

John L. Weatherwax

∗∗

February 24, 2008

Problem Solutions

Chapter 2 (Bayesian Decision Theory)

Problem 11 (randomized rules) Problem 11 (randomized rules) Part (a):

Part (a): LetLet R R((xx) be the average risk given measurement) be the average risk given measurement xx. Then since this risk depends. Then since this risk depends on the action and correspondingly the probability of taking such action, this risk can be on the action and correspondingly the probability of taking such action, this risk can be expressed as expressed as R R((xx) ) == m m



ii=1=1 R R((aaii

||

xx))P P ((aaii

||

xx))

Then the total risk (over all possible

Then the total risk (over all possible xx) is given by the average of ) is given by the average of R R((xx) ) oror R R = =



Ω Ωxx (( m m



ii=1=1 R R((aaii

||

xx))P P ((aaii

||

xx)))) p p((xx))dxdx Part (b):

Part (b): One would pick One would pick P

P ((aaii

||

xx) =) =



_{00 ii = anything else}11 ii = ArgMin = ArgMin_{= anything else}iiRR((aaii

||

xx))

Thus for every

Thus for every x x pick the action that minimizes the local risk. This amounts to a determin- pick the action that minimizes the local risk. This amounts to a determin-istic rule i.e. there is no beneﬁt to using randomness.

istic rule i.e. there is no beneﬁt to using randomness.

∗

∗_{[email protected]}_{[email protected]}

(2)

Part (c):

Part (c): Yes, since in a deterministic decision rule setting randomizing some decisions Yes, since in a deterministic decision rule setting randomizing some decisions might further improve performance. But how this would be done would be diﬃcult to say. might further improve performance. But how this would be done would be diﬃcult to say.

Problem 13 Problem 13 Let Let λ λ((ααii

||

ωω j j) ) ==







00 ii = = j j λ λrr ii = = c c + 1 + 1 λ λss otherwise otherwise

Then since the deﬁnition of risk is expected loss we have Then since the deﬁnition of risk is expected loss we have

R R((ααii

||

xx) ) == cc



ii=1=1 λ λ((aaii

||

ωω j j))P P ((ωω j j

||

xx)) ii = = 11,, 22, . . . , c, . . . , c (1)(1) R R((ααcc+1+1

||

xx) ) == λλrr (2)(2) Now for

Now for i i = = 11,, 22, . . . , c, . . . , c the risk can be simpliﬁed as the risk can be simpliﬁed as R R((ααii

||

xx) ) == λ λss cc



j j=1=1,, jj

_

==ii λ λ((aaii

||

ωω j j))P P ((ωω j j

||

xx) ) == λ λss(1(1

−

P P ((ωωii

||

xx))))

Our decision function is to choose the action with the smallest risk, this means that we pick Our decision function is to choose the action with the smallest risk, this means that we pick the reject option if and only if

the reject option if and only if λ λrr < < λλss(1(1

−

P ((ωP ωii

||

xx))))

∀∀

ii = = 11,, 22, . . . , c, . . . , c This is equivalent to This is equivalent to P P ((ωωii

||

xx)) < < 1 1

−

λ λrr λ λss

∀∀

ii = = 11,, 22, . . . , c, . . . , c

This can be interpreted as if all posterior probabilities are below a threshold (1

₋

λλrr

λ

λss) ) wewe

should reject. Otherwise we pick action

should reject. Otherwise we pick action i i such that such that λ λss(1(1

−

P P ((ωωii

||

xx)))) < < λλss(1(1

−

P P ((ωωii

||

xx))))

∀∀

jj = = 11,, 22, ., . . .. . , c, c, , jj



== i i This is equivalent to This is equivalent to P P ((ωωii

||

xx)) > > P P ((ωω j j

||

xx))

∀∀

jj



== i i Thi

This s can be can be ininteterprrpreteeted d as as selselecectinting g the class with the class with the largethe largest st postpostererioriori i proprobabbabiliilityty. . InIn summary we should reject if and only if

summary we should reject if and only if P P ((ωωii

||

xx)) < < 1 1

−

λ λrr λ λss

∀∀

ii = = 11,, 22, . . . , c, . . . , c (3)(3) Note that if

Note that if λλrr = 0 there is no loss for rejecting (equivalent beneﬁt/reward for correctly = 0 there is no loss for rejecting (equivalent beneﬁt/reward for correctly

classifying) and we always reject. If

classifying) and we always reject. If λλrr > > λλss, the loss from rejecting is greater than the loss, the loss from rejecting is greater than the loss

from missclassiﬁng and we should never reject. This can be seen from the above since from missclassiﬁng and we should never reject. This can be seen from the above since λλrr

λ λss >> 1 1

so

so 11

₋

λλrr

λ

(3)

Problem 31 (the probability of error) Problem 31 (the probability of error) Part (a):

Part (a): We can assume with out loss of generality that We can assume with out loss of generality that µµ11 < < µµ22. . If thiIf this is not the cas is not the casese

initially change the labels of class one and class two so that it is. We begin by recalling the initially change the labels of class one and class two so that it is. We begin by recalling the probability of error probability of error P P ee P P ee == P P (( ˆ ˆH H == H H 11

||

H H 22))P P ((H H 22) +) + P P (( ˆ ˆH H == H H 22

||

H H 11))P P ((H H 11)) = =







ξξ

−∞

p p((xx

_||

H H 22))dxdx



P P ((H H 22) +) +







∞

ξξ p p((xx

_||

H H 11))dxdx



P ((H P H 11)) .. Here

Here ξ ξ is the decision boundary betw is the decision boundary between the two classeen the two classes i.e. es i.e. we claswe classify as sify as classclass H H 11 when when

x

x < < ξ ξ and classify and classify xx as class as class H H 22 when when x x > > ξ ξ . . HHeerere P P ((H H ii) are the priors for each of the) are the priors for each of the

tw

two o classclasses. es. WWe e begin by ﬁnding begin by ﬁnding the Baythe Bayes decision boundaryes decision boundary ξ ξ under the minimum error under the minimum error crit

criterionerion. . FFrom the discusrom the discussion in sion in the book, the book, the decithe decision boundary is given by a sion boundary is given by a liklikelihelihoodood ratio test. That is we decide class

ratio test. That is we decide class H H 22 if if

p p((xx

_||

H H 22)) p p((xx

_||

H H 11))

≥

P P ((H H 11)) P P ((H H 22))



C C 2121

−

C C 1111 C C 1212

−

C C 2222



_,, and classify the point as a member of

and classify the point as a member of H H 11 other otherwisewise. . The decisThe decision bion boundaroundaryy ξ ξ is the point is the point

at which this inequality is an

at which this inequality is an equality equality . . If we If we use a use a minminimimum probabum probabiliility of ty of ererror critror critererionion the costs

the costs C C ijij are given by are given by C C ijij = = 11

−

δ δ ijij, and the decision boundary reduces when, and the decision boundary reduces when P P ((H H 11) ) ==

P

P ((H H 22) =) = 11₂₂ toto

p

p((ξ ξ

_||

H H 11) = p) = p((ξ ξ

||

H H 22)) ..

If each conditional distributions is Gaussian

If each conditional distributions is Gaussian p p((xx

_||

H H ii) ) ==

N

((µµii,, σσii22), then the above expression), then the above expression

becomes becomes 11

√

_22πσ_πσ 1 1 exp exp

_{−

11 22 ((ξ ξ

₋

µµ11))22 σ σ22 1 1

}}

= =

_√

11 22πσπσ22 exp exp

_{−

11 22 ((ξ ξ

₋

µµ22))22 σ σ22 2 2

}}

.. If we assume (for simplicity) that

If we assume (for simplicity) that σ σ11 = = σ σ22 = = σ σ the above expression simpliﬁes to ( the above expression simpliﬁes to (ξ ξ

−

µµ11))22 ==

((ξ ξ

₋

µµ22))22, or on taking the square root of both sides of this we get, or on taking the square root of both sides of this we get ξ ξ

−

µµ11 = =

±

((ξ ξ

−

µµ22). If we). If we

take the plus sign in this equation we get the contradiction that

take the plus sign in this equation we get the contradiction that µ µ11 = = µ µ22. Taking the minus. Taking the minus

sign and solving for

sign and solving for ξ ξ we ﬁnd our decision threshold given by we ﬁnd our decision threshold given by ξ

ξ = = 11

22((µµ11 + + µ µ22)) .. Again since we have uniform priors (

Again since we have uniform priors (P P ((H H 11) ) == P P ((H H 22) ) == 11₂₂), we have an error probability), we have an error probability

given by given by 22P P ee = =



ξξ

−∞

11

√

_22πσ_πσ expexp

_{−

11 22 ((xx

₋

µµ22))22 σ σ22

}}

dxdx + +



∞

ξξ 11

√

_22πσ_πσ expexp

_{−

11 22 ((xx

₋

µµ11))22 σ σ22

}}

dxdx .. or or 22

√

22πσπσP P ee = =



ξξ

−∞

exp exp

_{−

11 22 ((xx

₋

µµ22))22 σ σ22

}}

dxdx + +



∞

ξξ exp exp

_{−

11 22 ((xx

₋

µµ11))22 σ σ22

}}

dxdx ..

To evaluate the ﬁrst integral use the substitution

To evaluate the ﬁrst integral use the substitution vv == xx

−

µµ22

√

_2σ₂_σ soso dvdv ==

√

dxdx₂_2σ_σ, and in the second, and in the second

use

use vv = = xx

−

µµ22

√

_2σ₂_σ soso dv dv = =

√

dxdx₂_2σ_σ. This then gives. This then gives

22

√

22πσπσP P ee = = σ σ

√

22



µ µ₁₁₋₋µµ₂₂ 2 2√ √ 22σσ

−∞

ee

−

vv22dvdv + + σ σ

√

22



∞

−

µµ2211√ √ −−22µµσσ22 ee

−

vv22dvdv ..

(4)

(5)

as

||

µµ11

−

µµ22

||

σ

→

+ +

∞

. This can happen if µ. This can happen if µ11 andand µµ22 get far apart of get far apart of σσ shrinks to zero. In each shrinks to zero. In each

case the

case the classclassiﬁcatiﬁcation ion problproblem em gets easier.gets easier.

Problem 32 (probability of error in higher dimensions) Problem 32 (probability of error in higher dimensions)

As in Problem 31 the classiﬁcation decision boundary are now the points (

As in Problem 31 the classiﬁcation decision boundary are now the points ( xx) that satisfy) that satisfy

||||

xx

₋

µµ11

||||

22 ==

||||

xx

−

µµ22

||||

22..

Expanding each side of this expression we ﬁnd Expanding each side of this expression we ﬁnd

||||

xx

_||||

22

₋

22xxttµµ11 + +

||||

µµ11

||||

22 ==

||||

xx

||||

22

−

22xxttµµ22 + +

||||

µµ22

||||

22..

Canceling the common

_||||

xx

_||||

22 from each side and grouping the terms linear infrom each side and grouping the terms linear in x x we have that we have that x xtt((µµ11

−

µµ22) ) == 11 22



||||

µµ11

||||

2 2

−

− ||||

µµ22

||||

22



..

Which is the equation for a hyperplane. Which is the equation for a hyperplane.

Problem 34 (error bounds on non-Gaussian data) Problem 34 (error bounds on non-Gaussian data) Part (a):

Part (a): Since the Bhattacharyya bound is a specialization of the Chernoff bound, we Since the Bhattacharyya bound is a specialization of the Chernoff bound, we begin by considering the Chernoff bound. Specifically, using the bound min(

begin by considering the Chernoﬀ bound. Speciﬁcally, using the bound min(a,a, bb))

_≤

aaβ β _bb11

₋

β β _,,

for

for a a >> 0, 0, b b >> 0, and 0 0, and 0

_≤

β β

_≤

1, one can show that1, one can show that P

P (error)(error)

_≤

P P ((ωω11))β β P P ((ωω22))11

−

β β ee

−

kk((β β )) ffoor r 00

≤

β β

≤

11 ..

Here the expression

Here the expression kk((β β ) is given by) is given by kk((β β ) =) = β β (1(1

−

β β )) 22 ((µµ22

−

µµ11)) tt_[[β_β Σ_Σ 1 1 + (1 + (1

−

β β )Σ)Σ22]]

−

11((µµ22

−

µµ11) +) + 1 1 22 ln ln



_||

β β ΣΣ11 + (1 + (1

−

β β )Σ)Σ22

||

ΣΣ11

||

β β

||

ΣΣ22

||

11

−

β β



. . When we desire to compute the Bhattacharyya bound we assign

When we desire to compute the Bhattacharyya bound we assign β β = = 11//2. 2. FFor a or a twtwo classo class scalar problem with symmetric means and equal variances that is where

scalar problem with symmetric means and equal variances that is where µ µ11 = =

−

µµ,, µ µ22 = = ++µµ,,

and

and σ σ11 = = σ σ22 = = σ σ the expression for the expression for k k((β β ) above becomes) above becomes

kk((β β ) ) == β β (1(1

−

β β )) 22 (2(2µµ))



βσβσ 2 2_{+ (1}_{+ (1}

−

β β ))σσ22

_

−

11(2(2µµ) +) + 1 1 22 ln ln



_||

βσβσ22_{+ (1}_{+ (1}

−

β β ))σσ22

||

σ σ22β β _σ_σ2(12(1

₋

β β ))



= = 22β β (1(1

−

β β ))µµ 2 2 σ σ22 ..

Since the above expression is valid for

Since the above expression is valid for al al l l β β in the range 0 in the range 0

_≤

β β

_≤

1, we can obtain the tightest1, we can obtain the tightest possible bound by computing the

possible bound by computing the maximum maximum of the expression of the expression kk((β β ) (since the dominant) (since the dominant β β expression is

expression is ee

−

kk((β β ))_{. . T}_To_{o ev}_{evaluate this maxim}_{aluate this maximum}_{um of}_of k_k((β_{β ) we take the derivative with respect}_{) we take the derivative with respect}

to

to β β and set that result equal to zero obtaining and set that result equal to zero obtaining dd dβ dβ



22β β (1(1

₋

β β ))µµ22 σ σ22



₌_{= 00}

(6)

which gives which gives

11

₋

22β β = = 00 or

or β β = = 11//2. 2. ThThus in us in thithis s cascase the Cherne the Chernoﬀ boundoﬀ bound equals equals the Bhat the Bhattactacharra bound. harra bound. WhenWhen β

β = = 11//2 we have for the following bound on our our error probability2 we have for the following bound on our our error probability P

P (error)(error)

_≤



P P ((ωω11))P P ((ωω22))ee

−

µ µ22

2 2σσ22 ..

To further simplify things if we assume that our variance is such that

To further simplify things if we assume that our variance is such that σ σ22 == µ µ22, and that our, and that our priors over class are taken to be uniform (

priors over class are taken to be uniform (P P ((ωω11) ) == P P ((ωω22) ) = = 11//2) the above becomes2) the above becomes

P P (error)(error)

_≤

11 22ee

−

1 1//22

≈

00..30333033 ..

(7)

Chapter 3 (Maximum Likelihood and Bayesian Estimation)

Problem 34 Problem 34

In the problem statement we are told that to operate on a data of “size”

In the problem statement we are told that to operate on a data of “size” nn, requires, requires f f ((nn)) numeric calculations. Assuming that each calculation requires 10

numeric calculations. Assuming that each calculation requires 10

−

99 _{seconds of time to com-}_{seconds of time to}

com-pute, we can process a data structure of size

pute, we can process a data structure of size nn in 10 in 10

−

99_f_f ((n_{n) seconds}_{) seconds. . If we wan}_{If we want our}_{t our resu}_results_lts

ﬁnished by a time

ﬁnished by a time T T then the largest data structure that we can process must satisfy then the largest data structure that we can process must satisfy 10

10

−

99f ((nf n))

_≤

T T oror nn

_≤

f f

−

11(10(1099T T )) .. Converting the ranges of times into seconds we ﬁnd that

Converting the ranges of times into seconds we ﬁnd that T

T ==

_{{

1sec1sec,, 11 hohourur,, 1day1day,, 11 yyeaearr

_}}

==

_{{

1sec1sec,, 36360000 sesecc,, 8640864000 secsec,, 3153315360006000 sesecc

_}}

.. Using a very naive algorithm we can compute the largest

Using a very naive algorithm we can compute the largest nn such that such that f f ((nn) by simply incre-) by simply incre-menting

menting nn repeat repeatedlyedly. . This is done This is done in the Matlab scriptin the Matlab script prob 34 chap 3.m prob 34 chap 3.m and the results and the results

are displayed in Table XXX. are displayed in Table XXX.

Probl

Problem em 35 35 (rec(recursivursive e v.s. v.s. diredirect ct calcucalculatiolation n of of meanmeans s and and covcovariaariances)nces) Part (a):

Part (a): To compute ˆ To compute ˆµµnn we have to add we have to add nn,, dd-dimensional vectors resulting in-dimensional vectors resulting in ndnd com-

com-putat

putations. ions. WWe then e then havhave e to divide each componento divide each component t by the scalarby the scalar nn, resulting in another, resulting in another nn computations. Thus in total we have

computations. Thus in total we have nd

nd + + n n = = (1 +(1 + d d))nn ,, calculations.

calculations.

To compute the sample covariance matrix,

To compute the sample covariance matrix, C C nn, we have to first compute the differences, we have to first compute the differences

x

xkk

−

µµˆˆnn requiring dd subtraction for each vector requiring subtraction for each vector xxkk. Since we must do this for each of the. Since we must do this for each of the nn

vectors

vectors x xkk we have a total of we have a total of nd nd calculations required to compute all of the calculations required to compute all of the x xkk

−

µµˆˆnn diﬀerence diﬀerence

ve

vectorctors. s. WWe then have to compute then have to compute the outer products (e the outer products (xxkk

−

µµˆˆnn)()(xxkk

−

µµˆˆnn))

′′

, which require, which require

dd22 _calc_calculatio_{ulations to}_{ns to produce eac}_{produce each}_{h (in a}_{(in a naiv}_{naive implemen}_{e implementation}_tation)._{). Sinc}_Since_{e we hav}_{we havee n}_{n such outer}_{such outer}

products to compute computing them all requires

products to compute computing them all requires ndnd22 operations. We next, have to add alloperations. We next, have to add all of these

of these n n outer products together, resulting in another set of ( outer products together, resulting in another set of ( nn

₋

1)1)dd22 _{operations. Finally}_{operations. Finally}

dividing by the scalar

dividing by the scalar n n

₋

1 requires another1 requires another d d22 operations. Thus in total we haveoperations. Thus in total we have nd

nd + + nd nd22 + (+ (nn

₋

1)1)dd22++ d d22 = = 22ndnd22 ++ nd nd ,, calculations.

calculations. Part (b):

Part (b): We can derive recursive formulas for ˆ We can derive recursive formulas for ˆµµnn andand C C nn in the followi in the following wayng way. . First forFirst for

ˆˆ µ

µnn we note that we note that

ˆˆ µ µnn+1+1 == 11 n n + 1 + 1 n n+1+1



k k=1=1 x xkk == x xnn+1+1 n n + 1 + 1 + +



nn n n + 1 + 1



1 1 n n n n



k k=1=1 x xkk = = 11 n n + 1 + 1xxnn+1+1 + +



nn n n + 1 + 1



_µ_µ_ˆˆ n n..

(8)

Writing

Writing _n_n+1n₊₁n asas nn+1+1_n+1_n₊₁

−

11 = = 11

₋

_n_n+1₊₁11 the above becomes the above becomes ˆˆ µ µnn+1+1 == 11 n n + 1 + 1xxnn+1+1 + + ˆˆµµnn

−

11 n n + 1 + 1µµˆˆnn = = ˆˆµµnn + +



11 n n + 1 + 1



_((x_x n n+1+1

−

µµˆˆnn)) ,, as expected. as expected.

To derive the recursive update formula for the covariance matrix

To derive the recursive update formula for the covariance matrix C C nn we begin by recalling we begin by recalling

the deﬁnition of

the deﬁnition of C C nn of of

C C nn+1+1 = = 11 n n n n+1+1



k k=1=1 ((xxkk

−

µµˆˆnn+1+1)()(xxkk

−

µµˆˆnn+1+1))

′′

..

Now introducing the recursive deﬁnition of the mean ˆ

Now introducing the recursive deﬁnition of the mean ˆµµnn+1+1 as computed above we have that as computed above we have that

C C nn+1+1 == 11 n n n n+1+1



k k=1=1 ((xxkk

−

µµˆˆnn

−

11 n n + 1 + 1((xxnn+1+1

−

µµˆˆnn))())(xxkk

−

µµˆˆnn

−

11 n n + 1 + 1((xxnn+1+1

−

µµˆˆnn))))

′′

= = 11 n n n n+1+1



k k=1=1 ((xxkk

−

µµˆˆnn)()(xxkk

−

µµˆˆnn))

′′

−

11 n n((nn + 1) + 1) n n+1+1



k k=1=1 ((xxkk

−

µµˆˆnn)()(xxnn+1+1

−

µµˆˆnn))

′′

−

_n_n((n_{n + 1)}11_{+ 1)} n n+1+1



k k=1=1 ((xxnn+1+1

−

µµˆˆnn)()(xxkk

−

µµˆˆnn))

′′

−

11 n n((nn + 1) + 1)22 n n+1+1



k k=1=1 ((xxnn+1+1

−

µµˆˆnn))

′′

. . Since ˆ

Since ˆµµnn = = _n_n11



nn_k_k=1₌₁ x xkk we see that we see that

n nˆˆµµnn

−

11 n n n n



k k=1=1 x xkk = = 0 0 oorr 11 n n n n



k k=1=1 ((ˆˆµµnn

−

xxkk) ) = = 00 ,,

Using this fact the second and third terms the above can be greatly simpliﬁed. For example Using this fact the second and third terms the above can be greatly simpliﬁed. For example we have that we have that n n+1+1



k k=1=1 ((xxkk

−

µµˆˆnn) =) = n n



k k=1=1 ((xxkk

−

µµˆˆnn) + () + (xxnn+1+1

−

µµˆˆnn) ) == x xnn+1+1

−

µµˆˆnn

Using this identity and the fact that the fourth term has a summand that does not depend Using this identity and the fact that the fourth term has a summand that does not depend on

on k k we have we have C C nn+1+1 given by given by

C C nn+1+1 == n n

₋

11 n n



11 n n

₋

11 n n



k k=1=1 ((xxkk

−

µµˆˆnn)()(xxkk

−

µµˆˆnn))

′′

+ + 11 n n

₋

11((xxnn+1+1

−

µµˆˆnn))

′′



−

_n_n((n_{n + 1)}22_{+ 1)}((xxn+1n+1

−

µµˆˆnn) +) + 11 n n((nn + 1) + 1)((xxnn+1+1

−

µµˆˆnn))

′′

= =



11

₋

11 n n



_C_C n n + +



11 n n

−

11 n n((nn + 1) + 1)



_((x_x n n+1+1

−

µµˆˆnn))

′′

= = C C nn

−

11 n nC C nn + + 11 n n



nn + 1 + 1

₋

11 n n + 1 + 1



_((x_x n n+1+1

−

µµˆˆnn))

′′

= =



nn

−

11 n n



_C_C n n + + 11 n n + 1 + 1((xxnn+1+1

−

µµˆˆnn)(x)(xnn+1+1

−

µµˆˆnn))

′′

..

(9)

which is the result in the book. which is the result in the book. Part (c):

Part (c): To compute ˆ To compute ˆµµnn using the recurrence formulation we have using the recurrence formulation we have dd subtractions to com- subtractions to

com-pute

pute xxnn

−

ˆ ˆµµnn

₋

11 and the scalar division requires another and the scalar division requires another dd operati operations. ons. FinallFinallyy, , additaddition byion by

the ˆ

the ˆµµnn

₋

11 requires another requires another dd operations giving in total 3 operations giving in total 3dd operations. To be compared with operations. To be compared with

the (

the (dd + 1) + 1)nn operations in the non-recursive case. operations in the non-recursive case.

To compute the number of operations required to compute

To compute the number of operations required to compute C C nn recursively we have dd com- recursively we have

com-putations to compute the diﬀerence

putations to compute the diﬀerence x xnn

−

µµˆˆnn

₋

11 and then and then d d22 operations to compute the outeroperations to compute the outer

product (

product (xxnn

−

µµˆˆnn

₋

11)(x)(xnn

−

µµˆˆnn

₋

11))

′′

, another, another d d22 operations to multiplyoperations to multiply C C nn

₋

11 by by



nn_n_n

−

₋

22₁₁



and ﬁnallyand ﬁnally

dd22 _{operations to add this product to the other one. Thus in total we have}_{operations to add this product to the other one. Thus in total we have}

dd + + d d22 ++ d d22++ d d22 ++ d d22 = = 44dd22 ++ d d .. To be compare with 2

To be compare with 2ndnd22 ₊_+ nd_{nd in the non}_{in the non re}_recur_cursiv_sive_{e cas}_case._{e. F}_{For large}_{or large n}_{n the non-recursive}_{the non-recursive}

calculations are much more computationally intensive. calculations are much more computationally intensive.

(10)

Chapter 6 (Multilayer Neural Networks)

Problem 39 (derivatives of matrix inner products) Problem 39 (derivatives of matrix inner products) Part (a):

Part (a): We present a slightly diﬀerent method to prove this here. We desire the gradient We present a slightly diﬀerent method to prove this here. We desire the gradient (or derivative) of the function

(or derivative) of the function

φ

_≡

xxT T KKxx .. The

The ii-th component of this derivative is given by-th component of this derivative is given by ∂φ ∂φ ∂x ∂xii == ∂ ∂ ∂x ∂xii



xx T T _K_Kx_x



== e eT T _ii K x Kx + + x xT T KeKeii where

where e eii is the i is the i-th elementary basis function for-th elementary basis function for RRnn, i.e. it has a 1 in the, i.e. it has a 1 in the i i-th position and-th position and

zeros everywhere else. Now since zeros everywhere else. Now since

((eeT T _ii K K xx))T T == x xT T K K T T eeT T _ii == e eT T _ii K K xx ,, the above becomes

the above becomes

∂φ ∂φ ∂x ∂xii

=

= e eT T _ii K K xx + + e eT T _ii K K T T xx = = e eT T _ii

_

((K K + + K K T T ))xx

_

.. Since multiplying by

Since multiplying by eeT T

ii on the left selects the on the left selects the ii-th row from the expression to its right we-th row from the expression to its right we

see that the full gradient expression is given by see that the full gradient expression is given by

∇

φφ = = ((K K + + K K T T ))xx ,, as request

as requested ed in the in the texttext. . Note that this Note that this exprexpressiession on can also bcan also be e proproveved d easileasily y by writinby writing g eaceachh term in component notation as suggested in the text.

term in component notation as suggested in the text. Part (b):

Part (b): If If K K is symmetric then since is symmetric then since K K T T == K K the above expression simpliﬁes to the above expression simpliﬁes to

∇

φφ = = 22KxKx ,, as claimed in the book.

as claimed in the book.

Chapter 9 (Algorithm-Independent Machine Learning)

Problem 9 (sums of binomial coeﬃcients) Problem 9 (sums of binomial coeﬃcients) Part (a):

Part (a): We ﬁrst recall the binomial theorem which is We ﬁrst recall the binomial theorem which is ((xx + + y y))nn == n n



k k=0=0



nn kk



xx k k_yynn

₋

kk_.. If we let

If we let xx = 1 = 1 anandd y y = 1 then = 1 then xx + + y y = 2 and the sum above becomes = 2 and the sum above becomes 22nn == n n



k k=0=0



nn kk



,,

(11)

whi

whicch h is the is the stastateted d ideidentntitityy. . As an As an asiaside, note also that de, note also that if we if we letlet xx ==

₋

1 and1 and yy = 1 then = 1 then x

x + + y y = 0 and we obtain another interesting identity involving the binomial coeﬃcients = 0 and we obtain another interesting identity involving the binomial coeﬃcients 0 0 == n n



k k=0=0



nn kk



₍₍

−

1)1)kk ..

Problem 23 (the average of the leave one out means) Problem 23 (the average of the leave one out means) We deﬁne the leave one out mean

We deﬁne the leave one out mean µ µ((ii)) asas

µ µ((ii)) = = 11 n n

₋

11



j j

_

==ii x x j j . .

This can obviously be computed for every

This can obviously be computed for every i i. Now we want to consider the mean of the leave. Now we want to consider the mean of the leave one out means. To do so ﬁrst notice that

one out means. To do so ﬁrst notice that µ µ((ii)) can be written as can be written as

µ µ((ii)) == 11 n n

₋

11



_j_j



= =ii x x j j = = 11 n n

₋

11



nn



j j=1=1 x x j j

−

xxii



= = 11 n n

₋

11



n n



11 n n



_

nn j j=1=1 x x j j

−

xxii



= = nn n n

₋

11µµˆˆ

−

x xii n n

₋

11 . .

With this we see that the mean of all of the leave one out means is given by With this we see that the mean of all of the leave one out means is given by

11 n n n n



ii=1=1 µ µ((ii)) == 11 n n n n



ii=1=1



nn n n

₋

11µµˆˆ

−

x xii n n

₋

11



= = nn n n

₋

11µµˆˆ

−

11 n n

₋

11



11 n n



_

nn ii=1=1 x xii = = nn n n

₋

11µµˆˆ

−

ˆˆ µ µ n n

₋

11 = = ˆˆµµ ,, as expected. as expected.

Problem 40 (maximum likelihood estimation with a binomial random variable) Problem 40 (maximum likelihood estimation with a binomial random variable) Since

Since kk has a binomial distribution with parameters ( has a binomial distribution with parameters (nn

′′

_{,, pp) we have that}_{) we have that}

P P ((kk) =) =



nn

′′

kk



p p k k₍₁₍₁

−

p p))nn′′

−

kk..

(12)

Then the

Then the pp that maximizes this expression is given by taking the derivative of the above that maximizes this expression is given by taking the derivative of the above (with respect to

(with respect to p p) setting the resulting expression equal to zero and solving for) setting the resulting expression equal to zero and solving for p p. We ﬁnd. We ﬁnd that this derivative is given by

that this derivative is given by dd dp dpP P ((kk) ) ==



nn

′′

kk



kpkp k k

₋

11₍₁₍₁

−

p p))nn′′

−

kk ++



nn

′′

kk



p p k k₍₁₍₁

−

p p))nn′′

−

kk

−

11((nn

′′

₋

kk)()(

₋

1)1) .. Which when set equal to zero and solve for

Which when set equal to zero and solve for p p we ﬁnd that we ﬁnd that p p = = _n_nkk_′′, or the empirical counting, or the empirical counting estimate of the probability of success as we were asked to show.