Solutions to Selected Problems In:
Solutions to Selected Problems In:
Pattern Classification
Pattern Classification
by Duda, Hart, Stork
by Duda, Hart, Stork
John L. Weatherwax
John L. Weatherwax
∗∗
February 24, 2008
February 24, 2008
Problem Solutions
Problem Solutions
Chapter 2 (Bayesian Decision Theory)
Chapter 2 (Bayesian Decision Theory)
Problem 11 (randomized rules) Problem 11 (randomized rules) Part (a):
Part (a): LetLet R R((xx) be the average risk given measurement) be the average risk given measurement xx. Then since this risk depends. Then since this risk depends on the action and correspondingly the probability of taking such action, this risk can be on the action and correspondingly the probability of taking such action, this risk can be expressed as expressed as R R((xx) ) == m m
ii=1=1 R R((aaii||
xx))P P ((aaii||
xx))Then the total risk (over all possible
Then the total risk (over all possible xx) is given by the average of ) is given by the average of R R((xx) ) oror R R = =
Ω Ωxx (( m m
ii=1=1 R R((aaii||
xx))P P ((aaii||
xx)))) p p((xx))dxdx Part (b):Part (b): One would pick One would pick P
P ((aaii
||
xx) =) =
00 ii = anything else11 ii = ArgMin = ArgMin = anything elseiiRR((aaii||
xx))Thus for every
Thus for every x x pick the action that minimizes the local risk. This amounts to a determin- pick the action that minimizes the local risk. This amounts to a determin-istic rule i.e. there is no benefit to using randomness.
istic rule i.e. there is no benefit to using randomness.
∗
Part (c):
Part (c): Yes, since in a deterministic decision rule setting randomizing some decisions Yes, since in a deterministic decision rule setting randomizing some decisions might further improve performance. But how this would be done would be difficult to say. might further improve performance. But how this would be done would be difficult to say.
Problem 13 Problem 13 Let Let λ λ((ααii
||
ωω j j) ) ==
00 ii = = j j λ λrr ii = = c c + 1 + 1 λ λss otherwise otherwiseThen since the definition of risk is expected loss we have Then since the definition of risk is expected loss we have
R R((ααii
||
xx) ) == cc
ii=1=1 λ λ((aaii||
ωω j j))P P ((ωω j j||
xx)) ii = = 11,, 22, . . . , c, . . . , c (1)(1) R R((ααcc+1+1||
xx) ) == λλrr (2)(2) Now forNow for i i = = 11,, 22, . . . , c, . . . , c the risk can be simplified as the risk can be simplified as R R((ααii
||
xx) ) == λ λss cc
j j=1=1,, jj
==ii λ λ((aaii||
ωω j j))P P ((ωω j j||
xx) ) == λ λss(1(1−
−
P P ((ωωii||
xx))))Our decision function is to choose the action with the smallest risk, this means that we pick Our decision function is to choose the action with the smallest risk, this means that we pick the reject option if and only if
the reject option if and only if λ λrr < < λλss(1(1
−
−
P ((ωP ωii||
xx))))∀∀
ii = = 11,, 22, . . . , c, . . . , c This is equivalent to This is equivalent to P P ((ωωii||
xx)) < < 1 1−
−
λ λrr λ λss∀∀
ii = = 11,, 22, . . . , c, . . . , cThis can be interpreted as if all posterior probabilities are below a threshold (1
This can be interpreted as if all posterior probabilities are below a threshold (1
−
−
λλrrλ
λss) ) wewe
should reject. Otherwise we pick action
should reject. Otherwise we pick action i i such that such that λ λss(1(1
−
−
P P ((ωωii||
xx)))) < < λλss(1(1−
−
P P ((ωωii||
xx))))∀∀
jj = = 11,, 22, ., . . .. . , c, c, , jj
== i i This is equivalent to This is equivalent to P P ((ωωii||
xx)) > > P P ((ωω j j||
xx))∀∀
jj
== i i ThiThis s can be can be ininteterprrpreteeted d as as selselecectinting g the class with the class with the largethe largest st postpostererioriori i proprobabbabiliilityty. . InIn summary we should reject if and only if
summary we should reject if and only if P P ((ωωii
||
xx)) < < 1 1−
−
λ λrr λ λss∀∀
ii = = 11,, 22, . . . , c, . . . , c (3)(3) Note that ifNote that if λλrr = 0 there is no loss for rejecting (equivalent benefit/reward for correctly = 0 there is no loss for rejecting (equivalent benefit/reward for correctly
classifying) and we always reject. If
classifying) and we always reject. If λλrr > > λλss, the loss from rejecting is greater than the loss, the loss from rejecting is greater than the loss
from missclassifing and we should never reject. This can be seen from the above since from missclassifing and we should never reject. This can be seen from the above since λλrr
λ λss >> 1 1
so
so 11
−
−
λλrrλ
Problem 31 (the probability of error) Problem 31 (the probability of error) Part (a):
Part (a): We can assume with out loss of generality that We can assume with out loss of generality that µµ11 < < µµ22. . If thiIf this is not the cas is not the casese
initially change the labels of class one and class two so that it is. We begin by recalling the initially change the labels of class one and class two so that it is. We begin by recalling the probability of error probability of error P P ee P P ee == P P (( ˆ ˆH H == H H 11
||
H H 22))P P ((H H 22) +) + P P (( ˆ ˆH H == H H 22||
H H 11))P P ((H H 11)) = =
ξξ−∞
−∞
p p((xx||
H H 22))dxdx
P P ((H H 22) +) +
∞
∞
ξξ p p((xx||
H H 11))dxdx
P ((H P H 11)) .. HereHere ξ ξ is the decision boundary betw is the decision boundary between the two classeen the two classes i.e. es i.e. we claswe classify as sify as classclass H H 11 when when
x
x < < ξ ξ and classify and classify xx as class as class H H 22 when when x x > > ξ ξ . . HHeerere P P ((H H ii) are the priors for each of the) are the priors for each of the
tw
two o classclasses. es. WWe e begin by finding begin by finding the Baythe Bayes decision boundaryes decision boundary ξ ξ under the minimum error under the minimum error crit
criterionerion. . FFrom the discusrom the discussion in sion in the book, the book, the decithe decision boundary is given by a sion boundary is given by a liklikelihelihoodood ratio test. That is we decide class
ratio test. That is we decide class H H 22 if if
p p((xx
||
H H 22)) p p((xx||
H H 11))≥
≥
P P ((H H 11)) P P ((H H 22))
C C 2121−
−
C C 1111 C C 1212−
−
C C 2222
,, and classify the point as a member ofand classify the point as a member of H H 11 other otherwisewise. . The decisThe decision bion boundaroundaryy ξ ξ is the point is the point
at which this inequality is an
at which this inequality is an equality equality . . If we If we use a use a minminimimum probabum probabiliility of ty of ererror critror critererionion the costs
the costs C C ijij are given by are given by C C ijij = = 11
−
−
δ δ ijij, and the decision boundary reduces when, and the decision boundary reduces when P P ((H H 11) ) ==P
P ((H H 22) =) = 1122 toto
p
p((ξ ξ
||
H H 11) = p) = p((ξ ξ||
H H 22)) ..If each conditional distributions is Gaussian
If each conditional distributions is Gaussian p p((xx
||
H H ii) ) ==N
N
((µµii,, σσii22), then the above expression), then the above expressionbecomes becomes 11
√
√
22πσπσ 1 1 exp exp{−
{−
11 22 ((ξ ξ−
−
µµ11))22 σ σ22 1 1}}
= =√
√
11 22πσπσ22 exp exp{−
{−
11 22 ((ξ ξ−
−
µµ22))22 σ σ22 2 2}}
.. If we assume (for simplicity) thatIf we assume (for simplicity) that σ σ11 = = σ σ22 = = σ σ the above expression simplifies to ( the above expression simplifies to (ξ ξ
−
−
µµ11))22 ==((ξ ξ
−
−
µµ22))22, or on taking the square root of both sides of this we get, or on taking the square root of both sides of this we get ξ ξ−
−
µµ11 = =±
±
((ξ ξ−
−
µµ22). If we). If wetake the plus sign in this equation we get the contradiction that
take the plus sign in this equation we get the contradiction that µ µ11 = = µ µ22. Taking the minus. Taking the minus
sign and solving for
sign and solving for ξ ξ we find our decision threshold given by we find our decision threshold given by ξ
ξ = = 11
22((µµ11 + + µ µ22)) .. Again since we have uniform priors (
Again since we have uniform priors (P P ((H H 11) ) == P P ((H H 22) ) == 1122), we have an error probability), we have an error probability
given by given by 22P P ee = =
ξξ−∞
−∞
11√
√
22πσπσ expexp{−
{−
11 22 ((xx−
−
µµ22))22 σ σ22}}
dxdx + +
∞
∞
ξξ 11√
√
22πσπσ expexp{−
{−
11 22 ((xx−
−
µµ11))22 σ σ22}}
dxdx .. or or 22√
√
22πσπσP P ee = =
ξξ−∞
−∞
exp exp{−
{−
11 22 ((xx−
−
µµ22))22 σ σ22}}
dxdx + +
∞
∞
ξξ exp exp{−
{−
11 22 ((xx−
−
µµ11))22 σ σ22}}
dxdx ..To evaluate the first integral use the substitution
To evaluate the first integral use the substitution vv == xx
−
−
µµ22√
√
2σ2σ soso dvdv ==√
√
dxdx22σσ, and in the second, and in the seconduse
use vv = = xx
−
−
µµ22√
√
2σ2σ soso dv dv = =√
√
dxdx22σσ. This then gives. This then gives22
√
√
22πσπσP P ee = = σ σ√
√
22
µ µ11−−µµ22 2 2√ √ 22σσ−∞
−∞
ee−
−
vv22dvdv + + σ σ√
√
22
∞
∞
−
−
µµ2211√ √ −−22µµσσ22 ee−
−
vv22dvdv ..as
as
||
µµ11−
−
µµ22||
σ
σ
→
→
+ +∞
∞
. This can happen if µ. This can happen if µ11 andand µµ22 get far apart of get far apart of σσ shrinks to zero. In each shrinks to zero. In eachcase the
case the classclassificatification ion problproblem em gets easier.gets easier.
Problem 32 (probability of error in higher dimensions) Problem 32 (probability of error in higher dimensions)
As in Problem 31 the classification decision boundary are now the points (
As in Problem 31 the classification decision boundary are now the points ( xx) that satisfy) that satisfy
||||
xx−
−
µµ11||||
22 ==||||
xx−
−
µµ22||||
22..Expanding each side of this expression we find Expanding each side of this expression we find
||||
xx||||
22−
−
22xxttµµ11 + +||||
µµ11||||
22 ==||||
xx||||
22−
−
22xxttµµ22 + +||||
µµ22||||
22..Canceling the common
Canceling the common
||||
xx||||
22 from each side and grouping the terms linear infrom each side and grouping the terms linear in x x we have that we have that x xtt((µµ11−
−
µµ22) ) == 11 22
||||
µµ11||||
2 2−
− ||||
µµ22||||
22
..Which is the equation for a hyperplane. Which is the equation for a hyperplane.
Problem 34 (error bounds on non-Gaussian data) Problem 34 (error bounds on non-Gaussian data) Part (a):
Part (a): Since the Bhattacharyya bound is a specialization of the Chernoff bound, we Since the Bhattacharyya bound is a specialization of the Chernoff bound, we begin by considering the Chernoff bound. Specifically, using the bound min(
begin by considering the Chernoff bound. Specifically, using the bound min(a,a, bb))
≤
≤
aaβ β bb11−
−
β β ,,for
for a a >> 0, 0, b b >> 0, and 0 0, and 0
≤
≤
β β≤
≤
1, one can show that1, one can show that PP (error)(error)
≤
≤
P P ((ωω11))β β P P ((ωω22))11−
−
β β ee−
−
kk((β β )) ffoor r 00≤
≤
β β≤
≤
11 ..Here the expression
Here the expression kk((β β ) is given by) is given by kk((β β ) =) = β β (1(1
−
−
β β )) 22 ((µµ22−
−
µµ11)) tt[[β β ΣΣ 1 1 + (1 + (1−
−
β β )Σ)Σ22]]−
−
11((µµ22−
−
µµ11) +) + 1 1 22 ln ln
||
β β ΣΣ11 + (1 + (1−
−
β β )Σ)Σ22||
||
ΣΣ11||
β β||
ΣΣ22||
11−
−
β β
. . When we desire to compute the Bhattacharyya bound we assignWhen we desire to compute the Bhattacharyya bound we assign β β = = 11//2. 2. FFor a or a twtwo classo class scalar problem with symmetric means and equal variances that is where
scalar problem with symmetric means and equal variances that is where µ µ11 = =
−
−
µµ,, µ µ22 = = ++µµ,,and
and σ σ11 = = σ σ22 = = σ σ the expression for the expression for k k((β β ) above becomes) above becomes
kk((β β ) ) == β β (1(1
−
−
β β )) 22 (2(2µµ))
βσβσ 2 2+ (1+ (1−
−
β β ))σσ22
−
−
11(2(2µµ) +) + 1 1 22 ln ln
||
βσβσ22+ (1+ (1−
−
β β ))σσ22||
σ σ22β β σσ2(12(1−
−
β β ))
= = 22β β (1(1−
−
β β ))µµ 2 2 σ σ22 ..Since the above expression is valid for
Since the above expression is valid for al al l l β β in the range 0 in the range 0
≤
≤
β β≤
≤
1, we can obtain the tightest1, we can obtain the tightest possible bound by computing thepossible bound by computing the maximum maximum of the expression of the expression kk((β β ) (since the dominant) (since the dominant β β expression is
expression is ee
−
−
kk((β β )). . TTo o evevaluate this maximaluate this maximum um of of k k((β β ) we take the derivative with respect) we take the derivative with respectto
to β β and set that result equal to zero obtaining and set that result equal to zero obtaining dd dβ dβ
22β β (1(1−
−
β β ))µµ22 σ σ22
= = 00which gives which gives
11
−
−
22β β = = 00 oror β β = = 11//2. 2. ThThus in us in thithis s cascase the Cherne the Chernoff boundoff bound equals equals the Bhat the Bhattactacharra bound. harra bound. WhenWhen β
β = = 11//2 we have for the following bound on our our error probability2 we have for the following bound on our our error probability P
P (error)(error)
≤
≤
P P ((ωω11))P P ((ωω22))ee−
−
µ µ22
2 2σσ22 ..
To further simplify things if we assume that our variance is such that
To further simplify things if we assume that our variance is such that σ σ22 == µ µ22, and that our, and that our priors over class are taken to be uniform (
priors over class are taken to be uniform (P P ((ωω11) ) == P P ((ωω22) ) = = 11//2) the above becomes2) the above becomes
P P (error)(error)
≤
≤
11 22ee−
−
1 1//22≈
≈
00..30333033 ..Chapter 3 (Maximum Likelihood and Bayesian Estimation)
Chapter 3 (Maximum Likelihood and Bayesian Estimation)
Problem 34 Problem 34
In the problem statement we are told that to operate on a data of “size”
In the problem statement we are told that to operate on a data of “size” nn, requires, requires f f ((nn)) numeric calculations. Assuming that each calculation requires 10
numeric calculations. Assuming that each calculation requires 10
−
−
99 seconds of time to com-seconds of time tocom-pute, we can process a data structure of size
pute, we can process a data structure of size nn in 10 in 10
−
−
99f f ((nn) seconds) seconds. . If we wanIf we want our t our resuresultsltsfinished by a time
finished by a time T T then the largest data structure that we can process must satisfy then the largest data structure that we can process must satisfy 10
10
−
−
99f ((nf n))≤
≤
T T oror nn≤
≤
f f−
−
11(10(1099T T )) .. Converting the ranges of times into seconds we find thatConverting the ranges of times into seconds we find that T
T ==
{{
1sec1sec,, 11 hohourur,, 1day1day,, 11 yyeaearr}}
=={{
1sec1sec,, 36360000 sesecc,, 8640864000 secsec,, 3153315360006000 sesecc}}
.. Using a very naive algorithm we can compute the largestUsing a very naive algorithm we can compute the largest nn such that such that f f ((nn) by simply incre-) by simply incre-menting
menting nn repeat repeatedlyedly. . This is done This is done in the Matlab scriptin the Matlab script prob 34 chap 3.m prob 34 chap 3.m and the results and the results
are displayed in Table XXX. are displayed in Table XXX.
Probl
Problem em 35 35 (rec(recursivursive e v.s. v.s. diredirect ct calcucalculatiolation n of of meanmeans s and and covcovariaariances)nces) Part (a):
Part (a): To compute ˆ To compute ˆµµnn we have to add we have to add nn,, dd-dimensional vectors resulting in-dimensional vectors resulting in ndnd com-
com-putat
putations. ions. WWe then e then havhave e to divide each componento divide each component t by the scalarby the scalar nn, resulting in another, resulting in another nn computations. Thus in total we have
computations. Thus in total we have nd
nd + + n n = = (1 +(1 + d d))nn ,, calculations.
calculations.
To compute the sample covariance matrix,
To compute the sample covariance matrix, C C nn, we have to first compute the differences, we have to first compute the differences
x
xkk
−
−
µµˆˆnn requiring dd subtraction for each vector requiring subtraction for each vector xxkk. Since we must do this for each of the. Since we must do this for each of the nnvectors
vectors x xkk we have a total of we have a total of nd nd calculations required to compute all of the calculations required to compute all of the x xkk
−
−
µµˆˆnn difference differenceve
vectorctors. s. WWe then have to compute then have to compute the outer products (e the outer products (xxkk
−
−
µµˆˆnn)()(xxkk−
−
µµˆˆnn))′′
, which require, which requiredd22 calccalculatioulations to ns to produce eacproduce each h (in a (in a naivnaive implemene implementationtation). ). SincSince e we havwe havee nn such outer such outer
products to compute computing them all requires
products to compute computing them all requires ndnd22 operations. We next, have to add alloperations. We next, have to add all of these
of these n n outer products together, resulting in another set of ( outer products together, resulting in another set of ( nn
−
−
1)1)dd22 operations. Finallyoperations. Finallydividing by the scalar
dividing by the scalar n n
−
−
1 requires another1 requires another d d22 operations. Thus in total we haveoperations. Thus in total we have ndnd + + nd nd22 + (+ (nn
−
−
1)1)dd22++ d d22 = = 22ndnd22 ++ nd nd ,, calculations.calculations. Part (b):
Part (b): We can derive recursive formulas for ˆ We can derive recursive formulas for ˆµµnn andand C C nn in the followi in the following wayng way. . First forFirst for
ˆˆ µ
µnn we note that we note that
ˆˆ µ µnn+1+1 == 11 n n + 1 + 1 n n+1+1
k k=1=1 x xkk == x xnn+1+1 n n + 1 + 1 + +
nn n n + 1 + 1
1 1 n n n n
k k=1=1 x xkk = = 11 n n + 1 + 1xxnn+1+1 + +
nn n n + 1 + 1
µµˆˆ n n..Writing
Writing nn+1n+1n asas nn+1+1n+1n+1
−
−
11 = = 11−
−
nn+1+111 the above becomes the above becomes ˆˆ µ µnn+1+1 == 11 n n + 1 + 1xxnn+1+1 + + ˆˆµµnn−
−
11 n n + 1 + 1µµˆˆnn = = ˆˆµµnn + +
11 n n + 1 + 1
((xx n n+1+1−
−
µµˆˆnn)) ,, as expected. as expected.To derive the recursive update formula for the covariance matrix
To derive the recursive update formula for the covariance matrix C C nn we begin by recalling we begin by recalling
the definition of
the definition of C C nn of of
C C nn+1+1 = = 11 n n n n+1+1
k k=1=1 ((xxkk−
−
µµˆˆnn+1+1)()(xxkk−
−
µµˆˆnn+1+1))′′
..Now introducing the recursive definition of the mean ˆ
Now introducing the recursive definition of the mean ˆµµnn+1+1 as computed above we have that as computed above we have that
C C nn+1+1 == 11 n n n n+1+1
k k=1=1 ((xxkk−
−
µµˆˆnn−
−
11 n n + 1 + 1((xxnn+1+1−
−
µµˆˆnn))())(xxkk−
−
µµˆˆnn−
−
11 n n + 1 + 1((xxnn+1+1−
−
µµˆˆnn))))′′
= = 11 n n n n+1+1
k k=1=1 ((xxkk−
−
µµˆˆnn)()(xxkk−
−
µµˆˆnn))′′
−
−
11 n n((nn + 1) + 1) n n+1+1
k k=1=1 ((xxkk−
−
µµˆˆnn)()(xxnn+1+1−
−
µµˆˆnn))′′
−
−
nn((nn + 1)11 + 1) n n+1+1
k k=1=1 ((xxnn+1+1−
−
µµˆˆnn)()(xxkk−
−
µµˆˆnn))′′
−
−
11 n n((nn + 1) + 1)22 n n+1+1
k k=1=1 ((xxnn+1+1−
−
µµˆˆnn)()(xxnn+1+1−
−
µµˆˆnn))′′
. . Since ˆSince ˆµµnn = = nn11
nnkk=1=1 x xkk we see that we see thatn nˆˆµµnn
−
−
11 n n n n
k k=1=1 x xkk = = 0 0 oorr 11 n n n n
k k=1=1 ((ˆˆµµnn−
−
xxkk) ) = = 00 ,,Using this fact the second and third terms the above can be greatly simplified. For example Using this fact the second and third terms the above can be greatly simplified. For example we have that we have that n n+1+1
k k=1=1 ((xxkk−
−
µµˆˆnn) =) = n n
k k=1=1 ((xxkk−
−
µµˆˆnn) + () + (xxnn+1+1−
−
µµˆˆnn) ) == x xnn+1+1−
−
µµˆˆnnUsing this identity and the fact that the fourth term has a summand that does not depend Using this identity and the fact that the fourth term has a summand that does not depend on
on k k we have we have C C nn+1+1 given by given by
C C nn+1+1 == n n
−
−
11 n n
11 n n−
−
11 n n
k k=1=1 ((xxkk−
−
µµˆˆnn)()(xxkk−
−
µµˆˆnn))′′
+ + 11 n n−
−
11((xxnn+1+1−
−
µµˆˆnn)()(xxnn+1+1−
−
µµˆˆnn))′′
−
−
nn((nn + 1)22 + 1)((xxn+1n+1−
−
µµˆˆnn)()(xxnn+1+1−
−
µµˆˆnn) +) + 11 n n((nn + 1) + 1)((xxnn+1+1−
−
µµˆˆnn)()(xxnn+1+1−
−
µµˆˆnn))′′
= =
11−
−
11 n n
C C n n + +
11 n n−
−
11 n n((nn + 1) + 1)
((xx n n+1+1−
−
µµˆˆnn)()(xxnn+1+1−
−
µµˆˆnn))′′
= = C C nn−
−
11 n nC C nn + + 11 n n
nn + 1 + 1−
−
11 n n + 1 + 1
((xx n n+1+1−
−
µµˆˆnn)()(xxnn+1+1−
−
µµˆˆnn))′′
= =
nn−
−
11 n n
C C n n + + 11 n n + 1 + 1((xxnn+1+1−
−
µµˆˆnn)(x)(xnn+1+1−
−
µµˆˆnn))′′
..which is the result in the book. which is the result in the book. Part (c):
Part (c): To compute ˆ To compute ˆµµnn using the recurrence formulation we have using the recurrence formulation we have dd subtractions to com- subtractions to
com-pute
pute xxnn
−
−
ˆ ˆµµnn−
−
11 and the scalar division requires another and the scalar division requires another dd operati operations. ons. FinallFinallyy, , additaddition byion bythe ˆ
the ˆµµnn
−
−
11 requires another requires another dd operations giving in total 3 operations giving in total 3dd operations. To be compared with operations. To be compared withthe (
the (dd + 1) + 1)nn operations in the non-recursive case. operations in the non-recursive case.
To compute the number of operations required to compute
To compute the number of operations required to compute C C nn recursively we have dd com- recursively we have
com-putations to compute the difference
putations to compute the difference x xnn
−
−
µµˆˆnn−
−
11 and then and then d d22 operations to compute the outeroperations to compute the outerproduct (
product (xxnn
−
−
µµˆˆnn−
−
11)(x)(xnn−
−
µµˆˆnn−
−
11))′′
, another, another d d22 operations to multiplyoperations to multiply C C nn−
−
11 by by
nnnn−
−
−
−
2211
and finallyand finallydd22 operations to add this product to the other one. Thus in total we haveoperations to add this product to the other one. Thus in total we have
dd + + d d22 ++ d d22++ d d22 ++ d d22 = = 44dd22 ++ d d .. To be compare with 2
To be compare with 2ndnd22 ++ nd nd in the non in the non rerecurcursivsive e cascase. e. FFor largeor large nn the non-recursive the non-recursive
calculations are much more computationally intensive. calculations are much more computationally intensive.
Chapter 6 (Multilayer Neural Networks)
Chapter 6 (Multilayer Neural Networks)
Problem 39 (derivatives of matrix inner products) Problem 39 (derivatives of matrix inner products) Part (a):
Part (a): We present a slightly different method to prove this here. We desire the gradient We present a slightly different method to prove this here. We desire the gradient (or derivative) of the function
(or derivative) of the function
φ
φ
≡
≡
xxT T KKxx .. TheThe ii-th component of this derivative is given by-th component of this derivative is given by ∂φ ∂φ ∂x ∂xii == ∂ ∂ ∂x ∂xii
xx T T KKxx
== e eT T ii K x Kx + + x xT T KeKeii wherewhere e eii is the i is the i-th elementary basis function for-th elementary basis function for RRnn, i.e. it has a 1 in the, i.e. it has a 1 in the i i-th position and-th position and
zeros everywhere else. Now since zeros everywhere else. Now since
((eeT T ii K K xx))T T == x xT T K K T T eeT T ii == e eT T ii K K xx ,, the above becomes
the above becomes
∂φ ∂φ ∂x ∂xii
=
= e eT T ii K K xx + + e eT T ii K K T T xx = = e eT T ii
((K K + + K K T T ))xx
.. Since multiplying bySince multiplying by eeT T
ii on the left selects the on the left selects the ii-th row from the expression to its right we-th row from the expression to its right we
see that the full gradient expression is given by see that the full gradient expression is given by
∇
∇
φφ = = ((K K + + K K T T ))xx ,, as requestas requested ed in the in the texttext. . Note that this Note that this exprexpressiession on can also bcan also be e proproveved d easileasily y by writinby writing g eaceachh term in component notation as suggested in the text.
term in component notation as suggested in the text. Part (b):
Part (b): If If K K is symmetric then since is symmetric then since K K T T == K K the above expression simplifies to the above expression simplifies to
∇
∇
φφ = = 22KxKx ,, as claimed in the book.as claimed in the book.
Chapter 9 (Algorithm-Independent Machine Learning)
Chapter 9 (Algorithm-Independent Machine Learning)
Problem 9 (sums of binomial coefficients) Problem 9 (sums of binomial coefficients) Part (a):
Part (a): We first recall the binomial theorem which is We first recall the binomial theorem which is ((xx + + y y))nn == n n
k k=0=0
nn kk
xx k kyynn−
−
kk.. If we letIf we let xx = 1 = 1 anandd y y = 1 then = 1 then xx + + y y = 2 and the sum above becomes = 2 and the sum above becomes 22nn == n n
k k=0=0
nn kk
,,whi
whicch h is the is the stastateted d ideidentntitityy. . As an As an asiaside, note also that de, note also that if we if we letlet xx ==
−
−
1 and1 and yy = 1 then = 1 then xx + + y y = 0 and we obtain another interesting identity involving the binomial coefficients = 0 and we obtain another interesting identity involving the binomial coefficients 0 0 == n n
k k=0=0
nn kk
((−
−
1)1)kk ..Problem 23 (the average of the leave one out means) Problem 23 (the average of the leave one out means) We define the leave one out mean
We define the leave one out mean µ µ((ii)) asas
µ µ((ii)) = = 11 n n
−
−
11
j j
==ii x x j j . .This can obviously be computed for every
This can obviously be computed for every i i. Now we want to consider the mean of the leave. Now we want to consider the mean of the leave one out means. To do so first notice that
one out means. To do so first notice that µ µ((ii)) can be written as can be written as
µ µ((ii)) == 11 n n
−
−
11
j j
= =ii x x j j = = 11 n n−
−
11
nn
j j=1=1 x x j j−
−
xxii
= = 11 n n−
−
11
n n
11 n n
nn j j=1=1 x x j j−
−
xxii
= = nn n n−
−
11µµˆˆ−
−
x xii n n−
−
11 . .With this we see that the mean of all of the leave one out means is given by With this we see that the mean of all of the leave one out means is given by
11 n n n n
ii=1=1 µ µ((ii)) == 11 n n n n
ii=1=1
nn n n−
−
11µµˆˆ−
−
x xii n n−
−
11
= = nn n n−
−
11µµˆˆ−
−
11 n n−
−
11
11 n n
nn ii=1=1 x xii = = nn n n−
−
11µµˆˆ−
−
ˆˆ µ µ n n−
−
11 = = ˆˆµµ ,, as expected. as expected.Problem 40 (maximum likelihood estimation with a binomial random variable) Problem 40 (maximum likelihood estimation with a binomial random variable) Since
Since kk has a binomial distribution with parameters ( has a binomial distribution with parameters (nn
′′
,, pp) we have that) we have thatP P ((kk) =) =
nn′′
kk
p p k k(1(1−
−
p p))nn′′−
−
kk..Then the
Then the pp that maximizes this expression is given by taking the derivative of the above that maximizes this expression is given by taking the derivative of the above (with respect to
(with respect to p p) setting the resulting expression equal to zero and solving for) setting the resulting expression equal to zero and solving for p p. We find. We find that this derivative is given by
that this derivative is given by dd dp dpP P ((kk) ) ==
nn′′
kk
kpkp k k−
−
11(1(1−
−
p p))nn′′−
−
kk ++
nn′′
kk
p p k k(1(1−
−
p p))nn′′−
−
kk−
−
11((nn′′
−
−
kk)()(−
−
1)1) .. Which when set equal to zero and solve forWhich when set equal to zero and solve for p p we find that we find that p p = = nnkk′′, or the empirical counting, or the empirical counting estimate of the probability of success as we were asked to show.