• No results found

Bayesian decision theory

N/A
N/A
Protected

Academic year: 2021

Share "Bayesian decision theory"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

The Gaussian classifier

Nuno Vasconcelos ECE Department, UCSDp ,

(2)

Bayesian decision theory

recall that we have

Y – state of the worldY – state of the world

X – observations

g(x) – decision function

L[g(x),y] – loss of predicting y with g(x)

Bayes decision rule is the rule that minimizes the risk

for the “0-1” loss

[

( , )

]

, L X Y

E Risk = X Y for the 0 1 loss

⎩⎨

⎧ ≠

= g x y

y x g

L 0 ( )

) ( ,

] 1 ), (

[ ⎩0, g(x) = y

(3)

MAP rule

the optimal decision rule can be written as

1)1) i *(x ) argmaxP (i | x )

2)

)

| ( max

arg )

(x P | i x

i Y X

= i

[

( | ) ( )

]

max arg

)

*(

x P x i P i

i

=

2)

3)

[

( | ) ( )

]

max arg

)

(

x P

|

x i P i

i

X Y Y

= i

[

l ( | ) l ( )

]

)

*(

x P x i P i

i

+

3)

we have started to study the case of Gaussian classes

[

log ( | ) log ( )

]

max arg

)

(

x P

|

x i P i

i

X Y Y

i +

=

we have started to study the case of Gaussian classes

⎭⎬

⎩⎨

⎧− − Σ −

= Σ ( ) ( )

2 exp 1

|

| ) 2 ( ) 1

|

( 1

| i i

T d i

Y

X x i x x

P µ µ

⎭ Σ | ⎩ 2

| ) 2

| (

i

π d

(4)

The Gaussian classifier

di i i t

BDR can be written as

[ d

i

x

i i

]

x

i

*

( ) = arg min ( , µ ) + α

discriminant:

PY|X(1|x ) = 0.5

with

[

i i i

]

i

d x

x

i ( ) arg min ( , µ ) + α

) (

) (

) ,

( x y x y

1

x y

d

i

= −

T

Σ

i

the optimal rule is to assign x to the closest class

) ( log

2 )

2

log(

d i

P

Y

i

i

= π Σ −

α

the optimal rule is to assign x to the closest class

closest is measured with the Mahalanobis distance di(x,y) to which the

α

constant is added to account for the class to which the

α

constant is added to account for the class prior

(5)

The Gaussian classifier

If then

discriminant:

PY|X(1|x ) = 0.5

i

= Σ ∀ i

Σ ,

with

) ( max

arg )

*

(

x g x

i

i

i

=

with

) (

1

w

0

x w x

g

i

=

iT

+

i

) ( 1

1

log

1

i P w

w

T i i

+ Σ

= Σ

=

µ µ

µ

the BDR is a linear function or a linear discriminant

) ( 2 log

0

P i

w

i

= µ

i

Σ µ

i

+

Y

(6)

Geometric interpretation

classes i,j share a boundary if

there is athere is a set of xset of x such thatsuch that

) ( )

( x g x g

i

=

j

or

( )

T

( ) 0

( w

i

w

j

) x + ( w

i0

w

j0

) = 0 ( Σ

1

µ Σ

1

µ )

T

x +

( )

0 )

( 2 log

) 1 ( 2 log

1

1 1

⎠ =

⎜ ⎞

⎛ − Σ + + Σ −

+ Σ

− Σ

P i P j

x

Y j

T j Y

i T

i

j i

µ µ

µ µ

µ µ

2

2 ⎠

i i Y j j Y

(7)

Geometric interpretation

note that

(

Σ1

µ

Σ1

µ )

T x +

( )

0 )

( 2 log

) 1 ( 2 log

1 1 1 ⎟ =

⎜ ⎞

⎛− Σ + + Σ −

+ Σ

− Σ

P i P j

x

Y j

T j Y

i T

i

j i

µ µ

µ µ

µ µ

can be written as

( )

1

P

(

i

)

2

2 ⎠

next we use

( )

0

) (

) log (

2 2

1 1 1

1 ⎟ =

⎜⎜ ⎞

⎛ Σ − Σ −

− Σ

j P

i x P

Y j Y

T j i

T i T

j

i

µ µ µ µ µ

µ

next, we use

= Σ

Σ

T T

T T

j T

j i

T

i

µ µ µ

µ

1 1

1 1

1 1

= Σ

− Σ

+ Σ

Σ i iT j iT j jT j

T

i

µ µ µ µ µ µ µ

µ

1 1 1 1

(8)

Geometric interpretation

which can be written as

1

1 T

TΣ Σ

1 1

1 1

1 1

j T

j j

T i j

T i i

T i

j T

j i

T i

µ µ

µ µ

µ µ

µ µ

µ µ

µ µ

= Σ

− Σ

+ Σ

− Σ

= Σ

− Σ

) (

) (

) (

) (

1 1

1 1

j i

T j j

i T

i

j T

j i

j i

T i

µ µ

µ µ

µ µ

µ µ

µ µ

µ µ

=

− Σ

+

− Σ

= Σ

− +

− Σ

) (

)

( i j T 1 i j

j j

j

µ µ

µ

µ

+ Σ

using this in

( µ µ )

T Σ1x 1 ⎜⎜

µ

TΣ1

µ µ

TΣ1

µ

2log PY (i) + = 0

( )

0

) log (

2 2 =

⎜⎜ ⎠

⎝ Σ − Σ − +

− Σ

x P j

Y j

j i

i j

i

µ µ µ µ µ

µ

(9)

Geometric interpretation

leads to

( ) ( ) ( )

3 1

) 0 (

) log (

2 2

1 1

1 ⎟⎟ =

⎜⎜ ⎞

⎛ + Σ − − +

− Σ

j P

i x P

Y Y j

i T

j i

T j

i

µ µ µ µ µ

µ

4 4 4 4 4 4 4 4 4 4

4 3

4 4 4 4 4 4 4 4 4 4

4 2

1

) (

0 w 1

b x

wT

− Σ

=

= +

µ µ

) (

) log (

2

) (

) (

) (

1

j P

i b P

w

j Y i

T j i

j i

− + Σ

− +

= Σ

=

µ µ

µ µ

µ µ

this is the equation of the hyper-plane of parameters

) (

2 PY j

hyper plane of parameters w and b

(10)

Geometric interpretation

which can also be written as

) (

1 ⎛

P i

( ) ( ) ( )

( )

) 0 (

) log (

2 2

1 1

1

⎟⎟ =

⎜⎜ ⎞

⎛ + Σ − −

− Σ

j P

i x P

Y j Y

i T

j i

T j

i

µ µ µ µ µ

µ

( ) ( )

( ) ( )

log (( )) 0

2 1

1 ⎟⎟ =

⎜⎜

− Σ

− + −

− + Σ

j P

i x P

Y Y j

i T

j i

j i

j T i

j

i

µ µ µ µ

µ µ

µ µ µ

µ

or

( x x

0

)

0

w

T − =

( )

( )

( )

log

1

i x P

w

j Y i

j i

j i

µ µ

µ µ

µ µ

− +

− Σ

=

( ) ( )

log ( )

2 1

0

P j

x

j Y i

T j

i

µ µ µ

µ

− Σ −

=

(11)

Geometric interpretation

this is the equation of the hyper-plane

ofof normal vector wnormal vector w

that passes through x0

( )

T

( x x

0

) = 0

w

T

x1

x0 w

( )

1

x2 xn

x 1

( )

x w

j i

j i

µ µ

µ µ

+ −

=

− Σ

=

x3 x2

( )

( ) ( )

log (( ))

2

1 0

j P

i P x

T Y

j

i

µ

µ

Σ

=

optimal decision

boundary for Gaussian

( ) (

1 i j

)

g

P

Y (

j

) T

j

i

µ µ µ

µ

− Σ

classes, equal covariance

(12)

Geometric interpretation

special case i)

I Σ

2

optimal boundary has

I σ

2

= Σ

j

w

i

σ µ µ

= 2

( )

Y Y j

i

j i

j i

j P

i x P

µ µ

µ σ µ

µ µ

− −

= +

) (

) log (

2 2

0

2

(

i j

)

Y Y j

i j

i

j

j P

i

P µ µ

µ µ

µ σ

µ

− −

= +

) (

) log (

2 2

2

j Y

i

µ j

µ

( )

(13)

Geometric interpretation

this is

( )

j i

j i

i P w

µ σ µ σ

µ µ

+

= −

) (

2

2

(

i j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

− −

= +

) (

) log (

2 2

vector along 0

the line through µ and µ

σ

µi and µj

w

µi

µj

σ

Gaussian classes Gaussian classes, equal covariance σ2I

(14)

Geometric interpretation

for equal prior probabilities (PY(i) = PY(j) )

µ

optimal boundary:

µ

2

j i

j

w

i

µ µ σ

µ µ

+

= −

optimal boundary:

- plane through midpoint between µi and µj

th l t th li

0 2

j

x µ

i

µ

=

mid point between - orthogonal to the line

that joins µi and µj

w

σ

mid-point between µi and µj

w

µi

µj

σ

x0

Gaussian classes Gaussian classes, equal covariance σ2I

(15)

Geometric interpretation

different prior probabilities (PY(i) PY(j) )

µ

µ

x0 moves along

( )

j Y i

j i

i P w

µ σ µ σ

µ µ

+

= −

) l (

2

2

line through µi and µj

(

i j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

− −

= ( )

) log (

2 2

0

σ

w

µi

µj

σ

x0

Gaussian classes

) (

1 PY i Gaussian classes, equal covariance σ2I

) (

) log (

1

2 P j

i P

Y Y j

i µ

µ

(16)

Geometric interpretation

what is the effect of the prior? (PY(i) PY(j) )

i

P

2

(

i j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

− −

= +

) (

) log (

2 2

0

2

x0 moves away from µi if PY(i)>PY(j) making it more likely to pick i

w

σ

µi

µj

σ x0

Gaussian classes,

j

i µ

µ

equal covariance σ2I

(17)

Geometric interpretation

what is the strength of this effect? (PY(i) PY(j) )

µ

µ

“inversely

( )

j Y i

j i

i P w

µ σ µ σ

µ µ

+

= −

) l (

2

2

proportional to the distance between means

(

i j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

− −

= ( )

) log (

2 2

0

σ

in units of standard deviation”

w

µi

µj

σ

x0

Gaussian classes

) (

1 PY i Gaussian classes, equal covariance σ2I

) (

) log (

1

2 P j

i P

Y Y j

i µ

µ

(18)

Geometric interpretation

note the similarities with scalar case, where )

0

2 (

j Y

i µ σ P

µ +

while here we have

) 1 (

) 0 log (

2 Y

Y j

i j

i

P x P

µ µ

µ σ µ

+

< +

while here we have

( )

T

x x

w

0 = 0

( )

j i

j i

i P w

µ σ µ σ

µ µ

+

= −

) (

2

2

(

i j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

− −

= +

) (

) log (

2 2

0

hyper-plane is the high-dimensional version of the threshold!

(19)

Geometric interpretation

boundary hyper-planeyp p in 1, 2,

and 3D

f i

for various prior

configurations

(20)

Geometric interpretation

special case ii)

Σ

= Σ

i

optimal boundary

T

x x

w

( − 0) = 0

i

( )

( )

j Y i

j i

i P w

µ µ

µ µ

+

− Σ

=

) ( 1

) (

1 0

x0 basically the same strength of the prior inversely proportional

( ) ( ) (

i j

)

Y Y j

i T

j i

j i

j P

i

x P µ µ

µ µ

µ µ

µ

µ

− Σ

− −

= +

) (

) log (

1

2 1

0

x0 basically the same, strength of the prior inversely proportional to Mahalanobis distance between means

• w is multiplied by Σ-1, which changes its direction and the slope of the hyper-plane

the hyper-plane

(21)

Geometric interpretation

equal but arbitrary covariance

( )

( )

( ) (

T

)

Y

(

i j

)

j i

j i

i x P

w

µ µ µ

µ

µ µ

− + −

=

− Σ

=

) log (

1

0

1

( ) ( ) (

i j

)

j Y i

T j

i

P j

x µ µ

µ µ

µ

µ

− Σ − log ( )

2 1

0

w

µi

µj

x0

Gaussian classes Gaussian classes, equal covariance Σ

(22)

Geometric interpretation

in the homework you will show that the separating plane is tangent to the pdf iso-contours at x0

µi

µj

Gaussian classes, equal covariance Σ

reflects the fact that the natural distance is now Mahalanobis

(23)

Geometric interpretation

boundary hyper- plane

in 1, 2, and 3D

for various prior for various prior configurations

(24)

Geometric interpretation

what about the generic case where covariances are different?

in this case

[

i i i

]

i d x

x

i *( ) = argmin ( ,µ )+α

i

) (

) (

) ,

(x y x y 1 x y

di = T Σi

there is not much to simplify

) ( log 2 )

2

log( d i PY i

i = π Σ

α

) ( log

2 log

2

) ( log

2 log

) (

) (

) (

1 1

1

1

i P x

x x

i P x

x x

g

Y i

i i T i i

T i T i

Y i

i T i

i i

− Σ +

Σ +

Σ

− Σ

=

− Σ +

− Σ

=

µ µ

µ

µ µ

) ( log

2 log

2x P i

x

x Σi Σi µi + µi Σi µi + Σi Y

(25)

Geometric interpretation

and

) ( l

2 l

2 )

(x x 1x x 1 1 P i

g T Σ T Σ + T Σ + Σ

which can be written as

) ( log

2 log

2 )

(x x 1x x 1 1 P i

gi = T ΣiT Σi µi + µi Σi µi + ΣiY

) (

1

0

W

w x

w x

W x

x g

i i

T i i T i

i

Σ

=

+ +

=

) ( log

2 log

2

1 0

1

i P w

w

Y i

i i T i i

i i i

− Σ +

Σ

=

Σ

=

µ µ

µ

for 2 classes the decision boundary is hyper-quadratic

this could meanthis could mean hyper-plane, pair of hyper-planes, hyper-hyper-plane pair of hyper-planes hyper- spheres, hyper-elipsoids, hyper-hyperboloids, etc.

(26)

Geometric interpretation

in 2 and 3D:

(27)

The sigmoid

we have derived all of this from the log-based BDR

[ l ( | ) l ( ) ]

)

*

( x P x i P i

i

h th l t l it i l i t ti t

[ log ( | ) log ( ) ]

max arg

)

( x P

|

x i P i

i

X Y Y

i

+

=

when there are only two classes, it is also interesting to look at the original definition

) ( max

arg )

*

( x g x

i =

i

with

) ( max

arg )

( x g x

i

i

=

i

) ( )

| ) (

| ( )

( P

X |Y

x i P

Y

i

x i

P x

g = =

) ( )

| (

) ) (

| ( )

(

|

|

Y Y

X

X X

Y i

i P i x P

x x P

i P

x

g = =

) 1 ( )

1

| ( )

0 ( )

0

| (

) ( )

|

|

(

Y Y

X Y

Y X

Y Y

X

P x

P P

x

P +

=

(28)

The sigmoid

note that this can be written as

) ( max

arg )

*

( x g x

i ( x ) = arg max g ( x ) 1

i

i

=

i

) 0 ( )

0

| (

) 1 ( )

1

| 1 (

) 1 (

|

0 X Y Y

P x

P

P x

x P g

+

= )

( 1

)

(

0

1

x g x

g = −

and, for Gaussian classes, the posterior probabilities are

) 0 ( )

0

|

|

(

Y

Y

X

x P

P

) ( 1

)

(

0

1

x g x

g

{

0 0 1 1 0 1

}

0

1 exp ( ) ( )

) 1

( = + − µ − − µ + α − α

x d

x x d

g

where, as before,

d

i

( x , y ) = ( x − y )

T

Σ

i1

( x − y ) ) ( log

2 )

2

log(

d i

P

Y

i

i

= π Σ −

α

(29)

The sigmoid

the posterior

) 1

g (

is a sigmoid and looks like this

{

0 0 1 1 0 1

}

0

( ) 1 exp ( ) ( )

α α

µ

µ − − + −

= +

x d

x x d

g

is a sigmoid and looks like this

discriminant:

P (C1|x ) = 0.5

(30)

The sigmoid

h i id i l k

the sigmoid appears in neural networks

it is the true posterior for Gaussian problems where the covariances are the same

Equal variances

Single boundary Single boundary at

halfway

between means

(31)

The sigmoid

but not necessarily when the covariances are different

Variances are different Va a ces a e d ffe e t

Two boundaries

(32)

Bayesian decision theory

advantages:

BDR isBDR is optimaloptimal andand cannot be beatencannot be beaten

Bayes keeps you honest

models reflect causal interpretation of the problem, this is how we think

natural decomposition into “what we knew already” (prior) and

“what data tells us” (CCD)

• no need for heuristics to combine these two sources of info

BDR is, almost invariably, intuitive

B l h i l d i li ti bl d l it

Bayes rule, chain rule, and marginalization enable modularity, and scalability to very complicated models and problems

problems:

p

• BDR is optimal only insofar the models are correct.

(33)

References

Related documents

Este é constituído de 36 itens que avaliam a saúde geral do indivíduo, fornecendo pontuação nas dimensões da qualidade de vida: capacidade funcional,

Efficiency, reliability, responsiveness, fulfillment and privacy are the different dimensions of service quality, taken from the in depth study of literature review, used

In order to provide user-friendly access to all information and documents, we offer the innovative solution: ADMO is an easy-to-use database software for central planning and

u ovoj ćemo knjizi pokazati kako se tehnike dodira mogu koristiti ne samo za opuštanje i poboljšanje raspoloženja obitelji i prijatelja nego i za bolje

Frequentist vs Bayesian Bayes’ Rule Conjugate Priors Markov chain Monte Carlo.. Markov chains

For both conflict and democracy variables we tried to capture the dynamic dimension of both phenomena by using not only information on absolute levels – for instance, in 1990 and

“Gross receipts” means the total amount received from the sale of raffle tickets and the sale, transfer or use of bingo faces and entrance fees charged at premises in which games

So, by making use of prior experience, we have transformed the conditional probability of the observed symptoms given a specific disease (the likelihood, which is based only on