Bayesian decision theory

(1)

The Gaussian classifier

Nuno Vasconcelos ECE Department, UCSDp ,

(2)

Bayesian decision theory

recall that we have

• Y – state of the worldY – state of the world

• X – observations

• g(x) – decision function

• L[g(x),y] – loss of predicting y with g(x)

Bayes decision rule is the rule that minimizes the risk

for the “0-1” loss

[

⁽ ^, ⁾

]

, L X Y

E Risk = _X _Y for the 0 1 loss

⎩⎨

⎧ ≠

= g x y

y x g

L 0 ( )

) ( ,

] 1 ), (

[ ⎩0, g(x) = y

(3)

MAP rule

the optimal decision rule can be written as

• 1)¹⁾ i ^*(x ) argmaxP (i | x )

• 2)

)

| ( max

arg )

(x P _| i x

i _Y _X

= i

[

⁽ ^| ⁾ ⁽ ⁾

]

max arg

)

*(

x P x i P i

i

=

• 2)

• 3)

[

⁽ ^| ⁾ ⁽ ⁾

]

max arg

)

(

x P

_|

x i P i

i

_X _Y _Y

= i

[

^l ⁽ ^| ⁾ ^l ⁽ ⁾

]

)

*(

x P x i P i

i

+

• 3)

we have started to study the case of Gaussian classes

[

^log ⁽ ^| ⁾ ^log ⁽ ⁾

]

max arg

)

(

x P

_|

x i P i

i

_X _Y _Y

i +

=

we have started to study the case of Gaussian classes

⎭⎬

⎫

⎩⎨

⎧− − Σ −

= Σ ( ) ⁻ ( )

2 exp 1

|

| ) 2 ( ) 1

|

( ¹

| i i

T d i

Y

X x i x x

P µ µ

⎭ Σ | ⎩ 2

| ) 2

| (

i

π d

(4)

The Gaussian classifier

di i i t

BDR can be written as

[ ^d

_i

^x

_i _i

]

x

i

^*

( ) = arg min ( , µ ) + α

discriminant:

P_Y|X(1|x ) = 0.5

with

[

_i _i _i

]

i

d x

x

i ( ) arg min ( , µ ) + α

) (

) ,

( x y x y

¹

x y

d

_i

= −

^T

Σ

_i⁻

−

the optimal rule is to assign x to the closest class

) ( log

2 )

2 log(

^d _i

P

_Y

i

= π Σ −

α

the optimal rule is to assign x to the closest class

closest is measured with the Mahalanobis distance d_i(x,y) to which the

α

constant is added to account for the class to which the

α

constant is added to account for the class prior

(5)

The Gaussian classifier

If then

discriminant:

P_Y|X(1|x ) = 0.5

i

= Σ ∀ i

Σ ,

with

) ( max

arg )

*

(

x g x

i

_i

i

=

• with

) (

1

w

0

x w x

g

_i

=

_i^T

+

_i

) ( 1

₁

log

1

i P w

w

T i i

+ Σ

−

= Σ

=

−

µ µ

µ

• the BDR is a linear function or a linear discriminant

) ( 2 log

0

P i

w

_i

= µ

_i

Σ µ

_i

+

_Y

(6)

Geometric interpretation

classes i,j share a boundary if

• there is athere is a set of xset of x such thatsuch that

) ( )

( x g x g

_i

=

_j

• or

( )

^T

( ) ⁰

( ^w

_i

⁻ ^w

_j

) ^x ⁺ ( ^w

_i₀

⁻ ^w

_j₀

) ⁼ ⁰ ( ^Σ

⁻¹

^µ ^Σ

⁻¹

^µ )

^T

^x ⁺

( )

0 )

( 2 log

) 1 ( 2 log

1

₁ ₁

⎠ =

⎜ ⎞

⎝

⎛ − Σ + + Σ −

+ Σ

− Σ

−

P i P j

x

Y j

T j Y

i T

i

j i

µ µ

2 2 ⎠

⎝

ⁱ ⁱ ^Y ^j ^j ^Y

(7)

Geometric interpretation

note that

(

^Σ⁻¹

^µ

^Σ⁻¹

^µ )

^T ^x ⁺

( )

0 )

( 2 log

) 1 ( 2 log

1 ₁ ₁ ⎟ =

⎠

⎜ ⎞

⎝

⎛− Σ + + Σ −

+ Σ

− Σ

−

− P i P j

x

Y j

T j Y

i T

i

j i

µ µ

• can be written as

( )

¹ ^⎛

^P

⁽

ⁱ

⁾ ^⎞

2

2 ⎠

⎝

next we use

( )

⁰

) (

) log (

2 2

1 ₁ ₁

1 ⎟ =

⎠

⎜⎜ ⎞

⎝

⎛ Σ − Σ −

− Σ

− ⁻ ⁻ ⁻

j P

i x P

Y j Y

T j i

T i T

j

i

µ µ µ µ µ

µ

next, we use

= Σ

−

Σ⁻ ⁻

T T

j T

j i

T

i

µ µ µ

µ

1 1

= Σ

− Σ

+ Σ

−

Σ⁻ _i _i^T ⁻ _j _i^T ⁻ _j _j^T ⁻ _j

T

i

µ µ µ µ µ µ µ

µ

¹ ¹ ¹ ¹

(8)

Geometric interpretation

which can be written as

1

1 T

TΣ⁻ Σ⁻

1 1

j T

j j

T i j

T i i

T i

j T

j i

T i

µ µ

= Σ

− Σ

+ Σ

− Σ

= Σ

− Σ

−

) (

1 1

j i

T j j

i T

i

j T

j i

T i

µ µ

=

− Σ

+

− Σ

= Σ

− +

− Σ

−

) (

)

( _i _j ^T ¹ _i _j

j j

j

µ µ

µ

+ Σ⁻ −

using this in

( ^µ ^µ )

^T ^Σ⁻¹^x ¹ _⎜⎜^⎛

^µ

^T^Σ⁻¹

^µ ^µ

^T^Σ⁻¹

^µ

²^log ^P^Y ⁽ⁱ⁾ ⁺^⎞ ⁼ ⁰

( )

⁰

) log (

2 2 =

⎜⎜ ⎠

⎝ Σ − Σ − +

− Σ

− x P j

Y j

j i

i j

i

µ µ µ µ µ

µ

(9)

Geometric interpretation

leads to

( ) ( ) ( )

3 1

) 0 (

) log (

2 2

1 ₁

1 ⎟⎟ =

⎠

⎜⎜ ⎞

⎝

⎛ + Σ − − +

− Σ

− ⁻ ⁻

j P

i x P

Y Y j

i T

j i

T j

i

µ µ µ µ µ

µ

4 4 4 4 4 4 4 4 4 4

4 3

4 4 4 4 4 4 4 4 4 4

4 2

1

) (

0 w 1

b x

w^T

− Σ

=

= +

−

µ µ

) (

) log (

2

) (

1

j P

i b P

w

j Y i

T j i

j i

− + Σ

− +

= Σ

=

−

µ µ

this is the equation of the hyper-plane of parameters

) (

2 P_Y j

hyper plane of parameters w and b

(10)

Geometric interpretation

which can also be written as

) (

1 ⎛

P i

⎞

( ) ( ) ( )

( )

) 0 (

) log (

2 2

1 ₁

1

⎞

⎛

⎟⎟ =

⎠

⎜⎜ ⎞

⎝

⎛ + Σ − −

− Σ

− ⁻ ⁻

j P

i x P

Y j Y

i T

j i

T j

i

µ µ µ µ µ

µ

( ) ( )

^log ⁽⁽ ⁾⁾ ⁰

2 ¹

1 ⎟⎟ =

⎠

⎞

⎜⎜

⎝

⎛

− Σ

− + −

− + Σ

− ⁻ ₋

j P

i x P

Y Y j

i T

j i

j T i

j

i

µ µ µ µ

µ µ

µ µ µ

µ

or

( ^x ^x

₀

)

⁰

w

^T − =

( )

log

1

i x P

w

j Y i

j i

µ µ

− +

− Σ

= ⁻

( ) ( )

^log ⁽ ⁾

2 ¹

0

P j

x

j Y i

T j

i

µ µ µ

µ

− Σ −

−

= ₋

(11)

Geometric interpretation

this is the equation of the hyper-plane

• ofof normal vector wnormal vector w

• that passes through x₀

( )

T

( ^x ⁻ ^x

₀

) ⁼ ⁰

w

^T

x1

x₀ w

( )

1

x2 xn

x ¹

( )

x w

j i

µ µ

+ −

=

− Σ

= ⁻

x3 x2

( )

( ) ( )

^log ⁽⁽ ⁾⁾

2

1 0

j P

i P x

T Y

j

i

µ

Σ

−

=

optimal decision

boundary for Gaussian

( ) (

¹ i j

)

^g

^P

Y ⁽

^j

⁾ T

j

i

µ µ µ

µ

− Σ⁻ −

classes, equal covariance

(12)

Geometric interpretation

special case i)

I Σ

2

optimal boundary has

I σ

2

= Σ

j

w

i

σ µ µ

−

= ₂

( )

Y Y j

i

j i

j P

i x P

µ µ

µ σ µ

µ µ

−

− −

= +

) (

) log (

2 ²

0

2

(

_i _j

)

Y Y j

i j

i

j

j P

i

P µ µ

µ µ

µ σ

µ

−

− −

= +

) (

) log (

2 ²

2

j Y

i

µ j

µ

⁽ ⁾

(13)

Geometric interpretation

this is

( )

j i

i P w

µ σ µ σ

µ µ

+

= −

) (

2

(

_i _j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

−

− −

= +

) (

) log (

2 ²

vector along 0

the line through µ and µ

σ

µ_i and µ_j

w

µ_i

µ_j

σ

Gaussian classes Gaussian classes, equal covariance σ²I

(14)

Geometric interpretation

for equal prior probabilities (P_Y(i) = P_Y(j) )

µ

optimal boundary:

µ

2

j i

j

w

i

µ µ σ

µ µ

+

= −

optimal boundary:

- plane through midpoint between µ_i and µ_j

th l t th li

0 2

j

x µ

i

µ

=

mid point between - orthogonal to the line

that joins µ_i and µ_j

w

σ

mid-point between µ_i and µ_j

w

µ_i

µ_j

σ

x₀

Gaussian classes Gaussian classes, equal covariance σ²I

(15)

Geometric interpretation

different prior probabilities (P_Y(i) ≠ P_Y(j) )

µ

−

x₀ moves along

( )

j Y i

j i

i P w

µ σ µ σ

µ µ

+

= −

) l (

2

line through µ_i and µ_j

(

_i _j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

−

− −

= ( )

) log (

2 ²

0

σ

w

µ_i

µ_j

σ

x₀

Gaussian classes

) (

1 P_Y i Gaussian classes, equal covariance σ²I

) (

) log (

1

2 P j

i P

Y Y j

i µ

µ −

(16)

Geometric interpretation

what is the effect of the prior? (P_Y(i) ≠ P_Y(j) )

i

P

2

(

_i _j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

−

− −

= +

) (

) log (

2 ²

0

2

x₀ moves away from µ_i if P_Y(i)>P_Y(j) making it more likely to pick i

w

σ

µ_i

µ_j

σ x₀

Gaussian classes,

j

i µ

µ −

equal covariance σ²I

(17)

Geometric interpretation

what is the strength of this effect? (P_Y(i) ≠ P_Y(j) )

µ

−

“inversely

( )

j Y i

j i

i P w

µ σ µ σ

µ µ

+

= −

) l (

2

proportional to the distance between means

(

_i _j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

−

− −

= ( )

) log (

2 ²

0

σ

in units of standard deviation”

w

µ_i

µ_j

σ

x₀

Gaussian classes

) (

1 P_Y i Gaussian classes, equal covariance σ²I

) (

) log (

1

2 P j

i P

Y Y j

i µ

µ −

(18)

Geometric interpretation

note the similarities with scalar case, where )

0

2 (

j Y

i µ σ P

µ +

while here we have

) 1 (

) 0 log (

2 _Y

Y j

i j

i

P x P

µ µ

µ σ µ

+ −

< +

while here we have

( )

T

x x

w

− ₀ = 0

( )

j i

i P w

µ σ µ σ

µ µ

+

= −

) (

2

(

_i _j

)

Y Y j

i j

i

j P

i

x P µ µ

µ µ

µ σ

µ

−

− −

= +

) (

) log (

2 ²

0

• hyper-plane is the high-dimensional version of the threshold!

(19)

Geometric interpretation

boundary hyper-planeyp p in 1, 2,

and 3D

f i

for various prior

configurations

(20)

Geometric interpretation

special case ii)

Σ

= Σ

_i

optimal boundary

T

x x

w

( − ₀) = 0

i

( )

j Y i

j i

i P w

µ µ

+

− Σ

= ⁻

) ( 1

) (

1 0

• x₀ basically the same strength of the prior inversely proportional

( ) ( ) (

_i _j

)

Y Y j

i T

j i

j P

i

x P µ µ

µ µ

µ

−

− Σ

− −

= + ₋

) (

) log (

1

2 ¹

0

x₀ basically the same, strength of the prior inversely proportional to Mahalanobis distance between means

• w is multiplied by Σ^-1, which changes its direction and the slope of the hyper-plane

the hyper-plane

(21)

Geometric interpretation

equal but arbitrary covariance

( )

( ) (

_T

)

^Y

(

_i _j

)

j i

i x P

w

µ µ µ

µ

µ µ

− + −

=

− Σ

= ⁻

) log (

1

0

1

( ) ( ) (

_i _j

)

j Y i

T j

i

P j

x µ µ

µ µ

µ

− Σ⁻ − log ( )

2 ¹

0

w

µ_i

µ_j

x₀

Gaussian classes Gaussian classes, equal covariance Σ

(22)

Geometric interpretation

in the homework you will show that the separating plane is tangent to the pdf iso-contours at x₀

µ_i

µ_j

Gaussian classes, equal covariance Σ

• reflects the fact that the natural distance is now Mahalanobis

(23)

Geometric interpretation

boundary hyper- plane

in 1, 2, and 3D

for various prior for various prior configurations

(24)

Geometric interpretation

what about the generic case where covariances are different?

• in this case

[

_i _i _i

]

i d x

x

i ^*( ) = argmin ( ,µ )+α

i

) (

) ,

(x y x y ¹ x y

d_i = − ^T Σ_i⁻ −

• there is not much to simplify

) ( log 2 )

2

log( ^d _i P_Y i

i = π Σ −

α

) ( log

2 log

2

) ( log

2 log

) (

1 1

1

i P x

x x

i P x

x x

g

Y i

i i T i i

T i T i

Y i

i T i

i i

− Σ +

Σ +

Σ

− Σ

=

− Σ +

− Σ

−

=

−

µ µ

µ

µ µ

) ( log

2 log

2x P i

x

x Σ_i Σ_i µ_i + µ_i Σ_i µ_i + Σ_i _Y

(25)

Geometric interpretation

and

) ( l

2 l

2 )

(x x ¹x x ¹ ¹ P i

g ^T Σ⁻ ^T Σ⁻ + ^T Σ⁻ + Σ

• which can be written as

) ( log

2 log

2 )

(x x ¹x x ¹ ¹ P i

g_i = ^T Σ_i − ^T Σ_i µ_i + µ_i Σ_i µ_i + Σ_i − _Y

) (

1

0

W

w x

W x

x g

i i

T i i T i

i

Σ

=

+ +

=

−

) ( log

2 log

2

1 0

1

i P w

w

Y i

i i T i i

i i i

− Σ +

Σ

=

Σ

−

=

−

µ µ

µ

for 2 classes the decision boundary is hyper-quadratic

• this could meanthis could mean hyper-plane, pair of hyper-planes, hyper-hyper-plane pair of hyper-planes hyper- spheres, hyper-elipsoids, hyper-hyperboloids, etc.

(26)

Geometric interpretation

in 2 and 3D:

(27)

The sigmoid

we have derived all of this from the log-based BDR

[ ^l ⁽ ^| ⁾ ^l ⁽ ⁾ ]

)

*

( x P x i P i

i

h th l t l it i l i t ti t

[ ^log ⁽ ^| ⁾ ^log ⁽ ⁾ ]

max arg

)

( x P

_|

x i P i

i

_X _Y _Y

i

+

=

when there are only two classes, it is also interesting to look at the original definition

) ( max

arg )

*

( x g x

i =

_i

with

) ( max

arg )

( x g x

i

_i

=

i

) ( )

| ) (

| ( )

( P

^X ^|^Y

x i P

^Y

i

x i

P x

g = =

) ( )

| (

) ) (

| ( )

(

|

Y Y

X

X X

Y i

i P i x P

x x P

i P

x

g = =

) 1 ( )

1 | ( )

0 ( )

0 | (

) ( )

|

(

Y Y

X Y

Y X

Y Y

X

P x

P P

x

P +

=

(28)

The sigmoid

note that this can be written as

) ( max

arg )

*

( x g x

i ( x ) = arg max g ( x ) 1

i

_i

=

i

) 0 ( )

0 | (

) 1 ( )

1 | 1 (

) 1 (

|

0 X Y Y

P x

P

P x

x P g

+

= )

( 1

)

(

₀

1

x g x

g = −

and, for Gaussian classes, the posterior probabilities are

) 0 ( )

0 |

|

(

Y

X

x P

P

) ( 1

)

(

₀

1

x g x

g

{

₀ ₀ ₁ ₁ ₀ ₁

}

0

1 exp ( ) ( )

) 1

( = + − µ − − µ + α − α

x d

x x d

g

where, as before,

d

_i

( x , y ) = ( x − y )

^T

Σ

_i⁻¹

( x − y ) ) ( log

2 )

2 log(

^d _i

P

_Y

i

= π Σ −

α

(29)

The sigmoid

the posterior

) 1

g (

is a sigmoid and looks like this

{

₀ ₀ ₁ ₁ ₀ ₁

}

0

( ) 1 exp ( ) ( )

α α

µ

µ − − + −

−

= +

x d

x x d

g

is a sigmoid and looks like this

discriminant:

P (C₁|x ) = 0.5

(30)

The sigmoid

h i id i l k

the sigmoid appears in neural networks

• it is the true posterior for Gaussian problems where the covariances are the same

Equal variances

Single boundary Single boundary at

halfway

between means

(31)

The sigmoid

but not necessarily when the covariances are different

Variances are different Va a ces a e d ffe e t

Two boundaries

(32)

Bayesian decision theory

advantages:

• BDR isBDR is optimaloptimal andand cannot be beatencannot be beaten

• Bayes keeps you honest

• models reflect causal interpretation of the problem, this is how we think

• natural decomposition into “what we knew already” (prior) and

“what data tells us” (CCD)

• no need for heuristics to combine these two sources of info

• BDR is, almost invariably, intuitive

B l h i l d i li ti bl d l it

• Bayes rule, chain rule, and marginalization enable modularity, and scalability to very complicated models and problems

problems:

p

• BDR is optimal only insofar the models are correct.

(33)

Bayesian decision theory

The Gaussian classifier