The Gaussian classifier
Nuno Vasconcelos ECE Department, UCSDp ,
Bayesian decision theory
recall that we have
• Y – state of the worldY – state of the world
• X – observations
• g(x) – decision function
• L[g(x),y] – loss of predicting y with g(x)
Bayes decision rule is the rule that minimizes the risk
for the “0-1” loss
[
( , )]
, L X Y
E Risk = X Y for the 0 1 loss
⎩⎨
⎧ ≠
= g x y
y x g
L 0 ( )
) ( ,
] 1 ), (
[ ⎩0, g(x) = y
MAP rule
the optimal decision rule can be written as
• 1)1) i *(x ) argmaxP (i | x )
• 2)
)
| ( max
arg )
(x P | i x
i Y X
= i
[
( | ) ( )]
max arg
)
*(
x P x i P i
i
=• 2)
• 3)
[
( | ) ( )]
max arg
)
(
x P
|x i P i
i
X Y Y= i
[
l ( | ) l ( )]
)
*(
x P x i P i
i
+• 3)
we have started to study the case of Gaussian classes
[
log ( | ) log ( )]
max arg
)
(
x P
|x i P i
i
X Y Yi +
=
we have started to study the case of Gaussian classes
⎭⎬
⎫
⎩⎨
⎧− − Σ −
= Σ ( ) − ( )
2 exp 1
|
| ) 2 ( ) 1
|
( 1
| i i
T d i
Y
X x i x x
P µ µ
⎭ Σ | ⎩ 2
| ) 2
| (
i
π d
The Gaussian classifier
di i i t
BDR can be written as
[ d
ix
i i]
x
i
*( ) = arg min ( , µ ) + α
discriminant:
PY|X(1|x ) = 0.5
with
[
i i i]
i
d x
x
i ( ) arg min ( , µ ) + α
) (
) (
) ,
( x y x y
1x y
d
i= −
TΣ
i−−
the optimal rule is to assign x to the closest class
) ( log
2 )
2
log(
d iP
Yi
i
= π Σ −
α
the optimal rule is to assign x to the closest class
closest is measured with the Mahalanobis distance di(x,y) to which the
α
constant is added to account for the class to which theα
constant is added to account for the class priorThe Gaussian classifier
If then
discriminant:
PY|X(1|x ) = 0.5
i
= Σ ∀ i
Σ ,
with
) ( max
arg )
*
(
x g x
i
ii
=
• with
) (
1
w
0x w x
g
i=
iT+
i) ( 1
1log
1
i P w
w
T i i
+ Σ
−
= Σ
=
−
−
µ µ
µ
• the BDR is a linear function or a linear discriminant
) ( 2 log
0
P i
w
i= µ
iΣ µ
i+
YGeometric interpretation
classes i,j share a boundary if
• there is athere is a set of xset of x such thatsuch that
) ( )
( x g x g
i=
j• or
( )
T( ) 0
( w
i− w
j) x + ( w
i0− w
j0) = 0 ( Σ
−1µ Σ
−1µ )
Tx +
( )
0 )
( 2 log
) 1 ( 2 log
1
1 1⎠ =
⎜ ⎞
⎝
⎛ − Σ + + Σ −
+ Σ
− Σ
−
−
P i P j
x
Y j
T j Y
i T
i
j i
µ µ
µ µ
µ µ
2
2 ⎠
⎝
i i Y j j YGeometric interpretation
note that
(
Σ−1µ
Σ−1µ )
T x +( )
0 )
( 2 log
) 1 ( 2 log
1 1 1 ⎟ =
⎠
⎜ ⎞
⎝
⎛− Σ + + Σ −
+ Σ
− Σ
−
− P i P j
x
Y j
T j Y
i T
i
j i
µ µ
µ µ
µ µ
• can be written as
( )
1 ⎛P
(i
) ⎞2
2 ⎠
⎝
next we use
( )
0) (
) log (
2 2
1 1 1
1 ⎟ =
⎠
⎜⎜ ⎞
⎝
⎛ Σ − Σ −
− Σ
− − − −
j P
i x P
Y j Y
T j i
T i T
j
i
µ µ µ µ µ
µ
next, we use
= Σ
−
Σ− −
T T
T T
j T
j i
T
i
µ µ µ
µ
1 1
1 1
1 1
= Σ
− Σ
+ Σ
−
Σ− i iT − j iT − j jT − j
T
i
µ µ µ µ µ µ µ
µ
1 1 1 1Geometric interpretation
which can be written as
1
1 T
TΣ− Σ−
1 1
1 1
1 1
j T
j j
T i j
T i i
T i
j T
j i
T i
µ µ
µ µ
µ µ
µ µ
µ µ
µ µ
= Σ
− Σ
+ Σ
− Σ
= Σ
− Σ
−
−
−
−
) (
) (
) (
) (
1 1
1 1
j i
T j j
i T
i
j T
j i
j i
T i
µ µ
µ µ
µ µ
µ µ
µ µ
µ µ
=
− Σ
+
− Σ
= Σ
− +
− Σ
−
−
−
−
) (
)
( i j T 1 i j
j j
j
µ µ
µ
µ
+ Σ− −using this in
( µ µ )
T Σ−1x 1 ⎜⎜⎛µ
TΣ−1µ µ
TΣ−1µ
2log PY (i) +⎞ = 0( )
0) log (
2 2 =
⎜⎜ ⎠
⎝ Σ − Σ − +
− Σ
− x P j
Y j
j i
i j
i
µ µ µ µ µ
µ
Geometric interpretation
leads to
( ) ( ) ( )
3 1
) 0 (
) log (
2 2
1 1
1 ⎟⎟ =
⎠
⎜⎜ ⎞
⎝
⎛ + Σ − − +
− Σ
− − −
j P
i x P
Y Y j
i T
j i
T j
i
µ µ µ µ µ
µ
4 4 4 4 4 4 4 4 4 4
4 3
4 4 4 4 4 4 4 4 4 4
4 2
1
) (
0 w 1
b x
wT
− Σ
=
= +
−
µ µ
) (
) log (
2
) (
) (
) (
1
j P
i b P
w
j Y i
T j i
j i
− + Σ
− +
= Σ
=
−
µ µ
µ µ
µ µ
this is the equation of the hyper-plane of parameters
) (
2 PY j
hyper plane of parameters w and b
Geometric interpretation
which can also be written as
) (
1 ⎛
P i
⎞( ) ( ) ( )
( )
) 0 (
) log (
2 2
1 1
1
⎞
⎛
⎟⎟ =
⎠
⎜⎜ ⎞
⎝
⎛ + Σ − −
− Σ
− − −
j P
i x P
Y j Y
i T
j i
T j
i
µ µ µ µ µ
µ
( ) ( )
( ) ( )
log (( )) 02 1
1 ⎟⎟ =
⎠
⎞
⎜⎜
⎝
⎛
− Σ
− + −
− + Σ
− − −
j P
i x P
Y Y j
i T
j i
j i
j T i
j
i
µ µ µ µ
µ µ
µ µ µ
µ
or
( x x
0)
0w
T − =( )
( )
( )log
1
i x P
w
j Y i
j i
j i
µ µ
µ µ
µ µ
− +
− Σ
= −
( ) ( )
log ( )2 1
0
P j
x
j Y i
T j
i
µ µ µ
µ
− Σ −−
= −
Geometric interpretation
this is the equation of the hyper-plane
• ofof normal vector wnormal vector w
• that passes through x0
( )
T
( x − x
0) = 0
w
Tx1
x0 w
( )
1
x2 xn
x 1
( )
x w
j i
j i
µ µ
µ µ
+ −
=
− Σ
= −
x3 x2
( )
( ) ( )
log (( ))2
1 0
j P
i P x
T Y
j
i
µ
µ
Σ−
=
optimal decision
boundary for Gaussian
( ) (
1 i j)
gP
Y (j
) Tj
i
µ µ µ
µ
− Σ− −classes, equal covariance
Geometric interpretation
special case i)
I Σ
2optimal boundary has
I σ
2= Σ
j
w
iσ µ µ
−= 2
( )
Y Y j
i
j i
j i
j P
i x P
µ µ
µ σ µ
µ µ
−
− −
= +
) (
) log (
2 2
0
2
(
i j)
Y Y j
i j
i
j
j P
i
P µ µ
µ µ
µ σ
µ
−− −
= +
) (
) log (
2 2
2
j Y
i
µ j
µ
( )Geometric interpretation
this is
( )
j i
j i
i P w
µ σ µ σ
µ µ
+
= −
) (
2
2
(
i j)
Y Y j
i j
i
j P
i
x P µ µ
µ µ
µ σ
µ
−− −
= +
) (
) log (
2 2
vector along 0
the line through µ and µ
σ
µi and µj
w
µi
µj
σ
Gaussian classes Gaussian classes, equal covariance σ2I
Geometric interpretation
for equal prior probabilities (PY(i) = PY(j) )
µ
optimal boundary:
µ
2
j i
j
w
iµ µ σ
µ µ
+
= −
optimal boundary:
- plane through midpoint between µi and µj
th l t th li
0 2
j
x µ
iµ
=
mid point between - orthogonal to the line
that joins µi and µj
w
σ
mid-point between µi and µj
w
µi
µj
σ
x0
Gaussian classes Gaussian classes, equal covariance σ2I
Geometric interpretation
different prior probabilities (PY(i) ≠ PY(j) )
µ
µ
−x0 moves along
( )
j Y i
j i
i P w
µ σ µ σ
µ µ
+
= −
) l (
2
2
line through µi and µj
(
i j)
Y Y j
i j
i
j P
i
x P µ µ
µ µ
µ σ
µ
−− −
= ( )
) log (
2 2
0
σ
w
µi
µj
σ
x0
Gaussian classes
) (
1 PY i Gaussian classes, equal covariance σ2I
) (
) log (
1
2 P j
i P
Y Y j
i µ
µ −
Geometric interpretation
what is the effect of the prior? (PY(i) ≠ PY(j) )
i
P
2
(
i j)
Y Y j
i j
i
j P
i
x P µ µ
µ µ
µ σ
µ
−− −
= +
) (
) log (
2 2
0
2
x0 moves away from µi if PY(i)>PY(j) making it more likely to pick i
w
σ
µi
µj
σ x0
Gaussian classes,
j
i µ
µ −
equal covariance σ2I
Geometric interpretation
what is the strength of this effect? (PY(i) ≠ PY(j) )
µ
µ
−“inversely
( )
j Y i
j i
i P w
µ σ µ σ
µ µ
+
= −
) l (
2
2
proportional to the distance between means
(
i j)
Y Y j
i j
i
j P
i
x P µ µ
µ µ
µ σ
µ
−− −
= ( )
) log (
2 2
0
σ
in units of standard deviation”
w
µi
µj
σ
x0
Gaussian classes
) (
1 PY i Gaussian classes, equal covariance σ2I
) (
) log (
1
2 P j
i P
Y Y j
i µ
µ −
Geometric interpretation
note the similarities with scalar case, where )
0
2 (
j Y
i µ σ P
µ +
while here we have
) 1 (
) 0 log (
2 Y
Y j
i j
i
P x P
µ µ
µ σ µ
+ −
< +
while here we have
( )
T
x x
w
− 0 = 0( )
j i
j i
i P w
µ σ µ σ
µ µ
+
= −
) (
2
2
(
i j)
Y Y j
i j
i
j P
i
x P µ µ
µ µ
µ σ
µ
−− −
= +
) (
) log (
2 2
0
• hyper-plane is the high-dimensional version of the threshold!
Geometric interpretation
boundary hyper-planeyp p in 1, 2,
and 3D
f i
for various prior
configurations
Geometric interpretation
special case ii)
Σ
= Σ
ioptimal boundary
T
x x
w
( − 0) = 0i
( )
( )
j Y i
j i
i P w
µ µ
µ µ
+
− Σ
= −
) ( 1
) (
1 0
• x0 basically the same strength of the prior inversely proportional
( ) ( ) (
i j)
Y Y j
i T
j i
j i
j P
i
x P µ µ
µ µ
µ µ
µ
µ
−− Σ
− −
= + −
) (
) log (
1
2 1
0
x0 basically the same, strength of the prior inversely proportional to Mahalanobis distance between means
• w is multiplied by Σ-1, which changes its direction and the slope of the hyper-plane
the hyper-plane
Geometric interpretation
equal but arbitrary covariance
( )
( )
( ) (
T)
Y(
i j)
j i
j i
i x P
w
µ µ µ
µ
µ µ
− + −
=
− Σ
= −
) log (
1
0
1
( ) ( ) (
i j)
j Y i
T j
i
P j
x µ µ
µ µ
µ
µ
− Σ− − log ( )2 1
0
w
µi
µj
x0
Gaussian classes Gaussian classes, equal covariance Σ
Geometric interpretation
in the homework you will show that the separating plane is tangent to the pdf iso-contours at x0
µi
µj
Gaussian classes, equal covariance Σ
• reflects the fact that the natural distance is now Mahalanobis
Geometric interpretation
boundary hyper- plane
in 1, 2, and 3D
for various prior for various prior configurations
Geometric interpretation
what about the generic case where covariances are different?
• in this case
[
i i i]
i d x
x
i *( ) = argmin ( ,µ )+α
i
) (
) (
) ,
(x y x y 1 x y
di = − T Σi− −
• there is not much to simplify
) ( log 2 )
2
log( d i PY i
i = π Σ −
α
) ( log
2 log
2
) ( log
2 log
) (
) (
) (
1 1
1
1
i P x
x x
i P x
x x
g
Y i
i i T i i
T i T i
Y i
i T i
i i
− Σ +
Σ +
Σ
− Σ
=
− Σ +
− Σ
−
=
−
−
−
−
µ µ
µ
µ µ
) ( log
2 log
2x P i
x
x Σi Σi µi + µi Σi µi + Σi Y
Geometric interpretation
and
) ( l
2 l
2 )
(x x 1x x 1 1 P i
g T Σ− T Σ− + T Σ− + Σ
• which can be written as
) ( log
2 log
2 )
(x x 1x x 1 1 P i
gi = T Σi − T Σi µi + µi Σi µi + Σi − Y
) (
1
0
W
w x
w x
W x
x g
i i
T i i T i
i
Σ
=
+ +
=
−
) ( log
2 log
2
1 0
1
i P w
w
Y i
i i T i i
i i i
− Σ +
Σ
=
Σ
−
=
−
−
µ µ
µ
for 2 classes the decision boundary is hyper-quadratic
• this could meanthis could mean hyper-plane, pair of hyper-planes, hyper-hyper-plane pair of hyper-planes hyper- spheres, hyper-elipsoids, hyper-hyperboloids, etc.
Geometric interpretation
in 2 and 3D:
The sigmoid
we have derived all of this from the log-based BDR
[ l ( | ) l ( ) ]
)
*
( x P x i P i
i
h th l t l it i l i t ti t
[ log ( | ) log ( ) ]
max arg
)
( x P
|x i P i
i
X Y Yi
+
=
when there are only two classes, it is also interesting to look at the original definition
) ( max
arg )
*
( x g x
i =
iwith
) ( max
arg )
( x g x
i
i=
i) ( )
| ) (
| ( )
( P
X |Yx i P
Yi
x i
P x
g = =
) ( )
| (
) ) (
| ( )
(
|
|
Y Y
X
X X
Y i
i P i x P
x x P
i P
x
g = =
) 1 ( )
1
| ( )
0 ( )
0
| (
) ( )
|
|
(
Y Y
X Y
Y X
Y Y
X
P x
P P
x
P +
=
The sigmoid
note that this can be written as
) ( max
arg )
*
( x g x
i ( x ) = arg max g ( x ) 1
i
i=
i) 0 ( )
0
| (
) 1 ( )
1
| 1 (
) 1 (
|
0 X Y Y
P x
P
P x
x P g
+
= )
( 1
)
(
01
x g x
g = −
and, for Gaussian classes, the posterior probabilities are
) 0 ( )
0
|
|
(
YY
X
x P
P
) ( 1
)
(
01
x g x
g
{
0 0 1 1 0 1}
0
1 exp ( ) ( )
) 1
( = + − µ − − µ + α − α
x d
x x d
g
where, as before,
d
i( x , y ) = ( x − y )
TΣ
i−1( x − y ) ) ( log
2 )
2
log(
d iP
Yi
i
= π Σ −
α
The sigmoid
the posterior
) 1
g (
is a sigmoid and looks like this
{
0 0 1 1 0 1}
0
( ) 1 exp ( ) ( )
α α
µ
µ − − + −
−
= +
x d
x x d
g
is a sigmoid and looks like this
discriminant:
P (C1|x ) = 0.5
The sigmoid
h i id i l k
the sigmoid appears in neural networks
• it is the true posterior for Gaussian problems where the covariances are the same
Equal variances
Single boundary Single boundary at
halfway
between means
The sigmoid
but not necessarily when the covariances are different
Variances are different Va a ces a e d ffe e t
Two boundaries
Bayesian decision theory
advantages:
• BDR isBDR is optimaloptimal andand cannot be beatencannot be beaten
• Bayes keeps you honest
• models reflect causal interpretation of the problem, this is how we think
• natural decomposition into “what we knew already” (prior) and
“what data tells us” (CCD)
• no need for heuristics to combine these two sources of info
• BDR is, almost invariably, intuitive
B l h i l d i li ti bl d l it
• Bayes rule, chain rule, and marginalization enable modularity, and scalability to very complicated models and problems
problems:
p
• BDR is optimal only insofar the models are correct.