Maximum likelihood estimation Maximum likelihood estimation
Nuno Vasconcelos
UCSD
Bayesian decision theory
• recall that we have
– Y – state of the world – X – observations
– g(x) – decision function
– L[g(x),y] – loss of predicting y with g(x)
• Bayes decision rule is the rule that minimizes the risk
[ L ( X , Y ) ]
E Risk =
X Y• for the “0-1” loss
[ ( , ) ]
,Y X
⎧ 1 ( ) ≠
• optimal decision rule is the maximum a posteriori
⎩ ⎨
⎧
=
= ≠
y x
g
y x
y g x g
L 0 , ( )
) ( ,
] 1 ), ( [
• optimal decision rule is the maximum a-posteriori
probability rule
MAP rule
h h th t it b i l t d i f th
• we have shown that it can be implemented in any of the three following ways
– 1))
i
*( x ) = arg max P
Y|X( i | x )
– 2)
)
| ( max
arg )
( x P
|i x
i
Y Xi
[ ( | ) ( ) ]
max arg
)
(
|*
x P x i P i
i =
X Y Y)
– 3)
[
|]
i
[ log ( | ) log ( ) ]
max arg
)
(
|*
x P x i P i
i =
X Y+
Y)
• by introducing a “model” for the class-conditional
distributions we can express this as a simple equation
[ g ( | ) g ( ) ]
g )
(
X |Y Yi
distributions we can express this as a simple equation
– e.g. for the multivariate Gaussian
⎬ ⎫
⎨ ⎧ 1 ( ) Σ ( ) ) 1
|
(
T 1P ⎭ ⎬ ⎫
⎩ ⎨
⎧ − − Σ −
= Σ ( )
−( )
2 exp 1
|
| ) 2 ( ) 1
|
(
1| i i
T i i
Y d
X
x i x x
P µ µ
π
The Gaussian classifier
di i i t
• the solution is
[ d
ix
i i]
x
i
*( ) = arg min ( , µ ) + α
discriminant:
PY|X(1|x ) = 0.5
with
[
i i i]
i
d x
x
i ( ) arg min ( , µ ) + α
) (
) (
) ,
( x y x y
1x y
d
i= −
TΣ
i−−
) ( log
2 )
2
log(
d iP
Yi
i
= π Σ −
α
• the optimal rule is to assign x to the closest class
• closest is measured with the Mahalanobis distance d
i(x,y)
• can be further simplified in special cases
• can be further simplified in special cases
Geometric interpretation
• for Gaussian classes, equal covariance σ
2I µ
µ −
( )
j Y i
j i
i P w
µ σ µ σ
µ µ
+
=
) l (
2 2
(
i j)
Y Y j
i j
i
j
x P µ µ
µ µ
µ
µ −
− −
= ( )
) log (
2
20
σ w
µ
iµ
jσ
x0
) (
1 P i
5
) (
) log (
1
2
2 P j
i P
Y Y j
i
σ µ µ −
Geometric interpretation
• for Gaussian classes, equal but arbitrary covariance
( )
w Σ
−1( µ µ )
( ) (
T)
Y(
i j)
j i
j i
j P
i x P
w
µ µ µ
µ µ
µ µ
µ
µ µ
Σ − + −
=
− Σ
=
−
( )
) log (
1
2
10
2 ( µ
i− µ
j) ( Σ µ
i− µ
j) P
Y( j )
w
µ
iµ
jx0
Bayesian decision theory
• advantages:
– BDR is optimal and cannot be beaten – Bayes keeps you honest
– models reflect causal interpretation of the problem, this is how we think
– natural decomposition into “what we knew already” (prior) and
“what data tells us” (CCD)
– no need for heuristicsno need for heuristics to combine these two sources of infoto combine these two sources of info – BDR is, almost invariably, intuitive
– Bayes rule, chain rule, and marginalization enable modularity, and scalability to very complicated models and problems
and scalability to very complicated models and problems
• problems:
– BDR is optimal only insofar the models are correct.
Implementation
• we do have an optimal solution
(
i j)
w = Σ
−1µ − µ
f
( ) ( ) (
i j)
Y Y j
i T
j i
j i
j P
i
x P µ µ
µ µ
µ µ
µ
µ −
− Σ
− −
= +
−) (
) log (
1
2
10
• but in practice we do not know the values of the parameters µ , Σ ,
PY(1)– we have to somehow estimate these values
– this is OK, we can come up with an estimate from a training set – e.g. use the average value as an estimate for the mean
( )
( )
( )
j Y i
j i
i x P
w
µ µ µ
µ
µ µ
ˆ ) ˆ
ˆ ( 1 log
ˆ ˆ
ˆ ˆ
1ˆ
= +
− Σ
=
−( ) ( ) (
i j)
Y j
i T
j
i
P j
x µ µ
µ µ
µ
µ log ˆ ( )
ˆ ˆ
ˆ ˆ
2
10
−
− Σ
− −
=
−Important
• warning: at this point all optimality claims for the BDR cease to be valid!!
• the BDR is guaranteed to achieve the minimum loss when we use the loss when we use the true probabilities
• when we “plug in” the b bilit ti t
probability estimates, we could be
implementing p g
a classifier that is quite distant from the optimal
– e.g. if the PX|Y(x|i) look like the example above
I could never approximate them well by parametric models (e g – I could never approximate them well by parametric models (e.g.
Gaussian)
Maximum likelihood
• this seems pretty serious
– how should I get these probabilities then?
• we rely on the maximum likelihood (ML) principle
• this has three steps:
1) we choose a parametric model for all probabilities – 1) we choose a parametric model for all probabilities
– to make this clear we denote the vector of parameters by Θ and the class-conditional distributions by
note that this means that Θ is NOT a random variable (otherwise
)
;
|
|
( x i Θ P
X Y– note that this means that Θ is NOT a random variable (otherwise it would have to show up as subscript)
– it is simply a parameter, and the probabilities are a function of this parameter
parameter
Maximum likelihood
• three steps:
– 2) we assemble a collection of datasets
D(i) = {x (i) x (i)} set of examples drawn independently from D(i) = {x1(i) , ..., xn(i)} set of examples drawn independently from class i
3) we select the parameters of class i to be the ones that – 3) we select the parameters of class i to be the ones that
maximize the probability of the data from that class
( Θ )
Θ arg max P ( D
(i)| i ; )
( Θ )
=
Θ
= Θ
Θ
;
| log
max arg
;
| max
arg
) (
| ) (
|
i D
P
i D
P
i Y
X Y X i
– like before, it does not really make any difference to maximize
( )
Θ
g | ;
g
X|Y, y y
probabilities or their logs
Maximum likelihood
• since
– each sample D(i) is considered independently
t ti t d l f l (i)
– parameter Θi estimated only from sample D(i)
• we simply have to repeat the procedure for all classes
• so from now on we omit the class variable
• so, from now on we omit the class variable
( Θ )
= Θ
Θ
max ;
*
arg
D P
X( Θ )
=
Θ Θ
; log
max arg
P
XD
• the function P
X( D; Θ ) is called the likelihood of the parameter Θ with respect to the data
i l h lik lih d f i
• or simply the likelihood function
Maximum likelihood
• note that the likelihood function is a function of the parameters Θ of the parameters Θ
• it does not have the same shape as the
f density itself
• e.g. the likelihood
function of a Gaussian function of a Gaussian is not bell-shaped
• the likelihood is
defined only after we have a sample
⎫
⎧ ( )
21 d µ
⎭ ⎬
⎫
⎩ ⎨
⎧ −
−
=
Θ
22
2
) exp (
) 2 ( ) 1
;
( σ
µ σ
π d d
P
XMaximum likelihood
• given a sample, to obtain ML estimate we need to solve
( Θ )
=
Θ
*arg max P D ;
• when Θ is a scalar this is high-school calculus
( Θ )
= Θ
Θ
max ;
arg P
XD
when Θ is a scalar this is high school calculus
• we have a maximum when
first derivative is zero – first derivative is zero
– second derivative is negative
The gradient
• in higher dimensions, the generalization of the derivative is the gradient
∇f
• the gradient of a function f(w) at z is
f
Tf ⎞
⎜ ⎛ ∂ ∂
∇f• the gradient has a nice geometric i t t ti
n
w z z f
w z f
f ⎟
⎠
⎜⎜ ⎞
⎝
⎛
∂
∂
∂
= ∂
∇
−
) ( ,
), ( )
(
1 0
L
interpretation
– it points in the direction of maximum growth of the function
f(x,y) – which makes it perpendicular to the
contours where the function is constant
) , (x0 y0
∇ff ( 0,y0)
) , (x1 y1
∇f
The gradient
maxmax
• note that if ∇ f = 0
– there is no direction of growthg
– also -∇f = 0, and there is no direction of decrease
– we are either at a local minimum or maximum
“ ddl ” i t or “saddle” point
• conversely, at local min or max or saddle point
di ti f th d
– no direction of growth or decrease min – ∇f = 0
• this shows that we have a critical point if d l if ∇ f 0
saddle
and only if ∇ f = 0
• to determine which type we need second
order conditions
The Hessian
• the extension of the second-order derivative is the Hessian matrix
⎤
⎡
⎥ ⎥
⎥ ⎤
⎢ ⎢
⎢ ⎡
∂
∂
∂
∂
∂
=
∇
−) ( )
( )
(
1 0
2 2
0 2
2
x x x
x f x
f x
f M
nL
⎥ ⎥
⎥
⎢ ⎦
⎢ ⎢
⎣ ∂
∂
∂
∂
∂
=
∇
−
−
) ( )
( )
(
2 1 2
0 1
2
x
x x f
x x
f x
f
n n
L M
– at each point x, gives us the quadratic function
⎦
⎣
x x f
x
t∇
2( )
that best approximates f(x)
x x f
x ∇ ( )
The Hessian
• this means that, when gradient is
zero at x, we have
max– a maximum when function can be approximated by an “upwards-facing”
quadratic
– a minimum when function can be
approximated by a “downwards-facing”
quadratic
– a saddle point otherwise
saddle
min
The Hessian
max• for any matrix M, the function
Mx x
t• is
– upwards facing quadratic when M
Mx x
upwards facing quadratic when M is negative definite
– downwards facing quadratic when M is positive definite
min
saddle is positive definite
– saddle otherwise
• hence, all that matters is the positive d fi it f th H i
definiteness of the Hessian
• we have a maximum when Hessian is negative definite
is negative definite
Maximum likelihood
• in summary, given a sample, we need to solve
( Θ )
Θ
*P D
max
• the solutions are the parameters
( Θ )
= Θ
Θ
max ;
arg P
XD
the solutions are the parameters such that
0 )
;
( Θ =
∇ P ( D ; Θ ) = 0
∇
ΘP
XD
X n
t
∇
ΘP D θ θ ≤ ∀ θ ∈ ℜ
θ
2( ; ) 0 ,
• note that you always have to check the second-order condition!
X
Θ
( ; ) ,
condition!
Maximum likelihood
• let’s consider the Gaussian example
• given a given a sample {T sample {T
11, …, T , …, T
NN} of independent points } of independent points
• the likelihood is
Maximum likelihood
• and the log-likelihood is
• the derivative with respect to the mean is zero when
• or
• note that this is just the sample mean
• note that this is just the sample mean
Maximum likelihood
• and the log-likelihood is
• the derivative with respect to the variance is zero when
• or
• note that this is just the sample variance
• note that this is just the sample variance
Maximum likelihood
• example:
– if sample is {10,20,30,40,50}
Homework
• show that the Hessian is negative definite
n
t 2
• show that these formulas can be generalized to the vector
X n
t
∇
ΘP D θ θ ≤ ∀ θ ∈ ℜ
θ
2( ; ) 0 ,
show that these formulas can be generalized to the vector case
– D(i) = {x1(i) , ..., xn(i)} set of examples from class i th ML ti t
– the ML estimates are
∑
=
Σ 1 ( x
(i )µ )( x
(i )µ )
T∑ x
(i )µ 1
• note that the ML solution is usually intuitive
∑ − −
= Σ
j j i j i
i
x x
n ( )( )
) ( )
(
µ µ
∑
=
j j
i
x
n
)
µ
(• note that the ML solution is usually intuitive
Estimators
• when we talk about estimators, it is important to keep in mind that
– an estimate is a number
– an estimator is a random variable
) ˆ f ( X X θ
• an estimate is the value of the estimator for a given
) ,
,
( X
1X
nf K
θ =
g sample.
• if D = {x
1, ..., x
n}, when we say = ∑
j
x
jn ˆ 1
µ
what we mean is with
j
n n x X x
n X
X X
f
= ==
1 , ,1 1
) ,
, ˆ (
K
Kµ
∑ X
X X
f 1
)
( = ∑ the X
iare random variables
j
j
n
X
X n X
f (
1, K , )
Bias and variance
• we know how to produce estimators (by ML)
• how do we evaluate an estimator?
• Q
1: is the expected value equal to the true value?
• this is measured by the bias
if – if
then
) ,
, ˆ (
1
X
nX
f K
θ =
then
an estimator that has bias will usually not converge to the perfect
( ) θ ˆ = EX1, ,X [ f ( X
1, , X
n ) − θ ]
Bias
K nK
– an estimator that has bias will usually not converge to the perfect estimate θ, no matter how large the sample is
– e.g. if θ is negative and the estimator is
the bias is clearly non zero =
∑
j j
n X
X n X
f 1 1 2
) ,
,
( K
the bias is clearly non-zero j
Bias and variance
• the estimators is said to be biased
– this means that it is not expressive enough to approximate the tr e al e arbitraril ell
true value arbitrarily well
– this will be clearer when we talk about density estimation
• Q
22: assuming that the estimator converges to the true g g value, how many sample points do we need?
– this can be measured by the variance
( ) ˆ { ( f ( X X ) E [ f ( X X ) ] )
2}
E
Var θ =
– the variance usually decreases as one collects more training
[ ]
( )
{
1 , , 1}
,
,
( , , ) ( , , )
1
1 X n X X n
X
f X X E f X X
E
K nK −
K nK
y g
examples
Example
• ML estimator for the mean of a Gaussian N( µ , σ
2)
( ) µ ˆ = E
X X[ µ ˆ − µ ] = E
X X[ ] µ ˆ − µ
Bias ( ) [ ] [ ]
1
, , ,
,
1
1 1
⎥ −
⎦
⎢ ⎤
⎣
= ⎡
=
=
∑ µ
µ µ
µ µ
µ
X X i
X X
X X
X E
E E
Bias
n
n
n K
K
[ ]
1
,
1,
−
=
⎥ ⎦
⎢ ⎣
∑
∑
µ µ
X X i
i i
X X
X E
n
K n
[ ] [ ]
1
,
1,
−
= ∑
∑
µ
µ
i X
i X X i
X E
n
K n[ ]
= 0
−
=
∑
µ µ
µ
i