Maximum likelihood estimation Maximum likelihood estimation
Nuno Vasconcelos
UCSD
Bayesian decision theory
• recall that we have
– Y – state of the world – X – observations
– g(x) – decision function
– L[g(x),y] – loss of predicting y with g(x)
• Bayes decision rule is the rule that minimizes the risk
[ ^{L} ^{(} ^{X} ^{,} ^{Y} ^{)} ]
E Risk =
_{X} _{Y}• for the “01” loss
[ ^{(} ^{,} ^{)} ]
,Y X
⎧ 1 ( ) ≠
• optimal decision rule is the maximum a posteriori
⎩ ⎨
⎧
=
= ≠
y x
g
y x
y g x g
L 0 , ( )
) ( ,
] 1 ), ( [
• optimal decision rule is the maximum aposteriori
probability rule
MAP rule
h h th t it b i l t d i f th
• we have shown that it can be implemented in any of the three following ways
– 1)^{)}
i
^{*}( x ) = arg max P
_{Y}_{}_{X}( i  x )
– 2)
)
 ( max
arg )
( x P
_{}i x
i
_{Y} _{X}i
[ ^{(} ^{} ^{)} ^{(} ^{)} ]
max arg
)
(
_{}*
x P x i P i
i =
_{X} _{Y} _{Y})
– 3)
[
]
i
[ ^{log} ^{(} ^{} ^{)} ^{log} ^{(} ^{)} ]
max arg
)
(
_{}*
x P x i P i
i =
_{X} _{Y}+
_{Y})
• by introducing a “model” for the classconditional
distributions we can express this as a simple equation
[ ^{g} ^{(} ^{} ^{)} ^{g} ^{(} ^{)} ]
g )
(
_{X} _{}_{Y} _{Y}i
distributions we can express this as a simple equation
– e.g. for the multivariate Gaussian
⎬ ⎫
⎨ ⎧ 1 ( ) Σ ( ) ) 1

(
^{T} ^{1}P ⎭ ⎬ ⎫
⎩ ⎨
⎧ − − Σ −
= Σ ( )
^{−}( )
2 exp 1

 ) 2 ( ) 1

(
^{1} i i
T i i
Y d
X
x i x x
P µ µ
π
The Gaussian classifier
di i i t
• the solution is
[ ^{d}
_{i}^{x}
_{i} _{i}]
x
i
^{*}( ) = arg min ( , µ ) + α
discriminant:
P_{YX}(1x ) = 0.5
with
[
_{i} _{i} _{i}]
i
d x
x
i ( ) arg min ( , µ ) + α
) (
) (
) ,
( x y x y
^{1}x y
d
_{i}= −
^{T}Σ
_{i}^{−}−
) ( log
2 )
2
log(
^{d} _{i}P
_{Y}i
i
= π Σ −
α
• the optimal rule is to assign x to the closest class
• closest is measured with the Mahalanobis distance d
_{i}(x,y)
• can be further simplified in special cases
• can be further simplified in special cases
Geometric interpretation
• for Gaussian classes, equal covariance σ
^{2}I µ
µ −
( )
j Y i
j i
i P w
µ σ µ σ
µ µ
+
=
) l (
2 2
(
_{i} _{j})
Y Y j
i j
i
j
x P µ µ
µ µ
µ
µ −
− −
= ( )
) log (
2
^{2}0
σ w
µ
_{i}µ
_{j}σ
x_{0}
) (
1 P i
5
) (
) log (
1
2
2 P j
i P
Y Y j
i
σ µ µ −
Geometric interpretation
• for Gaussian classes, equal but arbitrary covariance
( )
w Σ
^{−1}( µ µ )
( ) (
_{T})
^{Y}(
_{i} _{j})
j i
j i
j P
i x P
w
µ µ µ
µ µ
µ µ
µ
µ µ
Σ − + −
=
− Σ
=
−
( )
) log (
1
2
^{1}0
2 ( µ
i− µ
j) ( Σ µ
i− µ
j) P
Y( j )
w
µ
_{i}µ
_{j}x_{0}
Bayesian decision theory
• advantages:
– BDR is optimal and cannot be beaten – Bayes keeps you honest
– models reflect causal interpretation of the problem, this is how we think
– natural decomposition into “what we knew already” (prior) and
“what data tells us” (CCD)
– no need for heuristicsno need for heuristics to combine these two sources of infoto combine these two sources of info – BDR is, almost invariably, intuitive
– Bayes rule, chain rule, and marginalization enable modularity, and scalability to very complicated models and problems
and scalability to very complicated models and problems
• problems:
– BDR is optimal only insofar the models are correct.
Implementation
• we do have an optimal solution
(
_{i} _{j})
w = Σ
^{−1}µ − µ
f
( ) ( ) (
_{i} _{j})
Y Y j
i T
j i
j i
j P
i
x P µ µ
µ µ
µ µ
µ
µ −
− Σ
− −
= +
_{−}) (
) log (
1
2
^{1}0
• but in practice we do not know the values of the parameters µ ^{, } Σ ^{, }
P_{Y}(1)– we have to somehow estimate these values
– this is OK, we can come up with an estimate from a training set – e.g. use the average value as an estimate for the mean
( )
( )
( )
j Y i
j i
i x P
w
µ µ µ
µ
µ µ
ˆ ) ˆ
ˆ ( 1 log
ˆ ˆ
ˆ ˆ
^{1}ˆ
= +
− Σ
=
^{−}( ) ( ) (
_{i} _{j})
Y j
i T
j
i
P j
x µ µ
µ µ
µ
µ log ˆ ( )
ˆ ˆ
ˆ ˆ
2
^{1}0
−
− Σ
− −
=
_{−}Important
• warning: at this point all optimality claims for the BDR cease to be valid!!
• the BDR is guaranteed to achieve the minimum loss when we use the loss when we use the true probabilities
• when we “plug in” the b bilit ti t
probability estimates, we could be
implementing p g
a classifier that is quite distant from the optimal
– e.g. if the P_{XY}(xi) look like the example above
I could never approximate them well by parametric models (e g – I could never approximate them well by parametric models (e.g.
Gaussian)
Maximum likelihood
• this seems pretty serious
– how should I get these probabilities then?
• we rely on the maximum likelihood (ML) principle
• this has three steps:
1) we choose a parametric model for all probabilities – 1) we choose a parametric model for all probabilities
– to make this clear we denote the vector of parameters by Θ and the classconditional distributions by
note that this means that Θ is NOT a random variable (otherwise
)
;


( x i Θ P
_{X} _{Y}– note that this means that Θ is NOT a random variable (otherwise it would have to show up as subscript)
– it is simply a parameter, and the probabilities are a function of this parameter
parameter
Maximum likelihood
• three steps:
– 2) we assemble a collection of datasets
D^{(i)} = {x ^{(i)} x ^{(i)}} set of examples drawn independently from D^{(i)} = {x_{1}^{(i)} , ..., x_{n}^{(i)}} set of examples drawn independently from class i
3) we select the parameters of class i to be the ones that – 3) we select the parameters of class i to be the ones that
maximize the probability of the data from that class
( ^{Θ} )
Θ arg max P ( D
^{(}^{i}^{)} i ; )
( ^{Θ} )
=
Θ
= Θ
Θ
;
 log
max arg
;
 max
arg
) (
 ) (

i D
P
i D
P
i Y
X Y X i
– like before, it does not really make any difference to maximize
( )
Θ
g  ;
g
_{X}_{Y}, y y
probabilities or their logs
Maximum likelihood
• since
– each sample D^{(i)} is considered independently
t ti t d l f l ^{(i)}
– parameter Θ_{i} estimated only from sample D^{(i)}
• we simply have to repeat the procedure for all classes
• so from now on we omit the class variable
• so, from now on we omit the class variable
( ^{Θ} )
= Θ
Θ
max ;
*
arg
D P
_{X}( ^{Θ} )
=
Θ Θ
; log
max arg
P
_{X}D
• the function P
_{X}( D; Θ ) is called the likelihood of the parameter Θ with respect to the data
i l h lik lih d f i
• or simply the likelihood function
Maximum likelihood
• note that the likelihood function is a function of the parameters Θ of the parameters Θ
• it does not have the same shape as the
f density itself
• e.g. the likelihood
function of a Gaussian function of a Gaussian is not bellshaped
• the likelihood is
defined only after we have a sample
⎫
⎧ ( )
^{2}1 d µ
⎭ ⎬
⎫
⎩ ⎨
⎧ −
−
=
Θ
_{2}2
2
) exp (
) 2 ( ) 1
;
( σ
µ σ
π d d
P
_{X}Maximum likelihood
• given a sample, to obtain ML estimate we need to solve
( ^{Θ} )
=
Θ
^{*}arg max P D ;
• when Θ is a scalar this is highschool calculus
( ^{Θ} )
= Θ
Θ
max ;
arg P
_{X}D
when Θ is a scalar this is high school calculus
• we have a maximum when
first derivative is zero – first derivative is zero
– second derivative is negative
The gradient
• in higher dimensions, the generalization of the derivative is the gradient
∇f
• the gradient of a function f(w) at z is
f
Tf ⎞
⎜ ⎛ ∂ ∂
∇f• the gradient has a nice geometric i t t ti
n
w z z f
w z f
f ⎟
⎠
⎜⎜ ⎞
⎝
⎛
∂
∂
∂
= ∂
∇
−
) ( ,
), ( )
(
1 0
L
interpretation
– it points in the direction of maximum growth of the function
f(x,y) – which makes it perpendicular to the
contours where the function is constant
) , (x_{0} y_{0}
∇ff ( _{0},y_{0})
) , (x_{1} y_{1}
∇f
The gradient
maxmax
• note that if ∇ ^{f = 0}
– there is no direction of growthg
– also ∇f = 0, and there is no direction of decrease
– we are either at a local minimum or maximum
“ ddl ” i t or “saddle” point
• conversely, at local min or max or saddle point
di ti f th d
– no direction of growth or decrease min – ∇f = 0
• this shows that we have a critical point if d l if ∇ f 0
saddle
and only if ∇ f = 0
• to determine which type we need second
order conditions
The Hessian
• the extension of the secondorder derivative is the Hessian matrix
⎤
⎡
⎥ ⎥
⎥ ⎤
⎢ ⎢
⎢ ⎡
∂
∂
∂
∂
∂
=
∇
^{−}) ( )
( )
(
1 0
2 2
0 2
2
x x x
x f x
f x
f _{M}
^{n}L
⎥ ⎥
⎥
⎢ ⎦
⎢ ⎢
⎣ ∂
∂
∂
∂
∂
=
∇
−
−
) ( )
( )
(
2 1 2
0 1
2
x
x x f
x x
f x
f
n n
L M
– at each point x, gives us the quadratic function
⎦
⎣
x x f
x
^{t}∇
^{2}( )
that best approximates f(x)
x x f
x ∇ ( )
The Hessian
• this means that, when gradient is
zero at x, we have
^{max}– a maximum when function can be approximated by an “upwardsfacing”
quadratic
– a minimum when function can be
approximated by a “downwardsfacing”
quadratic
– a saddle point otherwise
saddle
min
The Hessian
^{max}• for any matrix M, the function
Mx x
^{t}• is
– upwards facing quadratic when M
Mx x
upwards facing quadratic when M is negative definite
– downwards facing quadratic when M is positive definite
min
saddle is positive definite
– saddle otherwise
• hence, all that matters is the positive d fi it f th H i
definiteness of the Hessian
• we have a maximum when Hessian is negative definite
is negative definite
Maximum likelihood
• in summary, given a sample, we need to solve
( ^{Θ} )
Θ
^{*}P D
max
• the solutions are the parameters
( ^{Θ} )
= Θ
Θ
max ;
arg P
_{X}D
the solutions are the parameters such that
0 )
;
( Θ =
∇ P ( D ; Θ ) = 0
∇
_{Θ}P
_{X}D
X n
t
∇
_{Θ}P D θ θ ≤ ∀ θ ∈ ℜ
θ
^{2}( ; ) 0 ,
• note that you always have to check the secondorder condition!
X
Θ
( ; ) ,
condition!
Maximum likelihood
• let’s consider the Gaussian example
• given a given a sample {T sample {T
_{1}_{1}, …, T , …, T
_{N}_{N}} of independent points } of independent points
• the likelihood is
Maximum likelihood
• and the loglikelihood is
• the derivative with respect to the mean is zero when
• or
• note that this is just the sample mean
• note that this is just the sample mean
Maximum likelihood
• and the loglikelihood is
• the derivative with respect to the variance is zero when
• or
• note that this is just the sample variance
• note that this is just the sample variance
Maximum likelihood
• example:
– if sample is {10,20,30,40,50}
Homework
• show that the Hessian is negative definite
n
t ^{2}
• show that these formulas can be generalized to the vector
X n
t
∇
_{Θ}P D θ θ ≤ ∀ θ ∈ ℜ
θ
^{2}( ; ) 0 ,
show that these formulas can be generalized to the vector case
– D^{(i)} = {x_{1}^{(i)} , ..., x_{n}^{(i)}} set of examples from class i th ML ti t
– the ML estimates are
∑
=
Σ 1 ( x
_{(}^{i} _{)}µ )( x
_{(}^{i} _{)}µ )
^{T}∑ ^{x}
^{(}^{i} ^{)}µ 1
• note that the ML solution is usually intuitive
∑ ^{−} ^{−}
= Σ
j j i j i
i
x x
n ^{(} ^{)(} ^{)}
) ( )
(
µ µ
∑
=
j j
i
x
n
)
µ
(• note that the ML solution is usually intuitive
Estimators
• when we talk about estimators, it is important to keep in mind that
– an estimate is a number
– an estimator is a random variable
) ˆ f ( X X θ
• an estimate is the value of the estimator for a given
) ,
,
( X
_{1}X
_{n}f K
θ =
g sample.
• if D = {x
_{1}, ..., x
_{n}}, when we say ^{=} ∑
j
x
jn ˆ 1
µ
what we mean is with
j
n n x X x
n X
X X
f
_{=} _{=}=
1 , ,1 1
) ,
, ˆ (
K
Kµ
∑ ^{X}
X X
f 1
)
( ^{=} ∑ the X
_{i}are random variables
j
j
n
X
X n X
f (
_{1}, K , )
Bias and variance
• we know how to produce estimators (by ML)
• how do we evaluate an estimator?
• Q
_{1}: is the expected value equal to the true value?
• this is measured by the bias
if – if
then
) ,
, ˆ (
1
X
nX
f K
θ =
then
an estimator that has bias will usually not converge to the perfect
( ) ^{θ} ^{ˆ} ^{=} E
X_{1}, ,X^{[} f ^{(} X
1^{,} ^{,} X
n^{)} ^{−} ^{θ} ^{]}
Bias
_{K} nK
– an estimator that has bias will usually not converge to the perfect estimate θ, no matter how large the sample is
– e.g. if θ is negative and the estimator is
the bias is clearly non zero ^{=}
∑
j j
n X
X n X
f _{1} 1 ^{2}
) ,
,
( K
the bias is clearly nonzero ^{j}
Bias and variance
• the estimators is said to be biased
– this means that it is not expressive enough to approximate the tr e al e arbitraril ell
true value arbitrarily well
– this will be clearer when we talk about density estimation
• Q
_{2}_{2}: assuming that the estimator converges to the true g g value, how many sample points do we need?
– this can be measured by the variance
( ) ^{ˆ} { ( ^{f} ^{(} ^{X} ^{X} ^{)} ^{E} [ ^{f} ^{(} ^{X} ^{X} ^{)} ] )
^{2}}
E
Var θ =
– the variance usually decreases as one collects more training
[ ]
( )
{
1 , , 1}
,
,
( , , ) ( , , )
1
1 X n X X n
X
f X X E f X X
E
_{K} nK −
_{K} nK
y g
examples
Example
• ML estimator for the mean of a Gaussian N( µ , σ
^{2})
( ) ^{µ} ^{ˆ} ^{=} ^{E}
_{X} _{X}[ ^{µ} ^{ˆ} ^{−} ^{µ} ] ^{=} ^{E}
_{X} _{X}[ ] ^{µ} ^{ˆ} ^{−} ^{µ}
Bias ( ) [ ] [ ]
1
, , ,
,
1
1 1
⎥ −
⎦
⎢ ⎤
⎣
= ⎡
=
=
∑ ^{µ}
µ µ
µ µ
µ
_{X} _{X} _{i}
X X
X X
X E
E E
Bias
n
n
n K
K
[ ]
1
,
1,
−
=
⎥ ⎦
⎢ ⎣
∑
∑
µ µ
_{X} _{X} _{i}
i i
X X
X E
n
K n
[ ] [ ]
1
,
1,
−
= ∑
∑
µ
µ
i X
i X X i
X E
n
^{K} ^{n}[ ]
= 0
−
=
∑
µ µ
µ
i