Maximum likelihood estimation Maximum likelihood estimation

30  Download (0)

Full text

(1)

Maximum likelihood estimation Maximum likelihood estimation

Nuno Vasconcelos

UCSD

(2)

Bayesian decision theory

• recall that we have

– Y – state of the world – X – observations

– g(x) – decision function

– L[g(x),y] – loss of predicting y with g(x)

• Bayes decision rule is the rule that minimizes the risk

[ L ( X , Y ) ]

E Risk =

X Y

• for the “0-1” loss

[ ( , ) ]

,Y X

⎧ 1 ( ) ≠

• optimal decision rule is the maximum a posteriori

⎩ ⎨

=

= ≠

y x

g

y x

y g x g

L 0 , ( )

) ( ,

] 1 ), ( [

• optimal decision rule is the maximum a-posteriori

probability rule

(3)

MAP rule

h h th t it b i l t d i f th

• we have shown that it can be implemented in any of the three following ways

– 1))

i

*

( x ) = arg max P

Y|X

( i | x )

– 2)

)

| ( max

arg )

( x P

|

i x

i

Y X

i

[ ( | ) ( ) ]

max arg

)

(

|

*

x P x i P i

i =

X Y Y

)

– 3)

[

|

]

i

[ log ( | ) log ( ) ]

max arg

)

(

|

*

x P x i P i

i =

X Y

+

Y

)

• by introducing a “model” for the class-conditional

distributions we can express this as a simple equation

[ g ( | ) g ( ) ]

g )

(

X |Y Y

i

distributions we can express this as a simple equation

– e.g. for the multivariate Gaussian

⎬ ⎫

⎨ ⎧ 1 ( ) Σ ( ) ) 1

|

(

T 1

P ⎭ ⎬ ⎫

⎩ ⎨

⎧ − − Σ −

= Σ ( )

( )

2 exp 1

|

| ) 2 ( ) 1

|

(

1

| i i

T i i

Y d

X

x i x x

P µ µ

π

(4)

The Gaussian classifier

di i i t

• the solution is

[ d

i

x

i i

]

x

i

*

( ) = arg min ( , µ ) + α

discriminant:

PY|X(1|x ) = 0.5

with

[

i i i

]

i

d x

x

i ( ) arg min ( , µ ) + α

) (

) (

) ,

( x y x y

1

x y

d

i

= −

T

Σ

i

) ( log

2 )

2

log(

d i

P

Y

i

i

= π Σ −

α

• the optimal rule is to assign x to the closest class

• closest is measured with the Mahalanobis distance d

i

(x,y)

• can be further simplified in special cases

• can be further simplified in special cases

(5)

Geometric interpretation

• for Gaussian classes, equal covariance σ

2

I µ

µ −

( )

j Y i

j i

i P w

µ σ µ σ

µ µ

+

=

) l (

2 2

(

i j

)

Y Y j

i j

i

j

x P µ µ

µ µ

µ

µ −

− −

= ( )

) log (

2

2

0

σ w

µ

i

µ

j

σ

x0

) (

1 P i

5

) (

) log (

1

2

2 P j

i P

Y Y j

i

σ µ µ

(6)

Geometric interpretation

• for Gaussian classes, equal but arbitrary covariance

( )

w Σ

−1

( µ µ )

( ) (

T

)

Y

(

i j

)

j i

j i

j P

i x P

w

µ µ µ

µ µ

µ µ

µ

µ µ

Σ − + −

=

− Σ

=

( )

) log (

1

2

1

0

2 ( µ

i

− µ

j

) ( Σ µ

i

− µ

j

) P

Y

( j )

w

µ

i

µ

j

x0

(7)

Bayesian decision theory

• advantages:

– BDR is optimal and cannot be beaten – Bayes keeps you honest

– models reflect causal interpretation of the problem, this is how we think

– natural decomposition into “what we knew already” (prior) and

“what data tells us” (CCD)

– no need for heuristicsno need for heuristics to combine these two sources of infoto combine these two sources of info – BDR is, almost invariably, intuitive

– Bayes rule, chain rule, and marginalization enable modularity, and scalability to very complicated models and problems

and scalability to very complicated models and problems

• problems:

– BDR is optimal only insofar the models are correct.

(8)

Implementation

• we do have an optimal solution

(

i j

)

w = Σ

−1

µ − µ

f

( ) ( ) (

i j

)

Y Y j

i T

j i

j i

j P

i

x P µ µ

µ µ

µ µ

µ

µ −

− Σ

− −

= +

) (

) log (

1

2

1

0

• but in practice we do not know the values of the parameters µ , Σ ,

PY(1)

– we have to somehow estimate these values

– this is OK, we can come up with an estimate from a training set – e.g. use the average value as an estimate for the mean

( )

( )

( )

j Y i

j i

i x P

w

µ µ µ

µ

µ µ

ˆ ) ˆ

ˆ ( 1 log

ˆ ˆ

ˆ ˆ

1

ˆ

= +

− Σ

=

( ) ( ) (

i j

)

Y j

i T

j

i

P j

x µ µ

µ µ

µ

µ log ˆ ( )

ˆ ˆ

ˆ ˆ

2

1

0

− Σ

− −

=

(9)

Important

• warning: at this point all optimality claims for the BDR cease to be valid!!

• the BDR is guaranteed to achieve the minimum loss when we use the loss when we use the true probabilities

• when we “plug in” the b bilit ti t

probability estimates, we could be

implementing p g

a classifier that is quite distant from the optimal

e.g. if the PX|Y(x|i) look like the example above

I could never approximate them well by parametric models (e g – I could never approximate them well by parametric models (e.g.

Gaussian)

(10)

Maximum likelihood

• this seems pretty serious

– how should I get these probabilities then?

• we rely on the maximum likelihood (ML) principle

• this has three steps:

1) we choose a parametric model for all probabilities – 1) we choose a parametric model for all probabilities

– to make this clear we denote the vector of parameters by Θ and the class-conditional distributions by

note that this means that Θ is NOT a random variable (otherwise

)

;

|

|

( x i Θ P

X Y

– note that this means that Θ is NOT a random variable (otherwise it would have to show up as subscript)

– it is simply a parameter, and the probabilities are a function of this parameter

parameter

(11)

Maximum likelihood

• three steps:

– 2) we assemble a collection of datasets

D(i) = {x (i) x (i)} set of examples drawn independently from D(i) = {x1(i) , ..., xn(i)} set of examples drawn independently from class i

3) we select the parameters of class i to be the ones that – 3) we select the parameters of class i to be the ones that

maximize the probability of the data from that class

( Θ )

Θ arg max P ( D

(i)

| i ; )

( Θ )

=

Θ

= Θ

Θ

;

| log

max arg

;

| max

arg

) (

| ) (

|

i D

P

i D

P

i Y

X Y X i

– like before, it does not really make any difference to maximize

( )

Θ

g | ;

g

X|Y

, y y

probabilities or their logs

(12)

Maximum likelihood

• since

– each sample D(i) is considered independently

t ti t d l f l (i)

– parameter Θi estimated only from sample D(i)

• we simply have to repeat the procedure for all classes

• so from now on we omit the class variable

• so, from now on we omit the class variable

( Θ )

= Θ

Θ

max ;

*

arg

D P

X

( Θ )

=

Θ Θ

; log

max arg

P

X

D

the function P

X

( D; Θ ) is called the likelihood of the parameter Θ with respect to the data

i l h lik lih d f i

• or simply the likelihood function

(13)

Maximum likelihood

• note that the likelihood function is a function of the parameters Θ of the parameters Θ

• it does not have the same shape as the

f density itself

• e.g. the likelihood

function of a Gaussian function of a Gaussian is not bell-shaped

• the likelihood is

defined only after we have a sample

⎧ ( )

2

1 d µ

⎭ ⎬

⎩ ⎨

⎧ −

=

Θ

2

2

2

) exp (

) 2 ( ) 1

;

( σ

µ σ

π d d

P

X

(14)

Maximum likelihood

• given a sample, to obtain ML estimate we need to solve

( Θ )

=

Θ

*

arg max P D ;

• when Θ is a scalar this is high-school calculus

( Θ )

= Θ

Θ

max ;

arg P

X

D

when Θ is a scalar this is high school calculus

• we have a maximum when

first derivative is zero – first derivative is zero

– second derivative is negative

(15)

The gradient

• in higher dimensions, the generalization of the derivative is the gradient

f

• the gradient of a function f(w) at z is

f

T

f

⎜ ⎛ ∂ ∂

f

• the gradient has a nice geometric i t t ti

n

w z z f

w z f

f

⎜⎜ ⎞

= ∂

) ( ,

), ( )

(

1 0

L

interpretation

– it points in the direction of maximum growth of the function

f(x,y) – which makes it perpendicular to the

contours where the function is constant

) , (x0 y0

ff ( 0,y0)

) , (x1 y1

f

(16)

The gradient

max

max

• note that if ∇ f = 0

– there is no direction of growthg

– also -∇f = 0, and there is no direction of decrease

– we are either at a local minimum or maximum

“ ddl ” i t or “saddle” point

• conversely, at local min or max or saddle point

di ti f th d

– no direction of growth or decrease min – ∇f = 0

• this shows that we have a critical point if d l if ∇ f 0

saddle

and only if ∇ f = 0

• to determine which type we need second

order conditions

(17)

The Hessian

• the extension of the second-order derivative is the Hessian matrix

⎥ ⎥

⎥ ⎤

⎢ ⎢

⎢ ⎡

=

) ( )

( )

(

1 0

2 2

0 2

2

x x x

x f x

f x

f M

n

L

⎥ ⎥

⎢ ⎦

⎢ ⎢

⎣ ∂

=

) ( )

( )

(

2 1 2

0 1

2

x

x x f

x x

f x

f

n n

L M

– at each point x, gives us the quadratic function

x x f

x

t

2

( )

that best approximates f(x)

x x f

x ∇ ( )

(18)

The Hessian

• this means that, when gradient is

zero at x, we have

max

– a maximum when function can be approximated by an “upwards-facing”

quadratic

– a minimum when function can be

approximated by a “downwards-facing”

quadratic

– a saddle point otherwise

saddle

min

(19)

The Hessian

max

• for any matrix M, the function

Mx x

t

• is

– upwards facing quadratic when M

Mx x

upwards facing quadratic when M is negative definite

– downwards facing quadratic when M is positive definite

min

saddle is positive definite

– saddle otherwise

• hence, all that matters is the positive d fi it f th H i

definiteness of the Hessian

• we have a maximum when Hessian is negative definite

is negative definite

(20)

Maximum likelihood

• in summary, given a sample, we need to solve

( Θ )

Θ

*

P D

max

• the solutions are the parameters

( Θ )

= Θ

Θ

max ;

arg P

X

D

the solutions are the parameters such that

0 )

;

( Θ =

∇ P ( D ; Θ ) = 0

Θ

P

X

D

X n

t

Θ

P D θ θ ≤ ∀ θ ∈ ℜ

θ

2

( ; ) 0 ,

• note that you always have to check the second-order condition!

X

Θ

( ; ) ,

condition!

(21)

Maximum likelihood

• let’s consider the Gaussian example

• given a given a sample {T sample {T

11

, …, T , …, T

NN

} of independent points } of independent points

• the likelihood is

(22)

Maximum likelihood

• and the log-likelihood is

• the derivative with respect to the mean is zero when

• or

• note that this is just the sample mean

• note that this is just the sample mean

(23)

Maximum likelihood

• and the log-likelihood is

• the derivative with respect to the variance is zero when

• or

• note that this is just the sample variance

• note that this is just the sample variance

(24)

Maximum likelihood

• example:

– if sample is {10,20,30,40,50}

(25)

Homework

• show that the Hessian is negative definite

n

t 2

• show that these formulas can be generalized to the vector

X n

t

Θ

P D θ θ ≤ ∀ θ ∈ ℜ

θ

2

( ; ) 0 ,

show that these formulas can be generalized to the vector case

D(i) = {x1(i) , ..., xn(i)} set of examples from class i th ML ti t

– the ML estimates are

=

Σ 1 ( x

(i )

µ )( x

(i )

µ )

T

x

(i )

µ 1

• note that the ML solution is usually intuitive

= Σ

j j i j i

i

x x

n ( )( )

) ( )

(

µ µ

=

j j

i

x

n

)

µ

(

• note that the ML solution is usually intuitive

(26)

Estimators

• when we talk about estimators, it is important to keep in mind that

– an estimate is a number

– an estimator is a random variable

) ˆ f ( X X θ

• an estimate is the value of the estimator for a given

) ,

,

( X

1

X

n

f K

θ =

g sample.

• if D = {x

1

, ..., x

n

}, when we say =

j

x

j

n ˆ 1

µ

what we mean is with

j

n n x X x

n X

X X

f

= =

=

1 , ,

1 1

) ,

, ˆ (

K

K

µ

X

X X

f 1

)

( =the X

i

are random variables

j

j

n

X

X n X

f (

1

, K , )

(27)

Bias and variance

• we know how to produce estimators (by ML)

• how do we evaluate an estimator?

• Q

1

: is the expected value equal to the true value?

• this is measured by the bias

if – if

then

) ,

, ˆ (

1

X

n

X

f K

θ =

then

an estimator that has bias will usually not converge to the perfect

( ) θ ˆ = E

X1, ,X

[ f ( X

1

, , X

n

) θ ]

Bias

K n

K

– an estimator that has bias will usually not converge to the perfect estimate θ, no matter how large the sample is

– e.g. if θ is negative and the estimator is

the bias is clearly non zero =

j j

n X

X n X

f 1 1 2

) ,

,

( K

the bias is clearly non-zero j

(28)

Bias and variance

• the estimators is said to be biased

– this means that it is not expressive enough to approximate the tr e al e arbitraril ell

true value arbitrarily well

– this will be clearer when we talk about density estimation

• Q

22

: assuming that the estimator converges to the true g g value, how many sample points do we need?

– this can be measured by the variance

( ) ˆ { ( f ( X X ) E [ f ( X X ) ] )

2

}

E

Var θ =

– the variance usually decreases as one collects more training

[ ]

( )

{

1 , , 1

}

,

,

( , , ) ( , , )

1

1 X n X X n

X

f X X E f X X

E

K n

K −

K n

K

y g

examples

(29)

Example

• ML estimator for the mean of a Gaussian N( µ , σ

2

)

( ) µ ˆ = E

X X

[ µ ˆ µ ] = E

X X

[ ] µ ˆ µ

Bias ( ) [ ] [ ]

1

, , ,

,

1

1 1

⎥ −

⎢ ⎤

= ⎡

=

=

µ

µ µ

µ µ

µ

X X i

X X

X X

X E

E E

Bias

n

n

n K

K

[ ]

1

,

1,

=

⎥ ⎦

⎢ ⎣

µ µ

X X i

i i

X X

X E

n

K n

[ ] [ ]

1

,

1,

= ∑

µ

µ

i X

i X X i

X E

n

K n

[ ]

= 0

=

µ µ

µ

i

E

X

X

i

n

i

• the estimator is unbiased

(30)

Figure

Updating...

References