# Maximum likelihood estimation Maximum likelihood estimation

## Full text

(1)

(2)

### • recall that we have

– Y – state of the world – X – observations

– g(x) – decision function

– L[g(x),y] – loss of predicting y with g(x)

X Y

,Y X

(3)

– 1))

*

Y|X

– 2)

|

Y X

i

|

*

X Y Y

)

– 3)

|

i

|

*

X Y

Y

)

X |Y Y

i

### distributions we can express this as a simple equation

– e.g. for the multivariate Gaussian

T 1

1

| i i

T i i

Y d

X

(4)

di i i t

i

i i

*

discriminant:

PY|X(1|x ) = 0.5

i i i

i

1

i

T

i

d i

Y

i

i

(5)

2

j Y i

j i

2 2

i j

Y Y j

i j

i

2

0

σ w

i

j

σ

x0

) (

1 P i

5

) (

) log (

1

2

2 P j

i P

Y Y j

i

σ µ µ

(6)

−1

T

Y

i j

j i

j i

1

0

i

j

i

j

Y

w

i

j

x0

(7)

### Bayesian decision theory

– BDR is optimal and cannot be beaten – Bayes keeps you honest

– models reflect causal interpretation of the problem, this is how we think

– natural decomposition into “what we knew already” (prior) and

“what data tells us” (CCD)

– no need for heuristicsno need for heuristics to combine these two sources of infoto combine these two sources of info – BDR is, almost invariably, intuitive

– Bayes rule, chain rule, and marginalization enable modularity, and scalability to very complicated models and problems

and scalability to very complicated models and problems

### • problems:

– BDR is optimal only insofar the models are correct.

(8)

i j

−1

i j

Y Y j

i T

j i

j i

1

0

### • but in practice we do not know the values of the parameters µ , Σ ,

PY(1)

– we have to somehow estimate these values

– this is OK, we can come up with an estimate from a training set – e.g. use the average value as an estimate for the mean

j Y i

j i

1

i j

Y j

i T

j

i

1

0

(9)

### a classifier that is quite distant from the optimal

e.g. if the PX|Y(x|i) look like the example above

I could never approximate them well by parametric models (e g – I could never approximate them well by parametric models (e.g.

Gaussian)

(10)

### • this seems pretty serious

– how should I get these probabilities then?

### • this has three steps:

1) we choose a parametric model for all probabilities – 1) we choose a parametric model for all probabilities

– to make this clear we denote the vector of parameters by Θ and the class-conditional distributions by

note that this means that Θ is NOT a random variable (otherwise

|

### ( xi Θ P

X Y

– note that this means that Θ is NOT a random variable (otherwise it would have to show up as subscript)

– it is simply a parameter, and the probabilities are a function of this parameter

parameter

(11)

### • three steps:

– 2) we assemble a collection of datasets

D(i) = {x (i) x (i)} set of examples drawn independently from D(i) = {x1(i) , ..., xn(i)} set of examples drawn independently from class i

3) we select the parameters of class i to be the ones that – 3) we select the parameters of class i to be the ones that

maximize the probability of the data from that class

(i)

Θ

) (

| ) (

|

### P

i Y

X Y X i

– like before, it does not really make any difference to maximize

Θ

### g

X|Y

, y y

probabilities or their logs

(12)

### • since

– each sample D(i) is considered independently

t ti t d l f l (i)

– parameter Θi estimated only from sample D(i)

Θ

*

X

Θ Θ

X

X

(13)

2

2

2

X

(14)

*

Θ

X

### • we have a maximum when

first derivative is zero – first derivative is zero

– second derivative is negative

(15)

f

T

f

n

1 0

### interpretation

– it points in the direction of maximum growth of the function

f(x,y) – which makes it perpendicular to the

contours where the function is constant

) , (x0 y0

ff ( 0,y0)

) , (x1 y1

f

(16)

max

max

### • note that if ∇ f = 0

– there is no direction of growthg

– also -∇f = 0, and there is no direction of decrease

– we are either at a local minimum or maximum

“ ddl ” i t or “saddle” point

### • conversely, at local min or max or saddle point

di ti f th d

– no direction of growth or decrease min – ∇f = 0

(17)

1 0

2 2

0 2

2

n

2 1 2

0 1

2

n n

### L M

– at each point x, gives us the quadratic function

t

2

### ( )

that best approximates f(x)

(18)

### zero at x, we have

max

– a maximum when function can be approximated by an “upwards-facing”

– a minimum when function can be

approximated by a “downwards-facing”

min

(19)

max

t

### • is

– upwards facing quadratic when M

### Mx x

upwards facing quadratic when M is negative definite

– downwards facing quadratic when M is positive definite

min

(20)

*

max

Θ

X

Θ

X

X n

t

Θ

2

X

Θ

(21)

11

NN

(22)

(23)

(24)

### • example:

– if sample is {10,20,30,40,50}

(25)

n

t 2

X n

t

Θ

2

### show that these formulas can be generalized to the vector case

D(i) = {x1(i) , ..., xn(i)} set of examples from class i th ML ti t

– the ML estimates are

(i )

(i )

T

(i )

j j i j i

i

) ( )

(

j j

i

)

(

(26)

### • when we talk about estimators, it is important to keep in mind that

– an estimate is a number

– an estimator is a random variable

1

n

1

n

j

j

j

n n x X x

n X

= =

1 , ,

1 1

K

i

j

j

n

1

(27)

1

if – if

then

1

n

### θ =

then

an estimator that has bias will usually not converge to the perfect

## ( ) θˆ=E

X1, ,X

1

n

K n

### K

– an estimator that has bias will usually not converge to the perfect estimate θ, no matter how large the sample is

– e.g. if θ is negative and the estimator is

the bias is clearly non zero =

### ∑

j j

n X

X n X

f 1 1 2

) ,

,

( K

the bias is clearly non-zero j

(28)

### • the estimators is said to be biased

– this means that it is not expressive enough to approximate the tr e al e arbitraril ell

true value arbitrarily well

– this will be clearer when we talk about density estimation

22

### : assuming that the estimator converges to the true g g value, how many sample points do we need?

– this can be measured by the variance

2

### Var θ =

– the variance usually decreases as one collects more training

1 , , 1

,

,

1

1 X n X X n

X

K n

K n

y g

examples

(29)

2

X X

X X

, , ,

,

1

1 1

X X i

X X

X X

n

n

n K

K

,

1,

X X i

i i

X X

K n

,

1,

i X

i X X i

K n

i

X

i

i

(30)

Updating...