• No results found

Bayesian Theory and Computation. Lecture 12: Expectation Maximization

N/A
N/A
Protected

Academic year: 2021

Share "Bayesian Theory and Computation. Lecture 12: Expectation Maximization"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Bayesian Theory and Computation

Lecture 12: Expectation Maximization

Cheng Zhang

School of Mathematical Sciences, Peking University May 04, 2021

(2)

Introduction

2/32

I In this lecture, we discuss Expectation-Maximization (EM), which is an iterative optimization method dealing with missing or latent data.

I In such cases, we may assume the observed data x are generated from random variable X along with missing or unobserved data z from random variable Z. We envision complete data would have been y = (x, z).

I Very often, the inclusion of the observed data z is a data augmentationstrategy to ease computation. In this case, Z is often referred to as latentvariable.

(3)

Latent Variable Model

3/32

I Some of the variables in the model are not observed. I Examples: mixture model, hidden Markov model (HMM),

latent Dirichlet allocation (LDA), etc.

(4)

Marginal Likelihood

4/32

I complete data likelihood p(x, z|θ), θ is model parameter I When z is missing, we need to marginalize out z and use

the marginal log-likelihood for learning log p(x|θ) = logX

z

p(x, z|θ)

I Examples: Gaussian mixture model. z ∼ Discrete(π), θ = (π, µ, Σ) p(x|θ) =X k p(z = k|θ)p(x|z = k, θ) =X k πkN (x|µk, Σk) =X k πk 1 (2π)d/2 k|1/2 exp  −1 2(x − µk) TΣ−1 k (x − µk) 

(5)

Learning in Latent Variable Model

5/32

I For most of these latent variable models, when the missing components z are observed, the complete data likelihood often factorizes, and the maximum likelihood estimates hence have closed-form solutions.

I When z are not observed, marginalization destroys the factorizible structure and makes learning much more difficult.

I How to learn in this scenario?

I Idea 1: simply take derivative and use gradient ascent directly

I Idea 2: find appropriate estimates of z (e.g., using the current conditional distribution p(z|x, θ)), fill them in and do complete data learning – This isEM!

(6)

Expectation Maximization

6/32

I At each iteration, the EM algorithm involves two steps

I based on the current θ(t), fill in unobserved z to get

complete data (x, z0)

I Update θ to maximize the complete data log-likelihood `(x, z0|θ) = log p(x, z0|θ)

I How to choose z0?

I Use conditional distribution p(z|x, θ(t))

I Take full advantage of the current estimates θ(t)

Ep(z|x,θ(t))`(x, z|θ) =

X

z

p(z|x, θ(t))`(x, z|θ)

(7)

EM Algorithm

7/32

More specifically, we start from some initial θ(0). In each iteration, we follow the two steps below

I Expectation (E-step): compute p(z|x, θ(t)) and form the expectation using the current estimate θ(t)

Q(t)(θ) = Ep(z|x,θ(t))`(x, z|θ)

I Maximization (M-step): Find θ that maximizes the expected complete data log-likelihood

θ(t+1)= arg max θ

Q(t)(θ)

In many cases, the expectation is easier to handle than the marginal log-likelihood.

(8)

How does EM Work?

8/32

I EM algorithm can be viewed as optimizing a lower bound on the marginal log-likelihood L(θ) = log p(x|θ)

I A class of lower bounds

L(θ) = logX z p(x, z|θ) = logX z q(z)p(x, z|θ) q(z) ≥X z q(z) logp(x, z|θ) q(z) - Jensen’s inequality =X z q(z) log p(x, z|θ) −X z q(z) log q(z), ∀q(z)

I The term in the last equation is often calledFree-energy

F (q, θ) =X z

q(z) log p(x, z|θ) −X z

(9)

Lower Bound Maximization

9/32

I Free-energy is a lower bound of the true log-likelihood L(θ) ≥ F (q, θ)

I EM is simply doingcoordinate ascenton F (q, θ)

I E-step: Find q(t) that maximizes F (q, θ(t))

I M-step: Find θ(t+1) that maximizes F (q(t), θ) I Properties:

I Each iteration improves F

F (q(t+1), θ(t+1)) ≥ F (q(t), θ(t))

I Each iteration improves L as well L(θ(t+1)) ≥ L(θ(t))

(10)

E-step

10/32

I Find q that maximizes F (q, θ(t))

F (q, θ) =X z q(z) log p(x, z|θ) −X z q(z) log q(z) =X z q(z) logp(z|x, θ)p(x|θ) q(z) =X z q(z) logp(z|x, θ) q(z) + log p(x|θ) = L(θ) − DKL(q(z)kp(z|x, θ)) ≤ L(θ)

(11)

E-step

11/32

F (q, θ(t)) = L(θ(t)) − DKL(q(z)kp(z|x, θ(t)))

I KL divergence is non-negative and is minimized (equals to 0) iff the two distributions are identical.

I Therefore, F (q, θ(t)) is maximized at q(t)(z) = p(z|x, θ(t)). I So when we are computing p(z|x, θ(t)), we are actually

computing arg maxqF (q, θ(t)) I Moreover,

F (q(t), θ(t)) = L(θ(t))

this meansthe lower bound matches the true log-likelihood at θ(t), which is crucial for the improvement on L.

(12)

M-step

12/32

I Find θ(t+1) that maximizes F (q(t), θ) θ(t+1)= arg max θ F (q(t), θ) = arg max θ X z p(z|x, θ(t)) log p(x, z|θ) + H(p(z|x, θ(t))) = arg max θ Ep(z|x,θ (t))`(x, z|θ)

I The expected complete data log-likelihood usually can be solved in the same manner (closed-form solutions) as the fully-observed model.

(13)

Monotonicity of EM

13/32

L(θ(t+1)) ≥ F (q(t), θ(t+1))

(14)

EM for Exponential Families

14/32

I When the complete data follow an exponential family distribution (in canonical form), the density is

p(x, z|θ) = h(x, z) exp(θ · T (x, z) − A(θ)) I E-step Q(t)(θ) = Ep(z|x,θ(t))log p(x, z|θ) = θ · Ep(z|x,θ(t))T (x, z) − A(θ) + Const I M-step ∇θQ(t)(θ) = 0 ⇒ Ep(z|x,θ(t))T (x, z) = ∇θA(θ) = Ep(x,z|θ)T (x, z)

(15)

Examples: Censored Survival Times

15/32

I In survival analyses, we often have to terminate our study before observing the real survival times, leading to

censored survival data.

I Suppose the observed data are Y = {(t1, δ1), . . . , (tn, δn)}, where Tj ∼ Exp(µ) and δj is the indicator of a censored sample. WLOG, assume δi= 0, i ≤ r, δi = 1, i > r I The log-likelihood function is

log p(Y |µ) = r X i=1 log p(ti|µ) + X i>r log p(Ti> ti|µ) = −r log µ − n X i=1 ti/µ I The MLE of µ: ˆµ =Pn i=1ti/r

(16)

Examples: Censored Survival Times

16/32

I Let us see how EM works in this simple case.

I Let t = (T1, . . . , Tn) = (T1, . . . , Tr, z) be the complete data vector, where z = (Tr+1, . . . , Tn) are the unobserved n − r censored random variables.

I Natural parameter 1/µ, sufficient statisticsPn

i=1Ti, and EµPni=1Ti = nµ

I By the lack of memory, Ti|Ti> ti ∼ ti+ Exp(µ), ∀i > r.

Ep(z|Y,µ(k)) n X i=1 Ti = r X i=1 ti+ X i>r ti+ (n − r)µ(k) I Update formula µ(k+1)= Pn i=1ti+ (n − r)µ(k) n

(17)

Gaussian Mixture Model

17/32

I Consider clustering of data X = {x1, . . . , xN} using a finite mixture of Gaussians.

z ∼ Discrete(π), x|z = k ∼ N (µk, Σk) θ = {πk, µk, Σk}Kk=1 are model parameters

I Complete data log-likelihood

log p(x, z|θ) = log K Y k=1 (p(z = k)p(x|z = k))1z=k = K X k=1 1z=k(log πk+ log N (x|µk, Σk))

(18)

E-step

18/32

I Compute the conditional probability p(zn|xn, θ(t)) via Bayes’ theorem p(zn|xn, θ) = p(zn, xn|θ) P znp(zn, xn|θ) p(zn= k|xn, θ(t)) = πk(t)N (xn|µ(t)k , Σ(t)k ) P kπ (t) k N (xn|µ (t) k , Σ (t) k )

I Denote γn,k(t) , p(zn= k|xn, θ(t)), which can be viewed as a

soft clustering of xn

X k

(19)

E-step

19/32

I Expected complete-data log-likelihood

Q(t)(θ) =X n X zn p(zn|xn, θ(t)) log p(xn, zn|θ) =X n X k γn,k(t) (log πk+ log N (xn|µk, Σk)) =X k X n γn,k(t) (log πk+ log N (xn|µk, Σk)) Substitute N (xn|µk, Σk) in Q(t)(θ) =X k X n γn,k(t)log πk− d 2log(2π) − 1 2log |Σk| −1 2(xn− µk) TΣ−1 k (xn− µk) 

(20)

M-step

20/32

I Maximize Q(t)(θ) with respect to π using Lagrange multipliers πk(t+1)∝X n γn,k(t) Therefore π(t+1)k = P nγ (t) n,k P k P nγ (t) n,k = P nγ (t) n,k P n P kγ (t) n,k = P nγ (t) n,k N I Note thatP nγ (t)

n,k can be viewed as the weighted number of data points in mixture component k, and π(t+1)k is the fraction of data the belongs to mixture component k.

(21)

M-step

21/32

I Compute the derivative w.r.t µk ∂Q(t)(θ) ∂µk =X n γn,k(t)Σ−1k (xn− µk) = Σ−1k X n γn,k(t)(xn− µk) I Therefore, µ(t+1)k = P nγ (t) n,kxn P nγ (t) n,k

µ(t+1)k is the weighted mean of data points assigned to mixture component k

I Similarly, we can get

Σ(t+1)k = P nγ (t) n,k(xn− µ (t+1) k )(xn− µ (t+1) k )T P nγ (t) n,k

(22)

EM algorithm for Gaussian Mixture Models

22/32

I E-step: Compute the soft clustering probabilities

γn,k(t) = π (t) k N (xn|µ(t)k , Σ (t) k ) P kπ (t) k N (xn|µ (t) k , Σ (t) k ) I M-step: Update parameters

π(t+1)k = P nγ (t) n,k N µ(t+1)k = P nγ (t) n,kxn P nγ (t) n,k Σ(t+1)k = P nγ (t) n,k(xn− µ (t+1) k )(xn− µ (t+1) k )T P nγ (t) n,k

(23)
(24)
(25)

Connection to k-means

25/32

I The k-means algorithm follows two steps

I Assignment step: assign data to the nearest cluster γn,k=



1, k = arg mink0kxn− µk0k

0, otherwise

I Update step: set µk to the mean of data points assigned to

the k-th cluster µk= P nγ (t) n,kxn P nγ (t) n,k = 1 Nk X n:γn,k=1 xn

Nk is the number of data points assigned to the k-th cluster. I Therefore, k-means can be viewed as a special case of EM

for Gaussian mixture models where Σk= I and γn,k are hard assignments instead of soft clustering probabilities.

(26)

Hidden Markov Model

26/32

I Sequence data x1, x2, . . . , xT, each xn∈ Rd

I Hidden variables z1, z2, . . . , zT, each zt∈ {1, 2, . . . , K} I Joint probability p(x, z) = p(z1) T −1 Y t=1 p(zt+1|zt) T Y t=1 p(xt|zt)

I p(xt|zt) is theemission probability, could be a Gaussian p(xt|zt= k) = N (xt|µk, Σk)

I p(zt+1|zt) is thetransition probability, a K × K matrix aij = p(zt+1= j|zt= i),Pjaij = 1

(27)

Expected Complete Data Log-likelihood

27/32

I The expected complete data log-likelihood is Q = Ep(z|x)log p(x, z) =X z p(z|x) log p(z1) + T −1 X t=1 log p(zt+1|zt) + T X t=1 log p(xt|zt) ! =X z1 p(z1|x) log p(z1) + T −1 X t=1 X zt,zt+1 p(zt, zt+1|x) log p(zt+1|zt) + T X t=1 X zt p(zt|x) log p(xt|zt)

I Therefore, in the E-step, we need to compute unary and pairwise marginal probabilities p(zt|x) and p(zt, zt+1|x).

(28)

E-step: Forward-Backward Algorithm

28/32

I Using the sequential structure of HMM, we can compute these marginal probabilities via dynamic programming. I The forward algorithm

αt+1(j) = p(zt+1= j, x1, . . . , xt+1) =X i p(zt+1= j, zt= i, x1, . . . , xt+1) = p(xt+1|zt+1= j) X i p(zt+1= j|zt= i)p(zt, x1, . . . , xt) = p(xt+1|zt+1= j) X i aijp(zt, x1, . . . , xt) = p(xt+1|zt+1= j) X i aijαt(i)

(29)

E-step: Forward-Backward Algorithm

29/32

I The backward algorithm

βt(i) = p(xt+1, . . . , xT|zt= i)

=X

jp(xt+1, . . . , xT, zt+1= j|zt= i)

=X

jaijp(xt+1|zt+1= j)βt+1(j) I Unary marginal probability

p(zt= j|x) ∝ p(zt= j, x) = αt(j)βt(j) I Pairwise marginal probability

p(zt+1= j, zt= i|x) ∝ p(zt+1= j, zt= i, x)

(30)

M-step

30/32

I From the E-step, we have γt,k= p(zt= k|x) = αt(k)βt(k) P kαt(k)βt(k) ξt(i, j) = p(zt+1= j, zt= i|x) = αt(i)aijp(xt+1|zt+1= j)βt+1(j) P kαt(k)βt(k) I The expected complete data log-likelihood is

Q =X k γ1,klog πk+ T −1 X t=1 X i,j ξt(i, j) log aij + T X t=1 X k γt,klog N (xt|µk, Σk)

I Closed form solution for M-step – just like in the Gaussian mixture model

(31)

Summary

31/32

EM algorithm finds MLE for models with missing/latent variables. Applicable if the following pieces are easy to solve

I Estimating missing data from observed data using current parameters (E-step)

I Find complete data MLE (M-step) Pros

I No need for gradients, learning rates, etc. I Fast convergence

I Monotonicity. Guaranteed to improve L at every iteration Cons

I Can get stuck at local optimal

(32)

References

32/32

I A. P. Dempster, N. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.

I R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models, 89:355–368, 1998. I Baum, L. E. (1972). An inequality and associated

maximization technique in statistical estimation for probabilistic functions of Markov processes. In Shisha, O. (Ed.), Inequalities III: Proceedings of the 3rd Symposium on Inequalities, 1–8. Academic Press.

References

Related documents

The Board asked the study consultant to review available insurance industry personal property, commercial liability, and commercial property premium and claims data for Nova Scotia

Sulaiman, Department of Computer Engineering, College of Engineering Uni- versity of Mosul, Mosul, (IRAQ). E-mail address

The second foF2 model, which in fact introduces a new regional mapping technique, employs the Juliusruh neu- ral network model and uses the E-layer local noon calculated daily

From the values, in the engineers‟ point of view, the most important factors to NPD success in Indian SMEs is the role of the top management and social factor, topping the table

Our target is to reduce the number of emergency hospital admissions caused by accidental injuries to children and young people aged under 18 years per 10,000 of the population by

1 doesn new problem work probably let said years ll long question course 2 michael netcom andrew new uucp internet mark steve opinions mike mail net 3 thanks mail email help fax

Department of Physical Medicine and Rehabilitation Ljubljanska bb 81000 Podgorica, Montenegro Phone: +38-2-20412412 Fax: +38-2-20412412 Email 1: [email protected] Email

Banks of cylinders or manifold are also used to supply oxygen, nitrous oxide and air, although air is often compressed, dried and filtered on site.. The normal size cylinder for