the MCEM Algorithm
While in principle any frequentist point estimator could be used when estimating the hyperparameters α in the marginal p(y|α), the most natural way to proceed is to regard L(α; y) = p(y|α) as the likelihood of the hyperparameters α and then find the point estimator ˆα as the maximizer of this marginal likelihood. We will hence use the marginal maximum likelihood estimator (MMLE) defined by
ˆ
αMMLE = arg max α
L(α; y) = arg max
α
p(y|α)
to estimate the hyperparameters α. Using Equation (5.3), the marginal likelihood is given by
L(α; y) = p(y|α) = ˆ
p(y|λ)p(λ|α) dλ.
Since this is an intractable integral, we could use the law of large numbers (Theo- rem A.13) to find its MC approximation based on a sample from the prior
L(α; y)≈ 1 N
N
X
i=1
p(y|λ(i)), λ(i) i.i.d.∼ p(λ|α). (6.1) Unfortunately, this approach does not work well in practice. The reason for this is that in the high-dimensional sample space, most of the sampled values of λ fall on regions of the space where the likelihood is numerically zero. This is true even for very reasonable priors p(λ|α). Hence, an extremely large sample from the prior would be required to get even a rough idea about the value of the likelihood L(α; y). To solve this problem, we use the EM algorithm described in Section 4.1.1 to find the maximum of the marginal likelihood L(α; y). To do this, we regard the means of the true histogram λ as unobserved latent variables. Hence, the complete data are (y, λ) with the likelihood function L(α; y, λ) = p(y, λ|α) for the hyperparameters α. The two likelihood functions are related by
L(α; y) = p(y|α) = ˆ
p(y, λ|α) dλ = ˆ
L(α; y, λ) dλ,
which corresponds to Equation (4.8) in our general formulation of the EM algorithm. Denoting the complete-data log-likelihood by l(α; y, λ) = log p(y, λ|α), it follows that on the kth E-step of the algorithm, we compute the conditional expectation
Since p(y, λ|α) = p(y|λ)p(λ|α) and we are only interested in how Q depends on α, we can write
Q(α; α(k)) = Elog p(λ|α)|y, α(k)+ Elog p(y|λ)|y, α(k) = Elog p(λ|α)|y, α(k)+ const
= ˆ
p(λ|y, α(k)) log p(λ|α) dλ + const.
This is again an intractable integral, so we use the MC approximation of Equa- tion (5.7) to find Q(α; α(k))≈ 1 N N X i=1
log p(λ(i)|α) + const, λ(i) ∼ p(λ|y, α(k)), where the sample {λ(i)}N
i=1 is produced using the Metropolis–Hastings algorithm
described in Section 5.2. Thus, on the E-step of the algorithm, we sample N ob- servations from the posterior p(λ|y, α(k)) computed for the current iterate of the hyperparameters. The arithmetic mean of the values of the log-prior corresponding to this sample is then used to approximate the value of Q(α; α(k)) up to a constant which does not depend on α. On the subsequent M-step of the algorithm, the ap- proximate value of Q(α; α(k)) is maximized with respect to α. The difficulty of this
maximization depends on the choice of the family of priors {p(λ|α)}α. We show
below that for the Gaussian smoothness prior (5.12), the maximization can be car- ried out analytically, while for more complicated choices of {p(λ|α)}α, it might be
necessary to find the maximum numerically using standard nonlinear optimization algorithms.
Since we need to resort to MC integration when computing the conditional ex- pectation, the iteration outlined above is not exactly the EM iteration described in Section 4.1.1 but instead its stochastic Monte Carlo version. This extension of the original EM algorithm was first proposed by Wei and Tanner in [60], who called it the Monte Carlo EM (MCEM) algorithm. See also [43, Section 6.3] for a review of the literature on the MCEM algorithm.
To summarize the discussion above, the MCEM algorithm for finding the marginal maximum likelihood estimator ˆαMMLE of the hyperparameters α consists of the fol-
lowing iteration:
1. Pick some initial guess α(0) and set k = 0.
2. E-step:
(a) Sample λ(1), λ(2), . . . , λ(N ) from the posterior p(λ|y, α(k)).
(b) Compute: e Q(α; α(k)) = 1 N N X i=1 log p(λ(i)|α). (6.2)
3. M-step: Set α(k+1)= arg max
α
e
Q(α; α(k)).
4. Set k ← k + 1.
5. If some stopping rule C(α(k), α(k−1), . . . , α(0)) is satisfied, set ˆα
MMLE = α(k)
and terminate the iteration, else go to step 2.
Replacing the E-step with its MC approximation complicates both the theoret- ical and practical convergence analysis of the EM algorithm. Firstly, the random fluctuations of the MC estimator invalidate the monotonicity of the original EM algorithm (see Theorem 4.3). Secondly, the stopping rule C(α(k), α(k−1), . . . , α(0)) should consider more than just the latest iteration of the algorithm to see if the iterates appear to fluctuate around some central value before claiming convergence. Nevertheless, despite these complications, the MCEM algorithm has been success- fully applied to various problems of practical interest, see e.g. [7, 39].
The MCEM algorithm has a rather intuitive interpretation. First, on the E-step, we use the current iterate α(k) to produce a sample of λ’s from the posterior. Since this sample summarizes our current understanding of λ, we then tune the prior by changing α on the M-step to match this sample as well as possible and the α that matches the posterior sample the best will then become the next iterate α(k+1).
There are two reasons why the MCEM algorithm is numerically a lot more stable than the first MC approximation (6.1) that we tried to use to find the MMLE. Firstly, in MCEM, the λ’s are sampled from the posterior and hence most of them are reasonable true histograms. This means that they should also lie within the bulk of the prior probability density making eQ in Equation (6.2) well behaved. On the contrary, in Equation (6.1), the sample is generated from the prior resulting mostly in very unlikely true histograms. Secondly, the sum in (6.1) is over plain densities instead of log-densities which is the case in (6.2). This makes the computations in MCEM a lot more robust against small pdf values.