The empirical Bayes approach
3.4 Computation via the EM algorithm
For the two-stage model, EB analysis requires maximization of the marginal likelihood given in (3.2). For many models, standard iterative maximum likelihood methods can be used directly to produce the MMLE and the ob-served information matrix. In more complicated settings, the EM
(where EM stands for "expectation-maximization") offers an alternative approach, as we now describe.
Versions of the EM algorithm have been used for decades. Dempster, Laird, and Rubin (1977) consolidated the theory and provided instructive examples. The EM algorithm is attractive when the function to be max-imized can be represented as a missing data likelihood, as is the case for (3.2). It converts such a maximum likelihood situation into one with a
"pseudo-complete" log-likelihood or score function, produces the MLE for this case, and continues recursively until convergence. The approach is most effective when finding the complete data MLE is relatively straightforward.
Several authors have shown how to compute the observed information ma-trix within the EM structure, or provided various extensions intended to accelerate its convergence or ease its implementation.
Since the marginal likelihood (3.2) can be represented as a missing data likelihood, the EM algorithm is effective. It also provides a conceptual
introduction to the more advanced Monte Carlo methods for producing the full posterior distribution presented in Section 5.4. Here we sketch the approach and provide basic examples.
3.4.1 EM for PEB
Consider a model where, if we observed
straightforward. For example, in the compound sampling model (3.4), given the MLE for
dependency on y on the left-hand side of equations, let
(3.27) can be computed using only the prior g. Suppressing the
the MLE for would be relatively
For this model and generalizations, the EM partitions the likelihood max-i mmax-izatmax-ion problem max-into components that are relatmax-ively easy to handle. For example, unequal sampling variances produce no additional difficulty. The The MLE for
proceeds by computing tribution
posterior variance plus the square of the mean for
complete"-data MLE equations, and continuing the recursion. Note that the algorithm requires expected sufficient statistics, so
by its expected value, not by the square of the expectation of the imputed
Expanding the second component gives
(3.30) Example 3.2 Consider the Gaussian/Gaussian model (3.9) with unequal sampling variances, and let
we use T for as a function of
and thus the algorithm converges monotonically. This convergence could be to a local (rather than global) maximum, however, and so, as with many optimization algorithms, multiple starting points are recommended.
If both the E and M steps are relatively straightforward (the most attrac-tive case being when the conditional distributions in (3.28) and (3.29) are exponential families), the EM approach works very well.
Dempster et al. (1977), Meilijson (1989), Tanner (1993, p.43), and many other researchers have shown that, at every iteration, the marginal
likeli-hood either increases or stays constant, i.e.,
(3.29) This step involves standard Bayesian calculations to arrive at the condi-tional expectation of the sufficient statistics for
The "M-step" then uses
hyperparameter, and the recursion proceeds until convergence. That is, be the score function. For the "E-step," let
of the hyperparameter at iteration j, and compute
denote the current estimate
based on
in (3.28) to compute a new estimate of the
where for notational convenience For model component i, -2 times the loglikelihood for
is log(T) + producing the score vector
depends on the sufficient statistics
as in (3.28) using the conditional posterior dis-The EM substituting the posterior mean for and the solving the "psuedo-must be replaced (as though were observed).
EM is most attractive when both the E and the M steps are relatively more straightforward than dealing directly with the marginal likelihood. How-ever, even when the E and M steps are non-trivial, they may be handled numerically, resulting in potentially simplified or stabilized computations.
3.4.2 Computing the observed information
Though the EM algorithm can be very effective for finding the MMLE, it never operates directly on the marginal likelihood or score function.
It therefore does not provide a direct estimate of the observed informa-tion needed to produce the asymptotic variance of the MMLE. Several approaches to estimating observed information have been proposed; all are based on the standard decomposition of a variance into an expected
con-ditional variance plus the variance of a concon-ditional expectation.
Using the notation for hierarchical models and suppressing dependence on
(3.31) where I is the information that would result from direct observation of The first term on the right is the amount of information on
by the conditional distribution of the complete data given y, while the second term is the information on
information for the MMLE).
Louis (1982) estimates observed information versions of the left-hand side and the first right-hand term in (3.31), obtaining the second term by subtraction. Meng and Rubin (1991) generalize the approach, producing the Supplemented EM (SEM) algorithm. It requires only the EM computer code and a subroutine to compute the left-hand side. Meilijson (1989) shows how to take advantage of the special case where y is comprised of inde-pendent components to produce an especially straightforward computation.
With i indexing components, this approach uses (3.28) for each component, producing
(3.32) requiring no additional computational burden. This gradient approach can be applied to (3.30) in the Gaussian example. With unequal sampling vari-ances
from a distribution (a "supermodel"), producing an i.i.d. marginal model, or weaker conditions ensuring large sample convergence. Otherwise, the more complicated approaches of Louis (1982) or Meng and Rubin (1991) must be used.
validity of the approach requires either that they are sampled and then
provided in the marginal likelihood (the required
is
the of the Fisher for
The EM algorithm is also ideally suited for computing the nonparamet-ric maximum likelihood (NPML) estimate, introduced in Subsection 3.2 above. To initiate the EM algorithm, recall that the prior
maximizes the likelihood (3.5) in the nonparametric setting must be a dis-crete distribution with at most k mass points. Thus, assume the prior has mass points
where J = k. Then, for the
and probabilities iteration, let
and let and
necessarily equal to the previous values. The ws are particularly straight-forward to compute using Bayes' theorem:
Normalization requires dividing by the sum over index j.
For the Poisson sampling distribution, we have
a weighted average of the yi . The
of the data. The number of mass points in the NPML is generally much smaller than k. Therefore, though the number of mass points will remain constant at J, some combination of a subset of the
the
The EM algorithm has no problem dealing with this convergence to a ridge in the parameter space, though convergence will be very slow.
3.4.4 Speeding convergence and generalizations
Several authors (see, e.g., Tanner, 1993, pp. 55-57) provide methods for speeding convergence of the EM algorithm using the information decom-position (3.31) to compute an acceleration matrix. Though the approach is successful in many applications, its use moves away from the basic attrac-tion of the EM algorithm: its ability to decompose a complicated likelihood maximization into relatively straightforward components.
Several other extensions to the EM algorithm further enhance its effec-tiveness. Meng and Rubin (1993) describe an algorithm designed to avoid approaching each other will occur as the EM iterations proceed.
approaching, 0 and should be spread out to extend somewhat beyond the range
are not Note that the updated maximize
maximize weighted likelihoods, and that the updated
which 3.4.3 EM for NPEB
the nested EM iterations required when the M-step is not available in closed form. Called the Expectation/Conditional (ECM) algorithm, it replaces the M-step with a set of conditional maximization steps. Meng and Rubin (1992) review this extension and others, including the Supple-mented ECM (SECM) algorithm, which supplements the ECM algorithm to estimate the asymptotic variance matrix of the MMLE in the same way that the SEM algorithm supplements EM.
Finally, the EM algorithm also provides a starting point for Monte Carlo methods. For example, if the E-step cannot be computed analytically but values can be sampled from the appropriate conditional distribution, then can be estimated by Monte Carlo integration (see Section 5.3 for details).
Wei and Tanner (1990) refer to this approach as the Monte Carlo EM ( MCEM) algorithm. In fact, the entire conditional distribution of S could be estimated in this way, providing input for computing the full posterior distribution in a "data augmentation" approach (Tanner and Wong, 1987).
Meng and Rubin (1992) observe that their Partitioned ECM (PECM) al-gorithm, which partitions the parameter vector into k subcomponents in order to reduce the dimensionality of the maximizations, is essentially a deterministic version of the Gibbs sampler, an extremely general and easy-to-use tool for Bayesian computation. We discuss this and other popular Markov chain Monte Carlo methods in Section 5.4.