Expectation-maximization algorithm - Basics of network structure

1.3 Basics of network structure

1.3.8 Expectation-maximization algorithm

As mentioned in section 1.3.6, maximization via derivatives is not sufficient for many network models. To supplement that approach, this dissertation will make extensive use of the maximization technique known as the expectation-maximization (EM) algorithm [39]. The EM algorithm is a technique for maximizing the likelihood of a parameterized latent, that is unobserved, variable model. For stochastic block- models, the latent variables are the communities (or whatever vertex information the model concerns itself with) and the parameters are the mixing matrix and any other relevant parameters explicitly defined in the model. The algorithm is performed by splitting the single maximization into two deterministic steps, which individually can be much simpler to solve than the combined problem. The first (Expectation) step is to find the distribution of the latent variables while holding all other observables and parameters constant, and the second (Maximization) step is to maximize the explicit

expression with the latent variables over the parameters, a step which is often done directly with derivatives. These two steps are computed in an alternating fashion until the parameters converge, at which point both steps are satisfied simultaneously. The EM algorithm is proven to monotonically increase the likelihood at each step, and converge to a critical point of the likelihood13_{. This critical point is not guaranteed}

to be the absolute maximum, so the algorithm will typically be run multiple times (at different starting locations, since it is deterministic) and the best result kept as the desired answer.

There are many equivalent formulations of the EM algorithm, and two will be presented here which will be useful later in the dissertation. Mathematically the objective is to turn the log of a sum into a sum of logs, since these are much easier to differentiate. A simple and direct way this can appear is via Jensen’s inequality in the form log P uxu ≥ X u qulog xu qu , (1.5)

where the xu are some set of positive numbers (which in our case will be related to

the probability of a specific edge) and the qu are any nonnegative numbers satisfying

uqu = 1. This statement is a valid application of Jensen’s inequality because the

logarithm function is concave. Notice in particular that the equality can be recovered by making the particular choice

qu = xu/

xu. (1.6)

It is not immediately clear how the qu correspond to the latent variables of our models,

so this is demonstrated by comparing this method with a second way of writing the

13_{Since each step individually maximizes the log-likelihood with respect to a particular set of}

parameters, it is trivial that the likelihood must increase with each iteration. Furthermore, well- defined likelihoods have an upper bound of 1, so this increase has to stop somewhere. It will soon be shown that where this two-step process converges corresponds to a critical point of the original likelihood.

EM algorithm:

log(P (G|Θ)) ≥ log(P (G|Θ)) − D Q(Z)||P (Z|G, Θ), (1.7)

where P (G|Θ) is the probability of the data given the parameters (written in a format suggestive to the application towards networks), Z refers to the latent variables, and we’ve introduced Q(Z) as any probability distribution over Z. D(P1(X)||P2(X)) =

P XP1(X)log P1(X) P2(X)

is the Kullback-Leibler divergence (also referred to as the rel- ative entropy) of distributions P1(X) and P2(X) over some random variable X. In

particular, the Kullback-Leibler divergence is weakly greater than 0 and equals 0 if and only if P1(X) = P2(X) almost everywhere14. Thus whereas Q(Z) can be any

probability distribution, the right hand side of equation (1.7) will be maximized (and thus the two sides of the equation will be equal) when Q(Z) = P (Z|G, Θ), the distribution of the latent variables. Simplifying equation (1.7) gives

P (G|Z, Θ) is the form of our model given the latent variables and is much easier to write explicitly than P (G|Θ) since there’s no additional sum over the latent variables. The term P (Z|Θ) is the prior on the latent variables and is often independent compared to the probability of the graph with respect to the parameters and thus maximized separately from the rest of the equation. The two steps of the EM algo-

14_{For the network models, the fact that the Kullback-Leibler divergence is additive over indepen-}

rithm are to compute Q(Z) = P (Z|G, Θ) holding Θ constant then to maximize equation (1.8) with respect to Θ while holding Q(Z) constant. Comparing equations (1.5) and (1.8) with the appropriate substitutions (qu = Q(Z) and xu = P (G, Z|Θ)), it is

clear that the two are equivalent and thus both represent valid EM algorithms. This dissertation will primarily use the Jensen formulation since it will be easier to follow for the models presented.

In most applications of the EM algorithm, the M step becomes easy to compute and a closed form solution is not uncommon. The E step can be another story however. It is rare for this step to have a closed form solution, so a numerical method like Markov Chain Monte Carlo (MCMC) is usually needed to obtain the distribution of the latent variables [127].

In document Blockmodeling Techniques for Complex Networks. (Page 40-43)