• No results found

1.5 Learning and inference

1.5.2 Message passing

Message passing schemes essentially propagate local factor information to the other factors and try to achieve global consistency via enforcing local consistency. These algorithms can be most conveniently described by using factor graphs.

Definition 21 (Factor graph) (Kschischang et al., 2001) Given a pdf which fac- torizes onto groups of nodes: p(x) = 1

Z Q

c∈Cϕc(xc), its factor graph is defined as a

bipartite graph, where one side consists the original nodes and the other side consists of the factors given by the prescribed factorization ofp. A nodeiis linked with a factor

c if, and only if, i is involved in the factor c (ic). For example, given a joint distribution:

p(x1, . . . , x5) =fA(x1)fB(x2)fC(x1, x2, x3)fD(x3, x4)fE(x3, x5),

the corresponding factor graph is Figure 1.3. It is clear that all graphical models can be represented by a factor graph, whose factors are subsets of the maximal cliques.

Interpretingcas the set of nodes associated with factorc, one can define the scheme

factor graph: variablei∈cto factor c: mi→c(xi) = Q c0:ic0,c06=c mc0i(xi), factor cto variable ic: mc→i(xi) = P xc\{i} f(xi, xc\{i}) Q j∈c\{i} mj→c(xj) ! ,

and the final marginal distribution can be obtained by p(xc) := Q

i∈cmi→c(xi) up to

a normalization constant. By replacing the above sum-product with max-product, the same scheme can be used for MAP inference. This is the idea of generalized distributive law with different semi-rings (Kschischang et al.,2001).

BP is guaranteed to converge on graphs with at most one loop (Weiss,2000) or when the joint distribution is Gaussian with arbitrary topology (Weiss,2001). Unfortunately, when there is more than one loop, no guarantee can be made on convergence, or convergence to the true marginal. Ihler et al.(2005) provided some convergence analysis and conditions using contraction of dynamic range. In general, it is still an open issue although loopy BP often performs well in practice.

A major progress in message passing inference was made by Minka (2001), called expectation propagation (EP). In a nutshell, it approximates all the factorsfc(xc) with

some restricted (simple) forms ˜fc(xc) such as product of independent Gaussians, so that

the inference on the joint approximation {f˜c(xc)}c is tractable. The approximation criteria is to optimize the KL divergence between the given pdf q Q

cfc(xc) and the approximant p ∝ Q

cf˜c(xc). If the approximant is restricted to exponential families, this is equivalent to moment matching. For computational tractability, a cavity update scheme is employed, i.e., cycle through all the factors and each time optimize the factor’s approximation in the context of other factors’ current approximation. Here we sketch some technical details because EP will be used extensively in Chapter 3, and the full details can be found in (Minka,2001).

Suppose we have a pre-specified exponential familyPφ for which efficient inference

is available. Now we are given an arbitrarypdf q and inference is intractable on it. A

natural idea is to approximate q by some distribution p(x;θ) ∈ Pφ, and then simply

use the marginals and partition functions etc of p(x;θ) as the surrogate of those ofq.

The above approximation can be in the sense of projectingq toPφ in KL divergence:

min

θ∈ΘKL(q||p(x;θ)) ⇔ minθ∈ΘKL(q||exp(hφ(x),θi −g(θ))). Taking gradient wrt θ and equating to 0 yield the optimality condition:

which means matching the expectation of features.

Now suppose the distributions have graphical model structures, andq factorizes as q(x) = 1

Z

Y

c∈C

fc(xc),

and now we naturally wish to project intoPφ whereφfactorizes into vecc∈C{φc(xc)}:

min θ∈ΘKL(q||p(x;θ)) ⇔ minθ∈ΘKL 1 Z Y c∈C fc(xc) exp X c∈C hφc(xc), θci −g(θ) !! . (1.21) Although the result of moment matching still holds, it is now intractable to compute the moment in general and in fact this is the problem we want to tackle in the first place. This obstacle also precludes clique-wise block coordinate descent. Despite the computational feasibility of matching the moment clique by clique independently, it does not give good approximations unless all the cliques are disjoint.

Let us first ignore the normalizer and write ˜fc(xc;θc) := exp(hφc(xc), θci. EP takes a cavity approach (Opper & Winther,2000): cycle through all the cliques, and for each cliquec, find the best approximant ˜fcoffc keeping the other current approximants ˜fc0

(c0 6=c) fixed,i.e. min θc KL  fc(xc) Y c06=c ˜ fc0(xc0;θc0) ˜ fc(xc;θc)Y c06=c ˜ fc0(xc0;θc0)  .

Since only one factor from q, fc, is involved, this optimization over θc is feasible via moment matching. Different algorithms can be derived by further assuming different forms of the exponential family. For example, loopy belief propagation can be recovered when each φc completely decomposes onto individual nodes: φc(xc) = veci∈cφc,i(xi). The normalization factor can be obtained by matching the zero-th order moment, and is usually done after the above cyclic procedure terminates.

Unfortunately, EP still has no convergence guarantee and even when it converges, there is no guarantee that it will give the correct inference results. Again it works pretty well in practice, and a theoretical analysis is available in (Minka,2005) which also provides unified comparisons with some other inference algorithms.

A simplified version of EP takes only one pass through the factors in q. This is

known as assumed density filtering (ADF) (Maybeck,1982;Opper,1998), and is useful for online learning where factors are revealed in a sequence and must be discarded before the next factor arrives (for privacy or storage constraint). In general, the accuracy of ADF is inferior to EP and is susceptible to the order in which the factors are revealed.