Approximate Inference - Inference and Learning

2.2 Inference and Learning

2.2.2 Approximate Inference

In most settings, in particular when using relational PGMs, exact inference algorithms are not feasible. Therefore one often resorts to approximate inference algorithms. One class of

approximate inference algorithms are sampling based methods. There is a large choice of sampling approaches including rejection sampling, importance sampling, and Markov chain Monte Carlo (MCMC) methods, to name only a few. The general idea behind sampling algorithms is to obtain a set of samples drawn from the distribution p(X). As a representative of the MCMC methods, we will now briefly describe Gibbs sampling and we refer to [21, Chapter 11] and [124, Chapter 12] for a general overview of sampling methods. Gibbs sampling is a very simple yet widely applied technique. Gibbs sampling [64] is an MCMC method and serves as a simple approximate inference technique for problems where the conditional probability of a variable given its Markov blanket can be computed easily. Starting with an initial value of the unobserved variables, we iterate over all of these variables in a pre-defined, or alternatively random, order. For each variable, we sample a new value based on its conditional probability distribution given its neighbors. In many BNs and MRFs, every sampling step can be done fast because one can easily sample from the conditional distribution if all parents, respectively neighbors, are observed. Iterating once over all variables in the network is often referred to as a sweep. These sweeps are then computed for a fixed number of iterations. The more iterations we make, the less influence does the initial state of the variables have and we are approximating the true posterior distribution. To reduce the bias of the initial state, a burn-in phase is usually incorporated into the sampling process. From the MCMC algorithms’ point of view, the Gibbs sampler is a particular instance of the Metropolis-Hastings algorithm with a constant acceptance probability equal to one, i.e., all samples are accepted. Gibbs sampling will be used in different parts of this thesis because the MRFs in use allow to easily sample from the required conditional distributions. For example, we will see in Chapter 7 how Gibbs sampling is used to generate samples for count random variables that are assumed to be conditionally Poisson distributed. Another interesting MCMC approach worth mentioning is slice sampling. This sampling algorithm also finds applications in PGM inference and is referred to in Section 2.3.1 again.

While MCMC methods have been among the most popular approximate inference algorithms for a long time, an increasing interest in alternatives to MCMC has been seen starting in the late nineties of the last century. A large class of these inference algorithms are message passing based approaches. These algorithms exchange messages between nodes in the graph, or respectively between nodes and factors in a factor graph, until a convergence criterion is met. Probably the most famous algorithm in this group is Belief Propagation (BP) [177]. BP is exact on tree structured problems and has also shown good performance on loopy graphs [158]. Similar to VE and JT, BP can easily be adapted to compute max-marginals, i.e., to solve MAP inference problems, which is referred to as max-product BP as opposed to sum-product BP for marginal inference. We will describe BP in Section 2.4.1 in detail because it is of major importance for this thesis. There exists an extension to BP for Gaussian MRFs as well which is referred to as Gaussian BP (GaBP) [246]. It was shown that running GaBP on specific graphs amounts to running a power iteration approach. We will shed more light into this connection in Chapter 5 because we will use this power iteration method for running Label Propagation, a graph labeling algorithm, in a large-scale labeling task.

Variational Inference approaches approximate a complex target distribution by a simpler distribution for which running inference is tractable. The goal is then to find a distribution from a family of simpler distributions that is as close as possible to the original one. Hence, the task of inference can be seen as an optimization problem where the objective is the distance between the true distribution and the approximation. Probabilistic queries are then answered by using the simpler distribution as a proxy. Variational methods include approaches such as

Expectation Propagation [150] and Mean Field approximations [105]. However, also BP was later understood to fall into this paradigm and described in terms of the Bethe approximation to the free energy [253]. In general, variational inference subsumes a lot of different algorithms and ideas, and as Jordan et al. [105] state “there is as much art as there is science in our current understanding of how variational methods can be applied to probabilistic inference”. Therefore, a full coverage of variational inference is beyond the scope of this introduction and we refer the reader to [105] and [124, Chapter 11] for more details.

For approximate inference algorithms it is worthwhile to explicitly point out several MAP inference algorithms. For example, Boykov et al. [25] use graph cuts to find minimal energy assignments for a class of MRFs which are frequently found in computer vision problems. The MAP problem for multinomial random variables has an equivalent formulation as an integer linear program over the marginal polytopeM(G) (see Equation (2.4)). Of course, it has the same complexity as the original problem and hence it is intractable. The relaxation of this problem does not require the indicator variables to be integer anymore. Nevertheless, these linear program relaxations come with the benefit that solutions are guaranteed to be optimal if the solution found is integer. Although the linear program can be solved in polynomial time, it is often impossible to use standard solvers due to the huge number of constraints necessary to specify the problem. Therefore, several approaches have been introduced that make use of the specific structure of the linear program for MAP.

One example of such an approach is Tree-Reweighted BP (TRWBP) [239] which is a message passing algorithm akin to max-product BP. However, the messages are slightly altered and extended by edge-appearance probabilities. If all of these probabilities are one, then TRWBP reduces to standard BP. The intuition of TRWBP is to reformulate the original MAP problem into a convex combination of tree-structured problems. Each subproblem is represented as a spanning tree and the appearance probability of each edge is based on the distribution over the trees. We have already mentioned above max-product BP and its correctness for tree-structured problems. Therefore, TRWBP can be seen as a combination over a tractable subclass of problems. TRWBP is especially interesting because it finds the same solutions as a linear program relaxation and hence connects a well known message passing algorithm in the spirit of BP with linear programs.

A comparison of TRWBP and CPLEX, a well known yet commercial linear program solver, can be found in [251]. Yanover et al. show that TRWBP is not only faster than CPLEX for many examples, but also capable of solving large instances which CPLEX cannot handle. Lastly, there is also a sum-product variant of TRWBP presented by Wainwright et al. [238]. Yarkony et al. [252] present a TRWBP variant that is based on covering trees (CT). By using covering trees instead of spanning trees, the number of parameters in TRWBP is reduced while the same bounds on the MAP are achieved. However, it is necessary to introduce copy-nodes which have to be aligned by means of optimization to obtain a consistent solution.

As described, the CT-based approach is capable of reducing the number of required parameters compared to the original TRWBP by using a single tree instead of a distribution over spanning trees. Globerson and Jaakkola [70] present with Max Product Linear Programming (MPLP) an approximate MAP inference algorithm that also relies on the linear program relaxation but is parameter free and still guaranteed to converge. It is again a message passing algorithm and in its structure similar to max-product BP. It can be seen as block coordinate descent in the dual of the linear program relaxation. In each iteration, block coordinate descent fixes all variables except for a subset and optimizes over this subset.

Approaches such as TRWBP and covering trees can see seen as two instances of a more general paradigm. The idea of decomposing the original MRF into tractable sub-structures can be found in various other inference and learning approaches as well, and is the fundamental idea in dual decomposition [212], also known as Lagrangian relaxation. Here, the original problem is decomposed and inference is run repeatedly on the sub-problems. The local solutions to each sub-problem are then combined into one global solution.

Instead of a decomposition of the original problem, we can also further constrain the feasible joint probability distributions in M(G). By doing so, we obtain approaches such as the likelihood maximization approach in [129]. We will return to this algorithm in Section 3.3 where we will explain it in detail and also describe the message updates precisely.

So far, we have made roughly the distinction between marginal and MAP inference algorithms. However, there exist also approaches that somewhat combine the ideas of both inference types. For example, the Perturb-and-MAP random fields by Papandreou and Yuille [175]. In this approach, noise is injected into the factors and afterwards MAP inference is run. By doing this in multiple rounds, uncertainty is induced into the MAP solutions and the process approximates Gibbs sampling. Hence the results can be used for parameter learning and probabilistic reasoning. After this brief overview on inference algorithms, we will now summarize some of the ideas that have been proposed to combine logics and PGMs. These template languages describe PGMs compactly and also promote the concept of lifted inference which will be explained afterwards.

In document Graphical models beyond standard settings: lifted decimation, labeling, and counting (Page 37-40)