1.5 Learning and inference
1.5.4 Variational inference
Variational methods refer to the technique of posing some quantities hard to compute as the minimal value of some functions, and then apply optimization algorithms to it. For example, the solution of the linear systemAx=b is exactly the minimizer of
1 2x
>Ax
− hb,xiifAis positive definite. For unconstrained quadratics, algorithms such
as conjugate gradient (Hestenes & Stiefel,1952) can optimize it very efficiently. In the same spirit, Wainwright(2002) formulated the log partition function as the minimum of a certain function with some constraints, and its minimizer is exactly the feature mean. This new framework allows direct application of a large body of optimization techniques, which can be further accelerated by utilizing the structure of the graph (e.g.,Wainwright et al.,2003,2005;Sontag & Jaakkola,2007). Intuitively, the key idea is the Fenchel-Young equality
g(θ) = sup
µ h
θ,µi −g?(µ)
whereg? is the Fenchel dual ofg. Now three questions arise: a) what is the domain of
g?, b) how to computeg?, c) how to carry out the optimization. We will answer the first two questions in the next part, and then survey some approximate algorithms for
optimization.
Marginal polytope and g?
Wainwright & Jordan(2008, Theorem 3.4) showed that the domain ofg? is related to the range of the expectation ofφ(x) wrt all distributions that are absolutely continuous
wrtν.
Definition 22 (Marginal polytope) Define the marginal polytope of φas
Mφ:= µ∈Rd:∃p(·),s.t. Z φ(x)p(x)ν(dx) =µ .
Note that thepdf pin the definition is not required to be in the exponential family
Pφ, however, adding this restriction is straightforward. Givenθ∈Θ, we are interested
in the expectation ofφ(x) underp(x;θ), and formally we define a mapping Λφ: Θ7→
Mas
Λφ(θ) :=Eθ[φ(x)] = Z
φ(x)p(x;θ)ν(dx).
And then we have the range of Λφ mapping from Θ,i.e.the space of mean parameters
wrtPφ: Λφ(Θ) := Z φ(x)p(x)ν(dx) :p∈ Pφ .
Mφ is obviously convex, while Λφ(Θ) is not necessarily convex. Hence we callMφ
marginal polytope. Also, neitherMnor Λφ(Θ) is guaranteed to be closed. Whenφis
clear from context, we omit the subscriptφ inMφ, Λφ(θ), and Λφ(Θ). Λ(Θ) andM
are related as follows.
Proposition 23 (Wainwright & Jordan,2003, Theorem 1) The mean parameter map- ping Λ is onto the relative interior of M, i.e.,Λ(Θ) = riM.
The mean parameter µ= Ex∼p[φ(x)] can be roughly considered as a signature of the density p. The following is an important theorem which provides an explicit form
of the Fenchel dual of log partition function, in terms of the entropy of the distribution. Theorem 24 (Fenchel dual of g(θ) and entropy) (Wainwright & Jordan, 2003, Theorem 2) For any µ ∈ riM, let θ(µ) denote an element in Λ−1(µ). Denote as H(p) the entropy of pdf p. The Fenchel-Legendre dual of g(θ) has the form
g?(µ) = −H(p(x;θ(µ)) if µ∈riM +∞ if µ∈/ clM .
θ
1P
θ
2μ[P]
Figure 1.4: Simple break of injectivity
θ
1P
θ
2μ[P] = μ[Q]
Q
Figure 1.5: Imaginary break of injectivity For any boundary pointµ∈bdM:= clM\riM, we haveg?(µ) = limn
→∞−H(p(x;θ(µn))
taken over a sequence {µn} ⊆riM converging to µ.
This theorem also implies that given the mean parameter µ, the entropy of the
distribution is independent of which natural parameter is used from Λ−1(µ). Consid- ering that both θ and µ can serve as a signature of the distribution, it is natural to
investigate their relationship which turns out to hinge on the minimality of sufficient statistics.
Theorem 25 (Injectivity of mean mapping) (Wainwright & Jordan,2008, Propo- sition 3.2) The mean map Λ is one-to-one if, and only if, φ(x) is minimal.
Since θ is mapped to the marginal polytope M via the pdf p(x;θ), injectivity
can break in two different ways: a) two different natural parameters giving the same distribution, see Figure 1.4; and b) different distributions in the exponential family giving the same mean, see Figure 1.5. The minimality assumption seems to preclude only the first case. Fortunately, it turns out that this second mapPφ7→ Mis injective
irrespective of whether the sufficient statistics are minimal.
Theorem 26 Using the same notation as in Theorem 25, the mapping from distribu- tionp∈ Pφ to the mean Ex∼p[φ(x)]is injective regardless of whether φ(x)is minimal.
Proof The proof is based on the maximum entropy interpretation of exponential families. Suppose two pdfs p, q ∈ Pφ have the same mean µ. Let the pdf p∗ (not
necessarily in Pφ) be the optimal solution of the following optimization problem:
maximize
p H(p), s.t. Ex∼p[φ(x)] =µ. (1.22) Note the optimization isnot restricted toPφ, and the feasible region must be nonempty
since p and q satisfy the constraint. Since entropy is a strictly convex functional and
the linear constraints form a convex set, the optimal solution p∗ is unique and is well
known to be in Pφ. As the entropy of exponential family distributions can be fully
determined by its mean (Theorem 24), p and q must have the same entropy as p∗.
solution impliesp=q=p∗.
The significance of this theorem is that the mean of sufficient statistics uniquely identifies the distribution in the exponential family, and furthermore if the sufficient statistics are minimal, the natural parameter can also be uniquely identified.
Optimization techniques for variational inference
By Theorem24,g?(µ) is just the negative entropy of the distribution corresponding to
µ. However, the hardness of the optimization problemg(θ) = supµ∈Mhθ,µi −g?(µ)
is exhibited in two folds: a) the constraint setMis extremely difficult to characterize
explicitly; b) the negative entropy g? is defined indirectly, hence it lacks explicit form inµ. Therefore, one resorts to outer or inner bounds ofMand upper or lower bounds
of g?. This leads to various algorithms (Wainwright & Jordan,2008), such as
• Naive mean field. It only considers a subset (inner approximation) ofM where
the mean parameter of the edges is fixed to be the product of the mean of the two end points. This essentially assumes that all the nodes are independent, and yields a lower bound on g(θ). In this case, the entropy factorizes and becomes
easy to compute.
• Tree-reweighted sum-product. By noticing that the entropy of trees can be com-
puted efficiently, Wainwright et al. (2005) studied the restriction of any mean parameter µ to a spanning tree T: µ(T). Since this restriction removes those
constrains corresponding to ignored edges, the entropy of µ(T) is higher than
that ofµ, hence the restriction leads to a concave upper bound ofhθ,µi −g?(µ). Moreover, as convex combination of upper bounds is still an upper bound, it can be tightened by further optimizing over the convex combination, e.g., the distribution over spanning trees called spanning tree polytope.
• Log-determinant relaxation. Observing that M can be characterized by con-
straining the moments to be positive semi-definite to any order, Wainwright & Jordan (2006) proposed a relaxation based on Gaussian approximation.
• Cutting plane. Sontag & Jaakkola(2007) proposed a new class of outer bounds on
the marginal polytope, by drawing its equivalence with the cut polytope (Bara- hona & Mahjoub, 1986). Different from most previous methods which fix the outer bound a priori, Sontag & Jaakkola(2007) progressively tightens the outer bound according to the current infeasible solution. This is done by efficiently findinga violated constraint via a series of projections onto the cut polytope.