Variational inference - Learning and inference

1.5 Learning and inference

1.5.4 Variational inference

Variational methods refer to the technique of posing some quantities hard to compute as the minimal value of some functions, and then apply optimization algorithms to it. For example, the solution of the linear systemAx=b is exactly the minimizer of

1 2x

>_A_x

− hb,x_iifAis positive definite. For unconstrained quadratics, algorithms such

as conjugate gradient (Hestenes & Stiefel,1952) can optimize it very efficiently. In the same spirit, Wainwright(2002) formulated the log partition function as the minimum of a certain function with some constraints, and its minimizer is exactly the feature mean. This new framework allows direct application of a large body of optimization techniques, which can be further accelerated by utilizing the structure of the graph (e.g.,Wainwright et al.,2003,2005;Sontag & Jaakkola,2007). Intuitively, the key idea is the Fenchel-Young equality

g(θ) = sup

µ h

θ,µi −g?(µ)

whereg? _{is the Fenchel dual of}_g_{. Now three questions arise: a) what is the domain of}

g?_{, b) how to compute}_g?_{, c) how to carry out the optimization. We will answer the} first two questions in the next part, and then survey some approximate algorithms for

optimization.

Marginal polytope and g?

Wainwright & Jordan(2008, Theorem 3.4) showed that the domain ofg? _{is related to} the range of the expectation ofφ(x) wrt all distributions that are absolutely continuous

wrtν.

Definition 22 (Marginal polytope) Define the marginal polytope of φas

Mφ:= µ_∈Rd:∃p(·),s.t. Z φ(x)p(x)ν(dx) =µ .

Note that thepdf pin the definition is not required to be in the exponential family

Pφ, however, adding this restriction is straightforward. Givenθ∈Θ, we are interested

in the expectation ofφ(x) underp(x;θ), and formally we define a mapping Λφ: Θ7→

Mas

Λφ(θ) :=Eθ[φ(x)] = Z

φ(x)p(x;θ)ν(dx).

And then we have the range of Λφ mapping from Θ,i.e.the space of mean parameters

wrtPφ: Λφ(Θ) := Z φ(x)p(x)ν(dx) :p∈ Pφ .

Mφ is obviously convex, while Λφ(Θ) is not necessarily convex. Hence we callMφ

marginal polytope. Also, neitherMnor Λφ(Θ) is guaranteed to be closed. Whenφis

clear from context, we omit the subscriptφ in_Mφ, Λφ(θ), and Λφ(Θ). Λ(Θ) andM

are related as follows.

Proposition 23 (Wainwright & Jordan,2003, Theorem 1) The mean parameter mapping Λ is onto the relative interior of M, i.e.,Λ(Θ) = riM.

The mean parameter µ= Ex∼p[φ(x)] can be roughly considered as a signature of the density p. The following is an important theorem which provides an explicit form

of the Fenchel dual of log partition function, in terms of the entropy of the distribution. Theorem 24 (Fenchel dual of g(θ) and entropy) (Wainwright & Jordan, 2003, Theorem 2) For any µ ∈ riM, let θ(µ) denote an element in Λ−1(µ). Denote as H(p) the entropy of pdf p. The Fenchel-Legendre dual of g(θ) has the form

g?(µ) =    −H(p(x;θ(µ)) if µ_∈ri_M +_∞ if µ_∈/ cl_M .

θ

P

θ

μ[P]

Figure 1.4: Simple break of injectivity

θ

P

θ

μ[P] = μ[Q]

Q

Figure 1.5: Imaginary break of injectivity For any boundary pointµ_∈bd_M:= cl_M\ri_M, we haveg?₍_µ_{) = limn}

→∞−H(p(x;θ(µn))

taken over a sequence {µn} ⊆riM converging to µ.

This theorem also implies that given the mean parameter µ, the entropy of the

distribution is independent of which natural parameter is used from Λ−1₍_µ_{). Consid-} ering that both θ and µ can serve as a signature of the distribution, it is natural to

investigate their relationship which turns out to hinge on the minimality of sufficient statistics.

Theorem 25 (Injectivity of mean mapping) (Wainwright & Jordan,2008, Propo- sition 3.2) The mean map Λ is one-to-one if, and only if, φ(x) is minimal.

Since θ is mapped to the marginal polytope _M via the pdf p(x;θ), injectivity

can break in two different ways: a) two different natural parameters giving the same distribution, see Figure 1.4; and b) different distributions in the exponential family giving the same mean, see Figure 1.5. The minimality assumption seems to preclude only the first case. Fortunately, it turns out that this second map_Pφ7→ Mis injective

irrespective of whether the sufficient statistics are minimal.

Theorem 26 Using the same notation as in Theorem 25, the mapping from distribu- tionp_{∈ P}φ to the mean Ex∼p[φ(x)]is injective regardless of whether φ(x)is minimal.

Proof The proof is based on the maximum entropy interpretation of exponential families. Suppose two pdfs p, q _{∈ P}φ have the same mean µ. Let the pdf p∗ (not

necessarily in Pφ) be the optimal solution of the following optimization problem:

maximize

p H(p), s.t. Ex∼p[φ(x)] =µ. (1.22) Note the optimization isnot restricted to_Pφ, and the feasible region must be nonempty

since p and q satisfy the constraint. Since entropy is a strictly convex functional and

the linear constraints form a convex set, the optimal solution p∗ is unique and is well

known to be in Pφ. As the entropy of exponential family distributions can be fully

determined by its mean (Theorem 24), p and q must have the same entropy as p∗.

solution impliesp=q=p∗.

The significance of this theorem is that the mean of sufficient statistics uniquely identifies the distribution in the exponential family, and furthermore if the sufficient statistics are minimal, the natural parameter can also be uniquely identified.

Optimization techniques for variational inference

By Theorem24,g?₍_µ_{) is just the negative entropy of the distribution corresponding to}

µ. However, the hardness of the optimization problemg(θ) = sup_µ∈Mhθ,µi −g?(µ)

is exhibited in two folds: a) the constraint setMis extremely difficult to characterize

explicitly; b) the negative entropy g? _{is defined indirectly, hence it lacks explicit form} inµ. Therefore, one resorts to outer or inner bounds of_Mand upper or lower bounds

of g?_{. This leads to various algorithms (Wainwright & Jordan,}_{2008), such as}

• Naive mean field. It only considers a subset (inner approximation) ofM where

the mean parameter of the edges is fixed to be the product of the mean of the two end points. This essentially assumes that all the nodes are independent, and yields a lower bound on g(θ). In this case, the entropy factorizes and becomes

easy to compute.

• Tree-reweighted sum-product. By noticing that the entropy of trees can be com-

puted efficiently, Wainwright et al. (2005) studied the restriction of any mean parameter µ to a spanning tree T: µ(T). Since this restriction removes those

constrains corresponding to ignored edges, the entropy of µ(T) is higher than

that ofµ, hence the restriction leads to a concave upper bound of_hθ,µ_{i −}g?₍_µ_). Moreover, as convex combination of upper bounds is still an upper bound, it can be tightened by further optimizing over the convex combination, e.g., the distribution over spanning trees called spanning tree polytope.

• Log-determinant relaxation. Observing that M can be characterized by con-

straining the moments to be positive semi-definite to any order, Wainwright & Jordan (2006) proposed a relaxation based on Gaussian approximation.

• Cutting plane. Sontag & Jaakkola(2007) proposed a new class of outer bounds on

the marginal polytope, by drawing its equivalence with the cut polytope (Bara- hona & Mahjoub, 1986). Different from most previous methods which fix the outer bound a priori, Sontag & Jaakkola(2007) progressively tightens the outer bound according to the current infeasible solution. This is done by efficiently findinga violated constraint via a series of projections onto the cut polytope.

In document Graphical Models: Modeling, Optimization, and Hilbert Space Embedding (Page 43-47)