• No results found

Variational Mean Field for Graphical Models

N/A
N/A
Protected

Academic year: 2021

Share "Variational Mean Field for Graphical Models"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

Variational Mean Field

Variational Mean Field

for Graphical Models

for Graphical Models

CS/CNS/EE 155

Baback Moghaddam

Machine Learning Group

baback @ jpl.nasa.gov

(2)

Approximate Inference

Approximate Inference

• Consider general UGs

(

i.e.,

not tree-structured)

• All basic computations are intractable

(for large

G

)

- likelihoods & partition function - marginals & conditionals

(3)

Inference

Inference

Exact Exact VE JT BP Approximate Approximate Stochastic Stochastic Gibbs, M-H (MC) MC, SA Deterministic Deterministic Cluster Cluster ~MPMP LBP EP Variational Variational

Taxonomy of Inference Methods

(4)

Approximate Inference

Approximate Inference

• Stochastic

(Sampling)

- Metropolis-Hastings, Gibbs, (Markov Chain) Monte Carlo, etc

- Computationally expensive, but is “exact” (in the limit)

• Deterministic

(Optimization)

- Mean Field (MF), Loopy Belief Propagation (LBP)

- Variational Bayes (VB), Expectation Propagation (EP) - Computationally cheaper, but is not exact (gives bounds)

(5)

original G (Naïve) MF H0 structured MF Hs

c c c x x p( ) φ ( ) q(x) ∝ qA(xA) qB(xB)

• General idea

- approximate

p

(

x

)

by a simpler factored distribution

q

(

x

)

- minimize “distance”

D

(

p||q

)

- e.g., Kullback-Liebler

i i i x q x q( ) ( )

Mean Field : Overview

(6)

• Naïve MF has roots in

Statistical Mechanics

(1890s)

- physics of spin glasses (Ising), ferromagnetism, etc

- why is it called why is it called ““Mean FieldMean Field”” ??

with full factorization : E[xi xj] = E[xi] E[xj]

Mean Field : Overview

Mean Field : Overview

• Structured MF is more “modern”

Coupled HMM Structured MF approximation

(7)

KL Projection

KL Projection

D

D

(

(

Q||P

Q||P

)

)

• Infer hidden

h

given visible

v (clamp v

nodes with

δ

‘s)

Variational

Variational

: optimize KL globally

the right density form for Q “falls out”

KL is easier since we’re taking E[.] wrt simpler Q Q seeks mode with the largest mass (not height) so it will tend to underestimate the support of P

(8)

KL Projection

KL Projection

D

D

(

(

P||Q

P||Q

)

)

• Infer hidden

h

given visible

v (clamp v

nodes with

δ

‘s)

Expectation Propagation

Expectation Propagation

(EP) : optimize KL

locally

this KL is harder since we’re taking E[.] wrt P

no nice global solution for Q “falls out”

must sequentially tweak each qc(match moments)

Q covers all modes so it overestimates support

(9)

α

α

-

-

divergences

divergences

• The 2 basic KL divergences are

special cases

of

D

α

(

p||q

)

is non-negative and 0 iff

p = q

– when

α

- 1 we get KL(P||Q)

– when

α

+1 we get KL(Q||P)

– when

α

= 0 D0(P||Q) is proportional to HellingerHellinger’’s distance (metric)s

(10)

for more on

α

α

-

-

divergences

divergences

(11)

for specific examples of

See

Chapter 10

Chapter 10

Variational Single Gaussian Variational Linear Regression Variational Mixture of Gaussians Variational Logistic Regression Expectation Propagation (α = -1)

1

±

=

(12)

Hierarchy of Algorithms

Hierarchy of Algorithms

(based on

α

and structuring)

BP • fully factorized • KL(p||q) EP • exp family • KL(p||q) FBP • fully factorized • Dα(p||q) Power EP • exp family • Dα(p||q) MF • fully factorized • KL(q||p) TRW • fully factorized • Dα(p||q) α>1 Structured MF • exp family • KL(q||p) by Tom Minka

(13)

Variational MF

Variational MF

) (

1

)

(

1

)

(

x c c c

e

Z

x

Z

x

p

=

γ

=

ψ

=

c c c

x

x

)

log(

(

))

(

γ

ψ

dx

e

Z

=

log

(x)

log

ψ

dx

x

Q

e

x

Q

x

=

)

(

)

(

log

ψ ( )

E

log

[

e

(x)

/

Q

(

x

)]

Q ψ

)]

(

/

[

log

sup

E

e

(x)

Q

x

Q Q ψ

=

}

)]

(

[

)]

(

[

{

sup

Q

E

Q

x

+

H

Q

x

=

ψ

Jensen’s

(14)

Variational MF

Variational MF

}

)]

(

[

)]

(

[

{

sup

log

Z

Q

E

Q

ψ

x

+

H

Q

x

Equality is obtained for

Q

(

x

) =

P

(

x

)

(all

Q

admissible)

Using any other

Q

yields a lower bound on

log

Z

The slack in this bound is KL-divergence

D

(

Q||P

)

Goal

: restrict

Q

to a

tractable subclass

Q

optimize with

sup

Q

to tighten this bound

note we’re (also) maximizing entropy

H

[Q]

(15)

Variational MF

Variational MF

}

)]

(

[

)]

(

[

{

sup

log

Z

Q

E

Q

ψ

x

+

H

Q

x

Most common specialized family :

“log-linear models”

linear in parameters

θ

(natural parameters of EFs)

clique potentials

φ

(

x

)

(sufficient statistics of EFs)

Fertile ground for plowing

Convex Analysis

Convex Analysis

)

(

)

(

)

(

x

x

T

x

c c c c

φ

θ

φ

θ

ψ

=

=

(16)

The Old Testament The New Testament

Convex Analysis

(17)

Variational MF for EF

Variational MF for EF

}

)]

(

[

)]

(

[

{

sup

log

Z

Q

E

Q

ψ

x

+

H

Q

x

}

)]

(

[

)]

(

[

{

sup

log

Z

E

T

x

H

Q

x

Q Q

+

θ

φ

}

)]

(

[

)]

(

[

{

sup

log

Z

T

E

Q

x

H

Q

x

Q

+

θ

φ

}

)

(

*

{

sup

)

(

θ

µ

θ

µ

A

µ

A

T M

M

M = set of all moment parameters realizable under subclass Q

EF

(18)

Variational MF for EF

Variational MF for EF

So it looks like we are just optimizing a concave function (linear term + negative-entropy) over a convex set

Yet it is hard ... Why?

1. graph probability (being a measure) requires a very large

number of marginalization constraints for consistency (leads to a typically beastly marginal polytope MM in the discrete case)

e.g., a complete 7-node graph’s polytope has over 108 facets !

In fact, optimizing just the linear term alone can be hard

2. exact computation of entropy -A*(

µ

) is highly non-trivial

(19)

Gibbs Sampling for

Gibbs Sampling for

Ising

Ising

• Binary MRF G = (V,E) with pairwise clique potentials

1. pick a node s at random 2. sample u ~ Uniform(0,1) 3. update node s :

4. goto step 1

(20)

Naive MF for

Naive MF for

Ising

Ising

• use a variational mean parameter at each site 1. pick a node s at random

2. update its parameter :

3. goto step 1

• deterministic “loopy” message-passing • how well does it work? depends on

θ

(21)

Graphical Models as EF

Graphical Models as EF

GG(V,E) with nodes • sufficient stats :

• clique potentials likewise for

θ

st

• probability • log-partition

(22)

• For any mean parameter µ where θ(µ) is the corresponding natural parameter

• the log-partition function has this variational representation

• this supremum is achieved at the moment-matching value of µ

Variational Theorem for EF

Variational Theorem for EF

in relative interior of M

(23)

Main Idea: (convex) functions can be “supported” (lower-bounded) by a continuum of lines (hyperplanes) whose intercepts create a conjugate dual

of the original function (and vice versa)

conjugate dual of A

conjugate dual of A*

Note that A** = A (iff A is convex)

Legendre

(24)

Dual Map for EF

Dual Map for EF

Two equivalent parameterizations of the EF

Bijective mapping between Ω and the interior of M

Mapping is defined by the gradients of A and its dual A*

(25)

GG(V,E) = graph with discrete nodes

Then MM = convex hull of all

φ

(x)

• equivalent to intersecting half-spaces aT

µ

> b

• difficult to characterize for large G • hence difficult to optimize over

• interior of MM is 1-to-1 with Ω

Marginal Polytope

(26)

GG(V,E) = a single Bernoulli node

φ

(x) = x • density

• log-partition (of course we knew this)

• we know A* too, but let’s solve for it variationally

• differentiate stationary point

• rearrange to , substitute into A*

The Simplest Graph

The Simplest Graph

x

Note: we found both the mean parameter and the lower bound using the variational method

(27)

GG(V,E) = 2 connected Bernoulli nodes •

• moments •

• variational problem • solve (it’s still easy!)

The 2

The 2

nd

nd

Simplest Graph

Simplest Graph

x1 x2

(28)

3 nodes 16 constraints

# of constraints blows up real fast:

7 nodes 200,000,000+ constraints

hard to keep track of valid µ’s (i.e., the full shape and extent of M)

no more checking our results against closed-forms expressions that we already knew in advance!

unless G remains a tree, entropy A* will not decompose nicely, etc

The 3

The 3

rd

rd

Simplest Graph

Simplest Graph

x1 x2

(29)

• tractable subgraph H = (V,0)

• fully-factored distribution • moment space

• entropy is additive :

-• variational problem for A(

θ

)

• using coordinate ascent :

Variational MF for

(30)

M

M

trtr is a non-convex inner approximation

• optimizing over

M

M

trtr must then yield a lower bound

Variational MF for

Variational MF for

Ising

Ising

M

M

tr

(31)

• suppose we have a tree

G =

(

V,T

)

• useful factorization for trees

• entropy becomes

- singleton terms - pairwise terms

Factorization with Trees

Factorization with Trees

(32)

• pretend entropy factorizes like a tree (Bethe approximation) • define pseudo marginals

Variational MF for

Variational MF for

Loopy

Loopy

Graphs

Graphs

must impose these normalization

and marginalization constraints

• define local polytope

L

(

G

)

obeying these constraints • note that for any

G

with equality only for trees : M(G) = L(G)

)

(

)

(

G

L

G

(33)

L

(

G

)

is an outer polyhedral approximation

solving this Bethe Variational Problem we get the LBP eqs !

Variational MF for

Variational MF for

Loopy

Loopy

Graphs

Graphs

so fixed points of LBP are the stationary points of the BVP

this not only illuminates what was originally an educated “hack” (LBP)

(34)

see ICML

(35)

• SMF can also be cast in terms of “Free Energy”

etc

• Tightening the var bound = min KL divergence

• Other schemes (

e.g

, “Variational Bayes”) = SMF

- with additional conditioning (hidden, visible, parameter)

• Solving variational problem gives both

µ

and

A

(

θ

)

• Helps to see problems through lens of Var Analysis

Summary

(36)

Matrix of Inference Methods

EP, variational EM, VB,

NBP, Gibbs

EP, EM, VB,

NBP, Gibbs

EKF, UKF, moment matching (ADF)

Particle filter

Other

Loopy BP

Gibbs

Jtree = sparse linear algebra

BP = Kalman filter

Gaussian

Loopy BP, mean field, structured variational, EP, graph-cuts Gibbs VarElim, Jtree, recursive conditioning BP = forwards Boyen-Koller (ADF), beam search Discrete High treewidth Low treewidth Chain (online)

Exact Deterministic approximation Stochastic approximation

BP = Belief Propagation, EP = Expectation Propagation, ADF = Assumed Density Filtering, EKF = Extended Kalman Filter, UKF = unscented Kalman filter, VarElim = Variable Elimination, Jtree= Junction Tree, EM = Expectation Maximization, VB = Variational Bayes, NBP = Non-parametric BP

(37)

T H

T H

E

E

E N

References

Related documents

Structural unity among viral origin binding proteins: crystal structure of the nuclease domain of adeno-associated virus Rep. The nuclease domain of adeno-associated virus

• Use of vaccines against (Spn) and type b (Hib), the two most common bacterial causes of childhood pneumonia, and against rotavirus, the most common cause

This qualitative study was carried out to identify the factors contributing to training effectiveness from the employee’s perspective among operational employees

Interestingly, however, by comparing the occupational skills of Mexicans versus other immigrant groups, we find stronger evidence that legal status facilitated by IRCA legislation

Today, however, improved two- way communication across this interface is now clearly identified as a likely way to improve performance at this interface.(21) However, as in the case

To the author’s best knowledge no studies has been reported yet on non-aligned bioconvective stagnation point flow of a nanofluid comprising gyrotactic microorganisms across

The HDM-4 model is applied to alternative scenarios to establish whether savings in road user costs and road maintenance and rehabilitation expenditure, thanks to axle-load control

We demonstrated higher risk of dose-limiting toxicity and overall mortality for metastatic renal cell cancer patients with low versus high skeletal muscle index or skeletal