Variational Bayes - Extensions of non-negative matrix factorization and their application to th

Variational techniques are used as approximation methods in a variety of fields such as statistics [Rus76], statistical mechanics and quantum statistics [Fey72], [Par88], quantum mechanics [Mit94], or finite element analysis [Bat96]. The strategy is always similar: a complex problem is converted into a simpler one by decoupling the degrees of freedom in the original problem. This decoupling is achieved by an expansion which includes additional adjustable parameters.

A famous example for variational techniques in physics are mean field approximations [Mac03], where e.g. the interactions of many spins are replaced by an averaged quantity or mean field to make the computations tractable.

6.2.1 A lower bound for the log evidence

In general, variational approximations are deterministic procedures which provide bounds for proba- bilities of interest.

One possible bound is induced by Jensen’s inequality ([Jen06]). Let X denote the observed variables, Y denote the latent variables, and Θ denote the parameters of a fixed model m which will be implicitly

6.2. VARIATIONAL BAYES 79 assumed in the following. The log evidence can be bounded from below

ln P (X) = ln Z P (X, Y, Θ)dYdΘ (6.16) = ln Z Q(Y, Θ)P (X, Y, Θ) Q(Y, Θ) dYdΘ (6.17) ≥ Z Q(Y, Θ) lnP (X, Y, Θ) Q(Y, Θ) dYdΘ (6.18) = Z Q(Y, Θ) lnP (Y, Θ|X)P (X) Q(Y, Θ) dYdΘ (6.19) = Z Q(Y, Θ) lnP (Y, Θ|X) Q(Θ) dYdΘ + Z Q(Y, Θ) ln P (X)dYdΘ (6.20) = −KL(Q(Y, Θ)||P (Y, Θ|X)) + ln P (X) (6.21)

The first expression in the last line is the negative Kullback-Leibler divergence between the two probability distributions Q(Θ) and P (Θ|X).

According to Gibb’s inequality, the KL-divergence between two probability distributions Q and P is non-negative always and zero only if Q = P . This is a well-known result from statistical physics and information theory [CT91], [Mac03]. Hence, maximizing the latter quantity w.r.t. the variational distribution Q(Y, Θ) leads to

Q(Y, Θ) = P (Y, Θ|X) (6.22)

this means that the variational distribution equals the posterior distribution of the latent variables and parameters.

Usually, the computation of the true posterior P (Y, Θ|X) is intractable.

Hence, simpler forms of the approximating distributions are chosen which render the problem tractable. Hinton and van Camp [HvC93], Hinton and Zemel [HZ94] use a separable Gaussian form for Q in a similar task.

In contrast, the so called free-form approach [Att00],[GB00b], assumes the following factorized approximation

Q(Y, Θ) ≈ Q(Y)Q(Θ) (6.23)

which yields a lower bound ln P (X) ≥

Q(Y)Q(Θ) lnP (X, Y, Θ)

Q(Y)Q(Θ)dYdΘ =: FQ (6.24)

In general, this factorized approximation of the true posterior will not reach equality and remain a lower bound on the log evidence. The strategy of variational Bayes is thus to make this lower bound as tight as possible.

It is interesting to rewrite the bound in eq.(6.24) as

FQ = Z Q(Y)Q(Θ) lnP (X, Y|Θ)P (Θ) Q(Y)Q(Θ) dYdΘ (6.25) = Z Q(Y)Q(Θ) lnP (X, Y|Θ) Q(Y) dYdΘ + Z Q(Θ) lnP (Θ) Q(Θ)dΘ (6.26) = lnP (X, Y|Θ) Q(Y) Q(Y)Q(Θ) − KL(Q(Θ)||P (Θ)) (6.27)

where the first term corresponds to the averaged log likelihood and the second term is the KL distance between the prior and posterior approximation of the parameters. With increasing number of parameters |Θ|, the KL distance grows and FQ decreases. Hence, a large number of parameters is

automatically penalized by the variational Bayes framework. This was pointed out by Attias [Att00], who also showed that the popular Bayesian information criterion (BIC) for model order selection [Sch78] is a limiting case of the variational Bayes framework. In the large sample limit N → ∞, the parameter posterior is sharply peaked about the most probable value Θ = Θ∗ _{and the KL-term in}

FQ reduces to |Θ

∗_|

2 ln N .

6.2.2 The VBEM algorithm

The Variational Bayesian EM algorithm [Att00], [GB01], [BG04] iteratively maximizes FQ in eq. 6.24

w.r.t. the free distributions Q(Y) and Q(Θ). Each update is done while keeping the other quantities fixed. Setting ∂ ∂Q(Θ)FQ+ λ Z Q(Θ)dΘ − 1 = 0 (6.28)

where λ is a Lagrange multiplier ensuring the normalization of the density Q immediately leads to

Q(Θ) ∝ P (Θ) exp Z

ln P (X, Y|Θ)Q(X)dX

(6.29)

Similarly, one can derive

Q(X) ∝ exp Z

ln P (X, Y|Θ)Q(Θ)dΘ

(6.30) The equations (6.30) and (6.29) constitute the Variational Bayesian EM algorithm. Beal and Ghahra- mani [BG04] show that the Variational Bayesian EM algorithm reduces to the ordinary EM algorithm if the parameter density is restricted to be a Dirac delta function Q(Θ) = δ(Θ − Θ∗).

Attias [Att99], [Att00] was the first who described the variational Bayes framework and showed that it is a generalization of the well-known EM algorithm [DLR77] which is the method of choice for maximum likelihood parameter estimation in statistics. Before, Neal and Hinton [NH98] generalized the EM algorithm for maximum likelihood estimation to cases where a lower bound is iteratively maximized w.r.t. parameters and hidden variables.

Ghahramani and Beal [GB01] applied it to the large class of conjugate-exponential models, which can be characterized by two conditions:

1. The complete-data likelihood is in the exponential family, i.e. can be written as

P (X, Y|Θ) = g(Θ)f (X, Y)eΦTu(X,Y) (6.31)

where Φ(Θ) is the vector of natural parameters, u and f are functions and g is a normalization constant.

2. The parameter prior is conjugate to the complete-data likelihood

P (Θ|η, ν) = h(η, ν)g(Θ)ηeΦTν (6.32)

6.2. VARIATIONAL BAYES 81

6.2.3 Applications of variational inference

Special attention has been spent on variational methods in context of graphical models (see [JGJS98] for a tutorial on variational methods in graphical models). In densely connected graphical models there are often averaging phenomena which render nodes relatively insensitive to particular values of their neighbors. Variational methods take advantage of these averaging phenomena and can lead to simple approximation procedures.

Machine learning applications of variational Bayes include Ensemble learning for neural networks [HvC93] and mixtures of experts [WM96],

Utilizing mixture models as approximating distributions are discussed in [JJ98], [BLJJ98] and [Att00]. Variational Bayes versions of popular data analysis techniques have been developed, such as Bayesian logistic regression [JJ97], Bayes mixtures of factor analyzers [GB00b], Variational Principal Com- ponent Analysis [Bis99], Ensemble learning for Independent Component Analysis [Lap99], Factor analysis [HK07], or Bayesian Independent Component Analysis [WP07].

Statistical physics, continued

In statistical physics, variational free energy minimization [Mac03] is a method which approximates the (usually very complex) distribution P (x|β, J) given in eq. (6.6) by a simpler one Q(x; θ) that is parameterized by adjustable parameters θ. The quality of the approximation can be measured by the variational free energy

β ˜F (θ) =X

Q(x; θ) ln Q(x; θ)

P (x|β, J)− ln Z(β, J) (6.33)

which is the sum of the Kullback-Leibler divergence or relative entropy DKL(Q||P ) between the two

distributions Q and P and the true free energy of the system defined as

βF := − ln Z(β, J) (6.34)

Thus, the variational free energy β ˜F (θ) is bounded below by the true free energy βF and the two quantities are equal if Q(x; θ) = P (x|β, J).

The optimization strategy is to vary the parameters θ such as to minimize β ˜F (θ). The approximating distribution Q is then a simplified approximation to the true distribution P , and the value of β ˜F (θ) will be an upper bound to βF .

Chapter 7

Bayesian approaches to NMF

While we introduced the technique of NMF in chapter 2, and gave a general introduction on Bayesian learning theory in chapter 6, this chapter brings both approaches together. First, the statistical aspects of usual NMF are discussed in section 7.1. Then, section 7.2 briefly reviews existing literature on Bayesian approaches to NMF.

In document Extensions of non-negative matrix factorization and their application to the analysis of wafer test data (Page 84-89)