2.2.2 Composite structure of the model . . . 58 2.3 New formulations of GaP . . . 58 2.3.1 GaP as a composite negative multinomial model . . . 58
2.3.2 GaP as a composite multinomial model . . . 59 2.4 Closed-form marginal likelihood. . . 60 2.4.1 Analytical expression . . . 60
2.4.2 Self-regularization . . . 62 2.5 Optimization algorithms . . . 63 2.5.1 Expectation-Maximization . . . 63
2.5.2 Monte Carlo E-step . . . 64
2.5.3 M-step . . . 65 2.6 Experimental work . . . 67 2.6.1 Comparison of the algorithms . . . 67
2.6.2 Examples of the self-regularization phenomenon . . . 74 2.7 Discussion . . . 76
2.1 Introduction
A wide variety of probabilistic models are based on the following observation model
vf n∼Poisson([WH]f n), (2.1)
with W ≥ 0, H ≥ 0. Such models are called Poisson factorization models, or sometimes Poisson factor analysis. As discussed in Section 1.3, we may classify these models in two categories:
• Semi-Bayesian models. When independent Gamma priors are assumed on H, and W is treated as a deterministic variable, we retrieve the so-called Gamma-Poisson model ofCanny (2004) and Buntine and Jakulin (2006);
• Bayesian models, which assume prior distributions on both W and H. Typical prior distributions include independent Gamma distributions, or column-wise Dirich- let distributions. These Bayesian models have found applications in image processing (Cemgil, 2009), as well as in recommender systems (Ma et al.,2011; Gopalan et al.,
2015), where they have received a particular attention. Several works have also been devoted to non-parametric approaches to alleviate the problem of setting the factor- ization rank K (Zhou et al.,2012;Gopalan et al.,2014;Zhou and Carin,2015). In this chapter, we focus on maximum marginal likelihood estimation in the Gamma- Poisson model. This problem has been first been addressed inDikmen and Févotte(2012). In their experiments, they found this estimation process to be robust to over-specified values of K; an intriguing behavior that was left unexplained. We provide the following contributions:
• We provide a closed-form expression of the marginal likelihood. This expression is tedious to compute for large F and K, as it involves combinatorial operations, but is workable for problems in small dimension.
• We show that the proposed closed-form expression reveals a penalization term on the columns of W that explains the “self-regularization” effect observed in Dikmen and Févotte(2012).
• We compare three variants of the EM algorithm, and experimentally show that the one based on the marginalization of H has favorable properties.
The rest of the chapter is organized as follows. Section2.2introduces the Gamma-Poisson model. In Section 2.3, we propose two novel formulations of the model in which H has been marginalized out. This leads to a closed-form expression of the marginal likelihood, discussed in Section 2.4. Three optimization algorithms are presented in Section 2.5, and experimental work is conducted in Section 2.6. We conclude by a general discussion in Section2.7.
N • • α β
h
nW
•v
n N K • • αk βkh
knw
k •c
knv
nFigure 2.1: Graphical representations of the Gamma-Poisson model. Observed variables are in blue, while latent variables are in white. Deterministic parameters are represented as black dots. The observation vn is a vector of size F . From left
to right:(a) The standard model. The latent variable hn is of size K; (b) The
augmented model with variables C. The latent variable ckn is of size F , and
hknis scalar.
2.2 The Gamma-Poisson model
2.2.1 First formulationThe Gamma-Poisson (GaP) model is a probabilistic matrix factorization model which was introduced in the field of text information retrieval (Canny,2004;Buntine and Jakulin,
2006). In this field, a corpus of text documents is typically represented by an integer-valued matrix V of size F × N, where each column vn represents a document as a so-called “bag
of words”. Given a vocabulary of F words (or in practice semantic stems), the matrix entry
vf n is the number of occurrences of word f in the document n. GaP is a generative model
described by a dictionary of “topics” or “patterns” W (a non-negative matrix of size F ×K) and a non-negative “activation” or “score” matrix H (of size K × N), as follows:
hkn∼Gamma(αk, βk), (2.2)
vf n|hn∼Poisson ([WH]f n) , (2.3)
where we use the shape and rate parametrization of the Gamma distribution, see its p.d.f. in Equation (0.8). The dictionary W is treated as a free deterministic variable.
We consider maximum marginal likelihood estimation (MMLE), in which H is treated as a latent variable over which the joint likelihood is integrated. In other words, MMLE relies on the minimization of
L(W, α, β)def= − log p(V; W, α, β) (2.4)
= − logZ H
p(V|H; W)p(H; α, β)dH, (2.5)
where α = {α1, . . . , αK} and β = {β1, . . . , βK}. We emphasize that the dictionary W is
treated as a free deterministic variable.
2.2.2 Composite structure of the model
GaP can be augmented with auxiliary variables C, yielding a composite model, thanks to the superposition property of the Poisson distribution (Févotte et al.,2009):
hkn∼Gamma(αk, βk), (2.6) cf kn|hkn∼Poisson(wf khkn), (2.7) vf n= X k cf kn. (2.8)
In the remainder, the vectors ckn = [c1kn, . . . , cF kn]T of size F and which sum up to vn
will be referred to as components. C will denote the F × K × N tensor with coefficients
cf kn. Figure 2.1 displays graphical representations of the Gamma-Poisson model, both in
its initial form (on the left) and augmented form (on the right).