2.4 Topic Models
2.4.1 Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) [Pritchard et al. 2000; Blei et al. 2003]9 uses a similar model to PLSI, but applies Bayesian inference. Dirichlet priors (con- jugate to multinomials) are used, which greatly simplifies the structure of the posterior and, thus, inference procedures. Gibbs sampling [Griffiths and Steyvers 2004] or variational methods [Blei et al. 2003] are typically used to estimate the posterior. More recently, an approach using statistical recovery has been devel- oped that is orders of magnitude faster, making a relatively simple separability assumption [Arora et al. 2012]. These Bayesian methods produce models that generalise far better than previous maximum-likelihood approaches.
One way to envisage LDA is to imagine solving a puzzle. You start with many jars (documents) containing coloured marbles (words) and a collection of bags (topics). You need to distribute the marbles into the bags, trying to ensure that each bag contains mostly marbles of only a few colours. There is another restriction, however: you need also to make sure that each jar has most of its marbles in only a few bags. Usually this problem has no good solution — if you satisfy one requirement, the other doesn’t do very well. If, however, some of the documents (jars) cover the same semantic topic (which we represent by a small set of colours/word types), you can put all the words (marbles) characteristic of that literal topic into the one bag. Those documents (jars) then look “pretty good”. Similarly, if several semantic topics are referred to in a document, you can put most of that document’s words into bags representing those topics and do “pretty well”.
There is a tension here between the two requirements — a reluctance to have marbles of more than a few colours in a bag and a desire to put most of a document’s marbles into only a few bags. The balance between these is governed by two parameters of the LDA model, typically labelled α (less topics for a document) and β (less words for a topic). The third parameter for LDA is the number of topics.
LDA uses Bayesian inference — that is, it proposes a parametrised generative model for the texts and then seeks the most probable parameter set given the text under investigation10. In order to make this inference, a ‘prior’ distribution over
9The same model was invented independently in the fields of population genetics [Pritchard
et al. 2000] and text analysis [Blei et al. 2003]. Both papers have been highly influential, with 12369 and 9056 citations respectively (Google Scholar Aug. 2014)
10True Bayesian inference seeks the full probability distribution over model parameters, how-
possible parameter values is provided. These inferred parameters are typically referred to as the model’s latent variables or hidden variables, whereas the term parameters typically refers to some parametrisation of the prior distribution.
Given data D, latent variables Θ and prior P(Θ), by Bayes rule we have:
P(Θ|D) = P(D|Θ)P(Θ)
P(D) (2.2)
We seek values for Θ that maximise P(Θ|D). Since P(D) does not vary with Θ,
P(D) can be seen as a normalisation constant and does not need to be calculated. The generative model proposed by LDA consists of a fixed number, T, of topics, each represented as a multinomial distribution over words, and a multi- nomial distribution over topics assigned to each document. Implicit here is a corpus structure that has a fixed number, D, of documents and a fixed length,
Nd, for each document d. Each word-position in each document is then filled
by first choosing a topic from the containing document’s topic distribution, then choosing a word from that topic’s word distribution.
The prior used for for the word-topic and topic-document multinomials is the Dirichlet distribution. This is a natural choice as it is conjugate to the multi- nomial distribution — the posterior distributions are also Dirichlet, greatly sim- plifying posterior estimation. With appropriate parameter settings, the Dirichlet priors also encourage sparsity — probability is concentrated around multinomials with most of their entries near zero. This is almost always desired for the topic- word distributions (topics with few words), but for the document-topic distribu- tions it is sometimes natural to encourage topic diversity (where each document contains a broad mixture of topics).
Figure 2.1: LDA Plate Diagram This process is often represented
with a plate diagram such as Fig- ure 2.1. In the diagram, boxes rep- resents collections of documents — K
topics, M documents and N words (ideally, N would be subscripted as it varies between documents, but this is usually omitted). Circles represent in- dividual entities: α and β are param-
eters for the Dirichlet priors, θ the topic mixture for a document, φ a topic, Z
a topic chosen from θ and W a word chosen from Z. W is grey, indicating that it is an observed variable (the only one in this model). Similar plate diagrams are often used to describe the model intra-dependencies of LDA variants and adaptations.
Estimation
As with most Bayesian models of modest complexity, direct calculation of maxi- mum a posteori (MAP) values forθandφis not feasible. Instead one must employ a method that estimates these values. As noted earlier in this section, most of the literature employs either Gibbs sampling [Griffiths and Steyvers 2004] or varia- tional methods [Blei et al. 2003]. Below I describe in somewhat more detail the Gibbs sampling approach.
Gibbs sampling is a method for estimating multivariate probability distri- butions that originated in statistical physics [Geman and Geman 1984]. It is a Markov chain Monte-Carlo (MCMC) method, meaning it produces a series of values that constitute a Markov chain whose stationary distribution is the probability distribution being sought. After a “burn-in” period during which the Markov chain settles down to a close approximation to its stationary distribution, the values can be taken as good estimates for samples from the target distribu- tion. Gibbs sampling achieves this by sampling from each of the model’s latent variables in turn, keeping the others fixed. A collapsed Gibbs sampler, as used in LDA estimation, first integrates out some of the models variables. For LDA estimation, θ and φ are integrated out, thus we need only sample word topic allocations z.
The full derivation of the Gibbs sampling update equations for LDA utilises relatively standard mathematical techniques which I do not present here. The resulting update equations are as follows. Write j for a particular topic index, vocabulary size W, T topics and nd
j, nwj the counts of words assigned to topic j
in document d or word w respectively. The −i, j subscripts indicate that word
i of the corpus is left out of the count. A ‘·’ in place of a super- or sub-script indicate a sum over that index (from [Griffiths and Steyvers 2004]).
P(zi =j|z−i,w)∝ nwi −i,j +β n· −i,j+W β ndi −i,j+α ndi−i,·+T α (2.3)
second numerator is fixed and we can simplify thus: P(zi =j|z−i,w)∝ nwi −i,j+β n· −i,j+W β (ndi−i,j +α) (2.4) Once we are satisfied that the model has converged to a reasonable level, we can calculate posterior values for document topic distributions θ and topic word distributions φ as follows. φ(jw) ≈ n w j +β n. j+W β (2.5) θj(d) ≈ n d j +α nd . +T α (2.6)