3.3 Stochastic Processes and the Nonparametric Model
3.3.3 Pitman-Yor Process with a Mixture Base
Note that the base measure Hof a PYP is not necessarily restricted to a single prob- ability distribution, we can letHbe a mixture distribution such as
H=ρ1H1+ρ2H2+· · ·+ρnHn, (3.35)
where∑ni=1ρi =1 and{H1, . . .Hn}is a set of distributions over the same measurable space(X,B)asH.
With this specification of H, the PYP is also named the compound Poisson- Dirichlet process in Du [2012], or the doubly hierarchical Pitman-Yor process in Wood and Teh [2009]. A special case of this is the DP equivalent, which is also known as the DP with mixed random measures in Kim et al. [2012]. We note that in the CRP representation, if the base distribution is a mixture of multiple PYP, we can treat the PYP to have multiple parent restaurants. More on this in Chapter 5.
Note that in the above discussion we have assumed constant values for the ρi, though of course we can go fully Bayesian and assign a prior distribution for each of them, a natural prior would be the Dirichlet distribution:
(ρ|γ)∼Dirichlet(γ), (3.36) where we definedρ= (ρ1, . . . ,ρn)andγ= (γ1, . . . ,γn).
3.4
Summary
This chapter provides a brief review on some relevant and important probability dis- tributions and their characteristics. In particular, we touch on the aspect of choosing conjugate priors to simplify the corresponding posterior distributions. This also led to the discussion on the Hierarchical Dirichlet Model in Section 3.2.4, which serves as a bridge to the discussion of some related stochastic processes.
An application of these probability distributions and stochastic processes is in the area of topic modelling. This will be reviewed in the next chapter.
Chapter4
Topic Models
One example out of many successful Bayesian applications is topic modelling, which is an algorithm that automatically discovers thelatent(or hidden) structure of a cor- pus of documents. Here, a document is not restricted to just text, it can be an image, video or even genes (with genetic information); essentially, topic modelling can be applied to any data that can be represented by a set of items/features [Bleiet al., 2003; Ferguset al., 2005; Zhenget al., 2006; Hospedaleset al., 2012]. In this dissertation, we discuss topic modelling in the context of text analysis.
“Topic modelling algorithms are statistical methods that analyse the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time” [Blei, 2012]. With topic models, we are able to analyse and summarise electronic documents — which are growing in size exponentially — quickly and automatically.
Atopic is essentially a set of words grouped together by their co-occurrence and other factors (this depends on the topic model). Although a topic does not have a word or a title that describes itself, practitioners tend to represent a topic by the first n most significant words. To overcome this manual task, the research community has proposed several methods to label the topics autonomously. Recent attempts on automatic topic labelling include the work of Lau et al.[2011], Mao et al. [2012], Aletras and Stevenson [2014], and Cano Basaveet al.[2014].
Topic modelling is being used in many domains such as text analysis and com- puter vision. In text analysis, topic modelling has been used for document clustering, topic exploration, sentiment analysis, text summarisation, document segmentation, and information retrieval [Blei, 2012]. In computer vision, topic modelling is suc- cessfully used in face recognition [Luet al., 2003] and scene recognition [Fei-Fei and Perona, 2005]. In this chapter, we discuss some popular topic models used in practice.
4.1
Latent Dirichlet Allocation
The Latent Dirichlet allocation(LDA) [Blei et al., 2003] is the simplest Bayesian topic model; it is a fully Bayesian extension of theprobabilistic latent semantic indexing(pLSI)
28 Topic Models
K
N
dD
z
dnγ
µ
w
dnθ
dϕ
kFigure 4.1: Graphical model of the latent Dirichlet allocation (LDA). The shaded node represents observed variable while the unshaded nodes represent latent variables.
[Hofmann, 1999]. The LDA can also be seen as a type of principal component analysis for discrete data [Buntine, 2002].
The LDA is an admixture model — each word in a document is assigned to a topic and hence a document is linked to multiple topics, rather than having only a topic per document. The Bayesian model of the LDA is given by the following generative process:
(θd|µ)∼Dirichlet(µ), ford=1, . . . ,D, (4.1)
(φk|γ)∼Dirichlet(γ), fork=1, . . . ,K, (4.2)
(zdn|θd)∼Discrete(θd), (4.3) (wdn|zdn,φ)∼Discrete(φzdn), forn=1, . . . ,Nd. (4.4)
In the above,µandγare the parameters for priorsθdandφk respectively,7 whilezdn is a topic index (i.e., a label for a particular topic, usually numbered) and wdn is a word associated with documentdand positionn(then-th word in the text sequence); kis used to index the topics (out ofKseen topics during sampling). Figure 4.1 shows the graphical model for the LDA.
Under this model, our aim is to infer the latent variables θ and φ, which are known as document–topic distribution and topic–word distribution respectively. Infer- ence can be performed easilyvia the collapsed Gibbs sampling, in which the conju- gacy between the distributions in the model allows a marginal posterior distribution to be derived. The Gibbs sampling is performed on the latent variable z, with the priors θ andφbeing integrated out, even though the main interest is on them. This is becauseθ andφcan be constructed rather easily once we have the sample z.
Due to the simplicity of the LDA and its ease of implementation, it has been used widely in a variety of applications. It is also easily extended into a more compli-
7Here, the notation Dirichlet(a) represents the symmetric Dirichlet distribution with parameter