Topic Modelling with Metadata - Nonparametric Bayesian Topic Modelling with Auxiliary Data

cated model for complex problems. A straightforward extension of the LDA is the hierarchical Dirichlet process LDA (HDP-LDA) [Tehet al., 2006], which is a Bayesian nonparametric generalisation of the LDA. One advantage of nonparametric modelling is that it allows us to overcome a limitation of the LDA, for which the number of topics is a fix constant. The HDP-LDA relaxes this constraint and is able to learn the number of topics directly from the data. We note that the HDP-LDA is a special case of the hierarchical Pitman-Yor process LDA. We will revisit this in Chapter 5.

4.2 Topic Modelling with Metadata

Another extension to the LDA makes use of metadata, or auxiliary information that accompanying a document, for instance, tweets (short document from Twitter) con- tain additional information like authors, tags, and hyperlinks. This information is often discarded and ignored in a vanilla topic model such as the LDA.

In the context of microblog, such astweets, each document is limited to a certain size8 and usually contains informal languages (deliberate misspellings, acronyms, and abbreviations). Previous finding [Zhao et al., 2011] suggests that the LDA does not work as well as other models that use metadata, as topics obtained from the LDA are mostly incoherent and not interpretable. A natural treatment to this is by aggregating these microblog documents together based on the authors to form documents that are larger [Wenget al., 2010; Hong and Davison, 2010].

Instead of employing anad-hocapproach in improving the LDA, a better solution would be to design a topic model that is more suitable in modelling the documents. Topic models that make use of metadata include author-topic model [Rosen-Zviet al., 2004], tag-topic model [Tsai, 2011], relational topic model [Chang and Blei, 2010], supervised LDA [Mcauliffe and Blei, 2008], Twitter-LDA [Zhao and Jiang, 2011], Topic- Link LDA [Liu et al., 2009], and others. These models are able to make additional inference on documents, such as obtaining the word distributions correspond to certain authors or tags.

4.2.1 Author-topic Model

The author-topic model proposed by Rosen-Zviet al.[2004] makes use of authorship information to improve topic modelling, it is a combination of both the LDA and the author model. The author model is analogous to a topic model, but with words generated from author-word distributions rather than topic–word distributions. The author model is not an admixture model like the LDA.

In the author-topic model, a new latent variable xis introduced, which serves to assign a word to an author. Hence, each word under this model is assigned a topic

30 Topic Models

A

K

N

ν

a

D

z

γ

µ

w

x

ϕ

Figure 4.2: Graphical model for the author-topic model (ATM). As before, the shaded node represents observed variable while the unshaded represent latent variables.

and an author. The generative model for the author-topic model can be summarised as follows: (ν_i|µ)∼Dirichlet(µ), fori=1, . . . ,A, (4.5) (φk|γ)∼Dirichlet(γ), fork =1, . . . ,K, (4.6) (xdn|ad)∼Uniform(ad), ford =1, . . . ,D, n=1, . . . ,Nd, (4.7) (zdn|xdn,ν)∼Discrete(νxdn), (4.8) (w_dn_|z_dn,φ)∼Discrete(φzdn), (4.9)

Here,νi is the author–topic distribution for author i, which is used in generating the latent topiczdngiven the latent authorxdn, who is assumed to have written the word nin documentd. Figure 4.2 shows the graphical model of author-topic model.

Note that the latent author xdn is generated uniformly9 from ad, the list of authors in document d. This means that each word in document d is assumed to be contributed randomly by one of the authors. However, this assumption is not real- istic, since a document is often written by the first author, and then adjusted by the others. In addition, the assumption fails to recognise the dependency of the words in term of authorship, that is, consecutive words tend to be penned down by the same person. A relaxation of this assumption would be to induce asymmetry in authorship and/or to assign authorship given the structure of the documents.

4.2.2 Tag-topic Model

The tag-topic model [Tsai, 2011] is essentially the same as author-topic model, except that the authorship information is replaced by tags. The model is arguably better than author-topic model as tags are more closely related to topics than authors, in

9_{We denote Uniform}₍_b₎_{to be a discrete uniform distribution for which the random outcome is one} of the value frombchosen randomly with probability 1/|b|.

§4.2 Topic Modelling with Metadata 31

K

N

y

D

z

γ

µ

θ

w

ϕ

Figure 4.3: Graphical model for the supervised LDA. The shaded node represents observed variable while the unshaded nodes represent latent variables.

fact, a tag can be seen as a topic. The graphical model of the tag-topic model is omitted as it is almost identical to that of author-topic model. There are several topic model variants that also utilise tag information, this includes the TagLDA [Zhuet al., 2006] and the Tag-LDA (with hyphen) [Si and Sun, 2009].

4.2.3 Supervised LDA

In contrast to author-topic model and tag-topic model where the metadata are used in generating the words in a document, supervised LDA deals with metadata that is generated from the model, like the generation of words. Supervised LDA works with any metadata that are relevant to a document. For example, movie ratings (score) for movie reviews (text). As such, supervised LDA can also be used to predict any quantity of interest (metadata) given the text data.

The graphical model for supervised LDA is given in Figure 4.3. Under this model, in addition to the usual generative process given by standard LDA (see Section 4.1), the observed variabley_dis generated by

(yd|zd,β,δ)∼GLM(z¯d,β,δ), (4.10) where GLM(x,β,δ) denotes the generalised linear model [McCullagh, 1984] with covariates x, regression parameters β and dispersion parameter δ. The probability density function for GLM is given by

p(y_|x,β,δ) =h(y,δ)exp (x_·β)y−A(x·β) δ . (4.11)

Here, the functionsh(y,δ)andA(x·β))are known aslink functionandlog-normaliser. For more details, refer to Mcauliffe and Blei [2008], and McCullagh [1984].

Note that supervised LDA represents each zdn as a vector, for which exactly one value in the vector is 1 (the rest being 0). The explanatory power to predictyd is then

32 Topic Models

given by the mean vector ¯zd, which is thetopic proportionfor documentd: ¯ z_d= ∑ Nd n=1zdn Nd . (4.12)

In document Nonparametric Bayesian Topic Modelling with Auxiliary Data (Page 53-56)