Citation Network Topic Model (CNTM) - Nonparametric Bayesian Topic Modelling with Auxiliary Dat

x

ν

D

θ

_d′

D

z

ϕ

γ

µ

w

θ

ϕ

′dk

Figure 7.1: Graphical model for the Citation Network Topic Model (CNTM). The shaded nodes represent observed variables while the unshaded nodes are latent. The box on the top left with D2entries is the citation network on documents represented as a Boolean matrix. The remainder is a nonparametric author-topic model where the A authors on the left have topic distributions that influence the D document–topic distributions. TheK topics, shown in the top right, have bursty modelling following Buntine and Mishra [2014].

7.3 Citation Network Topic Model

In this section, we propose a topic model that jointly model thetext,authors, and the citation network of research publications (documents). We name the topic model the Citation Network Topic Model (CNTM). We first describe the topic model part of the CNTM where the citations are not considered, which will be used for comparison later in Section 7.7. We then complete the CNTM with the discussion on its network component. Additionally, we propose a novel approach to incorporate supervision into CNTM, which is detailed in Section 7.3.3. The full graphical model for CNTM is displayed in Figure 7.1.

7.3.1 Hierarchical Pitman-Yor Process Topic Model

In modelling authorship, the CNTM modifies the approach of the ATM, which as- sumes that the words in a publication are equally attributed to the different authors. This is not reflected in practice since publications are often mainly written by the first author, except when the order is alphabetical. Thus, we assume that the first author is dominant and attribute all the words in a publication to the first author. Although we could model the contribution of each author on a publication by, say, using a Dirichlet distribution, we found that considering only the first author gives a simpler learning algorithm and cleaner results.

The generation process of the topic model component of the CNTM is as follows. We first sample a root topic distribution µwith a GEM distribution to act as a base

84 Bibliographic Analysis on Research Publications

distribution for the author–topic distributionsνa for each author a, as follows:

µ∼GEM(αµ,βµ), (7.1)

νa|µ∼PYP(ανa,βνa,µ), for a=1, . . . ,A. (7.2) Given the first authoradof each publicationd, we sample the document–topic priorθ_d0 and the document–topic distributionθ_d, as the following:

θ0_d|ad,ν ∼PYP(αθ0d_,_βθd0_,_ν_a

d), (7.3)

θd|θ0_d ∼PYP(αθd,βθd,θ_d0), for d=1, . . . ,D. (7.4) Note that instead of modelling a single document–topic distribution, we model a document–topic hierarchy with θ0 andθ. The primedθ0 represents the topics of the document in the context of the citation network. The unprimed θ represents the topics of the text, naturally related to θ0 but not the same. Such modelling gives citation information a higher impact, taking into account the relatively low amount of citations compared to the text. The technical details on the effect of such modelling is presented in Section 7.9.2.

On the vocabulary side, we generate a background word distributionγgivenHγ, a discrete uniform vector of length _|V|, that is, Hγ _{= (}_{. . . ,} 1

|V|, . . .). The symbol V denotes the set of distinct word tokens observed in a corpus. Then, we sample a topic–word distribution φk for each topic k, with γ as the base distribution. This is illustrated by

γ∼PYP(αγ,βγ,Hγ), (7.5)

φk|γ∼PYP(αφk,βφk,γ), for k=1, . . . ,K. (7.6) Modelling word burstiness is important since words in a document are likely to repeat in the document [Buntine and Mishra, 2014]. The same applies to publication abstract, as illustrated by Table 7.2 in Section 7.6. To address this property, we make the topics bursty so each document only focusses on a subset of words in the topic. This is achieved by defining theφ0_dkfor each topic kin documentdas

φ0_dk|φk ∼PYP(αφdk0 ,_βφ0dk,_φ_k). (7.7)

Finally, for each word wdn in document d, we sample the corresponding topic assignmentz_dn from the document–topic distributionθ_d; while the wordw_dnis sam- pled from the topic–word distributionφ0_d givenzdn, that is,

z_dn_|θ_d∼Discrete(θ_d), (7.8)

§7.3 Citation Network Topic Model (CNTM) 85

Note that w includes words from the title and the abstract of the publications, but not the full article. This is because title and abstract provide a good summary of the topics of a given publication and thus more suited for topic modelling, while the full article might contain too much technical details that are not too relevant.

In the next section, we carry through to the modelling of the citation network accompanying a publication collections. This completes the CNTM.

7.3.2 Citation Network Poisson Model

To model the citation network between publications, we assume that the citations are generated conditioned on the topic distributions of the publications. Our approach is motivated by the degree-corrected variant of PMTLM [Zhuet al., 2013]. Denoting xijas the number of times documenticiting documentj, we modelxijwith a Poisson distribution with mean parameterλij, namely,

xij|λij ∼Poisson(λij), for i=1, . . . ,D; j=1, . . . ,D, (7.10) whereλij =λ+_i λ−_j ∑_kλT_kθ_ik0 θ0_jk.

Here,λ+_i is the propensity of documentito cite andλ−_j represents the popularity of cited document j. The parameter λ_kT scales the k-th topic, effectively penalising common topics and strengthen rare topics. Hence, a citation from document i to document j is more likely when these documents are having relevant topics. The Poisson distribution is used instead of a Bernoulli distribution because it leads to dramatically reduced complexity in analysis [Zhu et al., 2013]. We note that the Poisson distribution is similar to the Bernoulli distribution when the mean parameter is small. We present a list of variables associated with the CNTM in Table 7.1.

7.3.3 Incorporating Supervision into CNTM

Author modelling allows topic sharing of multiple documents written by the same author. However, there are many authors who have authored only a few publications. Their treatment can be problematic to topic models that incorporate authorship information since these authors are not discriminative enough for prediction. We propose an extension of the CNTM to address this issue. We name this extension the supervised CNTM (SCNTM) as it allows supervision during model training.

We introduce author threshold η, a parameter that controls the level of supervision used by the SCNTM. In the SCNTM, for each document, if the author has produced less thanηpublications, the authorship is replaced by adummy authorthat corresponds to the categorical label of the document. These authors are effectively merged to form a collection of authors with similar research area. For example, η = 2 means authors who have only a single publication are replaced, while η = 1 corresponds to no replacement. In addition to being able to cluster the documents

86 Bibliographic Analysis on Research Publications

Table 7.1: List of Variables for the Citation Network Topic Model (CNTM).

Variable Name Description

zdn Topic Category (topical) label for word wdn. wdn Word Observed word or phrase at position_{in document} _d. n

xij Citations Number of times document

icites document j.

φ_dk0 Word distribution Probability distribution in generating words given documentdand topick. φk Topic–word distribution Word prior forφ0_dk.

θd Document–topic distribution Probability distribution in generating_{topics for document}_d. θ_d0 Document–topic prior Topic prior for θ_d.

νa Author–topic distribution Probability distribution in generating topics for authora.

γ Global word distribution Word prior forφk. µ Global topic distribution Topic prior for νa.

αN Discount Discount parameter for PYPN. βN Concentration Concentration parameter for PYPN. HN Base distribution Base distribution of the PYP N.

λij Rate Rate parameter or the mean forxij. λ_i+ Cite propensity Propensity to cite for documenti. λ_i− Cited propensity Propensity to be cited for document j. λT_k Scaling factor Citation scaling factor for topick.

better, as shown in Section 7.7.4, the SCNTM achieves a reduction of memory usage by greatly reducing the number of authors that need to be modelled.

In document Nonparametric Bayesian Topic Modelling with Auxiliary Data (Page 107-110)