x
ijν
aD
2θ
d′D
z
dnϕ
kγ
µ
w
dnθ
dϕ
′dkFigure 7.1: Graphical model for the Citation Network Topic Model (CNTM). The shaded nodes represent observed variables while the unshaded nodes are latent. The box on the top left with D2entries is the citation network on documents represented as a Boolean matrix. The remainder is a nonparametric author-topic model where the A authors on the left have topic distributions that influence the D document–topic distributions. TheK topics, shown in the top right, have bursty modelling following Buntine and Mishra [2014].
7.3
Citation Network Topic Model
In this section, we propose a topic model that jointly model thetext,authors, and the citation network of research publications (documents). We name the topic model the Citation Network Topic Model (CNTM). We first describe the topic model part of the CNTM where the citations are not considered, which will be used for comparison later in Section 7.7. We then complete the CNTM with the discussion on its network component. Additionally, we propose a novel approach to incorporate supervision into CNTM, which is detailed in Section 7.3.3. The full graphical model for CNTM is displayed in Figure 7.1.
7.3.1 Hierarchical Pitman-Yor Process Topic Model
In modelling authorship, the CNTM modifies the approach of the ATM, which as- sumes that the words in a publication are equally attributed to the different authors. This is not reflected in practice since publications are often mainly written by the first author, except when the order is alphabetical. Thus, we assume that the first author is dominant and attribute all the words in a publication to the first author. Although we could model the contribution of each author on a publication by, say, using a Dirichlet distribution, we found that considering only the first author gives a simpler learning algorithm and cleaner results.
The generation process of the topic model component of the CNTM is as follows. We first sample a root topic distribution µwith a GEM distribution to act as a base
84 Bibliographic Analysis on Research Publications
distribution for the author–topic distributionsνa for each author a, as follows:
µ∼GEM(αµ,βµ), (7.1)
νa|µ∼PYP(ανa,βνa,µ), for a=1, . . . ,A. (7.2) Given the first authoradof each publicationd, we sample the document–topic priorθd0 and the document–topic distributionθd, as the following:
θ0d|ad,ν ∼PYP(αθ0d,βθd0,νa
d), (7.3)
θd|θ0d ∼PYP(αθd,βθd,θd0), for d=1, . . . ,D. (7.4) Note that instead of modelling a single document–topic distribution, we model a document–topic hierarchy with θ0 andθ. The primedθ0 represents the topics of the document in the context of the citation network. The unprimed θ represents the topics of the text, naturally related to θ0 but not the same. Such modelling gives citation information a higher impact, taking into account the relatively low amount of citations compared to the text. The technical details on the effect of such modelling is presented in Section 7.9.2.
On the vocabulary side, we generate a background word distributionγgivenHγ, a discrete uniform vector of length |V|, that is, Hγ = (. . . , 1
|V|, . . .). The symbol V denotes the set of distinct word tokens observed in a corpus. Then, we sample a topic–word distribution φk for each topic k, with γ as the base distribution. This is illustrated by
γ∼PYP(αγ,βγ,Hγ), (7.5)
φk|γ∼PYP(αφk,βφk,γ), for k=1, . . . ,K. (7.6) Modelling word burstiness is important since words in a document are likely to repeat in the document [Buntine and Mishra, 2014]. The same applies to publication abstract, as illustrated by Table 7.2 in Section 7.6. To address this property, we make the topics bursty so each document only focusses on a subset of words in the topic. This is achieved by defining theφ0dkfor each topic kin documentdas
φ0dk|φk ∼PYP(αφdk0 ,βφ0dk,φk). (7.7)
Finally, for each word wdn in document d, we sample the corresponding topic assignmentzdn from the document–topic distributionθd; while the wordwdnis sam- pled from the topic–word distributionφ0d givenzdn, that is,
zdn|θd∼Discrete(θd), (7.8)
§7.3 Citation Network Topic Model (CNTM) 85
Note that w includes words from the title and the abstract of the publications, but not the full article. This is because title and abstract provide a good summary of the topics of a given publication and thus more suited for topic modelling, while the full article might contain too much technical details that are not too relevant.
In the next section, we carry through to the modelling of the citation network accompanying a publication collections. This completes the CNTM.
7.3.2 Citation Network Poisson Model
To model the citation network between publications, we assume that the citations are generated conditioned on the topic distributions of the publications. Our approach is motivated by the degree-corrected variant of PMTLM [Zhuet al., 2013]. Denoting xijas the number of times documenticiting documentj, we modelxijwith a Poisson distribution with mean parameterλij, namely,
xij|λij ∼Poisson(λij), for i=1, . . . ,D; j=1, . . . ,D, (7.10) whereλij =λ+i λ−j ∑kλTkθik0 θ0jk.
Here,λ+i is the propensity of documentito cite andλ−j represents the popularity of cited document j. The parameter λkT scales the k-th topic, effectively penalising common topics and strengthen rare topics. Hence, a citation from document i to document j is more likely when these documents are having relevant topics. The Poisson distribution is used instead of a Bernoulli distribution because it leads to dramatically reduced complexity in analysis [Zhu et al., 2013]. We note that the Poisson distribution is similar to the Bernoulli distribution when the mean parameter is small. We present a list of variables associated with the CNTM in Table 7.1.
7.3.3 Incorporating Supervision into CNTM
Author modelling allows topic sharing of multiple documents written by the same author. However, there are many authors who have authored only a few publica- tions. Their treatment can be problematic to topic models that incorporate author- ship information since these authors are not discriminative enough for prediction. We propose an extension of the CNTM to address this issue. We name this extension the supervised CNTM (SCNTM) as it allows supervision during model training.
We introduce author threshold η, a parameter that controls the level of super- vision used by the SCNTM. In the SCNTM, for each document, if the author has produced less thanηpublications, the authorship is replaced by adummy authorthat corresponds to the categorical label of the document. These authors are effectively merged to form a collection of authors with similar research area. For example, η = 2 means authors who have only a single publication are replaced, while η = 1 corresponds to no replacement. In addition to being able to cluster the documents
86 Bibliographic Analysis on Research Publications
Table 7.1: List of Variables for the Citation Network Topic Model (CNTM).
Variable Name Description
zdn Topic Category (topical) label for word wdn. wdn Word Observed word or phrase at positionin document d. n
xij Citations Number of times document
icites doc- ument j.
φdk0 Word distribution Probability distribution in generating words given documentdand topick. φk Topic–word distribution Word prior forφ0dk.
θd Document–topic distribution Probability distribution in generatingtopics for documentd. θd0 Document–topic prior Topic prior for θd.
νa Author–topic distribution Probability distribution in generating topics for authora.
γ Global word distribution Word prior forφk. µ Global topic distribution Topic prior for νa.
αN Discount Discount parameter for PYPN. βN Concentration Concentration parameter for PYPN. HN Base distribution Base distribution of the PYP N.
λij Rate Rate parameter or the mean forxij. λi+ Cite propensity Propensity to cite for documenti. λi− Cited propensity Propensity to be cited for document j. λTk Scaling factor Citation scaling factor for topick.
better, as shown in Section 7.7.4, the SCNTM achieves a reduction of memory usage by greatly reducing the number of authors that need to be modelled.