5.5 Evaluations on Topic Models
5.5.4 Document Clustering
Recall that topic models assign a topic to each word in a document, essentially per- forming a soft clustering[Erosheva and Fienberg, 2005] for the documents in which the membership is given by the document–topic distributionθ. To evaluate the clus- tering of the documents, we convert the soft clustering to hard clustering by choosing a topic that best represents the documents, hereafter called the dominant topic. The dominant topic of a document d corresponds to the topic that has the highest pro- portion in the topic distribution, that is,
Dominant Topic(θd) =arg max k
θdk. (5.35)
Two commonly used evaluation measures for clustering arepurityandnormalised mutual information(NMI) [Manninget al., 2008]. Purity is a simple clustering measure which can be interpreted as the proportion of documents correctly clustered, while NMI is an information theoretic measures used for clustering comparison. Here, we denote the ground truth classes as S = {s1, . . . ,sJ}and the obtained clusters as
R ={r1, . . . ,rK}, where eachsi andri represents a collection (set) of documents. The purity and NMI can then be computed as
purity(S,R) = 1 D K
∑
k=1 max j |rk∩sj|, NMI(S,R) = 2 MI(S;R) E(S) +E(R), (5.36) where MI(S;R)denotes the mutual information between two sets andE(·)denotes the entropy. They are defined as follows:MI(S;R) = K
∑
k=1 J∑
j=1 |rk∩sj| D log2D |rk∩sj| |rk||sj| , E(R) =− K∑
k=1 |rk| D log2 |rk| D . (5.37)§5.6 Implementation 49
5.6
Implementation
To perform inference on a general topic model with a hierarchical PYP structure, we implemented a general topic modelling framework that modularise the PYP nodes. In this section, we briefly discuss the implementation of the general topic modelling framework, which is written in theJavaprogramming language.
Our topic model framework consists of three parts, which are data preprocessing, model learning, and evaluation. Here we focus on model learning. We leave the data preprocessing discussion to the later chapters where they require different prepro- cessing techniques tailored to various data type. The implementation for evaluation is relatively straightforward and thus not discussed.
5.6.1 State
We first discuss the state of the model, which is a collection of variables used in the model. The state consists of all the PYP nodes of the model, the base distribution Hγ, and the topic assignmentZ. We briefly describe each part as follows.
5.6.1.1 PYP Node
Each PYP node N in the implementation framework stores the discount parameter αN and the concentration parameter βN, as well as the associated customer counts and table counts. Additionally, the PYP node also has a reference to its parent node (base distribution), allowing recursive operations to be performed easily.
The PYP node has routines (functions) to increment counts and decrement counts (and also sample the Bernoulli indicator used in decrementing counts). These pro- cedures are recursive in that they call the respective routines of its parent node. In addition, the PYP node has a routine to sample a new concentration parameter βN, following the procedure in Section 5.4.3. Finally, the PYP node can also compute the modularised likelihood and the likelihood ratio according to Equation (5.9) and Equation (5.17), and estimate its posterior mean with Equation (5.28).
5.6.1.2 Base Distribution
Here, we describe the base distribution that is in the form of probability vector. An example of this is Hγ, which is a uniform vector. In our implementation, we
treat the base distribution like a PYP node, we store the “customer counts” for the base distribution, which are just the table counts from its child node. Unlike PYP node, the base distribution does not have table counts. Note that although storing the customer counts and table counts for each PYP node may seem redundant, it is actually important for more complicated topic models in the later chapters.
The base distribution has similar routines to the PYP Node, with a major differ- ence in the way of computing the likelihood. For instance, the modularised posterior likelihood for the base distributionHγcorresponds to the last term in Equation (5.12).
50 Model Design and Implementation
5.6.1.3 Topic Assignments
The topics in the topic model are represented as a positive integer from 0 to K−1, where K is the number of topics. As such, the topic assignments in Z take values from 0 toK−1.
In our implementation, we store the topic assignmentszd for each documentdas separate variables. Each of the zd is a vector ofzdn and has a reference to its parent node, θd. The topic assignments zd has a routine to initialise itself randomly, which also update the counts of the parent nodes recursively.
5.6.1.4 Customer Counts and Table Counts
The table counts t and the customer countsc for a PYP node N can be sparse, that is, most of the tk andck are zeros.10 For efficient storage of the table counts and the customer counts, we adopt theOpenIntIntHashMapfrom theColtlibrary,11which is a
more efficient HashMap for integers.
Furthermore, we store the various sums of the table counts and the customer counts in a cache. This avoids the need to compute the sum repeatedly, thus speeds up the algorithm considerably.