Latent Dirichlet Allocation in R
Martin Ponweiser
Diploma Thesis
Institute for Statistics and Mathematics
Diplomarbeit
Latent Dirichlet Allocation in R
Zur Erlangung des akademischen Grades
eines Magisters der Sozial- und Wirtschaftswissenschaften an der Wirtschaftsuniversit¨at Wien
Unter Anleitung von
Assist.-Prof. Dipl.-Ing. Dr. Bettina Gr¨un Department of Applied Statistics Johannes Kepler Universit¨at Linz
Beurteiler
Univ.-Prof. Dipl.-Ing. Dr. Kurt Hornik Institute for Statistics and Mathematics
Wirtschaftsuniversit¨at Wien von
Martin Ponweiser
Topic Modelle sind ein neues Forschungsfeld innerhalb der Informationswissenschaften Information Retrieval und Text Mining. Mit Techniken aus dem maschinellen Lernen werden Dokumentenkorpora in statistische Modelle ¨uberf¨uhrt und sind dadurch bess-er durchsuch- und bess-erforschbar. Das bekannteste Topic Modell ist Latent Dirichlet Al-location (LDA), das sich seit seiner Einf¨uhrung im Jahre 2003 durch Blei et al. als n¨utzliches Werkzeug in verschiedenen Disziplinen etabliert hat und außerdem Grund-lage f¨ur die Entwicklung von komplexeren Topic Modellen gewesen ist. Diese Diplomar-beit widmet sich dem Anwendungsaspekt von LDA. Die Datenanalyse eine der ersten Ver¨offentlichungen zu LDA, des Artikels “Finding scientific models” von Thomas Grif-fiths und Mark Steyvers (2004), wird in der statistischen Programmiersprache R unter Zuhilfenahme des neuen R Paketes “topicmodels” von Bettina Gr¨un und Kurt Hornik nachvollzogen. Der komplette Ablauf, vom Extrahieren eines Textkorpus von der Web-seite der Zeitschrift PNAS, ¨uber die Vorbereitung der Daten und das ¨Uberf¨uhren in eine Dokumenten-Term-Matrix, die Modellselektion, das Modellsch¨atzen und die Aufbere-itung und Visualisierung der Resultate, wird vollst¨andig dokumentiert und kommentiert. Das Ergebnis entspricht weitestgehend der Analyse des urspr¨unglichen Artikels; es wird somit einerseits das Vorgehen von Griffiths/Steyvers reproduziert, andererseits kann die Eignung der quelloffenen Werkzeuge der R-Umgebung f¨ur das Text Mining mittels LDA attestiert werden.
Schlagworte: Latent Dirichlet Allocation, LDA, R, Topic Models, Text Mining, Infor-mation Retrieval, Statistik
Diploma Thesis
Latent Dirichlet Allocation in R
Martin Ponweiser
Institute for Statistics and Mathematics Vienna University of Business and Economics
Supervisors: Bettina Gr¨un, Kurt Hornik
Topic models are a new research field within the computer sciences information re-trieval and text mining. They are generative probabilistic models of text corpora inferred by machine learning and they can be used for retrieval and text mining tasks. The most prominent topic model is latent Dirichlet allocation (LDA), which was introduced in 2003 by Blei et al. and has since then sparked off the development of other topic models for domain-specific purposes. This thesis focuses on LDA’s practical application. Its main goal is the replication of the data analyses from the 2004 LDA paper “Finding scientific topics” by Thomas Griffiths and Mark Steyvers within the framework of the R statistical programming language and the R package topicmodels by Bettina Gr¨un and Kurt Hornik. The complete process, including extraction of a text corpus from the PNAS journal’s website, data preprocessing, transformation into a document-term matrix, model selection, model estimation, as well as presentation of the results, is fully documented and commented. The outcome closely matches the analyses of the original paper, therefore the research by Griffiths/Steyvers can be reproduced. Furthermore, this thesis proves the suitability of the R environment for text mining with LDA.
Keywords: latent Dirichlet allocation, LDA, R, topic models, text mining, information retrieval, statistics
Acknowledgements
First and foremost, I want to thank Regina, Helga, Josef and Simon for their uncondi-tional love and faith in me. I owe you more than I can say.
I offer my sincerest gratitude to my thesis supervisor Bettina Gr¨un, who during the last three years supported me with her profound knowledge and a seemingly never-ending patience whilst allowing me to work in my own way. Bettina’s positive outlook and methodic approach to solving problems has been a huge inspiration to me. Thank you.
Ingo Feinerer helped me in setting up my development environment and by pro-viding his dissertations’s LATEX source as a template, as well as by giving me general
and tm-specific advice many times.
I also want to thank my sisters Rita and Barbara for their motivation. Extra hugs for Rita for proofreading the complete thesis.
Last, but not least, I am thankful to the whole staff of the Institute for Statis-tics and MathemaStatis-tics, especially Kurt Hornik and Stefan Theußl, for giving me the opportunity to be part of the team last year as wu.cloud administrator.
Contents
1. Introduction and Overview 2
1.1. Motivation and Scope of Thesis . . . 2
1.2. Organisation of Thesis . . . 3
2. Background in Computer Sciences 5 2.1. Information Retrieval . . . 5
2.1.1. Common Information Retrieval Models . . . 7
2.2. Natural Language Processing . . . 9
2.3. Text Mining . . . 12
3. Latent Dirichlet Allocation: a Topic Model 13 3.1. Topic Models . . . 13
3.2. Literature and Impact . . . 13
3.3. Observed Data / Model Input . . . 14
3.4. The Dirichlet Distribution . . . 14
3.5. Generative Process . . . 15
3.6. Model Estimation . . . 21
3.6.1. Collapsed Gibbs Sampling . . . 22
3.7. Model Evaluation and Selection . . . 24
3.7.1. Performance Measurement on Data . . . 24
3.7.2. Performance Measurement on Secondary Tasks . . . 26
3.7.3. Performance Measurement by Human Judgement . . . 26
3.8. Use of Fitted Topic Models . . . 26
3.8.1. Finding Similar Documents Through Querying and Browsing . . . 27
3.8.2. Exploring the Relation between Topics and Corpus Variables . . . 27
3.9. Extending LDA . . . 28
4. LDA in the R Environment 29 4.1. Overview of LDA Implementations . . . 29
4.2. TheR Programming Language and Environment . . . 31
4.3. Topic Modeling with the R Packages tmand topicmodels . . . 31
4.3.1. The topicmodelsPackage . . . 31
4.3.2. Preprocessing with the tmPackage . . . 32
4.3.3. Model Selection by Harmonic Mean in topicmodels . . . 33
5. Analyzing the 1991 to 2001 Corpus of PNAS Journal Abstracts 37
5.1. Retrieving and Preprocessing the Corpus . . . 38
5.1.1. Legal Clearance . . . 38
5.1.2. Web Scraping . . . 38
5.1.3. Importing the Corpus and Generating a Document-Term Matrix inR . . . 39
5.2. Model Selection . . . 40
5.2.1. Result of Model Selection . . . 43
5.3. Model Fitting . . . 44
5.4. Relations between Topics and Categories (Year 2001) . . . 44
5.4.1. Preparations and Showing the Category Distribution for 2001 . . 45
5.4.2. Finding the Diagnostic Topics . . . 47
5.4.3. Visualising Topics and Categories . . . 48
5.4.4. Comparison with Original Paper and Interpretation . . . 52
5.5. Hot and Cold Topics . . . 54
5.5.1. Comparison with Original Paper and Interpretation . . . 60
5.6. Document Tagging and Highlighting . . . 62
5.6.1. Finding the Example Abstract from Griffiths/Steyvers (2004) . . 62
5.6.2. Tagging and Highlighting of an Abstract . . . 63
5.6.3. Comparison with Original Paper . . . 63
6. Conclusion and Future Work 64 6.1. Future Work . . . 64
A. Appendix: Software Used 65 B. Appendix: Model Fitting on the Sun Grid Engine 66 B.1. Submitting Sun Grid Engine Jobs . . . 66
C. Appendix: Source Code Listings 67 C.1. beta-distribution.R . . . 67 C.2. dirichlet-3d.R . . . 67 C.3. dirichlet-samples.R . . . 68 C.4. sge-modelselection-chains.R . . . 68 C.5. sge-modelselection-likelihoods.R . . . 69 C.6. modelselection-chain-sizes.R . . . 70 C.7. classifications-fig-category-frequency.R . . . 70 C.8. classifications-fig-levelplot-most-diagnostic.R . . . 71 C.9. classifications-fig-levelplot-most-diagnostic-by-prevalence.R 72 C.10.classifications-table-most-diagnostic-five-terms.R . . . 73 C.11.classifications-fig-levelplot-five-most-diagnostic.R . . . 73 C.12.classifications-original-topics-table.R . . . 74 C.13.trends-table-year-frequencies.R . . . 75 C.14.trends-table-significance.R . . . 75
C.15.trends-fig-all-300.R . . . 76 C.16.trends-fig-five-hot-and-cold.R . . . 76 C.17.trends-table-terms.R . . . 77 C.18.trends-original-terms-table.R . . . 77 C.19.tagging-document-tag-latex.R . . . 77 C.20.tagging-table-topics-most-prevalent.R . . . 80 C.21.appendix-model-300-terms-tables.R . . . 80
D. Appendix: Web Scraping a Corpus of PNAS Journal Abstracts 82 D.1. Site Structure and Content . . . 82
D.2. Implementation . . . 82
D.2.1. Preparing Scrapy . . . 83
D.2.2. Calling Scrapy . . . 83
D.2.3. Selecting All Further Downloads . . . 84
D.2.4. Downloading the Selected Abstracts and Full Texts . . . 84
D.2.5. Cleaning the Downloaded Files . . . 85
D.2.6. Merging and Inspecting the Available Metadata . . . 86
D.2.7. Importing Abstracts and Metadata as a tmCorpus in R . . . 87
D.3. Listings . . . 87 D.3.1. settings.py . . . 87 D.3.2. items.py . . . 88 D.3.3. pnas_spider.py . . . 88 D.3.4. pipelines.py. . . 91 D.3.5. 1-scrape.sh . . . 91 D.3.6. 2-select.py . . . 91 D.3.7. 3-get.py . . . 95 D.3.8. 4-scrub.py . . . 99 D.3.9. 5-zip.py . . . 103 D.3.10.csv_unicode.py . . . 108 D.3.11.opt-categories.py . . . 109 D.3.12.tm-corpus-make.R . . . 110
E. Appendix: PNAS Journal 1991-2001 Corpus: 300 Topics and Their Twelve Most Probable Terms from the LDA Model Sample 112
1. Introduction and Overview
Information overload is one of the problems that have arrived with the
Informa-tion Age. Huge collecInforma-tions of data, be it literature, emails, web pages or content in other digital media have grown to sizes that make it hard for humans to handle them easily. It is the computer sciences of information retrieval, text mining and natural language processing that deal with this problem. Researchers from these fields have developed the techniques that help us transform our search queries (on Google or Bing) into meaningful search results, recommend to us new books based on our previous purchases (Amazon) or automatically translate documents between languages.
A new approach to extracting information from digital data are so-calledtopic mod-els. As the name suggests, they are about modeling which data entities are likely to be from the same topic or subject. Topic model algorithms enable us to feed a computer program with data and have documents or other content (images or videos) assigned to topics that are meaningful to humans. The topics are the results of an unsupervised learning process which can then be used for searching or browsing the original data collection.
In this thesis, I focus on the topic model latent Dirichlet allocation (Lda), which was first proposed by Blei et al. in 2003. In comparison to other topic models,Lda has the advantage of being a probabilistic model that firstly performs better than alternatives such as probabilistic latent semantic indexing (Plsi) (Blei et al., 2003) and that secondly, as a Bayesian network, is easier to extend to more specific purposes. Variations on the original Lda have led to topic models such as correlated topic models (Ctm), author-topic models (Atm) and hierarchical topic models (Htm), all of which make different assumptions about the data and with each being suited for specific analyses.
1.1. Motivation and Scope of Thesis
Ldaand similar topic models are still relatively new and therefore offer many directions for further research. Most available papers focus on improvements to the theorical foun-dation, i.e., extensions to the basicLda model and also better learning or performance evaluation techniques. This thesis explores a less prominent aspect of topic modeling with Lda, namely the practical side.
The initial and also many later papers onLda give examples of applications in which one or more models are built on document collections. The resulting distributions are then presented in word tables or “refined” in tools using more advanced visualization and/or analysis. Due to the limited space in journals it is common practice that authors only publish the results of their analyses and omit most of the steps that were taken in
the process. I have found that authors choose to publish the modeling software they have written but not the source code for their papers. It is therefore hardly possible to test claims made by the authors, understand the complete context or improve the papers. Instead of trying to solve this general problem of scientific methodology, I have decided to replicate an existing Ldapaper and explicitly document all the steps and difficulties involved to see how much information is actually missing in this specific example and draw conclusions on the general state of publications on topic models.
My thesis uses the framework of the statistical programming language R. The choice of R is an obvious one, as it is thede facto standard for open-source statistical analyses. It is widely used by academics and businesses, and a large variety of additional packages for solving user tasks are available.
In 2009, Bettina Gr¨un and Kurt Hornik published the R package topicmodels, which includes three open-source implementations of topic models from other authors. The
R language itself and an interface to the powerful R text mining package tm make this combination of tools ideal for both topic model research and application.
Consequently, the goals of this thesis are:
to give an introduction to the sciences behind topic models,
to review the literature onLda and summarize the model,
to demonstrate the use of theR package topicmodels,
to replicate the results of an often cited paper (Griffiths and Steyvers, 2004),
to extend and improve thetopicmodelspackage where needed, and
to identify potential research gaps.
1.2. Organisation of Thesis
In Chapter 2 I introduce the main computer sciences which apply topic models: infor-mation retrieval, natural language processing and text mining.
Chapter 3 describes topic models and their use in general. It explains Lda’s theory and the Gibbs estimation algorithm.
Chapter 4 lists available topic model software implementations and then discusses how Ldatopic modeling can be accomplished within the environment of theRprogramming language.
Chapter 5 is the main application part, in which I aim at replicating the results of an often cited paper on Lda (Griffiths and Steyvers, 2004). This involves downloading a large corpus of abstracts from the Pnas journal’s1 web page, preprocessing the data, performing model selection and fitting, and text mining the corpus via the model data. Chapter 6 concludes this thesis with a summary, the insights gained and the research gaps identified.
1
The appendix contains source code that is too long to be printed in the main chapters and lists all other resources used for the thesis: a list of the software that was used, a description of the Sun Grid Engine, R source code listings and documentation of the corpus download.
2. Background in Computer Sciences
In this chapter I give short introductions to the computer sciences of information re-trieval, natural language processing and text mining, which are closely related and sometimes overlap in their goals and tools. The first probabilistic topic model, latent Dirichlet allocation (Lda), was presented as an information retrieval model (Blei et al., 2003), solveable by machine learning (a subdiscipline of the computer science artificial intelligence) techniques. Topic models can also be seen as classical text mining or nat-ural language processing tools, therefore I have opted to shortly describe these fields as well. It should be noted that research in these domains is highly cross-disciplinary and also touches on linguistics, statistics and psychology.2.1. Information Retrieval
Consider a collection of hundred thousand or more digital text documents, for example scanned books. You now want to make these data accessible to a normal person through a computer interface. Ideally, the user should be able to enter a few words that describe what he or she is looking for and immediately after submitting a query, the relevant results should be displayed (in a ranked order), serving as an entry point to the full texts. Also, the user should be notified when new documents that are similar to his or her previous results are added to the collection, and furthermore they should also be able to browse the data guided by a meaningful structure instead of just filtering it.
If you were to design and implement a system according to the specifications above, you would invariably face the following problems:
how to match the meaning of a query to that of a relevant text document; consider the case of synonyms (different words with the same meaning) and homographs (same spelling, different meaning),
how to rank search results,
how to find similar documents,
how to to process terabytes of data when the user expects instant results, etc. These and related questions are addressed by the computer science ofinformation
re-trieval (commonly abbreviated as Ir), which deals with “searching for documents, for
information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web” (Wikipedia, 2011a). The his-tory of information retrieval started in the 1950s, when it was one of the first applications
of early computers (Stock, 2007). Since then it has become an important aspect of the tasks we expect computers to carry out for us, most prominently shown by our everyday use of web search engines.
The main goals of Irmay thus be reduced to enabling (Baeza-Yates and Ribeiro-Neto, 1999, Chapters 1 and 2):
adhoc retrieval,
filtering (of new documents in a collection, based on a user profile), and
browsing.
As explained in Manning et al. (2009), the simplest way to match terms of an adhoc retrieval query to documents is a linear scan within the full texts (also called grepping, after the standard Unix program that “gets regular expressions”), which returns the passages where the search terms occur. Naturally, this would require the system to scan through the entire collection from beginning to end for each new query, making this approach unfeasible for large databases. A general solution to this problem is to prepare a documentindex in advance.
An index is a sorted term dictionary that points to where these terms occur in a document collection. Therefore, to each term in a collection’s vocabulary the index maps in which document the term was posted and optionally other information as well, such as how often the term occurs in general or in a specific document (see Figure 2.1). Such indices are also calledinverted indicesor lists (Weiss et al., 2005) because they are built by sorting a forward index (a list of words for each document) by words (instead of by documents). The inverted index is “essentially without rivals [. . . ] the most efficient structure for supporting ad hoc text search” (Manning et al., 2009), because:
It can be treated like a matrix, but occupies less space than a regular matrix due to the sparsity of the information, meaning there are relatively few non-zero entries in the matrix. Adding and removing documents can be translated into intuitive operations on the matrix.
It allows us toseparate dictionary lists and posting lists,by keeping the terms in the working memory and referencing (the larger) term postings (that are stored on non-volatile storage devices like hard disks) by pointers, thereby optimizing the use of system resources.
One side effect of the use of inverted indices is that the word order in original doc-uments is lost. This can be fixed by storing word positions in the posting lists, at the cost of additionally occupied memory. Most topic models (introduced in Chapter 3.1) rely on input from a so-called document-term matrix, which is essentially an index in the form of one sparse matrix that lists term-per-document frequencies.
Note that indices usually are built after preprocessing the original data. This step might include natural language processing techniques (see Section 2.2) such as tokeniza-tion, that is, the decision which sequence of letters is treated as a single term, or the
Figure 2.1.: Construction of an inverted index from two short documents, adapted from Manning et al. (2009, Figure 1.4).
2.1.1. Common Information Retrieval Models
Having established inverted indices as the most basic representation of documents in information retrieval models we can now look at what types of models are available to accomplish the main goals of Ir.
Figure 2.2 on page 8 is a taxonomy of Ir models that was adapted from Wikipedia (2011a), similar categorizations can be found in Baeza-Yates and Ribeiro-Neto (1999), Stock (2007) and Manning et al. (2009).
The figure arrangesIrmodels along two dimensions: first by the models’ mathematical basis and, secondly, by the way term-interdependencies are treated. (The following two subsections were quoted verbatim from Wikipedia (2011a) and extended with references where seen as appropriate.)
Categorization by Mathematical Basis
“Set-theoretic models represent documents as sets of words or phrases. Similarities are usually derived from set-theoretic operations on those sets. Common models are:
Standard Boolean model: [Baeza-Yates and Ribeiro-Neto (1999,
Ch. 2.5.2)],
Ribeiro-Figure 2.2.: Categorization of IR models, taken from the Wikipedia entry for IR, the original figure was published in Kuropka (2004).
Neto (1999, Ch. 2.6.2)]
Fuzzy retrieval: [Fox and Sharat (1986), Baeza-Yates and Ribeiro-Neto
(1999, Ch. 2.6.1)]
Algebraic models represent documents and queries usually as vectors, matrices, or tu-ples. The similarity of the query vector and document vector is represented as a scalar value.
Vector space model: [Salton et al. (1975), Manning et al. (2009, Ch. 6.3),]
Generalized vector space model: [Wong et al. (1985), Baeza-Yates and
Ribeiro-Neto (1999, Ch. 2.5.1),]
Extended Boolean model: [Salton et al. (1983), also a set-theoretic model.]
Latent semantic indexing (Lsi), [also known as] Latent semantic analysis
(Lsa): [see Deerwester et al. (1990), Baeza-Yates and Ribeiro-Neto (1999, Ch. 2.7.2). Lsican be considered a non-probabilistic topic model.]
Probabilistic models treat the process of document retrieval as a probabilistic inference [problem]. Similarities are computed as probabilities that a document is relevant for a given query. Probabilistic theorems like the Bayes’ theorem are often used in these models.
Binary independence model: [Robertson and Sparck Jones (1976),
Baeza-Yates and Ribeiro-Neto (1999, Ch. 2.5.4).]
Probabilistic relevance model [and its derivative, Okapi BM25:
[. . . ]
Language models, [probability distributions over word sequences, like
n-gram models: Manning et al. (2009, Ch. 12), Zhai (2009).]
[Probabilistic topic models, probabilistic latent semantic indexing (Plsi,
also called Plsa, Hofmann (1999)), latent Dirichlet allocation (Lda, Blei et al. (2003)).]
[. . . ]”(Wikipedia, 2011a)
Categorization by Modelling of Term-Interdependencies
“Models without term-interdependencies treat different terms/words as independent. This fact is usually represented in vector space models by the orthogonality as-sumption of term vectors or in probabilistic models by an independency asas-sumption for term variables.
Models with immanent term interdependencies allow a representation of interdepen-dencies between terms. However the degree of the interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g., by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.
Models with transcendent term interdependencies allow a representation of interde-pendencies between terms, but they do not allege how the interdependency between two terms is defined. They [rely on] an external source for the degree of interde-pendency between two terms. (For example a human or sophisticated algorithms.) [. . . ]”(Wikipedia, 2011a)
2.2. Natural Language Processing
As explained above, information retrieval intends to make document collections more ac-cessible via search engines or interfaces for browsing. In contrast to that, the computer science of natural language processing (Nlp) deals with tasks of processing human language. Nlphas its roots in artificial intelligence (Ai) research, but other major scien-tific domains have approached the same problems as well: linguistics (“computational lin-guistics”), electrical engineering (“speech recognition”) and psychology (“computational psycholinguistics”) (Jurafsky and Martin, 1999).
Major goals inNlpare (Jurafsky and Martin, 1999): speech recognition and synthesis, natural language understanding and machine translation.
Nlp tools have often been incorporated into Ir systems to enhance retrieval perfor-mance, however especially with large corpora (i.e., gigabytes of full text) only the most efficient and robust techniques are advised (Kruschwitz, 2005, p. 24).
Figure 2.3 on page 11 shows which Nlptechniques are commonly used in the context of Ir. Order and actual application of the steps in the diagram to both queries and documents depend on the respective goals of an analysis.
The workflow in the figure can be described as follows (Stock, 2007, p. 100):
If unknown, thecharacter encoding (e.g., Unicode, Ascii, . . . ) of a document or query must be determined.
Next, the language is detected (English, Russian, . . . ), to which the determina-tion of a writing system (Latin alphabet, Cyrillic alphabet, . . . ) is also related, including its respective directionality (left-to-right or right-to-left).
If the document is in a markup format like Html, then the text needs to be separated fromlayout or navigational elements.
Instead of using words as the basis of an analysis, one can also use a text that is converted into n-grams. In general, an n-gram is a subsequence of n items of a sequence (Wikipedia, 2011d). If n = 1 then one speaks of a unigram, an n-gram with n = 2 is referred to as a bigram, one with n = 3 as a trigram. Here in this context the items are sequences of single letters (as opposed to word n-grams, which for example are used in certain language models).
If an Ir model is n-gram based, the steps of single word based processing can be skipped and the n-grams are directly fed into models (Stock, 2007, p. 211).
If the model is based on words, a necessary step is to split a document into single words. This method is called text (or word) segmentation or tokenization
(Manning and Sch¨utze, 1999, p. 124–129) and aided by the fact that most languages delimit words by whitespace.
Words which are of little use to an analysis (“and”, “is”, . . . ) can be tagged and deleted asstop wordsthat are often listed on astop list(Manning and Sch¨utze, 1999, p. 533).
When a model relies on direct user input, recognition and correction of
spelling errorsmay be warranted at this point (Stock, 2007, Ch. 18).
To keep certain words from being mangled in a subsequent step, named entity
recognitionmay be applied to tag these words (Stock, 2007, p. 255).
Different word forms (for example “retrieving” and “retrieval”) can then be nor-malized via conflation (Stock, 2007, p. 227). If just the word stem is needed for retrieval, then one employs stemming (“retriev”). Stemming is commonly im-plemented as a simple removal of suffixes (or prefixes). If the conflation requires a linguistically correct word form as an outcome (“retrieve”), lemmatization is used. In this case dictionaries might be needed to help map the different word forms to a base form.
Determine text encoding
(e.g., Latin-1, UTF-8) Language detection
Separation of text from layout and navigation components
Word-based (uniterm)
processing N-gram processing
Stop word removal Spelling error recognition
and correction Named entity recognition
Conflation: Stemming / Lemmatization Phrase recognition („ text chunking“ ), POS tagging
Semantics Recognition of homonyms and synonyms Translation (for multilanguage retrieval) (Statistical) Similarity
Resolution of ellipses and anaphora Tokenization / text
segmentation Text corpus / Query
IR model
Figure 2.3.: Typical application of Nlp techniques for information retrieval models (adapted and translated from Stock (2007, p. 101, Fig. 8.5)).
Part-of-speech taggingis used to mark up word classes within sentences (noun, verb, . . . ; Manning and Sch¨utze, 1999, Ch. 10). A more shallow form thereof is
text chunking, which may be applied to identify and tag compound words (also
called “phrases”; Stock, 2007, Ch. 15). Information obtained by these techniques may help to build a better inverted index and thus increase retrieval performance.
Moving into the world of semantics (meaning) and similarity, words may be tagged as homonyms (words with the same spelling and pronounciation but dif-ferent meanings) or synonyms(different words with the same meaning).
If a query should return results in more than one language, then a translation
module must be used (Stock, 2007, Ch. 28).
Finally,resolution of anaphora(pronouns like “she” or “he”) andellipses (omis-sions in a sentence) may further normalize the meaning of a text (Stock, 2007, Ch. 17).
2.3. Text Mining
Text mining is the “process of deriving [(new)] high-quality information from text”
(Wikipedia, 2011b). It is a highly cross-disciplinary field that can trace its roots to the theory and practice of data mining (also called “knowledge discovery in databases”, or Kdd; Alexander and Wolff, 2005; Weiss et al., 2005). Due to its relative novelty, text mining often overlaps with information retrieval or natural language processing (Hearst, 1999; Alexander and Wolff, 2005; Wikipedia, 2011c). (Semi-)automated tasks in text mining include:
Classification, or categorization, i.e., “assign[ing] a priori known labels to text
documents” (Feinerer, 2008), with applications like spam-classification (Berry and Kogan, 2010).
Clustering, or “relationship identification, i.e., finding connections and
similari-ties between distinct subsets of documents in the corpus” (Feinerer, 2008). Topic models are well suited to address this task (Srivastava and Sahami, 2009; Berry and Kogan, 2010).
Keyword extraction (Berry and Kogan, 2010).
Summarization(Fan et al., 2006).
Topic tracking (Fan et al., 2006). Topic models have been proposed specifically
for this task (Blei and Lafferty, 2006; Wang and McCallum, 2006; Wei et al., 2007).
Anomaly (novelty) and trend detectionand prediction (Weiss et al., 2005;
3. Latent Dirichlet Allocation: a Topic
Model
3.1. Topic Models
Topic models are “[probabilistic] latent variable models of documents that exploit the
correlations among the words and latent semantic themes” (Blei and Lafferty, 2007). The name “topics” signifies the hidden, to be estimated, variable relations (=distributions) that link words in a vocabulary and their occurrence in documents. A document is seen as a mixture of topics. This intuitive explanation of how documents can be generated is modeled as a stochastic process which is then “reversed” (Blei and Lafferty, 2009) by machine learning techniques that return estimates of the latent variables. With these estimates it is possible to perform information retrieval or text mining tasks on a document corpus.
3.2. Literature and Impact
Latent Dirichlet Allocation (Lda) has been thoroughly explained in the original paper by Blei et al. (2003), and also in Griffiths and Steyvers (2004); Heinrich (2005); Blei and Lafferty (2009); Berry and Kogan (2010); Blei (2011) and others. A video lecture on the topic given by David Blei at theMlss1 2009 in Cambridge is also available online (Blei, 2009).
Lda“[. . . ] has made a big impact in the fields of natural language processing and statistical machine learning” (Wang and McCallum, 2005) and “[. . . ] has quickly become one of the most popular probabilistic text modeling techniques in machine learning and has inspired a series of research works in this direction” (Wei and Croft, 2007).
Among the research that Lda has spawned are different estimation methods (e.g., Griffiths and Steyvers (2004), Asuncion et al. (2009), etc.) and also numerous exten-sions to the standardLdamodel, e.g., hierarchical Dirichlet processes (Hdp, Teh et al., 2006b), dynamic topic models (Dtm, Blei and Lafferty, 2006), correlated topic models (Ctm, Blei and Lafferty, 2007), etc.
What follows in the next sections is an overview of theLdatopic model, largely based on the original authors’ work.
1
3.3. Observed Data / Model Input
Lda was first applied to text corpora (Blei et al., 2003), although it later has been extended to images (Iwata et al., 2007) and videos (Wang et al., 2007) as well. In this thesis I will only cover analyses of texts. Words (= terms) are the basic unit of data in a document. The set of all words in a corpus is called a vocabulary. Splitting up a document into words is referred to as “tokenization” (also see Section 2.2).
Lda relies on the bag-of-words assumption (Blei et al., 2003), which means that the words in a document are exchangeable and their order therefore is not important. This leads to the representation of a document collection as a so-called document-term matrix (Dtm, sometimes also a term-document matrix, which simply is the transposed Dtm), in which the frequencies of words in documents are captured. Figure 3.1 shows an exemplary document-term matrix of a small corpus.
Figure 3.1.: Exemplary document-term matrix with D = 3 documents and vocabulary of size V = 5 terms.
3.4. The Dirichlet Distribution
Lda’s generative model posits that the characteristics of topics and documents are drawn from Dirichlet distributions.
The Dirichlet distribution is the multivariate generalization of the beta distribution which itself has been used in Bayesian statistics for modeling belief (see for example Neapolitan, 2003, Chapters 6 and 7; Balakrishnan and Nevzorov, 2003, Ch. 27). The Dirichlet distribution’s probability density function is defined as (see for example Blei and Lafferty, 2009): p(x|α1, . . . , αK) = Γ PK i=1αi QK i=1Γ(αi) K Y i=1 xαi−1 i , (3.1)
with α being a positive K-vector and Γ denoting the Gamma function, which is a generalization of the factorial function to real values (see for example Neapolitan, 2003, p. 298):
Γ(z) =
Z ∞
0
Figure 3.2 on page 16 shows the density of Dirichlet distributions with various α
parameters over the two-simplex. A greater sum ofα (sometimes called “scaling”) leads to a peakier density (Blei, 2009). Values for α smaller than one lead to higher density at the extremes of the simplex.
Figure 3.3 on page 17 shows samples drawn from Dirichlet distributions of order five. The Dirichlets are symmetric, meaning that the vector α contains identical values. As these values approach zero, the draws get sparser, favoring more extreme outcomes on the five-simplex.
3.5. Generative Process
In order to infer topics from a corpus one has to have a certain model of how documents are generated, i.e. how a person with a limited set of tools and information (model parameters) would come up with a new document, word by word (not taking into account correct word order). Lda’s generative model can be summarized as:
1. for each topic: decide what words are likely. 2. for each document,
a) decide what proportions of topics should be in the document, b) for each word,
i. choose a topic,
ii. given this topic, choose a likely word (generated in step 1).
The Lda model involves drawing samples from Dirichlet distributions (see previous section) and from multinomial distributions. A generalization of the binomial distribu-tion, the multinomial distribution tells us the probability of observing a count of two or more independent events, given the number of draws and fixed probabilities per outcome that sum to one. In our case the outcomes are terms and topics.
A geometric interpretation of the model is shown in Figure 3.4 on page 18, where a simple generative model of three terms, three topics and the resulting four documents is represented on the two-simplex.
To formally describe the model, some symbols are introduced in Table 3.1 on page 18. The probabilistic generative process is defined (Blei and Lafferty, 2009) as:
1. For each topic k, draw a distribution over words φk ∼Dir(α). 2. For each document d,
a) Draw a vector of topic proportionsθd ∼Dir(β). b) For each wordi,
i. Draw a topic assignment zd,i ∼M ult(θd), zd,n∈ {1, . . . , K}, ii. Draw a wordwd,i ∼M ult(φz ), wd,i ∈ {1, . . . , V}.
X 0.0 0.2 0.4 0.6 0.8 1.0 Y 0.0 0.2 0.4 0.6 0.8 1.0 density −3 −2 −1 0 1 2 3 α1=4, α2=4, α3=2 X 0.0 0.2 0.4 0.6 0.8 1.0 Y 0.0 0.2 0.4 0.6 0.8 1.0 density −3 −2 −1 0 1 2 3 α1=2, α2=4, α3=4 X 0.0 0.2 0.4 0.6 0.8 1.0 Y 0.0 0.2 0.4 0.6 0.8 1.0 density −3 −2 −1 0 1 2 3 α1=2, α2=4, α3=2 X 0.0 0.2 0.4 0.6 0.8 1.0 Y 0.0 0.2 0.4 0.6 0.8 1.0 density −3 −2 −1 0 1 2 3 α1=2, α2=2, α3=2 X 0.0 0.2 0.4 0.6 0.8 1.0 Y 0.0 0.2 0.4 0.6 0.8 1.0 density −3 −2 −1 0 1 2 3 α1=1, α2=1, α3=1 X 0.0 0.2 0.4 0.6 0.8 1.0 Y 0.0 0.2 0.4 0.6 0.8 1.0 density −3 −2 −1 0 1 2 3 α1=0.5, α2=0.5, α3=0.5
Event Probability 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 ● ● ● ● ● Sample 1 alpha = 0.01 ● ● ● ● ● Sample 2 alpha = 0.01 1 2 3 4 5 ● ● ● ● ● Sample 3 alpha = 0.01 ● ● ● ● ● Sample 4 alpha = 0.01 1 2 3 4 5 ● ● ● ● ● Sample 5 alpha = 0.01 ● ● ● ● ● Sample 1 alpha = 0.1 ● ● ● ● ● Sample 2 alpha = 0.1 ● ● ● ● ● Sample 3 alpha = 0.1 ● ● ● ● ● Sample 4 alpha = 0.1 0.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● Sample 5 alpha = 0.1 0.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● Sample 1 alpha = 1 ● ● ● ● ● Sample 2 alpha = 1 ● ● ● ● ● Sample 3 alpha = 1 ● ● ● ● ● Sample 4 alpha = 1 ● ● ● ● ● Sample 5 alpha = 1 ● ● ● ● ● Sample 1 alpha = 10 ● ● ● ● ● Sample 2 alpha = 10 ● ● ● ● ● Sample 3 alpha = 10 ● ● ● ● ● Sample 4 alpha = 10 0.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● Sample 5 alpha = 10 0.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● Sample 1 alpha = 100 1 2 3 4 5 ● ● ● ● ● Sample 2 alpha = 100 ● ● ● ● ● Sample 3 alpha = 100 1 2 3 4 5 ● ● ● ● ● Sample 4 alpha = 100 ● ● ● ● ● Sample 5 alpha = 100
Figure 3.4.: Geometric interpretation of Lda(adapted from Blei et al., 2003, Figure 4).
K specified number of topics
k auxiliary index over topics
V number of words in vocabulary
v auxiliary index over topics
d auxiliary index over documents
Nd document length (number of words)
i auxiliary index over words in a document
α positive K-vector
β positive V-vector
Dir(α) a K-dimensional Dirichlet
Dir(β) a V-dimensional Dirichlet
z Topic indices: zd,i =k means that the i-th word in the
d-th document is assigned to topick
As an example of a hierarchical Bayesian model (see for example Gelman et al., 2004) Ldacan also be denoted as a graphical model in plates notation. Graphical models are a standard way of representing statistical models of “random variables that are linked in complex ways“ (Jordan, 2004). Figure 3.5 showsLdain plates notation, where rectangles signify a repetition over enclosed nodes.
α θ d zd,i wd,i N D φk K β
Figure 3.5.: Lda in plates notation (adapted from Blei and Lafferty (2009)). A translation of this directed acyclical graph (Dag) into a joint probability distribu-tion of independent variables is possible when the Markov condidistribu-tion holds true, which is the case (see for example Neapolitan, 2003). From Figure 3.5 we therefore deduce the following joint distribution:
p(w, z, θ, φ|α, β) =p(θ|α)p(z|θ)p(φ|β)p(w|z, φ). (3.3) Marginalizing over the joint distribution will allow us to get the looked after proba-bilities that are contained in the model.
I will now explain in detail the four factors on the right hand side of the equation above.
First, the topics distribution per document, which is drawn from the Dirichlet distribution, given the Dirichlet parameter α, which is a K-vector with components
αk>0: p(θ|α) = Γ( PK k=1αk) QK k=1Γ(αk) θαk−1 1 · · ·θ αK−1 K = Γ(α·) QK k=1Γ(αk) K Y k=1 θαk−1 k .
The second, shorter version of this equation introduces the use of the dot operator (·) in the variable index as a shorthand for summation over all the variables’ values. The following table is an example of three such per-document distributions:
The next component of the joint distribution is the distribution of thetopic to words
assignments in the corpus, z, which depends on the aforementioned distribution θ.
Each wordwi in a document of N words is therefore assigned a value from 1. . . K. The next table shows an example of this:
In the joint distribution we express the probability of z for all documents and topics in terms of the word countnd,k,· which is the number of times topic k has been assigned
to any word in document d:
p(z|θ) = D Y d=1 K Y k=1 θnd,k,· d,k .
The term distributions per topic of the whole corpus φk are (again) drawn from
a Dirichlet distribution with parameter β. This is an example of some possible topic distributions for four topics:
φk,v gives us the probability that termv is drawn when the topic was chosen to be k. We express the probability of φ for all topics and all words of the vocabulary as:
p(φ|β) = K Y k=1 Γ(βk,·) QV v=1Γ(βk,v) V Y v=1 φβk,v−1 k,v .
Finally, the probability of a corpus w given its parents z and φ in the graphical model is: p(w|z, φ) = K Y k=1 V Y v=1 φn·,k,v k,v ,
where n·,k,v is the count how many times topic k was assigned to vocabulary term v in the whole corpus.
p(w, z, θ, φ|α, η) = p(θ|α)p(z|θ)p(φ|β)p(w|z, φ) = = D Y d=1 Γ(α·) QK k=1Γ(αk) K Y k=1 θαk−1 d,k ! D Y d=1 K Y k=1 θnd,k,· d,k ! × K Y k=1 Γ(βk,·) QV k=1Γ(βk,v) V Y v=1 φβk,v−1 k,v ! K Y k=1 V Y v=1 φn·,k,v k,v ! = = D Y d=1 Γ(α·) QK k=1Γ(αk) K Y k=1 θαk+nd,k,·−1 d,k ! × K Y k=1 Γ(βk·) QV v=1Γ(βk,v) V Y v=1 φβk,v+n·,k,v−1 k,v ! . (3.4)
We now marginalize the latent variables out in order to be able to write down a model’s probability when a corpusw and the hyperparameters (αandβ) are given. This probability is needed for a ”maximum-likelihood estimation of the model parameters and to infer the distribution of latent variables“ (Chang and Yu, 2007):
p(w|α, β) = Z φ Z θ X z D Y d=1 Γ(α·) QK k=1Γ(αk) K Y k=1 θαk+nd,k,·−1 d,k ! × K Y k=1 Γ(βk·) QV k=1Γ(βk,v) V Y v=1 φβk,v+n·,k,v−1 k,v ! dθdφ. (3.5)
The sum over all possible combinations of topic assignments z (nd,k,·, n·,k,v) makes this probability computationally intractable (Chang and Yu, 2007; Blei et al., 2003) and therefore prohibits the use of the conventional expectation maximization algorithm. Instead, we have to resort to machine learning techniques to find good approximations of the marginal probability (see next section).
3.6. Model Estimation
The following learning algorithms have been used to infer topic models’ (latent) variables (Asuncion et al., 2009; Blei and Lafferty, 2009):
Maximum likelihood (Ml) estimation (Hofmann, 1999) for the Plsa model
(Lda’s predecessor without Dirichlet priors), where α and β are ignored, and φ and θ are treated as parameters.
Maximum a posteriori(Map) estimation (Chien and Wu, 2008), an extension
Variational Bayesian inference(Vb, Blei et al., 2003), where in the smoothed model version z, φ and θ are hidden variables and the posterior distribution is approximated using a variational distribution.
Collapsed variational Bayesian inference(Teh et al., 2006a), where the
vari-ablesφ and θ are marginalized (“collapsed”).
Collapsed Gibbs sampling (Griffiths and Steyvers, 2004), a Markov Chain
Monte Carlo (Mcmc)-based method. (Described in more detail in the next sub-section.)
Expectation Propagation (Ep) (Minka and Lafferty, 2002; Griffiths and
Steyvers, 2004).
Having reviewed previous work in Teh et al. (2006a); Mukherjee and Blei (2009); Welling et al. (2008); Girolami and Kab´an (2003) and having compared these algorithms, Asuncion et al. (2009) come to the following conclusions:
“Update equations in these algorithms are closely connected”, and
“[...] using the appropriate hyperparameters causes the performance difference between these algorithms to largely disappear”. Hyperparameters can be learned during model training through various approaches (listed in Asuncion et al. (2009)).
Parallelizing the algorithms is possible and allows close to real-time analysis of small corpora (D= 3000 documents).
In the same paper the authors propose an optimized version of collapsed variational Bayesian inference that offers a slight performance advantage compared to the other algorithms (“Cvb0”).
In this thesis I focus on one of the inference algorithms, namely Gibbs sampling, which I describe in the next subsection.
3.6.1. Collapsed Gibbs Sampling
Computing the marginal distribution (Equation 3.5 on page 21) “involves evaluating a probability distribution on a large discrete state space” (Steyvers et al., 2004). The paper by Steyvers et al. was the first to show how Gibbs sampling could be employed to estimate Lda’s latent variables. The Gibbs sampling algorithm is a typical Markov Chain Monte Carlo (Mcmc) method and was originally proposed for image restoration (Geman and Geman, 1984). Mcmc methods solve the problem of obtaining samples from complex probability distributions by the use of random numbers (MacKay, 2005, Ch. 29).
“Gibbs sampling (also known as alternating conditional sampling) [. . . ] simulates a high-dimensional distribution by sampling on lower-dimensional subsets of variables where each subset is conditioned on the value of all others. The sampling is done
sequentially and proceeds until the sampled values approximate the target distribution.” (Steyvers and Griffiths, 2007). Applied to the Lda model, we need the “probability of topic za,b being assigned to wa,b, the b-th word of the a-th document, given z−(a,b), all
the other topic assignments to all the other words” (Carpenter, 2010):
p(za,b|z−(a,b), w, α, β), (3.6)
which is proportional to (Carpenter, 2010; Chang and Yu, 2007):
∝p(w, z|α, β) = Z θ Z φ p(w, z, θ, φ|α, β) dθdφ. (3.7)
A series of expansions and equation transformations (Carpenter, 2010; Elkan, 2010) re-sults in the unnormalized conditional probability (Griffiths and Steyvers, 2004; Steyvers and Griffiths, 2007) (a more general derivation can be found in Heinrich (2005)):
p(za,b|z−(a,b), w, α, β)∝
n−za,b(a,b,a,)·+α n−za,b(a,,a,·)·+Kα × n −(a,b) za,b,·,wa,b +β n−za,b(a,b,·,)·+J β . (3.8)
If asymmetric priors α and β are allowed and the topic-related denominator is omit-ted (which is the same for every word in the vocabulary), we arrive at the following conditional probability (Carpenter, 2010):
p(za,b|z−(a,b), w, α, β)∝
n−za,b(a,b,a,)·+αza,b
×nz−a,b(a,b,·,w)a,b +βwa,b
n−za,b(a,b,·,)·+
PJ
j=1βj
. (3.9)
“The first multiplicand in the numerator, n−za,b(a,b,a,)· +αza,b, is just the number of other
word in document a that have been assigned to topic za,b plus the topic prior. The second multiplicand in the numerator, n−za,b(a,b,·,w) a,b + βwa,b, is the number of times the
current word wa,b has been assigned to topic za,b plus the word prior. The denominator just normalizes the second term to a probability.” (Carpenter, 2010). The normalized conditional probability is:
p(za,b|z−(a,b), w, α, β) =
n−(za,b,a,a,b)·+αza,b×n−(za,b,a,b·),wa,b+βwa,b
n−(za,b,a,b·),·+ PJ j=1βj PK k=1
n−(za,b,a,a,b)·+αza,b
×n−(za,b,a,b·),wa,b+βwa,b
n−(za,b,a,b·),·+
PJ
j=1βj
. (3.10)
The learning algorithm can be summarized (Steyvers and Griffiths, 2007) as: 1. Initialize the topic to word assignments z randomly from {1,. . . ,K}. 2. For each Gibbs sample:
a) “For each word token, the count matricesn−(a,b) are first decremented by one
for the entries that correspond to the current topic assignment.” (Steyvers and Griffiths, 2007)
b) A new topic is sampled from the distribution in Equation 3.10.
c) The count matrices are updated by incrementing by one at the new topic assignment.
3. Discard samples during the initial burn-in period.
4. After the Markov chain has reached a stationary distribution, i.e., the posterior distribution over topic assignments, samples can be taken at a fixed lag (averaging over Gibbs samples is recommended for statistics that are invariant to the ordering of topics (Griffiths and Steyvers, 2004)).
The other latent variables θ and φ, which are the interest of most analyses, can be estimated from a single Gibbs sample of z (Griffiths and Steyvers, 2004):
ˆ θd,k = αk+nd,k,· α·+nd,·,· (3.11) ˆ φk,v = βk,v+n·,k,v βk,·+n·,k,· (3.12)
3.7. Model Evaluation and Selection
Model evaluation, i.e., measuring a model’s performance, is needed to ensure that the
(topic) model is able to generalize from the training data in a useful way. One also speaks
of model selectionwhen the decision is taken of “what model(s) to use for inference”,
given data x and potential fitted models (Burnham and Anderson, 2002, Ch. 1). For instance, a common problem in topic modeling is to choose the number of topics if this parameter is not specified a priori (Blei and Lafferty, 2009). Depending on the goals and available means, a researcher can employ a variety of performance metrics (Wallach et al., 2009; Buntine, 2009; Chang and Blei, 2009), which I list in the next subsections.
3.7.1. Performance Measurement on Data
Most topic model papers measure the performance of the learned model on held-out (i.e., non-training, or test) data (with thePnasanalysis in Griffiths and Steyvers (2004) being an exception, where model selection is performed on the complete corpus). “Estimating the probability of held-out documents provides a clear, interpretable metric for evaluat-ing the performance of topic models relative to other topic-based models as well as to other non-topic-based generative models” (Wallach et al., 2009). These metrics are well suited for qualitative goals, “such as corpus exploration [and] choosing the number of topics that provides the best language model” (Blei and Lafferty, 2009).
A related approach is “document completion”, namely to divide documents (as opposed to dividing corpora) into training and test sets (Rosen-Zvi et al., 2004; Wallach et al., 2009). “While this approach is reasonable, it [. . . ] does not address the whole problem, the quality of individual documents” (Buntine, 2009).
Perplexity “[. . . ], used by convention in language modeling, is monotonically decreasing in the likelihood of the test data, and is algebraically equivalent to the inverse of the geometric mean per-word likelihood. A lower perplexity score indicates better generalization performance” (Blei et al., 2003), see also Rosen-Zvi et al. (2004).
Empirical likelihood,see Li and McCallum (2006).
Marginal likelihood,which can be approximated by:
– Harmonic mean method, see Newton and Raftery (1994); Griffiths and
Steyvers (2004); Buntine and Jakulin (2004); Griffiths et al. (2005); Wallach (2006) and next subsection.
– (Annealed/Mean field) importance sampling, see Neal (2001);
McCal-lum (2002); Li and McCalMcCal-lum (2006); Buntine (2009).
– Chib-style estimation,see Chib (1995); Murray and Salakhutdinov (2009).
– Left-to-right samplers,see Del Moral et al. (2006); Wallach (2008); Buntine
(2009).
In Wallach et al. (2009) it was found that “the Chib-style estimator and ‘left-to-right’ algorithm presented in this paper provide a clear methodology for accurately assessing and selecting topic models”. Further research in Buntine (2009) shows that a sequential version of Wallach’s left-to-right sampler and the even more efficient mean field importance sampler are optimal for measuring marginal model likelihood on held-out data.
Harmonic Mean Method
This method has first been applied by Griffiths and Steyvers in their 2004 Bayesian approach to finding the optimal number of topics. The method has since then been used in a variety of papers (Griffiths et al., 2005; Zheng et al., 2006; Wallach, 2006) because of its simplicity and relative computational efficiency, but has also been proven to be unstable (Wallach et al., 2009; Buntine, 2009).
In its effort to replicate the findings of Griffiths and Steyvers, this thesis also makes use of the harmonic mean method for model selection. The method is best summarized by a direct quote from their paper (symbol notation, literature and equation references have been adapted accordingly):
“In our case, the data are the words in the corpus, w, and the model is specified by the number of topics, K, so we wish to compute the likelihood
p(w|K). The complication is that this requires summing over all possible assignments of words to topicsz [, i.e., p(w|K) =R p(w|z, K)p(z)dz]. How-ever, we can approximate p(w|K) by taking the harmonic mean of a set of values of p(w|z, K) when z is sampled from the posterior p(z|w, K) [(Kass and Raftery, 1995, Section 4.3)]. Our Gibbs sampling algorithm provides such samples, and the value ofp(w|z, K) can be computed from Eq. 3.13[:]”
P(w|z) = Γ(V β) Γ(β)V K K Y k=1 Q V Γ(n (w) k +β) Γ(n(k·)+V β) , (3.13) “[. . . ] in which n(kw) is the number of times word w has been assigned to topic k in the vector of assignments z, and Γ(·) is the standard Gamma function” (Griffiths and Steyvers, 2004).
3.7.2. Performance Measurement on Secondary Tasks
In some scenarios the model can be evaluated by cross validation on the error of an external task at hand, such as document classification or information retrieval (Wei and Croft, 2006; Titov and McDonald, 2008).
3.7.3. Performance Measurement by Human Judgement
Steyvers and Griffiths (2007) “showed that the number of topics a word appears in correlates with how many distinct senses it has and reproduced many of the metrics used in the psychological community based on human performance” (Chang and Blei, 2009). Following up on this line of thought, Chang and Blei (2009) presented two metrics that allow for a better evaluation of a topic model’s latent structure by using input from humans:
Word intrusion“measures how semantically ‘cohesive’ the topics inferred by a
model are and tests whether topics correspond to natural groupings for humans” (Chang and Blei, 2009).
Topic intrusion“measures how well a topic model’s decomposition of a document
as a mixture of topics agrees with human associations of topics with a document” (Chang and Blei, 2009).
The result of Chang and Blei (2009) was that for three topic models (Plsi, Lda, Ctm) traditional measures were “indeed, negatively correlated with the measures of topic quality developed in [their] paper”. The authors therefore suggest that topic model developers should “focus on evaluations that depend on real-world task performance rather than optimizing likelihood-based measures” – a view which is re-iterated by Blei (2011).
3.8. Use of Fitted Topic Models
“Representing the content of words and documents with probabilistic topics has one distinct advantage over a purely spatial representation. Each topic is individually inter-pretable, providing a probability distribution over words that picks out a coherent cluster
“soft clusters” (Heinrich, 2005; Newman et al., 2006), which makes them candidates for a multitude of applications.
3.8.1. Finding Similar Documents Through Querying and Browsing
Lda and topic models in general can be applied for the classical tasks of information retrieval, i.e., ad-hoc retrieval, browsing and filtering (see Section 2.1). “Querying can be considered topic estimation of unknown documents (queries) and comparing these topic distributions to those of the known documents. Appropriate similarity measures permit ranking.” (Heinrich, 2005).
Enabling the exploring of corpora through browsing is Lda’s major strength (Blei et al., 2003; Blei and Lafferty, 2009). Links to online examples can be found on David Blei’s homepagehttp://www.cs.princeton.edu/~blei/topicmodeling.html.
Examples of Lda’s use in ad-hoc retrieval have been shown in Azzopardi et al. (2004); Wei and Croft (2006, 2007); Yi and Allan (2008). Filtering new documents in databases and recommender systems are other obvious, though so far unexplored, applications.
3.8.2. Exploring the Relation between Topics and Corpus Variables
Related to browsing and text mining, it is possible to combine topics with secondary corpus variables that were not part of the training process. For theLdamodel this was first demonstrated in Griffiths and Steyvers (2004), where two document metavariables, namely year of publication and category designated by the authors, were used to gain additional insight into the corpus structure and model semantics. For example, to find out topic trends over the years, the per-document topic distributions were aggregated by averaging over all documents from the same year. This concept is illustrated in Figure 3.6.
3.9. Extending LDA
Since the publication of the Lda topic model (Blei et al., 2003) hundreds of papers have been written on the subject. Among these have been improvements to the model estimation (e.g., distributed inference algorithms, see Newman et al. (2007); Wang (2008); Newman et al. (2009)) and dozens of new model specifications that take into account not only documents but corresponding metadata like date (dynamic topic mod-els, see Blei and Lafferty (2006)) or authors (author-topic model, see Rosen-Zvi et al. (2004, 2005); Shen et al. (2006)). A current bibliography of topic modeling by David Mimno is available at: http://www.cs.princeton.edu/~mimno/topics.html. Discus-sions and announcements on topic modeling are carried out on David Blei’s mailing list:
4. LDA in the R Environment
This chapter gives an overview over current software implementations of theLda topic model and describes theRpackagestmandtopicmodelswhich I use for the data analysis in more detail.
4.1. Overview of LDA Implementations
The firstLdasoftware was published by David Blei in 2004 aslda-cand implements the variational inference described in his earlier paper (Blei et al., 2003). Other researchers have implementedLdaand many other topic models so that today there exists a variety of both open and closed source topic modeling software. Table 4.1 on page 30 lists packages that allow model fitting by Lda. This list is not exhaustive and omits stand-alone implementations of other topic models.
It should be noted that Lda can also be implemented in general machine learning frameworks, examples thereof are:
HBC (Hierarchical Bayes Compiler), see http://www.cs.utah.edu/~hal/
HBC/.
WinBUGS, see http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.
shtml and http://www.arbylon.net/projects/ for an Lda script by Gregor
Heinrich.
Apache Mahout, see https://cwiki.apache.org/confluence/display/
P a c k ag e Mo d el s L an gu ag e A ut ho r(s ) Li cen se G ensim Ls a , Ld a P yt h on R adim ˇ Reh ˚u ˇrek Gn u Lgp l G ibbsL D A ++ / J Gibb LD A Ld a C ++ / J av a X uan -H ieu Ph an and C am-T u N guy en Gn u Lgp l 2 Inf er.N E T Ld a an d oth ers .NET lan gua ges M icroso ft R esearc h C am b ridge non -com merc ial us e onl y lda Ld a an d oth ers R J onat han Cha ng Gn u Lgp l lda Ld a Ma tl ab or C Dai chi M o chiha shi n.a. lda -c Ld a C Da v id Blei Gn u Lgp l 2 lda -j Ld a J av a G regor He inric h Gn u Lgp l 2 Ld a , P ls a Ld a an d P ls a Ma tl ab J ak ob V erb eek n .a. lda 1pf s Ld a ( 1 Pf s inf ere nce) J av a A n th on y DiF ra nco m ostly Gp l 3 L ingP ip e Ld a an d oth ers J av a A lias-i, In c prop rieta ry license Ld a wit h G ibbs sa mpl er Ld a P yt h on M ath ieu Bl onde l n .a. lsa -lda Ls a , Ld a C n.a . Gn u Gp l 2 M A LL E T Ld a an d oth ers J av a M A LL E T tea m Com mon Pu blic L icen se V ersio n 1.0 ( Cp l ) M atl ab T opic Mo deli ng T o olb o x Ld a , A t , H m m-Ld a , Ld a-Co l Ma tl ab M ark St ey v e rs, T o m G riffith s free of charg e for rese arc h an d edu cati on N onp aram etri c Ba y esia n M ix ture M o d els Ld a an d oth ers Ma tl ab , C Y ee W h y e T eh free for re searc h or ed uca-tio n on line ld a On line inf erenc e for Ld a P yt h on M att Ho ff man G NU Gp l 3 pld a P aral lel Ld a C ++ Z hiy uan L iu, Y uzh ou Zhan g, E dw a rd Y. Ch ang , M ao song Su n A pac he L icen se 2.0 stre amL D A Ld a (o nline /stre amed ) P yt h on M att hew D. H off ma n, J essy Co w an -Sh arp, J ord an Bo y d-G rader Gn u Gp l 3 tm v e T op ic Mo del V isua liza-tio n E ngin e (f or ld a-c) P yt h on A lliso n Ch aney M it lice nse to picm o del s Ld a , Ctm R ( C , C ++ ) Bettin a G r¨un , Kur t H ornik Gn u Gp l 2 Y aho o! Ld a P arall el Ld a (H ad o op ) C ++ Sh ra v an N a ra y ana m ur th y A pa che L ice nse 2 .0 T a b le 4 .1. : Ld a im p lem en ta ti on s (2 0 11 ).
4.2. The R Programming Language and Environment
The practical part of this thesis is topic modeling and exploring by use of the R pro-gramming language. The following description is a direct quote from the R homepage (R Development Core Team, 2011):
“R is a language and environment for statistical computing and graphics. It is aGnu project which is similar to the S language and environment which was developed at Bell Laboratories (formerly At&t, now Lucent Technolo-gies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written forSruns unaltered under R.R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly exten-sible. TheS language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.”
R is available on Unix-like platforms (Unix, FreeBSD, Linux), Windows and MacOS and its source is openly available under the Gnu General Public License 2 (Gpl) (R Development Core Team, 2011). The program can be invoked either from the command line inbatch mode,where completeRscripts are read and evaluated by the interpreter, or run as an interactive session, where typed-in expressions are read, evaluated and the results printed immediately. There also exists a variety of graphical user interfaces (Guis) which wrap the base program for users who prefer integrated development envi-ronments instead of a command line, e.g.RStudio. Furthermore, text editors likeEmacs orvim offer modes which make writing and testing R scripts easier.
The standard capabilities of R are extensible by so-called packages which can be installed by the user and can be loaded from within anRsession. A central repository of such additional packages by third parties (and the R code and documentation itself) is
the Comprehensive R Archive Network (Cran, http://cran.r-project.org/).
Cranmirrors are available around the world and currently host more than 3,400 pack-ages from a wide range of categories, such as econometrics, graphics or machine learning.
4.3. Topic Modeling with the R Packages tm and
topicmodels
4.3.1. The topicmodels Package
TheR package topicmodels (Gr¨un and Hornik, 2011) was written by Bettina Gr¨un and Kurt Hornik and is available from Cran servers. It includes and provides interfaces to David Blei’s open source implementations of Lda (lda-c) and Ctm (ctm-c), both written in the C programming language and using the Vem algorithm for inference.
Also included as an alternative inference method for Lda models is the open source Gibbs sampler GibbsLDA++ by Xuan-Hieu Phan and Cam-Tu Nguyen.
Introductions and examples to topicmodelsare available from:
the package reference documentation which can be viewed via the R help sys-tem, or as a singlePDFfile (topicmodels.pdfathttp://cran.r-project.org/
package=topicmodels), or
the authors’ paper in the Journal of Statistical Software (Gr¨un and Hornik, 2011). For data input the topicmodels package depends on the package tm (text mining) which is described in the next subsection.
4.3.2. Preprocessing with the tm Package
tm is a framework for handling text mining tasks within R by Ingo Feinerer (Feinerer, 2008). Thetm framework is designed as middleware between theR system environment and the application layer, which in the case of this paper is the R package topicmodels. It enables the user to import data from various formats via so-called “readers” and offers standard tools for preprocessing and further transformation and filtering of texts and text collections. Furthermore it handles the creation of document-term matrices, a feature which topicmodelsmakes use of.
A list of all tm readers is shown in Table 4.2. If available, metadata will be stored with the documents and the corpus.
Reader Description
readPlain() Read in files as plain text ignoring metadata
readRCV1() Read in files in Reuters Corpus Volume 1Xmlformat
readReut21578XML() Read in files in Reuters 21578Xmlformat
readGmane() Read in GmaneRssfeeds
readNewsgroup() Read in newsgroup postings inUci Kddarchive format
readPDF() Read inPdfdocuments
readDOC() Read in MS Word documents
readHTML() Read in simply structuredHtmldocuments Table 4.2.: tm readers (Feinerer, 2008, Table 4.1).
After reading, corpora can be transformed and filtered (=searched) with the functions
tm_map()and tm_filter(). Available text and corpus transformations include:
removal of punctuation, numbers, whitespace and stop words,
conversion to lower case, and
Linguistic models that are based on the bag-of-words assumption (like Lda), rely on document-term matrices (see Section 3.3 on page 14). In tm these matrices are constructed from the corpus as an instance of the class DocumentTermMatrix and the following options can be specified:
different tokenizers (see Section 2.2),
matching against adictionary,
minimum term frequencies,
minimum word lengths, and
most of the standard transformations(see above).
The current functionality of tm is documented in detail in the package’s online help and in its reference manual (tm.pdf athttp://cran.r-project.org/package=tm).
4.3.3. Model Selection by Harmonic Mean in topicmodels
As explained in Section 3.7.1, one way of selecting the model parameter K, which is the number of topics, is to approximate the marginal corpus likelihood (that depends onK) by taking the harmonic mean of a set of samples generated by the Gibbs sampler (Griffiths and Steyvers, 2004). Version 0.7 of topicmodelsdid not provide this method of model selection1, and since I wanted to replicate the Griffiths and Steyvers (2004)
paper on exploring thePnas corpus viaLda, I had to implement this functionality by myself. This required a small addition to thetopicmodelssource code and some lines of
R code.
When an Lda model is fitted by Gibbs sampling, the function LDA() returns the model as an object of class LDA_Gibbs which contains nearly all data of interest for further processing. However, the following R lines:
> library("topicmodels") > getSlots("LDA_Gibbs")
and a cross check with the package’s documentation reveal that one sample attribute needed for calculating the formula in Section 3.7.1, namely the number of instances of wordi assigned to topic j, is missing from the returned object’s slots. This variable nw
is internally used by the GibbsLDA++ Gibbs sampler as a substitute for the explicit topic to word assignmentsz, therefore it can be added to theCcode that constructs the returned R object.
Changes to the topicmodels Source Code
TheRinterface is implemented by the package’s source filerGibbslda.cpp. Usingdiff
to compare the original file with my version shows the updated lines:
79,86d78 <
< // MP: nw: number of instances of word i assigned to topic j, size VxK
< tp = PROTECT(allocMatrix(INTSXP, model->K, model->V));
< for (i = 0; i < model->K; i++)
< for (j = 0; j < model->V; j++)
< INTEGER(tp)[i + model->K * j] = model->nw[j][i];
< SET_SLOT(ans, install("nw"), tp);
< UNPROTECT(1);
A second change was then made toallclasses.R, which contains all class declarations (again, as diffoutput):
131,132c131
< delta = "numeric",
< nw = "matrix"), # MP: additional slot
---> delta = "numeric"),
These changes are only available from within R after compiling and installing this custom build of topicmodels. Instructions on how to install R packages (in user space) are given in the manual “RInstallation and Administration” (R Development Core Team, 2010).
Functions in lda-gibbs-tools.R
The file lda-gibbs-tools.R is a collection of all further functions that implement the model selection as described in Section 3.7.1 on page 25.
The wrapper functionldaGibbsSamples() is needed for the repeated call of theLda model estimation in order to produce a chain of model samples. Samples are taken after a burn-in interval and at a specified lag.
ldaGibbsSamples <- function(dtm, k, burniniter, sampleinterval, nsamples, control=NULL, model=NULL, ...) {
# Value: returns a list of objects of class "LDA" # Note: The "call" slot of samples contains just the
# wrapper's variable names, is therefore useless.
# check control and remove iter parameter, if present: if (!is.null(control)) {