2.4 Topic Models
2.4.3 Data Pre-Processing
Human generated text is an extremely complex data form. Techniques for text analysis necessarily make substantial assumptions/approximations to the struc- ture of the text in order to be at all tractable. In the case of topic models, actual word frequency and co-occurrence distributions may differ significantly from the typical Dirichlet priors and from distributions resulting from the model structure, leading to topics that may not faithfully represent the sought after document semantics. For example, large vocabularies and zipf distributed word
frequencies may allow topic models to detect word co-occurrence relations that are not strictly semantic in nature, reducing the utility of resulting models.
For these reasons, a pre-processing step is often used to remove aspects of the text that significantly skew or add substantial noise to the analysis results. The purpose of many of the approaches below is an attempt to obtain a smaller, denser vocabulary to capture more focussed semantics in the topics. Other approaches attempt to ameliorate statistical anomalies in co-occurrence and word frequencies. Though the following list focuses on pre-processing steps typically used for topic modelling, most of these techniques are also used in other text analysis approaches (particularly bag-of-words based approaches). Typical pre-processing steps for topic modelling are:
• Tokenisation: processing text to identify the tokens or words is not always trivial, especially with social media data such as tweets.
• Removal of “stop words”: these are words that are frequent but carry little or no information relevant to the task at hand.
• Word stemming: verb tenses and plurals may be considered to contain no relevant differences in meaning, so suffixes such as ‘-ing’ and ‘s’ are often removed.
• Named entity recognition: look for constructs such as ‘Barak Obama’ or ‘Mr Foo’ and replace them with a single token, also attempting to equate that token with the same ‘entity’ in other forms (eg: ‘President Obama’ or ‘Mr A. Foo’).
• Duplicate document detection: duplicate or near-duplicate documents often fail to satisfy the (usually statistical) assumptions of text analysis algo- rithms, and can produce unexpected or spurious results. Detecting and removing duplicates may improve analyses.
A number of other pre-processing techniques have been used in the litera- ture. Talley et al. [Talley et al. 2011] identified vacuous topics by their strikingly uniform distribution over documents without any documents strong in the topic. Words from these topics were removed from the data. They also created a vo- cabulary of ∼600 acronyms and∼4200 commonly used bigrams and phrases.
Another approach is to restrict the vocabulary to words known to be of specific interest. Poldrack et al. [Poldrack et al. 2012] restricted the vocabulary to words found in a domain specific ontology when analysing a collection of related research
papers. Restricting the vocabulary in this way can be useful for summarising documents with a known conceptual framework, but prevents the discovery of novel conceptual associations with unexpected words.
The opposite — not restricting vocabulary with standard methods such as stop word removal is also an option. This approach has been used in connection with adaptor grammars in [Wong et al. 2012], where stop words were included in an extension of standard topic models in the task of identifying the native language of authors of English text for whom English is a second language. The approach has also been used by Thelwall et al. [Thelwall et al. 2012], where stan- dard machine learning approaches were used for sentiment detection. They did not remove stop words in their analysis, noting that they are potential indicators of sentiment. Due to the psychometric intent of the models used in Chapters 5 and 6and the clear relevance of “function words” (such as articles, pronouns) as indicators of social psychological processes [Chung and Pennebaker 2013], stop words were not removed.
O’Connor et al. [O’Connor et al. 2010] provide a good example of tweet to- kenisation. Their tokeniser treats “hash tags, @-replies, abbreviations, strings of punctuation, emoticons and unicode glyphs (e.g. musical notes) as tokens”. I have employed a similar approach in Chapters 5and 6.
Though not commonly noted in the topic modelling literature, repeated sub- texts can have a detrimental effect on topic model performance [Cohen et al. 2013], an observation I have also made when modelling the data presented in this thesis. A simple approach to reducing this problem is to remove repeated texts. It can be argued that in this way little thematic information is lost (since the text originals remain). This approach was applied in the models from Chapters5 and 6, where retweets were removed. It should be noted, however, that retweets often contain small amounts of extra text, which is lost when they are removed.
A recent topic model includes duplicate sub-texts in the generating model [Co- hen et al. 2014]. Their model is specific to the patient record context (groups of notes drawing text from a single ‘root’ patient record per group), though it could be generalised to other duplicate structures such as retweets with relatively little effort. This approach is superior to simply removing near-duplicate documents, as the remaining, non-duplicated words are also modelled. I have not developed such a model, however it would be of interest for future work, especially given the importance of retweets as indicators of community acceptance of the tweets contents.