Topic categorization works by using particular elements of language (e.g. words, word stems, strings or words, and parts of speech) as predictors of which category a particular document falls into. The essential insight is that certain words or phrases are markers of when a particular topic is being discussed or how a given texts compares to known samples of language. The raw text is transformed into numerical data by using a computer program to count the occurrence of each specific feature of language in the sample being analyzed. In this approach, each document receives a vector of counts for each feature appearing in the entire sample. The following programs mostly differ in how they use these word-count matrices to categorize texts.
Latent Semantic Analysis (LSA)
Latent Symantic Analysis (LSA) was first pioneered in cognitive psychology and is rooted in the idea that the human brain uses processes similar to factor analysis to decode the
meaning of communication Landauer and Dumais (1997) (for a general introduction to LAS Landauer, Foltz and Laham (1998). The general insight is that a given word’s meaning is a function of the contexts in which it routinely appears, and a passage’s meaning is determined by the words that appear in it. This family of models differs in the number and type of assumptions they need to function, but they are all use word-counts to uncover the latent dimensional structure of texts being analyzed. In a topic categorization set-up, dimensions capture clusters of words that appear in a particular context but not others. LSA models estimate the dimensional structure of language without any input from the researcher, thereby obviating the need for a pre-set topic coding scheme. It is important to note that LSA models make no use of word order, syntactic relationships, or morphology Simon and Xenos (2004). Beyond the cognitive theory of meaning cited above, these models are appealing because they generally produce sensible results.
Simon and Xenos reintroduced the idea of factor analyzing word-frequency matrices into political science. They conducted exploratory factor analysis to identify groups of words used to discuss partial-birth abortion. The dimensional structure that Simon and Xenos uncovered was generally consistent with a hand-coding of texts, evidence that examining word usage captures much of the same information that a human researcher would identify by reading. The major objection to their method is that explanatory factor analysis assumes that observations are independent of one another. Rules of grammar, however loosely followed, imply that many words are not structurally independent of one another. That said, the success of their demonstration implies that this violated assumption may not always be deadly.
Most LSA models are variants of basic factor analysis, but amended so as to avoid violations of form like the one just discussed. Quinn, Monroe, and their co-authors use a multinomial mixture model to identify the topics of congressional speeches from the 105th-108th Congresses 2006. The authors settled on 42 topics (dimensions) after running the model for higher and lower order solutions. It should be noted that LSA models generally function best for high-dimensional solutions, usually in the 50-1500 dimension range Landauer, Foltz and Laham (1998). The fact that Quinn et al’s model is optimized
at 42 topics is evidence of congressional speech being more highly organized than many other types of language. For any Douglas Adams fans, a 42-topic solution also hints that Monroe and company are courting deeper questions than they recognize.
One major advantage of Quinn et al’s model is that it can incorporate a time pa- rameter, which allows their model to capture the changing relationship between words and topics over time. This is an important advantage because language is a dynami- cally evolving entity. When words start appearing in new contexts, these models capture those changes in contextual meaning. The major limitation of LSA models is that they require very large bodies of text. Quinn and his colleagues analyzed every word uttered in Congress during the years studied. Their results are more generalizable than those based on more specific samples of text, but collecting, archiving, and preprocessing the entire Congressional Record is a massive undertaking. LSA methods, therefore, may not be feasible for researchers with less time, fewer graduate student assistants, or where the availability of relevant text is limited.
Support Vector Machines
Support Vector Machines (SVM’s) are another family of models designed to sort text into categories of interest. These models identify words that are characteristic of specific categories within a “training” sample and then use those discriminating words to sort texts into the same categories. These models have been used in the private sector to distinguish good versus bad customer product reviews Dave, Lawrence and Pennock (2003) and to sort viewer comments on movies Pang, Lee and Vaithyanathan (2002) (or a comprehensive discussion SVM’s Vapnik (1982); Cortes and Vapnik (1995); Vapnik (1999)). SVM’s have produced similar or superior results to alternative methods of identifying categories in text 1999; 2005.
Hillard, Purpura, and Wilkerson used SVM’s to categorize the 380,000 bills introduced in the 80th-105th Congresses 2007 based on the words appearing in the title. They found that the SVM program successfully replicated how those same bills were manually coded by the Congressional Bills Project, and with much less effort. Once again, analyzing word
usage captures many of the same patterns that human readers identify.
A few characteristics of SVM models warrant specific mention. First, the program reports the discriminatory power of each word used. This allows researchers to trace the change over time in which words distinguish ideologies or topical categories from one another. While not as rigorously dynamic as LSA models, this still provides some leverage on how the meaning of specific words evolves. Second, the success of the Congressional Bills Project experiment indicates that SVM’s may be capable of working with extremely short segments of text. This is useful because long texts with lots of words may not always be easily collected. Third, SVM’s appear to work well in both high and low dimensional contexts. This is a strength, but also an indication that the dimensionality of the results is largely determined by which categories the researcher chooses to analyze and the selection or training samples. Finally, while SVM’s clearly require less massive bodies of text than Monroe and Quinn’s method, good results still require sizable training samples.
Non-Parametric Categorization
Gary King, Daniel Hopkins, and their research team has been exploring ways of cate- gorizing political text in ways that avoids what they see as key problems in how CATA methods are currently being implemented in political science Hopkins and King (2007).
First, they argue that we should change the benchmark for CATA categorization success. “Accurate estimates of these document category proportions has not been the goal of most work in the classification literature, which has focused instead on increasing the accuracy of individual document classification. Unfortunately, methods tuned to maximize the percent of documents correctly classified can still produce substantial biases in the aggregate proportion of documents within each category” Hopkins and King (2007). The focus on percent of individual documents accurately classified is a holdover from computational linguistics and may be a misleading standard for social scientists who are using these estimates to model aggregate behavior. To resolve this problem, King and Hopkins produce estimates of document mis-categorization, and use those to amend the aggregate measures of relative topic frequency.
Hopkins and King also point out that using words to predict categories violates the data generation process we are attempting to model. In the real world, authors and speakers know what they are talking about at the outset, and choose language accordingly. Both LSA models and SVM’s functional form implies that words precede categories, the equivalent of asserting that members of Congress only discover what they were talking about after they’ve finished speaking. In addition, estimating a document’s category as a function of its speech elements requires parametric assumptions that are almost never met by textual data. For these reasons, Hopkins and King, propose modeling language profiles (the vector of count data for each document) as a consequence of unobserved categories, rather than their cause. They argue that their approach is also less sensitive to differences in language between training samples and populations of interest.
Hopkins and King demonstrate their method by categorizing the sentiment of web- logs toward the major presidential candidates for 2008. They collected all blog postings between February 1st and 5th 2007 that mention any of the candidates and sort them into categories ranging from extremely negative to extremely positive. ”The idea is to create a daily opinion poll that summarizes the views of people who join the national conversation to express an opinion”(2007). They chose their sample to capture the days immediately preceding and following John Kerry’s botched ”joke” about dropping out of high school being a direct road to service in Iraq. Hopkins and King show that their method captures a massive spike in negative sentiment in the wake of Kerry’s gaffe. The success of their method is even more impressive given the consensus in the literature that sentiment categorization is even more difficult for CATA tools than topic sorting Pang, Lee and Vaithyanathan (2002).