2 PREVIOUS LITERATURE REVIEW
2.1 EFFICIENT MARKET HYPOTHESIS
2.3.1 CONTENT ANALYSIS METHODOLOGIES
Recent literature has used various approaches when turning text to quantified metrics. The idea is typically to transform qualitative information into a sentiment score – usually authors wish to distinguish positive vs. negative sentiment, and possibly measure the magnitude. As a criterion to compare whether the estimated sentiment correctly reflects the quantified, Mitra and Mitra (2010) suggest comparing a computer’s annotations to how a human, or a human group (in particular a group of experts), interpret the text. Alternatively, one could also use market based measures, defining the sentiment after a market reaction (sentiment = market reaction). The aforesaid approach inherently assumes that the market reacts to a news story. Therefore, changes in a correctly constructed sentiment should correlate with stock performance metrics, such as returns, volatility or volume.
Sentiment analysis process
To analyze sentiment, one must first collect texts to process and to analyze them in order to construct a sentiment score. Mitra and Mitra (2010) split the information flow into information gathering (mainstream news, pre-news and web2.0/social media), pre-analysis, classification and assignment of sentiment scores, and analysis (vs. financial market data). Once completed, the analysis results can be fed into various quant models for return prediction, trading decisions or to assess risk. This approach is illustrated in Figure 2.
Figure 2: Information flow and computational architecture (Mitra and Mitra, 2010)
Various sources of information have been used in previous literature. Many authors (e.g. Engelberg, 2008; Li, 2006; Loughran and McDonald, 2011 have focused on 10-K reports that
are easy to access and analyze as they follow a certain structure and relate by definition to a certain company. Examples of other media varies from general news (e.g. Antweiler and Frank, 2006; Tetlock, 2008) to analyzing message board posts (Antweiler and Frank, 2002).
Pre-analysis of information starts with the decisions on how media will be filtered from a database. For 10-K reports, the decision of which report relates to which company is straightforward. For news and social media, this needs to be carefully defined. News can also mention multiple companies and do not always relate to one company. Today’s news databases usually include some search functionality for a company ticker which is often the result of a machine learning algorithm by the database company. While Tetlock et al. (2008) choose to search news by the official company name and filter their results then further to ensure the news are highly relevant to the company, Engelberg (2008) relies on Factiva’s automatic classification of news by company code.
After sorting out the right companies, most authors also perform some pre-processing of the texts. This is necessary to, for example, include the heading to be a part of the text, or to deal with texts including elements not in a story-format, such as tables, pictures and disclaimers that could add unnecessary noise to a sentiment score.
Mitra and Mitra (2010) also recognize that it could be beneficial to identify stories that are current: news that report other old news are not so relevant anymore, as the information is not novel, and should often be given less weight or excluded from sentiment score metrics. Also, adjustments depending on news flow timing could be used. News flows have seasonality in them: at some points of day, week, month and year more news (and new information) come to the market than others. Finally, analyzing links between news should be considered, as news items often include a number of topics (e.g. a company’s earnings announcement will bring a wide variety of information to the table on different topics).
After preprocessing, news are classified to construct a sentiment score. Das (2010), also cited by Mitra and Mitra (2010), has identified six methods for classifying sentences: the naïve classifier and variations of the naïve classifier: the discriminant based classifier and the adjective-adverb phrase classifier; algorithms that determine the class based on the composition of lexicon items in a sentence: vector distance classifier and Bayesian classifier; and support vector machine (SVM). Das (2010) also proposes to use a voting scheme after using the number of classifiers, so that a message is given the category to which most classifiers would rank it.
Sentiment classifiers
The Naïve classifier (also known as “word count” and “bag of words method”) works by counting the number of word occurrences, and assigns a label to the text based on what category of words are most common (e.g. positive or negative, or neural if no majority exists). To work, this method requires a lexicon, i.e. list of words that have been categorized as “positive”, “negative”, etc. Due to the ease of implementation, this is the most commonly used classifier and has been used in most studies in the finance domain (e.g. by Tetlock, 2008; with some additions by Engelberg, 2008; and Loughran and McDonald, 2011). As a modification of the naïve classifier, Das proposes a discriminant based classifier that assigns different weights to different words (e.g. 0.5 negative weight for a slightly negative word, and 2 for a highly negative word). The Adjective-adverb phrase classifier works also similarly to the naïve classifier, but considers only noun phrases that include adjectives or adverbs: e.g. “a strong profit” would be considered for classification, but “a profit” would not be included even if the word profit would exist in the lexicon.In addition, Engelberg (2008) experiments by adding the impact of simple negations that change the meaning of expressions (for example “not bad” vs. “bad”).
The vector distance classifier assigns all words in lexicon as dimensions in vector space, and then describing each message as a vector. A training set of messages are pre-classified, and new messages are assigned polarity with vectors that have the smallest angle. Bayesian classifier, on the other hand, determines the count of each lexical item (e.g. a word) in a message. From a training set, it is possible to know with what likelihood each lexical item appears in a certain category. From word based frequencies, it is possible to calculate the probability that a message falls into a certain category, and assign the category with the highest probability to the message. For example O’Hare et al. (2009) uses multinomial naïve Bayesian classifiers to recognize sentiment in financial blogs.
Support Vector Machines (SVMs) are a classifier technique that is similar to cluster analysis but can be used in very high-dimensional spaces. Given a large number of texts and a training corpus, the SVM can classify texts, for example all words in the lexicon dimensions, and then clustering the texts based on information in the training corpus61. For example, this could be used to first identify which words are typically present in a positive sentence, and then to classify further sentences based on this. The advantage of SVMs would be their flexibility in
61 For a more technical description of SVMs, see Das, 2010 and Vapnik and Lerner (1963); Vapnik and
being able to learn features also in highly sophisticated environments. SVMs are used by e.g. O’Hare et al. (2009) to classify financial text.
Going further in sophistication with sentiment detection, Moilanen et al. (2010) have developed a method called “quasi-compositional sentiment sequencing” that we also use as the basis for our methodology. Compared to a base case word count, this method assumes that having polarities in different sequences can create a different outcome for the polarity of the whole sentence. For example, having a sentence with three words – “positive-negative- positive” – could be labeled as positive by majority vote with a word count algorithm. The logic of quasi-compositional sentiment sequencing, on the other hand, would be to ask “what kind of polarities have human annotators given to sentences that have words in the sequence ‘positive-negative-positive’”. To simplify sentences, the method compresses similar polarities together in the sequence (e.g. sentences with “positive-neutral-neutral-positive-positive” and “positive-neutral-neutral-neutral-positive” would be compressed to “positive-neutral- positive”). With this compression, the training sets required reduces significantly. For implementing the actual classification, the authors use a standard SVM approach and a readily annotated corpus (MPQA). Looking at the results of quasi-compositional sequencing, especially sentences with many different polarities (the authors use positive, negative, neutral and reversal) yield better results than simpler methods.
Considerations on classifiers
To work, classifiers often need supplementary databases: a dictionary includes the information of word categories (is a word an adverb, an adjective, a noun, etc.), a lexicon assigns words to various polarities (e.g. a list of positive words), and a training corpus of base messages shows examples of how different sentences should be classified. The contents of these databases can also vary significantly between studies, and e.g. changing from a general lexicon to a domain specific lexicon can make a large difference (see e.g. Loughran and McDonald, 2011). The most commonly used lexicon in the financial literature has been so far the General Inquirer’s Harvard-IV-4 psychological dictionary (e.g. Tetlock et al., 2008; Engelberg, 2008).
In addition to sentiments, a sentiment algorithm can consider the window where the sentiment is detected, and also the magnitude of the sentiment. While most papers either consider sentiment on a document level or always categorize sentences, there are also other options for labeling a text with a certain polarity. O’Hare et al. (2009) introduce a concept of word (and
sentence and paragraph) windows: they consider for the sentiment on a certain topic only text that has a distance of n to a topic word (e.g. only 5 words before and 5 words after a certain topic word).
For a human reader, it is also evident that the context of an expression impacts how strong the polarity should be. For example, Engelberg (2008) relates the negative sentiment on a sentence level further to one of six themes (positive fundamentals, negative fundamentals, future, outlook, environment, operations, and other) and identifies that they can be used to refine the perceived impact of sentiment.
Once classified, detected polarized words and sentences need to be combined to arrive at an aggregate sentiment score; in other words, “sentiment of the day”. Authors have adopted various approaches for aggregation: e.g. Tetlock et al. (2008) combines all news of a particular day into one article and calculates the proportion of negative words in this article. On the other hand, Das (2010) labels each message as a “buy” or a “sell” signal, and then calculated the number of total buys and sells per day. The chosen approach has an impact: for example, a long article would typically have a larger weight with Tetlock’s approach, whereas with Das’s approach the weight of each article would be the same.