5 METHODOLOGY
5.2 SENTIMENT AGGREGATION
5.2.1 DAILY SENTIMENT AGGREGATION
17 261 2 4 % 64 % 0 % 1 1 30 0 % 0 % 7 % Full-time annotator Negative Neutral Positive C o n tr o l A n n o ta ti o n
source of interannotator disagreement. Additionally, some financial expressions may also be difficult to interpret, and may depend on the context whether they are positive or negative. For example ‘dividend cuts’ may be seen as a negative signal, unless accompanied with a good reasoning (see also Mitra and Mitra, 2010). Finally writing styles such as cynicism, (Hsueh et al, 2009), and inherent variability in a word’s meaning (Maks and Vossen, 2010), may be sources of interannotator disagreement.
Maks and Vossen (2010) use a third annotator to get to a ‘gold standard,’ and thus we also consider the possibility of using further annotators. Hsueh et al. (2009), on the other hand, conclude that it is possible to use only one expert annotator, finding that this gives 97.4% correlation to the gold standard in their case. When comparing to similar annotation studies, our Kappas compare relatively well (cf. e.g. Maks and Vossen, 2010, κ=0.80). As our algorithm will make its sentiment estimates based on probabilities, and use at least 30 labeled sentences for each pattern, a 100% agreement will not be necessary for the algorithm to work correctly. Consequently, we do not see it necessary to add more annotators148. However, we notice after further analysis (see section 6.1.2) that a somewhat different training set may be more ideal for the purpose of training our algorithm.
Training set B
After our original training set, we have an excellent set of tagged sentences that represent sentiment in different sentences. This training set is in our view a solid benchmark to see whether an algorithm is classifying sentences correctly. However, our algorithm can only recognize the sentiment, but is unable to recognize the credibility of the author (Appendix J - Error descriptions for LPS: Company talking in advertising like -tone about its' own operations) or the relevance of a sentence for a company’s success (Appendix J - Error descriptions for LPS: Inability to recognize significance of events, Positive convention of talking about something, Inability to understand the magnitude or value of items). A training set where the ‘correct answer’ includes also deductions based on this information has more correct annotations, but they also include more noise from the algorithm: our approach does not include a model for assessing relevance or credibility, and therefore sentences where an annotator considers this information effectively increase noise. To adjust our training set for
148 The same set of sentences have been further annotated by more annotators in our parallel study. For details,
see Malo et al. (2013b). However, even after further annotations, the training set remains in line with our original annotator’s categorizations.
better results, we create a second training set (Training set B149) that is annotated by a researcher with a business background from Aalto School of Business. Thus, we have two different versions of the same training set:
A. Training set with credibility and relevance assessment: a person with a financial background reads the sentences and uses all their knowledge, except for company-specific knowledge, to annotate the sentences
B. Training set without credibility and relevance assessment: as above, but the person does not assess the credibility or relevance of the sentence.
The difference between the two training sets can be characterized with the three example sentences below in Table 13:
Table 10: Differences between training set A and B
Sentence Training set A Training set B Difference
“I think my company will beat its competitors” -CEO. Slightly positive / neutral
Very positive Credibility of author The company’s 100th year celebration party was a
great success.
Slightly positive / neutral
Very positive Relevance of adjustment “The company will likely beat its competitors” -
Financial Times.
Very positive Very positive N/A
It is clear that Training set A is closer to a ‘true’ sentiment estimate, and that the annotations are superior to Training set B in this sense. Credibility and relevance are; however, usually not assessed from the polarized sentences. Rather, they require knowledge on the credibility of different sources, and on the relevance of different events to companies, etc. As this would require us to identify a much wider variety of objects in the sentences, and represent a number of studies of their own, we create Training set B so that we can directly relate polarized words to sentiment.
We create our training set B using the set A as a basis. However, we especially go through the neutral sentences in order to detect cases where a sentence has been classified as neutral due to lack of credibility (e.g. a biased source, such as CEO explaining how good his own products are) or lack of relevance (e.g. an event that is considered unimportant). For such cases, we annotate the sentence with the polarity even if we know that this is likely not
relevant for the stock price. To further simplify the exercise, we only tag sentences on a 3-step scale case in Training set B. The interannotator agreement between Training set A and Training set B is summarized in Table 11.
Table 11: Interannotator agreement between Training set A and Training set B
κ=0.61
As expected, interannotator agreement is lower in this case compared to the control annotation, as the used instructions have been different. In particular, many sentences that the first, “stricter”, annotator characterizes as neutral, are categorized often with a sentiment by the second annotator. For example a sentence that is written in a positive tone but is not relevant for the company should become annotated as neutral in Training set A, but positive in Training set B.
5.2 Sentiment aggregation
In this sub-section, we will deal with several considerations that arise when calculating sentiment scores for full articles (as opposed to individual sentences), as well as when aggregating several sentiment scores from multiple articles on a given day into one sentiment score. The sub-section will proceed by first discussing the aggregation technique we have employed, and then move on to discuss additional considerations we have not yet explored when discussing our sentiment estimation methodology.
5.2.1 Daily sentiment aggregation
Once we have applied our different methods of investor sentiment estimation, we need to aggregate the polarized results for a document. In the case that there are several articles for a given company, we wish to further aggregate the sentiment scores of these articles within a day into an aggregate sentiment score for that day for a given firm.
Positive Neutral Negative Positive 23 % 15 % 0 % Neutral 2 % 46 % 0 % Negative 0 % 5 % 9 % Full-time annotator C o n tr o l A n n o ta ti o n Training set A T ra ini ng s et B
To ensure that our sample articles are relevant, we start out by filtering news based on their characteristics. As explained in Section 4.4.3, we have already removed news with less than 100 words. In addition, when doing a word count, we require each news item to have a minimum of three negative words with two of them being unique (e.g., Tetlock et al., 2008). Similarly, we also require documents to have at least two sentences with negative polarity (our word count may still use these news items). The exclusion is done in order to eliminate stories that contain only tables, or lists, with mostly quantitative information. For example, a table might contain an individual word multiple times in the header of the table, and thus could add considerable noise to the sentiment if it were included in the sentiment score. The articles that meet the aforementioned criteria are included in our news sample and hence in our sentiment score.
To consolidate polarized elements, Das (2010) suggests calculating ‘sents’, where a positive (‘BUY’) signal is calculated as +1, negative (‘SELL’) signal as -1, and a neutral (‘HOLD’) signal as 0. From the news items that are left for aggregation, we aggregate the sentiment score Neg150 for document d as
Negd = Number of negative words (sentences) / Total words (sentences)
As can be seen, we use the same method of aggregation for word count and the Linearized Phrase-Structure -model. This is done to keep the consolidated sentiments consistent, which will allow us to better compare the methods.
Aggregation of the daily sentiment based on multiple articles with different scores can be done either by (a) using averages of the sentiment per article, or by (b) combining all articles within a day into a composite article. The chosen method can significantly impact the weight that each source gets in the sentiment score. In option a) each document can be set to have a weight (equal weight, or some other weight), while this is not possible if we aggregate all word and sentences directly into a composite article (option b). Previous studies have counted the aggregate daily media sentiment for a company by pooling all news items together in order to create a composite article. Then, negative words in all articles during respective day / total words in all articles during respective day (e.g., Tetlock et al., 2008; Engelberg, 2008) would reflect the sentiment of the day. However, we choose to differ from this approach. Consider a day when six articles are published: an article with 1,000 words, 100 of which are
150 As negative news has been shown by previous literature to be most influential, we use in consolidated
negative; and five articles with 200 words, 0 of which are negative. With prior literature’s method, the aforementioned example would yield a sentiment score of:
The weight of an article for the daily sentiment score would thus be directly proportional to its length which can simply be a function of writing style. While writing in a certain style may impact people’s perception, we believe that multiple sources weigh more in the formation of an aggregate sentiment than the length, and negativity conveyed possibly by a single source151. Therefore, we aggregate daily sentiments based on equal weights between sources; in other words, using an average of the articles’ sentiment scores within a day:
∑
N= number of articles during day t
The roots of our approach can be traced back to behavioral finance theory152. According to mental accounting, people do not aggregate related information rationally but consider it in insulation. As discussed, aggregating using composite articles overweighs lengthy articles vis- à-vis the different number of articles. We hypothesize the following: agents do not aggregate different news during a day but use a 1/N style heuristic rule in forming their sentiment estimate for the day, leading to equal weighting of news: averaging.
5.2.2 Considerations on daily sentiment aggregation
Prior literature studies have estimated daily sentiment scores using different methods besides the simple fraction of negative words to total words. For instance, Tetlock (2007) uses standardization of negative fraction as follows:
where µneg is the mean of Neg over the previous 365 days and σneg is the standard deviation
over the same period. Standardization might be needed, for instance, in the case that different
151 In fact, proxies of impact should be the prestige and number of readers of a source in a more sophisticated
sentiment algorithm.
publications change their coverage style, or that some new publications have been added to a news database during the sample period. However we do not follow this approach as previous studies have not found a significant difference between standardization and the simple fraction (Tetlock et al, 2008; Engelberg, 2008). Also, it is possible that there is a justified shift in negativity during a sample period. For example, our sample reaches over the financial crisis. Therefore, it could be justified that the sentiment would change over time, and smoothing the sentiment with standardization would distort the correct sentiment.
Besides standardization, term-weighting has been used when estimating daily investor sentiment (Loughran and McDonald, 2011). The method takes into account the length of a document, the frequency of terms, and commonality of terms within the entire corpus. According to Loughran and McDonald, term weighting can be especially beneficial when using a dictionary that is not tailored for the context it is being used in: for example using the Harvard psychology dictionary in financial context. However, as we are using a context- specific dictionary, and wish to stay consistent with other studies: only Loughran and McDonald have used term-weighting, we refrain from using term-weighting.
As we are using closing prices, we need to take into account the sentiment changes occurring during weekends to have an accurate reflection of the relationship between financial metrics and sentiment. Therefore, we calculate the sentiment scores for Monday’s by adding the sentiment of the weekend to the sentiment of Monday. By doing so, we take into account that the change from Friday’s closing price to Monday’s closing price includes news from Saturday, Sunday and Monday. We use the aforementioned method also for other days when the stock market has been closed.
Once we have calculated these sentiment scores, we further check the scores for seasonality and industry trends to make sure that there is no systematic bias impacting the sentiment. We test whether there are time periods when negative news are more common, and if negative news are constantly reported more in a certain industry. However, as we previously noted when discussing the standardization possibility, there may be a good explanation why these trends are occurring: i.e. a certain industry could be constantly declining in value and therefore warrant a constant increase in negative sentiment towards it. However, we wish to make sure that there is a logical explanation behind such trends, and that it is not simply a matter of what news are being included in our sample.