5 METHODOLOGY
5.3 SENTIMENT ESTIMATION METHODOLOGY LIMITATIONS
Das (2010) illustrate the inverse relationship between data volume and algorithm complexity in data and algorithm pyramid figure which is depicted below in Figure 17: The data and algorithms pyramids (Das, 2010). In general, we could categorize the 'bag-of-words’ method as being on the lowest level of the pyramid, whereas our methodology: Linearized Phrase- Structure -model, would be in the content-level. However, as is evident from the figure, there is still work to be done to reach the context level.
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
Jan-06 Jul-06 Jan-07 Jul-07 Jan-08 Jul-08 Jan-09 Jul-09 Jan-10 Jul-10 Jan-11 Disagreement 30-day average disagreement
Figure 17: The data and algorithms pyramids (Das, 2010)
The prevalent methodology for the extant literature has been so far a naïve word count based on different dictionaries. While being simple and fast to use, the word count method has its limits. In their recent influential article, Loughran and McDonald (2011) show that dictionaries not related to the context of the data misclassify words. As a result, they create word lists for the financial context which significantly improve results for a word count methodology. Yet, the sentiment derived from a word count with context specific dictionaries remains a naïve proxy for the actual sentiment: simply counting words of a text cannot yield an understanding of the meaning of the text. Examples of the pitfalls of the methodology are multiple. For instance, word ‘bad’ counts as a negative word in both the expressions ‘bad result’ and ‘not a bad result’, or sarcastically written text could be downright misinterpreted. As an alternative way of measuring sentiment, we have proposed that Linearized Phrase- Structure -model can yield better results: recognizing common patterns in financial text that a word count is not able to do. While our methodology is an improvement vis-à-vis the prevalent methodology, it is still far from the actual sentiment that would be derived by multiple human annotators. Compared to a human annotator, the Linearized Phrase-Structure -model cannot detect topics, the relevance of a text, and is unable to assess text credibility. Also, we are unable pinpoint temporal differences in information in a text. All in all, our methodology is a significant improvement from the naïve word count methodology; however, there is yet significant room for improvement.
Our relevancy filtering is relatively limited, and we do not make a difference between topics and their relative importance. Ideally, we would retrieve all news that impact the sentiment of
a company, and then sort them based on relevancy. At the moment, we filter implicitly as we search for news based on the companies’ tickers. Should a news item be important but not mention the company: i.e. important industry news, we may miss the news item from our sentiment. Second, we give all news the same weight, regardless of their relative relevance. In reality, we might be better off by giving each news item a sentiment score, with a weight depending on the topic that it is written about. Possibly, we would have this kind of relevance assessment on two levels: first, we would assess what topics are relevant for the company, and by how much. Second, we could recognize on a sentence level how relevant each sentence is for the topic. Naturally, implementing topic recognition, and finding relative relevancy weights, would not be a trivial task, and could be a topic for future research.
The Linearized Phrase-Structure -model cannot assess credibility in sentences. Therefore, the algorithm operates in a child-like manner: believing everything that it sees. The aforesaid can lead into several biases that cause sentences to become tagged differently compared to that of a human annotator. Ideally, we would assess credibility of different authors, adjusting the opinions of authors that have a tendency to write in a certain way. For example, some publications may be more favorable in their writing style. For instance, a case-in-point is a situation where a company is the author of an article discussing its own operations in a positive manner.
Another caveat example relating to credibility is our aggregation method. Currently, we are assigning each article equal weight regardless of the publication; a naïve way of assigning weights to publications. For instance, an article in The Financial Times would most likely have more impact than a local newspaper article due to its larger circulation and higher perceived credibility. However, adjusting the methodology to take into account the aforementioned factors is an extensive undertaking; we leave the issues for future research. We suggest that, for example, different weights could be applied to the sentiment scores before aggregation, depending on the source.
Linearized Phrase-Structure -model does not make a difference between the temporal placements of information a news item is describing: we consider all found news with equal weight, regardless of the time period of the information in question. For instance, a story describing a company’s history would be handled in the exact same way as a story bringing new information to the market, or a story speculating on the future. In reality, markets react
very differently to new information vis-à-vis old information.156 One approach to overcome the aforementioned limitation would be to detect topics as they appear for the first time, and discount secondary news: ‘news of news’. Another approach would be to keep track of the time aspect when estimating sentiment, and use that information in the estimation process. For example, Cahan et al. (2011) hypothesize that gathering speculations around future dates can help with the use of sentiment information.
In addition to the aforementioned considerations, our choice to focus on the fraction of negative sentences and words can be questioned.157 It is a valid point that there may exist other metrics that would be more useful in the estimation of sentiment than negativity. However, extant literature has documented in several occasions that negativity outperforms other metrics (e.g., Tetlock, 2007). Therefore, we conclude that our choice to focus on the negativity of a given text is well-founded.
Another consideration is the qualitative data we use to estimate sentiment. As our sample consists of qualitative texts from LexisNexis database, we may miss some important qualitative text publications that are not included in the LexisNexis database. That being said, LexisNexis does cover different sources of qualitative texts quite comprehensively. Nevertheless, we may miss some publications due to copyright and coverage issues. Furthermore, we are missing qualitative texts that focus on specific products of companies but do not mention the company by its name. Such texts, and the sentiment in them, most likely carry significant value, and affect financial metrics. Also, we acknowledge the fact that we are missing the following qualitative sources completely from our sentiment: social media, non- written media: i.e., TV- and radio-broadcasts However, accounting for the aforesaid factors is not a trivial task, and therefore we suggest future research to study the matter.
We conclude that an ideal sentiment model would mimic the key stages of an analyst’s thought process in assessing the impact that a news article has on a company, and would draw similar conclusions as a financial analyst would. In addition, such a model should incorporate
156 However, we do recognize that tone can in itself have significant impact, even in the absence of new
information (content).
157
Previous studies have identified that negative sentiment appears to be the most influential one. However, for example, Das (2010) discovers that a daily ‘disagreement-sentiment’ that proxies the disagreement in the market by calculating the number of both positive and negative signals, can be used to estimate how well the negative sentiment works. In times when there are both positive and negative information on the market, the predictive power of the negative sentiment tends to decrease. We have aimed to take this into account by using the negative fraction of all words, but it is possible that by taking into account the positive sentiment in the way Das suggests, could further improve our results.
all available information that is relevant to a firm. We find that while our sentiment estimates are a significant improvement from word count methodology, more work remains to be done on estimating sentiment more accurately.