3 Design of a Classifier for Investor Sentiment in Blogs 89
3.3 Evaluation 104
The configurations of the document-vector-transformation and of the SVM learning machine, determined in the last section, are subject to evaluation of the accuracy regarding the classification of the sentiment orientation in this section. This section reports on results of an exploratory experimental analysis regarding the best configuration of two parameters, i.e., regarding the text representation and the C-parameter of the SVM learning machine, in terms of high accuracy using this thesis’ corpus and a cross-validation approach.
3.3.1 Cross-Validation Approach
A cross-validation approach was used for evaluating the classifier of the sentiment orientation of investor sentiment in terms of accuracy and for conducting experiments with respect to
Design of a Classifier for Investor Sentiment in Blogs 105
parameter configurations. The general problem with respect to evaluating a supervised machine learning classifier is that examples of human annotated (i.e., labeled) blog documents are required for both: (1) training of the classifier, and (2) evaluating the classifier. These annotated examples are provided by this thesis’ corpus designed in Section 3.1. That is, the number of annotated blog documents is limited to the size of the corpus. Because training and evaluation of the classifier must be conducted on separate sets of annotated blog documents, a strategy to annotation-efficiently and accuracy-effectively use these annotated blog documents must be devised.
The k-fold cross-validation approach randomly divides the set of annotated blog documents of a corpus in non-overlapping, equally sized k folds (e.g., (Kohavi, 1995)). Stratified cross-validation additionally requires each fold to contain the same number of positively vs. negatively labeled blog documents (Kohavi, 1995). Training and evaluation are conducted k times: each of the k folds is used as a test set once while all other folds are used as training set (e.g., (Kohavi, 1995)). The training set is used to train the supervised machine learning classifier (see Section 2.4.3.2) (e.g., (Kohavi, 1995)). The classifier is then applied to classify the blog documents in the test set (e.g., (Kohavi, 1995)). The fold that the test set is assigned to is then rotated among all folds (e.g., (Kohavi, 1995)). Finally, the classification accuracy estimate can be computed over all correctly classified blog documents in k test sets (e.g., (Kohavi, 1995)). In this thesis, k=10 was used, which is common practice in sentiment classification (e.g., (Ng et al., 2006; Pang & Lee, 2004; Wang & Manning, 2012)) and also stratification of the folds was used as proposed by Kohavi (1995). Note that the composition of the folds was conducted only once as defined in this section. That is, each experiment (described in the next section) was based on the same folds, i.e., containing the same blog documents.
3.3.2 Experiments and Results
Using the cross-validation approach and this thesis’ corpora, experiments for choosing the parameter configurations of the text representation (see Section 3.2.1.1) and the C-parameter (see Section 3.2.2) were conducted with the objective of maximizing the accuracy relative to the accuracy of a baseline configuration SVM-classifier. For machine learning experiments, a slightly modified version of the corpora presented in Section 3.1 was used with all company mentions in the body and title of the blog documents exchanged for a neutral “[comp]”-string. This treatment is to make the classifier not use company-words to relate to a sentiment orientation class and rather abstract away from specific companies to potentially increase the generalization ability. For each experiment, a combination of the title and the body of each blog document was used.
Baseline
The baseline configuration comprises the parameter configurations derived in Section 3.2 and summarized in Table 6. Unigrams (i.e., the simplest configuration) were used as baseline
Design of a Classifier for Investor Sentiment in Blogs 106
text representation. C=1 was used as baseline C-parameter configuration. This configuration is the default configuration in the GATE implementation of SVM (Cunningham et al., 2011, p.371), which was used in this thesis. The accuracy of the described baseline configuration using Corpus A is 76.2%. The baseline accuracy is with respect to Corpus A, which contains a mix of blog documents with (one or multiple stocks and) one sentiment orientation and with (multiple stocks and) multiple sentiment orientations of which one was selected randomly for training and evaluation. Such a mix also occurs in out-of-sample blog documents. Thus, the accuracy on Corpus A can be assumed to be a good estimation of the out-of-sample accuracy.
Effect of Blog Documents with Multiple Sentiment Orientations
Two experiments were conducted using a classifier trained on Corpus B, containing blog documents annotated with one sentiment orientation only, to study the effect of blog documents with multiple sentiment orientations – by comparing the results of the trained classifiers to the results of the baseline classifier. The baseline classifier uses Corpus A, which basically extends Corpus B with blog documents annotated with multiple sentiment orientations. In the first experiment, the blog documents with multiple sentiment orientations were found not to decrease the accuracy when training and evaluating on Corpus A instead of training with blog documents of only one sentiment orientation (using Corpus B) and evaluating on Corpus A. In the second experiment, training and evaluating on blog documents with one sentiment orientation (on Corpus B) only yielded a slightly higher than baseline accuracy. Table 8 provides an overview of the results of the experiments.
Table 8: Effects of documents annotated with one vs. multiple sentiment orientations on training and evaluating a classifier for the sentiment orientation.
Experiment Baseline: Train and
test with multiple sent. orientations
Train with one sentiment orientation
Train and test with one sentiment orientation
Training corpus A B B
Testing corpus A A B
Hypothesis n/a Accuracy is higher than
baseline.
Accuracy is higher than baseline.
Accuracy 76.2% 75.5% 77.0%
Support for the hypothesis
n/a No support Support
Discussing the experiments in detail in the following, it seems reasonable to hypothesize that training a classifier on Corpus A, containing blog documents with multiple sentiment orientations, reduces the accuracy due to “distractions” by vocabulary referring not to the target class of a training example. This setup corresponds to the baseline experiment. An alternative formulation of the hypothesis is that training a classifier only on blog documents that were annotated with one sentiment orientation (i.e., from Corpus B) yields a higher than baseline accuracy. To test the hypothesis, the baseline configuration was used to train a
Design of a Classifier for Investor Sentiment in Blogs 107
classifier using Corpus B and evaluate on Corpus A to be able to compare results to the baseline result.
To conduct the experiment, each of the ten test sets of Corpus B, which was used in the 10-fold cross-validation, was enriched with five randomly chosen blog documents that are part of Corpus A but not of Corpus B. The assignment was random but static for the whole cross-validation. 47 of the 50 blog documents that were used in total for enriching the test sets had been annotated with multiple sentiment orientations (of which one was selected for evaluation) – see Table 58 in the Appendix for a list. The remaining three blog documents (see Table 57) had been annotated with one sentiment orientation but are not part of Corpus A for the reason of having the same number of negative vs. positive sentiment orientations in the corpus (see Section 3.1.2). These three blog documents were added to the test sets as well, to make all test sets in total contain the same (number of) blog documents like Corpus A to be able to compare the cross-validation result to the baseline’s one. The resulting accuracy of the experiment is 75.5%.
The experiment’s accuracy is a bit lower than the baseline accuracy. This result provides an indication for rejecting the hypothesis. That is, no indication was found for a positive (negative) effect of blog documents with one (multiple) sentiment orientation on the accuracy. A possible reason might be the smaller number of training blog documents in Corpus B compared to Corpus A, indicating that the number of training examples is more important with respect to increasing the accuracy. Consequently, Corpus B was not used in the experiments for determining the parameter configurations below. Rather, Corpus A was used for training and evaluation like in the baseline experiment.
Finally, to get an indication of the effect of blog documents with multiple sentiment orientations on the evaluation, a second experiment of training and evaluating a classifier on Corpus B, which consists only of blog documents of one sentiment orientation, was conducted. The respective hypothesis is that the accuracy is higher than in the baseline experiment (of training and evaluating on Corpus A) because Corpus B represents the natural configuration for document level classification. The resulting accuracy of 77.0% seems to indicate support for the hypothesis – although the results are not directly comparable because the evaluation corpora are not identical. However, the baseline accuracy, which evaluates a one-sentiment-orientation-per-document-classifier on a corpus containing multiple sentiment orientations, is only slightly lower. This result can be interpreted as a supporting argument for this thesis’ document level classification approach, which does not differentiate possible multiple sentiment orientations regarding multiple stocks on one blog document.
Choosing the Text Representation
In Section 3.2.1.1, text representations that are common and suitable for machine learning sentiment classification were presented and discussed. As a baseline text representation unigrams were used, which are a typical choice (e.g., (Sebastiani, 2002, p.10)). Adding higher order n-grams (i.e., bigrams and trigrams) can provide additional information but also
Design of a Classifier for Investor Sentiment in Blogs 108
increases the number of features (see Section 3.2.1.1). To choose the best text representation regarding the overall Corpus A, experiments of altering only the text representation of the baseline configuration were conducted. The results are listed in Table 9.
Table 9: Accuracy of the classifier using a specific text representation. Unigrams
(baseline)
Unigrams & bigrams Unigrams & bigrams &
trigrams
76.2% 79.2% 77.0% Clearly, adding bigrams to unigrams improved the accuracy with respect to the baseline
configuration appreciably. Further adding trigrams did not improve the accuracy with respect to using unigrams and bigrams. Thus, unigrams & bigrams were used as text representation for this thesis’ document-to-vector-transformation.
Choosing the C-Parameter
The C-parameter can be used for tuning the accuracy of the SVM training algorithm (see Section 2.4.3.4). In Section 3.2.2, suitable configurations of the C-parameter were derived from the literature. In an exploratory analysis, the baseline configuration in combination with the unigram & bigram text representation was used to evaluate each of the C-parameter configurations on Corpus A. Resulting accuracies are listed in Table 10. In the explored C- parameter range no increase or decrease of the resulting accuracy with respect to the default configuration of C=1 was observed. Thus, C=1 was used in the following.
The resulting final classifier’s level of accuracy surpasses the accuracy of classifiers of investor sentiment in relevant related work (e.g., <70% in Das & Chen (2007) and 75.1% in O’Hare et al. (2009)). In contrast to many other approaches (see Section 2.6), the classification accuracy was evaluated and made transparent. Using the best parameter configurations in terms of accuracy reported in this section, a classifier was trained on the overall Corpus A (using all training examples of annotated blog documents consisting of title and body with specific company mentions exchanged for a “[comp]”-string). This classifier serves in Section 4 for classifying the (sentiment orientation of) investor sentiment of large datasets of investment blog documents. These investor sentiments are to be validated in a portfolio simulation.
Design of a Classifier for Investor Sentiment in Blogs 109
Table 10: Accuracy of the classifier using a specific C-parameter.
i C=2i Accuracy –5 0.03125xxxxxx 79.2% –4 0.0625xxxxxxx 79.2% –3 0.125xxxxxxxx 79.2% –2 0.25xxxxxxxxx 79.2% –1 0.5xxxxxxxxxx 79.2% 0 1 79.2% 1 2 79.2% 2 4 79.2% 3 8 79.2% 4 16 79.2% 5 32 79.2% 6 64 79.2% 7 128 79.2% 8 256 79.2% 9 512 79.2% 10 1,024 79.2% 11 2,048 79.2% 12 4,096 79.2% 13 8,192 79.2% 14 16,384 79.2% 15 32,768 79.2% 16 65,536 79.2% 17 131,072 79.2% 18 262,144 79.2% 19 524,288 79.2% 20 1,048,576 79.2%