4. Tools and Results
4.2. Automatic Classification Results
4.2.5. Combining N-gram and Sentiment Features
Next, we extend our experiments to an n-gram-based approach for the cosine similarity feature extraction. In this part of the research, we will cover one-gram, tri-gram and five-gram similarity analysis and then combine the n-grams with the same sentiment features that we used with the unigram classifications. To do n-gram analysis, we start by creating IDF dictionaries for the different n-grams, as in publication [3], because the dictionary we have been using so far is unigram-based and cannot be applied to the higher order n-grams. The unigram and the one-gram approaches are identical, except that we create a new word- based IDF dictionary from our dataset for the one-gram approach, and used an existing dictionary for the unigram approach. The theory behind using n-grams is that taking word order into account in texts can improve classification performance (Khreisat 2006; Bespalov et al. 2011). As we limit the analysis to one-grams, tri-grams, and five-grams it means that we decide that either one, three, or five-word combinations found in the texts are of interest to us. We are also interested in testing the performance of the different n-grams against each other. In theory, an n-gram approach can lead to either better or worse performance. In texts where there are few tri-grams or five-grams, there is a possibility that the performance will go down due to no matches. Previous studies using n-grams have shown that using higher than tri-grams will not necessarily increase classification performance (Fürnkranz 1998). Our approach here has some similarities to the approach used by (Bespalov et al. 2011). Here we will partly answer the second and third research questions by showing that combining sentiment features with n-gram features can improve
62
performance with the naïve Bayes classifier, however, we also show that it is not always the case.
Changing to higher n-gram analysis also raises the computational requirements due to the manifold increased sizes of dictionaries and the increased number of category TF-IDF values that need to be calculated and compared. In the unigram models, we worked with different top weighted TF-IDF words and ended up using top 15,000 weighted words, as our experiments showed that performance was only marginally increased beyond that point. Our research in [3] found that we needed to increase the number of weighted words per category above the 15,000 when performing n-gram classifications, because the n-gram words in a category can be over two million, while most unigrams categories contained only around 100,000 words. We tested a couple of different sizes and ended up using top 100,000 TF-IDF weighted category words for tri-grams, and top 120,000 TF- IDF weighted category words for the five-gram analysis. This was done to scale the number of words per category with the order of n-grams.
Table 7 shows the performance of the one-gram classification that uses IDF calculated based on our dataset. Table 8 shows the tri-gram classification results. Table 9 shows the performance of the five-gram classification. The classification performance when adding sentiment feature increases for category 13 over using only similarity features. Unigram classification still has the best performance for category 8, while category 12 has the best performance using
One-gram Similarity Classification Performance (Naïve Bayes) Category Accuracy Precision Recall F-measure
8 77.83% 0.75 0.82 0.79
12 80.49% 0.77 0.85 0.81
13 74.73% 0.73 0.79 0.76
17 71.35% 0.81 0.56 0.66
Combined One-gram + Sentiment Performance (Naïve Bayes) Category Accuracy Precision Recall F-measure
8 70.53% 0.64 0.92 0.76
12 68.29% 0.61 1.00 0.75
13 76.23% 0.74 0.82 0.78
17 67.19% 0.88 0.39 0.54
Table 7. One-gram classification performance using an IDF dictionary created from the dataset. Method taken from publication [3], experiments were extended to cover one-
63
one-grams. Category 17 has the best performance using five-grams with sentiment, and for category 13 the performance is even between models. Contrary to the study by Fürnkranz (1998), the five-gram performance seems to so far be better than the tri-gram classifications on average.
Tri-gram Similarity Classification Performance (Naïve Bayes)
Category Accuracy Precision Recall F-measure
8 68.96% 0.63 0.92 0.75
12 73.17% 0.65 0.98 0.78
13 70.56% 0.64 0.95 0.76
17 67.45% 0.62 0.92 0.74
Combined Tri-gram + Sentiment Performance (Naïve Bayes)
Category Accuracy Precision Recall F-measure
8 67.81% 0.62 0.94 0.74
12 64.63% 0.58 1.00 0.73
13 71.84% 0.65 0.94 0.77
17 79.17% 0.94 0.62 0.75
Five-gram Similarity Classification Performance (Naïve Bayes)
Category Accuracy Precision Recall F-measure
8 73.25% 0.66 0.93 0.78
12 73.17% 0.65 0.98 0.78
13 68.42% 0.62 0.94 0.75
17 69.27% 0.63 0.92 0.75
Combined Five-gram + Sentiment Performance (Naïve Bayes)
Category Accuracy Precision Recall F-measure
8 71.67% 0.65 0.95 0.77
12 64.63% 0.58 1.00 0.73
13 69.91% 0.64 0.92 0.75
17 83.33% 0.94 0.71 0.81
Table 8. Tri-gram classification performance using the IDF dictionary created from the
dataset. Method taken from publication [3], experiments were re-run to match the dataset and features of later extension models.
Table 9. Five-gram classification performance using IDF dictionary developed from the dataset. Method taken from publication [3], but experiments were re-run to match the
64
The performance of the models is at this point still quite far from practically usable as we defined the practically usable performance threshold as an F- measure of 0.9 or higher. The highest reached so far is 0.82 for category 8.