To evaluate the feature mining process, we should understand what has to be considered a feature. In our work, we assume that all features contain exactly two words and consist of a combination of a noun and a verb. However, we cannot consider all of the combinations to be features. Few approaches have been examined to measure the quality of feature mining.
First, a possible approach is to find existing reviews in thematic magazines about the application and manually extract features discussed from there. This approach would also allow seeing which feature is acknowledged as a good one. That might have helped with further steps of measurement of sentiment analysis. Unfortunately, this approach contains many drawbacks:
1. We cannot affirm that outcome of professionals reviews is correlating with customersβ opinion. Therefore, this approach cannot produce valuable output for sentiment analysis comparison.
2. Professional reviews are mostly very general and not specific to concrete features of the application. Articles usually provide an overview of the app and describe user experience without evaluating the application.
45
The second possible approach is creating of a questionnaire for customers who use the application on daily basis. This survey would ask about what users consider being a feature and what they do not. Poll would also help to measure satisfaction about the feature and might help with the evaluation of sentiment analysis. Unfortunately, during our work, not enough customers was found to provide a reasonable amount of opinions.
The third approach, which was chosen for evaluation of this step, is the manual extraction of features from a sample of reviews and comparing with aspects obtained from the algorithm. The idea behind this approach is described in [2]. The process is based on an independent evaluation of samples by few coders (in work specified above seven coders worked on 2800 reviews). These coders independently from each other marked sentences that contain feature or opinion about the feature extracted aspect mentioned and measured sentiment for each feature. Each coder was also measuring time which he spent on measuring and extracting features. The amount of hours spent on 900 reviews was between 8 and 12.5 hours for every coder. Figures proof that manual extraction of reviews is a process that requires many efforts. Unfortunately, for our work we did not have enough resources to perform this kind of analysis. However, feature extraction process was based on the same approaches with using the same tools (NLTK taggers, WordNet lemmatizer, SentiStrength). Therefore, we can reuse results for feature extraction and sentiment analysis obtained in [2]
For feature measurement, three methods were used: precision, recall, and F-measure. Formula for precision is ππ
ππ+πΉπ, where TP β is amount of true positive results (when feature
presented in both manual and automatic extraction), FP β is amount of false positive results (when feature presented in automatic feature extraction but not in manual)
Formula for recall is ππ
ππ+πΉπ , where TP β is amount of true positive results, FN - amount
of false negative results (when feature is presented in manual review but not in automatic extraction). For F-measure general form was used: πΉ = 2 β ππππππ πππβππππππ
ππππππ πππ+ππππππ
An example of the output is shown in the tables below. Authors used different applications from different categories. Table 7 shows amount of features extracted from application:
46
Table 7. Amount of feature extracted [2]
In this table πΉπ refers to topic extraction with adjectives and πΉππ to extraction without adjectives. As we specified in Chapter 3 in out work we assume that adjectives are mostly represents opinion words and contain sentiments. Therefore, approach used in our work correlate to second column of the table.
Table 8. Summarizing of results for feature extraction [2]
Table 8 shows calculated precision, recall, and F-measure for results. Similarly, for the previous table, we only interested in πΉππ part of it, therefore other part of table is not provided.
Worth mentioning that average precision achieved is around 60% and recall around 50%. AngryBirds application achieves the lowest recall and precision. This application, according to authors, produced most of the disagreements during manual feature extraction.
47
Approach for feature mining used in our work, however, has some limitations. Firstly, we only concentrate on features which consist of two words. The real feature name might be any length or even be described by one word. Also, non-frequent features will not be mined using specified method. Features which described in a low amount of reviews will be simply ignored.
5.3 Sentiment Analysis
As was specified in chapter 3, SentiStrength tool with default parameters has been chosen for performing sentiment analysis. However, SentiStrength has some limitation in use. Firstly, it measures text by giving to the sentence the score equal to the highest rate of the word in it. It means that for relatively long sentences with many adjectives, total score would contain both negative and positive sentiment score with high absolute value. Secondly, for some features during a manual inspection of results SentiStrength was affected by feature itself. For example, in a sentence which refers to feature βcalorie lossβ, the word loss has strong negative sentiment score in SentiStrengthβs vocabulary. Therefore, the sentence might be possibly counted as negative. As a possible solution, manual configuration of SentiStrength vocabularies may be used. Also, for each feature might be useful to exclude words of feature itself from the dictionary. However, this approaches have not been tested during our work and can be a possible improvement for the future. Finally, SentiStrength is not able to determine irony and sarcasm. Therefore, the sentence might get incorrect sentiment score for such sentences.
To evaluate SentiStrength, we refer to [30]. In this work, authors use SentiStrength with default parameters to identify the accuracy of the tool. As input data, authors used three corpuses based on Twitter: SentiStrength Twitter corpus, Earth Hour 2014 and Earth Hour 2015. Evaluation results of accuracy are shown in Table 9:
48
Except SentiStrength authors also used other tools for sentiment analysis. Columns SS, EH2014, and EH2015 corresponds to SentiStrength corpus, Earth Hour 2014 corpus and Earth Hour 2015 corpus respectively. Worth noticing that accuracy for SentiStrength for all three corpuses is above 59%.
49
6 Summary and Future Work
This chapter contains a summary of our work. Here will be underlined what has been done and what might be subject to further development of the topic.