5. Sentiment analysis
5.3. Sentiment classification methods
As stated earlier, the basic task of the sentiment analysis is to identify the sentiment polarity of a given text, which is called sentiment classification [Pang et al., 2002]. It can be done by three different types of approaches: the machine learning approach, the lexicon-based approach and the hybrid approach [Maynard and Funk, 2011; Medhat et al., 2014].
Figure 5.2. Sentiment classification methods [Medhat et al., 2014].
The machine learning approach considers the sentiment analysis task as a text classification problem, which classifies text based on its sentiment polarity into three different categories: positive, negative and neutral. It always relies on the supervised learning methods, where a large quantity of labeled texts of their sentiment polarities is utilized to train classifiers. The unsupervised learning methods are used when it is challenging to gain the labeled train data.
The lexicon-based approach is to judge the sentiment polarity of a text based on the sentiment words that it contains. It relies on a sentiment lexicon, which means a list of positive and negative words manually created or automatically generated. Then the sentiment polarity of the text can be decided on the ratio of the positive to negative sentiment words [Maynard and Funk, 2011]. In the previous example of the user review of a social app in Section 5.2, the whole review content is considered to be positive because more positive words such as “great”, “liked”, “fabulous” and “helped” can be found than the negative words in the review content. To use the lexicon-based approach, the key task is to generate the sentiment lexicon. Except manually creating the sentiment lexicon, which is very time consuming, there are two automated approaches collecting a sentiment lexicon, and they are the dictionary-based approach and the corpus-based approach.
The basic idea of the dictionary-based approach is to firstly manually collect a small set of sentiment words as the seed list. Then, these sentiment words are searched in a dictionary such as WordNet [Miller, 1995], a popular lexical database for the English language, for their synonyms and antonyms. For a positive (or negative) word, its synonyms are considered as positive (or negative), while its antonyms are seen as negative (or positive). All these new found synonyms and antonyms are added to the
seed list and then the next iteration starts. When no new synonyms and antonyms are found, the iteration process stops, and the sentiment lexicon is finally obtained. The major disadvantage of the dictionary-based approach is that the generated lexicon is not domain specific [Medhat et al., 2014]. To explain that some sentiment words usually have different sentiment polarities in different domains, a good example is the word “unpredictable”. It always has a negative polarity in the health domain and aerospace filed, e.g., “the new flu virus is evolving in an unpredictable way” and “the landing point of the rocket is unpredictable”, but is most likely to be positive in the movie domain, e.g., “the plot of this movie is unpredictable” [Ofek et al., 2013]. The SentiWordNet [Baccianella et al., 2010] is one of the most commonly used sentiment lexicons generated based on this dictionary-based approach. As shown in Figure 5.3., the adjective “unpredictable” always has a positive sentiment score of 0 and a negative sentiment score of 0.625 or 0.25 based on its different meanings in the SentiWordNet v3.0.0, which means it will be calculated as negative regardless of the domain in which it appears.
Figure 5.3. The word “unpredictable” in the SentiWordNet.
The corpus-based approach helps to generate the domain specific sentiment lexicon [Medhat et al., 2014]. It also starts with a small set of sentiment words as a seed list and the list is expanded by finding other sentiment words in a large document corpus based on syntactic or co-occurrence patterns [Medhat et al., 2014]. Hatzivassiloglou and McKeown [1997] did the first study using the corpus-based approach. They started with a labelled seed list of adjectives (657 positive and 679 negative adjectives) and used the conjunctions between adjectives to determine other adjectives’ sentiment polarities. For most connectives such as “and”, “or”, “either-or” and “neither-nor”, the conjoined adjectives were seen of the same sentiment polarity. The only exception was for the connective “but”, which connected two adjectives of the opposite sentiment polarities. The weakness of this method using conjunction rules was obvious that it could not extract unpaired adjectives. Qiu et al. [2009] proposed a more complex method called double propagation, which parsed words in a sentence by their dependency relations and used part-of-speech (POS) information to extract the related features and sentiment words that matched the predefined dependency rules. After finding the related features and sentiment words, Qiu et al. [2009] defined the rules to estimate the sentiment polarity of a sentiment word based on two observations: “same polarity for same feature in a review” and “same polarity for same sentiment word in a domain corpus”.
Therefore, features were added as a bridge to connect sentiment words that are related (modify the same feature) but not literally connected by connectives, which was proved to be an effective method with a high F-score [Qiu et al., 2009]. To the best of the author’s knowledge, currently there is no domain specific sentiment lexicon available based on the app review data as a corpus, which might be caused by the fact that the app store data mining is a new study field inspired by the proliferation of mobile application business.