Chapter 3 Technological Background
3.3 Classification Methods
3.3.1 Lexicon-based Classifier
In lexicon-based classification, documents are assigned labels by comparing the number of words/n-grams that appear from pre-constructed word/n-gram lists [84]. Lexicon-based classification is mostly used to infer the sentiment polarities of doc- uments with the help of sentiment lexicons. Existing sentiment lexicons can be roughly divided into two categories: Polarity-based lexicons and Valence-based lexi- cons. In Polarity-based lexicons, words/n-grams are annotated with the overall sen- timent orientations, i.e.,positive ornegative, such as in the Opinion Lexicon [122], Macquarie Semantic Orientation Lexicon (MSOL) [192] and the Multi-perspective Question Answering (MPQA) Opinion Lexicon [288]. In Valence-based lexicons, words/n-grams are annotated by the valence scores of the sentiment intensity, such as used by the AFINN Lexicon [205], SentiWordNet Lexicon [25], Sentiment140 Lexicon [194] and the NRC Hashtag Sentiment Lexicon [194].
Three main approaches have been proposed to generate sentiment lexicons: man- ual,dictionary-based and corpus-based [174].
The AFINN Lexicon [205] and the MPQA Lexicon [288] were constructed through manual approaches: each word/n-gram in these two lexicons was annotated manually by the authors. This was labour-intensive and time-consuming; moreover, the annotation results can be biased, because of the differences in cognition among human beings. The AFINN Lexicon used discrete values ranges from−5 (very neg- ative) to +5 (very positive) to denote the sentiment valences. The dictionary-based approach used a small set of sentiment seed words with known positive or negative orientations to bootstrap the collection of positive and negative words, based on the synonym and antonym structure of a dictionary [174].
Dictionary-based approaches are employed in the construction processes of the SentiWordNet Lexicon [25], Opinion Lexicon [122] and MSOL [192]. Concretely, for the Opinion Lexicon, researchers enriched the adjective seed words with their syn- onyms and antonyms from WordNet [188]. The adjective seed words shared the same sentiment orientation as their synonyms and opposite sentiment orientations as their antonyms. For the SentiWordNet Lexicon, researchers additionally trained multiple
classifiers based on the definitions of the enriched sentiment seed words, and applied the classifiers to calculate the positive, negative and objective scores of all the words in WordNet. An extra random-walk step was further performed on the WordNet definiens-definiendum binary relationship graph to adjust the positive and negative scores, out of the intuition that the positivity and negativity can be mapped from the definitions to the words being defined. The final positive and negative valences were determined by applying power law distribution functions to the rankings of the positive and negative scores generated by the random-walk step, and the objective valences were assigned based on the positive and negative intensities. For each word, its positive, negative and objective valences were continuous values, which ranged in the interval [0.0,1.0] and their sum was 1.0. For the MSOL, researchers initially used eleven affix patterns to expand the seed words set, then they employed the group information from the Macquarie Thesaurus [37] to perform another expansion: if a group had more positive seed words than negative seed words, all the words/n- grams in the group were marked as positive; otherwise, all the words/n-grams in the group were marked as negative; if a word/n-gram occurred in multiple groups, its sentiment orientation was determined by its most common sentiment orientation. The sizes of the lexicons generated by dictionary-based approaches are restricted by the sizes of the dictionaries, which are not adequate to cover the language-usage variations and multi-word expressions in social media texts.
The corpus-based approach also uses a small set of sentiment seed words with known positive or negative orientations to bootstrap, but is based on the syntac- tic or co-occurrence patterns in a large corpus [175]. Researchers exploited the same corpus-based approach to generate the Sentiment140 Lexicon [194] and the NRC Hashtag Sentiment Lexicon [194]. The construction of these two lexicons was based on the assumption that a coherent sentiment orientation was expressed in all words/n-grams in a tweet. Researchers initially utilised 32 positive hashtags and 36 negative hashtags to annotated a tweet corpus: a tweet was considered positive if it contained one of the 32 positive hash-tagged seed words, and negative if it con- tained one of the 36 negative hash-tagged seed words. Then the Point-wise Mutual Information (PMI) scores for all words/n-grams were calculated, which indicated their association with positive sentiment orientation if the scores were positive, and their association with negative sentiment orientation if the scores were negative. The PMI scores were further employed as the sentiment valences, which were con- tinuous values ranged in the interval (−∞,∞). The corpus-based approach is an
efficient and automatic solution to generate domain dependent and context specific sentiment lexicons. However, for some words/n-grams without the sufficient num- ber of occurrences, the reliability of their assigned sentiment orientation/valence is in question; the intra-sentential sentiment coherency assumption can be invalid for some sentences with complex syntactic structures.
The lexicon-based classification is unsupervised and relies on the linguistic heuristics introduced by the researchers. When using a polarity-based lexicon, be- cause only words/n-grams with strong sentiment intensities are included in the lex- icon, the sentiment orientation of a document is decided by the differences between the number of positive words/n-grams and the number of negative words/n-grams from the sentiment lexicon that appear in the document, as in [122, 212]. Specif- ically, when using +1 to denote positive sentiment orientation, and −1 to denote negative sentiment orientation, the sentiment orientation of document d, denoted bySOd, can be decided as follows:
SOd=sgn(
#
wv∈wd∩wL
tfd,v×sov). (3.7)
In the above equation,wd represents all the words in the document, wL represents all the words in the lexicon, sov represents the sentiment orientation of word wv
labelled in the lexicon,tfd,v represents the term frequency of wordwv in document
d,sgnrepresents the sign function.
When using a valence-based lexicon, words/n-grams carrying different levels of sentiment information are all included in the lexicon, the sentiment valence of a document is usually calculated as the average sentiment valence of all the words/n- grams from the sentiment lexicon that appear in the document, as in [52, 189, 274]. Specifically, the sentiment valence of documentd, denoted bySVd, can be calculated
as follows: SVd= ! w!v∈wd∩wLtfd,v×svv wv∈wd∩wLtfd,v . (3.8)
In the above equation,svv represents the sentiment valence of word wv annotated
in the lexicon.
The lexicon-based classification is unsupervised, it can be easily implemented, interpreted and modified. In particular, the valence-based lexicons can be employed to generate gradable results. However, the coverage and credibility of the lexicon limit its effectiveness, especially when facing texts of great flexibility and variability,
such as textual content on social media. The classification result is dependent on the rules introduced by the researchers, which only consider individual words/n- grams in the texts for most of the time, and ignore the syntactic and semantic information. Even though some researchers have proposed additional rules to modify the sentiment orientations and valences of the words/n-grams in the lexicon, based on their contexts [128,256], these rules are not comprehensive enough to cover all the language usage patterns, especially when facing domain-dependency and polysemy scenarios. The high dependency on handcrafted rules also restricts the application of lexicon-based classification to areas where such rules can be easily generalised, such as sentiment polarity classification, but not areas where more sophisticated inference is needed, such as target-specific stance detection. When facing data from variousunknown domains, lexicon-based classification using existing lexicons is stronger in generality than other approaches. In this thesis, the lexicon-based,
unsupervised sentiment analysis is employed to quantify the aggregated sentiment bias in multilingual Wikipedia contexts of the specified entity, which come from variousunknown domains (Chapter 6).