Weakly-Supervised Techniques for the Analysis of Evaluation in Text. Jonathon Read

(1)

Analysis of Evaluation in Text

Jonathon Read

Submitted for the degree of Doctor of Philosophy University of Sussex

(2)

Declaration

I hereby declare that this thesis has not been and will not be submitted in whole or in part to another University for the award of any other degree.

Signature:

(3)

UNIVERSITY OF SUSSEX

Jonathon Read, Doctor of Philosophy

Weakly-Supervised Techniques for the Analysis of Evaluation in Text

Summary

A common approach to sentiment analysis is to employ supervised machine-learning methods to acquire prominent features of sentiment. However, the success of these meth-ods is dependent on the domain, topic and time-period represented by the training data. This thesis explores an alternative approach to sentiment analysis, whereby the polarity of text is found by comparing the similarity of its constituents with prototypical examples of positivity and negativity. The techniques proposed are evaluated on various tasks in sentiment analysis and, while they are inferior to well-trained supervised techniques, they perform consistently across different domains, topics and time-periods.

The second aspect to this thesis concerns Appraisal, a functional linguistic theory of evaluation in English. The Appraisal theory describes a hierarchy of the language used to communicate evaluation, detailing types of Attitude (how writers communicate their point of view), Engagement (how writers align themselves with respect to the position of others) and Graduation (how writers amplify or diminish their opinions), the recognition of which may assist in performing other tasks in sentiment analysis. The thesis describes the creation of a corpus of book reviews annotated according to the Appraisal theory, and an assessment of the difficulty of performing analyses of Appraisal by way of an inter-annotator agreement study. The corpus is used to evaluate the weakly-supervised methods performance when identifying Appraisal-bearing words. The methods are then used to investigate the application of Appraisal recognition to the broader field of sentiment analysis.

(4)

Acknowledgements

Most importantly, thanks go to my supervisor, John Carroll. John made this thesis pos-sible by organising my studentship from the EPSRC. He has offered careful and considered advice throughout the course of my studies and taught me research method.

Thanks also to Diana McCarthy who kindled my interest in sentiment analysis with a pointer to a research paper and later offered advice as a member of my thesis committee. Bill Keller also served on the committee, and was particularly helpful in advising on the design on the annotation study. David Hope’s help went beyond what one might reasonably expect from a fellow D. Phil. student in spending far too many sunny summer afternoons annotating book reviews. Kentaro Inui kindly provided exposure to research in Japan by arranging my visit to the Nara Institute of Science and Technology. Thanks to all the NLCL faculty, students and visitors for informative seminars and discussions.

I am also grateful to the following people, who provided assistance by making their data/software available: Thorsten Joachims (SVMLight), Roy Lipski (newswire articles labelled with sentiment), Bo Pang (movie reviews labelled with sentiment), John Trenkle (language classification software).

Thanks also to my friends, who bought me more than a few beers over the course of my frugal student years. Special thanks to Lis for her enthusiasm in the face of my pessimism about my research, and provision of ample distraction while shopping for plants.

(5)

List of Tables

2.1 Riloff and Wiebe’s (2003) syntactic templates and examples of patterns of subjective expressions. . . 12 2.2 Pang and Lee’s (2008) exploration of sentiment classification of movie

re-views using keywords selected by human judges and simple statistics, with percentage accuracy and ties in number of keywords found. . . 14 2.3 Patterns of part-of-speech tags used by SO-PMI-IR (Turney, 2002) for

ex-tracting phrases from problem documents. . . 17 2.4 Attributes for the two main MPQA annotation types. . . 25 3.1 Accuracies of supervised classifiers when training and testing on different

topics. Best performance on a test set for each model is highlighted in bold. 37 3.2 Accuracies of supervised classifiers when training and testing on different

domains. Best performance on a test set for each model is highlighted in bold. . . 38 3.3 The top twenty most divergent features found when comparing sentiment

probabilities in the Newswire and Polarity 1.0 data sets. . . 39 3.4 Accuracies of supervised classifiers when training and testing on different

time-periods. Best performance on a test set for each model is highlighted in bold. . . 40 3.5 Examples of emoticons and the frequency of usage observed in Usenet

art-icles, in percent. . . 41 3.6 Accuracy of Emoticon-trained sentiment classifiers across topics. . . 42 3.7 Accuracy of Emoticon-trained sentiment classifiers across domains. . . 42 3.8 Accuracy of Emoticon-trained sentiment classifiers across time periods. . . . 43 3.9 Coverage of classifiers, in percent. . . 46

(10)

4.1 Lund and Burgess’s (1996) example matrix for “the horse raced past the

barn fell”, computed for a window width of five words. . . 56

4.2 Distance metrics employed by Levy et al. (1998). . . 58

4.3 Prototypes selected for Sentiment classes. . . 65

4.4 Prototypes selected for the six Basic emotions. . . 65

4.5 The performance of weakly supervised methods in classifying POSITIV and NEGATIV entries in the General Inquirer, with respect to polarity. . . 68

4.6 The performance of weakly supervised methods in determining the senti-ment of movie reviews in Pang and Lee’s (2004) data set. . . 71

4.7 The accuracies of supervised and weakly-supervised methods in classifying newswire articles according to sentiment in various topics, with the har-monic means of the accuracies. . . 72

4.8 The accuracies of supervised and weakly-supervised methods in classifying documents in the domains of newswire articles and movie reviews, with the harmonic means of the accuracies. . . 73

4.9 The accuracies of supervised and weakly-supervised methods in classifying movie reviews from data sets representing different time-periods, with the harmonic means of the accuracies. . . 73

4.10 The mean annotators’ correlation scores for each type in the Affective Text shared task (Strapparava and Mihalcea, 2007). . . 74

4.11 The performance of the weakly-supervised techniques in the Valence test compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007). Systems are in alphabetical order, with highest performers in each measure highlighted in bold. . . 80

4.12 The performance of the weakly-supervised techniques in the Affect sub task compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007) and the results reported by Strapparava and Mihal-cea (2008) (SM). The best results in each measure and each emotion are highlighted in bold. . . 81

4.13 The mean performance across all six emotions of the weakly-supervised techniques in the Affect test compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007) and the results reported by Strapparava and Mihalcea (2008) (SM). The best results in each measure are highlighted in bold. . . 82

(11)

4.14 Contingency tables of the labels output by the weakly-supervised methods in the word classification task. Rows indicate the labels chosen by the method while columns represent the correct label. The distribution of the

labels in the test set was 44.6% Positiveand 55.4%Negative. . . 83

4.15 The word similarity methods’ coverage of types and instances in the docu-ment level sentidocu-ment classification task. . . 85

5.1 Illustrations of Affect. . . 94

5.2 Illustrations of Judgement. . . 95

5.3 Illustrations of Appreciation. . . 95

6.1 MUC-7 test score definitions (Chinchor, 1998). . . 111

6.2 MUC-7 test scores, evaluating the agreement in text anchors selected by the annotators. When considering agreement in text anchors there is only one class of interest, hence the SUB measure will always be zero. The average between the two annotators is calculated using the harmonic mean. . . 112

6.3 Harmonic means of MUC-7 test scores evaluating the agreement in text anchors selected by the annotators for various matching constraints. . . 112

6.4 The contingency table showingd’s choices (columns) in terms of percentage of j’s annotations (rows). The MIS column indicates the percentage ofj’s annotations where ddid not provide a match. . . 118

6.5 The contingency table showingj’s choices (columns) in terms of percentage of d’s annotations (rows). The MIS column indicates the percentage ofd’s annotations where j did not provide a match. . . 119

6.6 κvalues at the different levels of the Appraisal taxonomy over all annotation types and over Attitude, Engagement, and Graduation types only. . . 121

6.7 Interpretations of κ values, suggested by Landis and Koch (1977). . . 121

6.8 Senses of the verb ‘abandon’ listed in WordNet 2.1, accompanied by an Appraisal class proposed by annotator j based on the gloss. . . 122

6.9 All possible combinations of questionnaire respondents with English as a first language and/or familiarity with Appraisal, with corresponding re-spondent frequencies (n) and Kappa (κ) scores. All values are significant atp <0.05. . . 125

(12)

7.2 The distribution of annotations in the development and test data sets, ac-cording to the number of grams. . . 131 7.3 The distribution of Appraisal types found in the development data, at

vari-ous levels of the Appraisal hierarchy. . . 132 7.4 The performance of word similarity algorithms in classifying expressions

according to the various levels of the Appraisal framework. (w) indicates weighted versions of the algorithms. . . 136 7.5 The performance of word similarity algorithms in extracting features

ac-cording to the various levels of the Appraisal framework. (w) indicates weighted versions of the algorithms. . . 139 7.6 The mean and variance of the cross-validated optimal thresholds for each

method at each level of the Appraisal hierarchy. (w) indicates weighted versions of the algorithms. . . 140 7.7 The performance of word similarity algorithms in determining the polarity

of instances of Attitude. . . 141 7.8 Prototypical words of Graduation. . . 142 7.9 The performance of word similarity algorithms in determining the direction

of instances of Graduation. . . 142 7.10 Descriptive statistics summarising the unexpectedness of the selections made

by each method. . . 145 7.11 The calculation of Lexical Association scores using Pointwise Mutual

In-formation, for the word but forCounter andDeny. . . 146 7.12 Similarity scores for the prototypes of Happiness and Security from the

Semantic Space algorithm for the words happy and unhappy. . . 147 7.13 A summary of proposed heuristics based on the Appraisal Theory types. . . 149 7.14 The optimal weights for each class when applying Appraisal heuristics on

the sentiment classification task. . . 151 7.15 The performance of various approaches to determining the sentiment of

movie reviews in the Pang and Lee’s (2004) Polarity 2.0 dataset. Items marked with † are supervised approaches. . . 152 7.16 The performance of the weakly-supervised techniques in the Valence test

compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007). Systems are in alphabetical order, with highest performers in each measure highlighted in bold. . . 153

(13)

7.17 The mean performance across all six emotions of the weakly-supervised techniques in the Emotions test compared with entrants in the Affective Text shared task (Strapparava and Mihalcea, 2007) and the results reported by Strapparava and Mihalcea (2008) (SM). Systems are in alphabetical order, with highest performers in each measure highlighted in bold. . . 153 D.1 A contingency table with unexpectedness values of the selections made by

the unweighted Lexical Association method when performing the Appraisal Classification task. For ease of reading only the contingencies where |u|>

¯

x+σ are displayed. Columns indicate the type chosen by the method while the correct type is listed by row. . . 192 D.2 A contingency table with unexpectedness values of the selections made by

the unweighted Semantic Space method when performing the Appraisal Classification task. For ease of reading only the contingencies where |u|>

¯

x+σ are displayed. Columns indicate the type chosen by the method while the correct type is listed by row. . . 193 D.3 A contingency table with unexpectedness values of the selections made by

the unweighted Distributional Similarity method when performing the Ap-praisal Classification task. For ease of reading only the contingencies where |u|>x¯+σ are displayed. Columns indicate the type chosen by the method while the correct type is listed by row. . . 194

(14)

List of Figures

2.1 Hatzivassiloglou and McKeown’s (1997) example of how the adjectivesimplistic deviates from its semantic group, in that its semantic orientation is opposite to the highly related word simple. . . 16 3.1 Change in performance of the supervised classifiers when constraining the

number of reviews permitted of any given movie, in percent. . . 45 3.2 Change in Performance of the SVM Classifier on held out reviews from

Po-larity 1.0, varying training set size and window context size. The datapoints represent 2,200 experiments in total. . . 47 4.1 A subsumption hierarchy describing the types of relations output by the

RASP system (Briscoe et al., 2006). . . 59 4.2 The results of each optimising General Inquirer test carried out on the

Lexical Association method. . . 66 4.3 The results of each optimising General Inquirer test carried out on the

Semantic Space algorithm. . . 67 4.4 The results of each optimising General Inquirer test carried out on the

Distributional Similarity algorithm. . . 67 4.5 The results of each optimising Movie Review test carried out on the Lexical

Association method. . . 69 4.6 The results of each optimising Movie Review test carried out on the

Se-mantic Space algorithm. . . 70 4.7 The results of each optimising Movie Review test carried out on the

Distri-butional Similarity algorithm. . . 70 4.8 The results of each optimising Affective Text valence test carried out on the

(15)

4.9 The results of each optimising Affective Text valence test carried out on the Semantic Space method. . . 76 4.10 The results of each optimising Affective Text valence test carried out on the

Distributional Similarity method. . . 77 4.11 The results of each optimising Affective Text emotion test carried out on

the Lexical Association method. . . 77 4.12 The results of each optimising Affective Text emotion test carried out on

the Semantic Space method. . . 78 4.13 The results of each optimising Affective Text emotion test carried out on

the Distributional Similarity method. . . 78 4.14 Frequencies of annotations by valence score in the test set compared with

the valence score assigned by the Semantic Space method. . . 86 5.1 A systems network depicting the structure of Appraisal resources (Martin

and White, 2005) . . . 88 5.2 The Cognitive Structure of Emotions, (from Ortony et al., 1988) . . . 89 5.3 Types of structural prosody in discourse. Examples from Martin and White

(2005). . . 92 5.4 The attitude system. . . 94 5.5 Strategies for inscribing and invoking attitude (Martin and White, 2005). . 96 5.6 The engagement system. . . 98 5.7 The graduation system. . . 101 6.1 The custom-made Appraisal annotation tool. . . 108 6.2 The Appraisal framework showing the hierarchical levels. Labels are

ac-companied by the harmonic mean of the F1 of the annotators for appraisal type/overall types for that level. . . 113 6.3 The harmonic mean of the recall (REC) exhibited by the annotators at the

various levels of the Appraisal taxonomy. . . 114 6.4 The harmonic mean of the precision (PRE) exhibited by the annotators at

the various levels of the Appraisal taxonomy. . . 115 6.5 The harmonic mean of the annotators’ substitution (SUB) rates at the

various levels of the Appraisal taxonomy. . . 116 6.6 The harmonic mean of the annotators’ error (ERR) rates at the various

(16)

7.1 The results of each optimising test carried out on the unweighted lexical association algorithm, using lemmatised tokens and PMI. . . 133 7.2 The results of each optimising test carried out on the weighted lexical

as-sociation algorithm, using lemmatised tokens and PMI. . . 134 7.3 The results of each optimising test carried out on the unweighted semantic

space algorithm, using lemmatised tokens and PMI. . . 134 7.4 The results of each optimising test carried out on the weighted semantic

space algorithm, using lemmatised tokens and PMI. . . 135 7.5 The results of each optimising test carried out on the distributional

(17)

Chapter 1

Introduction

1.1

Background

The past decade has witnessed a swell of interest in the analysis of authors’ opinions as expressed in written documents. This is in no small part due to the proliferation of electronically-published opinion which presents a wealth of easily-accessible text of interest to governments, companies and individuals seeking to automatically distill public opinion. This thesis studies aspects of the computational analysis of opinion in text.

Opinion is conveyed in text in a wide variety of domains and genres. Prior to the proliferation of the Internet, most publicly-available opinion was limited to reports and editorials in newspapers. More recently, however, the World Wide Web has allowed both traditional providers and also the general public to distribute written content on a scale not previously possible. Newspapers reproduce much of their content online, while the blogosphere enables suitably skilled Web users to easily publish their thoughts for the consideration of the rest of the community. There are numerous professional and enthusiast review websites, and many online retailers, such as Amazon and iTunes, encourage their customers to review their purchases for the benefit of other shoppers. There is such a wealth of product reviews, in fact, that it has prompted the development of opinion-aggregation websites such as Metacritic.com. A further important source of opinion may be found in collections of emails received by customer relations departments.

Reliable methods of automatically analysing evaluative language would be useful in a number of application areas which currently rely on manual analysis. For example, stock market traders often employ manual analyses of sentiment in news articles about a company in order to predict fluctuations in its share price. However, a high degree of accuracy can be obtained automatically by training supervised machine learning classifiers

(18)

such as Na¨ıve Bayes or Support Vector Machines (Pang et al., 2002). The same technology might be applied by opinion-aggregation websites, whose staff manually collate the scores of reviews for products such as films, music, games and television programmes in order to derive averages. Sentence-level sentiment classification may be of benefit to researchers investigating social networks in academic and online communities. Often such networks are constructed using citations made by authors (Wasserman and Faust, 1994); classifying these citations by sentiment would enable the network analysis to discriminate between favourable and unfavourable references.

Other applications require more detail about the types of opinions expressed. For example, political parties and governmental departments are often interested in under-standing public opinion on some contentious issue, and so commission person-to-person surveys. Similarly, traditional business market research techniques involve conducting surveys or organising focus group sessions to collect the opinions of a small number of members of the public. These techniques are time-consuming and costly, and findings de-termined in this way are questionable if the participants are not sufficiently representative of the public.

Instead, these tasks could be accomplished automatically. Opinion-oriented inform-ation retrieval techniques could obtain articles relevant to a given topic such as a polit-ical issue or company’s product. Opinion-mining techniques might then be employed to identify facets of opinions expressed within the document including the holder, target and nature of the opinion (Wiebe et al., 2003). These expressions of opinion could then be clustered in order to generate descriptive statistics to summarise public opinion. This approach might also be insufficiently representative of the public, however this could be mitigated by employing techniques to identify author demographics (Liu and Mihalcea, 2007; Argamon et al., 2009).

An application of interest to government intelligence agencies is that of automatically monitoring dissent on extremist websites. For example, Abbasi and Chen (2007) described how affect recognition techniques can be employed to automatically analyse the forums of extremist groups.

Techniques to process evaluative language would also be of benefit to other avenues of computing research that are concerned with interaction with humans. For instance, analysis of emotion in text might be useful to researchers working in Affective Computing (Picard, 1997). Affective Computing is a branch of human-computer interaction research that seeks to adapt interfaces to users’ emotional state. This is typically achieved by

(19)

monitoring speech patterns, facial expressions and body gestures, but a deeper analysis of users’ emotions and opinions could provide more detailed information about their inter-active experience.

Other aspects of human-computer interaction that could benefit from a textual analyses of emotion include: providing clues for prosody in text-to-speech synthesis (Alm et al., 2005); computationally generated humour (Stock and Strapparava, 2005); generation of gestures or facial expressions in avatars and robots (Nakano et al., 2005); and generation of emotive or persuasive language (Strapparava and Mihalcea, 2008). Such analysis could also be applied in Expressive Artificial Intelligence art projects (Mateas, 2001), particularly in interactive dramas in which computer-controlled characters need to respond appropriately to the opinions and emotions expressed by the player. Spertus (1997) described a system for detecting ‘flames’ in online forums (Spertus, 1997). This might be generalised to detect inappropriate language in formal communications. For example, in much the same way as applications highlight incorrect spelling and grammar, one could develop email and word processing applications that warn users of unsuitably affective language such as being overly familiar with customers or aggressive towards colleagues.

Automatic processing of the kind mentioned above is typically achieved using tech-niques rooted in one or more of: supervised machine learning, weakly-supervised machine learning, and linguistically-inspired heuristics. Pang et al. (2002), for example, classified reviews as being positive or negative in sentiment using Na¨ıve Bayes, Maximum Entropy and Support Vector Machine classifiers. Turney (2002) showed how this could also be achieved (with a lesser degree of accuracy) using weakly-supervised machine learning by comparing the similarity of target words with prototypical examples of positive and neg-ative sentiment. Polanyi and Zaenen (2004) explored the application of contextual valence shifters — lexical items noted as having an effect on the sentiment conveyed by a sentence.

1.2

Overview of the Thesis

This thesis is particularly concerned with weakly-supervised methods for the classification of text according to its positivity or negativity. Previous research has found that super-vised machine-learning techniques can be very effective at this task. A number of studies, however, have found that this performance is dependent on a good match between training and testing data with respect to topic. The thesis shows that the data must also match with respect to domain and time-period, and so proposes and evaluates classification tech-niques based on the similarity of words. These techtech-niques are only weakly-supervised as

(20)

they simply require a small set of prototypical words and a large unlabelled corpus of gen-eral text, and therefore potentially do not suffer from the types of dependency exhibited by supervised techniques. The experiments reported in the thesis show that, while the per-formance of the weakly-supervised methods is inferior to traditional supervised techniques, their accuracy is reasonably consistent across domains, topics and time-periods.

Applications such as those described above can potentially be supported by a number of frameworks that describes aspects of evaluative and emotional language, such as types of emotion and opinions, and the variables that affect their intensity. These frameworks typically originate in the fields of cognitive science, linguistics and psychology, and could be informative for computational experiments with evaluative language. Ekman (1993), for instance, derived from facial expressions a list of basic emotions which can provide a set of classes for affect recognition. Gratch and Marsella (2004) developed a cognitive model of appraisal that considered several variables affecting the strength of appraisal, such as the relevance and urgency of an event, and the degree to which the ego is involved. This model was created for use by avatars simulating an emotional reaction, but could be used to inform analyses of evaluative language if suitable indicators of these variables could be found. Wiebe et al. (2005) created a scheme for the annotation of the mental and emotional state conveyed by text. Their scheme distinguished between explicit expressions such as The U.S.fears a spill-over and subjective expressive elements where the affective state is implied by words that contain negative connotations (e.g. We foresaw electoralfraud but not daylight robbery). Hyland (1998) described the linguistic phenomenon of hedging, where writers express the degree to which an opinion is speculative or unconfirmed (e.g. perhaps or somewhat). Di Marco and Mercer (2004) used features based on hedging to determine the nature of the relationships between scientific articles.

The thesis extends the breadth of analysis of evaluation in language by investigating the computational analysis of text according to the Appraisal Theory (Martin and White, 2005), a Systemic Functional Linguistic theory of evaluation which is couched in terms of English, but potentially applicable to other languages. It distinguishes between types of attitude (personal affect, judgement of people and appreciation of objects) and describes how authors use language to communicate their engagement with other writers, and to amplify or diminish the strength of their opinions. Knowledge of these types of language could enhance existing techniques for the analysis of evaluation in language by consid-ering the type and strength of evaluation communicated, and identifying when and how authors report the opinions of others. The thesis presents a method employed to manually

(21)

annotate a corpus of book reviews according to the theory, the performance of the weakly-supervised methods in performing an analysis according to the Appraisal framework, and an application of the theory to the task of sentiment classification.

The content presented in each chapter of this thesis is summarised below.

Chapter 2: Background This chapter reviews previous research conducted in the area

of automatically analysing evaluative language in text. Four related lines of research are considered: subjectivity analysis, which seeks to distinguish between facts and expressions of emotion or opinion;sentiment analysis, which focuses on whether text is generally positive or negative in feeling;opinion mining, which extends subjectivity analysis by identifying aspects of opinion such as holders and targets; and affect recognition which attempts to recognise and label different types of emotions.

Chapter 3: Dependency in Supervised Techniques for Sentiment Classification

This chapter presents experiments that demonstrate that good performance of su-pervised machine learning techniques for sentiment classification is dependent on a good match between the training and testing data, with respect to domain, topic and the time-period represented by that data. It then proposes that these dependen-cies might be mitigated by using a body of general text, and therefore discusses the results of training supervised classifiers on text collected by extracting paragraphs containing a smile or frown emoticon from Usenet postings.

Chapter 4: Weakly Supervised Techniques for Sentiment Analysis This chapter

proposes that the best way to avoid the problems of domain, topic and time-period dependency in sentiment analysis is to instead employ word similarity methods that relate problem words in a very large corpus to prototypical examples of sentiment. It reviews three word similarity techniques: Lexical Association, Semantic Spaces and Distributional Similarity. Each of these methods are applied to three tasks: con-structing a polarity lexicon, in which entries are labelled as being positive or negative in sentiment; classifying documents as being positive or negative; and scoring sen-tences according to the strength of sentiment and six basic emotions. It concludes by discussing the strengths and weaknesses of the similarity methods for analysis of evaluative language.

Chapter 5: Introduction to Appraisal Theory As the subsequent chapters of the

thesis are concerned with Appraisal Theory, this chapter provides an outline of the basics of the theory. It describes the types of language utilised in communicating

(22)

the three parallel systems of Appraisal: Attitude, which is concerned with types of evaluations; Engagement, which describes how authors align with or distance them-selves from the opinions of others; and Graduation, which considers how language can amplify or diminish the strength of opinions. This chapter also discusses previous computational work based on Appraisal Theory.

Chapter 6: Annotating Expressions of Appraisal This chapter describes the

an-notation of a corpus of book reviews according to Appraisal Theory. It presents the results of an inter-annotator agreement study, and considers instances of sys-tematic disagreement that suggest areas in which the theory might be improved. It also reports the results of a survey designed to evaluate the difficulty of Appraisal analysis in particularly ambiguous situations. Although the annotation task is diffi-cult, there are many instances where the annotators agree; these are used to create a gold-standard corpus for the appraisal analysis experiments.

Chapter 7: Computational Appraisal Analysis This chapter presents the results of

evaluating the word similarity techniques described in Chapter 4 in labelling ex-pressions and extracting words according to Appraisal Theory, and discusses the strengths and weaknesses of the methods for this task. It also proposes a number of heuristics based on Appraisal Theory, and applies these to the task of sentiment classification.

Chapter 8: Conclusion This chapter summarises the techniques and experiments

re-ported in the thesis, presents overall conclusions derived from the results, and pro-poses several directions for future work.

(23)

Chapter 2

Background

Researchers investigating the computational analysis of evaluation in natural language have labelled their work using a number of terms, including: opinion mining, sentiment analysis and subjectivity analysis. This chapter reviews each of these related areas and considers the difference between these names with respect to which aspects of evaluative language they focus on. Section 2.1 discusses the analysis of subjectivity in text, where researchers look for clues as to whether a proposition represents a private state (that is, an aspect of a writer’s psychological state, which is not open to objective verification (Quirk et al., 1985)) or instead a matter of fact. When considering sentiment in text (reviewed in Section 2.2), the task is to determine the polarity of opinion-bearing expressions, that is, whether it is generally positive or negative in sentiment. Section 2.3 discusses the area of opinion mining, which analyses these expressions in more detail, determining opinion hold-ers, targets and types. Finally, Section 2.4 discusses the related area of affect recognition, which seeks to identify evaluative language in more dimensions than polarity, specifically distinguishing between different types of emotion.

2.1

Subjectivity Analysis

Subjective texts represent aspects of some individual’s point of view, such as their beliefs, emotions and perceptions (Banfield, 1982). In contrast, objective sentences express fac-tual information (or at least information that is believed to be facfac-tual by the individual). Automatic recognition of subjective text is beneficial in a number of natural language processing applications, such as tracking point-of-view (Wiebe, 1994) and answering ques-tions/extracting information with regards to matters of fact or opinion (Wiebe and Wilson, 2002; Riloff et al., 2005). Recognising subjective language is also a useful starting point

(24)

when considering any aspect of opinions in text, since disregarding objective content can expedite and simplify further analysis (Pang and Lee, 2004).

Wiebe (1994) observes that the subjective or objective status of propositions is rarely presented explicitly, making classification a challenging task for computational methods. The problem is complicated further if one considers that documents are never wholly either subjective or objective (Wiebe et al., 2001b), which makes evaluation a difficult endeavour. Nevertheless, several studies demonstrate that automatic identification of subjective language is possible to some extent.

2.1.1 Annotating Expressions of Subjectivity

Wiebe et al. (1999) presented a case study of human capability in judging sentences as subjective or objective. The coding of subjectivity status wasintention-based:

“If the primary intention of a sentence is objective presentation of material that is factual to the reporter, the sentence isobjective. Otherwise, it issubjective.” (Wiebe et al., 1999)

Four judges independently annotated the same 14 articles chosen at random from the Wall Street Journal portion of the Penn Treebank (Marcus et al., 1993). The judges classified non-compound sentences and each conjunct of each compound sentence as subjective or objective and assigned a certainty value (Bruce and Wiebe, 1999). The annotations were corrected for bias using a variety of statistical methods and then discussed by the judges in order to create an updated coding manual. The same judges used the updated coding manual to annotate a disjoint set of documents.

The authors employed the Kappa co-efficient (Carletta, 1996) to evaluate inter-judge pairwise agreement, finding that if items on which the annotators were uncertain were excluded then pairwise agreements between judges yielded a Kappa value of over 0.87, a score high enough to allow definite conclusions (Krippendorf, 1980). We can therefore conclude that the judges were able to agree on the subjective/objective nature of sentences, providing they had confidence in their own classifications.

Wiebe et al. employed the conjuncts labelled in their corpus annotation study to eval-uate machine learning models for the classification of sentences according to subjectivity. Sentences were represented using the following features:

• the presence of a pronoun, an adjective, a cardinal number, a modal other thanwill and an adverb other thannot;

(25)

• whether the sentence begins a new paragraph or not; and

• co-occurrence of word tokens and punctuation with respect to subjective/objective classifications.

The average accuracy across a ten-fold cross validation was 72.2% compared with a random-choice baseline of 51.0% and an upper bound of 89.5% (estimated from human performance).

2.1.2 Learning Subjective Language

Having demonstrated the feasibility of automatically recognising subjective language, even when using simple features, Wiebe and colleagues continued to investigate methods of learning more about the nature of subjective language (collated in Wiebe et al., 2004).

Wiebe (2000) clustered adjectives according to distributional similarity (using Lin’s (1998) method) in order to grow sets of clues of subjectivity from a small number of seed terms.

Hatzivassiloglou and Wiebe (2000) examined the effect of various adjective features in determining the subjectivity of sentences. They found that the semantic orientation (whether it was positive or negative in sentiment) and the gradability (whether it accepted modifiers that serve to intensify or diminish its strength) of an adjective to a large extent predicted the subjectivity of the sentences in which it appeared. Furthermore, employing automatic methods for determining semantic orientation and gradability improved the precision of automatic subjectivity classification.

Wiebe et al. (2001b) presented a method for learning collocational clues of subjectiv-ity in text. Their method identified collocations of fixed word stems and a generalised collocational pattern (a collocation where one position can be filled by infrequently ap-pearing words). The precision of an n-gram was calculated as the number of instances of that n-gram (n being from 1 to 4) in subjective elements relative to the total number of instances of that n-gram. The method labelled an n-gram as a subjective fixed-n-gram if its precision was at least 0.1 and greater than or equal to the precision of each of its components. To extract generalised collocational patterns Wiebe et al. replaced hapax legomena (words that appear only once in the corpus) with a placeholder (Unique). That

is, they treated the set of unique words as a single frequently occurring word. The same criteria to evaluate regular n-grams were used to determine if any n-gram with Unique

as a constituent is a subjective collocational pattern; if subjective they were said to be a ugen-n-gram (uniquegeneralised n-gram).

(26)

Wiebe and Wilson (2002) observed that while some expressions are subjective in all contexts (the exclamation mark (!), for example), most are dependent on the surround-ing context. Havsurround-ing identified potential subjective elements in previous studies such as those mentioned above, they attempted to disambiguate these elements in context. Their method followed that of Wiebe (1994), where an element was considered as more likely to be subjective if nearby elements were subjective. A potential subjective element (PSE) was considered to be high density if the number of subjective elements within a window

W around the PSE was greater than some threshold value T. The authors used a corpus marked with subjective element annotations collected in previous studies (Wiebe et al., 1999, 2001a), finding that a high-density of PSEs was strongly indicative of opinionated text.

Riloff and Wiebe (2003) investigated the use of information extraction patterns as a means of representing subjective expressions. They utilised high precision subjectivity and objectivity classifiers to obtain a large number of automatically labelled sentences. An extraction pattern learning algorithm was applied to this training data to learn lexico-syntactic patterns of subjectivity. The patterns were used to identify further subjective sentences and these in turn were used to provide more automatically labelled sentences. This process was then bootstrapped until an optimal set of patterns was found. Riloff and Wiebe (2003) argued that information extraction patterns are linguistically richer than simple n-grams and describe an example potential pattern of subjectivity: e.g.<x>drives

<y> up the wall, where x and y are noun phrases. They argued that this pattern could match many sentences such as “George drives me up the wall” or “The nosy old man drives his quiet neighbours up the wall.”

The first stage of their method was to employ high-precision subjectivity and objectiv-ity classifiers. The classifiers used a lexicon of subjectivobjectiv-ity clues from previous research, divided into sets of strongly subjective and weakly subjective. The classifier judged a sen-tence as subjective if it contained two or more strongly subjective clues. The classifier labelled a sentence as objective if the current, the previous and the following sentences contained no strongly subjective clues and at most one weakly subjective clue. However, a possible shortcoming of this method is that it may accidentally label a sentence that contains a strong subjectivity clue because it has not been encountered before, and does not co-occur with subjectivity clues that have been seen previously. The concern here then is that a depth of subjectivity clues of a certain type may be learnt as opposed to a variety of types.

(27)

Having automatically constructed a training set of subjective and objective sentences, the authors used Riloff’s AutoSlog-TS 1996 to learn extraction patterns indicative of sub-jectivity. AutoSlog-TS uses a set of syntactic templates to describe a search space of possible patterns (listed in the left column of Table 2.1). These templates are exhaust-ively applied to the training data, generating every possible instantiation of each template (the right column of Table 2.1 lists some examples). The algorithm ranks the patterns according to the frequency of the pattern in a subjective sentence relative to the total frequency of the pattern. Patterns are accepted based on two thresholds — one for the pattern’s frequency and another for the pattern’s frequency in subjective sentences. The authors found that augmenting the high-precision subjectivity classifier with the learned extraction patterns improved the recall by over 7 percentage points and reduced precision by only around 1 percentage point.

Wiebe and Riloff (2005) extended this approach in their Opinion-Finder system by attempting to learn patterns of objective expressions as well as subjective. In this study they employed an additional bootstrapping step involving self-training Na¨ıve Bayes classi-fiers which, once trained on sentences labelled by the extraction pattern process described above, labelled an additional large collection of unannotated data. The most confidently-labelled sentences from this data set were used to bootstrap the extraction-pattern learner (and subsequently the Na¨ıve Bayes classifiers once again). This additional bootstrapping lead to a large increase in recall with a relatively minor drop in precision, with results comparable to supervised methods.

Wiebe and Mihalcea (2006) discussed the integration of word sense disambiguation techniques with subjectivity analysis, asking if it is possible to label word senses as being subjective or objective. For instance, consider the following two senses of the wordalarm: “His alarm grew” versus “The alarm went off”; subjectivity analysis and further analysis of evaluation might benefit from methods for the automatic discrimination of word senses’ subjectivity. The authors began with an annotation study where two judges independently annotated 138 senses of 32 words from WordNet with labelssubjective,objective,both, or uncertain. Overall agreement was 85.5%, while the Kappa value of 0.74 indicated strong agreement beyond that expected by chance. (Disregarding uncertain cases resulted in a Kappa value of 0.90). Their method for classifying word senses of a target word as subjective or objective involved finding the distributionally similar words (DSW) using Lin’s method (1998) and for each sense of the target word computing a WordNet-based similarity score (WNSS) with each of the DSW. They then scored a sense as the sum of

(28)

SYNTACTIC FORM EXAMPLE PATTERN

<subj> passive-verb <subj>was satisfied

<subj> active-verb <subj>complained

<subj> active-verb dobj <subj>dealt blow

<subj> verb infinitive <subj>appear to be

<subj> aux noun <subj>has position

active verb <dobj> endorsed<dobj>

infinitive <dobj> to condemn<dobj>

verb infinitive <dobj> get to know<dobj>

noun aux <dobj> fact is <dobj>

noun prep <np> opinion on <np>

active-verb prep <np> agrees with<np>

passive-verb prep <np> was worried about<np>

infinitive prep <np> to resort to<np>

Table 2.1: Riloff and Wiebe’s (2003) syntactic templates and examples of patterns of subjective expressions.

the WNSS of each DSW that appeared in a subjective expression in the MPQA corpus (Wiebe et al., 2005) divided by the total WNSS of all the DSW. The approach achieved a break-even point1 of 0.50, significantly out-performing a random-choice baseline (break-even point of 0.27).

2.1.3 Multi-lingual Subjectivity Analysis Resources

Mihalcea et al. (2007) generated resources for the analysis of subjectivity in new languages by utilising existing tools and resources. First, they created a new subjectivity lexicon for Romanian by translating an English lexicon of subjectivity (created by Wiebe and Riloff, 2005). Following Riloff and Wiebe (2003), this lexicon was employed in a high-precision subjectivity/objectivity classifier2 in order to label large amounts of raw text. Evaluating the classifier on 504 sentences manually annotated by two judges indicated that subjective

1

The point at which precision and recall are equal.

2

Learning extraction patterns and bootstrapping the process was not possible however, due to the lack of appropriate linguistic analysis tools for the Romanian language.

(29)

sentences could be found with a good degree of precision (0.80), but a low performance overall (F1= 0.44). Secondly the authors demonstrated that a Romanian corpus could be reliably annotated by projecting sentence-level annotations from an automatically labelled parallel English corpus. This was evaluated by training machine-learning classifiers with a Romanian corpus annotated in this manner (from two sets of sentences, one subjective and one objective). When classifying the 504 manually annotated sentences, the machine-learning classifiers performed well (F1 = 0.68). This approach relies on having parallel corpora in English and the target language, however, so may not be readily transferred to other languages. Instead, Banea et al. (2008) proposed using machine translation tech-niques to propagate subjectivity annotations from English to other languages, achieving results comparable to those obtained for manually translated parallel corpora.

2.2

Sentiment Analysis

A frequently explored task in the fields of Sentiment Analysis and Opinion Mining is sentiment classification, which involves labelling text according to the writer’s general feeling toward their subject. Typically, the classesPositiveandNegativeare used (Pang et al., 2002; Turney, 2002), though increasingly researchers consider three-class systems including neutrality (Koppel and Schler, 2005), or four-class systems that consider degrees of sentiment (Pang and Lee, 2005) or incorporating the notion that some reviews are balanced in sentiment, containing equal amounts of positive and negative text (Zagibalov and Carroll, 2008a).

Given an example sentiment-bearing sentence (from a book review):

“The handsome features and sober patrician voice of George Alagiah are as familiar as wallpaper”

one might consider sentiment classification to be a straightforward task. In this instance the adjectives handsome, sober, patrician and familiar are all cues to suggest a positive sentiment. Pang et al. (2002) asked whether simple keyword spotting could suffice for sentiment classification. They asked two human judges to pick good indicators of positive and negative sentiment in movie reviews. Table 2.2 lists these keywords, and a similar sized list of keywords obtained using simple statistics based on frequency counts in a corpus of positive and negative movie reviews. Employing a simple keyword spotting classification approach resulted in only a moderately successful performance (compared with a baseline of 50%), so many researchers have investigated more sophisticated techniques.

(30)

Proposed word lists Accuracy Ties

Human 1 positive: dazzling, brilliant, phenomenal, excellent, fantastic 58% 75% negative: suck, terrible, awful, unwatchable, hideous

Human 2 positive: gripping, mesmerizing, riveting, spectacular, cool, 64% 39% awesome, thrilling, badass, excellent, moving, exciting

negative: bad, cliched, sucks, boring, stupid, slow

Statistics- positive: love, wonderful, best, great, superb, still, beautiful 69% 16% based negative: bad, worst, stupid, waste, boring, ?, !

Table 2.2: Pang and Lee’s (2008) exploration of sentiment classification of movie reviews using keywords selected by human judges and simple statistics, with percentage accuracy and ties in number of keywords found.

Note that three terms have been used in the literature that all refer to the same concept of sentiment in terms of positivity or negativity of propositions, namely: polarity, semantic orientation and valence. These terms are used interchangeably in this disserta-tion, depending on which is most relevant for the work under discussion, but they should be regarded as synonymous.

Two primary tracks of research for the sentiment classification task currently exist: one involving the construction of lexical resources of sentiment, and another approach based on machine learning techniques.

2.2.1 Supervised Machine-learning Approach

The prevalent approach in sentiment classification, at the time of writing, is based on supervised machine-learning techniques. Pang et al. (2002) introduced this approach by evaluating the application of a variety of machine learning methods to classifying movie reviews according to their polarity. They showed that large quantities of data on polarity could be obtained from the World Wide Web when reviews contain numeric ratings (e.g. a star rating system). They created a large set of movie reviews annotated with polarity by assuming 1 or 2 stars out of 5 indicated a review was negative, and 4 or 5 stars indicated it was positive. They trained Na¨ıve Bayes, Maximum Entropy and Support Vector Machine (SVM) classifiers using a variety of features, finding that accuracies all turned out at around 80% (50% baseline), though the SVM classifier using the boolean presence (rather than frequency) of unigrams as features tended to be the best performing configuration.

Other researchers have reported incremental improvements to this accuracy. Such improvements may be made by thresholding scores for features (Dave et al., 2003), or

(31)

by performing linguistic analysis, such as part-of-speech tagging or determining semantic relations (Gamon, 2004). More substantial gains may be made by only considering the sentences in a text that have been classified as subjective in a previous step. Pang and Lee (2004) extracted subjective sentences from movie reviews (using machine-learning techniques for subjectivity classification). Repeating the evaluation employed previously by Pang et al. (2002) (but with a larger data set) achieved an accuracy of around 86%.

Mullen and Collier (2004) complemented traditional features for text classification with diverse information sources. For each text they calculated Osgood semantic differentiation factors derived from the Wordnet minimum path length between each adjective in a docu-ment and a pair of seed words appropriate for each factor. SVM models were then trained using these factors, SO-PMI-IR (see Section 2.2.2), unigrams and features representing their relationship to aspects of the text such as topic. The authors also assert that the sentiment expressed toward a specific entity is most easily identified with reference to that entity. They thus weight features appropriately when they appear in proximity to the subject of the discourse. The authors report an improvement of around 2.5% accuracy when using their additional features with unigrams over using unigrams alone.

Noting that lexical features are key to many approaches in sentiment analysis and opinion recognition and that simple unigrams often perform well, Riloff et al. (2006) asked how often more complicated features were beneficial and which types of features were most valuable. To explore these questions they defined a subsumption hierarchy to specify the relationships between features. The researchers defined feature subsumption as occurring when all the text captured by one feature is also captured by a subsuming feature. For example, the unigram ’happy’ subsumes the bigram ’very happy’. To use the hierarchy representationally they assigned each feature to its appropriate node before performing a top-down breadth-first traversal, discarding any features that were subsumed by a more general node. They investigated unigrams, bigrams and extraction patterns (Riloff and Wiebe, 2003), but the technique is readily expandable to include other types of features.

To estimate the quality of features so that better-performing complex features are not subsumed by their ancestors, Riloff et al. employed the Information Gain (IG) metric and defined behavioral subsumption as occurring “if two criteria are met: (1) [feature] A

representationally subsumes [feature] B, and (2) IG (A) ≥IG (B)−δ”. δ is a parameter that indicates the degree of informationB must convey overAso thatBis worth retaining rather than subsuming intoA. Evaluating their feature selection technique using an SVM classifier and three data sets from sentiment and opinion research (OP—Wiebe et al.,

(32)

The tax proposal was           

simple and well-received simplistic but well-received

∗_{simplistic and well-received}

           by the public.

Figure 2.1: Hatzivassiloglou and McKeown’s (1997) example of how the adjectivesimplistic deviates from its semantic group, in that its semantic orientation is opposite to the highly related wordsimple.

2004; Polarity— Pang and Lee, 2004; and MPQA—Wiebe et al., 2005), the authors report a small gain (typically less than 1%) in performance on each task over baselines of using unigrams, bigrams or extraction patterns in isolation. They note however that the subsumption methodology counteracts the negative effect of aggregating these feature sets in their entirety.

2.2.2 Lexical Approach

Hatzivassiloglou and McKeown (1997) first investigated the semantic orientation of ad-jectives, noting that the type of conjunct used to join two adjectives can indicate when their semantic orientations differ, such as in Figure 2.1. The example also demonstrates that words that share a high degree of synonymy can sometimes differ in their semantic orientation (simple versus simplistic). Using a log-linear model of regression on all con-junctions of adjectives extracted from a corpus and a clustering algorithm, they created a set of positive adjectives and a set of negative adjectives with 92% accuracy.

However, Turney (2002) argued that isolated adjectives can be insufficient to determine semantic orientation over larger units of text, giving the example of the wordunpredictable, which is negative in the context of an automobile’s steering but positive when describing the plot of a movie. Rather than using individual adjectives to classify the semantic orientation of text, Turney proposed SO-PMI-IR, a method determining the orientation of phrases. SO-PMI-IR begins with the extraction of bigram phrases from a problem document using certain extraction patterns (listed in Table 2.3).

The semantic orientation (SO) of each phrase is then calculated as the Pointwise Mu-tual Information (PMI, a measure of the degree of statistical dependence) of the phrase and the word excellent minus the PMI of the phrase and the word poor3, where PMI is calculated following Church and Hanks (1990):

PMI (word1, word2) = log2

_{p (}_word 1&word2) p (word1) p (word2) (2.1) 3

(33)

First Word Second Word Third Word (Not Extracted) 1. JJ NN or NNS anything 2. RB, RBR, or RBS JJ not NN nor NNS 3. JJ JJ not NN nor NNS 4. NN or NNS JJ not NN nor NNS 5. RB, RBR, or RBS VB, VBD, VBN, or VBG anything

Table 2.3: Patterns of part-of-speech tags used by SO-PMI-IR (Turney, 2002) for extract-ing phrases from problem documents.

Performing algebraic simplification, and specifying the operator NEAR as representing cooccurrence, Turney defined the semantic orientation of a phrase as:

SO (phrase) = log₂

_{hits (}_phrase _{NEAR “excellent”) hits (“poor”)} hits (phrase NEAR “poor”) hits (“excellent”)

(2.2) The hits () function represents the number of hits returned by a query made to an In-formation Retrieval engine. In Turney’s case this was AltaVista, selected for its provision of the NEAR operator which specifies that queries should only be satisfied when terms appear within ten words of each other4.

Turney evaluated the system using 410 documents from the Epinons web site, sampling from reviews in the domains of automobiles, banks, movies and travel destinations. Over-all, SO-PMI-IR achieved an accuracy of 74.4% (compared with a majority-choice baseline of 59%), but struggled in the domain of movie reviews with an accuracy of 65.8%.

Turney’s lexical approach and the supervised machine learning techniques discussed above may not be as different as they appear. Beineke et al. (2004) explained how SO-PMI-IR effectively builds a large corpus of positive and negative training data using the seed words. They showed, through algebraic manipulation, how the calculation of point-wise mutual information between co-occurrences may be conceptualised as a Na¨ıve Bayes classifier. They also contributed to the development of SO-PMI-IR by arguing that seed terms can suffer from being too general or too specific. A seed word may be too general in that it is polysemous, but not all of the meanings may represent sentiment (e.g. “poor” in quality, or “poor” in economic status). Conversely, a term might be too specific; there may be several ways to express a sentiment, and a single word may not capture them all. Another statistical approach to building polarity lexicons is that proposed by Takamura 4

(34)

et al. (2005), who estimated semantic orientation by analogy with electron spin; “regard-ing each word as an electron and its semantic orientation as the spin of the electron, [they] construct a lexical network by connecting two words if, for example, one word appears in the gloss of the other word”. Under this analogy the authors employed techniques from the statistical mechanics literature (namely the spin model and the mean field approxim-ation) to label the General Inquirer’s positive and negative words, reporting an accuracy of 81.9% using 14 seed terms. The researchers went on to evaluate latent variable mod-els (Takamura et al., 2006) and probability modmod-els (Takamura et al., 2007) for labelling phrases as positive, negative or neutral.

An alternative knowledge-based approach to calculating the semantic orientation of words is to employ a WordNet-based measure. Kamps et al. (2004) described a method of calculating the orientation of adjectives by using the minimum path length, MPL (w1, w2), between that word and a pair of antonymous prototypes (p1 and p2):

orientation (w, p1, p2) =

MPL (w, p2)−MPL (w, p1) MPL (p1, p2)

(2.3) They evaluated this function by calculating values for the three major factors of Os-good et al.’s Theory of Semantic Differentiation (1957): Evaluative (good versus bad), Potency (strong versus weak) and Activity (active versus passive). For instance, the Evaluative function became:

EVA (w) = MPL (w,bad)−MPL (w,good)

MPL (good,bad) (2.4)

The authors evaluated the measure using the manually constructed sets of the General Inquirer (Stone, 1966), which contains explicit labels for the Osgood factors. The precision of their method differed significantly for the three factors, with Evaluative = 68.2%, Potence = 71.4%, andActivity = 61.9%.

Andreevskaia and Bergler (2006) constructed fuzzy sets of sentiment-bearing adjectives using the synonymy, antonymy and hyponymy relations, and subsequently the retrieved glosses detailed by WordNet to grow sets from seed terms. The researchers reported 66.5% accuracy in classifying words from the General Inquirer according to positivity, negativity or neutrality, and 76.1% when disregarding neutral examples.

Zagibalov and Carroll (2008a) described an iterative retraining approach to learning sentiment-bearing vocabulary from a pair of antonymous seed terms. Working in Chinese, they identified zones of text delimited by punctuation marks, and classified each zone by its sentiment based on the frequency of the seed terms, while accounting for negating words; documents were labelled by choosing whichever class has the highest number of zones. In

(35)

each iteration, their technique automatically classified the test data into subcorpora of positive and negative, and weighted sentiment scores for lexical items based the relative frequency within each subcorpus. After 18 iterations Zagibalov and Carroll found a pre-cision of 87.6% and a recall of 86.9% when evaluating the method on a corpus of product reviews. Their method performed favourably compared to a manually-collated sentiment lexicon (which achieved 87.8% precision and 77.1% recall). Zagibalov and Carroll (2008b) later both reduced the supervision required and improved the performance of the basic technique by using only one seed term.

2.2.3 Determining Degrees of Sentiment

Attention has recently turned to considering neutral documents (Engstr¨om, 2004; Koppel and Schler, 2005) and degrees of sentiment (Pang and Lee, 2005). One might think that solving the two-way problem will automatically solve the problem of determining document neutrality (assuming neutral documents lie in the middle of the problem space). However, Koppel and Schler (2005) showed that negative documents tend to be identified by the absence of positive features rather than the presence of negative features — it is therefore difficult to distinguish between neutral and negative examples.

Pang and Lee (2005) evaluated human performance in determining the relative differ-ence of pairs of movie reviews—whether one is more positive, less positive or as positive. They found that performance was perfect when there was a large difference between the star-ratings of the reviews, but performance dropped as the difference decreases. How-ever, the humans always out-performed the random-choice baseline. The authors then evaluated three algorithms to classify the degree of sentiment, ranging from 0 (negative) to 3 (positive), or in terms of negative, neutral or positive. They found that a One-vs-all (OVA) binary SVM approach worked best on the 3-way task, while OVA and a regression-based technique performed similarly on the 4-class problem. Both approaches beat a random-choice baseline on both tasks.

Goldberg and Zhu (2006) also examined techniques for labelling documents according to the degree of sentiment. They developed a semi-supervised approach of graph-based learning which encoded heuristics for rating-inference. Evaluating on the data sets pro-duced by Pang and Lee (2005) indicated that the graph-based approach outperformed supervised learning when fewer than 200 labelled documents were available.

(36)

2.2.4 Labelling On-Topic Sentiments

Hurst and Nigam (2004) advocated a fusion of sentiment and topic classification. They argued that for sentiment analysis to be useful, methods must be capable of extracting sentiment-bearing sentences about a specific topic from a pool containing both relevant and irrelevant texts. Their assumption was that any sentence deemed to be both repres-entative of sentiment and on-topic will be represrepres-entative of sentiment about that topic. The sentences of a document were classified according to topic using a Winnow classifier. Sentences judged to be on-topic were chunked using shallow natural language processing and tagged with semantic orientation obtained from a manually created lexicon, tailored for the problem domain.

Eguchi and Lavrenko (2006) also considered the relationship between topic and senti-ment but preferred to frame the sentisenti-ment classification problem with respect to Inform-ation Retrieval. They proposed generative models into which search terms are input to generate statements representing similar topicality and sentiment, before being ranked according to a probability estimate. The authors evaluated their models on documents from the MPQA corpus (Wiebe et al., 2005) which had been subsequently annotated with polarity labels (Wilson et al., 2005). The MPQA corpus contains annotations at the phrase level, and hence Eguchi and Lavrenko took the annotation as being indicative of polarity and the rest of its parent sentence as describing the topic. The topic-aware models showed a significant improvement over standard language models when retrieving on-topic sentiment-bearing sentences, with the best-performing configuration achieving a mean average precision of 22.0%.

2.2.5 The Effect of Context on Sentiment

Polanyi and Zaenen (2004) considered how the author of a text can manipulate the ex-pression of sentiment withcontextual valence shifters. These include straightforward types of lexical choice such as negators (not, never, etc.), intensifiers (deeply versus slightly, for example), and presuppositional items (e.g. sufficient versus barely sufficient). A more subtle example of valence shifting resulting from lexical choice is that of modal operators, which can be used to set up a possible context that may not necessarily represent author sentiment (the authors give the example: “If Mary were a terrible person, she would be mean to her dogs” which does not denote that Mary is terrible, nor that she is mean to her dogs, but instead describes a possible scenario, indicated by the modal operator if). The authors also described context valence shifters that operate at the discourse

(37)

level. These include connectors (e.g. but, on the contrary, etc.) joining propositions of opposite sentiment, lists of items that share sentiment, and elaborations that accentuate sentiment. Other aspects of discourse that might act to confound automatic sentiment analysis are: propositions that evaluate multiple entities, distinguishing features of genre, reported speech, sub-topics and asides, and cultural norms (for example, revolutionary might connote freedom-fighter orterrorist depending on one’s cultural background).

Kennedy and Inkpen (2006) incorporated the simpler lexical types of contextual valence shifters (negators and intensifiers/diminishers) into both the supervised machine-learning and lexical approaches to sentiment classification. They obtained lexical information from the General Inquirer (Stone, 1966), asserting that in the General Inquirer intensifiers are labelled as overstatements whereas diminishers are termed understatements. This assumption is difficult to accept, though, as under/over-stating items in the General In-quirer are often labelled with positivity or negativity, so it is unclear whether they convey sentiment by themselves, amplify the context’s sentiment, or both. Furthermore, Polanyi and Zaenen’s examples of intensifiers such as deeply or rather do not align easily with many instances from the General Inquirer’s overstatement and understatement words (for example, the overstatements of accurate, billion or celebrity or the understatements of unlucky,vexing orweak). Judging from the lexicon’s entries it would be more appropriate to use over/under-statements only when they are not labelled with positivity or negativity. Nevertheless, applying negators and intensifiers learnt in this way on the Polarity 1.0 movie review data set (Pang et al., 2002), Kennedy and Inkpen (2006) found that they improved the accuracy of a lexical approach to sentiment classification by around 1.5%. The supervised machine learning approach benefited, with a small but statistically-significant improvement of about 0.7% points. The authors found intensifiers added value beyond that provided by negators, but did not report finding a significant difference.

Wilson et al. (2005) also considered context when researching methods for phrase-level sentiment classification using supervised machine learning. Their data was derived from annotations