Reputation Management System

(1)

Reputation Management System

Mihai Damaschin

Matthijs Dorst

Maria Gerontini

Cihat Imamoglu

Caroline Queva

May, 2012

(2)

Abstract

(3)

Chapter 1

Introduction

Word-of-mouth marketing has always been an important success factor for consumer oriented businesses brand reputation is an important part of the value of a company. Today, with social media, the reputation of a brand, product or service can change much more rapidly than before, and the range of consumer sentiment and attitude is much larger than before.

Although reputation of brands, products and services previously had been difficult to track, today, it can be tracked by what is written about them online, e.g., in micro-blogs such as Twitter. Brand reputation mining by monitoring social media sources for language that may affect reputation in a positive or negative way can be a powerful tool for an organization’s public relations and marketing departments.

To tackle with the aforementioned issues, we developed an online reputation management system. The system basically takes a keyword (name of a company, for instance), and categorizes people’s opinions accord-ing to Twitter tweets as positive, negative and neutral. Moreover, it measures how strong people’s attitude and takes into account how effective and authoritative the people are in the Twitter. Although our system utilizes techniques in machine learning, information retrieval and natural language processing, the results are conveyed to the reader in a very simple and understandable manner.

(4)

Chapter 2

Methods

2.1 Search

2.1.1 @

TechReportID, author = author, title = title, institution = institution, year = year, OPTkey = key, OPTtype = type, OPTnumber = number, OPTaddress = address, OPTmonth = month, OPTnote = note, OPTannote = annote, History

2.2 Get tweets

To be able to do any kind of analysis we needed to retrieve data about the tweets. Besides the actual tweet text we also needed information about the users that posted the tweets and properties such as the number of retweets, followers and favorites.

Twitter offers access to most of its functionalities through a REST API. Rather than writing our own JAVA/REST connector we used the twitter4j library. Although not an official library it provided us direct access to the data through JAVA function calls. On top of this library we also added another layer specific for our use-case.

One problem that we were faced with was the rate limit imposed by the API. Our program can only do 150 calls per hour, when not authenticated and 350 otherwise. In the case of the tweet data we solved this issue by employing a common strategy in such situations - caching. We extended classes in the twitter4j library thus making our methods work regardless of where the actual data came from. On the other hand, in the case of the user data this limitation meant we couldn’t do a PageRank implementation on a user graph.

2.3 Sentiment Analysis

2.3.1 Tokenization

Before analyzing the tweets some preprocessing had to be done. The first process is the tokenization of tweets. At the beginning of the project, a simple tokenization that split the text using whitespaces as delimiters was used. In this process, words and punctuation can constitute a token. This tokenization is restricted and needed improvements to handle punctuation. For the final tokenizer the StandardTokenizer class of Lucene is used (Apache, http://lucene.apache.org/ ). This tokenizer splits words at punctuation characters -removing punctuation- and at hyphens; unless there is a number in the token, in which case the whole token is interpreted as a product number and is not split. With this tokenizer it is also possible to recognize email addresses and internet hostnames as one token.

(5)

2.3.2 Lexicon

Once tweets about the company we are searching have been retrieved, they must be analyzed to know if they are positive or not. Three labels are used for the tweets: negative, neutral and positive. A neutral tweet is defined as a sentiment score that equals to 0.

For this analysis, a sentiment lexicon is used. The lexicon is made of sentiment words associated with a score, for example (good: +1), (bad: -1), (like: +1), (hate: -2) Using the lexicon is quite easy; one just needs to sum up the score of each token to have the overall sentiment score of a tweet. However, this method can be improved using some heuristics.

The first thing to take into account is the negations’ problem. To deal with this problem, we added negations in the database and when a negation is found in a sentence, scores of the two following words are reversed; multiplying them by -1. For example ”This is not good” will give a score (-1)*(+1)=-1.

In the same way, some modifiers were added to our analysis. Modifiers are words like ”really”, ”very” or ”quite”. All these words modify the sentiment score of the following word multiplying it by a factor proportional to the strength of the modifier, for example ”very” will multiply by 2 whereas ”quite” will multiply by 0.5.

Then it is also important to deal with sentences like ”I love companyA but I hate companyB”. For this, the notion of distance was added to the analysis. The distance will increase the score of sentiment words close to the name of the company and reduce the score of sentiment words far away. A Gaussian distribution is used with a maximum distance of 4; this means that sentiment words that are far from the name of the company (4 words between them) are not taken into account.

Reputation Management System