Vol 7, No 7 (2017)

(1)

Research Article

July

2017

Computer Science and Software Engineering

ISSN: 2277-128X (Volume-7, Issue-7)

A Survey on Opinion Mining

R. Lydia Priyadharsini, M. Lovelin Ponn Felciah

Department of Computer Science, Bishop Heber College, Bharathidasan University, Tiruchirappalli, Tamilnadu, India

DOI: 10.23956/ijarcsse/V7I7/0113

Abstract: The number of customer reviews has been increased in the recent years with the popularity of e-commerce and social media. There are many social networks in which people tend to express their personal opinion. Reviews on products help the new user to know well about the product. It helps the customers to choose the best products based on the reviews. At present, there are huge amount of opinions available in the review sites, blogs, forums etc., when the number of opinions or reviews increase to large amount, it becomes difficult to go through all the reviews and get a clear idea. Hence it is essential to summarize the opinions and provide a clear and general view for quick and better understanding. Opinion mining is the art of extracting useful opinions. This paper presents a survey on opinion mining, the techniques used and the challenges faced.

Keywords: customer reviews, products, review sites, opinion, opinion mining

I. INTRODUCTION

Opinion Mining is the field of study that analyzes people’s opinions, sentiments, evaluations, attitudes and emotions from written text. It has extended its importance in the field of management science and social science. It also contributes to business and society. In the world of e-commerce there are so many products sold online. The online merchants provide a way to the customers to enter the reviews online. This will pave way to the enhancement of their business. As they get to know about the customer reviews, based on the customer expectation and remarks provided the companies can take steps to satisfy customer’s needs. When a user posts a comment online, we can find emotions and opinions in it. Human carry a variety of emotions like joyfulness, sorrow, fear, excitement etc., Identifying these emotions are very easy in face to face communication when compared to written communication [3]. Opinion mining is used to explore user’s opinion given on any social media or networking site for various commercial applications [3].With a wide growth of e-facilities available, many people come forward to express their views or opinions about a person or a product or anything they want. The distillation of knowledge from this huge amount of unstructured information is a very complicated task [10].

A. Components of an opinion:

An opinion usually contains the following three parts [3].

Opinion Holder: An opinion holder is the one who gives away the opinion on a product.

Opinion object: The object on which the opinion holder is expressing the view. The object may be a product or product

feature or a person or any post that was posted online

Opinion orientation: the opinion orientation may be positive negative or neutral comment.This is called the semantic

orientation of the opinion [1].

For example consider a comment according to John, the picture clarity of the camera is excellent.

Here, john is the opinion holder, picture clarity is the object and the orientation is positive (since the word “excellent” denotes a positive feedback).

B. Text Categorization:

While going in for opinion mining it is important to know if the opinion given by the user is subjective or

objective[7]. Subjective sentences carry emotions. They are written on the aspect of the reviewer. Objective sentences

contain the facts. No emotion of the reviewer is revealed. Example: “I felt sad for the German team” and “The German team was weak today” The first sentence is an example for a subjective one and the latter is an objective sentence.

C. Levels of Opinion Mining:

Opinion mining is usually done in three levels: Document level, Sentence level, Aspect level. In document level

analysis, the opinion is given based on the whole document. The opinion may be positive or negative. For example, in a product review, the system determines the review given says that a product is good or bad on the whole. Whereas in sentence level analysis, the system goes through each and every sentence and finds out what sentence express, that is whether it express a positive or negative or a neutral opinion about the product. Next is Aspect level analysis which is a fine grained analysis. It is also called as feature based analysis. This level aims at extracting opinions based on features of the product. Here the product’s attributes or features are reviewed.

II. RELATED WORK

(2)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0113, pp. 141-145

Ding et al (2007) proposed a holistic lexicon based approach that allows the system to handle context dependant

opinion words. The method basically counts the number of positive and negative opinion words that are near the product feature in each review sentence. For instance, If the number of positive opinion words is greater than the number of negative opinion words, then we may conclude that the final opinion about the product feature is positive. Also this approach exploits external information and evidences in other sentences rather than only studying the current sentence. The holistic lexicon based approach also uses conjunction rules, that is, when the word “and” is used between two opinion words, we can assume that the orientations of the two words are same. For example, if a review sentence

contains the words beautiful and spacious”, and suppose we do not know whether to take the word “spacious” as

positive or negative, but know that “beautiful” is positive, we can infer that “spacious” is also positive.

Previous approaches have mostly relied on natural language processing techniques or statistic information,

whereas Jin and Ho(2009) proposed a new machine learning framework using lexicalized HMMs. The approach

naturally integrates linguistic features, such as part-of-speech and surrounding contextual clues of words into automatic learning[6].

Identifying the product features and summarizing the opinions on the features by grouping positive and negative

opinions is done in the feature based opinion mining. Zhai et al (2011) exploited two pieces of natural language

knowledge: Sharing words that is the common words while describing a feature and words with lexical similarity are

grouped together. Another natural language knowledge is also used : Positive and negative correlation, that is people mostly don’t repeat the same thing in the same sentence, hence we can put the mentioned features under different groups.

For example consider the sentence, “I like the picture quality, the batterylife, and zoom of this camera”[4].Therefore

from the sentence we can infer that “picture quality”, “battery life”, and “zoom” are probably not synonyms. For semi-supervised learning the EM algorithm based on naïve Bayesian classification is used.

Pak and Paroubek in “Twitter as a Corpus for Sentiment Analysis and Opinion Mining” have proposed to

collect a corpus for sentiment analysis, and using it a sentiment classifier was developed to determine the polarity(positive or negative or neutral) of a sentence. Text containing positive emotions, negative emotions and objective text (no emotion) were collected from the Twitter and linguistic analysis of corpus were performed and shows how to build a sentiment classifier that uses the collected corpus as training data[11].

Ding and Liu(2007) studied the problem of finding the semantic orientation of an opinion. Most of the existing

methods use opinion words for the process. Since the opinion words are context dependent, Ding and Liu proposed some linguistic rules to deal with the problem along with a new opinion aggregation function. A system called opinion observer has also been developed [5].The opinion aggregation function is used to determine the orientation score,based on the final score the opinion is determined as positive or negative.For context dependant opinion words , some linguistic rules are used :Intra sentence conjunction rule ,Pseudo intra sentence conjunction rule and Inter sentence conjunction

rule.In the first rule , the conjunction word will be explicitly given .Example :“This camera takes great pictures and has

a long battery life. We can infer that here long is a positive opinion word because it is conjoined with the positive word

“great”. In the second rule ,the conjunction word is not explicitly revealed. Example : “The camera has a long battery life, which is great”.In the third rule the next sentence is also considered.For example : “The picture quality is great.

However, the battery life is short”.Also consider the Synonym and Antonym Rule: If a word is found to be positive (or

negative) in a context for a feature, its synonyms are also considered positive (or negative), and its antonyms are considered negative (or positive)[5].

III. CHALLENGES IN OPINION MINING

A. Domain independence:

Opinion mining uses the opinion words(words that bear opinion) to classify the opinions whether it is a positive or negative opinion. If the review contains desirable words (example amazing, great) then it is concluded that the review is positive. Similarly when the review contains undesirable words (example : bad, poor),it is concluded that the review is negative. However certain words which are positive in one domain might be negative in another domain. For example, consider the sentences: “the battery life is long” and “it takes a long time to focus”. Here the word “long” is positive in the first sentence while it is negative in the second sentence.

B. Detection of fake reviews:

The web contains both authentic and spam contents. For effective Sentiment classification, this spam content should be eliminated before processing. This can be done by identifying duplicates, by detecting outliers and reputation of reviewer.

C. Use of short forms and Orthographic words:

People’s way of messaging had changed a lot. Even in the social media, they opt to use short forms rather than typing a whole word. For example using “thnx” instead of thanks, “gr8” instead of great and so on. Orthographic words are the words used in excitement. Example : it is sooo sweeeeeeeet. Using these kind of words creates a challenge for the opinion mining techniques which uses the opinion lexicon to classify the opinion words.

D. Using sarcastic words:

(3)

in a metaphorical manner [8].Example review about a Smartphone: “All the features you want. Too bad they don’t work”

IV. DATA SOURCES

The Data sources like Blogs, Online Reviews (products),Micro blogging, Forums[2]

A. Blogs:

Abbreviated version of weblog. It is a regularly updated website, usually run by an individual or a small group. Blogger is the person who writes the content for the blog. The contents are generally called as posts. Blogs allow people to leave comment about the article

B. Online Reviews:

In most of the cases, any consumer who wishes to buy a product take steps to know about the quality of the product and most importantly what others think about the product. For this they go through the consumer reviews about the product who had already used it or using it.

C. Micro blogging:

Here the user posts the content in short. The users share the information as short messages. The messages could be in a text, image,audio or in a video format. Twitter is a popular micro blogging service. In twitter the short messages posted are called tweets.

D. Internet Forums:

These are online discussion area where the members can uphold a conversation in the form of posts. A member can read and review the other members posts.

V. MACHINE LEARNING TECHNIQUES

There are a number of machine learning techniques used to classify the opinions given in the review.

A. Naive bayes(NB):

It is an effective classification algorithm. Naive bayes algorithm is mostly used for document classification. This algorithm is based on applying bayes theorem with strong independence assumptions with the features. Naive bayes classifier assumes the presence of a particular feature on a given class and that feature is unaware of the presence of any other feature. This is called class conditional independence.

Bayes theorem:

In Bayesian terms, X is considered “evidence”. Let H be some hypothesis, such as that the data tuple X belongs

to a specified class C. For classification problems, we want to determine P(H | X), the probability that the hypothesis H

holds given the “evidence”,

P(H) is the prior probability, or a priori probability, of H.

P(X) is the prior probability of X

P(H | X) is the posterior probability, or a posteriori probability, of H conditioned on X.

Bayes theorem is ,P(H|X) = P(X|H)P(H)

P(X)

B. Support Vector Machine (SVM):

It is an algorithm used to classify both linear and nonlinear data.SVM uses a non-linear mapping to transform an original training data into a higher dimension. In this algorithm, we plot each data item as a point in n-dimensional space, where n is the number of features we have taken. Here the value of each feature is the value of a particular co-ordinate. Then the classification is performed by finding the hyper-plane which differentiates the two classes .SVM finds this hyper plane with the help of “support vectors” which are essential training tuples and “margins” which are defined by the support vectors.The following fig 1[12] shows the classification performed by finding the hyper-planethat differentiates two classes.

(4)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0113, pp. 141-145 C. Centroid classification algorithm:

It is a very simple algorithm. For each training class , a centroid vector also called as prototype vector is calculated. The testing document is compared with the measure of all centroids.Finally the document will be assigned to the class which has the most similar centroid value.

D. N-Grams:

N-Grams is a word prediction algorithm using probabilistic methods to predict the next word after the

observation of N-1 words. Hence, this computation of probabilities of next word is closely related to computing the sequence of words.

E. K Nearest Neighbor (KNN):

K Nearest Neighbor algorithm is considered as a lazy learning algorithm that classifies data sets based on their similarity with neighbors’ .Initially a parameter k is determined, where k is the number of nearest neighbors. The distance between the query instance and all the training samples is calculated. After sorting the distance, the nearest neighbor based on the k’th minimum distance is determined. The prediction value of the query instance is determined based on the majority of the category of nearest neighbors.

VI. EVALUATION AND DISCUSSION

The performance of different techniques used for opinion mining is evaluated by calculating various metrics like precision, recall and F-measure.

Precision:

The percentage of retrieved documents that is in fact relevant to the query.

Recall :

This is the percentage of documents that are relevant to the query and were in fact retrieved.

F1 score :

The two measures (Precision and Recall) are sometimes used together in the F1 score (also F-score or F-measure) is a measure of a test's accuracy.

The following table shows some of the approaches proposed to solve the challenges due to domain

independence and polarity of opinion bearing words.

AUTHOR PROPOSED APPROACH PURPOSE

Ding et al (2007) Holistic based Lexicon To handle context dependant words

Jin and Ho (2009) Lexicalized HMM Integrates linguistic features

Ding and Liu (2007)

Opinion observer and Opinion aggregation function

To find semantic orientation of opinion

Pak and Paroubek Sentiment classifier To determine polarity of an opinion

Zhai et al (2011) An unsupervised approach that extract

,expand and classify emotional words

To determine evaluative sentences

VII. CONCLUSION

Opinion mining is an active research area. It deals with the analysis of the reviews or opinions given by the users in the social media or in the web and also it extracts the useful and summarized view of all the opinions This paper has dealt with the importance of opinion mining, some of the techniques used to do opinion mining, challenges that has to be overcome during opinion mining. In future effective techniques have to be introduced to overcome the challenges in opinion mining.

ACKNOWLEDGEMENT

I would like to express my special thanks to Almighty and my guide Mrs.M.Lovelin Ponn Felciah for rendering her full support & encouraging to do my research work successfully and I would also like to thank my dear friends who supported me.

REFERENCES

[1] Xiaowen Ding ,Bing Liu ,Philip S yu, “A Holistic Lexicon-Based Approach to Opinion Mining”, WSDM’08,

[2] G.Vinodhini , RM.Chandrasekaran, “ Sentiment Analysis and Opinion Mining: A Survey”, 2012, IJARCSSE

[3] Asmita Dhokrat, Sunil Khillare, C. Namrata Mahender, “Review on Techniques and Tools used for Opinion

Mining”, International Journal of Computer Applications Technology and Research Volume 4– Issue 6, 419 - 424, 2015, ISSN:- 2319–8656

[4] Zhongwu Zhai, Bing Liu,Hua Xu, Peifa Jia, “Clustering Product Features for Opinion Mining”, WSDM’11,

(5)

[5] Xiaowen Ding and Bing Liu, “The Utility of Linguistic Rules in Opinion Mining”, SIGIR’07, July 23–27, 2007,

Amsterdam, The Netherlands.ACM 978-1-59593-597-7/07/0007.

[6] Wei Jin ,Hung Hay Ho, “A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining”,

Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009.

[7] Andrea Esuli and Fabrizio Sebastiani, “Determining Term Subjectivity and Term Orientation for Opinion

Mining”,EACL,2006

[8] Bakhtawar Seerat ,Farouque Azam , “ Opinion mining :Issues and Challenges(A survey), International Journal

of Computer Applications” (0975 – 8887), Volume 49 – No 9 ,July 2012

[9] Zhongwu Zhai,Bing Liu, Lei Zhang, Hua Xu, Peifa Jia, “Identifying Evaluative Sentences in Online

Discussions”,2011, Association for the Advancement of Artificial Intelligence (www.aaai.org).

[10] Erik Cambria, Robert Speer, Catherine Havasi, Amir Hussain, “SenticNet: A Publicly Available Semantic

Resource for Opinion Mining”, Commonsense Knowledge: Papers from the AAAI Fall Symposium (FS-10-02), Association for the Advancement of Artificial Intelligence (www.aaai.org).

[11] Alexander Pak, Patrick Paroubek, “Twitter as a Corpus for Sentiment Analysis and Opinion Mining”,

Universit´e de Paris-Sud, Laboratoire LIMSI-CNRS, Bˆatiment 508,F-91405 Orsay Cedex, France

[email protected], [email protected]