Vol 8, No 10 (2018)

(1)

Research Article

a

October

2018

Computer Science and Software Engineering

ISSN: 2277-128X (Volume-8, Issue-10)

Comparative Analysis of Different Supervised Machine

Learning Algorithms for Sentiment Analysis

Gundeep Singh1, Sukhmanjit Kaur2, Prabhsimran Singh3

1, 3

Department of Computer Engineering & Technology, Guru Nanak Dev University, Amritsar, Punjab, India

2

Department of Computer Applications, Khalsa College of Management & Technology, Amritsar, Punjab, India

1

[email protected], [email protected], [email protected]

Abstract: With advancement in technology, sentiment analysis has seen an exponential growth in last few years. Sentiment analysis is natural language processing technique which deals with emotion detection from given text. A major reason for popularity of this field is the fact it is applied to large number of application domains. Due to this, various techniques have been proposed to perform sentiment analysis using supervised machine learning algorithms. But a major question arises, that which algorithm is the best. This paper tries to provide solution of this question, by performing a comparative analysis of various algorithms taking different datasets under consideration. Our results show that Naïve Bayes out performs all other algorithms.

Keywords: Sentiment Analysis, Text Mining, Supervised machine learning, classification, WEKA.

I. INTRODUCTION

Sentiment analysis is a study which uses natural language processing (NLP), statistics and machine learning techniques to extract or identify the opinion (sentiment) content of a given text [1]. Past decade has witness an exponential growth in the volume of digital data, most of which is generally generated from social networking websites (SNW). Researchers working in field of sentiment analysis are taking keen interest in this for mining the opinions expressed by people on twitter through tweets, status on Facebook etc.This digital data can be used for a wide range of analysis and interpretation. This mammoth amount of data available on these SNW cannot be directly processed further in order to derive useful results and hence and hence sophisticated techniques like machine learning algorithms are used to accomplish this complicated task. This further give rise to another problem that which technique or algorithm is to be used for performing this analysis task. As there are multiple machine learning algorithms classified into supervised and unsupervised learning techniques. Supervised learning is the one in which the instances are labeled, while in unsupervised learning technique the instances are not labeled.

This paper tries to find the solution of the above stated problem by finding which algorithm is best suited for performing sentiment analysis on given dataset. For this, we are taking into consideration some well-known supervised machine learning algorithms like Decision Tree (DT), K-Nearest Neighbors (KNN), Naïve Bayes (NB), Random Forest (RF) and Support vector machine (SVM), in order to perform comparative analysis to put into perspective a relative study of various approaches.

II. SUPERVISED MACHINE LEARNING ALGORITHMS

Supervised learning is the machine learning task of inferring a function from supervised training data. In this training data consist of training examples, with each example consisting of an input value and corresponding desired output value. It is named so because of the fact scientist/ researcher act as a supervisor to the algorithm in order to guide it, what results should it produces at the end of computation. This is similar to the scenario where a student learns from its teacher. In this section we discuss some well-known supervised machine learning algorithms like Decision Tree (DT) [2], K-Nearest Neighbors (KNN) [3], Naïve Bayes (NB) [4], Random Forest (RF) [5] and Support vector machine (SVM) [6]. Table 1 shows descriptive analysis of various supervised machine learning algorithms.

Table 1. Various Supervised machine learning algorithms

DT [2] KNN [3] NB [4] RF [5] SVM [6]

About

It uses tree structure to classify the

It is a non-parametric and instance

It uses simple statistical Bayesian

Random Forests

uses multiple

classification trees.

They work on the

principal to

(2)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 61-65

data based

algorithm

used for

classification

Classifier for performing classification

Each tree gives a classification (Class-vote). The forest chooses the classification having the most votes

location of

decision boundaries (hyperplane) that

produce the

optimal separation of classes

Explanation Ability High Average Average Low Low

Accuracy Average Average High High High

Training Speed Low Average High Low Low

Classification Speed Average Average High Low Average

Missing Values Handling Average Average High High Average

Noise Handling Low Low High High Low

Dealing With Discrete/ Binary/ Continuous Attributes

All All Not

Continuous All Not Discrete

Need of Scaling No Yes No No No

III. EXPERIMENTATION AND RESULTS

The entire experimentation has been performed using WEKA 3.8 [7], which contains a collection of machine learning algorithms for performing various data mining tasks. For efficient analysis, we had used a total three datasets i.e. Amazon [8] datasetcontaining product reviews, IMDb [9] dataset containing movies reviewand Yelp[10] dataset containing restaurant reviews. Each dataset consist of 1000 sentences and corresponding sentiment to each sentence. All the datasets are evenly distributed containing 500 positive and 500 negative sentences. In addition to above datasets we have further used a forth dataset which is contains all the three datasets in combined form. The statistics of datasets are shown in Table 2, while Table 3 shows example of each dataset.

Table 2. Statistics of datasets

Source Positive Sentences Negative Sentences Total Sentences

Amazon 500 500 1000

IMDb 500 500 1000

Yelp 500 500 1000

Combined

Dataset 1500 1500 3000

For all the datasets under consideration we will pass them through different classification models developed using different machine learning algorithms i.e. decision tree (DT), k-nearest neighbors (KNN), naive bayes (NB), random forest (RF) and support vector machine (SVM) as discussed earlier (See Section 2).

Table 3. Statistics of datasets

Source Positive Sentences Negative Sentences

Amazon

I bought this to use with my Kindle Fire and absolutely loved it!

Needless to say I wasted my money

IMDb

Saw the movie today and thought it was a good effort good messages for kids

The acting was bad the dialogs were extremely shallow and insincere

Yelp

Thats right the red velvet cake ohhh this stuff is so good

(3)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 61-65

The results of amazon dataset are shown in Table 3. SVM gave the best results for amazon dataset, closely followed by NB and RF.

Table 3. Experimentation Results of Amazon dataset

DT KNN NB RF SVM Accuracy (%) 68.4 67.6 77.8 77.7 79.3

Kappa Statistics 0.368 0.352 0.556 0.554 0.586

Relative absolute error

(%) 71.23 69.60 51.53 71.75 41.4

Root relative square

error (%) 96.93 103.201 76.44 79.83 90.99 Matthews correlation

coefficient 0.372 0.359 0.559 0.555 0.586 F-Measure 0.682 0.673 0.777 0.777 0.793

Precision-Recall Curve

(PRC Area) 0.697 0.714 0.874 0.871 0.732 Area under the Curve 0.723 0.740 0.879 0.871 0.793

The results of IMDb dataset are shown in Table 4. NB gave the best results for IMDb dataset, while RF and SVM were second and third receptively in term of performance.

Table 4. Experimentation Results of IMDb dataset

DT KNN NB RF SVM Accuracy (%) 57.1 59 80 75.1 74.1

(%) 88.6 82.95 51.75 82.61 51.8

coefficient 0.142 0.188 0.600 0.504 0482 F-Measure 0.571 0.581 0.800 0.750 0.741

(PRC Area) 0.566 0.596 0.873 0.807 0.679 Area under the Curve 0.593 0.622 0.878 0.817 0.741

The results of Yelp dataset are shown in Table 5. RF gave the best results for Yelp dataset, while SVM and NB were second and third receptively in term of performance.

Table 5. Experimentation Results of Yelp dataset

(%) 81.86 72.22 54.86 76.30 45.6

(4)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 61-65

The results of combined dataset are shown in Table 6. NB slightly edged RF and gave the best results for combined dataset, while SVM was third in term of performance. Figure 1 shows the graphical representation of the comparison results.

Table 6. Experimentation Results of Combined dataset

(%) 72.73 78.55 51.54 70.91 42.73

(PRC Area) 0.691 0.631 0.874 0.871 0.786 Area under the Curve

Fig.1.Comparison results of different machine learning algorithms

0 20 40 60 80 100

Amazon IMDb Yelp Combined

Accuracy

DT KNN NB RF SVM

0 20 40 60 80 100

F-Measure

DT KNN NB RF SVM

0 20 40 60 80 100

Area Under the Curve

(5)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 61-65

IV. CONCLUSIONS

Text classification especially sentiment analysis is currently one of the hottest domain and is catching attention of many researchers worldwide. But the main issue with this is to find the best algorithm that can be used to perform sentiment analysis. This paper tries to answer this question, by performing comparative analysis of different supervised machine learning for performingsentiment analysis on different dataset under consideration. Our results showed that no algorithm is a clear cut best when tested for different datasets. Out of the total four datasets,Naïve Bayesgave the best performance for two, while Random Forest and Support Vector Machine sharing the honors for the remaining two datasets. Hence we can conclude that for performing sentiment analysis Naïve Bayes is the best choice closely followed by Random Forest and Support Vector Machine. However these are some handful of algorithms that we considered for comparison, with introduction of new hybrid and stacking techniques which are problem specific, there is definite chance that these new hybrid and stacking techniques can outperform Naïve Bayes.

REFERENCES

[1] Liu, Bing. "Sentiment analysis and opinion mining."Synthesis lectures on human language technologies 5, no. 1 (2012): 1-167. doi: 10.2200/S00416ED1V01Y201204HLT016.

[2] Quinlan, J. Ross. "C4. 5: programs for machine learning". Elsevier, 2014.

[3] Aha, David W., Dennis Kibler, and Marc K. Albert. "Instance-based learning algorithms." Machine learning 6, no. 1 (1991): 37-66.doi: 10.1007/BF00153759

[4] McCallum, Andrew, and Kamal Nigam. "A comparison of event models for naive bayes text classification." In AAAI-98 workshop on learning for text categorization, vol. 752, pp. 41-48. 1998.

[5] Schölkopf, Bernhard, Christopher JC Burges, and Alexander J. Smola, eds. Advances in kernel methods: support vector learning. MIT press, 1999.

[6] Breiman, Leo. "Random forests." Machine learning 45, no. 1 (2001): 5-32. doi: 10.1023/A:1010933404324 [7] Witten, Ian H., Eibe Frank, Mark A. Hall, and Christopher J. Pal. "Data Mining: Practical machine learning

tools and techniques". Morgan Kaufmann, 2016.

[8] McAuley, Julian, and Jure Leskovec. "Hidden factors and hidden topics: understanding rating dimensions with review text." In Proceedings of the 7th ACM conference on Recommender systems, pp. 165-172. ACM, 2013. doi: 10.1145/2507157.2507163

[9] Maas, Andrew L., Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. "Learning word vectors for sentiment analysis." In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 142-150. Association for Computational Linguistics, 2011.