Twitter Data Classification by Machine Learning for Sentiment Analysis

(1)

Twitter Data Classification by Machine Learning

for Sentiment Analysis

Pragya Mishra 1* and Rajesh kumar Nigam 2 1

M. Tech Scholar, Technocrats Institute of Technology Bhopal, Madhya Pradesh, India. 2

Associate Professor, Technocrats Institute of Technology Bhopal, Madhya Pradesh, India.

Date of publication (dd/mm/yyyy): 08/04/2021

Abstract – In this paper is proposed method that performs classification of tweet data sentiment on Twitter. Twitter is an online social networking website that contains a rich amount of sentiment data that can be structured, semi-structured and unstructured data that can help to analyze behavior of guy. Data classification and analyzing is big task. Here is used Naive Bayes classifier to classify data and used machine learning to analysis human behavior. Tweet analysis is work two basic mode : (A) sentiment expressed by a phrase in the context of a tweet, and (B) overall sentiment of a tweet. It is implemented with the help of python Ecosystem to improve its scalability and efficiency. In this sentiments analysis process also refer the NLP (Natural language processing). It is internal action process between human and computer, it analyzes the treasure of natural language data here we get idea how to mine twitter data to get sentiment. Here we used deep neural network and text mining algorithm to extract all text data to analysis accurate opinion of all user and increase accuracy.

Keywords – Sentiment Analysis, NLP (Natural Language Processing), Twitter data, Naive Baise Classifier, SVM, Textblob.

I. I

NTRODUCTION

We live in a society where the textual data on the Internet is growing at a rapid pace and many companies are trying to use this deluge of data to extract people’s views towards their products. Online social network platforms, with their large-scale repositories of user-generated content, can provide unique opportunities to gain insights into the emotional “pulse of the nation”, and indeed the global community. A great source of unstructured text information is included in social networks, where it's unfeasible to manually analyze such amounts of data. There is an outsized number of social network websites that enable users to contribute, modify, and grade the content, also on expressing their personal opinions about specific topics. Some examples include blogs, forums, product reviews sites, and social networks, like Twitter (http://twitter.com/). Twitter (San Francisco, CA, USA) is a microblogging site that offers the opportunity for the analysis of expressed mood, and previous studies have shown that geographical, diurnal, weekly, and seasonal patterns of positive and negative affects can be observed. Microblogging and more particularly Twitter is employed for the subsequent reasons:

 Microblogging platforms are used by different people to express their opinion about different topics, thus it is a valuable source of people’s opinions.

 Twitter contains an enormous number of text posts and it grows every day. The collected corpus can be arbitrarily large.

 Twitter’s audience varies from regular users to celebrities, company representatives, politicians, and even country presidents. Therefore, it’s possible together text posts of users from different social and interest groups.

(2)

 Twitter’s audience is represented by users from many countries as the audience of microblogging platforms and services grows a day, data from these sources are often utilized in opinion mining and sentiment analysis tasks. For example, manufacturing companies could also be curious about the subsequent questions: millions of people are using social network sites like Facebook, Twitter, Google Plus, etc. to express their emotions, opinion, and share views about their daily lives.

Through the web communities, we get interactive media where consumers inform and influence others through forums. Social media is generating an outsized volume of sentiment rich data within the sort of tweets, status updates, blog posts, comments, reviews, etc. Moreover, social media provides an opportunity for businesses by giving a platform to connect with their customers for advertising. People mostly depend on user-generated content online to an excellent extent for deciding. For e.g. if someone wants to shop for a product or wants to use any service, then they firstly search its reviews online, discuss it on social media before taking a choice, the amount of content generated by users is just too vast for a traditional user to research, so there’s a requirement to automate this, various sentiment analysis techniques are widely used. Sentiment analysis (SA) tells the user whether the knowledge about the merchandise is satisfactory or not before they pip out, Marketers and firms use this analysis data to know about their products or services in such how that it is often offered as per the user's requirements. Textual Information retrieval techniques mainly specialize in processing, searching, or analyzing the factual data present. Facts have an objective component but, there is another textual content that expresses subjective characteristics. These contents are mainly opinions, sentiments, appraisals, attitudes, and emotions, which form the core of Sentiment Analysis (SA). It offers many challenging opportunities to develop new applications, mainly thanks to the large growth of obtainable information on online sources like blogs and social networks. For example, recommendations of things proposed by a recommendation system are often predicted by taking under consideration considerations like positive, or negative opinions about those items by making use of SA. Comments, reviews and opinions of the people play a crucial role to work out whether a given population is satisfied with the merchandise, services. It helps in predicting the sentiment of a good sort of people on a specific event of interest just like the review of a movie, their opinion on various topics roaming around the world. These data are essential for sentiment analysis, to get the general sentiment of population, retrieval of knowledge from sources like Twitter, Facebook, Blogs are essential. For the sentiment analysis, we focus our attention towards the Twitter, a micro-blogging social networking website. Twitter generates huge data that can't be handled manually to extract some useful information and thus, the ingredients of automatic classification is required to handle those data. Tweets are unambiguous short text messages that are up to a maximum of 140 characters. By the use of Twitter, millions of people around the world to be connected with their family, friends, and colleagues through their computers or mobile phones. The Twitter interface allows the user to post short messages, and that can be read by any other Twitter user. Twitter contained a variety of text posts and grows every day. We choose Twitter because the source for opinion mining just because of its popularity and data processing. The Existing Database is not able to process the big amount of data within a specified amount of time. Also, this sort of knowledge base is restricted for the processing of structured data and features a limitation when handling an outsized amount of data. So, the traditional solution cannot help an organization to manage and process unstructured data. The use of Big Data technologies like Hadoop is the best way to solve Big Data challenges. Sentiment Analysis is the process of ‘computationally’ determining whether a bit of writing is positive, negative, or neutral. It’s also referred to as opinion mining, deriving the opinion or

(3)

attitude of a speaker. Sentiment analysis may be a term that you simply must have heard if you've got been within the tech field long enough. It is a method of predicting whether a bit of data (i.e. text, most commonly) indicates a positive, negative, or neutral sentiment on the subject in this article, we will go through making a Python program that analyzes the sentiment of tweets on a particular topic. The user will be able to input a keyword and get the sentiment on it based on the latest 100 tweets that contain the input keyword.

II. M

ETHODOLOGY

Sentiments analysis is the invented science of psychology and sociology and both are the scientific study of people emotions, relationships, opinions, and behaviors (wiki). Psychologists apply sentiments process through the hypothesis but data scientist applies through the data. In other words, it is the computational process which identifies and categories the opinions, thoughts and ideas through the text data. The sentiments analysis process also refer the NLP (Natural language processing). It is internal action process between human and computer. It also analyzes the treasure of natural language data. Sentiments analysis is expressed in two different categories: polarity and subjectivity. The polarity measure the text data is positive (>0) or negative (<0> (or neutral (0). Classifying is a sentence as subjective or objective, known as subjectivity classification (monkeylearn.com). Subjectivity measures from (0.0 to 1.0). Where 0.0 is very objective and 1.0 is very subjective. But In this thesis we calculate only the sentiments polarity from twitter data (tweets data is in CSV format). Polarity showed three different colors positive for green color, negative in red color and neutral in blue color. Polarity calculated through the python code using library of Textblob and python module Natural Language Tool Kit (NLTK).

When we go for sentiment analysis there are many option and tools. The most popular tools are MATLAB, Python, and Java and C # and due to huge no of libraries available in python and easiest in code so mostly researcher used python because it is sensible and suitable choice. The sentiments analysis algorithm consists of 4 modules. The procedure in each model starts with importing data with pandas, since the powerfulness of pandas for processes and data preprocessing. Then used NLTK and Textblob for analyzing the text of CSV file and calculate the polarity of each text separately and output is a numeric format (-1 to +1). In this research, first collected the tweets from Twitter with given keyword and then analyze the whole text and gave the result, then Matplotlib plotting the result on the pie chat and with different colors and different formats positive, negative and neutral (greater than zero, less than zero and equal to zero). This program only those texts analyze when the required keyword is founded.

(4)

2.1. Textblob Library

Textblob is the python library which process the textual data. Textblob provide API to access its methods and easily perform NLP task. The main reason behind the usage of Textblob is it’s like a python string easy to use without worrying the syntax. Textblob consist on different function like part of speech, noun phrase, sentiment analysis, tokenization, word inflection and lemmatization, wordlist, spelling correction, translation or language detection and N-gram. The textblob work and play with all kind of texts. Textblob support all kind of text formats. This also important module of python which use for sentiment analyzing and also classifying the data which part is positive of which part is negative (Steven loria, 2018). This is the main part of research and textblob is the key library for sentiments analyzing.

2.2. NLTK (Natural Language ToolKit)

Library Natural language toolkit it is also called NLTK. It is a suitcase of libraries such as symbolic and statistical natural language process which support Python English written programs. This toolkit have different classifications like sentiments, metrics, parse, tags, tokenization, chat, chunk, classify, translate, twitter, interface, draw, cluster and etc. NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit, plus a cookbook (Bird, Edward, el, 2009). NLTK used the Python platform for building the programs of natural language text (human language) for using statistical natural language processing. NLTK is open source librar y for python which used on any platform, such as windows, Mac, Linux and many other platforms. In our thesis, we used Textblob library for sentiment analysis which import the NLTK module and sentiment analyzer is the subclass of NLTK module: Polarity results of input keyword NLTK (Natural Language ToolKit) Library Natural language toolkit it is also called NLTK. It is a suitcase of libraries such as symbolic and statistical natural language process which support Python English written programs. This toolkit have different classifications like sentiments, metrics, parse, tags, tokenization, chat, chunk, classify, translate, twitter, interface, draw, cluster and etc. NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit, plus a cookbook (Bird, Edward, el, 2009). NLTK used the Python platform for building the programs of natural language text (human language) for using statistical natural language processing. NLTK is open source library for python which used on any platform, such as windows, Mac, Linux and many other platforms. In our thesis, we used Textblob library for sentiment analysis which import the NLTK module and sentiment analyzer is the subclass of NLTK module.

2.3. Sentiments Analyzer

A sentiment analyzer is a tool to implement and facilitate sentiment analysis task using NLTK features and classifiers, especially for teaching and demonstrative purposes. A sentiment analysis tool based on machine learning approaches.

2.4. Algorithm

Section A: Preparing the Test Set.

(5)

of text processing. This is definitely correct. In fact, both our Test and Training data will merely comprise of text. I chose to start with the Test set in order to get you all warmed up for the Training set extraction part, as it will rely more on the API. Here is a bit of an overview of what we are about to do:

1. Register Twitter application to get our own credentials.

2. Authenticate our Python script with the API using the credentials.

3. Create function to download tweets based on a search keyword.

Registering an application with Twitter is critical, as it is the only way to get authentication credentials. As soon as we get our credentials, we will start writing code. Step 3 is where the Test set lies. We will be downloading tweets based on the term that we are trying to analyze the sentiment on.

Step A.1: Getting the authentication credentials.

Step A.2: Authenticating our Python script.

Step A.3: Creating the function to build the Test set.

Section B: Preparing the Training Set.

Section C: Pre-processing Tweets in the Data Sets.

Section D: DNN Classifier.

Step D.1: Building the vocabulary.

Step D.2: Matching tweets against our vocabulary.

Step D.3: Building our feature vector.

Step D.4: Training the classifier.

Section E: Testing the Model.

III. S

IMULATION

R

ESULT

In this section, we used python to implement sentimental analysis. Some packages have utilized including tweepy and textblob. We can install the required libraries by following commands:

 pip install tweepy.  pip install textblob.

The second step is downloading the dictionary by running the following command: python -m textblob. download_corpora. The textblob is a python library for text processing and it uses NLTK for natural language processing. Corpora is a large and structured set of texts which we need for analyzing tweets.

To connect to Twitter and query latest tweets, we need to create an account on twitter and define an application. Users need to go to the apps.twitter.com/app/new and generate the api keys. The Application settings is shown in the figure.

(6)

Fig. 2. Twitter application management.

Following shows the sample output of the program for the ‘Corona Virus’ as a query based on the last 100 tweets from Twitter.

Fig. 3. Sentiment analysis result.

Figure 2 shows the sentimental analysis results based on Corona Virus. As it can be clearly seen in the figure3 the percentage of the neutral tweets are significantly high. This is also important to mention that depends on the data of the experiment we may get different results as people’s opinion may change depends on the world circumstances for example Corona Virus as it becomes the world of the year in 2020. For some queries, the neutral tweets are more than 60% which clearly shows the limitation of the current works.

(7)

Fig. 4. Pie chart.

The pie chart, as shown in figure 4, illustrates the data based on the results we got form this step. If we run the program in different times we may get different results, small variance, based on the tweets we fetch. We run the program three times and these results are the average of the outputs.

Table 5.1. Comparison based on classifier.

Classifier Accuracy TextBlob 28.4 Vader 31.5 Logistic regression 40.9 SVM 41.4 FastText 40.6 Proposed 82.3

Textblob + NLTK + Machine Learning

IV. C

ONCLUSION

Twitter sentiment analysis comes under the category of text and opinion mining. It focuses on analyzing the sentiments of the tweets and feeding the data to a machine learning model to train it and then check its accuracy, so that we can use this model for future use according to the results. The view of people can be positive or negative. An adjective plays a crucial role in identifying sentiment from part of data. It comprises of steps like data collection, text preprocessing, sentiment detection, sentiment classification, training and testing the model. This research topic has evolved during the last decade with models reaching the efficiency of almost 90%-95%. But it still lacks the dimension of diversity in the data. Along with this it has a lot of application issues with the slang used and the short forms of words. Many analyzers don’t perform well when the number of classes is increased. Also, it’s still not tested that how accurate the model will be for topics other than the one in consideration. Twitter is large source of data which make it more attractive for performing sentiment analysis. We perform analysis on around 100 tweets so that we analyze the result. It can used for any purpose based tweets we collect with the help of keyword. Hence sentiment analysis has a very bright scope of development in future.

(8)

R

EFERENCES

[1] T. Wilson, J. Wiebe and P. Hoffmann, “Recognizing contextual polarity in phrase-level sentiment analysis,” in Proceedings of HLT and EMNLP. ACL, (2005), pp. 347–354.

[2] C.C. Tao, S.K. Kim, Y.A. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng and Kunle Olukotun, “Map-reduce for machine learning on multicore”, In NIPS, vol. 6, (2006), pp. 281-288.

[3] L. Jimmy, and A. Kolcz, “Large-scale machine learning at twitter”, In Proceedings of the 2012 ACM SIGMOD International conference on management of data, ACM, (2012), pp. 793-804.

[4] B. Jiang, U. Topaloglu and F. Yu, “Towards large-scale twitter mining for drug-related adverse events”, In Proceedings of the 2012 international workshop on Smart health and wellbeing, ACM, (2012), pp.25-32.

[5] L. Bingwei, E. Blasch, Y. Chen, D. Shen and G. Chen, “Scalable sentiment classification for big data analysis using Naive Bayes Classifier”, In Big Data, 2013 IEEE International Conference on, IEEE, (2013), pp. 99-104.

[6] A. Cuesta, David F. and María D. R-Moreno, “A framework for massive Twitter data extraction and analysis”, In Malaysian Journal of Computer Science, (2014), pp. 50-67.

[7] S. Michal and A. Romanowski, “Sentiment analysis of Twitter data within big data distributed environment for stock prediction”, In Computer Science and Information Systems (FedCSIS), 2015 Federated Conference on, IEEE, (2015), pp. 1349-1354.

[8] T. Mohit, I. Gohokar, J. Sable, D. Paratwar and R. Wajgi, “Multi-Class Tweet categorization using map reduce paradigm”, In International Journal of Computer Trends and Technology. (2014), pp. 78-81.

[9] D. Jeffrey and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, Communications of the ACM 51.1, (2008), pp. 107-113.

[10] B. Yingyi, “HaLoop: Efficient iterative data processing on large clusters”, Proceedings of the VLDB Endowment 3.1 -2, (2010), pp. 285-296.

[11] T. Maite, “Lexicon-based methods for sentiment analysis”, Computational linguistics 37.2, (2011), pp.267-307.

[12] R. Tushar and S. Srivastava, “Analyzing stock market movements using twitter sentiment analysis”, Proceedings of the 2012 International conference on advances in social networks analysis and mining (ASONAM 2012). IEEE Computer Society, (2012). [13] D. Pessemier and Martens “Movie Tweetings: a movie rating dataset collected from Twitter”, Ghent University, Ghent, Belgium,

(2013).

[14] Twitter. Twitter Search API, available at https://dev.twitter.com/rest/public/search.

[15] V.D. Katkar, S.V. Kulkarni, “A Novel Parallel implementation of Naive Bayesian classifier for Big Data”, International Conference on Green Computing, Communication and Conservation of Energy, 978-1-4673-6126-2/2013 IEEE, pp. 847-852.

[16] S. Kumar, F. Morstatter and H. Liu, “Twitter data analytics”, Springer Science & Business Media, (2013). [17] B. Vishal, “Data Mining in Dynamic Social Networks and Fuzzy Systems”, IGI Global, (2013).

[18] G. Elmer, G. Langlois and J. Redden, “Compromised data: From social media to big data”, Bloomsbury Publishing USA, (2015). [19] Nalini K. and L.J. Sheela, “Classification of Tweets using Text classifier to detect cyber bullying”, In Emerging ICT for Bridging the

Future-Proceedings of the 49th annual convention of the computer society of India CSI, Springer International Publishing, vol. 2, (2015), pp. 637-645.

[20] Jaba S. L. and Dr V. Shanthi, “An approach for discretization and feature selection of Continuous-valued attributes in Medical Images for Classification Learning”, International Journal of Computer Theory and Engineering, vol. 1, no. 2, pp. 154.

[21] T. White, “Hadoop: The Definitive Guide”, Third Edition, O’Reilley, (2012). [22] L. George,“HBase: The Definitive Guide”, O’Reilley, (2011).

[23] E. Hewitt, “Cassandra: The Definitive Guide”, O’Reilley, (2010). [24] A. Gates, “Programming Pig”, O’Reilley, (2011).

[25] A. Pak and P. Paroubek. “Twitter as corpus for sentiment analysis and opinion mining”. In Proceedings of the seventh conference on International Language Resources and Evaluation, 2010, pp.1320-1326

[26] R. Parikh and M. Movassate, “Sentiment analysis of user-generated Twitter updates using various Classi_cation Techniques”, CS224N Final Report, 2009

[27] Go, R. Bhayani, L. Huang. “Twitter Sentiment Classification using distant supervision”. Stanford University, Technical Paper, 2009. [28] L. Barbosa, J. Feng. “Robust sentiment detection on Twitter from biased and noisy data”. Coling 2010: Poster Volume, pp. 36-44. [29] Bifet and E. Frank, "Sentiment Knowledge Discovery inTwitter Streaming Data”, In Proceedings of the 13th International conference

on discovery science, Berlin, Germany: Springer, 2010, pp. 1-15.

[30] Agarwal, B. Xie, I. Vovsha, O. Rambow, R. Passonneau, “Sentiment analysis of Twitter data”, In proceedings of the ACL 2011 Workshop on Languages in Social Media, 2011 , pp. 30-38.

[31] Dmitry Davidov, Ari Rappoport. “Enhanced sentiment learning using Twitter hashtags and smileys”. Coling 2010: Poster Volume pages 241{249, Beijing, August 2010.

[32] Po-Wei Liang, Bi-Ru Dai, “Opinion mining on social media data”, IEEE 14th International conference on mobile data management, Milan, Italy, June 3 - 6, 2013, pp 91-96, ISBN: 978-1-494673-6068-5, http://doi.ieeecomputersociety.org/10.1109/MDM.2013. [33] Pablo Gamallo, Marcos Garcia, “Citius: a naive-bayes strategy for sentiment analysis on English Tweets”, 8th International workshop

on semantic evaluation (SemEval 2014), Dublin, Ireland, Aug 23-24 2014, pp 171-175.

[34] Neethu M,S and Rajashree R,” Sentiment Analysis in Twitter using machine learning techniques” 4th ICCCNT 2013, at Tiruchengode, India. IEEE – 31661.

[35] P.D. Turney, “Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews,” in Proceedings of the 40th annual meeting on association for computational linguistics, pp. 417–424, Association for Computational Linguistics, 2002. [36] J. Kamps, M. Marx, R. J. Mokken, and M. De Rijke, “Using wordnet to measure semantic orientations of adjectives,” 2004.

[37] R. Xia, C. Zong, and S. Li, “Ensemble of feature sets and classification algorithms for sentiment classification,” Informatio n Sciences: an International Journal, vol. 181, no. 6, pp. 1138–1152, 2011.

A

UTHOR

’

S

P

ROFILE First Author

Pragya Mishra, M. Tech Scholar, Technocrats Institute of Technology Bhopal, Madhya Pradesh, India. Second Author