Twitter Data Classification using Hidden Markov Model

(1)

Abstract: Twitter data generally contains any reviews, feedbacks and sentiments of the customers regarding the product or service. However, customers are often confused and overwhelmed by the raw data that has been generated in the twitter. A promising solution is to analyze the texts and the labels (or, positive and negative review of the customers) of the twitter data using text classifier. In this paper, text classifier constructed using Hidden Markov model is utilized. The effectiveness of the classifier is demonstrated using the twitter review of a product. On the case example, it was observed that the proposed model performed well in terms of classification accuracy and F-measure. The outcomes of this study can be used to develop more sophisticated statistical models as well as to compare other text classifiers from other twitter text data.

Index Terms: Data Analytics, Hidden Markov Model, Text Classification, Twitter Data

I. INTRODUCTION

Information retrieval from the plain texts has got huge attention for the wider applications. At present with the technological advancement text data is growing up in a rapid speed via social media. The social media generates millions of data day by day which is the key for many business decision now. Most of the companies are now launching their product based on the sentiment analysis and customer review and feedback, and this decision is effectively taken by text analysis. In every online shop now unison recommendation system with product review system and this recommendation system works through text analysis [1]-[4]. Hidden Markov Model (HMM) is a sequential probabilistic model which can be very effective in text classification and analysis [5]. In [6], Barros discovered that HMM repaid the low execution of less sufficient classifiers and capabilities were picked to actualize the content classifier.

Speech recognitions were used in [7] via HMM using and tried to point out the speech recognition in a deeper sense. In [8] HMM framework were developed to retrieve information by multiple word generation mechanisms.

II. OVERVIEW OF TEXT CLASSIFICATION

A. Sentiment Analysis

People usually share their product reviews and opinion through the social media for a long time and generate huge level of structured and unstructured data which can be used for data analysis. Based on such overviews, companies are now developing their business strategy and production line. In twitter people generate approximately around three hundred to five hundred tweets every day and this tweets contains polarities for a specific subject [9]. The Internet forum is a place for such discussions.

However, to understand and analyze such language, sentiment analysis is a special tool. As the forum is an important place for worldwide discussion, it is also important to understand the forum language too.

Now Chabot is used for forum discussion the and it is easy to generate a healthy communicable environment based on the sentimental analysis [10].

Twitter Data Classification using Hidden Markov Model

1Dr Kazi Arif-Uz-Zaman, ²Md. Fantacher Islam

Department of Industrial Engineering and Management (IEM), Khulna University of

Engineering & Technology (KUET), Khulna-9203, Khulna, Bangladesh

(2)

B. Email Filtering

Everyday 2-3 billion emails is being send and communicated around the globe. Such emails contain a lot of valuable, confidential information as well as irrelevant and unwanted messages. On the negative side, people sometimes witnessed a lot of unwanted emails where some of them contain the sexual and illegal drug-related messages. So it is important to manage these kinds of huge messages, as in every minute approximately 2 million emails transitioned. Irrelevant email should not send directed toward the inbox as it contains unnecessary content so some filtering needs to manage the mailing system to prevent it. In this regard, text classification approach has shown effectiveness by preventing these messages from being transmitted toward to inbox[11]. Besides, it takes less time and less cost. Through the text classification, unwanted email categorized as spam mail and then directed toward spam folder [12].

C. Supplier Selection

We are now in the era of “Big Data” and the data can be found in structured, semi-structured and unstructured formats. Usually data can be generated from the main three sources: people, machine (sensor) and organization. Among them most of the unstructured data generated by people through the social media and it also contain most valuable information. Some companies need to import raw material from the global supplier. Without any prior supplier selection criteria, companies are unable to process such data into valuable information and decision making process. Text mining approach might be an effective solution to overcome such difficulties by analyzing supplier criteria and variables through big data analytics [13]. Based on the supplier status tweets, it is now possible to select the potential using text classification.

III. HMMCLASSIFIER

A. Basics of HMM

HMM is a probabilistic model used for sequential data set which have two variables: hidden and observed variables. It is very useful when observed variable is known but the hidden variable is unknown and follows a probability distribution. HMM is useful in many applications for very effective performance and shown positive result in speech recognition, pattern identification, or weather predictions. In the text classification, observed variables (normally used in HMM) considers words frequency matrix while the hidden variables as labels of texts (positive and negative tweets).

B. Document representation

To provide a new text label, HMM classifier is required to train with the existing text data, known as training data set. For probability calculation, each text document is represented as a matrix of words which is shown in Table I.

Table I Word matrix

d1 v1 v2 v3 v5 None

d2 v5 v6 v7 None None

d3 v8 v3 None None None

d4 v9 v10 v11 v1 v12

The pre-processing steps is then required to remove stop words, punctuation and unnecessary symbols from the text document. It is also called text cleaning. This step clears the text and set the document ready for next step. Moreover, stemming is also required to clean the text as well as to reduce the

(3)

redundancy.

C. Probability calculation and classifying new documents

For a given training dataset containing c classes and D documents, the output HMM probability matrix B can be calculated as (1):

B_j(v_l) = f.^∑^d∈Dc^M^d^(v^l^,j)

∑_d∈DcH_d(j) + (1 − f). ^∑^d∈Dc^G^d^(v^l⁾

∑^|V|_i=0(∑_d∈DcG_d(v_i)) (1)

where B_j(v_l) defines the probability of the word v_l being emitted at the respective rank,f ∈ [0,1], M_d(v_l, j) ∈ (1 for word v_l if it is found in the respective rank position and otherwise it will be 0), H_d(j) ∈ (1 for if there is any word in the respective rank otherwise it be 0), G_d(v_l) ∈ (1 for if v_l appears at least once in document d otherwise it will be 0), |V| defines the total number of feature words.

Afterward, the new document is required to pre-process for testing which will contains new wordlist T_d . Finally, the HMM-based training classifier calculates the probability for each respective label (i.e.

positive or negative tweet) using (2) where B_h is the respective model for each label document.

P(T_d|B_h) = ∏^min(|L_i=1 ^d^|−1,n−1)B_j(T_d_j) (2)

IV. CASE SCENARIO

In this study twitter product reviews were chosen as the case dataset where people tweets their emotion, on product brands, as negative or positive emotion. The dataset were collected from kaggle.com in a .csv file format containing tweets along with respective emotions. In the tweets column, there was 3548 tweets as text format. For the analysis purpose, 3370 tweets had chosen as the training data and rest of the 178 tweets kept for testing purpose. In the training phase, the tweets data needed to be pre-processed. For the HMM model development, dataset was required to be formatted as the model input where hidden state and observed state are set. For the training data hidden state had found for positive state was 22 and for negative state was 21 on the contrary observed state was 62282 for positive review and 11319 for negative review. Finally, the probability matrix is constructed based on the two variables as shown previously.

V. RESULT &DISCUSSION

HMM classifier shows the accuracy of 84% as shown in the Table II along with the other performance matrices where the model were trained for 3370 text documents and 178 text documents for testing.

Table II

Mode performance matrices

Performance Matrices Values (%)

Accuracy 84

Precision 82

Recall 84

F Measure 83

The confusion matrix as shown in the Fig. 1 describes that True Positive column defines for the negative emotion on brand. On the contrary, the True Negative column represents positive emotion on

(4)

brand. It has been observed that negative emotion has more false prediction than positive emotion on brand due to less observation state in the negative emotion text corpus.

Fig. 1 Confusion matrix for the Test data

The output result will be more accurate if the outlier can be handled from the corpus word which is shown in the Fig. 2. The figure shows the frequency of the words in the entire text corpus.

Fig. 2 Corpus Word Frequencies

VI. CONCLUSION &FUTURE DIRECTION

This papers represent the text classification model where HMM classifier used for twitter emotion analysis. Observation state and hidden state variables have been calculated using twitter text corpus and the probability matrix is developed using equation (1) in the HMM classifier section. The result shows that the positive emotion on brand has been predicted well than negative emotion because of less text corpus words. The negative emotion on brand prediction dominated by some outlier which is shown in Fig. 2 in the result and discussion section.The model were trained on just 3370 rows of tweets which seemed to be inadequate for good prediction and on the contrary the data of negative emotion was more less than the positive data which created some biasness. Therefore, if more data can be trained and the biasness of any class can be reduced it is hoped that HMM can be a great classifier for sentiment analysis as this research shows that HMM performance with small corpus words.

REFERENCES

[1] Dumais, S., et al., Inductive learning algorithms and representations for text categorization.

1998.

[2] Joachims, T. Text categorization with support vector machines: Learning with many relevant features. in European conference on machine learning. 1998. Springer.

[3] Mitra, V., C.-J. Wang, and S. Banerjee, Text classification: A least square support vector

(5)

machine approach. Applied Soft Computing, 2007. 7(3): p. 908-914.

[4] Yang, Y. and X. Liu. A re-examination of text categorization methods. in Sigir. 1999.

[5] Freitag, D. and A. McCallum. Information extraction with HMMs and shrinkage. in Proceedings of the AAAI-99 workshop on machine learning for information extraction. 1999.

Orlando, Florida.

[6] Barros, F.A., et al., Combining text classifiers and hidden markov models for information extraction. International Journal on Artificial Intelligence Tools, 2009. 18(02): p. 311-329.

[7] Rabiner, L.R., A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989. 77(2): p. 257-286.

[8] Miller, D.R., T. Leek, and R.M. Schwartz. A hidden Markov model information retrieval system. in SIGIR. 1999.

[9] Westbergh, P., et al., Noise, distortion and dynamic range of single mode 1.3 µm InGaAs vertical cavity surface emitting lasers for radio-over-fibre links. IET optoelectronics, 2008.

2(2): p. 88-95.

[10] Abbasi, A., H. Chen, and A. Salem, Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Transactions on Information Systems (TOIS), 2008. 26(3): p. 12.

[11] Androutsopoulos, I., et al. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval.

2000. ACM.

[12] Gómez Hidalgo, J.M., et al. Content based SMS spam filtering. in Proceedings of the 2006 ACM symposium on Document engineering. 2006. ACM.

[13] Su, C.-J. and Y.-A. Chen, Risk assessment for global supplier selection using text mining.

Computers & Electrical Engineering, 2018. 68: p. 140-155.