Text Classification Performance Analysis on Machine Learning

(1)

Text Classification Performance Analysis on Machine Learning

R Ravi Kumar¹ M Babu Reddy² P Praveen³

1Assistant Professor in Department of CSE, S R Engineering College, Warangal, Telangana, India.

[email protected]

2Assistant Professor in Department of Computer Science, Krishna University, Machilipatnam, A.P, India.

[email protected]

3Associate Professor in Department of CSE, S R Engineering College, Warangal, Telangana, India.

[email protected]

Abstract

Automated text classification has been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. In general, text classification plays an important role in information extraction and summarization, text retrieval, and question answering.

Intrusion detection system plays an important role in network security. Intrusion detection model is a predictive model used to

predict the network data traffic as normal or intrusion. Machine Learning algorithms are used to build accurate models for clustering, classification and prediction. Labeled text documents are used to classify the text in supervised classifications. This paper applied these classifiers on different kinds of labeled documents and measures the accuracy of the classifiers. An Artificial Neural Network (ANN) model using Back Propagation Network (BPN) is used with several other models to create an independent platform for labeled and supervised text classification process. An existing benchmark approach is used to analysis the performance of classification using labeled documents. Experimental analysis on real data reveals which model works well in terms of classification accuracy.

1. INTRODUCTION

Text classification of labeled documents is growing its necessity enormously because there are large amount of documents growing all over the World Wide Web (WWW). Text Classification is the task of classifying a document under a predefined category [1]. In this research, several machine learning algorithms are applied to a number of dataset to determine the accuracy of the classifiers. Machine learning algorithms include several Naïve Bayes techniques such as Multinomial Naïve Bayes, Bernoulli Naïve Bayes; some linear classifier models including Logistic Regression and Stochastic Gradient Descent; Support Vector Machine models including Support Vector Clustering (SVC), Linear Support Vector Clustering (Linear SVC); Artificial Neural Network model including Back Propagation Network. The algorithms are applied on some real experimental dataset such as—

Reuters corpus, Brown corpus, Movie review corpus etc.

Automatic text classification has always been an important application and research topic since the inception of digital documents. Today, text classification is a necessity due to the very large amount of text documents that we have to deal with daily. In general, text classification includes topic based text classification and text genre-based classification. Topic-based text categorization classifies documents according to their topics. Texts can also be written in many genres, for instance: scientific articles, news reports, movie reviews, and advertisements. Genre is defined on the way a text was created, the way it was edited, the register of language it uses, and the kind of audience to whom it is

(2)

addressed. Previous work on genre classification recognized that this task differs from topic-based categorization. Typically, most data for genre classification are collected from the web, through newsgroups, bulletin boards, and broadcast or printed news. They are multi-source, and consequently have different formats, different preferred vocabularies and often significantly different writing styles even for documents within one genre. Namely, the data are heterogenous.

In the present era, it is difficult to envision world without internet. Every person has dependency on internet. It has become an important model in various applications such as education, business and others. So security of the data that is communicated through internet is necessary. Secure network is maintained by Intrusion Detection System (IDS). IDS observes the data traffic carefully and identifies it as normal or spam. Nowadays most of the applications depends on the advance network technologies namely wireless networks, wireless sensor networks and bluetooth. In case of wireless sensor networks security mechanisms such as key-management protocols, authentication techniques and security protocols cannot be used because of resource constraints. Intrusion Detection System is the ideal security mechanism for wireless sensor networks.

1.1 Intrusion Detection System

A security mechanism used to monitor the abnormal behavior of the network is an Intrusion Detection System (IDS)12. The IDS identifies and informs that whether the user activity is normal or not. The users activities are compared by the IDS with the already stored intrusion records to identify the intrusion. Accurate predictive models can be built for large data sets using supervised machine learning techniques, that is not possible by traditional methods. As specified by Tom Mitchell3, machine learning based intrusion detection falls under two categories Anomaly and Misuse.

IDS learns the patterns by the training data, so the misuse based method is used. Misuse based detection can detect

only the known attack, new attacks cannot be identified. Anomaly based IDS observes the normal behavior and if there is a change in the behavior then it considers that behavior as anomaly. So anomaly based IDS can detect new attacks that are not learned from the training model.

Till now different machine learning techniques such as Artificial neural networks7, Support Vector Machine4 and

Naive Bayes5, 6, based techniques are proposed for the intrusion detection. A new detection by combining different

techniques, a hybrid detection technique is proposed by8. The literature on comparison of supervised machine learning techniques in intrusion detection is limited. Hence this paper aims at understanding the implications of using supervised machine learning techniques on intrusion detection.

2. LITERATURE REVIEW

There are several works done in the recent past on text classification. Li et al. [2] proposed text classification technique using positive and unlabeled data. This paper proposed two classes, one is labeled as positive and another

class is unlabeled. The unlabeled data would contain both positive and unlabeled data. The task is to find the labeled data and unlabeled data from the unlabeled class. The finding of the paper is, only one labeled data is used i.e. another class didn’t have to be labeled.

Aggarwal et al. [3] surveys about the various ways text categorizations. This paper discussed about various aspects of automatic text categorization and different type of techniques has been mentioned with their benefits and shortcomings for different applications. Tong et al. [4] used support vector machine for the text categorization problem. It found many relevant features while applying support

(3)

vector model for classification. But in our paper we used an extensive package of various SVM classification techniques that is very much optimized for text classification. Voting process is used to ensure the best result in SVM classification.

Liu, Bing, et al. [5] discussed text classification by labeling words instead of documents. It is very helpful to make the document labeled itself but the document has to be enriched with so many relevant words of that class. But in our paper we used the Named Entity Recognition (NER) of Natural Language Processing (NLP) that will make it easy for documents.

Hinrich Schütze [6] estimates the conditional probability of a particular word/term/token given a class as the relative frequency of it in documents belonging to input classes. Bernoulli Naïve Bayes is inefficient when we need to classify long documents, because it does not consider multiple occurrences of words. Tang, Bo, et al. [7] used Bayesian interface in automated text categorization.

They used information gain (IG) and Maximum Discrimination (MD) for feature selection. In this paper we used document frequency metric method for feature selection and Naïve Bayes rule to compare the results with the previous works [1]

3. MACHINE LEARNING ALGORITHMS

After function selection and transformation the documents may be without difficulty represented in a shape that may be utilized by a ml set of rules. many textual content classifiers have been proposed within the literature the usage of system getting to know strategies, probabilistic fashions, and so on.

they often fluctuate inside the method followed: selection timber, naıve-bayes, rule induction, neural networks, nearest buddies, and lately, guide vector machines. even though many approaches were proposed, automatic textual content classification continues to be aEssential region of studies generally because the effectiveness of current

Automated text classifiers isn't ideal and still desires development. naive bayes is often utilized in textual content type programs and experiments because of its simplicity and effectiveness. however, its overall performance is regularly degraded as it does not model text properly. schneider addressed the troubles and show that they may be solved through a few simple corrections. klopotek and woch provided effects of empirical evaluationOf a bayesian multinet classifier based totally on a new approach of gaining knowledge of very huge tree-like bayesian networks. the observe indicates that tree-like bayesian networks are able to manage a text type assignment in one hundred thousand variables with sufficient pace and accuracy. assist vector machines (svm), while implemented to text class provide remarkable precision, but poor remember. one way of customizing svms to enhance recollect, is to regulate the threshold associated with an svm. shanahanAnd roma defined an automatic technique for adjusting the thresholds of conventional svm with better effects. johnson et al. defined a quick choice tree construction set of rules that takes advantage of the sparsity of textual content data, and a rule simplification approach that converts the choice tree into a logically equal rule set [9]. lim proposed a way which improves overall performance of knn primarily based text class with the aid of the use of properly predicted parameters. a few variants of the knn method with specialChoice features, okay values, and feature sets were proposed and evaluated to find out good enough parameters. nook category (cc) network is a form of feed ahead neural community for right away report class. a schooling set of rules, named as text cc is offered.

The level of problem of text class tasks clearly varies. because the wide variety of wonderful instructions will increase, so does the issue, and therefore the scale of the schooling set wanted. in any multi-magnificence tex tCategory task, unavoidably some lessons may be extra tough than others to classify. reasons for this may be: (1) only a few advantageous education examples for the

(4)

magnificence, and/or (2) lack of desirable predictive features for that elegance. while education a binary classifier in line with class in textual content categorization, we use all of the documents within the training corpus that belong to that class as applicable training statistics and all the files inside the education corpus that belong to all of the other categories asNon-applicable education records. it's miles often the case that there may be an amazing quantity of non applicable education documents mainly when there is a big series of classes with every assigned to a small variety of documents, that is typically an ―imbalanced information problem". this trouble offers a particular undertaking to class algorithms, that may achieve high accuracy through absolutely classifying every example as poor. to conquer this problem, cost sensitive learning is wanted [5].A scalability analysis of a number of classifiers in text categorization is given. vinciarelli presents categorization experiments performed over noisy texts. by way of noisy it is supposed any textual content received through an extraction procedure (stricken by mistakes) from media aside from digital texts (e.g.

transcriptions of speech recordings extracted with a recognition system). the performance of the categorization device over the easy and noisy (phrase blunders fee between ~10 and ~50 percentage) versions of the Equal documents is in comparison. the system getting to know algorithms after characteristic choice and transformation the files can be without difficulty represented in a shape that may be utilized by a ml set of rules. many textual content classifiers had been proposed within the literature the usage of gadget studying techniques, probabilistic fashions, etc. they often range in the technique adopted: decision trees, naıve- bayes, rule induction, neural networks, nearest buddies, and currently, guide vector machines. even though many approaches Have been proposed, computerized text classification continues to be a chief vicinity of research basically due to the fact the effectiveness of modern computerized text classifiers isn't faultless and nonetheless desires improvement.

Naive bayes is regularly used in textual content category packages and experiments due to its simplicity and effectiveness. but, its performance is frequently degraded as it does not version text well. schneider addressed the troubles and show that they may be solved via some simple Corrections. klopotek and which presented outcomes of empirical assessment of a Bayesian multi net classifier based totally on a brand new method of learning very large tree-like bayesian networks.

the have a look at shows that tree-like bayesian networks are able to deal with a textual content category venture in a hundred thousand variables with enough speed and accuracy. help vector machines (svm), when applied to text type offer superb precision, however bad recollect. one manner of customizing svm s to Enhance recollect, is to alter the brink associated with an svm. shanahan and roma described an automated technique for adjusting the thresholds of widespread svm with better consequences. johnson et al. defined a quick selection tree creation set of rules that takes gain of the sparsity of text data, and a rule simplification approach that converts the selection tree into a logically equivalent rule set [9]. lim proposed a method which improves performance of knn based totally textual content type through the usage ofNicely anticipated parameters. a few versions of the knn approach with exclusive decision functions, k values, and feature sets were proposed and evaluated to find out good enough parameters.

Corner classification (cc) is a sort of feed ahead neural network for immediately report category. a schooling algorithm, named as text cc is supplied in [34]. the level of trouble of text type responsibilities obviously varies. because the number of awesome instructions will increase, so does the difficulty, And therefore the dimensions of the schooling set wanted. in any multi-elegance textual content classification assignment, unavoidably some classes will be more hard than others to categories. reasons for this may be: (1) only a few high quality training examples for the

(5)

magnificence, and/or (2) lack of desirable predictive capabilities for that elegance. while education a binary classifier according to category in text categorization, we use all of the documents inside the education corpus that belong to that class as applicable schooling facts and all of the Documents within the schooling corpus that belong to all the different classes as non-relevant education statistics. it is often the case that there's an amazing number of non applicable schooling documents specially whilst there's a big series of categories with every assigned to a small variety of documents, which is usually an ―imbalanced data hassle". this problem presents a particular venture to category algorithms, that may achieve excessive accuracy through virtually classifying every exampleAs negative. to conquer this hassle, value sensitive getting to know is wanted [5]. a scalability evaluation of some of classifiers in text categorization is given. vinciarelli gives categorization experiments performed over noisy texts. through noisy it is supposed any text acquired through an extraction process (tormented by errors) from media apart from virtual texts (e.g. transcriptions of speech recordings extracted with a popularity gadget). the overall performance of the categorization machine over the Smooth and noisy (word blunders charge among ~10 and ~50 percentage) versions of the same files is as compared. the noisy texts are received via handwriting reputation and simulation of optical person recognition. the effects show that the overall performance loss is suitable. different authors [36] additionally proposed to parallelize and distribute the technique of text classification. with any such method, the overall performance of classifiers can be improved in

both accuracy and time complexity.

4. METHODOLOGY

The methodology used is shown in the Fig 1. In pre-processing step all the categorical data which are in textual form are converted to numerical form. Pre-processed data is divided as testing data and training data. The models are built using Logistic Regression, Gaussian Naive Bayes, Support Vector Machine and Random Forest classifiers. These models are used for predicting the labels of the test data. Actual labels and predicted labels are compared. Accuracy,

True Positive Rate (TPR) and False Positive Rate (FPR) are computed. Based on these parameters performance of the models are compared.

Following steps are used to build the models.

1. Pre-process the data set.

2. The data set is divided as training data and testing data 3. Build the classifier model on training data

(6)

Fig. 1. Methodology.

 Logistic Regression

 Support Vector Machine

 Gaussian Naive Bayes

 Random Forest 4. Read the test data

5. Test the classifier models on training data

6. Compute and compare TPR, FPR, Precision, Recall, F1-Score and Accuracy for all the models.

CONCLUSION

This research has the goal of finding the best classification method and select the best voted classifier i.e. classifier that has most accuracy percentage. It also shows us that the performance of the classification of text documents slightly depends on how well the associated corpus is organized and as well as the classification method. An attempt has been made to check the performance of the supervised machine learning classifiers namely Support Vector Machine, Random Forest, Logistic Regression and Gaussian Naive Bayes are compared for an intrusion detection. The text classification problem is an Artificial Intelligence research topic, especially given the vast number of documents available in the form of web pages and other electronic texts like emails, discussion forum postings and other electronic documents. It has observed that even for a specified classification method, classification performances of the classifiers based on different training text corpuses are different; and in some cases such differences are quite substantial. This observation implies that a) classifier performance is relevant to its training corpus in some degree, and b) good or

(7)

high quality training corpuses may derive classifiers of good performance. Unfortunately, up to now little research work in the literature has been seen on how to exploit training text corpuses to improve classifier’s performance.

REFERENCES

[1] Ikonomakis, M., S. Kotsiantis, and V. Tampakas. "Text classification using machine learning techniques." WSEAS transactions on computers 4.8 (2005): 966-974.

[2] Li, Xiaoli, and Bing Liu. "Learning to classify texts using positive and unlabeled data." IJCAI. Vol. 3.

2003.

[3] Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text classification algorithms." Mining text data. Springer US, 2012. 163-222.

[4] Tong, Simon, and Daphne Koller. "Support vector machine active learning with applications to text classification." Journal of machine learning research 2.Nov (2001): 45-66.

[5] Sallauddin Mohmmad, Shabana , Ranganath Kanakam, ―Provisioning Elasticity On IoT's Data In Shared- Nothing Nodes‖ , International Journal of Pure and Applied Mathematics, Volume 117 No. 7, pp. 165- 173, oct. 2017.

[6] R. Ravi Kumar, M. Babu Reddy and P. Praveen, "A review of feature subset selection on unsupervised learning," 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), Chennai,2017,pp.163-167.doi:

10.1109/AEEICB.2017.7972404.

[7] Tang, Bo, et al. "A Bayesian classification approach using class-specific features for text categorization."

IEEE Transactions on Knowledge and Data Engineering 28.6 (2016): 1602-1606.

[8] McCallum, Andrew, and Kamal Nigam. "A comparison of event models for naive bayes text classification." AAAI-98 workshop on learning for text categorization. Vol. 752. 1998.

[9] Mohammed Ali Shaik, P.Praveen, Dr.R.Vijaya Prakash, "Novel Classification Scheme for Multi Agents", Asian Journal of Computer Science and Technology, ISSN: 2249-0701 Vol.8 No.S3, 2019, pp. 54-58.

[10] M. Sheshikala, D. Rajeswara Rao and R. Vijaya Prakash, Computation Analysis for Finding Co–

Location Patterns using Map–Reduce Framework, Indian Journal of Science and Technology, Vol 10(8), DOI: 10.17485/ijst/2017/v10i8/106709, February 2017.