Impact of feature selection techniques in Text Classification: An Experimental study

(1)

Advances in Engineering, Management and Sciences J. Mech. Cont.& Math. Sci., Special Issue, No.-3, September (2019) pp 39-51

Copyright reserved © J. Mech. Cont.& Math. Sci.

S. Rahamat Bashaet al.

39

Impact of feature selection techniques in Text Classification:

An Experimental study

S. Rahamat Basha

¹

, J.Keziya Rani

²

, JJC Prasad Yadav

³

, G.Ravi Kumar

⁴

1,2

Department of CS& T, SK University, AP, India.

3

Department of CSE, RGMCET, AP, India.

4

Department of CS, Rayalaseema University, AP, India

1

[email protected], Corresponding Author: S.Rahamat Basha

Email: [email protected]

https://doi.org/10.26782/jmcms.spl.3/2019.09.00004

Abstract

This work is a study of comparing different feature selection techniques on the accuracy of text classification. Text Mining or Document Categorization is a supervised learning (an Information Retrieval task which learns from labeled train data) technique where it uses labeled (set of instances with predefine labels) train instances or data to learn the categorization job and then it categorize the test text instances automatically using the system that is learnt. In the field of IR and management tasks, classification plays an important lead. The text categorization procedure includes the steps text pre-processing (cleaning, stop word removal and stemming), feature extraction or feature reduction or feature selection and then categorization. In this work, two machine learning algorithm/classifiers (Naïve Bayes and K-Nearest Neighbor) are used for classification. The analyzed experimental results show that Naïve Bayes algorithm gives more accuracy in many cases i.e. with many feature selection techniques and K-Nearest Neighbor classifier works well only in the cases, when the feature selection techniques either Information Gain (IG) or Mutual Information (MI). The results of experiments reported here were generated while Self-made corpus used for training and Reuters-21578 corpus used for testing.

Keywords :

Stop word removal, stemming, feature weighting and selection, K-NN, Naïve Bayesian

I. Introduction

Generally, the corpus used in text categorization [I], consists finite set of documents which belonging to predefined classes. These documents usually consists huge number of features or terms or words which results high dimensional feature vector [V], [VI]. To reduce this severity, in preprocessing, we remove the unnecessary word which does not support the classification task (example articles,

(2)

40

verbs and prepositions etc.). For text categorization (By supervised learning procedure) we assign some documents from predefined categories (for example earns money, crude etc., from Reuter’s corpus). In day to day life, the digital documents in the web are increases, the numbers of terms (i.e., features) that are in those documents are large in number but a few are informative. It is the severe problem which degrades the efficiency in Information Retrieval (IR) procedures. This work also includes stemming process [III] which reduces the dimensionality of feature space and the stochastic dependence between terms. A better feature selection procedure reflects the effectiveness on classification and computational efficiency.

The reminder of this work is organized as follows. Section 2 gives some back ground about feature selection. Section 3 presents feature weightening and implementation procedure of different feature selection methods and KNN and Naïve Bayes classifiers. Section 4 describes about experimental results and analysis. Finally conclusion of this work illustrated in section 5. It is important to develop scalable and efficient models for effective text classification with respect to time and space complexities. The feature extraction methods like Principal Component Analysis (PCA) [X], Linear Discriminant Analysis (LDA) [II], and Orthogonal Centroid algorithm [VII] were able to effectively convert the high dimensional data to a lower dimensional data. Feature clustering for feature reduction [XI], Fuzzy based feature selection techniques [IV] and automatic fuzzy clustering algorithm [IV] were proved as efficient feature reduction models for in reducing dimensionality size.

II. Related Work

By Feature selection approach we remove the redundant and irrelevant features from the corpus, i.e., selecting a subset of features from the training set and uses that set as feature set for text classification. Some supervised feature selection approaches (IG, MI, OR, CHI, NGL, GSS) [I] were used for this task. In our work, to reduce the noisy of data with respect to term frequency (TF) [I], document frequency (DF) [I], are implemented by giving user input threshold. By selecting the most probable features from total features (by the threshold value k) analysis of results with respect to dimensionality size possible is in our work.

III. Feature Weightening

In this process, each feature (single word or term or token) is assigned with a score based on some score computing function and higher scored (weighted) terms are selected. The score computing functions includes some mathematical definitions and probabilistic approaches which are estimated by some static information in the documents across different categories. Some example notations based on probabilities are:

 

P t

_{: Pod}

_X

containing term t;

 

ⁱ

P C

: Pod

X

that not belonging to the category

C

i;

  ^,

ⁱ

P t C

: The joint Pod

X

containing term t and belonging to the category

C

i;

(3)

41

 

ⁱ

^/

P C t

: Pod

X

that belonging to the category

C

i under the condition that it contains term t;

  ^/

ⁱ

P t C

: Pod

X

that containing the term t under the condition that it belongs to the category

C

i;

[Pod¹: The probability of a document]

Document Frequency

The # of documents in which a word occurs is DF.

𝐷𝐹 = ∑ 𝐴

₍₁₎

Ai= document i where word present, m= # of documents, iis an integer range from 1 to m

We computed the DF for every unique tern in the training corpus and the features which have less DF than the predefined threshold were to be removed.

Mutual Information (MI)

The MI and IG are gives similar results for binary problems. But we implemented multi class problem solving procedure (The corpus is with more than two classes) so that these two techniques gives different results.

Chi Square

It is a statistical measure and used to measure the independence of a feature or word and a class. In this context,”Null hypothesis here is that”, the particular word and category are independent completely, i.e., that that word is useless for classifying documents.

𝑋 = ∑⁽ ⁾ (2)

WhereO = the Observed (actual) value, E = the Expected value

GSS Coefficient

It is simplified Chi Square function and named by the inventors Galvotti, Sebastiani and Simi.

𝐺𝑆𝑆(𝐹) =

|𝑐|

𝑚𝑎𝑥 𝑘 = 1

𝐺𝑆𝑆(𝐶 )(3)

F= Feature, CK=class k and |c| =Number of classes.

(4)

42

Odds Ratio

This measure compares the odds of a word occurring in one class with the odds for it occurring in another class.

OR

is positive score if the feature more often occur in one document than other, it is negative if the feature occurs more in the other and it is zero if the feature is equal in both i.e.,ln(1)=0.

𝑂𝑅(𝐹, 𝐶 ) = 𝑙𝑛

^{( |} ⁾⁽ ^{( |} ⁾⁾

( | )( ( | ))

= 𝑙𝑛

, ,

(4)

NGL Coefficient

This measure is a simple variant of the Chi Square metric, also called as correlation coefficient and named by the last names of inventors Ng, Gho and Low.

Information Gain

This method implemented on the constraints of class membership function (presence/absence) and by that how much information gained. The𝐼𝐺of a term t is given as 𝐼𝐺(𝑇, 𝑎) = 𝐻(𝑇) − 𝐻(𝑇/𝑎) where

H (T) =Shannon entropy

H (T/a) is conditional entropy of T is given value of term a.

T is an instance or feature vector of the form (𝑋, 𝑌) = 𝑥 , 𝑥 , … , 𝑥 , 𝑌) Here 𝑥 , 𝑥 , … , 𝑥 are the frequencies of the term or binary representation if term presence 1 otherwise 0

The KNN and Naïve Bayes Classifiers

The advantage of KNN in this model is, by choosing different constraints in every level of classification task we may compare the results with respect to N (N is variable) most matched values. By computing the Euclidean distance we implemented this classifier.

The Naïve Bayes classifier is both binary and multi class classifier. This classifier works on Bayes rule. It works as follows. Let, There are 𝐶 classes in the data set to be categorized.

Let, the prior probability of each class of classifying a feature into𝐶 as 𝑃(𝐶 ) and the values of 𝑃(𝐶 ) can be estimated from the training instances. For n attribute values 𝑣 , the goal of categorization is to find the conditional probability𝑃(𝐶 |𝑣 ∧ 𝑣 ∧. .∧ 𝑣 ). By Bayes rule, this probability should be equivalent to

^, ^,… ⁽ ⁾

( , _,… ) (5)

(5)

43

The assumption of naïve Bayesian classifier is that, within each category, the values viare all independent of each other.

∴ According to the laws of independent probability,

𝑃(𝑣 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑜𝑡ℎ𝑒𝑟 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑣 , 𝐶 ) = 𝑃(𝑣 |𝐶 )

(6)

&

∴

𝑃(𝑣 ∧ 𝑣 ∧. .∧ 𝑣 |𝐶 ) = 𝑃(𝑣 |𝐶 )𝑃(𝑣 |𝐶 ) … 𝑃( 𝑣 |𝐶 )

(7) The each factor on the right-hand side of the equation (7) could be determined from the trained instances, because (for any arbitrary v_i),

𝑃(𝑣 |𝐶 ) ≈ [#(𝑣 ∧ 𝐶 )]/[#(𝐶 )](8)

The # symbol in (8) represents the number of such occurrences in the training instances or samples.

∴

The categorization of the test instances could now be estimated by, 𝑃(𝐶 |𝑣 ∧ 𝑣 ∧. .∧ 𝑣 )which is proportional

to

𝑃(𝐶 )𝑃(𝑣 |𝐶 )𝑃(𝑣 |𝐶 )𝑃(𝑣 |𝐶 ) … 𝑃(𝑣 |𝐶 ).

IV. Results and Analysis Data Set1: Self-made Data Set

To train the classifiers, we used a small (of size around 1.5MB) self-made corpus. This allows, the running time for training needed to be as short as possible.

The documents of self-made corpus collected articles in online from the New York Times, Washington post and CNN.com out of the standard category. “Business”,

“Science”, “Sports”, “Health”, “Education”, “Movies”, and “Travel”. We collected 150 documents with the following categories. Business (23), Education (24), Health (30), Movies (10), Science (27), Sports (30), Travel (6), with an average 702 words per document.

Self-made data set analysis using Voyant Tools

The Tends [VI] of the categories shown in the following diagram shows the relative frequencies of the words on Y axis and the document name in X axis. The relative frequency is much generalized measure which can be used in many contexts. For Example, Let a person won 75 games when he played a total of 100 games. The Relative Frequency of winning by that person will be 75/100= 0.75.

(6)

J. Mech. Cont.& Math. Sci.,

Fig. 1 Tends of the categories of self (a),(b),(c),(d),(e),(f),(g), & (h).

Advances in Engineering, Management and Sciences J. Mech. Cont.& Math. Sci., Special Issue, No.-3, September (2019) pp

et al.

44

of the categories of self-made corpus is shown in figures (a),(b),(c),(d),(e),(f),(g), & (h).

Advances in Engineering, Management and Sciences (2019) pp 39-51

made corpus is shown in figures

(7)

The Cirrus tool [IX] of voyant tools

self- made corpus is as follows. We have not removed stop words and stemming when applying this corpus to voyant tools. So, every cloud of the following diagram has the stop words as major.

et al.

45

of voyant tools -exposes the word clouds of the categories of the made corpus is as follows. We have not removed stop words and stemming when applying this corpus to voyant tools. So, every cloud of the following diagram has the stop words as major.

Advances in Engineering, Management and Sciences (2019) pp 39-51 exposes the word clouds of the categories of the made corpus is as follows. We have not removed stop words and stemming when applying this corpus to voyant tools. So, every cloud of the following diagram

(8)

Fig.2: Cirrus of the categories shown in figures (a),(b),(c),(d),(e),(f),(g) & (h).

The word clouds generally with the featu

frequency in that corpus. The above diagrams clearly shows that the all

consists adjectives(for example said because the corpus is with stop words also).The word clouds also shows that every word (almost except with stop words) semantically related to the corresponding category, for example the words growth , price

business text category cloud.

Fig.3: Corpus description in terms of Total and Unique word(without stop words) Advances in Engineering, Management and Sciences J. Mech. Cont.& Math. Sci., Special Issue, No.-3, September (2019) pp

et al.

46

of the categories shown in figures (a),(b),(c),(d),(e),(f),(g) & (h).

The word clouds generally with the features which have relatively more term requency in that corpus. The above diagrams clearly shows that the all

consists adjectives(for example said because the corpus is with stop words also).The word clouds also shows that every word (almost except with stop words) semantically related to the corresponding category, for example the words growth , price

business text category cloud.

Corpus description in terms of Total and Unique word(without stop words) Advances in Engineering, Management and Sciences

(2019) pp 39-51

of the categories shown in figures (a),(b),(c),(d),(e),(f),(g) & (h).

res which have relatively more term requency in that corpus. The above diagrams clearly shows that the all word clouds consists adjectives(for example said because the corpus is with stop words also).The word clouds also shows that every word (almost except with stop words) semantically related to the corresponding category, for example the words growth , prices etc in

Corpus description in terms of Total and Unique word(without stop words)

(9)

Fig.4: Corpus description in terms of Total and Unique word

Fig.5: Impact of stop words &stemming on corpus

From figure.5 and table1 shows the impact of preprocessing in the self

The X-axis of the bar graph shown the corpus after applying one or more preprocessing technique as shown in table1. The unique words that are resulted with all the combinations of preprocessing techniques and the impact is projected in the figure 5

Lancaster stemmer resulted less unique words. The corpus used as the input the feature selection techniques

Table 1:

et al.

47

Corpus description in terms of Total and Unique word (with stop words)

Impact of stop words &stemming on corpus

.5 and table1 shows the impact of preprocessing in the self-made corpus.

axis of the bar graph shown the corpus after applying one or more preprocessing technique as shown in table1. The unique words that are resulted

ions of preprocessing techniques and the impact is projected in the figure 5. The corpus after removing stop word and stemming with Lancaster stemmer resulted less unique words. The corpus used as the input the feature selection techniques [XII], chi square, IG and MI.

Table 1: Methods used to reduce number of Features Corpus Stop words Stemming

A No No

B Yes No

C No Porter

D Yes Porter

E No Lancaster

F Yes Lancaster

with stop words)

made corpus.

axis of the bar graph shown the corpus after applying one or more preprocessing technique as shown in table1. The unique words that are resulted ions of preprocessing techniques and the impact isclearly The corpus after removing stop word and stemming with Lancaster stemmer resulted less unique words. The corpus used as the input the

(10)

Data Set2: The Reuters 21578 corpus We have used this corpus as test

categories. The corpus is available in many internet sources. By using the freely available tool SX we convert the 22 SGML documents to XML documents. We then deleted some single characters which (decim

spaces) were rejected by the validating XML parser.

The results shown below in the tables are generated under the constraints where classifier is KNN, N=30, Threshold=1, Threshold step size=0.1, method of summation is sum and DF=<1, TF=<1 for the test documents Reut2

Reut2.001.xml, Reut 2.002.xml respectively. Many observations possible, for example, by table1, table2 and table3, Chi Square is much far away with respect to IG and MG values and when feature spa

MI diminished.

Breakeven point is the point where precision and recall meet. The mean of score of precisions is called average precision. The average precision is calculated over a scale of point from 0 to 1.

Table 2

Break-even 11 point-precision

Avg precision

et al.

48 Data Set2: The Reuters 21578 corpus

We have used this corpus as test data. reuters21578 [VIII] corpus consists totally categories. The corpus is available in many internet sources. By using the freely available tool SX we convert the 22 SGML documents to XML documents. We then deleted some single characters which (decimal values below 30, for example white spaces) were rejected by the validating XML parser.

The results shown below in the tables are generated under the constraints where classifier is KNN, N=30, Threshold=1, Threshold step size=0.1, method of

sum and DF=<1, TF=<1 for the test documents Reut2

Reut2.001.xml, Reut 2.002.xml respectively. Many observations possible, for example, by table1, table2 and table3, Chi Square is much far away with respect to IG and MG values and when feature space size increases the similarity between IG and Breakeven point is the point where precision and recall meet. The mean of score of precisions is called average precision. The average precision is calculated over a scale

Table 2: Dimensionality of feature space=250

Chi Square IG MI

even-point 0.112 0.765 0.716

precision 0.253 0.831 0.745

precision 0.196 0.869 0.762

Advances in Engineering, Management and Sciences (2019) pp 39-51 ] corpus consists totally 135 categories. The corpus is available in many internet sources. By using the freely available tool SX we convert the 22 SGML documents to XML documents. We then al values below 30, for example white The results shown below in the tables are generated under the constraints where classifier is KNN, N=30, Threshold=1, Threshold step size=0.1, method of sum and DF=<1, TF=<1 for the test documents Reut2-000.xml, Reut2.001.xml, Reut 2.002.xml respectively. Many observations possible, for example, by table1, table2 and table3, Chi Square is much far away with respect to IG ce size increases the similarity between IG and Breakeven point is the point where precision and recall meet. The mean of score of precisions is called average precision. The average precision is calculated over a scale

(9)

(10)

(11)

Table

Break-even-Point 11 point-precision

Avg precision Table

Break Point 11 point Precision Avg precision

V. KNN Classifier performance

Many observations for the categorization problem by two classifiers (KNN and Naïve Bayes) resulted. For most results we may conclude that Naïve Bayes classifier gives better results even though the performance of two classifiers is efficient. From fig. 6 we conclude that, KNN classifier gives different results for each feature selection technique but from fig. 7 Naïve Bayes classifier gives similar results for all feature selection techniques except for IG and MI. We observe that KNN classifier given similar results for GSS, Chi Square and NGL techniques at feature space size=1000.

`

Fig.6: KNN classifier results for the test document Reut.002.xml at constraints k=30, TF<1 and DF<1.

et al.

49

Table 3: Dimensionality of feature space=500

ChiSquare IG MI

Point 0.000 0.787 0.786

precision 0.125 0.851 0.844

Avg precision 0.137 0.891 0.872

Table 4: Dimensionality of feature space=1000 Chi

Square IG MI

Break-even-

Point 0.266 0.713 0.561

11 point

Precision 0.390 0.743 0.574

Avg precision 0.345 0.77 0.588

KNN Classifier performance

Many observations for the categorization problem by two classifiers (KNN and Naïve Bayes) resulted. For most results we may conclude that Naïve Bayes classifier gives better results even though the performance of two classifiers is e conclude that, KNN classifier gives different results for each feature selection technique but from fig. 7 Naïve Bayes classifier gives similar results for all feature selection techniques except for IG and MI. We observe that KNN r results for GSS, Chi Square and NGL techniques at feature

: KNN classifier results for the test document Reut.002.xml at constraints k=30, TF<1 and DF<1.

MI 0.786 0.844 0.872

Many observations for the categorization problem by two classifiers (KNN and Naïve Bayes) resulted. For most results we may conclude that Naïve Bayes classifier gives better results even though the performance of two classifiers is e conclude that, KNN classifier gives different results for each feature selection technique but from fig. 7 Naïve Bayes classifier gives similar results for all feature selection techniques except for IG and MI. We observe that KNN r results for GSS, Chi Square and NGL techniques at feature

: KNN classifier results for the test document Reut.002.xml at constraints

(12)

50

Naïve Bayes classifier

Fig.7: Naïve Bayes Classifier results for the test document Reut.002.xml at constraints TF<1 and DF<1.

VI. Conclusion

This work clearly shows the importance of using less and balanced data for training. The results are exposed using multiple feature selection techniques and two different classifiers. The impact of preprocessing techniques like preprocessing, stemming also shown in the dimensionality reduction. For a better text classification model implementation, every step of the model should yield better results. By this system we proved that Lancaster stemmer (generated unique terms 7288) better then Porter’s stemmer (generated unique terms 8697), comparison of Weighting techniques Chi Square, MI and IG. From fig. 1 and fig. 2, we also proved that under some constraints Naïve Bayes classifier gives better results than KNN.

References

I. “A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification”, IEEE transactions on knowledge and data engineering, Vol.:

23, Issue: 3, March 2011.

II. A.M. Martinez and A.C. Kak, “PCA versus LDA”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.: 23, Issue: 2, pp. 228-233, Feb.

2001.

III. An algorithm for suffix stripping by M. F. Porter, http://maya.cs.depaul.edu/~classes/ds575/papers/porter-algorithm.html, 1980.

IV. Eghbal G. Mansoori and Khadijeh S. Shafiee, "On fuzzy feature selection in designing fuzzy classifiers for high-dimensional data", Evolving Systems, Vol.:7, Issue:4, pp 255–265, December 2016.

(13)

51

V. F.Sebastani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol.: 34, Issue: 1, pp.1-47, 2002.

VI. H.kim, p. Howland, and H. park, “Dimension Reduction in Text Classification with Support Vector Machines”, J.Machine Learning Research, Vol.: 6, pp. 37-53, 2005.

VII. H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representation of Text Data Based on Centroids and Least Squares”, BIT Numerical Math, Vol.: 43, pp. 427-448, 2003.

VIII. https://archive.ics.uci.edu/ml/datasets/Reuters- 21578+Text+Categorization+Collection.

IX. https://voyant-tools.org/?corpus=1621ff41879200779eb5bf827f2e3881 X. I.T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.

XI. N. Slonim and N. Tishby, “The Power of Word Clusters for Text Classification”, Proc. 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.

XII. Oystern Lohre Garnes, Kjetil Norvag, Robert Neumayer, “Feature Selection for Text Categorization”, Norwegian University of Science and Technology, June 2009.

Impact of feature selection techniques in Text Classification: An Experimental study

Impact of feature selection techniques in Text Classification:

An Experimental study

S. Rahamat Basha

, J.Keziya Rani

, JJC Prasad Yadav

, G.Ravi Kumar

Department of CS& T, SK University, AP, India.

Department of CSE, RGMCET, AP, India.

Department of CS, Rayalaseema University, AP, India

[email protected],

[email protected], Corresponding Author: S.Rahamat Basha

Email: [email protected]

Abstract

Keywords :

I. Introduction

II. Related Work

III. Feature Weightening

 

P t

X

 

P C

X

C

  ,

P t C

X

C

 

/

P C t

X

C

  /

P t C

X

C

Document Frequency

𝐷𝐹 = ∑ 𝐴

Mutual Information (MI)

Chi Square

GSS Coefficient

Odds Ratio

OR

𝑂𝑅(𝐹, 𝐶 ) = 𝑙𝑛

= 𝑙𝑛

(4)

NGL Coefficient

Information Gain

The KNN and Naïve Bayes Classifiers

The advantage of KNN in this model is, by choosing different constraints in every level of classification task we may compare the results with respect to N (N is variable) most matched values. By computing the Euclidean distance we implemented this classifier.

𝑃(𝑣 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑜𝑡ℎ𝑒𝑟 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑣 , 𝐶 ) = 𝑃(𝑣 |𝐶 )

&

𝑃(𝑣 ∧ 𝑣 ∧. .∧ 𝑣 |𝐶 ) = 𝑃(𝑣 |𝐶 )𝑃(𝑣 |𝐶 ) … 𝑃( 𝑣 |𝐶 )

𝑃(𝑣 |𝐶 ) ≈ [#(𝑣 ∧ 𝐶 )]/[#(𝐶 )](8)

The categorization of the test instances could now be estimated by, 𝑃(𝐶 |𝑣 ∧ 𝑣 ∧. .∧ 𝑣 )which is proportional

to

IV. Results and Analysis Data Set1: Self-made Data Set

Self-made data set analysis using Voyant Tools

V. KNN Classifier performance

KNN Classifier performance

Naïve Bayes classifier

VI. Conclusion

_X

  ^,

^/

  ^/