Using Polynomial Neural Networks for Arabic Text Categorization

(1)

ISSN 1450-216X / 1450-202X Vol. 152 No 3 April, 2019, pp. 256-264 http://www. europeanjournalofscientificresearch.com

Using Polynomial Neural Networks for Arabic Text Categorization

Adel Hamdan Mohammad Computer Science Department

The world Islamic Sciences and Education University, Amman-Jordan E-mail: [email protected]; [email protected]

Nidhal Kamel Taha El-Omari Software Engineering Department

The world Islamic Sciences and Education University, Amman-Jordan E-mail: [email protected]; [email protected]

Abstract

As a matter of fact, text classification is one of the hottest topics for many researchers and practitioners. It is an important topic to be taken especially that there are a large ever-growing number of electronic documents. There are many efficient researches related to English text classification, though that is not the case for Arabic. The number of studies that have been carried out on Arabic dataset is not enough to address the problem at hand and to assure efficient classification. This research paper uses Polynomial Neural Network as one of the most efficient algorithms used in text classification. As Polynomial Neural Network shows good results when applied with English dataset, this research also applies Polynomial Neural Network with Arabic dataset. Dataset used in this research is in- house built and developed dataset. Experiments’ results demonstrate that Polynomial Neural Network can be used with Arabic dataset and the results are promising.

Keywords: Machine Learning, Text classification (TC), Polynomial Neural Networks (PNNs), Arabic Text Categorization.

1. Introduction

The huge rapid growth and availability of electronic documents evoke text classification (TC) as one of the emergent topics in these days. Nowadays, there are actually millions of electronic documents available on the internet; most of them are available in English language and, in turn, most methods and techniques used for text classification are also based on English language. There are a lot of extensive researches are applied on text classification based on English dataset [1]. For that reason, a lot of extensive researches are applied on text classification based on English dataset [1]. On the other side, spam detection uses several detection methods and techniques as in the case of the classification of emails based on their internal content [2,3,4].

Text classification aims to identify a document by comparing its content based on a group of interrelated documents. Although, each document is different from the others, all documents of the same class type are alike and might share certain common characteristics [5, 6]. With the intention of classifying any new document and differentiate it from the others, our purpose is to capture these common characteristics and uses them as a classifying norm [5, 6].

(2)

Actually, there are more than a few methods used for text classification, such as C4.5, K- Means, Support Vector Machines (SVM), Hidden Markov Model, Artificial Neural Network (ANN), k-nearest Neighbors Algorithm (KNN), and Naïve Bayesian (NB), and many others. [7,8,9]. Again, most of them were applied on English dataset. Unfortunately, as aforementioned, the number of researches that are conducted on Arabic dataset is not enough [9,10,11].

Many research papers on text classification have been published in the field of Polynomial Neural Network (PNNs). But never the less, most of these researches are, unfortunately, related to English dataset. However, few researches talk about PNNs text classification with Arabic dataset [1,12,13,14]. This research paper uses Polynomial Neural Network (PNNs) with Arabic Language. The authors try to investigate how good will be the results of using PNNs with Arabic dataset.

The remainder of the paper is organized as follows. The next section talks about Polynomial Neural Network (PNNs). Section 3 talks about Arabic language. Section 4 demonstrates related studies of PNNs as text classifier. Section 5 shows our in-house developed dataset used in experiments, and finally, Section 6 discussed our conducted experiments and their results.

2. Polynomial Neural Networks

Artificial Neural network (ANN) is a simplified form of Polynomial Neural Networks (PNNs). ANN is a powerful classification method that has the capability to capture complex relationships among documents and, therefore, it has been used by many researches and practitioners to identify patterns related to their relationships [5, 6]. There are many researches based on neural network used on both English and Arabic language and these researches show emergent results [2,3,4,12]. Neural network is an excellent method for classification specially for spam detection [2,3,4]. One of the main disadvantages of ANNs is that its disability of input pattern generalization. PNNs is also an excellent method in text classification. According to Al-Tahrawi et al. [1], PNNs were first used for English text classification in 2008 and still only few researches use PNN in classification of English text documents. Also, PNNs used for Arabic text classification in 2015. In PNNs there are general connection used to connect input with output variables. Besides that, PNNs polynomial functions are applied to variables to preserve partial dependencies. Polynomial Neural networks (PNNs) is considered as one of the most popular text classification methods related to English dataset [1]. In this paper, authors use Polynomial Neural networks (PNN) projected by Campbell as a supervised learning algorithm [15].

Text classification, in general, automatically assigns any new, unseen, or undefined document to one or more set of predefined documents [5, 6]. PNNs algorithm can be used to classify both English and Arabic text documents. But the number of researches conducted with Arabic language still need more investigations. Polynomial Networks architecture has several representations and forms. Authors in this research used two layers (input layer and output layer). In the first layer, there are a group of inputs or features x (x1, x2, …., xN) where N represents the number of input characteristics (features).

Also, N used to specify a set of monomial basis functions p(x) of the required order or degree K. Note that there is one basis function p(x) for each input or observation. According to Campbell [15] the elements of p(x) for a polynomial of degree k are monomials as following: [1]

(1) In this research, if an input vector x contains two features (x1, x2) this means that the second order of PNNs basis function p(x) will be as the following: [1]

= 1 ^t (2)

, ℎ ≥ 0 0 ≤ ≤

=1

(3)

The degree of polynomial could be 0 (constant), 1(Linear), 2(quadric), etc. However, the authors in this research used polynomial of degree 2. This means that all the first layer results are computed and merged (combined) by the second layer. This is done by computing weights or scores w^t p(x), where w is the classification model. Since there is a need to calculate a score for each input of this model, the score of W^t_jP(x_i) is calculated for each input vector x_i and for each class j. The final output is calculated by computing the average of the total score for all features where M is the number of features in the class j [1,15].

(3)

Polynomial network experiments are done through two phases, namely training phase and experiments phase. Training phase is conducted to identify ideal output using mean squared error as a criterion. Also, training phase is done by creating a term vector x for each training documents. This term vector is used to calculate the “w” for each class.

The training purpose is to find a set of optimum weights “w”. It is to be noted that the optimum weight must minimize the distance between the calculated linear combination expansion of the training data and the ideal target (output) [1,15]. Terms in this research are represented using the well-known information retrieval technique, namely “term frequency-inverse document frequency (tf.idf)”. After sufficient training, ideal outputs are achieved where each new document is actually addressed to be in the nearest class.

3. Arabic Language

Arabic is spoken as a native language by more than 350 million. Letters of Arabic language consists of twenty-eight letters plus some special characters. Additionally, letters of Arabic language are written from right to left and they have different forms depending on the position of the character (at the beginning, at the end, or at the middle) [17,18]. One of the most important characteristics of Arabic language is that most words have an existing root [19,20]. Another close related aspect, more over 80%

of Arabic words can be mapped into 3-letter root [19,20].

4. Related Studies

Mayy M. Al-Tahrawi [1]. In this paper, authors use Polynomial Networks in Arabic text classification.

The experiments in this paper are conducted on wide Arabic dataset. Results demonstrate that PNN is competitive and can be used on Arabic text classification. Stemming is applied on this research. Also, chi-square is used as feature selection.

Mayy M. Al-Tahrawi [21]. In this research authors use Polynomial Networks for Arabic text classification. Then, the experiments of PNNs are compared with “Support Vector Machine”, “Naïve Bayes”, “K_ nearest Neighbor”, “Logistic Regression”, and “Radial Basis Function Networks”. The results of this research demonstrate that PNNs are competitive.

Adel Hamdan[18,19]. In these researches, the authors applied different methods and technique on Arabic text classification. The authors applied k-nearest neighbour, Decision Trees (C4.5), Rocchio Classifier, Support vector machine, Naïve Bayes and Neural Network on Arabic text classifications.

Results are done using different feature selection methods. Their experiments also demonstrate the difference between these algorithms in text classification.

Majed Ismail [22]. In this research authors applied different algorithm in text classification.

Authors in this research implement Sequential Minimal Optimization, Naïve Bayesian and J48 (C4.5) Algorithms using Weka program. In this research, experiments demonstrate that SMO achieves the most acceptable accurateness.

S_j = ¹

∑ ! ₌₁ ^"

(4)

Tarek Fouad [23]. In this research, the authors deployed Support Vector Machine (SVM) as Arabic Text Classifier. Also, in this research results are compared with Rocchio and Bayes Classifiers.

Experiments are conducted on 1132 Arabic documents. In this research Rocchio classifier gives the most promising results.

Rasha Elhassan [24]. In this research, the authors focus on available research in the field of Arabic text categorization. Authors demonstrate that building a benchmark Arabic dataset is important.

Also, in this research authors demonstrate that stimming is a large problem in Arabic text categorization.

Ghazi Raho [25]. In this research, the authors conduct a comparative study on feature selection.

Authors demonstrate that dataset preprocessing is crucial for success. In this experiment’s authors create a comparison between the performance of different classifiers. Experiments in this research are done on BBC Arabic dataset. In this research authors use K-nearest neighbors, “Naïve Bayesian method” and “Naïve Bayes multinomial”.

Raed Al-khurayji [28]. In this research, the authors use “kernel naive Bayes (Kernel NB)” for Arabic text classification. Authors preprocess Arabic documents then they convert these words into vector using “term frequency-inverse document frequency (tf.idf)”. after that, they process their approach using kernel naive Bayes. The experiments are conducted on a collected dataset. Experiments demonstrate that “Kernel NB” achieved improvements of 13% for the training accuracy and 9% for the testing accuracy.

5. Dataset

Practically, one of the crucial problems facing Arabic text categorization is the absent of an existing dataset that are dedicated as an Arabic benchmark. To overcome this problem, most authors use their own collected (i.e. in-house) dataset [27,28,29] and therefore, the authors in this research are also used their in-house created dataset. Within this context, the dataset of this scientific research is collected from several web sites such as Aljazeera [30], Saudi Press Agency [31], and Al-Hayat [32]. As shown in Table 1, this dataset consists of 2000 Arabic documents.

Table 1: The Dataset of Arabic documents

Category Total Number of documents # of Training documents # of Testing documents

Politics 200 120 80

Economics 200 120 80

Culture 200 120 80

Sports 200 120 80

Art 200 120 80

Technology 300 200 100

Science 300 200 100

Education 400 250 150

Total 2000 1250 750

6. Document Pre-processing, Term Selection and Weighting

Document pre-processing is one of the most success factors in text classification. Removing useless data is very important in reducing the size of a document. Data pre-processing can be done by feature selection for the intension of removing unnecessary data such as: auxiliary verbs, digits, numbers and others unnecessary data. [33]

Feature selection includes stemming to clean the documents to be ready for the next-coming stage and to build vector space. There are several feature selection methods such as Chi-Square (CHI), Information Gain (IG), Document Frequency (DF), Mutual Information (MI), and others . In this

(5)

research, Chi-square is used to select features and present them using the technique “term frequency- inverse document frequency (tf-idf)”.

7. Experimental Results and Analysis

The authors in this research use PNNs as a text classifier to classify Arabic documents. As already mentioned in Section 5, the 2000-in-house dataset of Arabic documents are used; these documents belong to different categories.

The most popular evaluation measures that are widely used for English and Arabic text classification are Recall, Precision, and F1. In the sense of understanding these three measures, suppose that “a” indicates the relevant retrieved documents and “b” indicates the irrelevant retrieved document. On the other hand, suppose that “c” points to the relevant documents that haven’t been retrieved and “d” points to the irrelevant document that haven’t been retrieved. While Table 2 illustrates these symbols, the following formulas clarify how to compute these measures:

Precision= ^#

#$% (4)

Recall= ^#

#$& (5)

(6)

Table 2: The symbols that are required to calculate the three evaluation measures

Iteration Relevant Document Irrelevant Document

# of retrieved Documents a b

# of un-retrieved Documents c d

Table 3 illustrates the experimental-performance measures of using PNNs for these three evaluation measures: Precision, Recall, and F1. After analysing these experimental results, we found that PNNs gives promising results with Arabic dataset. The following calculated averages are found:

0.81, 0.86 and 0.84 for Precision, Recall, F1, respectably.

Again, the results of the three-aforementioned-evaluation measures (Precision, Recall, and F1), are organized, abstracted, and viewed broadly using the following figures: Figure 1, Figure 2, and Figure 3, respectively. It is clear from these figures that PNNs can be used for Arabic text classification and, therefore, PNNs can work well on Arabic dataset. To end with, it can be concluded that the PNNs can attain knowledge through learning and, in turn, have the capability in exhibiting some notion of intelligence through classification of Arabic text documents.

Table 3: PNNs for Precision, Recall and F1

(PNNs) Precision Recall F1

Politics 0.79 0.89 0.83

Economics 0.81 0.91 0.85

Culture 0.77 0.88 0.82

Sports 0.81 0.91 0.85

Art 0.79 0.82 0.80

Technology 0.85 0.86 0.85

Science 0.81 0.79 0.79

Education 0.9 0.88 0.88

Average 0.81 0.86 0.84

'1 = 2 ∗ *+,- ∗ .+//

*+,- + .+//

(6)

Figure 1: PNNs Precision

Figure 2: PNNs recall

Figure 3: PNNs F1

(7)

8. Conclusion and Future Works

This research talks about text classification. Text classification is a hot topic because of the great numbers of electronic documents. Researches can find enormous number of experiments done upon English dataset, but the numbers of experiments applied on Arabic dataset are still not able to reach its expected intended future.

The extensive experiments of this research were performed on Arabic dataset using PNNs as a text classifier. Results indicate that PNNs can be applied with Arabic dataset as an acceptable classifier. To demonstrate that PNNs can be used as a good classifier for Arabic documents, this scientific research uses three evaluation measures: Precision, Recall and F1. The Experiments are conducted on a dataset of eight categories with a total number of 2000 documents where 62.5% of these documents are used in the training phase and the remaining 37.5% are used for the testing phase.

And, therefore, PNNs can attain knowledge through learning and have the capability of showing some notion of intelligence through classification of Arabic text documents. Based on the result of this technique, a Comparative study between ANN and PNN could be addressed as a future work.

References

[1] Mayy M. Al-Tahrawi, Sumaya N. Al-Khatib, “Arabic text classification using Polynomial Networks”, Journal of King Saud University-Computer and Information Sciences, http://dx.doi.org/10.1016/j.jksuci.2015.02.003, 2014.

[2] Adel Hamdan, Raed Abu-Zitar, “Spam Detection Using Assisted Artificial Immune System”, Volume: 25, Issue: 8(2011) pp. 1275-1295, International Journal of Pattern Recognition and Artificial Intelligence, 2011.

[3] Raed Abu-Zitar, Adel Hamdan, “Application of Genetic Optimized Artificial Immune System and Neural Networks in Spam Detection”, Applied Soft Computing, Volume 11, Issue 4, June 2011, pp. 3827-3845, Elsevier, 2011.

[4] Adel Hamdan, Raed Abu-Zitar, “Genetic optimized artificial immune system in spam detection:

a review and a model”, Artificial Intelligence Review, Volume 40, Issue 30, pp. 305-377, 2013.

[5] N. K. T. El-Omari and A. A. Awajan, “Document Image Segmentation and Compression using Artificial Neural Networks and Evolutionary Methods”, in International Conference on Information and Communication Systems (ICICS09), , pp. 320-324. 2009

[6] N. K. T. El-omari, A. H. Omari, O. F. Al-badarneh, and H. Abdel-jaber, “Scanned Document Image Segmentation Using Back-Propagation Artificial Neural Network Based Technique”, Int. J. Comput. Commun, vol. 6, no. 4, pp. 183-190, 2012.

[7] Rasha Elhassan, Mahmoud Ahmed, “Arabic Text Classification Review”, International Journal of Computer Science and Software Engineering (IJCSSE), Volume 4, Issue 1, pp. 1-5, January 2015.

[8] Mofleh Al-diabat, “Arabic Text Categorization Using Classification Rule Mining”, Applied athematical Sciences, Vol. 6, 2012, no. 81, pp. 4033-4046, 2012.

[9] F. Sebastiani, “Machine learning in automated text categorization”, ACM Computing Surveys, volume 34 number 1. pp.1-47, 2002.

[10] Dharmadhikari, C.S., Ingle, M. and Kulkarni, P., “Empirical Studies on Machine Learning Based Text Classification Algorithms”, Advanced Computing: An International Journal (ACIJ), Vol.2, pp. 161-169, 2011.

[11] Mohammad Ali H. Eljinini, Wa'el Musa Hadi, Adel Hamdan, Mohammad Ghatasheh,

“Performance of NB and SVM Classifiers in Arabic Text Data”, proceedings of The 14^th International Business Information Management Association, Conference on Global Business Transformation through Innovation and Knowledge Management, Istanbul, Turkey. IBIMA, pp. 2593-2599, 2010

(8)

[12] Sameh Ghwanmeh, Adel Hamdan, Ali Al-ibrahim, “Innovative Artificial Neural Networks- Based Decision Support System for Heart Diseases Diagnosis”, Journal of Intelligent Learning Systems and Applications, Vol. 5 No. 3, pp.176-183, 2013.

[13] Adel Hamdan, “Apply Two Feature Selections (Chi-square and Symmetric Uncertainty) Using C4.5 Classification Algorithm Based on Arabic dataset”, ISER-431^st International Conference on Advances in Business Management and Information Science (ICABMIS), Istanbul, Turkey on 6^th - 7^th September, 2018.

[14] Abdel salam obeidat, Sameh Ghwanmeh, Ali Al-ibrahim, Nidhal El-Omari, Adel Hamdan,

“Performance and Effectiveness Examination of the IQE and AQE with Application on Arabic Content”, International Journal of Current Engineering and Technology, IJCET, June-2013, Vol.3, No.3, pp.795-797, 2013.

[15] Campbell, W.M., Assaleh, K.T., Broun, C.C., “A novel algorithm for training polynomial networks”, In: Int NAISO Symp Information Science Innovations ISI’2001, Dubai, UAE, March 2001. doi: http://dx.doi.org/10.1.1.28.5119, 2001.

[16] http://www.vcclab.org/lab/pnn/, Last Accessed: January 4, 2019.

[17] Rehab Duwairi, “Machine learning for Arabic text categorization” Journal of American society for information science and technology ( JASIST), Vol57, No8,pp1005-1010, 2005.

[18] Adel Hamdan, Tariq Alwada’n, Omar Al-Momani, “Arabic Text Categorization Using Support vector machine, Naïve Bayes and Neural Network”, GSTF Journal of Computing Volume 5, Issue 1; pp. 108-115, 2016.

[19] Adel Hamdan, Omar Al-Momani, Tariq Alwada’n, “Arabic Text Categorization Using k- nearest neighbour, Decision Trees (C4.5) and Rocchio Classifier: A Comparative Study”, International Journal of Current Engineering and Technology, Vol.6, No.2, April 2016.

[20] Adel Hamdan, “Comparing Two Feature Selections Methods (Information Gain And Gain Ratio) On Three Different Classification Algorithms Using Arabic dataset”, Journal of Theoretical and Applied Information Technology Vol. 96 Issue 6, 2018.

[21] Mayy M. Al-Tahrawi, “Polynomial Neural Networks versus Other Arabic Text Classifiers”, Journal of Software., Volume 11, Number 4, 2016.

[22] Majed Ismail Hussien, Fekry Olayah, Minwer AL-dwan & Ahlam Shamsan, “Arabic Text Classification Using SMO, naïve bayesian, j48 algorithms”, international journal of research and reviews in applied sciences, vol 9 Issue 2/IJRRAS_9_2_15.pdf, November 2011.

[23] Tarek Fouad Gharib, Mena Badieh Habib, and Zaki Taha Fayed, “Arabic Text Classification Using Support Vector Machines”, International Journal of Computers and Their Applications, Vol (16), Issue(4), 2009.

[24] Rasha Elhassan and Mahmoud Ahmed, “Arabic Text Classification review”, International Journal of Computer Science and Software Engineering (IJCSSE), Volume 4, Issue 1, January 2015.

[25] Ghazi Raho, Riyad Al-Shalabi, Ghassan Kanaan, Asma'aNassar, “Different Classification Algorithms Based on Arabic Text Classification: Feature Selection Comparative Study”, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 6, No.

2, 2015.

[26] Raed Al-khurayji, Ahmed Sameh, “an effective Arabic text classification approach based on kernel naive bayes, classifier”, International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017.

[27] Khreisat, L., “Arabic Text Classification Using N-Gram Frequency Statistics: A Comparative Study”, In: Proceedings of the 2006 International Conference on Data Mining (DMIN 2006), June 26-29, Las Vegas, Nevada, USA, pp. 78-82, 2006.

[28] Kanaan, G., Al-Shalabi, R., Ghwanmeh, S., “A comparison of text-classification techniques applied to Arabic text”, J. Am. Soc. Inform. Sci. Technol. 60 (9), 1836-1844.

http://dx.doi.org/10.1002/asi.v60:9, 2009.

(9)

[29] Fodil, L., Sayoud, H., Ouamour, S., “Theme classification of Arabic text: a statistical approach”, In: Terminology and Knowledge Engineering 2014, Berlin, Germany, pp. 77-86, 2014.

[30] https://www.aljazeera.net Last Accessed: December 4, 2018.

[31] [ https://www.spa.gov.sa/?lang=ar Last Accessed: December 4, 2018.

[32] [ http://www.alhayat.com Last Accessed: December 4, 2018.

[33] Liu, H. and Motoda, “Feature Extraction, constraction and selection: A Data Mining Perpective”, Boston, Massachusetts (MA): Kluwer Academic Publishers, 1998.