AWERProcedia Information Technology & Computer Science

(1)

AWERProcedia

Information Technology

&

Computer Science

1 (2012) 1007-1012

2

^nd

World Conference on Information Technology (WCIT-2011)

Turkish anti-spam filtering using binary and probabilistic models

Semih Ergin

^a*

, Efnan Sora Gunal

^a

, Huseyin Yigit

^a

, Rifat Aydin

^a

aEskisehir Osmangazi University, Department of Electrical and Electronics Engineering, Eskisehir, 26480, Turkey

Abstract

In this paper, a Turkish anti-spam filtering is implemented to determine text-based Turkish spam e-mails (junk e-mail or bulk e-mail). Since e-mails are extremely easy and cheap to send, they have gained tremendous popularity not only as a means for communicating with friends, but also as a medium for bombarding unsuspecting e -mail boxes with undesired e-mails usually for advertisement. Spam e-mail is a general name used to denote these undesired e-mails. In order to classify Turkish e-mails as spam or legitimate, firstly, a Turkish e-mail database containing examples of text-based spam and normal e-mails was constructed. Secondly, the content of each e-mail was analyzed and the different words appeared in each e -mail are found. Moreover, a stemmer subfunction has been developed, and thus the root forms of each different word were determined. The Mutual Information (MI) scores of each stem-word have been calculated so that two different types of feature vectors have been constructed according to these MI scores. After feature vector extraction, a Bayesian classifier ha s been used to categorize all of the e-mails (either spam or legitimate) utilizing two distinctive models which are binary and probabilistic models, respectively. In the learning (training) stage, 600 text-based Turkish e-mails (300 spam and 300 legitimate) were used while 200 Turkish e-mails (100 spam and 100 legitimate) were classified in the test phase. The two different models were individually tested, and therefore a success rate of 89% has been achieved for probabilistic model whereas the binary model has provided a success rate of 93%.

Keywords: Spam filtering, Binary model, Probabilistic model, Bayesian classifier

Selection and peer review under responsibility of Prof. Dr. Hafize Keser.

1. Introduction

In recent years, internet technology has made our daily life easier in a wide variety of fields especially in communication. Electronic mail (e-mail) is now the most common communication tool of internet to its simplicity and low cost. Since an e-mail is extremely easy and cheap to send, it has gained enormous

* ADDRESS FOR CORRESPONDENCE; Semih, Ergin, Eskisehir Osmangazi University, Department of Electrical and Electronics Engineering, Eskisehir, 26480, Turkey

E-mail address : [email protected]./ Tel.: +90-222-239-3750 / Ext: 3265; fax: +90-222-229-0535.

(2)

popularity not only as a means for communication with people, but also as a medium for conducting electronic commerce. Unfortunately, the same reasons that made e-mails so popular also attracted direct marketers to bombard e-mail boxes with undesired e-mails particularly for advertisement purposes. Spam (junk, bulk, unsolicited) e-mail is the common name to describe these undesired e-mails [1]. In a study, American Online (AOL) has stated that they have received 1.8 million spam e-mails until precautions have been taken [2]. According to a research performed by Symantec in 2002, 63% of the people receive over 50 spam e-mails per week while 37% of them receive over 1000; 65% waste at least 10 minutes to handle spam e-mails daily while 24% waste over 20 minutes [3].

As a result of this growing problem, automated methods to discriminate spam from legitimate e-mails are becoming necessary. To prevent spam e-mails, both users and administrators of e-mail systems use various anti-spam techniques. However, no technique is able to offer a complete solution to this problem, and each has trade-offs related to accuracy and processing time. The anti-spam techniques can be separated into two main groups: Static methods and dynamic methods [4]. Static methods are based on a predefined address list. For instance, an e-mail server allows a person to receive an e-mail only if his/her address is one of the recipient addresses. On the contrary, dynamic methods are more complicated. These methods take the contents of e-mails into consideration and adapt their filtering decisions with respect to the contents. Most of them utilize the common text categorization techniques by implementing machine learning methods [2].

In this study, firstly, a database containing examples of spam and legitimate e-mails in Turkish was constructed. The contents of each e-mail was analyzed and the different words passed through all of the e- mails were determined. Then, a stemmer function specific to Turkish has been employed to obtain the root forms of the different words. The Mutual Information (MI) scores of the different words were calculated so that two types of feature vectors were constructed according to MI scores. Following this process, binary and probabilistic models of Bayesian classifier were tested, and the success rates of binary and probabilistic models were compared.

2. Morphological Stemming

Since Turkish language is an agglutinative language, a new word can be derived by adding an affix (usually a suffix) to the root forms of Turkish words. A single word in an agglutinative language may correspond to a phrase made up of several words in a non-agglutinative language. This makes the language more complex than non-agglutinative languages like English. In the classification process for agglutinative languages, it is obvious that the features cannot be affixed forms of words. Different affixed forms correspond to a common root form and since they all present the same concept, the root forms of words can be used as features. Thus, a stemmer function is pretty much important to develop. According to [5], the extraction of the first five characters of Turkish words is an effective way t o obtain their corresponding roots. The stemmer function extracts the first five characters as the root form of a word if the word has more than five characters. If the word has five or less characters, the stemmer assumes all characters as root form of the corresponding word. The stemmer function does this process for each different word, respectively. This technique is called as 5-gram method in the information retrieval studies [6]. As an example, if the word is ‘kitabım’, the stemmer function gives ‘kitab’ as an output.

3. Experimental Study

In the experimental study, the spam and normal e-mail examples were collected from different e-mail accounts in order to compose a database. The total number of spam e-mails and normal e-mails used in this study are equal and it is 400. Three hundred e-mails of each class are used for training and the remaining one hundred e-mails of each class are used for testing purpose. The size of the database may seem insufficient. The main reason of this inadequacy is that a comprehensive Turkish e-mail database has not been already published until this time. Therefore, a database was built for this study.

(3)

Spam e-mail recognition process is based on words passed through in e-mails in this study. The words give sufficient information about e-mails whether an e-mail is spam or not; therefore, it is really essential to determine the words passed through in e-mails. For this purpose, the first step of training phase is to determine all different words written in an e-mail. One of the encountered difficulties in this step is how our program can perceive the words. Hence, an algorithm was developed to perceive the different words.

In order to achieve this, all of the characters in an e-mail are converted to their ASCII codes and then the ASCII codes except letters are ignored (punctuation marks, numbers, symbols, etc.). Later, a matrix including the different words is constituted. As an example, a sentence in the e-mails is “ali9, akşam Çay iç”. All characters in this sentence are converted into lower sized characters which is “ali9, akşam çay iç”. The ASCII code array of this lower-sized characters is “97 108 105 57 44 32 32 97 107 351 97 109 32 32 231 97 121 32 32 105 231 32”. The matrix including different words passed through this sentence has been constituted as

ali akşam

çay iç

The matrix including the ASCII code array of the different words passed through the s entence “ali9, akşam çay iç” is

97 108 105 97 107 351 97 109

231 97 121 105 231

This process has been applied to 600 e-mails (300 spam e-mails and 300 normal e-mails) so that 600 different matrices including different words in an training e-mail were found. Using these matrices, the matrix including all of the different words in all training e-mails were composed; and ultimately, 13651 different words have been found. Then, the stemmer function applied for each different word, respectively. In order to select which of the roots has the most discriminative power, the concept of Mutual Information (MI) was used [7]. In probability and information theory, the MI of two random variables is a quantity that measures the mutual dependence of the two variables [7]. The formulation of MI is below:

1 2

( , )

( ; ) ( , ) log

y Y x X

p x y

MI x y p x y

p x p y (1)

MI scores of all 13651 different words have been calculated, and then 50 words with the highest MI scores have been selected to form the first type feature vector. Afterwards, 75 words with the highest MI scores have been selected to constitute the second feature vector type. These feature vectors are given, respectively:

The first type of feature vector including 50 words = [tl; fırsa; indir; bülte; tıkla; almak; konak; buray; firsa; sadec;

paket; kahva; hotel; dahil; insan; istem; adet; özel; kargo; gece; sipar; seans; kendi; masaj; şarap; bunla; üründ;

söyle; geçer; oluşa; eşliğ; cadde; başka; bilet; ama; günü; menü; kdv; ücret; düşün; spa; tatil; büfe; menüs; kisis;

resta; içece; ilgil; ürün; sauna].

The second type of feature vector including 75 words = [tl; fırsa; indir; bülte; tıkla; almak; konak; buray; firsa;

sadec; paket; kahva; hotel; dahil; insan; istem; adet; özel; kargo; gece; sipar; seans; kendi; masaj; şarap; bunla;

üründ; söyle; geçer; oluşa; eşliğ; cadde; başka; bilet; ama; günü; menü; kdv; ücret; düşün; spa; tatil; büfe; menüs;

(4)

kisis; resta; içece; ilgil; ürün; sauna; istan; bile; ödeye; bunun; seçen; bakım; bağda; saklı; eğiti; bayan; ikram;

keyfi; onlar; hediy; tabağ; olmad; anlam; bazı; kişil; hakla; hisse; onu; bunu; ortay; yerin]

The training phase (learning module) has been completed by constituting these two types feature vectors.

In this study, two distinct models of Bayesian classification [8], which are binary model and probabilistic model, have been utilized to achieve spam e-mail recognition. These methods can be regarded as “semi- original” models based mainly on the classical methods of discrete Bayesian filtering. The binary model is based on whether a word occurs or not in an e-mail. The score for an e-mail with a feature vector X belonging to class Ci, (i=1, 2) was calculated by the formula:

1

( | )

, ,

n

j

ij i

ij

if jth word of feature vector occurs in e mail P C X

otherwise

cP

P

(2)

P_ij was obtained by dividing the number of e-mails in class C_i containing the j^th word by the total number of e-mails in class C_i, n is the dimension of the feature vector, and c is the coefficient level. c was taken as 10, 20, 30, 40, and 50, respectively. In e-mails, occurrence of an input word usually indicates a stronger idea for classifying e-mails than a non-occurrence of that word. One-hundred legitimate (normal) test e-mails and one-hundred spam test e-mails were classified using Bayesian classifier and the recognition rates are presented in Table 1.

Table 1. Recognition rates for binary model based classification (%)

Coefficient level (c) Feature vector size (50) Feature vector size (75)

Spam Normal Spam Normal

10 88 98 88 99

20 90 94 90 95

30 95 88 95 92

40 95 85 95 87

50 95 81 95 85

In the probabilistic model, the number of occurrences of a word in an e-mail is taken into account. In this case, the score for an e-mail with a feature vector X belonging to class Ci, (i=1, 2) was calculated as follows:

1

( | )

,

n

j

ij j

i

ij

if jth word of feature vector occurs in e mail P C X

otherwise

cP H

P

(3)

Here, H_j is the number of occurrences of the j^th word in an e-mail. Quite similar to the previous study, one- hundred legitimate test e-mails and one-hundred spam test e-mails were classified by Bayesian classifier and the recognition rates are presented in Table 2.

(5)

Table 2. Recognition rates for probabilistic model based classification (%)

Coefficient level (c) Feature vector size (50) Feature vector size (75)

Spam Normal Spam Normal

10 86 87 86 90

20 88 83 86 81

30 90 83 89 80

40 90 83 91 80

50 90 83 91 80

4. Conclusion

In this study, two different models, which are binary model and probabilistic model, have been utilized to achieve a Turkish spam e-mail recognition task. In order to classify Turkish e-mails as spam or legitimate, firstly, a Turkish e-mail database containing the examples of text-based spam and legitimate e-mails was constructed. In this process, the database including four-hundred spam e-mails and four-hundred normal e- mails was composed from different e-mail accounts. Three hundred e-mails of each class (spam or legitimate) were used for training and the remaining one hundred e-mails of each class were used for testing purpose. The size of the database may seem insufficient. The main reason of this inadequacy is that a comprehensive Turkish e-mail database has not been already published until this time. Therefore, a database has been presented to the scientists who will study not only in Turkish spam filtering but also in Turkish information retrieval issues.

Secondly, the content of each e-mail was analyzed and the different words appeared in the e-mails were found. Moreover, a stemmer function has been developed, and thus the root forms of each different word were determined. The Mutual Information (MI) scores of each stem-word have been calculated so that two different types of feature vectors have been constructed according to the MI scores. After feature vector extraction, a Bayesian classifier has been used to categorize all of the e-mails (either spam or legitimate) utilizing two distinctive models. The binary model has been achieved the best success rates about 92% when the feature vector size is fifty and about 93% when feature vector size is seventy-five. The probabilistic model has been always given worse recognition rates than the binary model except for only one situation in which the feature vector size is fifty and the coefficient level is 10. The experiments have also shown that the maximum success rates of these two models were obtained when the coefficient level is 10 and 30. In the future studies, it is aimed to expand the constructed database and different recognition methods, such as decision tree and artificial neural networks (ANNs) will be implemented to perform Turkish spam e-mail recognition.

References

[1] G. Zoltan and G-M. Hector, Web spam taxonomy, Proc. of the First Int. Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, (2005).

[2] L. Ozgur, T. Gungor and F. Gurgen, Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish, Pattern Recognition Letters, 25(16), pp. 1819–1831, (2004).

[3] H. Shanmugasundaram, Intelligent e-mail personalization system, Journal of Engineering Science and Technology, 6(1), pp. 50 -60, (2011).

[4] L. Ozgur, T. Gungor and F. Gurgen, Spam mail detection using artificial neural network and bayesian filter, Proc. of the Int. Conf. on Intelligent Data Engineering and Automated Learning, pp. 505-510, (2004).

[5] H. Sever and Y. Tonta, Truncation of content terms for Turkish, Proc. of 7th Int. Conf. on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, (2006).

[6] W. B. Cavnar and J. M. Trenkle, N-Gram based text categorization, Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161-175, (1994).

[7] C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, United

(6)

Kingdom, (2008).

[8] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras and G. D. Spyropoulos, An evaluation of naive Bayesian anti -spam filtering, Proc. of Workshop on Machine Learning in the New Information Age, pp. 9-17, Barcelona, (2000).