The Improved Bayesian Algorithm to Spam Filtering

Hongling Wang, Gang Zheng, and Yueshun He

Abstract Though electronic mail is one of the most popular forms of communi- cation in modern society, spam brings considerable inconvenience to our lives while also very negatively affecting network security; thus resolving this issue has become a rather urgent task. The existing Bayesian algorithm uses a Bernoulli model to process text features in application to spam ﬁltering, but it always mis- judges normal mail because it does not distinguish the differing degrees of importance of various features. In this paper, a new and improved Bayesian algorithm is proposed that weights feature words with minimum risk. Experimental results show that this algorithm can reduce the risk of misjudging normal mail and improve the accuracy of mail ﬁltering.

Keywords Spam ﬁltering • Bayesian algorithm • Weighting feature word • Mini- mum risk

5.1 Introduction

An investigative report on antispam conditions in China in the fourth quarter of 2013 was released recently, according to the Chinese Internet Association, and shows that 92.4 % of users use normal personal e-mail, 5.1 % of users use enterprise e-mail, and only 4.2 % of people do not use e-mail. The report also shows that successful receipt of messages, security, privacy, and antispam functions are the most important features of e-mail. Figures show that Chinese e-mail users receive about 14.6 pieces of spam every week. The percentage of spam is 37.3 % of the total, rising at an annual rate of 4% in the fourth quarter of 2013 [1]. Spam ties up more network resources, reduces the operating efﬁciency of networks, and con- sumes a considerable amount of time, money, and energy of receivers; sometimes spam contains malicious content such as fraud and sexually explicit images, which have a harmful effect on society [2]; therefore, antispam technologies must be developed. Antispam methods commonly include black- or white-list technology, keyword ﬁltering technology, Decision Tree, boosting technology, and naive H. Wang (*) • G. Zheng • Y. He

East China Institute of Technology, 330000 Nanchang, Jiangxi, China e-mail:[email protected]

©Springer International Publishing Switzerland 2015

W.E. Wong (ed.),Proceedings of the 4th International Conference on Computer Engineering and Networks, DOI 10.1007/978-3-319-11104-9_5

Bayesian algorithms [3]. Bayesian algorithms are the most popular method because of their convenience of design, decision features, and low storage requirements [4]. However, they also present some problems; for instance, they cannot differen- tiate the importance of feature words and may misidentify normal e-mail as spam. To solve these problems and improve their ﬁltering capabilities, a new Bayesian spam ﬁltering algorithm that carries minimum risk and is based on the weighting of feature words is proposed in this paper.

5.2 Principle of Bayesian Algorithms

A Bayesian algorithm is a classification technology used to predict the possibility of a new event according to past events, as proposed by the famous mathematician Thomas Bayes [5]. By computing the probability of every category in a given text, it classifies the text into the category that is most likely to be the correct category when the algorithm is applied to solve text classification problems. As its basic principle, the algorithm analyzes common keywords in a collection of spam and obtains a distribution statistics model and calculates the probability that a particular e-mail is of spam [6]:dis the text set,P(ci|d) is the probability that textdbelongs to categoryci,P(ci) is the prior probability of classci,P(d|ci) is the class-conditional probability of the text,P(d) is the appearance probability of the text.P(ci|d) can be calculated as follows: P ci d ¼P cð Þi P d ci P dð Þ , i¼1, 2, 3,. . ., C : ð5:1Þ

The probability of textP(d) can be calculated as follows:

P dð Þ ¼X C j j i¼1 P cð Þi P d ci : ð5:2Þ

P(ci) can be estimated according to the historical experience of the training set. Suppose Nk represents the number of texts of class ci in the training set, and N represents the number of texts in the training set. P(ci) can be calculated as follows:

P cð Þ ¼i

N: ð5:3Þ

Then we obtain the class-conditional probability of textP(d|ci) according to the class-conditional probability of feature words in the texts. This can be calculated as follows:

P d ci ¼Y n t¼1 BP wt ci þð1BÞ 1P wt ci ; ð5:4Þ whereBindicates whether or not the feature wordWtappears, andP(wt|ci) is the probability that the feature wordwt will appear in the condition of class ci. the number of texts in classci in which the feature word wi appears, and Nc is the number of texts in class ci, Nc is the texts number in class ci. P(wt|ci) can be calculated as follows: P wt ci ¼Nw Nc : ð5:5Þ

Based on the preceding description, we can see the traditional Bayesian filtering approach does not take into consideration the differences between normal text classification and e-mail filtering; in addition, it does not consider the various features in normal e-mail and spam. Therefore, some improvements with regard to feature word detection are made in this paper by studying the Bayesian filtering process.

5.3 Filtering Algorithm Based on Feature Word Weighting

5.3.1 Filtering Process Using Bayesian Algorithm

Spam filtering is a two-class classification problem, and the final result is that e-mails are divided into two groups, normal e-mail (ham) and spam (spam). When the e-mail system receives a new e-mail, it classifies this e-mail by calculat- ing the probability of P(ci|d),i2{spam, ham}. The whole process includes two steps: training and classification.

Training process:

1. Build spam and ham sets by collecting many e-mails.

2. Extract individual token strings as feature words such as discount, receipt from every -email title, and the contents in the spam and ham sets. Then calculate the appearance time of the token and build the feature setf¼{w1,w2,. . .,wn}. 3. Build individual hash tables for both normal e-mail and spam. Hash ham is for

normal e-mail and hash spam is for spam, where the mapping relation of a feature word token string to word frequency is stored.

4. Calculate the class-conditional probabilityP(wt|ci) for feature wordwtaccording to Eq. (5.4).

5. Count the prior probabilityP(ci) of the class based on Eq. (5.3).

Classiﬁcation process:

1. Extract feature words from new e-mails.

2. Calculate the probability P(cham|d) of normal e-mail and P(cspam|d) of spam when it satisﬁes the extracted feature wordsd.

3. Classify this e-mail based on the results. When the value ofP(cspam|d) is greater thanP(cham|d) or the thresholdλ, this e-mail is tagged as spam.

Spam ﬁltering faces two practical problems in its application: what is the result if spam is identiﬁed as normal e-mail? And what is the consequence if normal e-mail is considered spam. Treating spam as normal e-mail will waste a user’s precious time and energy; however, treating normal e-mail as spam could delay important events in a user’s day, such as meetings.

5.3.2 Improved Algorithm Based on Minimum Risk

In document W Eric Wong Proceedings of the 4th International Conference on Computer Engineering and Networks CENet2014 pdf (Page 53-56)