A Bayesian Topic Model for Spam Filtering

(1)

Available at http://www.joics.com

A Bayesian Topic Model for Spam Filtering

⋆

Zhiying Zhang, Xu Yu, Lixiang Shi, Li Peng, Zhixing Huang

∗

School of Computer and Information Science, Southwest University, Chongqing 400715, China

Abstract

Spam is one of the major problems of today’s Internet because it brings ﬁnancial damage to companies and annoys individual users. Among those approaches developed to detect spam, the content-based machine learning algorithms are important and popular. However, these algorithms are trained using statistical representations of the terms that usually appear in the e-mails. Additionally, these methods are unable to account for the underlying semantics of terms within the messages. In this paper, we present a Bayesian topic model to address these limitations. We explore the use of semantics in spam ﬁltering by representing e-mails as vectors of topics with a topic model: the Latent Dirichlet Allocation (LDA). Based upon this representation, the relationship between the topics and spam can be discovered by using a Bayesian method. We test this model on the Enron-Spam datasets and results show that the proposed model performs better than the baseline and can detect the internal semantics of spam messages.

Keywords: Spam Detection; Latent Dirichlet Allocation; Bayesian Topic Model

1 Introduction

Electronic mail (E-mail) is one of the most important and powerful means of modern communica-tion. However, in the past decade E-mail users have always been plagued by spam, which is also known as junk email or Unsolicited Bulk Email (UBE). Spam triggers a lot of problems, such as making users waste their time on looking through and sorting out additional emails [1], causing ﬁnancial loss to the companies by misusing of traﬃc, storage space and computational power [1], bringing security and legal problems by spreading malicious software, advertising pornography, pyramid schemes, etc [2].

Many techniques have been proposed to deal with spam. The content-based machine learning algorithms are important and popular, including algorithms that are considered top-performers in text classiﬁcation, like Boosting [3], Support Vector Machines [4, 5, 6], and Bayesian method [7, 8].

⋆_{Project supported by Natural Science Foundation Project of CQ CSTC (No. CSTC2012JJB40012), Scientiﬁc} Research Foundation for the Returned Overseas Chinese Scholars (No. 20091001) and Fundamental Research Funds for the Central Universities (No. SWU1309265).

∗_{Corresponding author.}

Email address: [email protected](Zhixing Huang).

(2)

Despite the fact that E-mails are usually represented as a sequence of words, there are relation-ships between words on a semantic level that also affect E-mails [9]. However, the content-based machine learning algorithms are trained using statistical representations of the terms that usually appear in the E-mails and are unable to account for the underlying semantics of the E-mails. To address these limitations, Santos [9] proposed to represent the E-mails with the enhanced Topic-based Vector Space Model (eTVSM) and achieved a satisfactory result on Ling-Spam dataset. However, eTVSM is a ontology-based method which may limit its effect when encounters more complicated unseen messages. Furthermore, the Ling-Spam has the disadvantage that its ham messages are more topic-specific which could lead to over optimistic estimates of the performance of learning-based spam filters.

In contrast, we present a Bayesian topic model by introducing the topic model Latent Dirichlet Allocation (LDA) [10] to mine the semantics of E-mails. LDA is a generative probabilistic model of a corpus which will not be limited by the weakness of ontology. LDA models every document as a distribution over the topics, and every topic as a distribution over the words. These topics could better reﬂect the semantics of the document than terms. The basic idea of our approach is that: we use a previously estimated LDA model to make inference on the new unseen E-mails to get the topics distribution of each E-mail. Hence, each E-mail could be treated as a vector of topics not terms. As the topics have deeper relationship with the content of a E-mail, we can then use a Bayesian method to discover the relationship between the topics and spam. More detailed descriptions are shown in Section 3.

Our model may be similar in the sense with the method proposed by B´ır´o [11] because we also use LDA, however, the model we present is completely diﬀerent with their method.

The remainder of this paper is organized as follows. Section 2 introduces the basic theory. Section 3 describes the proposed methodology. Section 4 details the performed experiments and presents the results. Finally, Section 5 concludes and outlines avenues for future work.

2 Basic Theory

The basic theory includes LDA topic model which is used to get the topics distribution of E-mails and Bayesian method which could discover the relationship between words and spam. A modiﬁcation of this Bayesian method is used to discover the relationship between topics and spam in our approach.

2.1 Latent Dirichlet Allocation

There areDdocuments of arbitrary length. A documentdis a vector ofNdwords,Wd, where each

word wid is chosen from a vocabulary of size V. Then the generation of a document collection

in LDA is modeled as a three step processes. First, for each document, a distribution over topics is sampled from a Dirichlet distribution. Second, for each word in the document, a single topic is chosen according to this distribution. Finally, each word is sampled from a multinomial distribution over words speciﬁc to the sampled topic.

This generative process corresponds to the hierarchical Bayesian model shown (using plate notation) in Fig. 1.

(3)

α β φ θ z w D Nd T

Fig.1: The hierarchical Bayesian model for LDA

In this model, ϕ denotes the matrix of topic distributions, with a multinomial distribution over V vocabulary items for each of T topics being drawn independently from a symmetric

Dirichlet(β) prior. θ is the matrix of document-speciﬁc mixture weights for these T topics, each being drawn independently from a symmetric Dirichlet(α) prior. For each word, z denotes the topic responsible for generating that word, drawn from theθ distribution for that document, and

wis the word itself, drawn from the topic distribution ϕ corresponding to z.

Estimating ϕ and θ provides information about the topics that participate in a corpus and the weights of those topics in each document respectively. A variety of algorithms have been used to estimate these parameters, from basic expecation-maximization [12] to approximate inference methods like variational EM [10], expectation propagation [13], and Gibbs sampling [14].

2.2 The Bayesian Method

The Bayesian method proposed by Paul Graham [7] is very different from any form of Naive Bayes classifiers [15, 16, 17, 18] and is able to greatly improve the false positive rate. In this paper, this method is referred as PG Bayesian classifier. PG Bayesian classifier could discover the relationship between words and spam.

Each word in the E-mail contributes to the E-mail’s spam probability, or only the most inter-esting words. This contribution of one word, which also can be called the “spamicity” of one word, is calculated using Bayes’ theorem:

p(s|w) = p(w|s)p(s)

p(w|s)p(s) +p(w|h)p(h) (1) In Eq. (1), p(s) is the overall probability that any given E-mail is spam. p(h) is the overall probability that any given E-mail is ham. p(w|s) is the probability that the given word appears in spam training E-mails, which can be estimated by dividing the number of spam training E-mails that contain this word by the total number of spam training E-mails. p(w|h) is the probability that the given word appears in ham training E-mails, which can be estimated by dividing the number of ham training E-mails that contain this word by the total number of ham training E-mails.

PG Bayesian classiﬁer makes the assumption that there is no a priori reason for any incoming message to be spam rather than ham, considers p(s) = p(h) = 0.5. This assumption permits simplifying the Eq. (1) to:

p(s|w) = p(w|s)

(4)

PG Bayesian classiﬁer also makes the assumption that the words present in a E-mail are in-dependent events. With that assumption, one can derive another equation from Bayes’ theorem to calculate the probability that the E-mail is spam by taking into consideration N words of the E-mail:

p= p1p2...pN

p1p2...pN + (1−p1)(1−p2)...(1−pN)

(3)

In Eq. (3), p indicates how sure the ﬁlter is that the E-mail is spam. pn (n = {1, ..., N}) is the

probability p(s|wn). The result pis usually compared to a given threshold to decide whether the

message is spam or not. Ifpis lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam.

3 Proposed Methodology

Assume that there is a training set consisting ofS spam E-mails andH ham E-mails and there is a test set consisting ofN new unseen E-mails. A Unified LDA model with T topics of the overall training set can be estimated using LDA firstly. Then, this Unified LDA model is used to make inference separately for the spam training set, ham training set and the new unseen E-mails. Here three LDA models could be got: the Spam LDA model with the topics distribution θ(s) _{of spam}

E-mails, the Ham LDA model with the topics distribution θ(h) _{of ham E-mails and New-emails}

LDA model with the topics distributionθ(n) of new E-mails. All these three models haveT topics which are consistent with the topics of the Uniﬁed LDA model. The third step, each E-mail e

can be represented as a vector ⃗e = ⟨z1, ..., zT⟩. Each zi (i ={1, ..., T}) has a value which is the

probability of the topic zi occurs in this E-mail i.e. p(zi|e). This value can be directly got from

the corresponding matrix θ.

What can be naturally thought of is that some topics of theT topics are more relevant to spam E-mails and some are more relevant to ham E-mails. In other words, the topics which are more relevant to spam E-mails will have a higher probability in each spam training E-mail, and the topics which are more relevant to ham training E-mails will have a higher probability in each ham E-mail. That means each of the T topic has the “spamicity” just like words. According to the Eq. (2), the following equation to calculate the probability of the “spamicity” of one topic

zi(i={1, ..., T}) could be naturally got: p(s|zi) =

p(zi|s) p(zi|s) +p(zi|h)

(4)

The challenge is how to calculate the probability ofp(zi|s) and p(zi|h). The spam training set

has total S spam E-mails, given each spam training E-mail ej(j ={1, ..., S}), the probability of

each topic zi could be got from the matrixθ(s), i.e. p(zi|ej, s) =θ (s)

j,i. We deﬁne the probability of

each E-mail ej as: p(ej|s) = _S1. Hence we could compute the probability of p(zi|s) by using the

law of total probability:

p(zi|s) = S ∑ j=1 p(zi|ej, s)p(ej|s) = 1 S S ∑ j=1 θ_j,i(s) (5)

(5)

The probability of p(zi|h) can be calculated using the same method, and then we update the Eq. (4) into: p(s|zi) = 1 S ∑S j=1θ (s) j,i 1 S ∑S j=1θ (s) j,i + 1 H ∑H j=1θ (h) j,i (6)

Each E-mail ej(j = {1, ..., N}) of N new unseen E-mails has also been represented as e⃗j =

⟨z1, ..., zT⟩. And the value of each zi is p(zi|ej). Apparently p(zi|ej) = θ (n)

ji . Then, we need to

select topk most representative topics of ej to calculate the probability that E-mail ej is spam.

This could be achieved by using the following algorithm in each E-mail ej:

(1) For each topic zi of T topics, if 0.45 < p(s|zi) < 0.55, add topic zi into CandidateTopicSet

(CTS).

(2) Rearrange the topics ofej according to the descending order of valuesp(zi|ej), save the results

asTopicOrderList (TOL).

(3) For each topic in TOL, orderly add those topics which also in CTS into theAvailableTopicList

(ATL).

(4) Select top k topics from ATL.

Then, the probability that the E-mailej is spam can be computed by taking into consideration

all of the topk topics. According to the form of Eq. (3), we can derive the ﬁnal equation, which is: p(s|ej) = ∏k i=1p(s|zi) ∏k i=1p(s|zi) + ∏k i=1(1−p(s|zi)) (7)

4 Experiment and Evalution

4.1 Datasets and Experimental Setup

We use six datasets collectively called Enron-Spam datasets, which is developed by Metsis et al in paper [8] and is also a publicly available, non-encoded datasets just like Ling-Spam and SpamAssassin. Each of the six Enron-Spam datasets consists of a ham set and a spam set and each message is in a separate text ﬁle. The ham collections of these six datasets were got from six Enron users, and were each paired with a spam collection. Hereafter, we refer the six Enron-Spam datasets as Enron 1, Enron 2, ..., Enron 6 respectively.

Phan’s GibbsLDA [19] is used to do estimations and inferences on the datasets. The Dirichlet parameter β is chosen to be constant 0.1 throughout, while α = 50/T throughout. T is the number of topics of the LDA model, we experiment with T = {10,20,50,100,200}. The Gibbs sampling is stopped after 2000 steps for estimation on the Uniﬁed training set, and after 1000 steps for inference on the Ham training set, Spam training set, and Test set.

In the testing phase, top k most representative topics are selected for each E-mail to calculate the probability of the test E-mail is spam. We experiment withk = 1, ..., Length(AT L)−1. Each learning model of each dataset is denoted asMT,k. The threshold is 0.5. If the probability of the

(6)

E-mail is lower than the threshold, it is considered as likely ham, otherwise it is considered as likely spam. 10-fold cross validation is applied in our experiments.

During the above experiments, the curves of the topics probability distribution on different model of each dataset are learned. A group of curves with T = 20 are shown in Fig. 2. These curves clearly reveals that the probability of the same topic occurs in different categories is also different, which is a direct proof of the correctness and feasibility of our approach.

0.12 0.10 0.08 0.06 0.04 0.02 0 Probabilit y of topic zi Number of topic zi 1 3 5 7 9 11 13 15 17 19 p(zi|h) p(zi|s)

(a) Curves on Enron 1

0.14 0.12 0.10 0.08 0.06 0.04 0.02 0 Probabilit y of topic zi Number of topic zi 1 3 5 7 9 11 13 15 17 19 p(zi|h) p(zi|s) (b) Curves on Enron 2 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0 Probabilit y of topic zi Number of topic zi 1 3 5 7 9 11 13 15 17 19 p(zi|h) p(zi|s) (b) Curves on Enron 3 0.20 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0 Probabilit y of topic zi Number of topic zi 1 3 5 7 9 11 13 15 17 19 p(zi|h) p(zi|s) (d) Curves on Enron 4 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0 Probabilit y of topic zi Number of topic zi 1 3 5 7 9 11 13 15 17 19 p(zi|h) p(zi|s)

(e) Curves on Enron 5

0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0 Probabilit y of topic zi Number of topic zi 1 3 5 7 9 11 13 15 17 19 p(zi|h) p(zi|s) (f) Curves on Enron 6

Fig.2: The curves of topics probability occurs in diﬀerent model for each dateset, with T=20

4.2 Evaluation and Comparison

We first make an evaluation on each model MT,k. Because different datasets are for different

person and each one has a diﬀerent ham-spam ratio, the best performing learning model may be also diﬀerent. To evaluate each model MT,k of each dataset, we present the evaluation results by

means of F-Measure curves. For the k of the best model MT,k are all less then 7, the F-measure

curves are drawn within the scope of k={1, ...,7} for facilitating the contrast.

By observing the F-measure curves which are shown in Fig. 3, the best performing model could be selected for each datesets. Table 1 shows these selected models as well as the corresponding F-measure values.

Our method achieves a best result on Enron 4 which demonstrates Metsis’s [8] view that some datasets (e.g., Enron 4) are “easier” than others (e.g., Enron 1). We just use 1 topic to do detection on Enron 4, in contrast use 5 topics to do detection on Enron 1. We also ﬁnd that models with T = 10 and T = 200 all not reach the best performing which shows too small or too large T are not appropriate.

(7)

T_{= 10} T_{= 20} T_{= 50} T_{= 100} T_{= 200} 1 2 3 4 5 6 7 Top k_number F-Measure 1.0 0.9 0.8 0.7 0.6 0.5 T_{= 10} T_{= 20} T_{= 50} T_{= 100} T_{= 200} 1 2 3 4 5 6 7 Top k_number F-Measure 1.0 0.9 0.8 0.7 0.6 0.5 T_{= 10} T_{= 20} T_{= 50} T_{= 100} T_{= 200} 1 2 3 4 5 6 7 Top k_number F-Measure 1.0 0.9 0.8 0.7 0.6 0.5 T_{= 10} T_{= 20} T_{= 50} T_{= 100} T_{= 200} 1 2 3 4 5 6 7 Top k_number F-Measure 1.0 0.9 0.8 0.7 0.6 0.5 T_{= 10} T_{= 20} T_{= 50} T_{= 100} T_{= 200} 1 2 3 4 5 6 7 Top k_number F-Measure 1.0 0.9 0.8 0.7 0.6 0.5 T_{= 10} T_{= 20} T_{= 50} T_{= 100} T_{= 200} 1 2 3 4 5 6 7 Top k_number F-Measure 1.0 0.9 0.8 0.7 0.6 0.5

(a) F-Measure curves of Enron 1 (b) F-Measure curves of Enron 2 (c) F-Measure curves of Enron 3

(d) F-Measure curves of Enron 4 (e) F-Measure curves of Enron 5 (f) F-Measure curves of Enron 6

Fig.3: The F-Measure curves of each model for each dateset

Table 1: Best model for each dataset

Datesets Best model F-Measure(%)

Enron 1 M100,5 97.52 Enron 2 M20,3 98.01 Enron 3 M50,1 98.59 Enron 4 M50,1 99.47 Enron 5 M50,1 98.53 Enron 6 M100,1 98.46

The prediction results of the best models are viewed as the best results of our method. To evaluate the filtering capability of our method, we compare it with two term-based spam filtering method. One is the PG Bayesian classifier which is used as a baseline and another is the Multi-nomial Naive Bayes with Boolean attributes (MN Bool) which is demonstrated as the best Naive Bayes classifier in paper [8].

The best spam and ham recall of PG Bayesian method are selected as baseline. Metsis et al experiment MN Bool also on Enron-Spam datasets, therefore, the experimental results in paper [8] are directly used as another reference. Threshold in the three methods are all 0.5. Tables 2 and 3 list the spam and ham recall, respectively, of the three method on the six datasets. The tables show that both PG Bayesian and our model are better than MN Bool method which has used 3000 attributes. And our model is the best one. Although PG Bayesian also reaches a better result than MN Bool method, it fails exploring the semantic relationships. However, our model can not only explore the semantic relationships but also get a relative better prediction result using just maximum of 5 topics. These results have demonstrated the superiority of our model.

(8)

Table 2: Spam recall (%) comparisons

Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg

MN Bool 96.00 96.68 96.94 97.79 99.69 98.10 97.53

PG Bayesian 96.53 95.01 96.82 98.91 99.02 98.83 97.52

Our Model 97.67 98.12 98.65 99.52 99.54 99.26 98.79

Table 3: Ham recall (%) comparisons

Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg

MN Bool 95.25 97.83 98.88 99.05 95.65 96.88 97.26

PG Bayesian 96.83 97.16 98.24 99.19 96.46 96.23 97.35

Our Model 97.36 97.89 98.53 99.42 97.48 97.64 98.05

5 Conclusion and Future Work

In this paper, a Bayesian topic model is proposed for spam filtering. By using LDA, each E-mail is represented as a vector of topics, and based upon this representation a Bayesian method is used to discover the relationship between the topics and spam. By testing our method on Enron-Spam datasets, we get the conclusion that our model is better than the baseline and it can detect the internal semantics of spam messages. In the future work, we will test the Bayesian topic model in other application fields, such as document classification.

References

[1] Mikko T. Siponen, Carl Stucke, Eﬀective anti-spam strategies in companies: An international study, In 39th Hawaii International International Conference on Systems Science, Kauai, HI, USA, 2006

[2] Evangelos Moustakas, C. Ranganathan, Penny Duquenoy, Combating spam through legislation: A comparative analysis of us and european approaches, In CEAS 2005 - Second Conference on Email and Anti-Spam, July 21-22, 2005, Stanford University, California, USA, 2005

[3] Xavier Carreras, Llu´ıs M`arquez, Boosting trees for anti-spam email ﬁltering, In Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 2001

[4] Harris Drucker, Donghui Wu, Vladimir Vapnik, Support vector machines for spam categorization, IEEE Transactions on Neural Networks, 10(5), 1999, 1048-1054

[5] A. Kolcz, J. Alspector, Svm-based filtering of E-mail spam with content-specific misclassification costs, In Proceedings of the ICDM Workshop on Text Mining, 2001

[6] Yuewu Shen, Guanglu Sun, Haoliang Qi, Xiaoning He, Using feature selection to speed up online svm based spam ﬁltering, In International Conference on Asian Language Processing, IALP 2010, Harbin, Heilongjiang, China, 2010, 142-145

(9)

[8] Vangelis Metsis, Ion Androutsopoulos, Georgios Paliouras, Spam ﬁltering with naive bayes-which naive bayes? In CEAS 2006 - The Third Conference on Email and Anti-Spam, Mountain View, California, USA, July 27-28, 2006

[9] Igor Santos, Carlos Laorden, Borja Sanz, Pablo Garcia Bringas, Enhanced topic-based vector space model for semantics-aware spam ﬁltering, Expert Syst. Appl., 39(1), 2012, 437-444

[10] David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., 3 March, 2003, 993-1022

[11] István B´ıró, Jácint Szabó, András A. Benczúr, Latent dirichlet allocation in web spam filtering, In AIRWeb’08: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing, China, 2008, 29-32

[12] Thomas Hofmann, Probabilistic latent semantic indexing, In Proceedings of the 22th International Conference on Research and Development in Information Retrieval, 1999, 50-57

[13] Thomas Minka Department, Thomas Minka, John Laﬀerty, Expectation-propagation for the gener-ative aspect model, In Proceedings of the 18th Conference on Uncertainty in Artiﬁcial Intelligence, Morgan Kaufmann, 2002, 352-359

[14] T. L. Griﬃths, M. Steyvers, Finding scientiﬁc topics, Proceedings of the National Academy of Science, 101 (Suppl. 1), April 2004, 5228-5235

[15] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A Bayesian approach to ﬁltering junk E-mail, In AAAI-98 Workshop on Learning for Text Categorization, 1998, 55-62

[16] P. Pantel, D. Lin, Spamcop: A spam classiﬁcation and organization program, In Proceedings of the AAAI Workshop on Learning for Text Categorization, 1998

[17] George H. John, Pat Langley, Estimating continuous distributions in bayesian classiﬁers, In UAI, Morgan Kaufmann, 1995, 338-345

[18] Karl-Michael Schneider, On word frequency information and negative evidence in naive bayes text classiﬁcation, In EsTAL, Lecture Notes in Computer Science, Vol. 3230, 2004, 474-486

[19] Xuan-Hieu Phan, Cam-Tu Nguyen, Gibbslda++: A c/c++ implementation of latent dirich-let allocation (lda) using gibbs sampling for parameter estimation and inference, Available at: http://gibbslda.sourceforge.net/