Filtering Spams using the Minimum Description Length Principle

(1)

Filtering Spams using the

Minimum Description Length Principle

Tiago A. Almeida, Akebo Yamakami

School of Electrical and Computer Engineering University of Campinas – UNICAMP

13083–970, Campinas, SP, Brazil +55 19 3521 3849

{tiago, akebo}@dt.fee.unicamp.br

Jurandy Almeida

Institute of Computing University of Campinas – UNICAMP

13083–970, Campinas, SP, Brazil +55 19 3521 5840

[email protected]

ABSTRACT

Spam has become an increasingly important problem with a big economic impact in society. Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the min-imum description length principle. The proposed model is fast to construct and incrementally updateable. Addition-ally, we offer an analysis concerning the measurements usu-ally employed to evaluate the quality of the anti-spam clas-sifiers. In this sense, we present a new measure in order to provide a fairer comparison. Furthermore, we conducted an empirical experiment using six well-known, large and public databases. Finally, the results indicate that our approach outperforms the state-of-the-art spam filters.

Categories and Subject Descriptors

I.5 [Pattern Recognition]: Applications; I.2.7 [Artificial Intelligence]: Natural Language Processing—Text analysis

General Terms

Anti-spam Filtering

Keywords

Minimum description length, spam filter, machine learning

1. INTRODUCTION

E-mail is one of the most popular, fastest and cheapest means of communication. It has become a part of everyday life for millions of people, changing the way we work and collaborate. E-mail is not only used to support conversa-tion but also as a task manager, document delivery system and archive. The downside of this success is the constantly growing volume of e-mail spam we receive. The problem of spams can be quantified in economical terms since many

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

hours are wasted everyday by workers. It is not just the time they waste reading the spam but also the time they spend deleting those messages.

Fortunately, many solutions are being proposed to avoid this “plague” and one of more promising is the use of ma-chine learning techniques for automatically filtering e-mail messages [8]. These methods include approaches that are considered top-performers in text categorization like Roc-chio [13, 18], Boosting [7], Support Vector Machines (SVM) [10, 14, 12], and naive Bayes classifiers [2, 16]. The two latter currently appear to be the best anti-spam filters presented in the literature [1, 2, 4, 6, 8, 9, 20].

A relatively recent method for inductive inference which is still rarely employed in text categorization tasks is the Min-imum Description Length (MDL) principle. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in sit-uations where the models can be arbitrarily complex, and overfitting the data is a serious concern [3, 11, 17].

In this paper, we present a novel anti-spam classifier based on the minimum description length principle [4]. We con-ducted an empirical experiment using six well-known, large and public databases. The results indicate that our ap-proach outperforms currently established spam filters. Fur-thermore, we investigate the most used performance mea-surements applied for comparing the quality of the anti-spam filters. In this way, we analyze the advantages of using the Matthews correlation coefficient.

The remainder of this paper is organized as follows: tion 2 presents details of our proposed approach. In Sec-tion 3, we introduce a brief discussion about the benefits of using the Matthews correlation coefficient as a measure of quality of anti-spam classifiers. Experimental results are showed in Section 4. Finally, Section 5 offers conclusions and directions for future works.

2. SPAM FILTERING BASED ON MINIMUM

DESCRIPTION LENGTH

The MDL principle is a formalization of Occam’s Razor in which the best hypothesis for a given set of data is the one that yields compact representations. The traditional MDL principle states that the preferred model results in the short-est description of the model and the data, given this model. In other words, the model that best compresses the data is selected. This model selection criterion naturally balances the complexity of the model and the degree to which this model fits the data.

LetZ be a finite or countable set and let P be a prob-ability distribution on Z. Then there exists a prefix code C for Z such that for all z ∈ Z, LC(z) = ⌈−log2P(z)⌉.

(2)

C is called the code corresponding to P. Similarly, let C be a prefix code for Z. Then there exists a (possibly de-fective) probability distributionP such that for allz ∈ Z, −log2P′(z) =LC′(z). P′ is called the probability distribu-tion corresponding toC′_{. Thus, large probability according}

toP means small code length according to the code corre-sponding toP and vice versa [3, 11, 17].

The goal of statistical inference may be cast as trying to find regularity in the data. Regularity may be identified with ability to compress. MDL combines these two insights by viewing learning as data compression: it tells us that, for a given set of hypothesesHand data setD, we should try to find the hypothesis or combination of hypotheses inHthat compressesDmost [3, 11, 17].

This idea can be applied to all sorts of inductive inference problems, but it turns out to be most fruitful in problems of model selection and, more generally, dealing with overfit-ting [11]. An important property of MDL methods is that they provide automatically and inherently protect against overfitting and can be used to estimate both the parameters and the structure of a model. In contrast, to avoid overfit-ting when estimaoverfit-ting the structure of a model, traditional methods such as maximum likelihood must be modified and extended with additional, typically adhoc principles [11].

Consider the following example. Suppose we flip a coin 1,000 times and we observe the numbers of heads and tails. We consider two model classes: the first consists of a code that represents each outcome with a 0 for heads or a 1 for tails. This code represents the hypothesis that the coin is fair. The code length according to this code is always exactly 1,000 bits. The second model class consists of all codes that are efficient for a coin with some specific bias, representing the hypothesis that the coin is not fair. Say that we observe 510 heads and 490 tails. Then the code length according to the best code in the second model class is shorter than 1,000 bits. For this reason a naive statistical method might put forward this second hypothesis as a better explanation for the data. However, in an MDL approach we would have to construct a single code based on the hypothesis, we can not just use the best one. A simple way to do it would be to use a two-part code, in which we first specify which element of the model class has the best performance, and then we specify the data using that code. We will need quite a lot of bits to specify which code to use; thus the total codelength based on the second model class would be larger than 1,000 bits. Thus if we follow an MDL approach the conclusion has to be that there is not enough evidence in support of the hypothesis that the coin is biased, even though the best element of the second model class provides better fit to the data. Consult Grunwald [11] for further details.

In essence, compression algorithms can be applied to text categorization by building one compression model from the training documents of each class and using these models to evaluate the target document.

The minimum description length principle was first em-ployed in the anti-spam filtering task by Bratko et al. [4]. We extended their original work offering improvements in the classification criteria, training process, and especially, proposing a different manner to build the spam and legiti-mate models.

Assuming that each message m is composed by a set of terms m = t1, . . . , tn, where each termtk corresponds to a word (“adult”, for example), a set of words (“to be re-moved”), or a single character (“$”), we can represent each message by a vector ~x = hx1, . . . , xni, where x1, . . . , xn are values of the attributesX1, . . . , Xnassociated with the termst1, . . . , tn. In the simplest case, each term represents

a single word and all attributes are Boolean: Xi= 1 if the message containstiorXi= 0, otherwise.

Given a set of pre-classified training messagesM, the task

is to assign a target e-mailmwith an unknown label to one of the classesc∈ {spam, legitimate}. So, the method mea-sures the increase of the description length of the data set as a result of the addition of the target document. Finally, it chooses the class for which the description length increase is minimal.

Unlike the algorithm proposed by Bratko et al. [4], we consider in this work, each class (model) c as a sequence of terms extracted from the messages and inserted into the training set. Each termtfrommhas a code lengthLtbased on the sequence of terms presented in the messages of the training set of c. The length of m when assigned to the class c corresponds to the sum of all code lengths associ-ated with each term of m, Lm=P|m|

i=1Lti. We calculate

Lti =⌈−log2Pti⌉, whereP is a probability distribution re-lated with the terms of class. Letnc(ti) the number of times that tiappears in messages of class c, then the probability that any term belongs tocis given by the maximum likeli-hood estimation:

Pti=

nc(ti) + 1

|χ| nc+ 1

wherenccorresponds to the sum ofnc(ti) for all terms which appear in messages that belongs toc and|χ|is the vocab-ulary size. In this work, we assume that |χ|= 232

, that is, each term in an uncompress mode is a symbol with 32 bits. This estimation reserves a “portion” of probability to words which the classifier has never seen before [5].

Briefly, the proposed MDL anti-spam filter classify a mes-sage by following these steps:

1. Tokenization: the classifier extract all terms of the new messagem={t1, . . . , t_|m|};

2. Compute the increase of the description length when mis assigned to each classc∈ {spam, legitimate}:

Lm(spam) = |m| X i=1 & −log2 nspam(ti) + 1 |χ| nspam+ 1 !’ Lm(legitimate) = |m| X i=1 & −log2 nlegitimate(ti) + 1 |χ| nlegitimate+ 1 !’

3. ifLm(spam)> Lm(legitimate), thenmis classified as spam; otherwise,mis labeled as legitimate.

4. Training method.

In the following, we offer more details about the steps 1 and 4.

2.1 Preprocessing and Tokenization

We did not perform language-specific preprocessing tech-niques such as word stemming, stop word removal, or case folding, since other researchers found that such techniques tend to hurt spam-filtering accuracy [2, 16, 20]. However, we use an email-specific preprocessing before the classifi-cation task. In this way, we employ the Jaakko Hyvattis normalizemime1

. This program converts the character set to UTF-8, decoding Base64, Quoted-Printable and URL en-coding and adding warn tokens in case of enen-coding errors. It also appends a copy of HTML/XML message bodies with

1

(3)

most tags removed, decodes HTML entities and limits the size of attached binary files.

Tokenization is the first stage in the classification pipeline; it involves breaking the text stream into terms (“words”), usually by means of a regular expression. We consider in this work that terms start with a printable character; followed by any number of alphanumeric characters, excluding dots, commas and colons from the middle of the pattern. With this pattern, domain names and mail addresses will be split at dots, so the classifier can recognize a domain even if sub-domains vary [19]. As proposed by Drucker et al. [10] and Metsis et al. [16], we do not consider the number of times a term appears in each message. In this way, each term is computed only one time per message it appears.

2.2 Training Method

Anti-spam filters generally build their predicting models by learning from examples. A basic training method is to start with an empty model, classify each new sample and train it in the right class if the classification is wrong. This is known as train on error (TOE). An improvement to this method is to train also when the classification is right, but the score is near the boundary – that is, train on or near error (TONE) [19].

The advantage of TONE over TOE is that it accelerates the learning process by exposing the filter to additional hard-to-classify samples in the same training period. Therefore, we employ the TONE as training method used by the pro-posed MDL anti-spam filter.

A good point of the MDL classifier is that we can start with an empty training set and according to the user feed-back the classifier builds the models for each class. More-over, it is not necessary to keep the messages used for train-ing since the models are incrementally buildtrain-ing by the term frequencies [5].

3. PERFORMANCE MEASUREMENTS

According to Cormack [8], the filters should be judged along four dimensions: autonomy, immediacy, spam iden-tification, and non-spam identification. However, it is not obvious how to measure any of these dimensions separately, nor how to combine these measurements into a single one for the purpose of comparing filters. Reasonable standard measures are useful to facilitate comparison, given that the goal of optimizing them does not replace that of finding the most suitable filter for the purpose of spam filtering.

Let S and Lsets of spam and legitimate messages, the possible prediction results are: true positives (T P) corre-sponding to the set of spam messages correctly classified, true negatives (T N) the set of legitimate messages correctly classified, false negatives (FN) the set of spam messages in-correctly classified as legitimate, and false positives (FP) the set of legitimate messages incorrectly classified as spam. Some well-known evaluation measurements are: True pos-itive rate (T pr), True negative rate (T nr), True negative rate (T nr), Spam precision (Spr), Spam recall (Sre), Legit-imate precision (Lpr), Legitimate recall (Lre), area under the ROC curves (1-AU C), LAM [8], precision×recall [16], Accuracy rate (Acc) and Total Cost Ratio (T CR) [2].

Nevertheless, failures to identify legitimate and spam mes-sages have materially different consequences. Misclassified non-spam substantially increases the risk that the informa-tion contained in the message will be lost, or at least delayed. Exactly how much risk and delay are incurred is difficult to quantify, as are the consequences, which depend on the na-ture of the message. On the other hand, failures to identify spam also vary in importance, but are generally less impor-tant than failures to identify non-spam. Viruses, worms, and phishing messages may be an exception, as they pose

significant risks to the user [8].

In order to take into consideration the asymmetry in the misclassification costs, Androutsopoulos et al. [2] proposed a refinement based on spam recall and precision, to allow the performance evaluation based on a single measure. They consider a false positive as beingλ times more costly than false negatives, with λequals to 1 or 9. Hence, each false positive is accounted asλmistakes, with the weighted accu-racy (Accw) being given by

Accw=|T P|

+λ|T N | |S|+λ|L| .

The total cost ratio (T CR) can be calculated by

T CR= |S| λ|FP|+|FN |.

It offers an indication of the improvement provided by the filter. Greater T CR indicates better performance, and for T CR <1, not using the filter is better.

3.1 Matthews correlation coefficient

According to Carpinter and Hunt [6] and Cormack and Lynam [9] the value of λis very difficult to be determined. In particular, it depends on the message once some messages are more important than others, as previously discussed.

Furthermore, the problem of using TCR is that it does not return a value inside a predefined range. Consider, for example, two classifiersAandBemployed to filter 600 mes-sages (450 spams + 150 legitimates, λ= 1). Suppose that Aattained a perfect prediction withFPA=FNA= 0, and B classified incorrectly only 3 spam messages as legitimate, thusFPB = 0 andFNB = 3. In this way,T CRA = +∞ and T CRB = 150. Intuitively, we can observe that both classifiers achieved similar performance with a small advan-tage forA. However, if we analyze only theT CR, we would wrong conclude thatA was much better thanB. Further-more, a T CR is not a representative value which we can make strong assumptions about the performance achieved by a single classifier, because it gives us only the information about the improvement provided by using the filter, but not provide an information about how much the classifier could be improved.

In order to avoid these characteristics, we propose the use of the Matthews correlation coefficient (M CC) [15]. M CC is used in machine learning as a measure of the quality of binary classifications which provides much more information thanT CR. It returns a real value between -1 and +1. A coefficient equals to +1 indicates a perfect prediction; 0, an average random prediction; and -1, an inverse prediction.

M CC provides a more balanced evaluation of the predic-tion than other measures, such as the proporpredic-tion of correct predictions, especially if the two classes are of different sizes. M CC =

(|T P|.|T N |)−(|FP|.|FN |) p

(|T P|+|FP|).(|T P|+|FN |).(|T N |+|FP|).(|T N |+|FN |) ,

It provides a fairer evaluation since in a real situation the number of spams we receive is much higher than the number of legitimate messages, therefore M CC tends to automat-ically adjust the how much a false positive error is worst than a false negative one. As the proportionality between the number of spams and legitimate messages increases a false positive tends to be much worst than a false negative. Using the previous example, the classifier A would achieve M CCA = 1.000 andM CCB = 0.987. Thus, we can make

(4)

correct conclusions about the classifiers’ predictions as much as each performance achieved individually.

Furthermore, we can combineM CCwith other measures, as precision×recall rates, for instance, in order to provide a fairer comparison [1].

4. EXPERIMENTAL RESULTS

We use the six well-known Enron corpora [16] in our ex-periments. All corpus are composed by real legitimate mes-sages extracted from the mailboxes of six former employees of the Enron Corporation and selected spam messages from different sources. Enron corpora tries to keep the same char-acteristics of a real user mailbox. The composition of each dataset is shown in Table 1.

Table 1: Enron datasets

Dataset No of Legitimate No of Spam Total Enron 1 3,672 1,500 5,172 Enron 2 4,361 1,496 5,857 Enron 3 4,012 1,500 5,512 Enron 4 1,500 4,500 6,000 Enron 5 1,500 3,675 5,175 Enron 6 1,500 4,500 6,000 Total 16,545 17,171 33,716

Tables 2, 3, 4, 5, 6, and 7 present the performance achieved by each classifier for each Enron dataset. Bold values indi-cate the highest score. In order to provide a fairer evalu-ation, we consider the most important measures the spam recall rate (Spr), legitimate recall rate (Lre) and Matthews correlation coefficient (M CC) achieved by each filter. Ad-ditionally, we present other measures as legitimate×spam precision rates, weighted accuracy (Accw) and total cost ra-tio (T CR). In this way, we employed λ = 1 in order to calculateAccwandT CR, as described in Section 3.

The results achieved by the proposed MDL classifier are compared with the ones attained by methods considered the actual top-performers in anti-spam filtering: the Boolean naive Bayes (NB) classifier [2, 16] and linear support vector machine (SVM) with Boolean attributes [10, 12, 14].

Due to paper limitations, we present the results achieved by each evaluated classifier. A comprehensive set of results, including all tables and figures, is available athttp://www. dt.fee.unicamp.br/~tiago/Research/Spam/spam.htm.

Table 2: Enron 1 – Results achieved by each filter

Measures NB SVM MDL Sre(%) 91.33 83.33 92.00 Spr(%) 85.09 87.41 92.62 Lre(%) 93.48 95.11 97.01 Lpr(%) 96.36 93.33 96.75 Accw(%) 92.86 91.70 95.56 T CR 4.054 3.488 6.552 M CC 0.831 0.796 0.892

Regarding the results achieved by the filters, the proposed MDL anti-spam classifier outperforms the other classifiers for the majority corpus used in our empirical evaluation. It is important to realize that in some situations the MDL performs much better than SVM or NB. For instance, for

Enron 1 (Table 2) MDL achieved spam recall rate equal to 92% while SVM attained 83.33%, even thought MDL pre-sented better legitimate recall. It means that for Enron 1 MDL was able to recognize more than 8% of spams than SVM, representing an improvement of 10.40%. In real ap-plications this difference is extremely important. Note that,

(5)

the same result can be found for Enron 5 (Table 6) and En-ron 6 (Table 7). Both methods, MDL and SVM, achieved similar performance with no significant statistical difference just for Enron 4 (Table 5).

The results indicate that our approach is more efficient to distinguish messages as spams or legitimates. It attained an accuracy rate higher than 95% and high precision× re-call rates for all datasets indicating that the proposed filter makes few mistakes. We also verify that the MDL classifier achieved highM CC score (≥0.878) for all tested corpus. It indicates that the proposed filter almost accomplished a perfect prediction (M CC = 1.000) and it is much better than not use a filter (M CC= 0.000).

Nevertheless, it is important to note that TCR is re-ally not an informative measurement. For instance, for En-ron 4 (Table 5), MDL and SVM achieved similar perfor-mances (M DLM CC= 0.945 andSV MM CC= 0.978). How-ever, their TCR are very different (M DLT CR= 34.615 and SV MT CR = 90.000), besides their precision ×recall rates are very close.

5. CONCLUSIONS AND FURTHER WORK

In this paper, we have presented an anti-spam classifier based on the minimum description length principles. Par-ticularly, we based our research on the filter proposed by Bratko et al. [4]. We have extended their original work of-fering several improvements mainly in the classification cri-teria, training process, and proposing a different manner to build the spam and legitimate models.

We have conducted empirical experiments using six well-known, large and public corpora. In addition, we have com-pared the results achieved by methods considered the actual top-performers in spam filtering and they indicate that our approach outperforms the state-of-the-art spam filters.

Furthermore, we have proposed the use of Matthews cor-relation coefficient (M CC) as the evaluation measurement in order to provide a fairer comparison. We have showed thatM CCprovides a balanced evaluation of the prediction, especially if the two classes are of different sizes. Moreover, M CC returns a value inside a predefined range which pro-vides more information about the classifiers’ performance than other measures.

Actually, we are conducting more experiments using larger datasets as TREC05, TREC06 and TREC07 corpora [8] in order to reinforce the validation. We also aim to com-pare our approach with other commercial and open-source anti-spam filters, as Bogofilter, SpamAssassin, OSBF-Lua, among others.

Future works should take into consideration that spam fil-tering is a coevolutionary problem, because while the filter tries to evolve its prediction capacity, the spammers try to evolve their spam messages in order to overreach the classi-fiers. Hence, an efficient approach should have an effective way to adjust its rules in order to detect the changes of spam features. In this way, collaborative filters could be used to assist the classifier by accelerating the adaptation of the rules and increasing the classifiers’ performance.

6. ACKNOWLEDGMENTS

This work is supported by the Brazilian funding agencies CNPq, CAPES and FAPESP.

7. REFERENCES

[1] T. Almeida, A. Yamakami, and J. Almeida.

Evaluation of approaches for dimensionality reduction applied with naive bayes anti-spam filters. InProc. of the 8th IEEE Int. Conf. on Machine Learning and Applications, pp. 1–6, Miami, FL, USA, 2009. [2] I. Androutsopoulos, G. Paliouras, and E. Michelakis.

Learning to filter unsolicited commercial e-mail. Technical Report 2004/2, National Centre for Scientific Research, Athens, Greece, 2004.

[3] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling.

IEEE Trans. on Inf. Theory, 44(6):2743–2760, 1998. [4] A. Bratko, G. Cormack, B. Filipic, T. Lynam, and

B. Zupan. Spam filtering using statistical data compression models.JMLR, 7:2673–2698, 2006. [5] I.A Braga, M Ladeira. Filtragem Adaptativa de Spam

com o Principio Minimum Description Length. In

Proc. of the 28th Brazilian Computer Society, pp. 11–20, Belem, Brazil, 2008 (In Portuguese). [6] J. Carpinter and R. Hunt. Tightening the net: A

review of current and next generation spam filtering tools. Computers and Security, 25(8):566–578, 2006. [7] X. Carreras and L. Marquez. Boosting trees for

anti-spam email filtering. InProc. of the 4th Int. Conf. on Recent Advances in Natural Language Processing, pages 58–64, Tzigov Chark, Bulgaria, 2001.

[8] G. Cormack. Email spam filtering: A systematic review.Foundations and Trends in Information Retrieval, 1(4):335–455, 2008.

[9] G. Cormack and T. Lynam. Online supervised spam filter evaluation.ACM Trans. on Information Systems, 25(3):1–11, 2007.

[10] H. Drucker, D. Wu, and V. Vapnik. Support vector machines for spam categorization.IEEE Trans. on Neural Networks, 10(5):1048–1054, September 1999. [11] P. Gr¨unwald. A tutorial introduction to the minimum

description length principle. In P. Gr¨unwald,

I. Myung, and M. Pitt, editors,Advances in Minimum Description Length: Theory and Applications, pp. 3–81. MIT Press, 2005.

[12] J. Hidalgo. Evaluating cost-sensitive unsolicited bulk email categorization. InProc. of the 17th ACM SAC, pp. 615–620, Madrid, Spain, 2002.

[13] T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. InProc. of 14th ICML, pp. 143–151, Nashville, TN, USA, 1997. [14] A. Kolcz and J. Alspector. Svm-based filtering of

e-mail spam with content-specific misclassification costs. InProc. of the 1st IEEE ICDM, pp. 1–14, San Jose, CA, USA, 2001.

[15] B. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.

Biochimica et Biophysica Acta, 405(2):442–451, 1975. [16] V. Metsis, I. Androutsopoulos, and G. Paliouras.

Spam filtering with naive bayes - which naive bayes? InProc. of the 3rd CEAS, pp. 1–5, Mountain View, CA, USA, 2006.

[17] J. Rissanen. Modeling by shortest data description.

Automatica, 14:465–471, 1978.

[18] R. Schapire, Y. Singer, and A. Singhal. Boosting and rocchio applied to text filtering. InProc. of the 21st ACM SIGIR, pp. 215–223, Melbourne, 1998. [19] C. Siefkes, F. Assis, S. Chhabra, and W. Yerazunis.

Combining winnow and orthogonal sparse bigrams for incremental spam filtering. InProc. of the 8th PKDD, pp. 410–421, Pisa, Italy, 2004.

[20] L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques.ACM Trans. on Asian Lang. Information Proc., 3(4):243–269, 2004.