AN EVALUATION OF FILTERING TECHNIQUES A NAÏVE BAYESIAN ANTI-SPAM FILTER. Vikas P. Deshpande

(1)

IN

A NAÏVE BAYESIAN ANTI-SPAM FILTER by

Vikas P. Deshpande

A report submitted in partial fulfillment of the requirements for the degree

of MASTER OF SCIENCE in Computer Science Approved: _________________________ _________________________ Dr. Robert F. Erbacher Dr. Nicholas Flann

Major Professor Committee Member

_________________________ Dr. Hugo de Garis Committee Member

UTAH STATE UNIVERSITY Logan, Utah

(2)

ABSTRACT

An Evaluation of Filtering Techniques in a Naïve Bayesian Anti-spam Filter

by

Vikas P. Deshpande, Master of Science Utah State University, 2004

Major Advisor: Dr. Robert F. Erbacher Department: Computer Science

An efficient anti-spam filter that would block all unsolicited messages i.e. spam, without blocking any legitimate messages is a growing need. To address this problem, this report takes a statistically-based approach, employing a Bayesian anti-spam filter, because it is content-based and self-learning (adaptive) in nature. We train the filter, using a large corpus of legitimate messages and spam, and we test the filter using new incoming personal messages. We evaluate four effective filtering techniques available for a Bayesian filter for our purposes. We look at the effectiveness of the technique, and we evaluate its different configurations for different threshold values in order to find an optimal anti-spam filter configuration. Based on cost-sensitive measures, we conclude that additional safety precautions are needed for a Bayesian anti-spam filter to be put into practice.

(3)

ACKNOWLEDGMENTS

I would like to thank my major professor, Dr. Robert F. Erbacher, for his guidance. I would also like to thank Dr. Nick Flann and Dr. Hugo de Garis for being members of my committee.

I am grateful to my parents for their encouragement and moral support.

(4)

CONTENTS

Page

ABSTRACT... ii

ACKNOWLEDGEMENTS... iii

LIST OF TABLES... vi

LIST OF FIGURES ... vii

1. INTRODUCTION ...1

2. ANTI-SPAM FILTERS ...3

2.1 Rule-based Approach...3

2.2 Blacklist Approach...4

2.3 Whitelists / Effective Filters Approach...6

2.4 Signature-based Approach...8

2.5 Filters Fight Back...9

3. NAÏVE BAYESIAN BASED CLASSIFICATION...11

3.1 Bayesian Networks ...11

3.2 Experimental Bayesian Filter...12

3.3 Cost Evaluation Measures...15

3.4 Independence Factor of Bayesian Networks...16

4. FILTERING TECHNIQUES IN NAÏVE BAYESIAN FILTER...18

4.1 Use of All Token...18

4.2 Use of a Fixed Number of Tokens...19

4.3 Use of Standard Deviation ...20

4.4 Use of a Relative Number of Tokens...21

5. EXPERIMENTS ...23

6. EVALUATION RESULTS ...26

7. CONCLUSIONS...33

(5)

REFERENCES ...37 APPENDIX...39

(6)

LIST OF TABLES

Table Page

6.1 The results (false positives, false negatives and correct classifications) of four

filtering techniques for different values of λ...28 6.2 The results (TCR) of four filtering techniques for different values of λ ...29 6.3 The results (false positives, false negatives and correct classifications) of four

configurations (5, 15, 20 and 25) of the fixed token approach for different values of λ ...29 6.4 The results (TCR) of four configurations (5, 15, 20 and 25) of the fixed token

approach for different values of λ ...30 6.5 The results (false positives, false negatives and correct classifications) of four

configurations (3, 7, 10 and 12) of the fixed token approach for different values of λ ...31 6.6 The results (TCR) of four configurations (3, 7, 10 and 12) of the fixed token

(7)

LIST OF FIGURES

Figure Page

6.1 TCR for effective configurations of the fixed token approach at t = 0.999

(λ = 999)...32 6.2 TCR for effective configurations of the fixed token approach at t = 0.9 (λ = 9)...32 6.3 TCR for effective configurations of the fixed token approach at t = 0.5 (λ =1)...32

(8)

CHAPTER 1 INTRODUCTION

Email service is one of the advantages of the Internet. In recent times, however, this service has faced the serious problem of spam. Spam can be defined as unsolicited automated email. It need not be sent solely for commercial purposes. Spam can be even used for political and social purposes. Direct marketers exploit the low cost advantage of email service to mass mail their ideas to thousands of recipients. Spam cause frustration for email users because these messages utilize a lot of their mailbox space. Further, users waste time in deleting these emails. Spam clutter costs millions of dollars to email service providers. Spam waste bandwidth for dial-up connected Internet users and may involve minors in some illegal businesses (e.g. pay XXX $ to get rich instantly) [14].

The basic problem in eliminating spam lies in differentiating a spam from a legitimate message. For example, if person A is looking for a car, and his neighbor, person B, (who is planning to sell his own car) happens to know that person A is looking for a car, then person B might send his neighbor, person A, an email offering his own car at some price. This message is unsolicited and commercial, and, thus, it can easily be mistaken as spam. One of the features that differentiate spam from legitimate messages is that spam is mass mailed by automation. Therefore, even legitimate messages can be categorized as spam if blindly mass mailed. However, the message content of spam typically forms a distinct category rarely observed in legitimate messages, making it possible for text classifiers to be used for anti-spam filtering.

(9)

A naïve Bayesian anti-spam filter is a text categorization technique based on a machine-learning algorithm. Proposed by Sahami et al. [10], the text categorization technique shows some impressive results on new unseen incoming messages. The filter requires training that can be provided by a previous set of spam and legitimate messages. It keeps track of each word that occurs only in spam, only in legitimate messages, and in both. Based on these word occurrence statistics (also called as tokens), new, incoming, unseen messages are processed and classified accordingly.

There are many filtering techniques available with a Naïve Bayesian approach. This report evaluates four effective filtering techniques currently being used in practical Naïve Bayesian anti-spam filters. Of these four, our evaluation found the fixed token approach to be the most successful. This report further evaluates the fixed token approach with different configurations to get an optimal approach. The results indicate that a content-based filter content-based on a naïve Bayesian approach gives a peak performance when 5 to12 tokens are used to process a new message irrespective of the size of message. Hence, we believe that our filter can make a positive contribution in first-pass filters.

This report is organized as follows. Chapter 2 discusses five significant approaches to anti-spam filtering along with their advantages and disadvantages. Chapters 3 and 4, respectively, describe in a detail Bayesian classifier along with the four filtering techniques required for this study. Our experiment set-up along with the procedure followed is presented in Chapter 5. Evaluation results and conclusions are projected in Chapters 6 and 7. In the last chapter, Chapter 8, we describe possible future work to extend our study.

(10)

CHAPTER 2 ANTI-SPAM FILTERS

Many different spam-filtering approaches have been tried. Most of these have a degree of effectiveness, but they have not attained global popularity because of their drawbacks. The five most significant spam filters are discussed below, along with their strengths and weaknesses.

2.1 Rule-based Approach

As the name suggests, in a rule-based approach, each email is compared with a set of rules to determine whether it is a spam or not. A rule set contains rules with various weights given to each rule. Initially, each incoming email message has a score of zero. The email is, then, parsed to detect the presence of any rule. If any rule is found in the message, its weight is added to the final score of the email. In the end, if the final score is found to be above some threshold value, the email is declared as spam [10].

Rules are nothing but observations of features that are found more frequently in spam than in legitimate messages. Some examples include no sender’s address, use of the color red above some threshold value, etc. Further, there are some spam features that remain constant over period of time, for example, forged headers and auto-executing JavaScript.

Advantages:

This approach can be very effective with a given set of rules. It can achieve 90 to 95 percent efficiency. The filter is easy to install, as it merely requires copying the rule set. It

(11)

requires neither training nor any sort of personal tuning. Further, the rule set can be updated by copying an additional set of rules to challenge the current trend of spam.

Disadvantages:

The disadvantage to the rule-based approach is that it is rigid. There is no self-learning facility available for the filter. Spammers with knowledge of the rule set can design a spam to deceive the method. For example, if there is a rule for classifying a message as a spam if the message contains the word “Viagra” more than five times, the spammer can easily circumvent the rule by using the term “V*i*a*g*r*a” instead of “Viagra.” Rules cannot be kept secret. The best option is to go through every spam and update the rule set by manually adding newfound rules. Unfortunately, this updating process is never ending, as the spammers continually devise new procedures to deceive the spam filters. This process requires personal effort, time, and some level of expertise, qualities not found in every email user.

The rule-based approach could be used in an integrated spam filter with some other approach. In a rule-based approach, decisions as to whether to classify an email as spam or not are binary. This classification process on its own does not give continuous confidence. Such confidence is critical because the cost of a false positive classification (classifying legitimate message as spam) is very high. There is a need for a classification scheme based on probability, wherein all messages near a threshold value can be categorized as legitimate to avoid the danger of being false positives. The rule-based approach is faster than the use of blacklists, but it is slower than statistically-based approaches. SpamAssassin is the most successful spam filter available in the market that

(12)

uses this approach. The ReadMe file of SpamAssassin states that SpamAssassin differentiates between spam and non-spam mail correctly in 99.94 percent of the cases.

2.2 Blacklist Approach

In this approach, servers that are found to be the sources of spam are blacklisted. The emails coming from blacklisted servers are marked as spam and deleted at the server level. The blacklist can also be maintained at the personal level.

Advantages:

The blacklist approach is helpful in cases in which servers are compromised and used for sending spam to hundreds of thousands of users. This is a better and cheaper option to use at the ISP level along with some other effective filtering technique. Tools like Razor and Pyzor can be used for this purpose.

Disadvantages:

The criterion of any spam filter is not only efficiency in filtering spam but also doing the job with the minimum amount of false positives. Marking a legitimate message as spam is a great mistake and is much more costly than marking a spam as legitimate. The blacklist approach generates a large amount of false positives. This being the case, its generalized approach of shunning a culprit server forever is not a good idea. A legitimate message arriving from a blacklisted server would always be considered a spam. MAPS RBL, probably the best-known blacklist, catches only 24 percent of all spam with a 34 percent of false positives. Moreover, there are many ethical issues involved in blacklisting a server. Probably the worst scenario of blacklisting a server is doing so without knowing whether that server is a source of spam or not. Moreover, a spammer is

(13)

a moving target. While a spammer might use a compromised computer to send spam, as soon as he learns his computer is being detected, he can use a different computer until that one is being detected. This can go on and on. The end result is that while servers are shunned, the spammer still keeps spamming.

The solution to this approach has been the use of Distributed Adaptive Blacklists. Its basic working is to detect a spam message and inform all the recipients (which may run into millions) of that message about its status. Digests of spam are maintained at the server level. So, whenever a new message is received at the MTA, adaptive blacklists are called to detect whether the message is spam. There are tools that ensure that the messages, which are different versions of the same spam, do not get identified as legitimate. In addition, maintainers of distributed blacklists create “honey-pot” addresses, addresses never used for legitimate purposes. The basic disadvantage of this approach is that it generates a considerable amount of false negatives. Thus, it is recommended that this approach be used in conjunction with another effective filtering technique.

2.3 Whitelist / Effective Filters Approach

Whitelists contain legitimate addresses. A whitelist filter is configured with an MTA. The email messages arriving from any of these addresses are allowed to pass into the recipient’s mailbox. The messages with sources that are not whitelisted are considered to be spam. It is difficult to maintain an exhaustive list of all legitimate addresses. The better option would be to share whitelists among friends and relatives. However, this, too, can be an easy route for a spammer to get a big list of legitimate addresses. The challenge – response approach has been integrated with the whitelist approach to avoid such a

(14)

problem. The sender who is not whitelisted will receive a challenge response from the recipient for authentication purposes. The response might contain an image for decryption or a word to recognize and spell. Such a process would be simple with human intervention, but a machine would not be able to reply to the response. Once the sender replies to the response, his address would be added to the whitelist, and all his mails would directly reach the recipient’s mailbox in the future.

Advantages:

Once all legitimate addresses are recorded, the whitelist approach coupled with the challenge-response approach has a 100 percent efficiency rate. The challenge-response component ensures that spammers do not reply to millions of responses and get registered on any whitelists integrated with anti-spam filters. The challenge-response approach requires human intervention for this very purpose. Spammers who try to respond to such challenges expose their purposed to the users seeking legal remedies against them.

Disadvantages:

The main disadvantage of the challenge-response approach is that it generates a vital amount of false positives, the reason being some addresses are not listed on a whitelist. There are many reasons for senders not replying to the challenge-response system. Doing so means extra effort for the senders. They may have unreliable ISPs, multiple email addresses, or may not care to reply to the challenge. Such senders would not be whitelisted, and, hence, their mails would be classified as spam. There are cases in which users receive mails from automatic reply machines, for example online purchases, online registration, web list sign-ups, etc. Such systems would not be able to reply to challenges.

(15)

In addition, if a user were to add an incorrect address or to forget to add an address to his whitelist, he would generate false positives. Further, some people consider the challenge--response system as rude.

To avoid false positives, the strict action taken by the above approach can be toned down to a milder one. The emails with unknown sources can be categorized in some folder (a low-priority mailbox) other than the inbox. This box could, then, be checked weekly. All unknown senders would receive replies stating that their emails would not be read for a week. If they wanted their message to be read immediately, they would have to respond to the challenge sent. Thus, instead of using the whitelist approach as single tool of defense, it is more effective if it is used in conjunction with some other anti-spam tool.

2.4 Signature-based Approach

The signature-based approach compares every new incoming email with the known set of spam. Spam are derived from honey pots and deliberately created fake email addresses. When any new spam message comes to light, all other email services are alerted.

The signature-based approach works in this way. Each character in an email carries weight. So, the summation of all characters would give a final score that is used as the signature of that email. Thus, every new message’s signature is compared with that of a spam’s signature. If the signatures match, then the new email is classified as spam.

(16)

Advantages:

The signature-based approach rarely generates false positives. Even false negatives generated by this approach are few as compared to other approaches. BrightMail is a successful spam filter that follows this approach.

Disadvantages:

These filters are easy to defeat. Since they are backward looking, they take action only after they become aware of a spam. By the time the honey pot has attracted a new spam, a signature has been assigned to it, and the updates have been sent and installed at all ISPs, the spammer has already sent millions of spam. Even a small change in emails might make the filter useless. Just by adding some random characters to each spam, the signatures of each will be differed from the original spam. Thus, all such spam messages will pass for legitimate messages. Active research on the logic behind adding random characters to messages is being given a boost, but, even so, spammers are always ahead of these filters. The efficiency of these filters is found to be 50 to 70 percent. In addition, these filters can only be used at the ISP level as first-pass filters.

2.5 Filters Fight Back

The filters fight back approach is the most aggressive among all the approaches adopted for filtering spam. It employs the policy of “attack is the best self-defense.” A spam message usually includes URLs for the readers to visit a site. The purpose may be commercial or social. The filters fight back approach works in this way. Once a message is detected as a spam, these filters send a number of requests to those URL-specified sites. A user can personally configure the number of requests. If any spam is sent to

(17)

thousands of users, there is a high possibility that the server hosting that site would receive millions of requests increasing the cost and the bandwidth, effectively shutting down all its services. Such filters are also known as auto-retrieving filters [6].

Advantages:

Since spam itself is the reason for the spammer’s loss, spammers would hesitate to send spam to unknown users. More recipients for the spam would create more loss to the spammer’s web server.

Disadvantages:

The job prior to fighting back is to detect a spam. Any URL sent to thousands of users mainly indicates a spam. However, at the bottom of every message, there are many advertisements, such as Yahoo, MSN, etc., many of which are legitimate URLs. If the site turns out to be legitimate, negatively affecting the site might involve legal proceedings. To avoid such confusion, auto-retrieval filters should refer to blacklists for servers that are banned. Further, the servers need to be blacklisted by human intervention, thus ensuring that the auto-retrieval filters send requests only to web servers that are blacklisted.

With this approach, there is an easy way out for spammers. They need to include only active unsubscribe links in their messages. In that way, the senders with auto-retrieval filters will be unsubscribed from the program, which is good news. However, the spam is not reduced globally. There is also the possibility that spammers might include their contact information and their image for marketing purposes instead of their URLs. Doing so, would wholly eliminate the danger of auto-retrieval filters.

(18)

To make this filter more effective, one needs to fine-tune the filter to each user’s incoming messages. Fine-tuning a filter requires time and expertise, both of which are often hard to come by. Thus, one needs a filter that is adaptive in nature, one that self-learns from the given legitimate messages and spam [12].

(19)

CHAPTER 3

BAYESIAN CLASSIFIERS

To understand the workings of Bayesian classifiers, one needs to know the underlying concept of Bayesian networks. A Bayesian classifier is nothing but the application of a Bayesian network to the process of text classification.

3.1 Bayesian Networks

Bayesian networks are probabilistic networks. They are used as problem solving models in different fields. In a Bayesian network, nodes indicate the variables of the problem, and the directed nodes between the nodes indicate the relationships between the variables. A Bayesian network, in our case, is used to represent a probability distribution. In such a graph, a node represents a random variable, and a directed edge indicates a probabilistic dependency from the variable denoted by the parent node to that of the child. Hence, it is implied that any node in the network is conditionally independent of its non-descendents, given its parents. Each node is associated with a conditional probability table that indicates the distribution over that node with any possible assignment of values to its parents [10, 13].

Let’s formulate the Bayesian network to solve our classification problem. Let C be the class variable that indicates to which class (legitimate / spam) a message belongs, and let node Xi denote any attribute (token, in our case) in the message. We need to talk about the class nature. For our purposes we will say ck is the given of the specific values for the required attributes. In our case, the specific values would be either 0 or 1 depending

(20)

on their presence in the message. The problem of class nature can be solved using Baye’s theorem:

P(X = x | C = ck) is difficult to calculate, as there is a high chance that the X attribute might be dependent on some other set of attributes. The easy solution is to assume that the attributes are conditionally independent of each other. Consequently, the probability will result in:

P(X = x | C = ck) = ∏ P (Xi = xi | C = ck)

If an email message is considered to be a set of attributes (i.e., words), then using a Bayesian network, we can calculate the probability of whether a message belongs to a specific class, namely, a legitimate message or a spam.

3.2 Experimental Bayesian Filter

The experimental Bayesian filter is a content-based approach. This attribute gives this approach an advantage over other approaches. Spammers cannot modify the content to deceive the filters, as content is the only reason to send spam at the first place. Content in this case includes headers and the message itself.

(21)

First the filter should be trained to work accordingly. A considerable amount of good mails and spam would be required to train the filter. Two tables would be maintained, one each for legitimate email and spam. Let us call them the good table and the bad table, respectively. The good table would contain tokens that occur in the good emails, along with their number of occurrences. Similarly, the interpretation of bad emails would be maintained in the bad table.

Based on these two tables, another table will be built using the Bayesian formula of probability [4, 5]:

P (bad/token) = _{A + B}A

A = P (token/bad) * P (bad) B = P (token /good) * P (good)

P (token/bad) = probability of a token given that it is present in spam email. P (token /good) = probability of a token given that it is present in good email.

P (bad/token) = probability of email being spam given that a specific token is present.

Let’s call this table the spam probability table. This table will contain all tokens that occur in all mails, along with the probability that will define the chances of the mail being spam with that token present. Ideally, the probability of mail being spam should be calculated with the given presence of the combination of tokens in an email. But this probability is difficult to calculate, as the number of tokens is huge, giving rise to a lot of

(22)

combinations. To make the matter simpler, we will assume that the tokens are independent of each other. Thus, the probability of an email is merely the combined probability of tokens. Hence, this implementation of the Bayesian formula is known as the naïve Bayesian rule.

For every new email, a fixed number of effective tokens would be collected to calculate the combined probability. This number can vary from 5 to 25 depending on the success of the filter based on one’s personal messages.

A = p (a) * p (b) B= (1-p (a)) * (1-p (b))

Score: _{A + B}A

If the score rises above a threshold value, the emails would be declared as spam, else as a good email. The selection of only 15 tokens is one of the filtering techniques used in the case of Bayesian filters. Effective tokens are those whose probabilities differ the most, on either side, from the threshold value. These tokens are either significantly good tokens or bad tokens, and they are responsible for deciding the overall status of the message.

The challenges present are speed, efficiency, database size, and the need of training data. The larger the set of tokens the greater would be the size of the database and the longer the time of training. So, there is a need to consider only those tokens that make an impact in deciding the status of an email. Since training and email classification will

(23)

occur during the same phase of time, special care must be taken to make both operations as independent as possible.

If the token is present only in the good table, its probability in the spam probability table would be recorded as 0.1. If the token is present only in the bad table, its probability in the spam probability table would be recorded as 0.9.

3.3 Cost Evaluation Measures

A false positive is mistakenly classifying a legitimate email as a spam, and a false negative is mistakenly classifying a spam as a legitimate email. The cost of a false positive is much higher than that of a false negative. The existence of false positives destroys the faith of the user in his spam filter because email users tend to delete spam from a bulk folder without reading them, and deleting legitimate messages (due to spam filters) is unacceptable. In that case, it is acceptable to allow some false negatives rather than having any false positives.

Let L→S be false positive error type and S→L be false negative error type. Assuming that L→S is λ times costlier than S→L, we classify a message as spam if:

In our case wherein we are considering a naïve Bayesian filter’s independency, the assumption holds. Therefore, P(C=spam | X=x) = 1 - P(C=legitimate | X=x), which leads to the following criteria:

(24)

P(C=spam | X=x) > t, where t = threshold value Thus t = λ / (1+ λ) as λ = t / (1-t)

Depending on the action taken on a spam folder, the threshold value can be altered. If spam are deleted directly once they are classified, then t is held as high as 0.999 (λ = 999), i.e. blocking a legitimate message is as bad as letting 999-spam messages pass the filter. Lower values of λ are acceptable depending on the different configurations made available for the spam folder. If the configuration is set up to resend the mail back to the sender asking him to send it to a private unfiltered email address of the recipient, then λ = 9 (t=0.9) seems to be reasonable. Even λ =1 (t=0.5) is acceptable if the recipient happens to go through every email in the bulk folder before manually deleting them. Two factors could be used in the context to measure the performance of a filter, namely, spam precision and spam recall. Let n (L→S) and n (S→L) be the numbers of L→S and S→L errors, and let n (L→L) and n (S→S) count the correctly treated legitimate and spam messages respectively. Spam recall (SR) and spam precision (SP) are defined as follows:

SR = n S->S

nS->L + nS->S SP =

nS->S nL->S + nS->S

3.4 Independence Factor of a Bayesian Network

Using a Bayesian network, we can model the complex dependencies between features to infer the solution class. As the number of features increases, it becomes increasingly difficult for a message to be classified with all its dependencies. As a result, spam filters

(25)

implement a naïve Bayesian concept wherein features are assumed to be independent of each other. This assumption is balanced by setting higher value to the threshold.

P(X = x | C = ck) = ∏ P (Xi = xi | C = ck)

A naïve Bayesian model is the most restrictive form of the feature dependence spectrum. Research has been done regarding the performance of spam filters by allowing some degree of dependence between features. This study can be formalized by introducing the notion of k-dependence Bayesian classifiers. A k-dependence Bayesian classifier is a Bayesian network wherein each feature is allowed to have a maximum of k parents. Based on this definition, we can say that a naïve Bayesian filter is a 0-dependence Bayesian classifier. We can also state that an ideal Bayesian filter (i.e. full Bayesian filter with no independence) is an (N-1)-dependence Bayesian classifier where N is the number of domain features.

By varying the value of k, one can move step-by-step in the feature dependence spectrum and analyze the performance of the spam filter at every step. It is also worth noting that as k grows, there are more condition variables with the same amount of data. This implies a larger probability space for estimation with the same data, causing inaccuracy in probability estimates and leading to an overall decrease in performance. This performance problem has been observed in many domains while going from k=2 to k=3.

(26)

CHAPTER 4

CLASSIFICATION TECHNIQUES IN A NAÏVE BAYESIAN FILTER

Once the naïve Bayesian filter is trained using huge datasets of spam and non-spam messages, it is now ready to perform its basic functionality of filtering, i.e. classifying new incoming unseen messages. Currently, there are many classification techniques used with naïve Bayesian filters available on the market. We discuss four significant techniques in detail in this section.

4.1 Use of All Tokens

This technique demands use of all tokens from a new email for classification. As each token is associated with a probability that determines the chances of the email being a spam, tokens from each new email would be used to calculate a combined probability to assign a final score to the email. In the case of a new token in an email (i.e. with no record in the database), it would be assigned a probability of 0.4. This assumption has been practically implemented and been found successful in naïve Bayesian filters. It implies that a new token is considered to be a good token rather than a part of a spam. It also indicates the positive approach adopted by spam filters, since the cost of a false positive is much higher than that of a false negative. However, we turn off this feature for the purposes of our evaluation, because we do not want to favor one technique (by taking a positive approach) over others. This global technique makes sense as we parsed all tokens from training datasets to build a database to be used for classification. Hence, it is logical to use the same technique for classification.

(27)

It should be noted that the classification phase is critical due to the heavy cost of a false positive as compared to the training phase wherein we know exactly whether an email is a spam or not. This technique might be deceived by an email in which there is a big story of how a person got rich instantly followed by a link to a spam site. Such emails would contain a large amount of good tokens as compared to a spam. There is a high possibility that such emails would deceive spam filters only to be categorized as a good email. But it is equally true that spammers avoid writing a big story as it is very likely that email readers would rather delete than read a big article from some unknown source. Thus, the use of the all tokens method is found to be effective in practical filters. For example, Bill Yerazunis has used this technique in his Controllable Regex Mutilator (CRM114).

4.2 Use of a Fixed Number of Tokens

The use of a fixed number of tokens technique, successfully implemented by Paul Graham, takes only a fixed number of tokens into consideration from a new email for assigning a final score to it. The number can vary from 15 to 20 to 25, but these tokens are assumed to be the most effective in the given email. An effective token is one whose probability deviates the most from 0.5 on any side, i.e. it can be a good token or a bad one. The combined probability of these tokens would assign a final score to the given new email. In that way, the most effective tokens are emphasized for the task. The technique directly attacks those words that are found most of the time in either legitimate emails or spam. As a result, the final score would most probably end up near 1 if the

(28)

email is a spam or near 0, otherwise. Thus, this technique eliminates the doubt of email classification if the final score ends up near 0.5. This method of effectiveness was proposed by Sahami et al. who calculated its effectiveness with the help of the mathematical formula of mutual information. It is recommended that the same token should not be counted more than once while calculating a final score. In that way, the filter makes an unbiased decision with no interference from any specific token even if it had occurred a few times in the message. The number of tokens (15/20/25) is a personal decision, based on the success of the spam filter on personal emails. If the number of tokens in a new email happens to be less than a fixed number, say 10, then the use of all tokens is the logical back-up technique to be used for classification.

This technique has some advantages over other techniques.

1) To avoid the problem of false positives, the threshold value can be raised to any value near to 0.9 from 0.5.

2) In the case of huge emails, the classification would be much faster.

4.3 Use of a Standard Deviation

This technique, like the previous technique, considers only the effective tokens. However, it also emphasizes the spam probability of tokens rather than the number of tokens. If a standard deviation (i.e. stddev) is of the value x, then all tokens with a spam probability in the range of 0.5-x to 0.5+x would be discarded. The remaining tokens would be the effective ones used to calculate the combined probability and assign a final score to the new email. BogoFilter, a spam filter that is currently available on the market,

(29)

has adopted this approach. The value of the stddev can be varied based on the filter’s success on one’s personal messages. The value, which is found successful and recommended, is 0.4. Thus, tokens under consideration would be the ones with probabilities (0.5-0.4) 0.1 and lower and (0.5+0.4) 0.9 and higher.

The specialty of the technique is that it assigns the score to the email independent of its size. Based on the content of an email, there might be only ten effective tokens, or sometimes there may be even more than 100. But for every classification, only effective tokens with probabilities 0.9 and above and 0.1 and lower would be considered. Like the previous technique, the score in this case would be near 1 (if spam) or near to 0, otherwise. Thus, it is less likely that the score would end up near 0.5, and, thus, giving rise to the possibility of false positives.

The same token should not be considered more than once to avoid the interference from any specific token if it had occurred a few times in the message. The threshold, like the previous technique, can be raised to 0.9 to reduce the possibilities of false positives. The processing time for classification would vary according to the size of the email.

4.4 Use of a Relative Number of Tokens

We would like to propose a technique and evaluate it along with other real-world successful techniques. Since the naïve Bayesian filter is trained with the contents of email messages, it is logical to apply the same content-based approach for classification as well. In this technique, we select some percentage (say 30 percent) of effective tokens out of the total tokens of an email message. These tokens will be used to calculate the combined

(30)

probability and assign a final score to the email message. The percentage value can be tuned, based on the success of the filter on personal email messages.

This approach is the combination of both the above techniques: the use of a fixed number of tokens and the use of a standard deviation. It values both the effectiveness and number of tokens while classifying a message. So, if an email contains 100 tokens, then the 30 most effective tokens among them will be used for classification. There is a high possibility that most of these 30 odd tokens would fall in the stddev of 0.4. In that way, we utilize the advantages of both the above techniques.

As it is a content-based approach, there are chances that the final score of an email might fall near 0.5. To avoid false possibilities, the threshold value can be raised to a higher value. The process time for classification depends on the size of the email message.

(31)

CHAPTER 5 EXPERIMENTS

Our experiment comprises of two phases: the training phase and the classification phase. In the training phase, the filter is trained using a known corpus of spam and good emails. A database of tokens appearing in each corpus and their total occurrences are maintained in a database. Based on their occurrences in each set of spam and good emails, each token is assigned a probability for its capacity of determining an email as spam given its presence. Then, using this knowledge of tokens, the filter classifies every new incoming mail in the classification phase. Once the status of a new mail is confirmed, all its tokens are also recorded, thus updating the database. This self-learning function of our filter makes it unique among the other available spam filters. Even if the filter misclassifies any message, the user can rectify it, and the spam filter would update its database accordingly. Thus, the filter learns from its mistakes, too.

We used 1250 legitimate messages and 11350 spam messages. Legitimate messages belong to my student webmail account assigned to me by Utah State University. Spam messages were collected from an archive provided by Nik Martin, available at the site hosted by Paul Graham (www.paulgraham.com) [6]. The spam was collected over the last four to five years. The proportion of spam to legitimate messages is quite huge, making it more likely that legitimate messages can easily be misclassified as spam. This makes the situation more challenging, as the cost of false positives is much higher than that of false negatives. We feel that by minimizing the false positives in such a situation, we have achieved an efficient Bayesian spam filter. Moreover, by recording tokens from

(32)

such a huge number of spam, we have covered almost all the topics for spam and are in a pretty good position to classify new incoming mails for spam.

Each word in each email message is considered to be a token. The whole message including the header is parsed for tokens. The token separator is a blank space. Words quoted in double and single quotes, numbers, and all words separated by blank spaces are also considered as tokens. The tokens under study and used for classification are 90930. Since we are not using any type of lemmatizer, we consider different forms of the same word as different tokens. For example, run, running and runner would all be considered as different tokens even though they have stemmed from the single word ”run.” There are studies [7] that prove the positive effect of a lemmatizer on a filter’s performance. The implementation of a lemmatizer is one of the topics of our future study. See Chapter 8.

Only the message content is used for classification purposes. Doing so eliminates the interference of tokens present in headers in determining the status of a message. In that way, there is no bias among the classification techniques that are considered for evaluation as some techniques consider only a few (or percentage of) tokens for assigning a final score to the message.

Our evaluation was conducted in the classification phase. We evaluated four effective filtering techniques of the Bayesian spam filter for their classification performance. We evaluated these techniques using cost-sensitive measures, as we believe that the cost of a false positive is much higher than that of a false negative. Eighty new incoming messages were tested in a batch of two (first batch: 50; second batch: 30) to get the significant evaluation results. These tested messages belong to the same email account (i.e. my

(33)

webmail account) previously used in the training phase. Thus, we avoided any type of erratic behavior from the anti-spam filter. The effective configuration of each technique was used for evaluation purposes. In the standard deviation technique, the value of standard deviation was set to 0.4, and in the percentage technique, 30 percent of total tokens were used to calculate the final score. The tabulated results and related plotted graphs are explained in the next section.

(34)

CHAPTER 6

COST- SENSITIVE EVALUATION

The evaluation factors that are frequently used in case of classification are accuracy (Acc) and the error rate (Err = 1 – Acc). Accuracy can be defined as the number of correct classifications, i.e. spam correctly classified as spam and legitimate messages as legitimate out of the total messages. The error rate is the ratio of the sum of false positives and false negatives out of the total messages.

Acc = nS->S + nL->L

NL + NS Err =

nL->S + nS->L NL + NS

Where NL and NS are the number of legitimate and spam messages, respectively.

In our cost-sensitive evaluation, we assume that the error of a false positive is much higher than that of false negative. Conversely, the above formulae of accuracy and error rate do not consider the cost-sensitive factor. Let’s assume that the error type of a false positive is λ times greater than that of a false negative, the implication being that we treat a legitimate message as being worth λ messages. So, if a legitimate message is misclassified, it counts to λ errors, and if it is classified correctly, it counts to λ successes. This assumption can be formulated in the form of a weighted accuracy (WAcc) and a weighted error rate (WErr = 1-WAcc):

WAcc = λ n S->S + n L->L

λN L + N S WErr =

λ n L->S + n S->L λN L + N S

(35)

To get a better idea of the filter’s performance in terms of accuracy and error rate, we must compare these factors with a ‘baseline’ approach [7]. In a baseline approach, we assume that no sort of filter is active, i.e. all spam pass the filter, and legitimate messages are never blocked. The weighted accuracy and error rate of the baseline are:

WAccb = λ N L

λN L + N S WErrb =

N S λN L + N S

We calculate TCR (Total Cost Ratio) to compare with the baseline approach [7]:

TCR = WErrb_{WErr =} N S

λn L->S + n S->L

The higher the value of TCR, the better the performance. With TCR < 1, a baseline approach is a better option, implying that the absence of a filter gives better results than the use of a filter. If cost is relative to wasted time, then TCR measures the time wasted to delete manually all spam messages as compared to the sum of time wasted to delete manually all spam messages misclassified as legitimate (nS->L) and time wasted by recovering all legitimate messages mistakenly classified as spam (λ nL->S).

Table 1 lists false positives, false negatives, and correct classifications of all four techniques with different configurations for the threshold. Table 2 lists spam recall, spam precision, weighted accuracy, baseline-weighted accuracy, and total cost ratio (TCR) for the same. TCR is calculated for all techniques for different values of thresholds. In a

(36)

cost-sensitive evaluation, TCR can be used as a scale of better performance. Table 1 indicates the fall in number of false positives and rise in number of false negatives by raising the threshold bars for all four techniques. Table 2 indicates that the fixed token technique outperforms for every value of λ. Unlike other techniques, the fixed token approach gives excellent results for λ=999. The all token approach is worst among them all. Our percentage approach performs better than the standard deviation for λ =1 and λ=999. Based on both the tables, we can say that by lowering the threshold value from 0.999 to 0.5, we have risked an increase in number of false positives. But at the same time, the evaluation has shown the increase in TCR values, indicating that an increase in false positives does not prove costly to us. However, in practice, no user would like to use a threshold of 0.5 that implies that he has to go through every spam mail before deleting it. The filter would just be helping the user in locating the spam. An ideal filter would be one wherein spam messages are deleted without the supervision of the user and no legitimate message is deleted in the process.

One can observe that number of nS->L is much lesser than that of nL->S. It is the due the fact the number of spam used in training phase is way greater than that of legitimate messages. Our filter, being a self-learner, would improve its performance in the future and would keep the number of nS->L as minimum as possible. We believe, after a period of time, our filter would perform at its peak performance and would remain constant thereafter. The ideal filter should give spam precision of 100 percent, spam recall of 100 percent and a positive value for TCR for all the values of λ.

(37)

Table 6.1. The results (false positives, false negatives and correct classifications) of four filtering techniques for different values of λ.

Filter Technique λ nL->S nS->L nL->L nS->S a) All Token tech

b) Fixed Token tech c) Std Deviation tech d) Percentage tech 1 1 1 1 11 6 9 9 0 0 1 0 14 19 16 16 25 25 24 25 a) All Token tech

b) Fixed Token tech c) Std Deviation tech d) Percentage tech 9 9 9 9 11 5 6 9 0 0 1 0 14 20 19 16 25 25 24 25 a) All Token tech

b) Fixed Token tech c) Std Deviation tech d) Percentage tech 999 999 999 999 9 0 4 3 3 5 4 6 16 25 21 22 22 20 21 19

Table 6.2. The results (TCR) of four filtering techniques for different values of λ.

Filter Technique λ Spam

Recall Spam Precision Weighted Accuracy Baseline W.Acc. TCR a) All Token tech

b) Fixed Token tech c) Std Deviation tech d) Percentage tech 1 1 1 1 100% 100% 96% 100% 69.44% 80.65% 72.73% 73.53% 78% 88% 80% 82% 50% 50% 50% 50% 2.27 4.17 2.5 2.78 a) All Token tech

b) Fixed Token tech c) Std Deviation tech d) Percentage tech 9 9 9 9 100% 100% 96% 100% 69.44% 83.33% 80% 73.53% 60.4% 82% 78% 67.6% 90% 90% 90% 90% 0.25 0.56 0.45 0.31 a) All Token tech

b) Fixed Token tech c) Std Deviation tech d) Percentage tech 999 999 999 999 88% 80% 84% 76% 70.97% 100% 84% 86.36% 64.02% 99.98% 84% 87.99% 99.9% 99.9% 99.9% 99.9% 0.002 5 0.006 0.008

An evaluation of the fixed token approach was conducted with 15 tokens. To get an optimal anti-spam filter, we further evaluated the fixed token approach with a different number of fixed tokens, i.e. 5, 15, 20, and 25. Tables 3 and 4 list their results.

(38)

Table 6.3. The results (false positives, false negatives and correct classifications) of four configurations (5, 15, 20 and 25) of Fixed token approach for different values of λ.

Filter Technique λ nL->S nS->L nL->L nS->S a) Fixed - 5 b) Fixed - 15 c) Fixed - 20 d) Fixed – 25 1 1 1 1 4 7 8 10 1 1 0 0 21 18 17 15 24 24 25 25 a) Fixed - 5 b) Fixed - 15 c) Fixed - 20 d) Fixed – 25 9 9 9 9 4 6 8 10 2 1 0 0 21 19 17 15 23 24 25 25 a) Fixed - 5 b) Fixed - 15 c) Fixed - 20 d) Fixed – 25 999 999 999 999 1 4 7 9 3 1 0 0 24 21 18 16 22 24 25 25

Table 6.4. The results (TCR) of four configurations (5, 15, 20 and 25) of Fixed token approach for different values of λ.

Filter Technique

(Fixed Token tech) λ Spam Recall Precision Spam WeightedAccuracy Baseline W.Acc. TCR a) Fixed - 5 b) Fixed - 15 c) Fixed - 20 d) Fixed – 25 1 1 1 1 96% 96% 100% 100% 85.71% 77.42% 75.76% 71.43% 90% 84% 84% 80% 50% 50% 50% 50% 5.0 3.125 3.125 2.5 a) Fixed - 5 b) Fixed - 15 c) Fixed - 20 d) Fixed – 25 9 9 9 9 92% 96% 100% 100% 85.19% 80% 75.76% 71.43% 84.8% 78% 71.2% 64% 90% 90% 90% 90% 0.66 0.45 0.35 0.28 a) Fixed - 5 b) Fixed - 15 c) Fixed - 20 d) Fixed – 25 999 999 999 999 88% 96% 100% 100% 95.65% 85.71% 78.13% 73.53% 95.99% 84.01% 72.03% 64.04% 99.9% 99.9% 99.9% 99.9% 0.02 0.006 0.003 0.002

The values of TCR and weighted accuracy prove the better performance of 5 tokens over others for each value of λ. The performance degrades as we consider a higher

(39)

number of tokens for the classification. However, the effective configuration still cannot be used as a stand-alone first-pass filter for λ=999 and λ=9. It needs the help of other techniques, such as blacklists and whitelists, for effective spam filtering.

To get an optimal number of tokens, we further evaluated by covering the range of 5 to 15 tokens. Tables 5 and 6 list their results. The results of 5, 7, and 10 tokens remained the same to each other as well as remained constant for different values of λ. However, the results for 3 fixed tokens were the worst, and results of 12 fixed tokens were near to that of 5, 7, and 10 fixed tokens. It can be said that in the case of the fixed token approach, the filter reaches optimal performance in the range of 5 to12 tokens and degrades thereafter. This observation is confirmed by the plotted graphs (Figures 1 through 3). They indicate the maximum peak (i.e. TCR value) in the range of 5 to12.

Table 6.5. The results (false positives, false negatives and correct classifications) of four configurations (3, 7, 10 and 12) of the fixed token approach for different values of λ.

Filter Technique λ nL->S nS->L nL->L nS->S a) Fixed - 3 b) Fixed - 7 c) Fixed - 10 d) Fixed - 12 1 1 1 1 10 1 1 2 7 1 1 0 5 14 14 13 8 14 14 15 a) Fixed - 3 b) Fixed - 7 c) Fixed - 10 d) Fixed - 12 9 9 9 9 10 1 1 2 7 1 1 0 5 14 14 13 8 14 14 15 a) Fixed - 3 b) Fixed - 7 c) Fixed - 10 d) Fixed - 12 999 999 999 999 7 1 1 1 12 1 1 0 8 14 14 14 3 14 14 15

(40)

Table 6.6. The results (TCR) of four configurations (3, 7, 10 and 12) of the fixed token approach for different values of λ.

Filter Technique

(Fixed Token tech) λ Spam Recall Precision Spam WeightedAccuracy Baseline W.Acc. TCR a) Fixed - 3 b) Fixed - 7 c) Fixed - 10 d) Fixed - 12 1 1 1 1 53.33% 93.33% 93.33% 100% 44.44% 93.33% 93.33% 88.24% 43.33% 93.33% 93.33% 93.33% 50% 50% 50% 50% 0.882 7.45 7.45 7.45 a) Fixed - 3 b) Fixed - 7 c) Fixed - 10 d) Fixed - 12 9 9 9 9 53.33% 93.33% 93.33% 100% 44.44% 93.33% 93.33% 88.24% 35.33% 93.33% 93.33% 88% 90% 90% 90% 90% 0.155 1.5 1.5 0.833 a) Fixed - 3 b) Fixed - 7 c) Fixed - 10 d) Fixed - 12 999 999 999 999 20% 93.33% 93.33% 100% 30% 93.33% 93.33% 93.75% 53.3% 93.33% 93.33% 93.34% 99.9% 99.9% 99.9% 99.9% 0.002 0.015 0.015 0.015

The figures of TCR v/s fixed tokens for different thresholds are shown below in Figures 6.1 through 6.3. The figures illustrate that 5 to12 tokens from a message are enough to indicate its status (spam / legitimate) independent of its content size. However, the number of fixed tokens might differ a bit depending on one’s personal messages.

0.002 0.02 0.015 0.015 0.015 0.006 0.003 0.002 0 0.005 0.01 0.015 0.02 0.025 3 5 7 10 12 15 20 25 F i x e d t o k e n s

Figure 6.1.TCR for effective configurations of the fixed token approach at t = 0.999 (λ = 999).

(41)

0.155 0.66 1.5 1.5 0.833 0.45 0.35 0.28 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 3 5 7 10 12 15 20 25 F i x e d t o k e n s 0.882 5 7.45 7.45 7.45 3.125 3.125 2.5 0 1 2 3 4 5 6 7 8 3 5 7 10 12 15 20 25 F i x e d t o k e n s

Figure 6.2.TCR for effective configurations Figure 6.3.TCR for effective configurations of the fixed token approach at t = 0.9 (λ = 9). of the fixed token approach at t = 0.5(λ=1).

(42)

CHAPTER 7 CONCLUSIONS

Our cost-sensitive evaluation suggests that a content-based filter using a Bayesian approach alone is not sufficient to function as an anti-spam filter due to large number of false positives. However, the fixed token approach has been found the most effective among the four techniques evaluated in the report. The fixed token approach achieves its peak performance when the number of effective tokens selected to classify a message fall in the range of 5 to12. This configuration performs satisfactorily for t = 0.9 (λ = 9) and t = 0.5 (λ = 1). No configuration performs well enough to be used for t = 0.999 (λ = 999). To obtain an optimal anti-spam filter, we suggest the use of a lemmatizer, a stoplist and integration with other techniques, such as the blacklist and rule-based methods.

The results are based on 80 personalized messages and we expect better results in the future as our filter follows a self-learning algorithm. Due to less variability in the content of spam messages, only some tokens were able to make an impact on the classification process. This made catching spam messages without blocking legitimate messages a bit difficult. As a first-pass filter, however, the few tokens technique must be analyzed to give maximum effectiveness. Thus, we believe that our fixed token technique with 5 to12 token configurations would be able to make a positive contribution as a first-pass filter. However, this number might differ a bit from person to person depending on the type of spam message received.

(43)

CHAPTER 8 FUTURE WORK

In each spam filter, keeping the cost of false positives to a minimum becomes the prime priority. Our experiment shows that all techniques of the Bayesian approach allow some number of false positives; however, some techniques keep the figure to a minimal. There are many other techniques studied that can be used along with a simple content based filter to improves its performance. A lemmatizer that converts each word to its base form can be included in our filter. In that way, any modifications of the same word would not escape the attention of the anti-spam filter. For example, s*e*x would be treated similarly to sex [8]. A stoplist that removes the 100 most frequent words of the British National Corpus (BNC) from messages is also helpful in cases like that of the all tokens method wherein each word is responsible for assigning a final score to a message. A user can also add words manually to his stoplist to tune his personal spam filter. There are studies that prove how a lemmatizer and a stoplist contribute in improving the efficiency of naïve Bayesian filter.

Our evaluation does not take into consideration nontextual factors like images and attachments. There is a high probability that a mail with no textual content but only an image or an attachment is a spam. Training the filter with a corpus containing nontextual content would improve its effectiveness during the classification phase. In the case of a hyperlink, we can have a web crawler that would visit the mentioned site and apply the same Bayesian approach to rate that page. If the score of that page goes above the threshold value, there is a possibility that the message containing the hyperlink is a spam.

(44)

Thus, hyperlinks would be useful in assigning a final score to a message with the help of web crawler.

There are some specific traits that help us detect spam. For example, no sender’s address and the use of dark red colors are some of traits commonly found in spam. These traits are termed as attributes. Attributes can be textual phrases like “only above 21 years.” If the filter is trained for such attributes, there is a proven study of a positive change in results during the classification phase of the filter [1].

A Bayesian filter can also be used for classifying messages into different folders [2]. It can suggest to which specific folder a new message belongs. People create folders to organize their messages for archival purposes, messages that need replies and, of course, a bulk folder to have a second look at messages before deleting them as spam. But people with a large number of folders find it difficult to organize their messages. Given that a text classifier can choose a correct folder 85 percent of the time, chances are high that the appropriate folder would always be in the first three guesses. Thus, the user is restricted to choosing a folder out of three guesses, as opposed to choosing a folder out of some 20 odd folders [9].

Experiments have been conducted to study the workings of a Bayesian filter with dependent features. In our study, we assumed that features were independent of each other. But the results in the case of a 1-order dependent Bayesian filter are better than the naïve Bayesian filter. This implies that phrases like “Click here,” and “Buy Free” are better indicators of spam than the independent words “Click,” “here,” “Buy,” and “Free”. If the order of dependency is increased above three, the results tend to decline. As the

(45)

order increases, we have more conditional variables with the same amount of data. This complicates the probability estimates and, hence, leads to an overall decrease in predictable accuracy [11].

To get an efficiency of above 99 percent with less than 1 percent false positives, a content-based filter would not be enough. Other approaches can be integrated with the Bayesian filter to get the desired result. For example, the rule-based and the blacklists approaches are simple ones to integrate. Doing so has proved helpful in spam filtering. The signature-based and filters fight back approaches are some of the advanced techniques that have also proved their usefulness when paired with a Bayesian spam filter.

(46)

REFERENCES

1. Androutsopoulos, I., Koutsias, J., Chandrinos, K., and Spyropoulos, C. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal email messages. In Proceedings of the 23rd

Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000.

2. Cohen, W. Learning rules that classify e-mail. In AAAI Spring Symposium on

Machine Learning in Information Access, 1996.

3. Dominigos, P. and Pazzani, M. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In Proceedings of the 13th Int.

Conference on Machine Learning, 1996.

4. Graham, P. A Plan for Spam. <http://www.paulgraham.com/spam.html> August 2002.

5. Graham, P. Better Bayesian Filtering. <http://www.paulgraham. com/better.html> January 2003.

6. Graham, P. <http://www.paulgraham.com/antispam.html> August 2002.

7. Potamias, G., Moustakis, V., and Van Someren, M. (eds.), An evaluation of naive Bayesian anti-spam filtering. In Proceedings of the Workshop on Machine

Learning in the New Information Age, 2000.

8. Provost, J. Naive Bayes vs. Rule Learning in Classification of Email, In Artificial

Intelligence Lab, University of Texas at Austin, A technical report,

(47)

9. Rennie, J. ifile: An application of machine learning to e-mail filtering. In

KDD-2000 Text Mining Workshop.

10. Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. A Bayesian approach to filtering junk email. In AAAI Workshop on Learning for Text Categorization, 1998, AAAI Technical Report WS-98-05.

11. Sahami, M. Learning limited dependence Bayesian classifiers. In Proceedings of

the Second International Conference on Knowledge Discovery and Data Mining,

1996.

12. Spertus, E, Smokey: Automatic recognition of hostile messages. In Proceedings

of the 14th National Conference on AI and the 9th conference on Innovative applications of AI, 1997.

13. Langley, P., Wayne, I., and Thompson, K. An analysis of Bayesian classifiers. In

Proceedings of the 10th National Conference on AI, 1992.

(48)

(49)

The appendix contains the code of my classes that are significant to the implementation of my report. Comments are mentioned to explain the code wherever necessary.

ParseMails.java

// The main program contains thread of a particular execution. // It can easily be integrated with any email client coded in java. import javax.swing.*; import javax.swing.border.*; import java.awt.*; import java.awt.event.*; import java.util.*; import java.io.*; import java.sql.*; import java.lang.*;

public class ParseMails extends JDialog {

private static String fSPACE = " "; String sPath, tableName;

String word = null;

public static Connection connection; double totalMails, goodMails, spamMails;

// Particular object can be executed by creating a separate thread. private ParseMails()

{

ParseText pt = new ParseText();

Thread t1 = new Thread(pt); t1.start();

};

public static void main (String[] args) { new ParseMails(); } } ParseText.java

//The program parses the given set of messages and store all the tokens in appropriate //tables. The program is used in training phase to train the filter. import javax.swing.*; import javax.swing.border.*; import java.awt.*; import java.awt.event.*; import java.util.*; import java.io.*; import java.sql.*;

(50)

import java.lang.*;

public class ParseText extends JDialog implements Runnable {

private static String fSPACE = " "; String sPath, tableName;

String word = null;

private JTextField spampath = null; Hashtable tokenHash = new Hashtable();

int ans;

double totalMails, goodMails, spamMails; public static Connection connection; Statement st, s, st1, st2, st3; String prevdWord, dWord;

//Parses the given set of messages with known status (spam / legitimate)

public void run() {

String message = "Please Enter YES if you want to train filter using spam emails and No otherwise" ;

int answer = JOptionPane.showConfirmDialog( new Frame(), message);

if (answer == JOptionPane.YES_OPTION)

{

// User clicked YES.

ans = 1;

}

else if (answer == JOptionPane.NO_OPTION)

{

// User clicked NO.

ans = 0;

}

this.setResizable(false);

this.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);

spampath = new JTextField(50); this.enableInputMethods(true); this.setSize(700,150); int x = (this.getToolkit().getScreenSize().width-this.getWidth())/2; int y = (this.getToolkit().getScreenSize().height-this.getHeight())/2; this.setLocation(x,y); spampath.setEditable(true); spampath.setEnabled(true); spampath.setRequestFocusEnabled(true); spampath.requestDefaultFocus();

JPanel logPanel = new JPanel();

(51)

JLabel nameLabel = new JLabel();

nameLabel.setForeground(Color.lightGray);

ActionListener okActionListener = new ActionListener()

{

public void actionPerformed(ActionEvent e)

{

sPath = spampath.getText();

getUpdate();

try

{

File dir = new File (sPath);

File[] files;

files = dir.listFiles ();

for (int i=0; i<= (files.length - 1); i++) { totalMails = totalMails + 1; if (ans == 1) spamMails = spamMails + 1; else goodMails = goodMails + 1; break; case '\'':

// A single-quoted string was found; sval contains the contents String squoteVal = sToken.sval;

if (squoteVal.length() != 0) storeHash(squoteVal); break; case StreamTokenizer.TT_EOL: // End of line character found

break;

case StreamTokenizer.TT_EOF: // End of file has been reached

break;

default:

// A regular character was found; the value is the token itself char ch = (char)sToken.ttype;

break; }

(52)

} fRead.close(); dataConnection(); } ExecuteUpdate(); }

catch (IOException ioe) {

System.out.println (ioe.getMessage());

}

catch (Exception ex) { ex.printStackTrace(System.err); } } }; spampath.registerKeyboardAction( okActionListener, KeyStroke.getKeyStroke(KeyEvent.VK_ENTER, 0, false), JComponent.WHEN_ANCESTOR_OF_FOCUSED_COMPONENT );

ActionListener escapeActionListener = new ActionListener() {

public void actionPerformed(ActionEvent e)

{ System.exit(0); } }; spampath.registerKeyboardAction( escapeActionListener, KeyStroke.getKeyStroke(KeyEvent.VK_ESCAPE, 0, false), JComponent.WHEN_ANCESTOR_OF_FOCUSED_COMPONENT ); logPanel.add(nameLabel); logPanel.add(spampath);

JPanel mainPanel = new BGPanel();

mainPanel.setBackground(new Color(0,0,64)); mainPanel.setBorder( BorderFactory.createCompoundBorder( BorderFactory.createBevelBorder(BevelBorder.RAISED, Color.gray, Color.white), BorderFactory.createEmptyBorder(10,10,10,10) ) ); mainPanel.add(logPanel, BorderLayout.SOUTH); this.getContentPane().setLayout(new BorderLayout()); this.getContentPane().add(mainPanel, BorderLayout.CENTER); this.setVisible(true); this.toFront();

(53)

spampath.requestFocus();

}

//Stores the parsed tokens in hash table public void storeHash(String tokenKey) { if (tokenHash.containsKey(tokenKey)) { Integer n = (Integer)tokenHash.get(tokenKey); int nn = n.intValue() ; nn = nn + 1;

tokenHash.put(tokenKey, new Integer(nn));

} else {

tokenHash.put(tokenKey, new Integer(1));

} }

//Update the appropriate tables with parsed tokens public void dataConnection()

{ try { String updateString = ""; if (ans ==1) tableName = "tblSpamTokens"; else if (ans ==0) tableName = "tblGoodTokens"; s = connection.createStatement(); Enumeration keys = tokenHash.keys(); while( keys.hasMoreElements() )

{

String key = (String) keys.nextElement();

Integer occ = (Integer) tokenHash.get(key);

int occur = occ.intValue();

if (s.execute("SELECT * FROM "+tableName+" WHERE token = '"+key+"'") == true) { ResultSet rs = s.getResultSet(); if ( rs.next()) { Statement stmt = connection.createStatement(); if (ans ==1) {

(54)

updateString = "UPDATE tblSpamTokens SET occurrences = occurrences + '"+occur+"' " +"WHERE token = '"+key+"'";

}

else

if (ans == 0)

{

updateString = "UPDATE tblGoodTokens SET occurrences = occurrences + '"+occur+"' " +"WHERE token = '"+key+"'";

} stmt.executeUpdate(updateString); stmt.close(); } else {

String insertQuery = "INSERT INTO "+tableName+" VALUES('"+key+"', '"+occ+"')"; Statement insertStatement = connection.createStatement(); insertStatement.executeUpdate(insertQuery); insertStatement.close(); } } } s.close(); }

catch (SQLException sqle) {

System.err.println("SQL State: " + sqle.getSQLState()); System.err.println("SQL Error: " + sqle.getErrorCode()); sqle.printStackTrace(System.err); System.err.println(sqle.getMessage()); } catch (Exception e) { e.printStackTrace(System.err); } }

//Gets the total count of spam, legitimate and total messages from the table tblMailCount

public void getUpdate() { try { Class.forName ("sun.jdbc.odbc.JdbcOdbcDriver"); connection = DriverManager.getConnection("jdbc:odbc:DRIVER={Microsoft Access Driver (*.mdb)};DBQ=dbSpamFilter.mdb");

(55)

ResultSet rst;

st.execute ("SELECT * FROM tblMailCount WHERE typeOfMails =

'totalMailCount'");

rst = st.getResultSet();

rst.next();

totalMails = rst.getInt(2);

st.execute("SELECT * FROM tblMailCount WHERE typeOfMails =

'goodMailCount' ");

rst = st.getResultSet();

rst.next();

goodMails = rst.getInt(2);

st.execute("SELECT * FROM tblMailCount WHERE typeOfMails =

'spamMailCount' "); rst = st.getResultSet(); rst.next(); spamMails = rst.getInt(2); st.close(); // rateMail(); // updateTableSpam (); // updateTableGood (); }

catch (SQLException sqle) {

System.err.println("SQL State: " + sqle.getSQLState()); System.err.println("SQL Error: " + sqle.getErrorCode()); sqle.printStackTrace(System.err);

System.err.println(sqle.getMessage()); }

catch (Exception ex) {

ex.printStackTrace(System.err); }

}

//A GUI for the user to enter the path of the messages private class BGPanel extends JPanel

{

public BGPanel() {

super( new BorderLayout() ); }

public void paint( Graphics g ) {

(56)

Color col = new Color(239,239,239,40);

Font fnt = new Font("Dialog", Font.BOLD, 32); g.setColor(col