Email Spam Filtering
A Thesis Paper Submitted to the School of Information Technology In Partial Fulfillment of the Requirements for the Degree of
Bachelor of Science in Computer Science
By
Bello, Lyndsay B. Genelsa, Rick Joshua L.
Roxas, Edgel John M.
Mapúa Institute of Technology July 2013
TABLE OF CONTENTS
Abstract i
Acknowledgements ii
Chapter 1 Introduction 1
Research Questions 5
Scope and Limitation 6
Significance of the Study 6
Conceptual Framework 7
Chapter 2 Review of Related Literature 10
Chapter 3 Methodology 22
Header Analysis 25
STEP 1. Extraction of Header Features 27
STEP 2. Validate FROM Field 27
STEP 3. Validate TO Field 28
STEP 4. Validate DATE Field 28
STEP 5. Validate X-Mailer, Message-ID, Return-Path, Reply-To, CC, BCC,
Sender 28
Content-based Body Analysis 30
STEP 1. HTML Tags Removal 30
STEP 2. Extraction of Body Features 31
STEP 3. Selection of Attributes 32
STEP 4. Training of the Classifiers 36
Testing of the Classifiers 36
Classification of Email 37
Accuracy Evaluation 38
Chapter 4 Results and Discussion 39
Pre-Experiment 40
Header Analysis 40
Body Analysis 42
Chapter 5 Conclusion and Recommendations 56
Conclusion 56
Recommendations 57
LIST OF FIGURES
Figure 1.1: Bayes Theorem 2
Figure 1.2: Naïve Bayesian Spam Filtering 3
Figure 1.3: C4.5 Pseudocode 3
Figure 1.4: Example of Decision Tree Built by C4.5 Algorithm 4
Figure 1.5: Parallel Approach Conceptual Framework 7
Figure 1.6: Sequential Approach Conceptual Framework 8
Figure 2.1: A Comparison of the Classification Accuracy of Case-based and Bayesian
Spam Filtering 12
Figure 2.2: 10-fold Cross Validation 14
Figure 2.3: Sample E-mail Header 15
Figure 2.4: TCR result of SVM on SA corpus. The 9 thresholds for body, header and all (body+header) feature types are 0.63, 0.575 and 0.48 respectively. The corresponding 999 values are 2.31, 1.43 and 1.38. 19 Figure 2.5: TCR result of SVM on ZH1 corpus. The 9 thresholds for body, header and all
(body+header) feature types are 0.771, 0.687 and 0.36 respectively. The corresponding 999 values are 1.47, 1.69 and 1.15. 19
Figure 3.1: Experimental Groups 22
Figure 3.2: Sample Extracted Emails 23
Figure 3.3: Sample Email Header 24
Figure 3.4: Sample Email Body 24
Figure 3.5: Header Analysis 25
Figure 3.6: Different Valid Formats for the From Field 27 Figure 3.7: Spam Email with Wrong From and To fields Format 27
Figure 3.8: Empty To Field 28
Figure 3.9: Sample of an Invalid Date 28
Figure 3.10: Sample of a Common Mail User Agent 28
Figure 3.11: Different Domain Names of The From and Message-ID Fields 29
Figure 3.12: Sample Header Analysis Results 30
Figure 3.13: Sample Email Body without HTML Tags 31
Figure 3.14: Sample Attributes/Tokens of an Email 32
Figure 3.15: Sample Summary of Frequency Table 33
Figure 3.16: Sample .csv File 34
Figure 3.17: Sample Arff File 34
Figure 3.18: Sample .arff File after Feature Selection 35
Figure 3.19: Sample .arff File of Test Set 36
Figure 3.20: Sample .arff File of the Test Result 37
Figure 3.21: Sample Result Tabulation 37
Figure 4.1: Sample Header Analysis Result 42
Figure 4.2: Sample Summary of All Feature Selection and Classifier Algorithms
Performed 45
Figure 4.4: Threshold Curve for OneR Attribute Eval Ranker – Naïve Bayes 50 Figure 4.5: Sample Classification Without Header Analysis 51
Figure 4.6: Screenshot of Spam Filter 52
Figure 4.7: Sample Tabulation for Validation 53
Figure 4.8: Software Architecture for Parallel Evaluation 54 Figure 4.9: Software Architecture for Sequential Evaluation 55
LIST OF TABLES
Table 2.1: Comparison of Different Spam Filtering Methods 18
Table 3.1: Approach and Rules 25
Table 4.1: Indicators of Spam Email 43
Table 4.2: Indicators of Nonspam Email 43
Table 4.3: Parallel Top 3 Result 46
Table 4.4: Sequential Top 3 Result 46
Table 4.5: Evaluation of Classifier using Training Set for Spam Class 47 Table 4.6: Sample Summary of 10-Fold Cross Result for Spam 48 Table 4.7: Results of Pre-selected Algorithms without Header Analysis 51
i ABSTRACT
Using email is one of major activities in the Internet. Considered now as a major form of communication, it reaches to thousands of users worldwide within a couple of seconds. However, spam had threatened the viability of communication via email. As a resolution to this problem, the Researchers developed a model through email header fields and words (also referred to as attributes or tokens) that were extracted from emails‟ body. The data were later on processed using WEKA, a data mining tool. The model studies the different header fields and attributes that were good indicators of spam emails.
ii
ACKNOWLEDGEMENTS
The Researchers would never have been able to finish this Thesis without the guidance of the technical adviser, research panelists, help from friends, and support from our family and God;
To the advisor, Mr. Ramon L. Rodriguez, for his excellent guidance, care, patience, providence and excellent atmosphere for doing research;
To the Faculty of the School of IT, for their inputs and wise suggestions;
To the beloved Ms. Sally, for going out of her way just to help;
To the Center for Student Advising, Mrs. Pamela A. Roldan and Ms. Mary Anne F. Balmes for sharing the needs of school supplies and the place to study;
To the Friends especially to those who willingly lent their E-mail‟s information for the needed data; for the countless laughter shared;
Lastly, the Families, in supporting them both financially and emotionally;
To God Himself, for without Him not a single word of this Thesis would have been conceived.
1 Chapter 1 INTRODUCTION
As the Internet continues to grow, it has opened new ways of communication. Using e-mail is thus the major activity when surfing the Internet. [1] This form of communication reaches out to thousands of users worldwide within a second; however, this freedom of communication can be misused. In the last couple of years, spam has become a phenomenon that threatens the viability of communication via e-mail. [2] Spam started in the spring of 1978 by a man named Gary Thuerk. He wanted everyone to know about his new DCE company.[3]
Though users can recognize spam e-mail messages easily, it is difficult to have an accurate definition of spam. By definition, spam is unsolicited usually commercial e-mail sent to a large number of addresses. [4] Spam is also defined as flooding the Internet with many copies of the same message, in an attempt to force the message on people who would not otherwise choose to receive it. [5] Some companies, organizations or people take advantage of the freedom of using e-mail by sending unsolicited advertising or offensive messages, therefore many users‟ inboxes are populated with unwanted messages. In 2007, a study [6] showed that eighty-five percent (85%) of the email traffic is accounted to spam and if trends continued it would increase to ninety percent (90%) by year end.
Because of this increasing problem, several techniques and solutions against spam filtering have already been proposed: decision trees (DT), support vector machines (SVM), K-nearest neighbor algorithm (KNN), Naïve Bayes (NB), neural networks, ensemble decision trees (EDT), boosting, bagging and stacking, and others. [7][8] But the problem with some techniques is the over-blocking, which is filtering out even the legitimate or normal e-mails.
Rev. Thomas Bayes, an English Nonconformist, mathematician, and author of “An Essay Towards Solving a Problem in the Doctrine of Chances” (1793), first used probability inductively and established a mathematical basis for probability inference (a means of calculating, from the number of times an event has occurred, the probability that it will occur in future trials). [9][10] He introduced what is known today as the Bayes‟ Theorem – the probability of event B given A can be found by assuming that event A has occurred and, working under that assumption, calculating the probability that event B will occur. [11]
Figure 1.1: Bayes Theorem
From this theorem comes one of the most highly successful spam filtering tools available today [12], the Bayesian Spam Filtering. The use of Bayesian logic in spam filtering started in Paul Graham‟s online article “A Plan for Spam” [13], and was later adopted by many developers. Bayesian spam filtering is based on the idea that the presence of certain words will indicate that the message is spam, while other words will identify it as legitimate. This method starts by analyzing different sets of e-mails which are already sorted as spam and legitimate. The Bayesian filter compares the contents of both sets, and then a database is built which contains words (also known as tokens), which can be used to identify classify future e-mails as spam or not. [14] Naïve Bayesian (NB) is one of the commonly used algorithms based from the described Bayesian Spam filtering.
3
Figure 1.2: Naïve Bayesian Spam Filtering
Another learning approach used to counter-attack spams is with the use of the C4.5 algorithm, a decision tree inducer, developed by John Ross Quinlan. C4.5 is Quinlan‟s extension of his own ID3 Algorithm. He introduced the gain ratio which is used by the algorithm as splitting criteria. When the number of instances to be split is below a specified threshold, splitting stops. After growing the decision tree, error-based pruning is performed. [15]
Using a training set, C4.5 builds a decision tree according to the splitting node strategy. At each node, the algorithm selects a single attribute that most effectively splits its set of instances into subsets. It recursively visits each decision node and selects the optimal split until no further splits are possible. [16] The following premises guides the algorithm: (1) If all cases are of the similar class, the tree is a leaf and so the leaf is returned with this class; (2) Calculate the possible information provided by a test on the attribute (based on the probabilities of each case having a particular value for the attribute) for each attribute. Also calculate the gain in information that would result from a test on the attribute (based on the probabilities of each case with a particular value for the attribute being of a particular class); and (3) Find the best attribute to branch on depending on the current selection criterion. [17]
Figure 1.4: Example of Decision Tree Built by C4.5 Algorithm
E-mail headers are included in the message by the sender of by a component of the mail system, and also contain transit-handling trace information. [18] Standard e-mail headers include the following fields: (1) Return Path, (2) Delivery-date, (3) Date, (4) Message-ID, (5) X-Mailer, (6) From, (7) To, and (8) Subject. [19]
According to Trevino[43], “Header analysis still has life”. Results of his tests showed that header analysis is capable of detecting over ninety percent (90%) of current spam with less than one percent (1%) false positive. These tests also require very little training and processing
5 power. Since it focused only on the header of the e-mail, messages that can trick statistical filter (such as phishing scams or image spam) are still easily detected and eliminated.
A study done by Wang and Chen [1], using Header Session for anti-spam focused on header analysis too. Wang and Chen made use of Header fields as cue for spam filtering, fields such as “To”, “CC”, ”From”, “X-Mailer”, “Message-ID”. These fields are the primary basis for analysis in their study, they analyzed these fields and found loopholes and pattern that are made as cue in classifying an email as spam or not spam. Such rules are used in classifying spam, sender address is invalid and the recipient is not in the emails „To‟ or „CC‟ fields.
As the fight for spam increases, spammers find more ways to hide their identifications when sending spam e-mails and by-pass filtering methods. Because of this, header analysis is considered as one of the main approaches to counter spam attacks with falsified header information. [20]
Research Questions
What are the other email features, both on header and body that can increase accuracy rate in classifying email messages?
Which is a better approach for detection of spam emails, parallel or sequential evaluation?
Which is the best combination of features selection and classification algorithm available in WEKA to yield highest accuracy rate?
Scope and Limitation
The training set and test set that were used in the study are only from the chosen corpuses. The emails that were used are in plain-text and HTML format only, and did not cover the analysis of email attachments. The frequency table created consists only of unigrams (a single item from a sequence). The study did not conduct any live system testing throughout the research. Domain Name Server (DNS) Spoofing detection was not considered, because testing requires a live mail server.
The email extraction tool (Export Messages to EML Format of OutlookFreeware Utilities) had some problems in extracting some email messages. As of this writing, the said tool has no documentation.
Significance of the Study
The result of the study will be of great importance and significance to future spam filtering developers. By increasing the accuracy of the algorithms with the help of Header Session, it can help improve the classification rate of emails, thus decreasing the high email traffic accounted to spam and also reducing the heavy load taken by mail servers.
The study can also help in improving the accuracy of filtering spam emails using different classification algorithms. It will also show if header analysis, whether in sequence or in parallel, will have an effect in the accuracy of email spam filtering. The study can show which algorithm works best with the header analysis by comparing the results of accuracy and false positive. The study also aims to discover new email header features and additional rules that could help in improving the classification accuracy.
7 Conceptual Framework Email Header Body Header Features Extraction Data Set Cleaning .arff File Classifier Algorithm Email Classification: Matching Result RFC 1036 rules Format Matching Text (.txt) File .csv File Body Features Extraction Text (.txt) File Same Result? Header Classification Body Classification Yes No Email Classification: Indeterminate
Figure 1.5: Parallel Approach Conceptual Framework
An email consists of two parts, the header and the body. These features were extracted, cleaned and separated for analysis. Extracted header features were saved in a text (.txt) file. The header part was analyzed using header analysis, while the body part was saved in a .csv file, and converted to a .arff file for the content-based analysis.
For the parallel approach, header analysis and content-based body analysis classified the emails simultaneously based on rules for each method. The results of both analyses methods were compared, and final email classification is based on the rules indicated in Table 3.2.
Email Header Body Header Features Extraction Data Set Cleaning .arff File Classifier Algorithm Email Classification: Spam RFC 1036 rules Format Matching Text (.txt) File .csv File Body Features Extraction Text (.txt) File Header Classification Body Classification Yes No Email Classification: Nonspam Spam? Yes No Spam?
Figure 1.6: Sequential Approach Conceptual Framework
The sequential approach follows the same pre-processing steps of the parallel approach. However, header analysis processed the header part first applying slack filter. Only those emails classified as non-spam were processed using the content-based analysis. Email classification is based on the rules indicated in Table 3.3.
The research used collected personal emails from people who agreed to give their received emails for the training set. Emails from the two corpuses: the Spamassassin data set
9 which has 6,047 e-mail messages, with a known 31% spam ratio [21]; and Bruce Guenter‟s Spam Archive [22] were also used as the test data. The header fields that were extracted from the emails are “From”, “To”, “Subject”, “Date”, “Return-Path”, “Reply-To”, “CC”, “BCC”,
10 Chapter 2
REVIEW OF RELATED LITERATURE
Spam is an issue about consent, not content. Whether the Unsolicited Bulk Email ("UBE") message is an advert, a scam, porn, a begging letter or an offer of a free lunch, the content is irrelevant - if the message was sent unsolicited and in bulk then the message is spam. [23]
Many studies have been published sharing different ways on how to fight spam such as the Rule Based Spam Filtering, Content Hash Based Filtering, Machine Learning techniques, Support Vector Machines (SVM), Collaborative Filtering (CF) and the Content- Based Filtering (CBF) to name a few.
Among these methods, CBF has been the most widely used anti-spam solution because it is freely available and its commercial implementations. [24] Current research focuses on improving individual classifier performance, by a better preprocessing or enhancement of the learning algorithm. Ensembles that combine distinct spam classifiers have also been proposed. [25]
However, both CF and CBF have drawbacks. CF face problems such as first-rater, sparsity of data and privacy. The first issue is because of the difficulty of classifying emails that have not been rated before; the second problem arises when users rate few messages; and the last problem depends on what is shared [25].
One of the strong benefits of the CBF is that it reduces error rates as legitimate e-mail would not be blocked even if the ISP from which it originated, is on a real-time block list and it only needs occasional refinement, meaning less hassle for end-user. [26]
11 One popular method under CBF is the Naïve Bayesian - an anti-spam filter text categorization technique. [27] This method is simple and fast because it only requires linear training time. [28]
In a study by Taninpong and Ngamsuriyaroj [28], they used an incremental Naïve Bayesian approach with variant incremental training evaluated using the Trec05p-1 and Trec06p corpora. Their approach uses the concept of sliding windows in training messages after classification by the filter. Using three schemes (with each scheme adding more data to the current scheme), an increase in the accuracy rate of classifying e-mails was shown. From 78.24% (for Trec05p) and 85.27%(for Trec06p) for the first scheme, it increased to 91.51% and 92.42% for the third scheme.
A study [29] conducted by Pantel and Lin also used the Naïve Bayesian algorithm to classify e-mails as spam or legitimate messages. They presented a spam-filtering program called SpamCop which treats an e-mail message as a multiple set of words. In their experiments, SpamCop was able to classify ninety-two percent (92%) of the spams while having only a 1.16% misclassification of non-spam e-mails. It also showed that a high accuracy rate is achievable using only as few as 32 spam examples.
More studies about Naïve Bayesian as anti-spam technique [30] used two other methods namely, unigrams and bigrams. The idea of unigrams was taken from Paul Graham [13], and a question of redundancy occurred for unigrams and bigrams. But results proved it wrong, the total number of spams detected by either of the classifiers about forty percent (40%) is detected by both of them simultaneously. The remaining part is divided in approximately equal shares amongst them. This suggests that both versions of Naïve Bayesian classifiers are complementing each other, without a need of human supervision of the retraining process.
However, another study [31] comparing several forms of Naïve Bayes and linear SVM proved that a different method is better. Though all the forms that were used are the best choices for automatic filtering of spams, SVM presented the best average performance by having an accuracy rate of more than ninety percent (90%) compared to the Naïve Bayes approaches.
Cunningham, Nowlan, Delany, and Haahr [32] made a study comparing Naïve Bayes approach to a Case-Based approach for a certain period of time, from May to January. They used cases that contain 30 spam words, 30 non-spam words, and 7 header features to test the classification accuracy of both classifiers. Both classifiers are trained to 200 spam and 200 non-spam cases and are evaluated to 150 non-spam and 150 non-non-spam at each test point. Results show that case-based approach with an accuracy percentage greater than ninety-five (95%) all throughout the testing process performs better than Naïve Bayes approach with an average accuracy of less than ninety-five percent (95%).
Figure 2.1: A Comparison of the Classification Accuracy of Case-based and Bayesian Spam Filtering
13
Another study [27] evaluated a total of four different types of Naïve Bayesian showed that the Fixed Token approach has been found to be the most effective among the four techniques evaluated. In the series of tests and evaluation keeping the cost of false positives to a minimum became their prime priority. The researchers also found out that “Click here” and “Buy free are better indicators of spam than the independent word “Click,” “here,” “Buy,” and “Free.” It was also noted that, to get efficiency above ninety percent (90%) with less than one percent (1%) false positives, a content-based filter would not be enough. Because of this loop hole in spam filtering, several other researches have been done to include the header of the e-mail in fighting against spam.
Another technique that is used to classify spam is the C4.5 Decision Tree Algorithm – sometimes also called as J48, an open source Java implementation in WEKA - developed by John Ross Quinlan. Researchers Abdelghani Bellaachia, Erhan Guven [33] have used WEKA and have investigated three data mining techniques - the Naïve Bayes, the back-propagated neural network and the C4.5 decision tree algorithms and concluded that C4.5 algorithm has a much better performance than the other two techniques based on their research.
In a study [34] comparing different decision tree algorithms conducted using different sizes of dataset; C4.5 yielded a 95.80% accuracy for the data size of 1000. Also in the same study, C4.5 and the Naïve Bayesian Classifiers performed better compare to the other algorithms, averaging over ninety-five percent (95%) for the precision and recall rates.
Another study [16] showed that though another decision tree (Logistic Model Tree) outperforms the C4.5 Algorithm/J48 in terms of classification accuracy rate, it is the best in terms of training time compared to the other decision tree algorithms compared in the study.
Making the said algorithm as the best choice if training time will be considered as the critical factor in selecting filtering algorithms.
A comparative study done by Sharma and Sahni [35] analyzed and compared four (4) algorithms namely: C4.5/J48, ID3, ADTree, SimpleCART. During the training of classifier for each algorithm Sharma and Sahni used the 10-fold Cross Validation method for classifier training and results analysis. This 10-fold Cross Validation method means that:
Figure 2.2: 10-fold Cross Validation
With this validation method used it reduces over fitting (random error), making the built classifier more reliable. After the comparison the C4.5/J48 got the highest accuracy rate of (92.7624%) compared to the other three algorithms.
Another study [36] which evaluated different data mining techniques for spam filtering showed that the three (3) tree classification algorithms (C4.5, CS-MC4, and Rnd) produces ninety-five accuracy. In the same study, the best results (above ninety-five percent (95%) accuracy) for runs filtering and stepwise discriminant analysis are from the algorithms C4.5 and CS-MC4.
J48 was also the most suitable associated algorithm with AdaBoostM1 to filter spam according to a study [37] done by Ali and Xiang. They compared three algorithms (Decision Stump, J48, and Naïve Bayes) first without the AdaBoostM1 algorithm, and the second in
15 combination with AdaBoostM1. In the first experimental setup, J48 showed the highest classification accuracy rate of 92.98% compared to 91.50% and 79.29% for Decision Stump and Naïve Bayes, respectively. However, based on training time, J48 performed the slowest having 94.37 seconds, while Decision Stump having 7.65 seconds and Naïve Bayes with 16.04 seconds. For the next experimental setup (all algorithms with AdaBoostM1), again J48 yielded the highest classification accuracy rate of 95.15%. Training time was also enhanced for J48 when the AdaBoostM1 was applied to the algorithm. From 94.37 seconds, the training time was decreased to 6.84 seconds. The same study also showed that J48 has a high true positive rate, low false positive rate and computationally less expensive compared to the other algorithms used in the study.
An email has three (3) main elements: the header, the body, and the envelope. [38] The header part is said to be the most fascinating part of the email. [39] It includes details about the message such as the sender, the receiver, the date and the subject. Below is an example of an e-mail header. E-e-mail headers should always be read from bottom to top. [40]
Figure 2.3: Sample E-mail Header
Header analysis is the breakdown of the header part of the e-mail, separating it into its elements. Header-based email spam filtering represents an efficient and lightweight approach to
achieve filtering of spam messages by inspecting email message header information. Typically, a machine learning classifier is applied on features extracted from email header information to distinguish legitimate messages from spam. The email can be filtered based just on the headers, no matter what they say in the body. [41]
Chin-Chien Wang [42] conducted a study to determine if the header field could be of use when filtering junk emails. In his study, he used 3,417 unsolicited emails, where 60.3% of those unsolicited emails have an invalid sender address and 92.8% receiver addresses was not shown in the “To” or “CC” headers. The result of his studies concluded that invalid sender and irrelative receiver email addresses left in the header section of junk emails could be used by spam filter developers to develop new anti-spam strategies or even improve the current anti-spam filters. Four anti-spam filtering techniques or methods were named in his study, namely:
1 Filtering by number of Recipients – This method is used for blocking emails sent to a large number of recipients. A downside of this method is that it could filter non-spam messages, because it assumes and consider all bulk emails to be junk.
2 Filtering by keyword – This method is said to be efficient in filtering junk mails. It also uses probability that the header or the body of the email contain specific words, such as “sell”, “sex”, “buy now”, and other keywords that most people consider a spam. The downside of this method is that it has a high percentage that solicited mails might be filtered because of the words on the filter list.
3 Filtering by Sender Address – This method filters spam based on the sender address. Like other methods in filtering junk mails, the downside of this method is that it‟s easy for the sender to use new email address or create a fake email address and use it for spamming. There are so many other problems regarding this method.
17 4 Filtering by address Validity – This method checks if the email followed the proper format described in the request for comment. It is said that every mail should at least have a sender and a receiver.
According to Trevino, header analysis still has life. Results of his tests showed that header analysis is capable of detecting over ninety percent (90%) of current spam with less than one percent (1%) false positive. These tests also require very little training and processing power. Since it focused only on the header of the e-mail, messages that can trick statistical filter (such as phishing scams or image spam) are still easily detected and eliminated. [43]
A study conducted by Hu [44] also focused on the header part, which includes the originator field, destination field, x-mailer field, sender server IP address, and email subject, tested 5 different spam classifiers: Random Forest (RF), Decision Tree, (DT), Naïve Bayes (NB), Bayesian Network (BN), and Support Vector Machine (SVM). Testing the accuracy, precision, recall, and F-measure of 2 different datasets consisting of 33,209 emails and 21,725, using the hybrid spam filtering framework it showed that random forest classifier has the best performance with 96.7% accuracy, 92.99% precision, 92.99% recall, and 93.3% F-measure.
Using a two-phase spam filtering method based on categorized Decision Tree Data Mining Algorithm, Sheu utilized the basic information in header sessions to identify spam or legitimate e-mails. [45] Based from his experimental results, the efficiency were evaluated in the following datum: 96.5% accuracy, 96.67% precision, and 96.35% recall. He pointed out that these datum are not lower than other filtering methods that checks email content given the fact that he only checked the header sessions of e-mails, which will reduce the computation cost and many system resources. Table 2.1 shows the results obtained by Sheu.
Table 2.1: Comparison of Different Spam Filtering Methods
In an experiment conducted by Le Zhang, Jingbo Zhu, Tianshun Yao [46], they used the SpamAssasin and ZH1 corpora and processed it into 3 versions to determine the contribution of different part in filtering spam mails; one version uses only terms from message body plus subject line, another version with tokens occur in message headers only and the last version was one with both mail body and headers tokenized. The result of the experiment showed that SVM classifier achieved good TCR values using only the information that they used from mail headers only. They concluded that using message headers can be more reliable in eliminating or filtering spam mails, compared to mails that uses spam filters that focuses only to body. They also discovered that using both header and body in filtering mails is better than focusing only on either the body or the header alone. They concluded that message headers shouldn‟t be ignored and should be considered as important as mail bodies in terms of filtering spam. Figures 2.3 and 2.4 show the result of the experiment.
19
Figure 2.4: TCR result of SVM on SA corpus. The 9 thresholds for body, header and all (body+header) feature types are 0.63, 0.575 and 0.48 respectively. The corresponding 999 values
are 2.31, 1.43 and 1.38.
Figure 2.5: TCR result of SVM on ZH1 corpus. The 9 thresholds for body, header and all (body+header) feature types are 0.771, 0.687 and 0.36 respectively. The corresponding 999
A study [38] by Ahmed Obied used machine learning approach based on Bayesian analysis to filter spam. This study is different from most anti-spam methods because he evaluated both the header and the body of the email. The filter learns of what spam and non-spam messages look like and it can make binary classification decisions (spam or non-spam) based on what it has learned. The filter does not require any heavy maintenance. All that is needed is to train it once and it is done. After training the filter, it becomes capable of filtering spam with high accuracy. The study used a feature extraction, that extracted the words both from the header and body by the use of delimiters ( \n\f\r\t\ ./&%# {}[]! +=-() „”*?:;<>) that was then placed on a hash table. The header and the body of an email message is separated by an empty line. The result of the 4 tests of 5,000 messages equally divided by spam and non-spam messages an average result of 97.80% accuracy. Wang and Chen [1] made use of Header fields as cue for spam filtering, fields such as “To”, “CC”, ”From”, “X-Mailer”, “Message-ID”. These fields are the primary basis for analysis in their study, they analyzed these fields and found loopholes and pattern that are made as cue in classifying an email as spam or not spam. Such rules are used in classifying spam, sender address is invalid and the recipient is not in the email‟s „To‟ or „CC‟ fields.
Out of the many researches about spam-filtering, particularly the Naïve Bayesian approach, the very influential article [13] by Paul Graham was adopted by many other researchers. In his study, Graham pointed out the importance of message headers. The main difference of this study from [27] is the probably the inclusion of message headers in filtering spam e-mails, and that data should not be discarded. In his Bayesian analysis for the body and header of the email, he made use of tokens, score and hash tables to verify whether email is a spam or not. Basing on a corpus of spam mail each common words would be given equivalent scores, depending on how often they occur on a mail (“free”, “sex”, “sexy”).
21 However, another study [47] by Gary Robinson put Paul Graham's approach under scrutiny. According to him, Graham's algorithm is subtly asymmetric with respect to how it handles words that indicate the e-mail is a spam compared to the words that make it a legitimate message. He also pointed out that Graham's technique was based on an anonymous article [48] showing that the probabilities are independent, which is not the case in words found in e-mails.
There are vast studies about CBF spam filtering, but some studies ignore the header of the email. Though different approaches that are applied to the content of the body has a high accuracy rate in distinguishing spam from non-spam emails, other studies also showed that combining the body of the email with its header can also increase the accuracy of the filter. Because of these results, the researchers want to know which spam filtering algorithm, with header analysis, is better using Parallel or Sequential approach. Also, the researchers want to discover other email header features and additional rules that could help in improving
22 Chapter 3 METHODOLOGY
In order to validate the Researcher‟s point of view, two (2) major experimental groups of data collection were created. Group A worked in parallel with header analysis, while Group B worked in sequence. Using exhaustive search, both groups were trained using the same set of classification algorithms available in WEKA. To ensure fairness in the experiment, all 4,179 samples were selected from personal emails collected from people who opted to participate. The test set was from two existing spam corpuses: the CSDMC2010 SPAM corpus which have 4,327 messages in total with 2,949 non-spam and 1,378 spam messages; and Bruce Guenter‟s Spam Archive. (Figure 3.1) Exhaustive Search Algorithm Classify Email Apply Header Analysis Evaluate Accuracy Email Exhaustive Feature Selection Apply Header Analysis Classify Email Evaluate Accuracy Email Exhaustive Search Algorithm Exhaustive Feature Selection Body and Header
Separation
Email’s Body Email’s Header
23 Emails for the training set were extracted using Export Messages to EML Format tool [55] for Microsoft Outlook. Figure 3.2 shows the sample extracted emails in .eml format using the said extracting tool.
Figure 3.2: Sample Extracted Emails
Before performing the two (2) methods for analysis, the header and body parts of the email were separated and saved in a .txt file as shown in Figures 3.3 and 3.4. However, the content of the Subject field was included in the body.
Figure 3.3: Sample Email Header
25 Header Analysis
In performing email header analysis, the Researchers adapted Wang and Chen‟s [1] method applying the rules the researchers observed from the spam emails collected:
Yes? Extracted Header Features Invalid FROM field? Validate FROM field Mark Email as SPAM Validate TO field Invalid TO field? Validate X-MAILER Validate MESSAGE-ID Validate RETURN-PATH Validate REPLY-TO Validate CC/BCC/ SENDER Header Analysis Result No? No? Yes?
Figure 3.5: Header Analysis
Table 3.1: Approach and Rules
Judgement Approach Rules
Judged as normal emails
Do not filter out emails with the following characteristics.
Normal email has the following characteristics:
1 Valid FROM field format. 2 Valid TO field format. 3 Valid DATE range.
is commonly used.
5 MESSAGE-ID field format is valid.
6 MESSAGE-ID field domain is the same as the FROM field domain.
7 RETURN-PATH field domain is the same as the FROM field domain.
8 REPLY-TO field domain is the same as the FROM field
domain.
9 Valid CC field format 10 Valid BCC field format 11 Valid SENDER field format.
Judged as spam
Filter out emails that match rule 1, 2, and 3; or match any
two of rules 4 to 11.
Spam has the following characteristics: 1 Invalid FROM field format. 2 Invalid TO field format. 3 Invalid DATE range.
4 Mail-User-Agent (X-MAILER) is not commonly used.
5 MESSAGE-ID field format is invalid.
6 MESSAGE-ID field domain is not the same as the FROM field domain.
7 RETURN-PATH field domain is not the same as the FROM field domain.
8 REPLY-TO field domain is not the same as the FROM field domain.
9 Invalid CC field format 10 Invalid BCC field format 11 Invalid SENDER field format. Judged as
indeterminate
Neither normal nor spam
emails. Neither normal nor spam emails.
All the rules applied in the header analysis were patterned after the Request for Comments (RFC) 1036 – Standard for USENET Messages. [51]
27 STEP 1. Extraction of Header Features
Email headers are present on every email an individual received via the Internet, and it can provide valuable diagnostic information like, hop delays, anti-spam results and more. Figure 2.2 shows a sample of an email header.
The Researchers extracted the header features for both Group A and Group B. Using JavaMail API [58] for our own parsing tool in extracting, we:
a. Parsed an email message and extracted the header part.
b. Extracted the selected header fields: From, To, Date, X-Mailer, and Message-ID; Return-Path, Reply-To, CC, BCC, Sender fields were extracted if present.
STEP 2. Validate FROM Field
According to RFC 1036, each email message must include the From field, containing the address of the sender who wishes this message to be sent.
Figure 3.6: Different Valid Formats for the From Field
Based on the observations from the collected spam emails, some spam emails‟ From field do not follow the standard set by RFC 1036. This loophole was used as criteria in judging the emails.
STEP 3. Validate TO Field
The recipient address was also used as a cue for judging normal emails. The To field also follows a strict structure like the FROM field. Another observation of the said field of spam emails usually contains empty or “undisclosed-recipients” as shown in the figure below.
Figure 3.8: Empty To Field
STEP 4. Validate DATE Field
The Date field was also used as a spam indicator. As show in the figure below, some spam emails contain a date that is in the future. The researchers validated the date the email was sent, setting the range from when email started (1960) [52] up to the present date.
Figure 3.9: Sample of an Invalid Date
STEP 5. Validate X-Mailer, Message-ID, Return-Path, Reply-To, CC, BCC, Sender
MUA (Mail User Agent) are software applications that are commonly used in sending email (refer to Appendix A). On the contrary, emails that did not use the common MUA, was used as a cue for spam.
29
Message-ID also follows a strict format (unique@full_domain_name) as stated in RFC 1036. This format was used to check the validity of the said field. The researchers also compared the Message-ID‟s domain to the From field‟s domain. As observed in spam emails, the two domains were different. Figure 3.8 shows the comparison of the two fields‟ domains.
Figure 3.11: Different Domain Names of The From and Message-ID Fields
If the email contains Return-Path, Reply-To, CC, BCC, and/or Sender fields, these fields were also used to judge the emails. Most spammers tend to spoof the From field, but leave their real email address in the Return-Path and Reply-To fields so that when the recipient replies it goes to their email. [8] With this, the researchers compared the From field to the Return-Path and Reply-To fields to verify if the said fields are the same. CC, BCC, and Sender fields were also processed the same as the To field.
All possible combinations of the header fields were used to allow the researchers to get the best combination with the body classifier algorithms that would yield the highest accuracy. The different results for all combinations are shown in Figure 3.12.
Figure 3.12: Sample Header Analysis Results Content-based Body Analysis
The contents of the body were filtered and classified using the different attribute selection and classifier algorithms available in WEKA that the Researchers selected.
STEP 1. HTML Tags Removal
HTML tags were removed in order to clean and extract only the contents of the email body, without its formatting.
31
Figure 3.13: Sample Email Body without HTML Tags
STEP 2. Extraction of Body Features
Using Apache OpenNLP [53] the body features were extracted from the email messages (shown in Figure 3.14). Applying Ahmed Obied‟s [38] method, we:
a. Extracted attributes from the body by tokenizing using the delimiter: \n\f\r\t\\ /&%# {}[]! +=-()\'\"*?:;<>@~._
b. Ignored the attributes of size three characters or less
Figure 3.14: Sample Attributes/Tokens of an Email
STEP 3. Selection of Attributes
Using the 4,179 samples of pre-classified emails as the training set, two (2) frequency tables were built containing the number of occurrences of words/tokens. Frequency Table 1 contains all the tokens extracted without using a dictionary. On the other hand, Frequency Table 2 contains all the tokens that were considered as valid words by WordNet [54]. The frequency tables contain:
33 b. Number of times each word occurred that belongs to spam
Figure 3.15: Sample Summary of Frequency Table
Setting the threshold at one hundred (100), the features (from the two frequency tables) with a hundred and more occurrences, totaling to two hundred sixty (260) attributes, were selected as attributes for the body analysis (refer to Appendix D).
A .csv file (shown in Figure 3.16) was created containing the computed frequency of each token, and the actual classification of the emails. Each row represents an email in the training set. The frequency is computed using the formula:
Equation 3.1: Frequency Formula
Figure 3.16: Sample .csv File
This file was converted to an .arff file that is readable by WEKA. A sample is shown in the figure below:
35 In order to select the features that are useful in classifying the email messages, the researchers applied feature selection algorithms available in WEKA, which are divided into two adjustable functions – an attribute evaluator and a search method. Appendix B lists all feature selection algorithms that were used. A sample .arff file of the training set after applying feature selection is shown in Figure 3.18.
STEP 4. Training of the Classifiers
Using exhaustive search, different classification algorithms in WEKA [49] were used in the study to verify which algorithm works best with header analysis (refer to Appendix B). Testing of the Classifiers
As shown in Figure 3.19, the test set was composed of 512 emails (155 nonspam and 357 spam). Testing was also done in WEKA using the model created in training each classifier. The result of the test was also saved in an .arff file shown in Figure 3.20.
37
Figure 3.20: Sample .arff File of the Test Result Classification of Email
The results of the two (2) analyses were tabulated in an excel (.xlsx) file as shown in Figure 3.21.
Parallel Classification
The result of the header analysis was compared to the body analysis result. The matched results of the two (2) analyses were considered as the final classifications; otherwise, the email was considered Indeterminate.
Sequential Classification
To avoid mistakenly filtering out normal emails we applied Slack Filtering. In a slack filter, all normal emails were kept and processing continued. The emails classified as spam by the header analysis were automatically considered as spam; otherwise the final classification was based on the body analysis result.
Accuracy Evaluation
To evaluate the performance of the experimental models, the following performance measures were used:
1.True Positive Percentage (Correctly Classified SPAM)
2.False Positive Percentage (NONSPAM classified as SPAM)
3.True Negative Percentage (Correctly classified NONSPAM)
4.False Negative Percentage (SPAM classified as NONSPAM)
5.Accuracy (What percent of the prediction is correct?) 6.Indeterminate (Percentage of indeterminate emails)
39 Chapter 4
RESULTS AND DISCUSSION
This chapter presents the results and discussions of the study. The series of steps and all the computations with it will be showed in this chapter, in both parallel and sequential evaluation. The possible combinations of header fields that yielded the highest accuracy, lowest false positive and low to zero indeterminate are also included in this chapter. The best combinations with the low false positive rate, high accuracy, and zero or low indeterminate rate (both in parallel and in sequence) are also presented here. The best three models for both types of evaluations are also presented here.
The Researchers looked for volunteers varying from the researchers peer, colleague, relatives and mentors who are willing to give their emails for the purpose of the research. Before the execution of the research study the significance, rationale and purpose of the study were given to the volunteers who gave their e-mail account. Furthermore, the volunteers have also been given the assurance that all the data they will give are used for the sole purpose of the research and the identities of the volunteers and their e-mail‟s data will be confidential. Roughly around 5,000 emails were collected, then the researchers selected emails which are in plain-text format and HTML format only.
All the pre-classified emails collected summed up to 4,179. Afterwards, they used Export Messages to EML Format [55] tool which extracts the email from the Microsoft Outlook to .eml format which are Notepad-readable.
Pre-Experiment
Originally the Researchers only considered the common Header fields from Wang & Chen [1] such as: Subject, From, To, X-mailer and Message ID. When the Emails were already Notepad-readable, the Researchers began observing and deduced that there are more possible header fields that can be critical in Header evaluation such as the following:
Return-Path Reply-To CC BCC Sender Date
The Researchers wrote a program to separate the header and body parts; also, they appended the Subject to the body since there is no relative comparison and cue that can be used for considering a Spam in the Subject field. Data cleaning was only used in Body Analysis.
Header Analysis
When the Researchers have gathered a reasonable number of emails for the training set, the emails were extracted from Microsoft Outlook. To further analyze and know the behavior of all the fields of each email, the test set is analyzed and showed if there is any invalid header field. They observed different irregularities on the different fields of the emails‟ header. The following spam characteristics were seen:
From and Return-Path/Reply-To fields are not similar; From, To, and Sender fields contains no email;
To fields contain “undisclosed-recipients”;
Message-ID contains dollar ($) sign, and not following the proper format; Message-ID domain is not the same as From field‟s domain; and
41 Date is invalid (example: Tue, 19 Jan 2038 11:13:00 +0800)
JavaMail API was used to access and extract the header fields that were used for the Header evaluation. In validating the selected fields, they based the rules from the RFC 1036 which contains the valid formats for the header part of an email. [51]
According to RFC 1036, each email message must have the From field. This field contains the address of the sender who wishes this message to be sent and must follow the proper format of an email address. Then they validated the said field‟s format based from the standard set by RFC 1036.
In validating the To, CC, and BCC fields, the recipient address was also used as a cue for judging emails. The To field also follows a strict structure like the From field. Another observation of the To field of spam emails, usually it contains empty or “undisclosed-recipients”; the same rule applies to „CC‟ and „BCC‟ fields.
The Date field was also used as a spam indicator. As observed, some spam emails contain a date that is in the future. They validated the date the email was sent, setting the range from when email started (1960) [52] up to the present date.
MUAs (Mail User Agent) are software applications that are commonly used in sending email (refer to Appendix A). On the contrary, emails that did not use the common MUA, was used as a cue for spam. The X-Mail field of the emails were checked if it is in the list of commonly used Mail-User-Agents (refer to Appendix A).
Most spammers tend to spoof the From field, but leave their real email address in the Return-Path and Reply-To fields so that when the recipient replies it goes to their email. This loophole was also used as a cue for spam. If present, these two fields were compared to the From field to check whether those fields have the same values.
The Researchers set the From, To, and Date fields as priority indicators of a spam email. If one of those three fields is invalid, the email is automatically classified as spam. Message-ID, X-Mail, Return-Path, Reply-To, CC, BCC, and Sender fields were treated with less priority, meaning an email must violate two imposed rules on the said fields before being classified as spam. Below is a sample result of the Header Analysis showing all the fields which has an invalid field or format.
Figure 4.1: Sample Header Analysis Result
As shown in Figure 4.1, based on the rules set the header analysis results show that 66.89% of the test emails have an invalid Message-ID domain, which is the highest rule violated; followed by the invalid Return-Path (39.84%) and invalid Message-ID format (33%). Other rules showed low percentages given the fact those fields are optional fields as indicated in RFC 1036. This also shows that these fields are good indicators for email classification.
Body Analysis
Before body analysis, HTML tags were removed from all HTML format emails. They used OpenNLP - machine learning based toolkit for the processing of natural language text – for
43 tokenization using the delimiter: \n\f\r\t\\ /&%# {}[]! +=-()\'\"*?:;<>@~._; and applied Ahmed Obied Methods in parsing the words of the body part of the email. To further polish the tokens, data cleaning was done such as ignoring words with three characters and less, and removing of stop words (refer to Appendix C).
From these, two frequency tables were created. Frequency Table 1 contains all the tokens extracted. On the other hand, Frequency Table 2 contains tokens considered by WordNet, a large database of English, as valid words. From the two tables, features with one hundred (100) and more occurrences were selected as significant attributes, totaling to two hundred sixty (260) words/tokens. From this frequency table, the following tokens were considered to be good indicators of a spam email:
Table 4.1: Indicators of Spam Email
1. http = 4192 6. fast = 2162
2. click = 3067 7. shipping = 2130
3. best = 2777 8. please = 1696
4. customers= 2565 9. address = 1583
5. delivery = 2255 10. prices = 1511
On the other hand, the following are indicators of a nonspam email:
Table 4.2: Indicators of Nonspam Email
1. debian = 776 6. weblog = 327 2. lists = 577 7. unsubscribe = 301 3. radio = 525 8. blogspot = 284
4. blog = 458 9. postregsql = 267 5. weblogs = 391 10. index 259
A .csv file was created containing the probability of the frequency of each token in an email. Then this file is converted to an .arff file.
Using exhaustive search, all possible combinations of evaluator and search methods (feature selection) were applied to at most five (5) from the different classifier groups (Bayes, Functions, Lazy, Meta, MI, Misc, Rules, and Trees groups) available in WEKA. Models were created using the training set, and the test set was re-evaluated using these models. Predictions of the classifiers were saved to an .arff file and converted to a .csv file for easier tabulation. Results of each train-and-evaluate process were tabulated in an excel file as shown in Figure 3.21.
For the parallel evaluation, the header result is compared to the body. If the results of both analyses are the same, the final classification of the email will be the matching result. On the other hand, sequential evaluation considers the result of the header analysis first. If it resulted to spam, the email will automatically be considered as spam. Otherwise, the body analysis result will be considered (Slack Filtering). Other results are shown in Appendix E.
45
Figure 4.2: Sample Summary of All Feature Selection and Classifier Algorithms Performed
Permutation of the header fields used was also done to identify which combination of fields would work best with body analysis. All gathered results were summarized in table as shown in Figure 4.3. Since several combinations yielded similar results, the top three were chosen based on the header fields used. The combination of header fields that has the most number of required fields was considered in order to make the header analysis more efficient. On the next page, Tables 4.3 and 4.4 present the True Positive (TP) rate, False Positive (FP) rate, True Negative (TN) rate, False Negative (FN) rate, Accuracy (A) rate, and Indeterminate (I) rate of top three classifications for each type of evaluation. TP rate is the percentage of the correctly classified spam emails; FP is the rate of nonspam identified spam; TN is rate of the correctly classified nonspam; FN rate is the percentage of spam emails identified as nonspam; A rate is the percentage of correct predictions; and I rate is the percentage of unclassified emails.
For both the parallel and the sequential evaluation, the header-feature selection-classifier algorithm combination with the highest accuracy rate, low false positive rate, and low
indeterminate rate was considered. These performance measures were used as the basis for selection because of the following reasons: accuracy rate is the percentage of correctly classified emails, therefore a high accuracy rate shows a high number of correctly classified emails; low false positive rate means low nonspam classified as spam, therefore legitimate emails can pass through the filter; and a low indeterminate rate indicates that few emails passed through the filter without being classified. The top three combinations were selected based from these criteria.
Table 4.3: Parallel Top 3 Result Result with Header Analysis NB J48G DT OneRAttribute Ranker SVMAttribute- Ranker OneRAttribute- Ranker Performance Measure
From/To, Date, MID, XMAIL, CC, BCC, Sender Date, MID, RPath, XMAIL, CC, BCC MID, RPath, RTo, XMAIL, CC, BCC TP 55.18 53.50 54.06 FP 1.94 3.87 5.81 TN 56.77 50.97 47.10 FN 7.00 3.92 2.80 A 91.05 93.10 93.33 I 38.87 43.36 44.34 *TP = True Positive *FP = False Positive *TN = True Negative *A = Accuracy *I = Indeterminate
Table 4.4: Sequential Top 3 Result Result with
Header Analysis
NB NB NB
Relief Attribute Eval -Ranker SVMAttribute- Ranker OneRAttribute- Ranker Performance Measure
Date RPath RTO Xmail
Date RPath RTO XMail Bcc
Date RPath Rto Xmail Cc Bcc TP 78.99 78.99 80.11 FP 7.10 7.10 7.74 TN 92.90 92.90 92.26 FN 21.01 21.01 19.89 A 83.20 83.20 83.79 I 0.00 0.00 0.00
47
The pre-selected models for both types of evaluation are: OneR Attribute Eval – Ranker
(with Naïve Bayes), SVM Attribute Eval – Ranker (with J48Graft), OneR Attribute Eval – Ranker (with Decision Table), Relief Attribute Eval – Ranker (with Naïve Bayes), and SVM Attribute Eval – Ranker (with Naïve Bayes). Table 4.5 presents the summary of the Correctly Classified
Instances, Incorrectly Classified Instances, Kappa Statistics, Mean Absolute Error, Root mean squared error, TP Rate, FP rate and ROC of these classifiers when used in the training set.
Table 4.5: Evaluation of Classifier using Training Set for Spam Class
Classifier Evaluation OneR Attribute Eval Ranker – NB OneR Attribute Eval Ranker - DT SVM Attribute Eval Ranker - J48G SVM Attribute Eval Ranker – NB Relief Attribute Eval Ranker – NB Correctly Classified Instances 75.47 98.66 99.29 75.57 75.47 Incorrectly Classified Instances 24.53 1.34 0.71 24.53 24.53 Kappa Statistics 0.26 0.89 0.94 0.26 0.26 Mean Absolute Error 0.25 0.06 0.01 0.25 0.25 Root mean squared error 0.50 0.14 0.08 0.50 0.50 TP Rate 0.74 1 1 0.74 0.74 FP Rate 0.10 0.18 0.09 0.10 0.10 ROC Area 0.88 0.97 0.97 0.88 0.88
The range for the correctly classified instances is from 75.47% to 99.29%. The value for the kappa statistics for all the classifiers ranges from 0.26 to 0.94. It means that the classifiers
with 0.89 have better results. The ROC area appears to have high values ranges from 0.88 to 0.97. It means that these models have a very good classifying ability.
Ten-fold cross validation was performed to the pre-selected classifiers to verify which of the classifiers would perform better with header analysis. This form of validation was done to reduce over fitting (random error); making the classifiers more reliable.
Table 4.6: Sample Summary of 10-Fold Cross Result for Spam
Classifier Evaluation OneR Attribute Eval Ranker – NB One Attribute Eval Ranker - DT SVM Attribute Eval Ranker - J48G SVM Attribute Eval Ranker – NB Relief Attribute Eval Ranker – NB Correctly Classified Instances 75.31 97.75 97.97 75.47 75.31 Incorrectly Classified Instances 24.69 2.25 2.03 24.53 24.69 Kappa Statistics 0.25 0.80 0.83 0.26 0.25 Mean Absolute Error 0.25 0.06 0.03 0.25 0.25 Root mean squared error 0.50 0.16 0.14 0.50 0.50 TP Rate 0.74 1 1 0.74 0.74 FP Rate 0.12 0.29 0.26 0.10 0.12 ROC Area 0.85 0.95 0.85 0.86 0.85
The correctly classified instances range from 75.31% to as much as 97.97%. The 10-fold cross validation divides the data into ten equal sets, and then trains the classifier on the nine sets and test on the one set; the process is repeated ten times and the output per test would be averaged. As shown in Table 4.6, after performing 10-fold cross validation a noticeable decrease
49 in accuracy can be seen, because unlike the usual where the classifier has the same data set and test set. 10-fold cross has different data set and test set in all of its ten tests. It is noticeable that
OneR Attribute Eval Ranker – DT and SVM Attribute Eval Ranker – J48G (both from parallel
evaluation) again showed the better results as compared to the other models. Values of the ROC area range from 0.85 to 0.97. For all the classifiers, ROC area decreased after 10-fold cross, however the difference is quite small. All models presented good test. Again, OneR Attribute
Eval Ranker – DT and SVM Attribute Eval Ranker – J48G have better results (0.95 and 0.97,
respectively) as compared to other models.
Significant figures such as the Mean Absolute Error (MAE), which defines how far a prediction to the actual values; and Root Mean Squared Error (RMSE), which measures the difference between predicted values by a given model and the actual values, are also considered in evaluation. [57] The results show again that OneR Attribute Eval Ranker – DT and SVM
Attribute Eval Ranker – J48G models are accurately significant, as compared to the other
models. SVM Attribute Eval Ranker – J48 Graft yielded the highest accuracy rate of 97.97% for the parallel evaluation; and OneR Attribute Eval Ranker– Naïve Bayes and Relief Attribute Eval
Ranker – Naïve Bayes, with 75.31% accuracy both for the sequential evaluation.
Figures 4.4 and 4.5 show the threshold curve of SVM Attribute Eval Ranker with J48
Graft, and OneR Attribute Eval Ranker– Naïve Bayes. Other graphs are shown in Appendix G
and H. Based on the ROC curves shown, the x-axis corresponds to the false positive rate, whereas the y-axis corresponds to the true positive rate. The graphs for the two selected models show that the value of threshold for true positive rate exceeds the false positive rate. With these results, the two models can be used to predict email classification.
Figure 4.3: Threshold Curve for SVM Attribute Eval Ranker – J48Graft
Figure 4.4: Threshold Curve for OneR Attribute Eval Ranker – Naïve Bayes
Another set of test was done without the header analysis to verify whether header analysis increases classification accuracy as shown in Figure 4.7.
51
Figure 4.5: Sample Classification Without Header Analysis
Table 4.7: Results of Pre-selected Algorithms without Header Analysis
NB J48G NB DT NB OneRAtt ribute Ranker SVMAt tribute Ranker SVMAtt ribute Ranker OneRAtt ribute Ranker ReliefAt tribute Ranker TP 78.43 89.9 78.43 91.60 78.43 FP 7.10 29.7 7.10 35.48 7.10 TN 92.90 70.3 92.90 64.52 92.90 FN 21.57 10.1 21.57 8.40 21.57 A 82.81 84 82.81 83.40 82.81
Presented in Table 4.7 are the results of the pre-selected algorithms performed without header analysis. Taking OneR Attribute Eval Ranker – Naïve Bayes (see Tables 4.1 and 4.2 for the result with header analysis) as an example, from 82.81% accuracy has slightly increased to 83.79% when header analysis was considered in evaluation. This shows that header analysis can be a supplement to body analysis.
Comparing the two types of evaluation, the top three models for the parallel evaluation showed strong results based on the accuracy rate of the classified emails. The accuracy rates for the three models are quite high (above 90%). However, because the indeterminate rates were quite high (ranging from 38.87% to 44.34%) - meaning a large number of emails were not classified – therefore, the evaluation is still considered weak. It is clearly shown that more emails can still pass through the filter without being classified which does not justify the role of a spam filter. On the other hand, the top three models of sequential evaluation accuracy rates are also high, ranging from 83.20% to 83.79%. This type of evaluation out-performed its parallel counterpart, because all emails were classified (0% indeterminate rates for all models); making this type of evaluation more reliable when applied to a live system.
Figure 4.6: Screenshot of Spam Filter
Out of all the models selected, SVM Attribute Eval Ranker – J48 Graft yielded the highest accuracy rate (97.97%). Since this model has been the most effective in classification,
53 this was embedded in a program (see Figure 4.7) for further evaluation. For the header analysis, the Date, Message-ID, Return-Path, X-Mail, CC, and BCC fields were used as these fields are the commonly used for the top three pre-selected models (for both types of evaluation); these fields were also used with the selected model during the testing phase of the experiment. Sequential evaluation was also used because it was shown in the results presented that it performed better than Parallel evaluation. A set of 100 new emails (50 spam and 50 nonspam) from the researchers‟ own emails received in the month of May 2013, and the Enron Spam Dataset were used for the validation. The result of the validation is shown in the figure below.
Figure 4.7: Sample Tabulation for Validation
After the validation, the performance of the spam filter was computed. The TP rate, FP rate, TN rate, FP rate, Accuracy rate of the validation is shown in Table 4.8. As shown in the