8.3 Materials and Methods
9.4.2 Performance on Personalized Emails
Traditionally, the studies on Enron-Spam report the performances of their filters by averaging the scores described in Section 9.3.4 across the six datasets [6]. In doing so, they almost always have used thearithmetic mean. However, our tests reveal that like other cutting-edge filters [6,23], our proposed filters perform completely opposite for Enron 1-3 (hams are missed less often because more hams are in the training data) and Enron 4-6 (spams are missed less often because more spams are in the training data). These extreme trends in the performances are exhibited in Figure 9.7 where the reported values for the six datasets largely vary. Such extreme points affect the overall average if it is calculated using the arithmetic mean. The viable alternative for the averaging is then the harmonic mean which reduces the effect of outliers on the average. That said, in this section, we report the harmonic meanof the results found from the six datasets.
The accuracy and F-score of the classifiers can be found in Figures 9.8a and 9.8b, respec- tively. It can be noted that the values reported here for personalized emails are significantly better than the values reported in Section 9.4.1 for non-personalized emails. We also com- pared the accuracy and F-score to that found by four benchmark personalized email filters:
lc [12], nb-bow [36], svm-bow [36], andicrm [28]. The two ensemble classifiers of sentinel
perform about equally well and surpass the performances of the rest. The filter that can claim to be a reasonably close second is thelcfilter inspired by artificial immune systems [12]. The
differences betweensentinel’s optimal classifier (bagged rf), andlcare significant atα=0.05.
The average email misclassification rates ofsentinelcan be found in Figures 9.8c and 9.8d,
respectively. baggingandadaboostm1 perform the best—baggingmisclassifies hams the least
(fpr=2.1%, see Figure 9.8c) whileadaboostm1 misclassifies spams the least (fnr=0.7%, see
Figure 9.8d). In addition,sentinelis compared to personalized filters that are used as yardsticks
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Probability Cost
Normalized Expected Cost
Random Forest AdaBoostM1 Bagging SVM Naive Bayes Trivial Classifier CSDMC2010 Cost Curves
Figure 9.4: Cost curves of the five classifiers generated by sentinel on the CSDMC2010
dataset.
ensemble classifiers of sentineloutperform the rest. Four of sentinel’s classifiers (exceptnb)
surpassed thefnrofnb-tfbut thefprof our second best classifier—adaboostm1—ties withnb- tf; thefprof baggingis higher thannb-tf—the difference, however, is statistically significant
only atα= 0.10.
Further tests on each of the six datasets reveal that the skewness of data has a detrimental effect on the training of anti-spam filters. For instance, except for the aberrant trend displayed by thenbclassifier, the remaining classifiers misclassify fewer hams in Enron 1-3 (ham skewed)
and fewer spams in Enron 4-6 (spam skewed). This experiment also suggests that rfexhibits
a similar ability to the meta-learners for spam classification (see Figure 9.7b). However, its ability to identify hams diminishes when more spams are included in its training data (see Figure 9.7a)—even so that for Enron 5 and 6, it is outperformed by nb. Overall, the results
indicate that ansvmclassifier is more sensitive to the skewness of the training data hence the
training set should be carefully selected. With spam-skewed training data an svmclassifier’s
ham misclassification rate is as high as 17%. Similarly, with ham-skewed training data the spam misclassfication rate nears 18%.
We have investigated the expected performance of the classifiers on each dataset. Although we have produced cost curves for all the classifiers for the six datasets in the Enron-Spam col- lection, we present only the cost curves of the two best classifiers: adaboostm1 and bagging.
These cost curves can be found in Figures 9.9 and 9.10. From the curves, it is evident that we can expect that our classifier performances will vary for different ratios of hams and spams. What is interesting in these curves is that not only is the ratio of ham to spam playing a role in expected performances, but also the spams themselves. Note that the spam-skewed datasets display similar curves for both classifiers, and show similar spreads among themselves. Stylo- metric differences among the spams in these three datasets may be the cause for these obser- vations. The ham-skewed datasets produce similar curves, but they are more tightly bundled. Even though the hams are from personal mailboxes, they are more likely to be stylometrically
Chapter9. 103 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Probability Cost
Normalized Expected Cost
Random Forest AdaBoostM1 Bagging SVM Naive Bayes Trivial Classifier
SpamAssassin Cost Curves
Figure 9.5: Cost curves of the five classifiers generated by sentinel on the SpamAssassin
dataset. 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Probability Cost
Normalized Expected Cost
Random Forest AdaBoostM1 Bagging SVM Naive Bayes Trivial Classifier LingSpam Cost Curves
1 2 3 4 5 6 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Ham Misclassification Rates (FPR)
Enron−Spam Datasets RF AdaBoostM1 Bagged RF SVM NB (a) 1 2 3 4 5 6 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Spam Misclassification Rates (FNR)
Enron−Spam Datasets RF AdaBoostM1 Bagged RF SVM NB (b)
Figure 9.7: (a) Ham misclassification rates and (b) spam misclassification rates of the five
sentinel-generated classifiers on the Enron-Spam collection.
0 20 40 60 80 100 Classifiers RF AdaBoostM1 Bagged RF SVM NB LC NB−BoW SVM−BoW ICRM Accuracy % 97.286 97.656 97.74694.786 92.134 96.800 88.410 95.130 89.000
(a) Comparison of accuracies.
0 0.2 0.4 0.6 0.8 1 Classifiers RF AdaBoostM1 Bagged RF SVM NB LC NB−BoW SVM−BoW ICRM F−score 0.966 0.971 0.972 0.927 0.9020.959 0.873 0.946 0.900 (b) Comparison of F-scores. 0 0.02 0.04 0.06 0.08 0.1 Classifiers RF AdaBoostM1 Bagged RF SVM NB NB−TF Bayesian NB−SLWE
Ham Misclassification Rates (FPR)
0.030 0.027 0.021 0.075 0.033 0.028 0.047 0.1
(c) Comparison of FPRs (ham misclassifica- tion rates). 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Classifiers RF AdaBoostM1 Bagged RF SVM NB NB−TF Bayesian NB−SLWE
Spam Misclassification Rates (FNR)
0.007 0.010 0.010 0.128 0.025 0.078 0.081 0.015
(d) Comparison of FNRs (spam misclassifi- cation rates).
Figure 9.8: Comparison of the performances on the Enron-Spam collection of the five classi- fiers generated bysentineland classifiers from previous studies.
Chapter9. 105 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Probability Cost
Normalized Expected Cost
Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Cost Curves for the AdaBoostM1 Classifiers for Enron-Spam
Figure 9.9: Cost curves for the adaboostm1 classifiers generated for the six datasets in the
Enron-Spam collection. 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Probability Cost
Normalized Expected Cost
Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Cost Curves for the Bagged RF Classifiers for Enron-Spam
Figure 9.10: Cost curves for the bagged rf classifiers generated for the six datasets in the
1 2 3 4 5 6 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Ham Misclassification Rates (FPR)
Enron−Spam Dataset Stacks
RF AdaBoostM1 Bagged RF SVM NB (a) 1 2 3 4 5 6 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Spam Misclassification Rates (FNR)
Enron−Spam Dataset Stacks
RF AdaBoostM1 Bagged RF SVM NB (b)
Figure 9.11: The incremental (a) ham misclassification rates and (b) spam misclassification rates of thesentinelclassifiers on the Enron-Spam collection.
similar because they are business-related. Some influence, albeit diminished because of the underrepresentation of spams, comes from the spams. Four of the curves are in a narrow band
whenProbability Cost =0.5.
Lastly, we evaluated the performance of sentinelon thestacksof the six datasets. We gen-
erated two sets of stacks and following is the process of generating first set of stacks: starting with the stack composed of Enron 1, one dataset (in numerical order) is added to the previous stack. This process of creating the dataset stacks continues until Enron 6 is combined with Enron 1-5. Thus, the number of hams dominates up to the third stack and the ratio of spams and hams becomes closer to 1 after adding Enron 6 to the stacks of Enron 1-5. The second set of stacks is generated as follows: starting with the stack composed of Enron 6, one dataset (in reverse numerical order) is added to the previous stack. This process of creating the dataset stacks continues until Enron 1 is combined with Enron 6-2. Thus, the number of spams dom- inates up to the third stack and the ratio of spams and hams becomes closer to 1 after adding Enron 1 to the stacks of Enron 6-2.
As we evaluatedsentinelon the stacks, it is evident that on anumericallybalanced dataset,
the best ham misclassification rate is achieved bybagged rffollowed byadaboostm1, rf, svm,
andnb while the best spam misclassification rate is achieved byadaboostm1 followed by rf, bagged rf, svm, andnb(see Stack 6 in Figures 9.11 and 9.12). Interestingly, we found that in
addition to class imbalance in the datasets, the performance of sentineldepends on two other
factors: the sources of the emails and the number of training emails for each class. A case in point, a notable change in ham misclassification rates can be observed for the first three reverse stacks (Figure 9.12a), even though the ratio of spams to hams in the three stacks is the same (i.e., 3:1). There are three differences among the stacks: the number of spam emails, the number of ham emails, and the email sources. This clearly suggests that besides the class imbalance problem, the email source and the number of training emails may have an influence on our filter’s performance.
Chapter9. 107 1 2 3 4 5 6 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Ham Misclassification Rates (FPR)
Enron−Spam Dataset Stacks (In Reverse Order) RF AdaBoostM1 Bagged RF SVM NB (a) 1 2 3 4 5 6 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Spam Misclassification Rates (FNR)
Enron−Spam Dataset Stacks (In Reverse Order) RF AdaBoostM1 Bagged RF SVM NB (b)
Figure 9.12: The reverse incremental (a) ham misclassification rates and (b) spam misclassifi- cation rates of thesentinelclassifiers on the Enron-Spam collection.
9.5
Conclusions
In this paper we describe the development and evaluation of an anti-spam filter namedsentinel.
The filter uses natural-language attributes, the majority being connected to stylometric aspects of writing. The real-valued, natural-language attributes extracted from the email texts are used to generate binary classifiers. The classifiers explored in this study are induced by five state- of-the-art learning algorithms. We evaluate the filter with benchmark non-personalized email datasets such as CSDMC2010, SpamAssassin, and LingSpam as well as standard personal- ized emails like those in the six datasets of the Enron-Spam collection. The evidence from extensive experiments implies that the classifiers that perform the best are of two ensemble methods: adaboostm1 andbagging. In general, the performance of sentinelis mixed on non-
personalized email data. This result is not unexpected because our findings demonstrate that the filter has limitations for non-personalized email data—mainly due to the absence of unique writing patterns in the randomly collected emails. Contrary to this, on personalized email data,
sentinelsurpasses the performances of a number of state-of-the-art personalized anti-spam fil-
ters. These outcomes imply that the attributes related to writer stylometry can better capture the imprinted patterns in personalized hams. One limitation of the filter is that its performance is affected by the extreme proportions of spams and hams in non-personalized datasets. On a good note, the filter is not affected at all by this factor on personalized datasets.
Our work clearly has some limitations. Firstly, several aspects of the filter, viz., its real- time training and response latency are not considered. It will require extensive tests to confirm
sentinel’s usability as an on-line filter. Secondly, personalized datasets share an interesting
phenomenon calledconcept driftwhich is yet to be investigated. The reaction of the proposed filter with respect to this phenomenon can be tested by substituting the spams of the Enron- Spam collection with more recent data.
Because our results suggest that ensemble methods perform the best, further tests should be carried out to see the performance of the filter by stacking several algorithms to generate its
classifiers. Future studies can extend the work by replacing the supervised algorithms used in this study with semi-supervised learning algorithms.
Bibliography
[1] Commtouch, “Internet threats trend report,” tech. rep., Commtouch, USA, April 2013. [2] E. Blanzieri and A. Bryl, “A survey of learning-based techniques of email spam filtering,”
Artificial Intelligence, vol. 29, pp. 63–92, Mar. 2008.
[3] J. Goodman, G. V. Cormack, and D. Heckerman, “Spam and the ongoing battle for the inbox,”Communications ACM, vol. 50, pp. 24–33, Feb. 2007.
[4] Y. Hu, C. Guo, E. W. T. Ngai, M. Liu, and S. Chen, “A scalable intelligent non-content- based spam-filtering framework,” Expert Systems with Applications, vol. 37, pp. 8557– 8565, Dec. 2010.
[5] J.-J. Sheu, “An efficient two-phase spam filtering method based on e-mails categoriza- tion,”International Journal of Network Security, vol. 9, no. 1, pp. 34–43, 2009.
[6] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with naive Bayes — Which naive Bayes?,” in Third Conference on Email and Anti-Spam (CEAS), (USA), p. 9pp, 2006.
[7] C.-C. Lai and M.-C. Tsai, “An empirical performance comparison of machine learning methods for spam e-mail categorization,” inFourth International Conference on Hybrid
Intelligent Systems, HIS ’04, (USA), pp. 44–48, IEEE Computer Society, 2004.
[8] A. Qaroush, I. M. Khater, and M. Washaha, “Identifying spam e-mail based-on statistical header features and sender behavior,” in CUBE International Information Technology
Conference, (USA), pp. 771–778, ACM, 2012.
[9] M. Ye, T. Tao, F.-J. Mai, and X.-H. Cheng, “A spam discrimination based on mail header feature and SVM,” inFourth International Conference on Wireless Communications, Net-
working and Mobile Computing (WiCom08), pp. 1–4, 2008.
[10] Q. Ma, Z. Qin, F. Zhang, and Q. Liu, “Text spam neural network classification algorithm,”
in2010 International Conference on Communications, Circuits and Systems (ICCCAS),
(China), pp. 466–469, 2010.
[11] C. Or˘asan and R. Krishnamurthy, “A corpus-based investigation of junk emails,” inThird
International Conference on Language Resources and Evaluation (LREC-2002), (Spain),
pp. 1773–1780, May, 29 – 30 2002.
[12] Y. Zhu and Y. Tan, “A local-concentration-based feature extraction approach for spam fil- tering,”IEEE Transactions on Information Forensics and Security, vol. 6, no. 2, pp. 486– 497, 2011.
Chapter9. 109
[13] S. Afroz, M. Brennan, and R. Greenstadt, “Detecting hoaxes, frauds, and deception in writing style online,” in 2012 IEEE Symposium on Security and Privacy (SP), (USA), pp. 461–475, 2012.
[14] F. Iqbal, L. A. Khan, B. C. M. Fung, and M. Debbabi, “E-mail authorship verification for forensic investigation,” inProceedings of the 2010 ACM Symposium on Applied Comput- ing, SAC ’10, (New York, NY, USA), pp. 1591–1598, ACM, 2010.
[15] J. Yang, Y. Liu, Z. Liu, X. Zhu, and X. Zhang, “A new feature selection algorithm based on binomial hypothesis testing for spam filtering,” Knowledge-Based Systems, vol. 24, pp. 904–914, Aug. 2011.
[16] B. Sirisanyalak and O. Sornil, “Artificial immunity-based feature extraction for spam detection,” in Software Engineering, Artificial Intelligence, Networking, and Paral-
lel/Distributed Computing, 2007. SNPD 2007. Eighth ACIS International Conference on,
vol. 3, pp. 359–364, 2007.
[17] A. Bratko, G. V. Cormack, D. R, B. Filipic, P. Chan, T. R. Lynam, and T. R. Lynam, “Spam filtering using statistical data compression models,”Journal of Machine Learning
Research, vol. 7, pp. 2673–2698, 2006.
[18] R. Prabhakar and M. Basavaraju, “A novel method of spam mail detection using text based clustering approach,”International Journal of Computer Applications, vol. 5, pp. 15–25, August 2010. Published By Foundation of Computer Science.
[19] G. V. Cormack and A. Bratko, “Batch and online spam filter comparison,” inConference
on Email and Anti-Spam, CEAS 2006, (Mountain View, CA), p. 9pp, July 2006.
[20] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian approach to filter- ing junk e-mail,” inLearning for Text Categorization: Papers from the 1998 Workshop, (USA), pp. 55–62, AAAI Technical Report WS-98-05, 1998.
[21] B. Issac, W. J. Jap, and J. H. Sutanto, “Improved Bayesian anti-spam filter implementa- tion and analysis on independent spam corpuses,” in2009 International Conference on
Computer Engineering and Technology - Volume 02, (USA), pp. 326–330, IEEE Com-
puter Society, 2009.
[22] P. Graham, “A plan for spam,” Aug. 2003.
[23] A. Kosmopoulos, G. Paliouras, and A. Androutsopoulos, “Adaptive spam filtering using only naive Bayes text classifiers,” in Fifth Conference on Email and Anti-Spam (CEAS
2008), (USA), p. 3pp, 2008.
[24] T. S. Guzella and W. M. Caminhas, “A review of machine learning approaches to spam filtering,”Expert Systems with Applications, vol. 36, pp. 10206–10222, Sept. 2009. [25] P. Haider, U. Brefeld, and T. Scheffer, “Supervised clustering of streaming data for
email batch detection,” in24th International Conference on Machine Learning, (USA), pp. 345–352, ACM, 2007.
[26] X. Carreras and L. M`arquez, “Boosting trees for anti-spam email filtering,” inRANLP-
2001, 4th International Conference on Recent Advances in Natural Language Processing,
pp. 58–64, 2001.
[27] T. Hastie, R. Tibshirani, and J. Friedman,The Elements of Statistical Learning. Springer Series in Statistics, Springer, 2001.
[28] A. Abi-Haidar and L. M. Rocha, “Adaptive spam detection inspired by a cross-regulation model of immune dynamics: A study of concept drift,” in Artificial Immune Systems, pp. 36–47, Springer, 2008.
[29] A. Abi-Haidar and L. M. Rocha, “Adaptive spam detection inspired by the immune sys- tem,” inALIFE, pp. 1–8, 2008.
[30] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos, “An exper- imental comparison of naive Bayesian and keyword-based anti-spam filtering with per- sonal e-mail messages,” in 23rd Annual International ACM SIGIR Conference on Re-
search and Development in Information Retrieval, (USA), pp. 160–167, ACM, 2000.
[31] A. Liaw and M. Wiener, “Classification and regression by randomForest,”R News, vol. 2, no. 3, pp. 18–22, 2002.
[32] R. E. Schapire, “A brief introduction to boosting,” in16th international joint conference
on Artificial intelligence - Volume 2, IJCAI’99, (USA), pp. 1401–1406, Morgan Kauf-
mann Publishers Inc., 1999.
[33] L. Breiman, “Bagging predictors,”Machine Learning, vol. 24, pp. 123–140, 1996. [34] R. C. Holte and C. Drummond, “Cost-sensitive classifier evaluation using cost curves.,” in
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)(T. Washio,
E. Suzuki, K. M. Ting, and A. Inokuchi, eds.), vol. 5012 ofLecture Notes in Computer
Science, pp. 26–29, Springer, 2008.
[35] C. Drummond and R. Holte, “Cost curves: An improved method for visualizing classifier performance,”Machine Learning, vol. 65, no. 1, pp. 95–130, 2006.