On the Efficiency of Collecting and Reducing Spam Samples
Pin-Ren Chiou, Po-Ching Lin
Department of Computer Science and Information Engineering
National Chung Cheng University
Chiayi, Taiwan, 62102
{cpj101m,pclin}@cs.ccu.edu.tw
Abstract
Collecting spam samples from the Internet is use-ful for observing the campaigns of spamming botnets and testing spam-filtering products. The common methods of spam collection from the Internet include setting up trap email addresses, a spam-filtering mail gateway and an open relay sinkhole. In this work, we empirically evaluate the three methods with respect to their efficiency of collection and the variety of the collected spam samples. We find an open relay sink-hole can collect the largest number of spam samples among the three methods, but the samples from it are likely to be duplicate or highly similar. We therefore design a novel two-level cache mechanism, which can efficiently reduce nearly 99% of the spam samples sent to the sinkhole, and greatly save the storage space and the volume of spam samples for further analysis.
1
Introduction
Spam filtering is a common practice on virtually all mail services, but spamming skills also have been evolving to evade the filtering. Thus, the developers of spam-filtering techniques would evaluate or com-pare various techniques with one or several corpora of mail samples [1, 2]. The evaluation results can serve as a good clue for the developers to improve the filter-ing accuracy based on real-world samples. Anti-spam researchers can also study the latest spam campaigns from the samples. Several corpora of mail samples are publicly available, e.g., on the website of Cybersecu-rity Data Mining Competition (www.csmining.org/ index.php/data.html). The existing corpora, how-ever, have been outdated, and cannot reflect the lat-est spam campaigns. Considering the fast evolution of spamming techniques, an efficient method for collect-ing ongocollect-ing spam messages is required.
In this work, we empirically evaluate three com-mon methods for spam collection (i.e., trap email ad-dresses, spam-filtering mail gateway and open relay sinkhole) with respect to their collection efficiency and the variety of the spam samples. It is noted that
col-lecting normal mail samples highly depends on the willingness of contributors, and that is irrelevant to the purpose of this work. We find an open relay sinkhole can collect the largest number of spam samples among the three methods, but the samples are likely to be duplicate or highly similar in the collection. The ex-cessive number of samples will waste the storage space and the time of spam analysis.
We therefore design a novel two-level cache mech-anism to efficiently reduce duplicate or highly similar spam samples. The first level comprises a novel struc-ture of hash tables to identify duplicate or highly sim-ilar spam samples arriving in a burst with high effi-ciency and filter them out, while the second can reduce more obfuscated samples based on the features derived from parsing into the samples. The two levels together can reduce nearly 99% of the collected spam samples, which are found duplicate or highly similar. The re-duction can efficiently save the storage space and the volume of spam samples for further analysis.
The remainder of this paper is organized as fol-lows. Section 2 reviews the methods of spam collection and the techniques to identify similar documents. Sec-tion 3 presents the deployment of the three methods of spam collection and the design of the two-level cache mechanism. Section 4 presents the collection results and the efficiency of reducing spam samples. Section 5 concludes this work.
2
Related Work
A corpus of spam samples can be collected in multiple ways: (1) A set of trap email addresses, known as spamtraps, can be exposed to spammers and lure their spam messages. A well known exam-ple of this approach is the Project Honey Pot (www. projecthoneypot.org), which distributes a large number of trap email addresses to the websites of vol-unteers. (2) An open relay sinkhole, which is a mail transfer agent (MTA) that allows forwarding email from any client to any destination, can be deployed for spammers to relay spam messages [3]. (3) A mail
gateway equipped with spam-filtering functions can of-fer the spam messages it detects. (4) A spamming bot program can be executed in a controlled environment to deliver spam based on the bot master’s instructions. The spam messages can be collected as samples.
A number of techniques have been developed to identify similar documents such as web pages, files and mail messages [4]. Two common techniques are Broder’s shingling algorithm [5] and Charikar’s simhash [6], which generate a fingerprint to repre-sent a document, and then compare the similarity between the fingerprints. Henzinger compared the two algorithms in detail [7]. Prior studies [8] and tools (e.g., spamsum, www.samba.org/ftp/unpacked/ junkcode/spamsum) assumed spam messages in the same campaign are likely to be similar, and used fin-gerprinting for spam detection.
Despite the existing techniques, a great challenge for reducing spam samples is to identify similar sam-ples on the order of millions of spam samsam-ples or more in a huge corpus, while new spam messages keep ar-riving. Thus, an efficient online method is required to identify (1) whether a newly arriving spam message is highly similar to an existing one in a huge corpus, and (2) dynamically updating the corpus. While the work in [4] can meet the former requirement, the latter is essential because the arrival rate of spam messages can be high on an open relay sinkhole [3].
3
Methods of Spam Collection and
Re-duction
We evaluate the three spam collection methods in this work: trap email addresses, spam-filtering mail gateway and open relay sinkhole. Spam collection from bots is left to the future work because of the insuffi-ciency of our current resources.
3.1
Spam Collection
The three methods of spam collection in the com-parison are described as follows.
3.1.1 Trap Email Addresses
Spammers can collect email addresses as the spam-ming targets by crawling public web pages for email-like patterns using email harvesting tools (e.g., Email Extractor from emailextractorpro.com). Since the trap email addresses are not created for normal use, the mail delivered to them is supposed to be spam. Consequently, we first applied for two sub-domains un-der the domain names of our campus, i.e., ccu.edu.tw, and then faked email addresses in the subdomains by following the naming conventions in the campus to make these addresses look real.
Figure 1: The system architecture of the spam-filtering mail gateway [9].
Figure 2: The system architecture of the open relay sinkhole.
We embedded the trap email addresses on two more websites with high pageranks of 3 and 5, besides ours, to make them be harvested in a short time, with the assumption that the email addresses will be ex-posed rapidly on a website with a high pagerank.
3.1.2 Spam-filtering Mail Gateway
We implemented a spam-filtering mail gateway at a senior high school. The MX record entry of the domain was modified to redirect the SMTP traffic to a Postfix (www.postfix.org) daemon on the gateway. Figure 1 presents the system architecture.
We integrated Amavisd-new and SpamAssassin with Postfix to filter incoming messages. Before the spam collection, we had tuned the spam-filtering al-gorithm to avoid collecting normal mail messages by minimizing the false-positive rate. We also built our own real-time blocking list (RBL) to enhance the ac-curacy of the spam filter on the mail gateway [9].
3.1.3 Open Relay Sinkhole
We built an open relay sinkhole for spam collection based on the method in [10]. Spammers used to scan the Internet for open relays, especially servers running on the SMTP port. According to the statistics of spam activities on the Botlab website (www.botlab.org) [11], we rented a virtual private server (VPS) in the U.S., which ranks the second highest in the country rank of spam activities. Figure 2 presents the system architecture of the open relay sinkhole.
The open relay sinkhole includes Postifx on the VPS and two multi-threaded Perl programs,
pCollector and rCollector, to cope with a large number of spam deliveries in a short time. The lat-ter two programs are described as follows:
• pCollector runs on the backend server, and ana-lyzes the mail messages forwarded from Postfix. For each newly arriving spam message, the entire message is searched for the patterns of test mes-sages we have identified (to be discussed later). If it is a test message, it will be sent back to rCollector to be forwarded to its destination; otherwise, the spam message will be forwarded to the two-level cache for identifying duplicate or highly similar ones, only one copy of which will be stored on the local storage.
• rCollector runs on the VPS. It is responsible for forwarding test mail messages passed from pCollector to the destination.
Like the observation in [3], we also found that the spammers deliver a small portion of test messages through the open relay sinkhole, and check whether the test messages are forwarded successfully. Once a test message is delayed, the spammers will keep re-sending the message in a short time (about once ev-ery 10 minutes in an hour). If the test messages re-main delayed, the spammers will slow down the retry-ing rate and stop the spam delivery. When the open relay starts to deliver the test messages again, more new test messages will be sent to the open relay for verification. The spamming traffic will resume once the open relay passes the test message checking.
The subject or the body of a test message may contain the IP address of the open relay with various keywords, such as test, test123, BC, SM, testuserOpen Relay, in the prefix or postfix. We used the keywords and the IP address of the open re-lay as the patterns to look for test messages. The test message bodies are mostly empty, and the sender ad-dresses are also mostly similar.
3.2
Methods of Fingerprinting
The spam samples can be summarized with the fingerprints generated from a hash function to rapidly identify the similarity among them. We use Charikar’s simhash [6] to generate the fingerprint for each spam sample because the simhash fingerprint can be as short as 64 bits to achieve good precision [4]. For each spam sample, the subject and the body are first decoded to restore the original content if they are encoded in BASE64 or the like, and the simhash function then reads the decoded sample in units of tokens. A token can be an English word, an ASCII string separated by punctuation or space, or a multi-byte character
(e.g., a Chinese character). The mail header except the subject is not involved in the fingerprint calcula-tion because it contains variable yet irrelevant informa-tion such as recipients and processing records inserted by the MTA, which will increase undesired disparity among the samples.
To generate the f -bit fingerprint f p for each sam-ple, the components of a f -tuple vector v are initialized to 0 first (f = 64 in this work). Each token along the spam content is sequentially hashed into a 64-bit value one by one. For each hash value, if the i-th bit is 1, the i-th component of v is incremented by a weight (1 by default); otherwise, the i-th component of v is decre-mented by the weight. Finally, the i-th bit of f p will be 1 if the i-th component of v is positive; otherwise, the i-th bit of f p will be 0. We determine the simi-larity between two samples according to the Hamming distance of their fingerprints.
3.3
Two-level Cache for Spam Reduction
It is essential to efficiently judge whether a newly arriving spam message is duplicate or highly similar to an existing one in a huge spam corpus. The work in [4] formulated this issue as a hamming distance problem to identify whether an existing fingerprint in a collec-tion of simhash fingerprints differs from the fingerprint of a given document in at most k bits. That work pre-sented an efficient algorithm using multiple sorted ta-bles of fingerprints from a static set of documents, but the algorithm is inapplicable to this work because the spam messages will keep arriving at a fast pace (e.g., in an open relay sinkhole), and the assumption of a static set of documents is not true.
We design a two-level cache mechanism, including an L1 cache and an L2 cache, to address the above issue. The former can filter out duplicate or highly similar spam messages arriving in a burst, and the latter can employ three features from the spam content to identify more similar spam messages. The details are described as follows.
3.3.1 L1 cache
Figure 3 illustrates the data structure of the L1 cache. For each newly arriving spam message, we calculate the simhash fingerprint f p from the concatenation of the subject and the body in the L1 cache. Each finger-print in the cache is divided into k + 1 segments. If an existing fingerprint in the cache differs f p in at most k bits, then one of its k + 1 segments must be identical to that of f p. We set k to 3, which is a good balance of the precision and recall [4].
Each fingerprint in the cache is duplicated in k+1 hash tables, and its i-th segment is located in the pre-fix of the i-th hash table. The i-th segment of a new
Figure 3: The L1 cache (not including the circular queues) for identifying similar simhash fingerprints.
fingerprint f p will be looked up in the prefix of the i-th hash table, for i = 1 . . . k + 1. According to the above observation, if a fingerprint in the cache differs from f p in at most k bits, one of the lookups will result in a hit. That is, the i-th segment of f p is identical to that of an existing fingerprint, for some i in 1 . . . k + 1. f p will be then compared with the fingerprints in the hit entry to verify the similarity. If a similar finger-print is found in the cache, the new spam message is considered duplicate or highly similar, and will be dropped right away; otherwise, f p will be inserted into the cache, and the new spam message will be stored. Each hash table has 1,024 entries in the memory for efficient queries. A circular queue is maintained for each entry in the hash tables to store the fingerprints with the same hash values, and the oldest fingerprint in a circular queue will be overwritten by the latest one if the queue is full. Thus, only k + 1 lookups are required for each new fingerprint, and at most c times of verification are required if there is a hit, where c is the length of a circular queue (c = 8 in this work).
3.3.2 L2 cache
We deliberately restrict the cache size of the L1 cache and calculate a fingerprint from the concatenation of the subject and the body, but this design trades ac-curacy for efficiency. First, a new fingerprint may be similar to one that was once in the L1 cache but has expired. Second, spammers often obfuscate spam mes-sages to increase the disparity of the mesmes-sages. Thus, we parse into the spam messages left after the L1 cache operation to extract three features from each of them, i.e., mail subject, mail body, and URLs, and analyze the features for further filtering. The last will be skipped if no URLs are in the spam messages. The features are described as follows.
1. mail subject and mail body: Both may be en-coded using the scheme proposed in RFC 2047 (www.ietf.org/rfc/rfc2047.txt). For exam-ple, the mail subject may be represented in a for-mat like ?BIG-5?B? Encoded-text ?=. We de-code the ende-coded section and normalize it with the UTF-8 character set before fingerprint calcu-lation to ensure the fingerprints will be consistent across different encodings.
2. URLs: We search for URLs in the mail body, but skip known clean URLs such as www.w3. org and schemas.microsoft.com, which are in-cluded in the spam messages because spam-mers compose the spam content based on the schemas defined by the organizations such as W3C (see www.w3schools.com/tags/tag_ doctype.asp). If multiple URLs are present, we choose only the first as the representative URL for generating the fingerprints.
The similarity comparison on the L2 cache are performed offline on the spam samples regularly (e.g., per day) to further reduce the volume of spam sam-ples. The following two methods are considered for implementing the L2 cache, and their efficiency will be evaluated in Section 4.2.
• Method 1: The fingerprints are calculated from the aforementioned three features separately, and store them in three separate caches. If no URLs are found in the spam messages, only two caches (for the mail subject and the body) are queried. The hash tables in each cache have the same num-ber of entries as those in the L1 cache, and the op-eration is like that of the L1 cache to identify an existing fingerprint that differs in at most k bits (k = 0 for the fingerprints of URLs, and k = 3 otherwise) from the queried fingerprint for a spe-cific feature. If more than half of the queries result in a hit, the spam message will be discarded. • Method 2: Like Method 1, we build three caches
to store the fingerprints of the three features, but the fingerprints will not expire in the caches. Con-sidering the large number of fingerprints due to the volume of spam messages, we simplify the caches to save the memory space by keeping the fingerprints in one hash table per cache, rather than in a complicated data structure like that in the L1 cache. The hash table uses linked lists to handle hash collisions, and can be dynamically expanded to accommodate more fingerprints and reduce the chances of hash collisions. Thus, a fin-gerprint has to be identical to an existing one in the hash table to encounter a hit. A spam mes-sage is discarded if more than half of total queries result in a hit in the hash tables.
4
Evaluation
In this section, we first compare the volume and the variety of collected spam samples in the three col-lection methods, and then study the efficiency of the two-level cache mechanism.
4.1
Comparison of Collection Methods
We deployed the three methods of spam collection described in Section 3.1. The periods of the three col-lections were different because of the different degrees of complexity to deploy these methods (e.g., request the authorities for permissions, purchase of equipment, configurations, implementations, etc.). According to Table 1, the average number of collected spam mes-sages per day in the three methods are 9, 379 and 1,388,738. Thus, setting up an open relay sinkhole can collect the largest number of spam messages in a short time, while the other two methods need distributed deployment on a large scale to collect a large volume of spam samples efficiently.
The pairwise Hamming distances between the fin-gerprints of spam samples in each collection are cal-culated to evaluate the variety of the spam samples collected in the three methods. The cumulative dis-tribution function (CDF) of the pairwise Hamming distances are presented in Figure 4. For simplicity, the sets of spam samples in the methods, trap email addresses, spam-filtering mail gateway and open relay sinkhole are represented as Collection A, Collection B and Collection C. Because of the huge number of sam-ples in Collection C, we randomly selected around 200 thousand spam samples over the period of the collec-tion to save the analysis time. According to Figure 4, nearly 75% of the pairs of fingerprints differ in at most 33 bits in Collection A and B, while the difference is at most 15 bits in Collection C, meaning the samples are more similar to each other in Collection C. Thus, the variety between the samples in Collection C is the lowest among the three methods.
Figure 4: The CDF of the pairwise Hamming distances between the fingerprints of the spam samples.
4.2
Analysis of Cache Efficiency
We select 7,066,226 spam samples collected in the first week from Collection C as the input dataset for evaluating the efficiency of the L1 cache. It is noted that we do not apply the L1 cache to Collection A and Collection B because the arrival of spam messages in the two collections is not in a burst, and the mecha-nism will be less effective. The result indicates that 6,824,156 spam messages, which amount to 96.57% of the evaluated samples, were found similar and dropped by the L1 cache.
The spam samples left after the L1 cache, as well as those in Collection A and Collection B, were read one by one for evaluating the efficiency of the L2 cache. According to Table 2, Method 1 and Method 2 of the L2 cache can filter out more than 85% of the samples in Collection A and Collection B, and more than 60% of the evaluated samples in Collection C. The results mean that separating the features for fingerprint cal-culation can effectively reduce more similar yet obfus-cated spam samples that cannot be identified by the L1 cache. The two caches together can reduce 98.66% (96.83% by the L1 cache, and 60.91% by Method 2 of the L2 cache) of the spam samples in the evaluated dataset from Collection C.
We also use the aforementioned samples from Col-lection C for evaluating the efficiency of the cache mechanism in Method 1 with different hash table sizes (i.e., the number of entries in the hash tables), be-sides the default size of 1,024 entries mentioned in Sec-tion 3.3. Method 2 is not involved in the evaluaSec-tion because the hash tables in this method are dynami-cally expanded in its operation. Table 3 summarizes the numbers of hits with different hash table sizes in Method 1. The results indicate that a larger hash ta-ble size can help to detect more duplicate or highly similar samples in the L2 cache.
Table 3: The numbers of hits with different hash table sizes in Method 1.
Hash table size Number of hits 1,024 147,445 (60.91%) 2,048 160,432 (66.28%) 4,096 171,311 (70.77%) 8,192 174,450 (72.07%)
5
Conclusion and Future Work
We evaluate three common methods of collecting spam samples, and present a novel two-cache mecha-nism to efficiently reduce duplicate or highly similar spam samples in this work. We find an open relay
Table 1: Spam message count in the three methods of spam collection. Method Spam message count Period
Trap email addresses 3,278 2012/04/01 - 2013/03/31 Spam-filtering mail gateway 138,666 2012/01/01 - 2012/12/31 Open relay sinkhole 56,938,282 2012/07/14 - 2012/08/24
Table 2: The numbers of hits for the three collections in the L2 cache.
Event Count in Collection A Count in Collection B Count in Collection C Number of spam samples 3,278 138,666 242,070 Number of hits [Method 1] 2,856 (87.13%) 120,213 (86.65%) 147,445 (60.91%) Number of hits [Method 2] 2,837 (86.55%) 118,817 (85.69%) 159,393 (65.85%)
sinkhole can collect the largest number of spam sam-ples (up to nearly 57 million samsam-ples over a period of six weeks) among the three methods, but the va-riety of its samples is also the lowest. The two-cache mechanism can reduce nearly 99% of duplicate or high similar spam samples sent to the open relay sinkhole, and more than 85% in the other two collection meth-ods. The reduction can greatly save the storage space and the volume of spam samples for further analysis. This work will be useful to those who want to collect a large corpus of spam samples for various kinds of analysis and filtering. For the future work, our next step is to deploy the collection methods in more than one spot, and analyze the variety and types of spam samples collected from different spots. .
References
[1] G. V. Cormack and T. R. Lynam, On-line Su-pervised Spam Filter Evaluation. ACM Transac-tions on Information Systems, 25(3), pp. 1-31, July 2007.
[2] L. Zhang, J. Zhu and T. Yao, An Evaluation of Sta-tistical Spam Filtering Techniques. ACM Transac-tions on Asian Language Information Processing, 3(4), pp. 243-269, Dec. 2004.
[3] A. Pathak, F. Qian, Y. C. Hu, Z. M. Mao and S. Ranjan. Botnet Spam Campaigns Can Be Long Lasting: Evidence, Implications, and Analysis. In Proceedings of the 11th International Joint Confer-ence on Measurement and Modeling of Computer Systems, Aug. 2009.
[4] G. S. Manku, A. Jain, A. D. Sarma, Detecting Near-Duplicates for Web Crawling. In Proceedings of International World Wide Web (WWW) Con-ference, May 2007.
[5] A. Broder, S. Glassman, M. Manasse and G. Zweig, Syntactic Clustering of the Web. In Proceedings of International World Wide Web (WWW) Confer-ence, Apr. 1997.
[6] M. S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In Proceedings of 34th Annual ACM Symposium on Theory of Computing, May 2002.
[7] M. Henzinger, Finding Near-Duplicate Web Pages: a Large-Scale Evaluation of Algorithms. In Pro-ceedings of Annual International ACM SIGIR Conference on Research and Development in In-formation Retrieval, Aug. 2006.
[8] A. Kolcz, A. Chowdhury and J. Alspector, Improved Robustness of Signature-based Near-Replica Detection via Lexicon Randomization. In Proceedings of ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, Aug. 2004.
[9] Pin-Ren Chiou, Po-Ching Lin and Chung-Ta Li. Blocking Spam Sessions with Greylisting and Block Listing based on Client Behavior. In Proceedings of International Conference on Advanced Communi-cation Technology (ICACT), Jan. 2013.
[10] A. Pathak, Y. C. Hu, and Z. M. Mao. Peeking into Spammer Behavior from a Unique Vantage Point. In Proceedings of USENIX LEET, 2008.
[11] J. P. John, A. Moshchuk, S. D. Gribble and A. Kr-ishnamurthy. Studying Spamming Botnets Using Botlab. In Proceedings of the 6th USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI), pp. 291-306, Apr. 2009. [12] A. Ramachandran, N. Feamster and S. Vempala,
Filtering spam with behavioral blacklisting. In Pro-ceedings of the 14th ACM conference on Computer and Communications Security (CCS), pp. 342-351, Oct. 2007.
[13] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B.Y. Zhao, Detecting and characterizing social spam campaigns, In Proceedings of Internet Mea-surement Conference (IMC), pp.35-47, 2010.