Spam Fighting Te???niques
Masterarbeit
von
Martin Gr????lin
Rupre???t-Karls-Universit??t Heidelberg
Betreuer: Prof. Dr. Gerhard Reinelt
Prof. Dr. Felix Freiling
I??? versi???ere, dass i??? diese Masterarbeit selbstst??ndig verfasst, nur die angegebenen ???ellen und Hilfsmi???el verwendet und die Grunds??tze und Empfehlungen ???Verantwortung in der Wissens???a?????? der Universit??t Heidelberg bea???tet habe.
One of the biggest ???allenges in global communication is to overcome the problem of unwanted emails, commonly referred to as spam. In the last years many approa???es to reduce the number of spam emails have been proposed. Most of them have in common that the end-user is still required to verify the ???ltering results. ???ese approa???es are reactive: before mails can be classi???ed as spam in a reliable way, a set of similar mails have to be received.
Spam ???ghting has to become proactive. Unwanted mails have to be blo???ed before they are delivered to the end-user???s mailbox. In this thesis the implementation of two proactive spam ???ghting te???niques is discussed.
???e ???rst concept, calledMail-Shake, introduces an authentication step before a sender is allowed to send emails to a new contact. Computers are unable to authenticate themselves and so all spam messages are automatically blo???ed. ???e development of this concept is discussed in this thesis.
???e second concept, called Spam Templates, is motivated by the fact that spam messages are generated from a common template. If we gain access to the template we are able to identify spam messages by mat???ing the message against the template. As the template is generated from currently sent spam messages, the template will never mat??? a legitimate mail. In this thesis mat???ing a mail against a template is implemented.
Acknowledgement
First of all I want to thank Professor Gerhard Reinelt and Professor Felix Freiling for making it possible for me to write this thesis at the Laboratory for Dependable Distributed Systems at the University Mannheim.
I also want to thank my supervisors Jan G??bel and Philipp Trinius. ???eir suggestions and feedba??? are very mu??? appreciated and helped to develop the system presented in this thesis.
A special thanks to all my friends and my family for testing the system and providing valuable feedba??? on its usability. I especially want to thank Arthur Arlt who was always willing to discuss details about the implementation and this document.
I want to thank the KDE community and Qt Development Frameworks for providing su??? a great and coherent development framework. ???e KDE community has helped me improve my C++ coding skills during the last years. ???is was useful during the implementation as many problems were already known and could be solved easily.
In general I want to thank the complete Free and Open Source community. Without their ideas of free so???ware it would not have been possible to realize su??? a project. ???e complete project including this document has been implemented and wri???en with the help of Free or Open Source so???ware.
Contents
1 Introduction 1
1.1 Motivation . . . 1
1.2 Proactive Spam Fighting Te???niques . . . 2
1.3 Notes About the Implementation . . . 3
1.4 Structure of ???is ???esis . . . 3
2 Proactive Spam Fighting 5 2.1 Related Work . . . 5
2.1.1 Bayesian Filtering . . . 5
2.1.2 DNS Bla???lists . . . 6
2.1.3 URI Bla???list . . . 7
2.1.4 Greylisting . . . 7
2.1.5 Conclusion . . . 8
2.2 ???e Mail-Shake Concept . . . 9
2.2.1 Proactive Spam Fighting With Dynamic Whitelists . . . 9
2.2.2 Limitations of the Mail-Shake Concept . . . 11
2.2.3 Summary . . . 17
2.3 ???e Spam Templates Concept . . . 18
2.3.1 Template Based Spam Mails . . . 18
2.3.2 Generation of Templates . . . 19
2.3.3 Proactive Filtering . . . 20
2.3.4 Summary . . . 21
3 Background 23 3.1 Evaluation of Current CAPTCHA Te???niques . . . 23
3.1.1 Introduction . . . 23
3.1.2 Simple Obfuscation . . . 24
3.1.3 Image Based CAPTCHAs . . . 25
3.1.4 Audio Based CAPTCHAs . . . 26
3.1.5 Image Recognition CAPTCHAs . . . 26
3.1.6 Riddle . . . 27
3.1.7 reCAPTCHA . . . 29
3.1.8 Conclusion . . . 30
3.2 Excursus: Breaking a CAPTCHA System . . . 32
3.2.1 ???e Scr.im CAPTCHA System . . . 32
3.2.2 Flaws in the Design of the Scr.im CAPTCHA System . . . 32
3.2.3 A???a??? on the CAPTCHA System . . . 34
3.3 Akonadi . . . 37
3.3.1 Client Plugins Compared to Central Storage . . . 37
3.3.2 Akonadi as the Central Storage Solution . . . 38
3.3.3 Design of Akonadi . . . 38
3.3.4 Summary . . . 40
4 Development of the Systems 41 4.1 So???ware Requirements for Mail-Shake . . . 41
4.1.1 Answering Spam Messages . . . 41
4.1.2 Delivery Status Noti???cations . . . 42
4.1.3 Public Mail Address . . . 43
4.1.4 Sending Mails . . . 44
4.1.5 Private Mail Address . . . 45
4.1.6 Summary . . . 46
4.2 Design of Mail-Shake . . . 47
4.2.1 Client Independent Library . . . 47
4.2.2 Akonadi Agent . . . 50
4.2.3 Client Integration . . . 52
4.2.4 Summary . . . 52
4.3 Implementation of Mail-Shake . . . 54
4.3.1 Mail-Shake Library . . . 54
4.3.2 Mail-Shake Akonadi Agent . . . 69
4.3.3 Mail-Shake Integration in Email Clients . . . 76
4.4 Implementation of Spam Templates . . . 81
4.4.1 Generating the RSS Feed . . . 81
4.4.2 Testing a Mail . . . 83
4.4.3 Summary . . . 87
5 Evaluation 89 5.1 Mail-Shake Evaluation Setup . . . 89
5.2 Results of Mail-Shake Evaluation . . . 90
5.3 Greylisting . . . 92
5.4 Results from January . . . 94
5.5 Results from February . . . 96
5.6 Summary . . . 97
6 Retrospection and Future Tasks 99 6.1 Problems caused by Akonadi . . . 99
6.2 Future tasks for Spam Templates . . . 101
6.3 Future Tasks for Mail-Shake . . . 101
6.3.1 Handling of Delivery Status Noti???cations . . . 102
6.3.2 Mail-Shake for Several Addresses . . . 102
6.3.3 Solving Mail-Shake Challenges in Email Clients . . . 103
6.3.4 Integrating Mail-Shake Directly Into Email Clients . . . 103
7 Conclusion 105
A Examples of Delivery Status Noti???cations 113
A.1 RFC Compliant . . . 113
A.2 Exim . . . 114
A.3 QMail . . . 115
A.3.1 MIME Mail . . . 115
A.3.2 Plain Text Mail . . . 116
A.4 Google Mail . . . 117
B Mails from Automated Systems 119 B.1 Review Board . . . 119
B.2 Bugzilla . . . 119
C Mail-Shake API Documentation 121 C.1 MailShake Namespace Reference . . . 121
C.1.1 Detailed Description . . . 121
C.1.2 Typedef Documentation . . . 122
C.1.3 Enumeration Type Documentation . . . 122
C.2 MailShake::DSN Class Reference . . . 122
C.2.1 Detailed Description . . . 122
C.2.2 Member Function Documentation . . . 122
C.3 MailShake::DSNPrivate Class Reference . . . 123
C.4 MailShake::EMail Class Reference . . . 123
C.4.1 Detailed Description . . . 123
C.4.2 Member Function Documentation . . . 124
C.5 MailShake::EMailPrivate Class Reference . . . 126
C.6 MailShake::Id Class Reference . . . 126
C.6.1 Detailed Description . . . 126
C.6.2 Member Function Documentation . . . 126
C.7 MailShake::IdPrivate Class Reference . . . 127
C.8 MailShake::MailShake Class Reference . . . 127
C.8.1 Detailed Description . . . 128
C.8.2 Member Function Documentation . . . 128
C.9 MailShake::MailShakePrivate Class Reference . . . 131
C.9.1 Member Function Documentation . . . 131
C.9.2 Member Data Documentation . . . 132
C.10 MailShake::WhiteListEntry Class Reference . . . 132
C.10.1 Detailed Description . . . 133
C.10.2 Member Function Documentation . . . 133
C.11 MailShake::WhiteListEntryPrivate Class Reference . . . 134
D Mailman Archive Address Harvester 135 D.1 main.cpp . . . 135
D.2 mailmanharvester.h . . . 135
D.4 mailmanharvesterview.h . . . 137
D.5 mailmanharvesterview.cpp . . . 138
D.6 mailmanharvesterviewbase.ui . . . 139
E Automated Scr.im CAPTCHA Solver 141 E.1 main.cpp . . . 141
E.2 ScrimCra???er.h . . . 141
E.3 ScrimCra???er.cpp . . . 143
E.4 CMakeLists.txt . . . 147
F Dialog to Solve a Mail-Shake Challenge 149 F.1 mailshakedialog.h . . . 149
F.2 mailshakedialog.cpp . . . 150
G RSS Generator 155 G.1 main.cpp . . . 155
G.2 rssgenerator.h . . . 155
G.3 rssgenerator.cpp . . . 156
G.4 CMakeLists.txt . . . 158
H Spam Templates Library 159 H.1 template.h . . . 159
H.2 template.cpp . . . 160
H.3 templatemanager.h . . . 163
H.4 templatemanager.cpp . . . 163
H.5 mail.h . . . 164
List of Figures
2.1 Overview of the Mail-Shake email process . . . 9
2.2 Example of a Mail-Shake ???allenge mail . . . 10
2.3 Leakage of private Mail-Shake address . . . 11
2.4 Mail-Shake authentication initiated on private address . . . 12
2.5 Mail loop triggered by a spam mail with a not valid sender address . . . 15
2.6 Web Service as rely of a mail . . . 16
2.7 Template based spamming . . . 19
2.8 Example of a generated Spam template . . . 20
3.1 Example of a reCAPTCHA . . . 25
3.2 CAPTCHA containing email address [email protected]??? . . . 25
3.3 Example of an Asirra CAPTCHA . . . 27
3.4 ???e words to be used for reCAPTCHA . . . 30
3.5 ???e scr.im CAPTCHA system . . . 33
3.6 Di???erent images for the same scr.im CAPTCHA . . . 33
3.7 Comparison of original CAPTCHA image and the result of the pixel shader. . . 35
3.8 Two di???erent applications to handle public and private addresses . . . 37
3.9 Basic aspects of the Akonadi ar???itecture . . . 39
3.10 Components of Akonadi . . . 40
4.1 Abusing Mail-Shake to send spam . . . 42
4.2 Activity diagram for processing mails sent to the public address . . . 44
4.3 Activity diagram for sending mails . . . 45
4.4 Activity diagram for receiving mails on private address . . . 46
4.5 Classes EMail and DSN of the Mail-Shake library . . . 48
4.6 Class WhiteListEntry of the Mail-Shake library . . . 49
4.7 High Level Class Diagram of the Mail-Shake library . . . 49
4.8 High Level Class Diagram of Mail-Shake???s client side implementation . . . 50
4.9 Communication between Akonadi server, agent and Mail-Shake library . . . 51
4.10 Class diagram for Mail-Shake email client integration . . . 52
4.11 Classes EMail and DSN split in interface and implementation classes . . . 55
4.12 Template of a Mail-Shake ???allenge . . . 73
4.13 Dialogs to con???gure the whitelist . . . 74
4.14 Noti???cation upon receipt of not whitelisted mail . . . 75
4.15 Mailody Message View with Mail-Shake ???allenge mail integration . . . 77
4.16 Dialogs to solve the Mail-Shake ???allenge . . . 79
4.17 Con???guration for determining the mat???ing score . . . 83
5.2 Rejected, bounced and junk mails in January 2010 on the evaluated MTA. . . 94
6.1 Mail-Shake Agents in the systray . . . 103
List of Tables
4.1 Examples for subjects containing a Mail-Shake id . . . 63
4.2 Size of Mail-Shake measured in Source Lines of Code . . . 69
4.3 Database structure of Mail-Shake agent . . . 71
4.4 Mail headers used in Mail-Shake ???allenge and noti???cation mails . . . 73
4.5 Changed ???les for Mail-Shake ???allenge integration in Mailody . . . 78
4.6 Command line options for the RSS generation tool . . . 81
5.1 Private and public addresses used during the Mail-Shake evaluation . . . 90
5.2 Number of mails ???ltered by Mail-Shake in January 2010 for the di???erent addresses . 94 5.3 Statistics for ???ltered mail per address in January . . . 95
List of Listings
3.1 Pixel shader for extracting ???aracters from the scr.im CAPTCHAs . . . 35
4.1 Mat???ing a string against a whitelist entry . . . 58
4.2 Trivial algorithm to ???e??? if a mail is whitelisted . . . 59
4.3 Comparing the whitelist entries to a given datum . . . 60
4.4 Improved algorithm to test if a mail is whitelisted based on a smarter data structure 60 4.5 Handling the receipt of a mail sent to the public address . . . 61
4.6 Generating a new unique identi???er . . . 62
4.7 Extracting the Mail-Shake Id from a mail subject . . . 62
4.8 Che???ing if received private mail is whitelisted or a DSN . . . 64
4.9 Che???ing if mail contains a ???allenge response Id or is on temporary whitelist . . . 65
4.10 Move an entry from temporary to permanent whitelist or create a new one. . . 66
4.11 Adding a whitelist entry for ea??? recipient of a sent mail . . . 66
4.12 Connecting a slot to the signal with the boost library . . . 69
4.13 Slot for removing one Id from the storage . . . 70
4.14 Connecting Signals and Slots with Qt . . . 70
4.15 Fet???ing a mail sent to the public address . . . 71
4.16 Extracting headers from a KMime message . . . 72
4.17 Extracting the Mail-Shake headers in Mailody . . . 76
4.18 Displaying Mail-Shake ???allenge information in Mailody???s header widget . . . 77
4.19 Intercepting a cli??? on a link in order to open the Mail-Shake ???allenge dialog . . . 78
4.20 Extracting CAPTCHA from the reCAPTCHA web page . . . 79
4.21 Testing if the web page contains the revealed mail address . . . 80
4.22 Generating an RSS item from one template ???le . . . 82
4.23 Generated RSS feed containing one template . . . 82
4.24 Algorithm for mat???ing a mail body against a template . . . 85
1 Introduction
1.1 Motivation
Unsolicited bulk emails or in general spam or junk mails have become one of the greatest ???allenges of current global communication. About 80 percent of the world???s email communication is not legitimate[10]. ???is includes not only spam mails but also malicious so???ware and phishing mails. ???ese mails cause a global economic loss of EUR 36 billion ea??? year plus EUR ten billion lost due to fraud[3]. 33 billion kWh are required to process the 62 trillion spam mails ea??? year and 104 billion user hours are required to ???e??? and delete these junk mails[42].
Unfortunately sending spam messages is a pro???table business: in the year 2002 a study showed that out of 3.5 million sent messages 81 sales were generated in the ???rst week of the campaign result-ing in an income of USD 1,500[48]. ???ese numbers can be con???rmed with more recent information unleashed by a former spammer: sending 40 million mails can render a weekly income of USD 37,440[58].
Spam is also one of the reasons why there is malicious so???ware at all. Next to Distributed Denial of Service a???a???s (DDoS), botnets are used to send spam mails[50]. About 10 million zombie com-puters organized in botnets are actively sending out spam and email-based malicious so???ware. As the zombies are added and removed dynamically to prevent static bla???list solutions from blo???ing the zombies[9], it can be assumed that there are many more computers controlled by the bots. A single zombie of a Storm botnet sends an average of 1.04 spam mails per second up to 136,000 mails per day[15].
Current spam ???ghting te???niques like Bayesian ???lters or Uniform Resource Identi???er Bla???lists (URIBL), whi??? are discussed in Chapter 2.1, are commonly reactive. ???ey require a large set of received spam messages to extract features su??? as URIs referenced in a message. With the help of the extracted features the algorithms can distinguish spam from ham messages (valid messages). But this reactive approa??? has disadvantages because it must ???rst receive the spam messages. As long as new features are not extracted, the te???niques cannot identify messages as junk. ???is is an annoyance for users as the te???niques produce false negative results and the users have to delete the unrecognized spam manually. Spam ???ghting has to become proactive: preventing that spam messages can be delivered to the end-users??? mailboxes at all or at least provide spam recognition solutions, whi??? are able to remove messages, based on new spam pa???erns, at the same time as the new pa???ern is used for the ???rst time.
1.2 Proactive Spam Fighting Techniques
In this thesis the implementation of two proactive spam ???ghting te???niques are discussed. ???ese te???niques aim to prevent that spam mails can be delivered to users at all and to recognize new spam faster and in a more reliable way.
???e ???rst te???nique, calledMail-Shake, is a concept whi??? prevents spam or at least makes it more di???cult for spammers to send spam. ???erefore ea??? sender has to authenticate once that he is a human. Mails sent from unauthenticated senders are dropped automatically and by that spammers are unable to deliver their junk. ???is concept is discussed in more detail in Chapter 2.2.
???e second te???nique helps to identify received spam mails in a more reliable way. By intercepting mails sent by a bot, generic templates are generated and used to identify spam mails even if other te???niques are not yet able to recognize the email as spam. ???e construction and usage ofSpam Templatesis discussed in more detail in Chapter 2.3.
???e hope is that these te???niques help reduce the number of spam mails received by users and the time whi??? is required to ???e??? for and sort false positives and negatives. ???e Mail-Shake concept is immune against false positives as only mails sent by computers are classi???ed as spam mails. ???e Spam templates on the other hand will not mark mails sent by humans as spam because the template is constructed in a way to only mat??? mails sent by a bot.
1.3 Notes About the Implementation
???e two te???niques, Mail-Shake and Spam templates, are developed independently but using the same libraries and te???nologies. Both applications are built upon the Personal Information Man-agement (PIM) framework developed and used by the KDE community. ???is framework, called
Akonadi, is completely client and platform independent, whi??? is currently Linux (and other Unixes), Microso??? Windows and Mac OS X. As the underlying KDE and Qt libraries are being ported to more platforms su??? as smart phones, Akonadi will probably become available on those as well.
Although Akonadi has been developed for the usage in KDE???s PIM suite ???Kontact??? it was designed with client independence in mind. So there are already di???erent KDE applications available, whi??? use Akonadi, and some prototype applications developed in di???erent programming languages and with di???erent GUI libraries. ???e Akonadi framework is discussed in Chapter 3.3.
???e combination of platform and client independence has the advantage that the applications developed in the scope of this thesis can be used with di???erent email clients. Nevertheless the applications are developed in a way so that its code can easily be reused by other projects to provide a more native integration. ???erefore an abstraction layer is implemented and used.
1.4 Structure of This Thesis
In the current Chapter a short introduction and motivation for implementing proactive spam ???ghting te???niques was presented. ???e applications, whose implementation are discussed in this thesis, were named and a short introduction to the framework used to develop the applications was provided.
???e following Chapter 2 discusses the proactive spam ???ghting te???niques. First of all related work, in this case other existing but reactive spam ???ghting te???niques, is presented. ???is motivates the discussion of the two te???niques: Mail-Shake and Spam Templates.
Before the implementation can be discussed, an overview on the ba???ground of the system is provided in Chapter 3. ???is includes an evaluation of current CAPTCHA?? te???niques in Chapter 3.1 required for implementing Mail-Shake and in Chapter 3.2 an example for an automated solution to break a CAPTCHA system is presented as an excursus, whi??? motivates the ???osen solution to not implement its own CAPTCHA, but to rely on existing and tested functionality. Last but not least a closer look at the KDE personal information management framework Akonadi in Chapter 3.3 completes the ???apter on the ba???ground of the system.
???e discussion of the development of the system is encapsulated in Chapter 4. First of all the so???ware requirements (Chapter 4.1) are presented, followed by design (Chapter 4.2), the actual im-plementation of Mail-Shake in Chapter 4.3 and Spam templates in Chapter 4.4.
???e following Chapter 5 evaluates the results. ???is shows if the concepts presented in this thesis are able to reduce the number of received spam mail and if the concepts are usable at all.
???e implementation allows the easy reuse in di???erent client implementations. Some possibilities for future work and a retrospection are named and presented in Chapter 6.
2 Proactive Spam Fighting
In this Chapter the two concepts Mail-Shake and Spam Templates are discussed. Both concepts are proactive spam ???ghting te???niques and are able to eliminate spam messages before they are shown to the end-user. ???is is an important di???erence to the existing, but reactive ones. Some of those te???niques are also presented in this Chapter.
2.1 Related Work
In this Section other existing spam ???ghting te???niques are presented. Most of those te???niques are reactive and share the disadvantages of reactive approa???es. A brief overview of te???nologies like Bayesian ???ltering, bla???lists and greylisting are provided and their advantages and disadvantages are discussed.
2.1.1 Bayesian Filtering
???e most common spam ???ghting te???niques are the Bayesian and rule-based ???ltering systems as used for example by Spam Assassin??. ???ese are examples for reactive spam ???ghting solutions:
a large repository of both spam and ham messages is required to extract features from all mes-sages. ???ese extracted features can be used to distinguish spam from ham messages via a Bayesian model[55].
Rule-based ???ltering systems are reactive as well. For constructing a rule it is required to ???rst look on the spam messages to construct the rule. Using rules for spam ???ltering is rather limited as the logical rule set makes binary decisions whether to classify a given mail as spam[55]. ???is can easily result in false positives, as seen in January 2010 when the dates grossly in the future became present forSpam Assassin[29]. ???e misbehaving rule tests for dates in the year 2010 or later and
ea??? message receives an additional score between 2.075 and 3.554. AsSpam Assassinclassi???es a
message as spam at a score of 5.0 this rule causes many false positives.
???ese limitations of rule-based ???ltering systems can be circumvent by feature extraction and the use of Bayesian ???ltering systems. Nevertheless a Bayesian system is not the perfect solution as well. For example it can only extract features from text messages and is unable to ???lter image
based spam. ???e number of image based spam increased signi???cantly in 2006[8] and the images are distorted by applying te???niques used for CAPTCHAs, so that computers are unable to restore the original image[67].
2.1.2 DNS Blacklists
One of the most common te???niques to blo??? spam mails directly on the mail server is the use of a
DNS bla???list(DNSBL). ???e name refers to the fact that the bla???list is queried with the help of the Domain Name System (DNS). To test if a given IP addressa.b.c.dis enlisted in a certain bla???list the mail server just has to query for theArecord for the addressd.c.b.a.bla???list-name[30]. If the query is successful the mail should be rejected as the sender???s IP address is known to send spam.
DNSBLs are of course a reactive spam ???ghting approa???. A given IP address has to be veri???ed to be used for spamming. ???e important question is if the IP addresses of bots get listed while the bot is actively sending out spam messages. A study from 2005 shows that DNSBLs are not capable to blo??? spam sent by botnets. Out of 4,295 IP addresses, whi??? were known to be part of the Bobax botnet, only 225 were bla???listed in the DNSBL provided by Spamhaus??[52].
On the other hand a bla???list might easily blo??? legitimate senders. For example if the IP address of a bot is assigned dynamically by its Internet Service Provider (ISP), the ISP might have assigned the same IP address to a di???erent customer at the time the DNSBL includes this address. So the actual bot is not blo???ed, but a legitimate user is blo???ed. An empirical study showed, that 80 % of the IP addresses of possible spammers in February 2004 were still listed in at least one of seven popular DNSBLs two month later. Some of the IP addresses were already present in the DNSBLs in the year 2000[30].
???e fact that a DNSBL can blo??? any domain from sending mails is also a great disadvantage as the DNSBL can be abused. In 2007 the popular Spamhaus project demanded that the Austrian Network Information Center ???nic.at??? takes down addresses used for phishing. As the registrar did not react, Spamhaus started to ???bla???mail??? the registrar by enlisting its domain, so that nic.at could not send mails anymore[47, 51].
DNSBLs seem not to be an appropriate method for spam ???ghting any more. ???e reactive approa??? is unable to scope with the frequently ???anging IP addresses of spam sending bots and the ???ances that legitimate senders are blo???ed is too high. Especially the incident between Spamhaus and nic.at illustrate that the disadvantages of DNSBLs prevail.
2.1.3 URI Blacklist
A di???erent form of bla???lists are theUniform Resource Identi???er Bla???lists(URIBL). Instead of bla???-listing the IP address of senders, domain names referenced in mail bodies more o???en than a given threshold are included in the bla???list. A given mail is analyzed if it contains an URI to su??? a bla???listed domain name and in that case the mail is classi???ed as spam[33].
In opposite to the DNSBLs the complete mail has to be received and the content has to be analyzed. ???e approa??? is reactive and requires a large set of both spam and ham messages as the presence of an URL in the message body is not a reliable indicator for spam. Almost 90 % of legitimate mails contain URLs as well[33].
An advantage of URIBLs compared to other te???niques is, that it only analyzes the URLs in the message body. On the other hand the approa??? easily produces false negative results as it requires the presence of an URL. If a spam message does not contain an URL, as it is for example image spam, the message cannot be classi???ed as spam.
Given the fact that URIBLs produce false negative results, it cannot be used as an own spam ???ghting solution, but has to be combined with other te???niques. So the fact that a mail has been classi???ed as spam by using an URIBL should only be seen as an indicator for spam.
2.1.4 Greylisting
Greylisting is a combination of a bla???- and a whitelist with automatic whitelist management. Ea??? new received mail is initially rejected on the Mail Transfer Agent (MTA) and the unique triplet of IP address of sending host, sender address and recipient address in the envelope is stored. If the sending host tries to deliver the mail again a???er a de???ned delay, the mail is accepted and the triplet is moved to the whitelist. All further communication from this triplet will not be delayed[26].
Greylisting is based on the assumption that spam sending applications are using a ??????re-and-forget??? approa???. If a spam message cannot be delivered the application does not try to resend the message, although temporary failures are always possible. ???e ???rst testing of greylisting in mid-2003 showed an e???ectiveness of 95 %[26]. Unfortunately this e???ectiveness is based on the fact that spam sending applications do not implement SMTP correctly. By adopting the spam sending applications to circumvent the protection provided by greylisting, the success rate can be decreased. In Chapter 5.3 on page 92 an evaluation of the current e???ectiveness of greylisting is provided.
Even when greylisting breaks because spammers adopt their used applications, it is useful to continue to use greylisting. Basically greylisting bounds resources on the spammer???s side. ???e spammer has to use a mail queue and cannot continue to use a ???re-and-forget approa???. Due to the fact that the host???s IP address is part of the unique triplet, the same bot has to send the message a???er the delay. ???ere is the ???ance that at this time reactive approa???es as for example DNSBL include the bot???s IP address in the bla???lists and so the spam message can be blo???ed, although the greylist is overcome.
2.1.5 Conclusion
As this Section illustrated none of the presented existing te???niques is able to reliably distinguish spam from ham messages. Most of the existing te???niques are reactive and require that ???rst a large set of false negative results is generated. Based on these false negative results the te???niques can be improved to identify spam messages in future. But this is of course an annoyance for the end user as un???ltered messages appear in the mailbox and has to ???lter those manually.
2.2 The Mail-Shake Concept
In this Section the Mail-Shake concept as described in [19] is discussed. First of all the idea is pre-sented followed by a discussion how and why the concept works and ???nally some of the limitations will be named and how to circumvent these.
2.2.1 Proactive Spam Fighting With Dynamic Whitelists
???e basic idea behind Mail-Shake is to blo??? all mails from unauthenticated senders and to provide senders an easy way to authenticate themselves. ???e process of authentication is done in a way so that humans are able to participate, while computers - and by that spam bots - are not. A???er authentication the sender???s address is put on a whitelist. ???is whitelist is used by Mail-Shake to decide if a mail is authorized or not. By that the concept is proactive as it blo???s spam before it is read by the user.
.
.send initial email
.reply with ???allenge .(and random ID)
.resend initial email .(and ID in subject)
.future communication .User A (private address)
.User B (private address) .User B (private address) .(address placed on whitelist)
.
.User B (public address) .User A (private address)
.(recipient placed on whitelist)
.User A (private address) .(update whitelist entry)
Figure 2.1:Overview of the Mail-Shake email process[19]
whi??? reveals User B???s private address. Now User A can resend the original mail with the identi???er in the subject. Mail-Shake compares the identi???er and put User A???s address on a whitelist. In future User A can send mails directly to User B???s private address. ???e authentication step is required only once. As well there is no need to include the identi???er in ea??? single mail. Other mails sent to the private address are discarded if the sender address is not on the whitelist.
???e ???allenge, whi??? reveals the private email address, has to be in a way that it is solvable by a human and not by a computer - that is a kind of a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). An example for su??? a ???allenge mail is presented in Figure 2.2. ???e actual ???allenge is implemented by relying on the reCAPTCHA web service, whi??? o???ers the possibility to protect an email address with a CAPTCHA. A Mail-Shake user can publish the public address openly in the web. If a spam bot gains this address, all spam mails sent to the public address are not read and the spam bot is unable to gain the private address from the Mail-Shake ???allenge mails.
Subject: Mail-Shake challenge
You sent an email to a public Mail-Shake address. The email will not be delivered. You have to send the email to the private address. You can retrieve the private address by visiting the following web address and solving the shown CAPTCHA:
http://mailhide.recaptcha.net/d?k=01CnVIbRzbs1dYsDFRJi_3RQ==&c=6pdRWDUzBNLbqFDUM-P8vMAb9FJMDoP3HqWDsEQZPoI=
The email to the private address will only be delivered if you include the following text in the subject of your email:
Mail-Shake Id: 37735
In future you can send emails directly to the private email address as normal. If you did not send an email to the public address you can ignore this email.
Figure 2.2:Example of a Mail-Shake ???allenge mail
If a spam bot gains the private email address, the bot is still unable to send junk to the Mail-Shake user. ???e address used in the spam mails are not on the whitelist and the mails are dropped. So a spam bot is unable to send spam mails without going the authentication steps.
As it might be possible that a spam bot gains both public and private address, the unique identi???er is introduced. Without the identi???er it would be su???cient to just send a mail to the public followed by sending a mail to the private address. Introducing the unique identi???er ensures that a Mail-Shake ???allenge mail has to be received. By that all spam bots using forged sender addresses are unable to receive the required unique identi???er.
the ???allenge mail is dropped. To circumvent this problem, a temporary whitelist is introduced. Whenever a Mail-Shake user sends a mail the recipient address is added to the temporary whitelist. A response mail is therefore not blo???ed and the ???allenge mail can be received. ???e temporary whitelist also guarantees that all communication started by a Mail-Shake user is possible.
Mail-Shake makes it more di???cult for spammers to deliver their junk. It requires to solve one CAPTCHA for ea??? recipient address. ???at increases the costs to send spam as long as the CAPTCHA is secure and has to be solved by a human. Even if a spammer knows both private and public address, he needs to be able to receive the ???allenge mail to get on the whitelist. ???e user will most likely delete the entry from the whitelist and so for ea??? spam mail to be sent, the spammer has to send two mails and receive one. Nevertheless there is one possible way to circumvent the Mail-Shake protection by infecting a system with malware and sending spam to all gathered addresses on this system disguised as the valid sender. In that case a Mail-Shake user will remove the address and Mail-Shake should inform the sender as it is a strong indicator that the system is infected.
2.2.2 Limitations of the Mail-Shake Concept
2.2.2.1 Leaking of Private Address
To assume that the private address will not be leaked is rather naive. If User A has a Mail-Shake protected setup and User B is authorized to send mails to User A???s private address, the address leaks if User B sends one mail to User A and a third User C. In case that User C answers to both User B and User A the mail is dropped as illustrated in Figure 2.3(a).
Another scenario for leakage is, that the Mail-Shake User A sends a mail to User B, who is not authorized. Mail-Shake adds User B???s address to the temporary whitelist automatically, so that response mails are accepted. In this scenario User B does not even know that User A is using Mail-Shake. By forwarding User A???s mail to User C the private address is leaked. If User C tries to send .
.User A .User B .User C
.private .address
. .
. .
.1
.2 .1
(a) 1.: User B sends one mail to both the private ad-dress of User A and to User C. 2.: User C sends a mail to User A, whi??? is dropped as his address is not on the whitelist.
.
.User A .User B .User C
.private .address
. .
. .
.1
.3 .2
(b) 1.: User A sends mail to User B and adds B???s ad-dress to his temporary whitelist. 2.: User B forwards mail to User C. 3.: User C sends mail to User A, whi??? is dropped as his address is not on the whitelist.
.
.Sender .Receiver
.private address
.public address .Storage
.
private address
.1. Sends mail to private address .2. Drops not whitelisted mail .3. Sends noti???cation mail with .???allenge for public address .4. Solves ???allenge,
.receives public address
.5. Sends mail to public address .6. Generates Id and adds it to storage .7. Sends regular ???allenge mail .8. Resends original mail .with Id in subject .9. Updates the whitelist
. .1
.3
.5 .4
.6
.7
.8
.9 .2.
.
Figure 2.4:Mail-Shake authentication initiated on private address
a mail to User A the mail is dropped. ???is scenario is illustrated in Figure 2.3(b).
Given these two scenarios it is unacceptable to drop mails without any further notice to neither receiver nor sender as proposed in the Mail-Shake concept paper. While this is an annoyance for private mail communications, it might have legal consequences in corporate or governmental in-stitutions. Also a ???nancial loss is possible if for example orders are sent to a Mail-Shake protected private address.
Of course a notice to the user of Mail-Shake (User A) does not make sense as it requires to manually ???e??? all ???ltered mails and in that case Mail-Shake does not provide any advantages in comparison to existing solutions. ???erefore the sender (User B) has to be noti???ed, that the mail has been dropped. ???is noti???cation must o???er a way to send mails to User A. Publishing User A???s public address in the noti???cation is no option as then both public and private address are known to User B. So he only needs to generate a unique identi???er by sending a mail to the public address in order to send mails to the private address in future. ???is implies that he does not have to authenticate himself as a human. ???erefore the noti???cation must include the public address as a ???allenge. ???e work???ow for authentication in this scenario is illustrated in Figure 2.4. ???e ???allenge reveals User A???s public address and User B has to send a mail to the public address in order to generate an identi???er. ???is identi???er has to send with the original mail to the already known private address. ???is work???ow requires one more mail to be sent and makes the complete process more complex.
mail to the private address nevertheless. As well there is the ???ance that a user expects the public address to be the private one if he knows the Mail-Shake concept.
2.2.2.2 Communication with Automated Systems
???ere is a limitation in the usability in conjunction with all kinds of communication with automated systems, su??? as mailing lists, bulletin boards and online shops. ???e idea behind Mail-Shake is to only accept mails from senders, who authenticate themselves as humans. In case of su??? automated systems we want to receive mails although the sender is not a human. ???e way to authenticate an address does not work for e.g. newsle???ers. In most cases an automated system does not accept mails at all and even if it does, there is nobody solving the ???allenge and ???anging the address. So the usage of the public address is out of bounds for communication with automated systems.
???is implies that every time when mails from an automated system are expected (e.g. generating a new account in an online shop) the private address has to be speci???ed. At this point ea??? mail sent by the automated system is dropped as the whitelist does not yet contain an entry for the address. It is the user???s obligation to add the address manually. Unfortunately the address, whi??? is used by the system, is in general unknown to the user.
A possible solution to this problem is to ignore it. In case that not whitelisted mails are not deleted and just moved to a di???erent mail folder, the user can ???e??? this folder for the mail. Of course this is a very suboptimal solution as the user has to look through a folder full of junk mails. As well this is no solution in case mails are automatically deleted as the original approa??? suggested.
In [19] a proposed solution to this limitation is to allow wildcards in whitelist entries. ???at is in case the user created an account at the online shop ???Foo???, he can manually create an entry that mat???es ???*foo*???. Of course this does not only mat??? mails sent by the shop, but also junk mail disguised as mails sent from this shop. So either the user has to ???ange the entry as soon as the ???rst mail sent from this shop has been received or specify a more speci???c rule like ???*@foo.de???. But this would not mat??? for example mails sent from addresses like [email protected]???. Adding more wildcards to the domain part of the address does not solve this problem as it opens the door to spammers using domains like ???foo.bar.de???, whi??? would be mat???ed by an entry like ???*@*foo*???.
???e implementation, whi??? is discussed later in this thesis, supports these two possibilities to circumvent this limitation of communication with automated systems. But it also supports a third one: instead of dropping the mail directly a processing delay is introduced. ???e user receives a noti???cation in the user interface, that a mail will be dropped and he can add the mail to the whitelist before being dropped. To provide be???er usability the noti???cation can be toggled on and o???. So before a user registers himself on a web shop he can activate the noti???cations, wait till the ???rst mail of this automated system is received and turn o??? the noti???cations.
system, so that the entry should not mat??? junk mails using parts of the domain name in their sender address. Unfortunately the assumption that all mails from one web shop will use the domain used in the ???rst mail does not hold as the evaluation (see Chapter 5.2 on page 90) showed. If an online shop Foo is a subsidiary of company, Baz the registration mail might be sent from domain ???foo.de??? while further communication is sent from ???baz.de???. In su??? a case a false positive is generated. On the other hand this proofs that the Mail-Shake concept works correctly.
2.2.2.3 Sending Noti???cations in Reply to Spam
A limitation not considered at all in the Mail-Shake paper[19] are Mail-Shake ???allenges or noti???-cations in reply to the receipt of spam mails. In case that the sender address is valid, but forged, a ???allenge is sent to a user who did not request it. If the user is using Mail-Shake himself the ???al-lenge is dropped without bothering the user. In case he is not using Mail-Shake he receives one or in worse cases many unwanted noti???cation. ???ese ba???sca???ered mails are of course unwanted and can even be considered as spam. ???e consequences could be that rule based spam ???ghting te???niques are trained to ???lter out Mail-Shake ???allenge or that the MTA?? sending the ???allenges is set on a bla???list. ???is would mean that a user of Mail-Shake is either unable to send mails or that users who want to send him mails are unable to go through the Mail-Shake authentication as the ???allenge mails are dropped automatically.
In case that the sender address of a spam mail is forged, but not valid, the situation is slightly di???erent. Mail-Shake tries to send a ???allenge to this address, but this cannot succeed as the address does not exist. ???e MTA responses with a delivery status noti???cation mail informing the end-user that the mail delivery failed. ???is noti???cation is sent to the sender address of the Mail-Shake ???allenge, whi??? is the public address. Of course Mail-Shake generates another ???allenge sent to the address of the MTA. In case that the MTA does not accept mails another delivery status noti???cation is sent to the public address. At that point Mail-Shake is caught in a mail sending loop as illustrated in Figure 2.5. Ea??? delivery status noti???cation sent by the MTA causes another ???allenge mail, ea??? ???allenge mail causes another delivery status noti???cation.
If noti???cations are sent in reply to mails received at the private address, the situation becomes more complex. Of course a mail loop can be triggered as well. A problem is that delivery status noti???cations in general may not be dropped automatically as they might be a valid mail in case a mail sent by the user could not be delivered. ???erefore Mail-Shake must be able to distinguish delivery status noti???cations sent in reply to a Mail-Shake mail from those in reply to a user mail.
With RFC 3464[44] a speci???cation for the format of delivery status noti???cations (DSN) exists. Unfortunately not all MTAs implement this speci???cation, although the ???rst version (RFC 1894) was published in 1996. During the evaluation (compare Chapter 5) non compliant delivery status
.
.Spam Bot
.Mail-Shake user .Challenge email
.Challenge email
.Delivery Status Noti???cation
.Mail-Shake user .spam email
.(with non valid sender address)
.MTA .undelivered
Figure 2.5:Mail loop triggered by a spam mail with a not valid sender address
ti???cations sent from Exim, qmail and the MTA of Google Mail have been received. While the noti???cations sent by the ???rst one o???er a minimal ???ance to be recognized as a noti???cation, the la???er ones do not. ???e noti???cations are normal plain text mails with the original, undelivered mail pasted into the text body. A compliant DSN uses a special MIME (Multipurpose Internet Mail Extensions) type??? and provides the undelivered mail as an a???a???ment. Appendix A contains examples for both compliant and non-compliant delivery status noti???cations received during the evaluation.
Mail-Shake has to be able to recognize a DSN and not send ???allenges or noti???cations in reply to the receipt of a DSN. Furthermore at the private address Mail-Shake has to only drop DSNs in reply to Mail-Shake noti???cations. ???e only way to recognize if a DSN is in reply to a Mail-Shake ???allenge is the a???a???ed undelivered mail whi??? is speci???ed as optional in RFC 3464. While this is in general positive as it blo???s ba???sca???ered spam, for Mail-Shake it is a problem. Fortunately the evaluation showed that all standard compliant DSNs either a???a??? the complete mail or at least the header, whi??? is su???cient to recognize a Mail-Shake mail. In case of non compliant noti???cations su??? as the one sent by Exim there is only the ???oice to either drop all noti???cations or to allow all noti???cations. So to say the ???oice between false positives or false negatives.
???e handling of delivery status noti???cations as proposed in this Section weakens the Mail-Shake concept. It is possible to successfully send mails to the private address without the requirement to authenticate. A spammer would only have to disguise the spam as a DSN. In case of a standard
.
Web Service .Mail-Shake User
.
User of
.
Web Service
.
1. .2
.
3
.
1. User sends message via Web Service
.
2. Web Service relys message as mail
.
3. Mail-Shake discards message
Figure 2.6:Web Service as rely of a mail
compliant delivery status noti???cation there is at least the ???ance that with extensions to the email client su??? mails can be recognized.
2.2.2.4 Web Services
Another limitation in the usability of Mail-Shake, whi??? can be considered as a variant of the com-munication with automated systems, was found during the evaluation: Mail-Shake is unable to handle mails sent from web services su??? as social networking services. Consider the case that User A is using Mail-Shake and User B is an authorized sender and knows the private address of User A. User B is also a user of social network Foo, while User A is not a user of that network. User B wants to invite User A to join that network. ???erefore he gives User A???s private address to Foo and Foo sends an invitation mail to User A. ???is mail is of course dropped as it uses a not whitelisted address of Foo and not the whitelisted mail of User B. ???e web service is so to say a mail rely whi??? ???anges the sender address as illustrated in Figure 2.6.
In case User B speci???es User A???s public address it fails as well, as the ???allenge is sent to Foo and as this is an automated system it cannot solve the ???allenge. In fact the evaluation showed that a bounce mail might be sent stating that you cannot send mails to that address. As this mail triggers another ???allenge mail, Mail-Shake and Foo are stu??? in a mail sending loop similar to the one seen above in the case of DSNs.
update this prede???ned whitelist for the case that new services are established or addresses ???ange. Also the case of pur???ases via the popular auction platform eBay fail as the seller sends a mail to the buyer. ???e seller???s address is of course unknown to the buyer and the address is in that case not whitelisted. ???ere might be a workaround to wat??? for mails at the time the pur???ase ???nishes or to only use the web frontend provided by the platform. A similar problem occurs for Review Board???, a web-based code reviewing tool used for example by the KDE community. ???e web tool knows the addresses of all participants and if User A opens a review request to User B, a mail is sent to User B from User A???s address. ???e header section does not contain any information, whi??? could be used to identify the mail as been sent from Review Board. In Appendix B an example of su??? a header section is provided and one from a system with useful headers. As Review Board is open source so???ware the easiest solution is to propose a pat??? to include a special header in ea??? mail.
2.2.3 Summary
In this Section the Mail-Shake concept has been presented. Mail-Shake protects an email account by using whitelists. All mails with a sender address, whi??? is not on the whitelist, are blo???ed. To get an address on the whitelist a sender has to proof that he is a human and not an automated system by solving a CAPTCHA. For the authentication process ea??? user of Mail-Shake has two addresses: a public and a private one. Ea??? mail to the public address is answered with a mail containing the CAPTCHA, whi??? reveals the private address.
???e concept is of course not bullet proof and some limitations of the concept and possible solutions to those were presented. ???e most severe problems are communication with automated systems and handling of Delivery Status Noti???cations. ???e solutions to these limitations are discussed in more detail in the scope of this thesis.
2.3 The Spam Templates Concept
In this Section the concept of proactive spam ???ltering based on templates, as described in [25] is presented. ???e general idea is to generate templates by intercepting mails sent by spam bots. ???ese templates are used to identify new received mails as junk by mat???ing the mail against the templates.
2.3.1 Template Based Spam Mails
Nowadays spam mails are mostly sent by botnets whi??? control a large number of systems infected with malicious so???ware (malware). ???ese controlled systems, whi??? are commonly known asbots
orzombies, communicate with a control server to get the order to send out spam. Most large spam sending botnets, like the Storm Worm botnet and its successor the Waledac botnet[60], use a special te???nique to generate and send spam messages[59] as illustrated in Figure 2.7. ???e control server passes templates, whi??? describes the structure of the spam messages to be sent, and meta-data su??? as recipient lists to the bots. ???e templates contain variable parts, whi??? are ???lled by the bots when sending out the messages with for example URLs received from the control server as well[34].
By intercepting the communication between the bots and the mail servers, they connect to for sending the spam messages, the templates can be reverse engineered. To intercept the communica-tion, probes of malware are executed in a sandbox, a controlled environment. ???e malware, whi??? is running on a native Microso??? Windows ma???ine, is allowed to communicate with its control server to receive current templates and the list of target recipients. When the bot tries to start a SMTP??? connection, the connection is intercepted and redirected to a local mail server. ???e local mail server is the man-in-the-middle between bot and the mail server the bot wanted to connect to. To tri??? the bot into believing that it is communicating with the actual mail server, the local one has to connect to the ???real??? MTA and grab the banner and reply it to the bot.
As the original template is passed from the control server to the bots, it seems to be a more elegant solution to intercept this communication, instead of intercepting the SMTP communication (and to reverse engineer the template). But gaining the original template might not be useful. Ea??? botnet uses its own template language, whi??? can be, as for example the Storm botnet illustrates, a fairly elaborate template language with support for forma???ing macros, generation of random numbers, dates, etc.[34] ???ese languages have to be reverse engineered and adjusted ea??? time the botnet slightly ???anges the language, whi??? renders the idea of spam ???ghting based on spam templates reactive. Another reason to reverse engineer the templates is, that not all botnets distribute their templates to the bots. ???ere are also botnets using areverse proxy-based spamming te???nique[49]. ???e bot connects to the control server and establishes a reverse SOCKS proxy connection. ???e control server uses this tunnel to directly send out the spam messages without passing the template
Figure 2.7:Template based spamming[25]
to the bot.
By intercepting all SMTP communication only current spam messages are gathered. ???is has the advantage that when a new spam campaign is started the mails are already present. Existing te???niques whi??? rely on feature extraction ???rst have to gather many spam mails to be able to identify a new campaign resulting in a high false negative rate at the start of a new spam campaign. ???e idea of the Spam templates concept is, to generate the templates used by the bots. ???erefore a bot is executed for a certain amount of time or till a certain number of messages have been collected. A???erwards the system is reset and a di???erent probe of malware is executed to receive messages sent by another botnet.
2.3.2 Generation of Templates
A???er one bot has been executed, templates can be reverse engineered from the collected spam mails. ???e messages are sorted, so that the longest message is processed ???rst. ???e longest mail becomes the base template and by merging it with the other mails a template is generated. If the merge with one mail results in a too generic template, the new one is discarded and the mail is moved ba??? to the list of unprocessed mails. ???at guarantees that the template does not become too generic and only mails whi??? were generated from the original template end up in the template. As soon as all mails are processed or only mails are le??? whi??? render the template too generic, the template generation process ends[25]. An example for a generated template is provided in Figure 2.8. Subject, X-Mailer header and ea??? line of the message body are replaced by a regular expression.
Subject\:\ ([\!\-\.\???\s\w]){7,137}\
X\-Mailer\:\ Microsoft\ Outlook\ Express\ 6\.00\.2720\.3000\ Body\:\
\#([\=\.\-\&\;\!\???\s\w]){20,152}\!\!\>\>\=09\ \.([A-Za-z]){14,14}Next\ Body\ Part\:\
\<\!DOCTYPE\ HTML\ PUBLIC\ \"\-\/\/W3C\/\/DTD\ HTML\ 4\.0\ Transitional\/\/EN\"\>\ \<HTML\>\<HEAD\>\
\<META\ http\-equiv\=3DContent\-Type\ content\=3D\"text\/html\;\ \=\ charset\=3Diso\-8859\-1\"\>\
\<META\ content\=3D\"MSHTML\ 6\.00\.2720\.3000\"\ name\=3DGENERATOR\>\ \<STYLE\>\<\/STYLE\>\
\<\/HEAD\>\
\<BODY\ bgColor\=3D\#ffffff\>\ \<STYLE\>\
\#([A-Za-z]){12,12}\ \<\/STYLE\>\
\<DIV\ style\=3D\"width\:([\d]){2,3}\%\;\ padding\:1([\d]){1,1}px\;\"\ id\=3D\"([A-Za-z]){14,14}\"\>\ \<H3\ id\=3D\"([A-Za-z]){14,14}\"\>\<A\ \=\
href\=3D\"http\:\/\/clubdetenisdelachar\.carpelo\.es\/index1\.php\"\ \=\
style\=3D\"color\:\#([\w]){2,2}0([\w]){3,3}\;\ font\-size\:14px\;\"\>([\=\.\-\&\;\!\???\s\w]) {7,141}\!\!\>\>\<\/A\>\<\/H3\>\=09\
\<\/DIV\>\ \<STYLE\>\
\.([A-Za-z]){14,14}\
\<\/STYLE\>\<\/BODY\>\<\/HTML\>Next\ Body\ Part\:\
Figure 2.8:Example of a generated Spam template
the mails sent by one bot are always the same or only very few di???erent mails are sent. So if the generated template is used to test mails, there can easily be a too high false negative rate as the mails vary to the intercepted.
2.3.3 Proactive Filtering
In the scope of this thesis a proactive spam ???ltering system based on the generated Spam templates are implemented. ???e system has to test new incoming mails against all templates and if the mail mat???es one of the templates, the mail can be classi???ed as spam and be discarded. ???e system can fet??? new templates from the servers generating the templates, whi??? makes the approa??? proactive. Even before spam mails of a new campaign have been received, the template might already be generated and so mails of the new campaign are instantly recognized. In the worst case the templates become available a???er a mail has been received. But even in this case new fet???ed templates can be used to test already received, but unread mails. Current spam ???ghting te???niques, whi??? test mails when they arrive at the mail server, do not test already received mails, when the rule set is updated. ???e Spam Templates are used on the client side, so that it is possible to test mails once on receipt and again when new templates become available.
algorithm is developed whi??? classi???es a given mail as spam based on the generated Spam templates and a system to provide and fet??? new templates is elaborated. ???e algorithm is discussed in more detail in Chapter 4.4.
2.3.4 Summary
3 Background
In this Chapter the essential ba???ground for implementing the two proactive concepts is presented. First of all an evaluation of current CAPTCHA te???niques is provided, as a secure CAPTCHA is required in Mail-Shake. An excursus on breaking an existing CAPTCHA system illustrates, that Mail-Shake has to use an existing and well tested system. Furthermore the framework Akonadi, on whi??? the two systems are built upon, is presented in this Chapter.
3.1 Evaluation of Current CAPTCHA Techniques
3.1.1 Introduction
One of the most important parts of the Mail-Shake approa??? is the ???allenge, whi??? reveals the private email address to the sender. ???e ???allenge is only solvable by humans and by that prevents that spam bots are able to send spam mails to the receiver using Mail-Shake. ???is ???allenge could be some form of CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), a riddle, or simple text[19].
???e most commonly used approa??? to implement this ???allenge is the CAPTCHA te???nique. ???ese provide automatically generated tests, whi??? cannot be passed by current computer programs using the current state of Arti???cial Intelligence (AI)[66]. ???e fact that a CAPTCHA does not guar-antee that future computer programs will not be able to break it has some implications for the implementation: it has to be possible to easily ???ange the used CAPTCHA implementation.
In this Section several currently used CAPTCHA implementations are evaluated. It is important to remember that CAPTCHAs are publicly available. So to say Ker???ho???s??? principle is valid for CAPTCHA te???niques as well. ???e strength of the CAPTCHA has to be determined by the di???culty to solve the AI problem and not by ???security through obscurity???. As Mail-Shake will be published as Open Source it is known that an a???a???er will have access to the implementation details.
???ere are many websites whi??? use CAPTCHAs, whi??? do not require to solve a hard AI prob-lem, to prevent spam comments. An example is a CAPTCHA whi??? is just a simple mathematical equation as provided by the ???Con???rmEdit??? extension for MediaWiki??. ???ese CAPTCHAs can be solved by computer programs. ???e strength of those CAPTCHAs is either the fact that there are
enough websites whi??? do not use CAPTCHAs or the fact that the CAPTCHA is placed at a random position in the HTML source whi??? makes it more di???cult to parse. ???e CAPTCHA itself does not provide any additional security. For the implementation of Mail-Shake it is important to use only CAPTCHAs whi??? require to solve a hard AI problem. If the CAPTCHA does not add any security there is no need to use a CAPTCHA at all and the idea of separating humans from computers is broken[65].
It is an important requirement for Mail-Shake to use an existing and well-tested CAPTCHA im-plementation instead of a custom imim-plementation. In the scope of this thesis the proof that a custom implementation is a hard AI problem cannot be given and by that a custom implementation has to be considered as insecure and unsuitable for Mail-Shake.
3.1.2 Simple Obfuscation
Many people and websites use a simple obfuscation to protect email addresses. ???is could be used by Mail-Shake as well. To obfuscate the email address special ???aracters are replaced by the textual representation. ???at way the email address is still readable, but if a bot tries to harvest the address, it is not able to ???nd it because it is not a valid address any more and cannot be mat???ed by a regular expression for email addresses as de???ned in RFC 822. For example the address
could be obfuscated as
???rstname [dot] lastname [at] example [dot] tld
???e obfuscation does not ful???ll the requirements of a strong CAPTCHA. It is based on a simple translation rule to replace special ???aracters with a textual representation. Of course this translation can be reversed. ???is te???nique seems to be useful to prevent harvesters extracting mail addresses from arbitrary websites[21]. If the position of the obfuscated email address is well known the ob-fuscation does not add any security.
For example the web ar???ive of GNU Mailman?? just replaces the ???@??? ???aracter by the textual representation, while the email address can be found always at the same position: the ???rst hyperlink of the web page.
???is address can be extracted automatically as the position in the document is known and as well is the translation rule. In Appendix D the source code for a small application with only 270 lines of code?? is provided, whi??? is able to extract all email addresses from a Mailman ar???ive. ???e result for extracting the addresses of one month in a public mailing list??? is shown in Figure D.1.
??http://www.gnu.org/software/mailman/index.html
Figure 3.1:Example of a reCAPTCHA
For Mail-Shake the same problem occurs: an a???a???er would know the position as well as the rules to retrieve the not obfuscated email address. Because of that obfuscating the email address is not a su???cient protection and cannot be used as a ???allenge in Mail-Shake.
3.1.3 Image Based CAPTCHAs
???e ???rst used CAPTCHAs are the image based CAPTCHAs. A word or any sequence of le???ers and digits is distorted and drawn above a noisy ba???ground. ???e distortion allows humans to pass the test while most computer programs and OCR programs are not able to solve the test. Figure 3.1 illustrates an example of the reCAPTCHA system developed by the Carnegie Mellon University, since September 2009 a subsidiary of Google Inc.[2].
Although the image based CAPTCHAs are a hard AI problem it is possible to break them by using image processing algorithms for object recognition. With that approa??? it is possible to break for example CAPTCHAs generated by the EZ-Gimpy implementation in 92 % of the time[45]. In Chapter 3.2 an excursus on breaking a current image based CAPTCHA system is provided.
Image based CAPTCHAs have the disadvantage that they cannot be solved by people with im-paired visibility. An image cannot be presented by a Braille display and so a user required to use a Braille display cannot solve the ???allenge. If the screen reader so???ware were able to present the CAPTCHA to a blind user, the CAPTCHA would not serve its purpose because bots would be able to ???gure it out as well[28]. ???is is a severe limitation of image based CAPTCHAs and because of that they cannot be used wherever accessibility has to be guaranteed su??? as governmental institutions. For Mail-Shake an email address has to be obfuscated by using an image as shown in Figure 3.2. ???is reveals some important information to a possible a???a???er. An email address has a well known structure containing exactly one ???@??? and at least one dot. ???e a???a???er can generate as many CAPTCHAs as needed and has the knowledge that ea??? CAPTCHA contains the same obfuscated address. Given these points it does not seem to be a good idea to use an image based CAPTCHA to
obfuscate the email address directly.
3.1.4 Audio Based CAPTCHAs
In a similar approa??? to image based CAPTCHAs aural CAPTCHAs can be generated. A sequence of random digits and ???aracters is generated and the spoken tokens are recorded on an audio clip together with some added noise. Unfortunately the amount of noise that has to be added to the clip to prevent Automatic Spee??? Recognition application from breaking the CAPTCHA renders the aural CAPTCHA as hardly usable[56]. Current aural CAPTCHAs can be broken by so???ware with a reliability of 58 %[62].
From an accessibility point of view an aural CAPTCHA does not solve the problems shown for image based CAPTCHAs. Blind people are able to solve the CAPTCHA but deaf people are not able to solve it. Although the audio CAPTCHA can be solved by blind people it is still very di???cult for them. Screen readers o???en speak over the playba??? of the audio CAPTCHA. ???is is one of the reasons why there is only a success rate of 43 % in solving audio CAPTCHAs for blind people[4].
For Mail-Shake the email address has to be recorded and distorted. If the listener does not understand the address correctly the Mail-Shake process fails. Given this disadvantage an aural CAPTCHA does not seem to be a solution for Mail-Shake.
3.1.5 Image Recognition CAPTCHAs
A di???erent approa??? for image based CAPTCHAs are image recognition CAPTCHAs. Instead of presenting an image with distorted text several images are presented and the user has to identify the common object or identify an anomaly. ???e approa??? has the same limitations for users with impaired visibility as the image based CAPTCHAs.
A proposed implementation uses an English pictorial dictionary, so every word is easy to illustrate. For ea??? of the words the ???rst 20 hits from Google???s image sear??? are used to build up the database[6]. A disadvantage for the use in Mail-Shake is that it does not provide a way to conceal the email address. It can only be used for the unique identi???er. In that case it has the disadvantage of spelling mistakes and di???erent languages. While an English user would identify and answer with ???dog???, a German user would answer ???Hund???. Requiring a language would render the ???allenge unsolv-able for people not speaking that language. And a result submi???ed in a di???erent language is not distinguishable from an incorrect result for the Mail-Shake implementation.
Figure 3.3:Example of an Asirra CAPTCHA
???ere exists an approa??? to overcome this problem by providing only images from dogs and cats. ???is concept is called ???Asirra?????? and an example is provided in Figure 3.3. ???e user???s task is to identify all cats in a set of 12 images. As a database images provided by Pet???nder.com, the world???s largest web site devoted to ???nding homes for homeless animals, is used. It contains over three million categorized images of cats and dogs and nearly 10,000 new images are added ea??? day[18]. Unfortunately the proposed implementation is a???a???able and there exists a classi???er whi??? is 82.7 % accurate in telling apart the images of dogs and cats[24]. For Mail-Shake the Asirra approa??? seems not suitable as the central database cannot be shared and the implementation is web service centric. To solve the accessibility problems of audio and images based CAPTCHAs a system could be used whi??? combines those te???niques. ???e image recognition could be combined with a ???aracteristic sound of the same object[28]. For example an image could show a dog while the audio clip contains barking.
???e disadvantage of this approa??? is the limited pool of combinations and by that the system can be broken by generating and solving all possible ???allenges manually. ???e paper discussing this approa??? is aware of this problem and suggests for websites to lo???out IP addresses trying to solve too many CAPTCHAs in a certain amount of time[28]. For Mail-Shake this does not work as the a???a???er can use di???erent email addresses to generate ???allenge emails. By that this CAPTCHA system does not ful???ll the requirements implied by the public availability of the test.
3.1.6 Riddle
A riddle is not really a common used CAPTCHA te???nique, but it is stated in the paper discussing the Mail-Shake approa???[19] as a possible way to encode the ???allenge. By that it has to be discussed in this section as well. A riddle is a very generic description and there are di???erent types of riddles whi??? could be used as a way to tell computers and humans apart. In general a riddle can be considered as a hard AI problem and ful???lls the requirements for a CAPTCHA.
???ere are basically two categories of riddles whi??? have to be discussed in this section. ???e ???rst is a riddle whi??? does not encode the email address, but has a distinct solution like a digit or a word. In the scope of Mail-Shake su??? a riddle could be used to encode the unique identi???er instead of encoding the email address. ???e second kind of riddle reveals the email address. So it has to be a kind of instruction for an algorithm to reveal the address.
???e ???rst kind of riddle could be a semantic CAPTCHA system. ???e human has to show that he understands the semantics of the presented words. For example three words of animals are presented: two of them are birds one is a mammal. ???e user has to recognize whi??? of the three given animals di???ers from the other two[38]. Of course su??? a system has to be combined with image based CAPTCHAs as presenting just the words in a not obfuscated way renders the riddle useless as it is just a ma???er of probability to solve the riddle.
???e semantic CAPTCHA approa??? has some obvious disadvantages. First of all it does not add any additional security to the system than the image based CAPTCHAs. As soon as the words are recognized by an OCR system solving the semantic CAPTCHA is just a ma???er of probability. It is obvious that the pool of possible riddles is limited. By that ea??? time an a???a???er breaks one CAPTCHA the a???a???er gains information on three words. ???e a???a???er knows that two words are of the same category while the third does not belong to the category. Whenever a riddle is presented with two of the three words the probability to solve the riddle is increased to one half instead of one third. In that way an a???a???er is able to generate a semantic database and solve ea??? new ???allenge. By using existing semantic databases like the Dublin Core Metadata[68] in combination with existing reasoners it is possible to solve the riddle by using queries[17]. ???e strength of the system does not rely on the strength of the AI problem but on keeping the semantic database secret. By that it does not ful???ll the requirements of Ker???ho???s??? principle and has to be considered as broken.
???is CAPTCHA system is not only easy to break but also very di???cult for certain groups of p