Spam Study

(1)

Erik Neff

CPS Research Paper – A Study on Solutions to Spam 12/10/2003

Abstract

Using e-mail most likely means an exposure to junk mail, referred to as Spam, in one degree or another. This paper examines the available and proposed solutions to Spam, studies the feasibility of each, and concludes with a single, most effective method of fighting Spam. While there is no perfect solution, some provide more comprehensive protection from the costs associated with junk mail.

I hypothesize that the best solution will be a rule-based solution that is

individually configurable by the e-mail subscriber. The reason being that the user will maintain complete responsibility at a minimal effort and can modify the solution’s functionality to tailor its capabilities to individual needs.

I. Introduction

Spam is defined as “Unsolicited e-mail, often of a commercial nature, sent indiscriminately to multiple mailing lists, individuals, or newsgroups; junk e-mail.” This definition may seem straightforward, however Spam is difficult to distinguish from normal or wanted e-mail. For instance, mail sent by a store you recently shopped online at could be considered Spam by some but not by others. There are those e-mails we are all too familiar with that are sent to randomly generated addresses that, by probability, make it to thousands of people. Most common are those e-mails advertising drugs such as Viagra that seem to originate from an e-mail address that is simply a string of random letters.

The frequency of Spam attempts is increasing. It is projected that the number of Spam e-mails received per person per year will increase from 450 in the year 2000 to over 1500 in 2003*. According to a study conducted by Ferris Research, Spam takes an average of 4 seconds for a user to process and costs U.S. corporations approximately $9 billion each year*. Another figure suggests that 40% of all Internet e-mail traffic is Spam*.

Spam can flood one’s inbox with dozens and sometimes hundreds of e-mails. This can fill an inbox and prevent those important and needed e-mails from being delivered. In addition, it can make it difficult to sort through and choose e-mails of personal interest and necessity. Spam mails can be used as tunnels by hackers and can contain dangerous viruses.

Junk e-mail is more than a nuisance to the receiver. The delivery of Spam costs Internet Service Providers millions of dollars per year in bandwidth. In order to send Spam anonymously, an untraceable or phony mail server IP address must be used. For this reason, Spammers hack into servers and generate e-mail from these machines. In essence, the need to be anonymous results in security violations for the purpose of sending Spam. In many cases, there is no way to track down these hackers.

The main reason for sending Spam is its use in marketing and advertising to consumers. E-mail is a popular channel for communication. Sending junk mail only

(2)

costs a small fraction of a penny to the sender. Comparing this to advertising methods such as magazine ads or television commercials and the costs are negligible. To get one sale from a thousand e-mails would often be profitable for the Spammer.

With so many Spam messages filling up inboxes, it is evident that something must be done. This paper looks at a variety of solutions to the problem of Spam. An

explanation of each of the solutions will be presented. The criteria used to subjectively grade each potential solution is discussed. I consider the characteristics of Spam,

evaluate the effectiveness of each solution, and conclude with a solution that fights Spam best.

II. Brief Description of Available and Proposed Spam Solutions

Numerous solutions have been proposed to eliminate Spam. Spam is an

increasing problem with no mainstream solution. Solutions range from defensive (filters) to offensive (penalties for Spam). The flow of Spam from sender to receiver follows a long path that provides several opportunities to attack it. Mail originates at a server, or end. It is channeled through one or more ISPs in its route to a destination mail server and then on to the client’s machine where the recipient views the e -mail. Therefore, several parties are involved, all of which have tools that can affect the flow of Spam, whether it be directly or indirectly.

Those parties involved are the sender or Spammer, the ISP, government, the owner of the receiving mail server, and the addressed recipient of the e-mail. Currently there are Spam solutions, however effective, that can be implanted at each stage of the Spam pathway. In addition, there are solutions that exist only in theory but have a strong following within the Internet community. Each solution has its advantages and

disadvantages that will be discussed further in the following sections.

Solutions considered will include current U.S. state and federal laws, rule-based client and server filters such as Spam Assassin, Bayesian filter applications, ISPs

charging for e-mails and maintaining whitelists, and reconstructing e-mail protocols. In addition, there are other solutions to Spam. The sample chosen for the purpose of this paper represents each of the predominant groups of solutions. In each case, the pros and cons are representative of other solutions in the same category. For example, there are dozens of available Spam filters, but only Spam Assassin’s solution will be considered.

III. Methodology for Comparing Spam Solutions

A feasibility study of proposed solutions must be conducted that will analyze the costs and benefits of each. The criteria by which I will conclude the most feasible solution will consist of enforce-ability, dollar costs, deploy-ability, effectiveness, false positives, network traffic, manual intervention required, and user friendliness. Another question that will be answered will be “Does the solution violate the classic end-to-end arguments set forth by the original designers of the Internet?” The analysis will be an in-depth look at the solution against the grading criteria. With the exception of dollar costs and other relevant statistics, this study will be a subjective evaluation to determine the most feasible implementation of eliminating junk mail.

(3)

Enforce-ability pertains to the level of effectiveness that can be governed of a particular solution. This category relates to the ability of the creator to ensure the solution is implemented and that it maintains its integrity. Enforce-ability answers the question “How much control does one have over the Spam solution?”

Dollar costs involve all monetary transactions related to developing the solution’s infrastructure, maintaining the solution, the costs saved by using he solution, and the costs associated with the scenario of not using a solution to stop Spam. This category does not include a value placed on intangibles such as the cost of losing an e-mail due to false positives in a filtering solution.

Deploy-ability assesses the ease of setting up a solution and maintaining it. It considers a one-time act versus a continual maintenance. It also looks at how much effort is required for a solution as well as what level of knowledge is required to enact a

solution.

Effectiveness examines the number of Spam mails stopped by the particular Spam solution. Due to the high level of this paper and lack of objective data, the bulk of this section will be subjective analysis.

False positives are the e-mails that are blocked by a solution that should not have been. This category does not apply to every solution examined in this paper. It is considered because of the importance of having no false positives in a solution. Losing just one important e-mail can be extremely costly.

The network traffic created and/or eliminated by a given solution is another important characteristic to be considered. Network traffic costs ISPs bandwidth and Internet users time.

IV. Spam Laws in the US

On November 22, 2003, the U.S. House of Representatives voted 392-5 in favor of a federal anti-Spam law that would threaten Spammers with fines and jail time. This law is known as the “Controlling the Assault of Non-Solicited Pornography and

Marketing Act”. This legislation allows the Federal Trade Commission to create an “opt-out” list that prevents Spammers from sending mail to people on that list. Further, the bill prohibits falsifying e-mail header information, using deceptive subject lines, not including a functioning return address, address guessing or “harvesting”, using scripts to sign up for e-mail accounts, and sending mail with sexual content unless labeled with the correct FTC label.

With this law comes much debate from Spam lobbyists regarding the first amendment and the right to freedom of speech. This has been defended in court:

! " #

$%'&)(+*-,.,0/1'2+3546*76/98;:-,0/=<>:.8?:@1"<5A6:<CB+/8?*@62?/D<E46/F1G1'//IH6:-,J:IG2K(//8K4LF2M<E41'/*I</A6/HLB@<5B/8;*-@62;/=<E46/ F1N(1':GOF<2 P-Q'RSTU.VQWRX;X=YZDTE[6R\Q)]?U^.Z_P-\`T>X=\`ba\Q'XcTd^0R`6V-^0R`TeP-Qf0V6R^0RP`6XMTE[6RgZ6Q\`6]K\Z_RXMhWUQij[\]+[0TE[6Rga\QWXTdN^0R`6V^0R`T klm-n6o6kIprq

stuv!wNxyz{;u|v} v~v~N5!y ~vbvs!y~{

E E6L9E60-6EEE.;-.+=6M>L0.- -KL->I.;-..? E ¡¢6£¤j¥6¦I>§+¨6§©N ª=«0§© ¬6®§ ¯;°I±²³´µ¶¯;°-··¸Lµ'²¹E²¯±±Eº6²9°µW³»¼0²½±±º6°I±°9¾6²½6¿6´-µº6°Àj°Lµ¶³º±5»½6¿²µ±Eº6²0Á´-½6À±E¶±E»±E¶´½.´µ´±Eº6²µ"Âj¶À?²D±´.À;²½6¿L»½Â°½±²¿ Ã0ÄIÅ>ÆÇÈÄ-ÉÈÊÅ>Ë9ÅEÌ6ÆgÌ6ËÃ0Æ9ËIÍ)Ä-Ê6ËIÅÌ6ÆÇÎÏWÍÅÌÈÐ=Ñ6ÇWËÌÈÒ6ÈÅEÈË-ÊÓË-ÑÆÇ'ÄIÅ>ÆIÐÅËLÈÃ.Ñ+ÆIÔ6Æ=ÅEÌ6ÆDÍOÉËIÕÖËIÍÆK×6ÆÊ9×6ÄÉÈÔLÈÔÆÄÐ;ØÅEÌ6Æ0Ä-Ê6ÐÕÆÇ ÙÚÛEÜ6ÝIÛ5Þ6ß.ß-Þ6à.Ü6ÝÚjÝLáÙâ-ÜÛÛßLãá'àÚ;ÚjàKäàÞÓåâßß6æåÙæàÝ6Újß-ÞÓÝ-ÞbçÞèjÙééÙÞ6âLá'àêKÙãÙàÞÛëIìÜ6à0ÝÚ;Ú?àáíÛàæLáÙâÜÛNßîÝLï0Ý-Ùéàá'ðèà á'àãàÝIÛð ÚcÛßãÚjÝÛÛEÜ6à9ßçÛàáNñßçÞ6æÝá"òÓßIîàKä6àáíòLãàá'Ú;ß-Þ6åÚjæßï0Ý-ÙÞ!ëôó

(4)

A problem with the laws in the U.S. is that Spam is generated all over the world. The Internet is globally present entity and Spammers can easily leave U.S. jurisdiction to send their mail. Tracking Spammers is also a challenge that enforcement officials will likely face.

V. Client Mail Filters – Spam Assassin

Spam Assassin is an e-mail filter product that uses a rule base to perform tests on the body and header of an e-mail. For each rule violation, the e-mail is assigned a

number that contributes to the e-mail’s score. The higher the score, the more probable it is that the e-mail is Spam. A threshold can be assigned that sends Spam mails to a junk folder. Spam Assassin also uses blacklists such as mail-abuse.org as well as works with Vipul’s Razor, which is a database of Spam signatures. By allowing recipients to store Spam messages in this database, other users that receive the same e-mail can block it. Spam Assassin requires little configuration, however a system administrator has the ability to modify or extend the filtering capability.

By blocking mail through filters at either the ISP or the recipient levels, the classic end-to-end arguments of the Internet are being violated.

VI. Bayesian Filter Solution

Bayesian filters attack Spam with a statistical approach. Rather than assigning a score to an e-mail based on its characteristics such as Spam Assassin, a Bayesian filter system assigns a probability to an e-mail. A score is meaningless because it is arbitrary and doesn’t measure anything real. According to Paul Graham*, out of 1000 Spam messages, Bayesian filters miss less than 5 with zero false positives.

Bayesian filters look at a corpus of e-mails, some Spam and some good e-mails. Based on the composition of each e-mail, a hash table is created for Spam and a second is created for the good mails. These tables are created by parsing the words and headers from each mail. A third hash table is then created by mapping each entry to the probability that an e-mail containing it is Spam. When a new mail comes, it is parsed into tokens and compared against this hash map.

The advantage to Bayesian filters is that the filter can be tailored to the individual e-mail user. With rule-based filters, the same rules apply to everyone. This allows Spammers to draft their e-mails to work around the rules. Since everybody has their own set or e-mails coming in with different vocabulary, these Bayesian filters will mold to that. Bayesian filters assign a probability to each word based on its hash tables. Often, the word “sex” appears in Spam. For most people, this word woul d be assigned a high probability because it appears frequently in mails designated as Spam and few times in the good e-mails. Supposing that the user used this e-mail frequently in personal e-mails, this word would have a neutral probability. The filter would consider the highest and lowest 15 or so words on the probability scale to determine the e-mail’s overall Spam probability.

Bayesian filters offer a solution that can adapt to the new methods of which a Spammer sends mail. The advantage is that the filter system automatically adapts to the users mailing habits.

(5)

VII. Charging Money for E-mailing & White Lists

E-mail presents a channel of communication at a near zero cost to the sender. Millions of e-mails can be sent with no cost. Bulk e-mail is economically effective if only a few of those million e-mails sent generates revenue. The solution: charge the sender of those e-mails and make bulk mailing uneconomical. The cost moves from the recipient to the sender.

The proposed solution, as discussed by Walter Bright*, involves charging $.01 per e-mail sent. This would make the cost of sending mass e-mail advertisements not worth the sales generated in return. In theory, the cost to the average person would be well worth it considering the time spent sorting through Spam and the money spent on other filtering solutions. To minimize the cost to the user, whitelists can be used. If an e-mail sender is on this list, they would not be required to pay for each e-e-mail sent.

To implement such a system, ISPs would need to set up a payment scheme that would charge the sender. The charge would show up on their monthly bill. Charges would be shared among ISPs since not all e-mail originates and ends with users of the same ISP. The ISP would benefit from this system through a decrease in bandwidth used from bulk mailing as well as money generated from the senders of this e-mail.

There seem to be many potential issues with implementing a process that charges the sender of an e-mail. First, a user could forge the return address on the e-mail that could result in several problems including charging an innocent party that had no intention of sending mail. Second, if a single ISP were to enforce this system, others would not be so sure to follow, users would grow weary of the additional charges, and that particular ISP would lose it’s customer base. Third, the costs to the ISP for tracking whitelists, billing, settling disputes, and monitoring e-mail traffic and origin would be far too expensive.

VIII. E-mail Protocol Overhaul: TRIPOLI

TRIPOLI stands for “Empowered E -mail Environment”. It is aimed at solving many problems currently associated with e-mail, including Spam, by implementing a new architecture for transmitting e-mail. TRIPOLI utilizes a payload identification token that facilitates cryptographically linking every e-mail message. Identities would be verified and tokens issued by third parties. When e-mail is delivered, it includes a token that links the mail to the sender.

IX. Comparisons of Costs/Benefits of Spam Solutions

In terms of enforce-ability, filters ranked above the other solutions because they are configured by the end user. Enforcing the “Controlling the Assault of Non-Solicited Pornography and Marketing Act”, like most laws, will be challengi ng to say the least. Establishing a system of charging for sending e-mails would also prove difficult, as clever Spammers would soon find a way around the system.

(6)

To implement and maintain a solution, the dollar costs would be greatest for setting up an e-mail charge system, followed by enforcing U.S. laws, TRIPOLI, and filters. Filters, however, are a solution to Spam after it has made its mark on costs to ISPs for delivering Spam. To the user, they’re a low cost solution. If these filters are effective and mainstream, then they may prove the small costs of sending Spam mail too large for Spammers and put an end to it.

Implementing a system of e-mail charges would be the most difficult solution to implement. The new architecture proposed on TRIPOLI would also be a costly and challenging implementation. Filters, both Bayesian and rule based, are normally set up by system administrators with expertise in the area. Finally, Spam laws require no implementation.

False positives occur with rule-based and Bayesian filters. Due to their nature, which involves blocking an already sent message, this risk exists. In other solutions that act as deterrents and stop the Spam before it’s even sent, false positives do not exist. Laws, TRIPOLI, and a charge system would be safe with respect to collecting all important personalized e-mail.

Filters act after the mail has been delivered and do not have any effect on the bandwidth consumer by delivering these messages. This could be argued by claiming that if people don’t read junk mail, Spammers won’t send it. Without deterring Spammers, the network will be bogged down by this traffic.

Filters also require considerable intervention by the user. A constant eye must be kept on the junk folder to ensure the recovery of any important false positives that may have been accidentally filtered. In addition, changes in the tactics used by Spammers will create new mails that slip through and require new configurations. The Bayesian filter, on the other hand, is theoretically capable of adapting itself to new Spam messages.

Effectiveness is the most important criteria of all. This is difficult to address due to the fact that two of the Spam solutions considered were not currently available. Spam laws, like most laws, will be broken and circumvented. I speculate that a charge system would also be circumvented. A recent study showed commercial Spam filters less than 87 percent effective*.

X. Conclusion

It must be understood that there is no solution that is 100 percent effective, nor is there any proposed solution that is both inexpensive and bulletproof. Solutions are effective to the point where Spam in manageable. In the future, as Spam generation efforts increase, it will be a combination of the filters described here that limit our costly exposure to junk e-mail and law enforcement. There are two ways to prevent Spam: discourage it from being sent in the first place or block it at the end point. A combination of discouragement, which will save time and bandwidth consumption, along with

blocking techniques will prove most effective in the battle against Spam. However, the single most effective method after weighing the variety of

characteristics is the use of Bayesian Spam filters. These filters are capable of adapting to both the end user and the content of junk mail being sent. The effectiveness exceeds 99 percent and the risk of false positives is less than a tenth of a percent. The cost of

(7)

purchasing one of these filters is low on a per user basis and can be of little concern to the end user. Most importantly, the end user has the ability to decide what to view.

Bibliography

Commentary/Facts:

Ismail, Izwan. Solution for Companies to Fight E-mail Spam. New Straits Times, Landover, MD, 03/09/2003.

Metz, Cade; Seltzer, Larry. More Ways to Slam the Spam. PC Magazine, 5/27/2003. Romans, Christine; Kiernan, Pat. The Cost of Spam. The Money Gang (CNNfn), 01/06/2003.

http://www.inboxlock.com/Spam.html

Charge for E-mail:

http://www.walterbright.com/Spam.html

Bayesian Filters:

Graham, Paul. A Plan for Spam. August 2002.

http://www.paulgraham.com/Spam.html

Spam Laws:

http://www.Spamlaws.com/

http://msn-cnet.com.com/2100-1024_3-5110622.html?part=msn-cnet&tag=feed_2516&subj=ns_

Spam Blocking Applications:

http://au2.Spamassassin.org/index.html

TRIPOLI: