CAFE - Collaborative Agents for Filtering E-mails
Lorenzo Lazzari, Marco Mari and Agostino Poggi Dipartimento di Ingegneria dell’Informazione
Università degli Studi di Parma
Parco Area delle Scienze 181/A, 43100 Parma, Italy {lazzari, mari, poggi}@ce.unipr.it
Abstract
CAFE (Collaborative Agents for Filtering E-mails) is a multi-agent system to collaboratively filter spam from users’ mail stream. CAFE associates a proxy agent with each user, and this agent represents a sort of interface between the user’s e-mail client (i.e. Microsoft Outlook, Eudora, etc.) and the e-mail server. With the support of other types of agents, the proxy agent makes a classification of new messages into three categories: ham (good messages), spam and spam-presumed. The system analyzes every single e-mail using essentially three kinds of approach: a first approach based on the usage of an hash function, a static approach using DNSBL (DNS- based Black Lists) databases and a dynamic approach based on a Bayesian algorithm.
1. Introduction
In the past few years, Internet Technology has affected our daily communication style in a radical way:
the electronic mail (e-mail) concept is used very extensively for communications nowadays. This technology makes it possible to communicate with many people simultaneously in a way so easy and cheap that it’s currently considered as the first worldwide medium into business sector.
However, the abuse of e-mails has the drawback that the volume of e-mails that show up in mailboxes has been exponentially increasing. Moreover, many e-mails are received by users without their desire: “spam mail” (or
“junk mail” or “bulk mail”) is the general name used to denote these types of e-mail. Spam mails, by definition, are the electronic messages posted blindly to thousands of recipients, usually for advertisement, and represent one of the most serious and urgent information overload problems.
As time goes on, much more percentage of the e-mails is treated as spam and this increases the seriousness of the problem. In fact, apart from wasting time, spam costs money to users with dial-up connections, wastes bandwidth, and may expose under-aged recipients to unsuitable content.
In 1998, Cranor and LaMacchia found that about 10%
of the incoming e-mails to the network was spam; in 2000, according to the WorldTalk Corp., over 60 million business people have been overwhelmed since about 30%
of total e-mail is spam [1]. More recently, Eugene Kaspersky, director of the antivirus research near the Kaspersky Labs Society, estimates that, in the spring of 2004, considering the whole e-mails on the Internet, the percentage of spam was about 70%, while other people estimate a percentage still higher.
Spam filtering is a difficult classification task for a variety of reasons. Spam is constantly changing as spam on new topics emerges. Also, spammers attempt to make their messages as indistinguishable from legitimate e-mail as possible and change the patterns of spam to foil the antispam filters.
Industrial as well as academic research has faced this problem in terms of automated filtering methods in order to distinguish legitimate e-mail from spamming.
In this paper, we present a multi-agent system, called CAFE (Collaborative Agents for Filtering E-mails), studied to resolve the spamming problem using a collaborative approach. CAFE is developed as an implementation of a more general system, called RAVE (Remote Assistance Virtual Environment) [13][14], supporting communities of users during shared and personal projects.
The paper is organized as follows: section 2 gives a survey of the actual anti-spam methods, section 3 presents in a detailed way the CAFE system. Finally, in section 4, we give some concluding remarks and we present some similarity between CAFE and some other existing antispam systems.
2. State of the art
Many methods have been proposed to solve the problem of spamming, but they are not completely satisfactory. We can group them into two broad categories: static methods and dynamic methods.
Static methods base their spam mail identification on a predefined address list or on a comparison of every message with the databases of DNSBL (DNS-based Black Lists) suppliers [2].
For instance, the mail server “hot mail” allows a person to receive an e-mail only if his/her address is one of the recipient addresses, otherwise treats the e-mail as spam [3]. Needless to say, most spam mails pass this test and some important mails are treated as spam. Also, some servers try to collect addresses which are reported as spammers (people who send spam messages) and treat the e-mails coming from them as spam. However, spammers are all aware of most of these methods. All these solutions lack the dynamic structure of the problem, which highly limits their effectiveness.
Also using the DNSBL databases the results are not satisfactory, because these databases aren’t realized considering the type of message sent and, also in this case, every e-mail is treated in the same way, independently from its structure and content.
Some more complex approaches are dynamic in nature. They take the contents of the e-mails into consideration and adapt their spam filtering decisions with respect to these contents. Most of them use general text categorization techniques by implementing machine learning methods. Several machine learning algorithms have been applied to text categorization (e.g. Lewis, 1992 [4]; Apte and Damerau, 1994 [5]; Dagan et al., 1997 [6]).
These algorithms learn to classify documents into fixed categories, based on their content, after being trained on manually categorized documents. Algorithms of this kind have also been used to thread e-mail (Lewis and Knowles, 1997), classify e-mail into folders (Cohen, 1996 [7]; Payne & Edwards, 1997), identify interesting news articles (Lang, 1995), etc. To the best of our knowledge, however, only one attempt has ever been made to apply a machine learning algorithm to anti-spam filtering (Sahami et al., 1998 [8]).
Sahami et al. trained a Naive Bayesian classifier (Duda and Hart, 1973; Mitchell, 1997) on manually categorized legitimate and spam messages, reporting impressive precision and recall on unseen messages. It may be surprising that text categorization can be effective in anti-spam filtering: unlike other text categorization tasks, it is the act of blindly mass-mailing a message that makes it spam, not its actual content. Nevertheless, it seems that the language of spam constitutes a distinctive
genre, and that spam messages are often about topics rarely mentioned in legitimate messages, making it possible to train a text classifier for anti-spam filtering.
Bayesian algorithms have been often used by training a classifier on manually categorized normal and spam mails (e.g. Androutsopoulos et al., 2000 [9]; McCallum and Nigam, 1998 [10]; Sanchez et al., 2002 [11]), but they present some limitation: a main drawback of Bayesian filters is that they have the hardest time blocking messages that do not lexically look like spam, e.g., messages composed of a single line of text inviting the recipient to check out an URL. Also, the Bayesian techniques may exhibit a latency both in the initial training and in responding to messages built on previously unknown vocabularies. Finally, Bayesian filters can be bypassed introducing noise within the most recognizable terms and adding a relatively high number of random words to reduce detection power.
Alternately to the content-based approach described, it’s been developing an innovative collaborative approach that does not consider the content of the e-mail but depends on the collaboration of groups of users who share information about spam. When a new spam message appears, an early receiver of the spam shares a signature for that spam (typically one or more hash codes) with the rest of the group. If the other users also receive this message their filters can identify it as spam based on the shared signature.
In this approach there are two key issues; an effective signature mechanism needs to be devised and a process for sharing these signatures needs to be developed.
Spammers insert random characters into messages to foil hash-based signatures so flexible and clever signatures are needed. The sharing of these signatures can be centralised through a clearing-house or it can be truly distributed using peer-to-peer techniques. This solution has achieved considerable success as it overcomes the single point of failure typical of centralized architecture.
The dominant system in this area is Vipul’s Razor [12], also available as SpamNet. Vipul’s Razor uses a centralized clearing-house for sharing signatures and much of the research has focused on developing sophisticated signatures.
Starting from this collaborative approach, we introduce a multi agent-system called CAFE (Collaborative Agents for Filtering E-mails) in order to solve information overload problem caused by spam e- mails. CAFE is an implementation of a more general architecture called RAVE (Remote Assistance Virtual Environment) [13][14], a Web and multi-agent based system to support users during common projects or activities.
3. CAFE System
CAFE is a multi-agent system to collaboratively filter spam from users’ mail stream.
CAFE associates an E-mail Proxy Agent (EPA) with each user, this agent represents a sort of interface between the e-mail client (i.e. Microsoft Outlook, Eudora etc.) and the e-mail server. This agent has a basic role in the whole system, because it is the responsible of the final classification of e-mails downloaded from the e-mail server into three categories: ham (good messages), spam and spam-presumed.
To take this decision, EPAs are supported by other types of agents, which analyze every single e-mail using essentially three kinds of approach: a first approach based on the usage of an hash function, a static approach using the DNSBL databases and a dynamic approach based on a Bayesian algorithm.
Also the system users have a fundamental role, because their careful support let the system’s efficacy grow.
3.1. System Agents
The system is based on eight different kinds of agents:
E-Mail Proxy Agents, Digest Managers, Analysis Managers, DNSBL Agents, Bayesian Filter Agents, User Profile Managers, Starter Agent, Directory Facilitator.
E-mail Proxy Agents, which are set between the e-mail client and the e-mail server, represent the proxy entity of the system. An EPA becomes active when the related user is on-line and opens his e-mail client software. While in a traditional system the e-mail client connects to the server and downloads every new messages present in the user mailbox, here there’s not this direct connection, but the new messages are downloaded by the EPA that, after an analysis phase, classifies them into different types and transfer them to the mail client.
To classify the e-mails, EPAs are supported by other types of agents, in particular the analysis phase is performed on different levels with the collaboration of Digest Managers, Analysis Managers, Bayesian Filter Agents and DNSBL Agents.
Digest Managers are responsible of the first approach in e-mails classification. This approach is based on the comparison of the e-mails digest with the ones known as spam. The spam digests are stored in a database. A digest of an e-mail is the representation of the message in the form of a single string of digits, created using a formula called a one-way hash function. Figure 1 gives a
graphical representation of this method.
Analysis Managers are responsible of the second level analysis, performed if the e-mail passes the first phase (e.g. the e-mail digest is not present in the spam digests database). The second level consists in an approach more
“classic” based on a static and on a dynamic method (see section 2). The Analysis Managers receive e-mails from E-mail Proxy Agents and forward them to DNSBL Agents, for the static analysis, and to Bayesian Filter Agents, for the dynamic Bayesian algorithm-based analysis. On the base of the results of the two analysis, Analysis Managers give a score to each message that indicates the probability of spamming and communicates it to the related E-Mail Proxy Agent.
DNSBL Agents perform the static analysis of the e- mails. As the agents name says, the analysis is based on a comparison between the e-mails sources and the lists of blacklisted IPs and domains supplied by DNS-based Black Lists databases. After this static analysis, to each message is given a score (we call this score “static score”) and it’s communicated by DNSBL Agents to Analysis Managers. Static score value for each e-mail is 1 if DNSBL Agents finds a match, 0 otherwise.
Bayesian Filter Agents are responsible of the dynamic analysis. They take into consideration the contents of the e-mails and adapt their spam filtering decisions with respect to these contents. As every classical Bayesian filter, a Bayesian Filter Agent, specific for one system user, needs an initial training to build a specific vocabulary of terms for the related user, and enriches this vocabulary during the system life. Using Naive Bayes’
probabilistic algorithms, these agents calculate, for each e-mail, a global probability that the message content is spam and, analogously to the previous static analysis, send it to Analysis Managers. We call this value, included between 0 and 1, “dynamic score”.
Fig. 1. E-mail digest.
For each e-mail analyzed, on the base of the related static and dynamic scores, Analysis Managers calculate a global score and send it to the right EPA for the final classification (see section 3.2 for details).
User Profile Managers are responsible of maintaining the users’ profile updating it in relation to the user choices in received messages rating.
Starter Agent is responsible for activating an E-Mail Proxy Agent and the right Bayesian Filter Agent when a user wants to connect to his mailbox .
In the end, Directory Facilitator is responsible to inform an agent about the address of the other agents active in the system (e.g., an E-Mail Proxy Agent can ask about the address of a Digest Managers, of an Analysis Manager, etc.).
Figure 2 gives a graphical representation of the CAFE architecture, focusing on the interactions between agents.
Note that a CAFE platform can be distributed on different computation nodes. Moreover, in figure 2 groups of three
users or agents mean that there can be one or more users and agents.
3.2. System Behavior
A quite complete description of the system behavior can be given showing the analysis and categorization process of a single e-mail.
Before describing the system behavior, we have to discuss an important consideration: as we said in section 3, the final messages classification consists in three types of messages: ham, spam and spam-presumed. This because we think that a “binary” classification in ham and spam is too restrictive especially when the message passes the first level analysis and it’s necessary a more classical analysis based on static and dynamic methods which frequently give not satisfactory results. For this reason, an e-mail is treated as spam only if it doesn’t pass the first level analysis, that is the e-mail’s digest is
Fig. 2. CAFE platform architecture.
already present in the spam digests database. Instead, if an e-mail is signaled as probable spam only by the Analysis Manager, it’s treated by the system as spam- presumed. Once the messages are divided into the three types, the user has the fundamental role to control the spam-presumed folder and, for each e-mail, he has to notify the system (by a simply e-mail forward) on the correct type of the message (spam or not). Similarly, the user has to inform the system if he founds some spam e- mail in the ham folder and vice versa.
The description of the system behavior can be divided in the following steps:
1) user log-in and E-mail Proxy Agent activation;
2) e-mail digest-based analysis;
3) static/dynamic analysis (if necessary);
4) e-mail classification and user valuation.
User log-in and E-mail Proxy Agent activation:
when an on-line user, by his e-mail client software, requests to download mew messages from his mailbox, the E-mail Proxy Agent and the Bayesian Filter Agent are activated by the Starter Agent. The E-mail proxy Agent, receiving user’s username and password from the e-mail client, connects to the e-mail server, takes the new e-mail (we suppose that user’s mailbox contains only one new message), and calculates the message digest.
E-mail digest-based analysis: the message digest is given to a Digest Manager, which is responsible of the first level analysis. If it’s found a match between the e- mail digest and one of spam digests present in database, the message is certainly spam, so the E-Mail Proxy Agent labels the e-mail as spam (e.g. by writing “SPAM” in the subject) and sends it to the e-mail client.
Nevertheless, if the Digest Manager agent doesn’t report a match, it’s necessary to perform the second level analysis.
Static/dynamic analysis: the message is sent to an Analysis Manager agent, which is the responsible of the second level analysis of the message. At this point the e- mail is subjected to two different types of analysis methods: a DNSBL-based static method, performed by a DNSBL Agent, and a dynamic method, performed by the Bayesian Filter Agent related to the user.
As we briefly said in the previous section, depending to the analysis results, the DNSBL Agent and the Bayesian Filter Agent report to the Analysis Manager a static ad a dynamic score indicating the spamming probability. On the base of these two values, using a threshold-based approach, the Analysis Manager decides if the analyzed message is ham or spam-presumed, and communicates this result to the E-mail Proxy Agent responsible of that message. Similarly to the spam case, if the e-mail is
characterized as spam-presumed, the e-mail subject is changed in “SPAM-PRESUMED”, while if the message analyzed is a legitimate massage, it’s send unchanged to the e-mail client.
E-mail classification and user valuation: as we said, the result of the whole analysis is recapitulated on the final message subject, so, in a simple way, the mail client can divide different types of e-mail (ham, spam and spam-presumed) into different folders by a simple rule based on the message subject.
A basic aspect of the system is the importance of users evaluations. In fact, users have a fundamental role, because if they collaborate to give a good support, the system efficaciousness can become maximum in a very short time. The users role is to notify the system about the messages characterized as spam-presumed and to inform if they find something wrong in the e-mail classification.
More in detail, we can divide different cases:
4.1) a spam-presumed message is spam;
4.2) a spam-presumed message is ham;
4.3) an ham message is spam;
4.4) a spam message is ham.
A spam-presumed message is spam: the message digest is inserted in the spam digests database, so in the future a similar message will be directly treated as spam.
Besides, the Bayesian Filter Agent related to the user is consequently trained.
A spam-presumed message is ham: the Personal Bayesian Agent related to the user is notified to treat, in the future, a similar e-mail as ham.
An ham message is spam: as the fist case, the message digest is inserted in the spam digests database and the Bayesian Filter Agent related to the user is consequently trained.
A spam message is ham: if a message is treated as spam by the system, it means that its digest is present into the spam digests database, so in this case the message digest is immediately deleted from the database. Also in this case, the Bayesian Filter Agent related to the user is notified.
Because the system efficaciousness is highly affected by users behaviour in spam reporting, the system (e.g.
User Profile Managers) maintains a profile of system users, in which every user is characterized by a percentage of “credibility”. This value changes in relation to the number of unreliable spam reports performed by the user (i.e. if a user notifies the system as spam an e- mail that one hundred of other users treat as ham, it’s an unreliable report).
When this value becomes lower than a certain threshold, every spam report performed by the user is
ignored, and the related e-mail digest is not inserted in the spam digests database.
4. Conclusions
In this paper, we present a system called CAFE (Collaborative Agents for Filtering E-mails) with the aim of filtering e-mails downloaded by the system users from the e-mail server to find spam messages. With this multi- agent architecture we propose a solution to the spamming problem, that it’s becoming one of the most serious information overloading problems, joining together essentially two components: a collaborative approach based on the users’ spam notifications, and a multi-agent architecture, responsible of messages analysis performed on different levels.
Taking a look at some related work, we find other systems that try to resolve the spam problem in a similar way. In particular, Vipul’s Razor [12] is a distributed, collaborative, spam detection and filtering network.
Similarly to CAFE, Razor, through users contribution, establishes a distributed catalogue of spam in propagation that is consulted by e-mail clients to filter out known spam. Another interesting system that presents some similarity with CAFE is the multi-agent architecture described in [15]. Also in this case, we have a multi-agent system studied with the aim to collaboratively filter spam from users’ mail stream. The multi-agent architecture designed is very simple, there are a personal agent associated to each user ad a facilitator. Personal agents communicate with the others as well as a facilitator in order to share their own knowledge and the facilitator accumulate all pieces of feature information about spam e-mail posted by personal agents. Nevertheless, here the shared information is only related to a Beyasian analysis of the messages, it’s not used any hash function-based method.
In these works, we find the same two components (e.g.
collaborative approach and multi-agent architecture) present in CAFE; however, none of them provides the integration of different approaches in e-mails analysis, and none of them presents the advanced level of agents and users collaboration that mark CAFE as a complete and effective anti-spam system.
References
[1] Internet E-mail Corporate Usage Report. Available from http://www.securitymanagement.com/library/worldtalk020 0.html, 2000.
[2] Information, commentary and opinion on DNSBL spam blacklists. Available from http://www.dnsbl.com/, 2004.
[3] MSN Hotmail. Available from http://www.hotmail.com.
[4] D. Lewis. Feature selection and feature extraction for text categorization. Proceedings of Workshop on Speech and Natural Language, Harriman, New York, pages 212-217, 1992.
[5] C. Apte, F. Damerau, S.M. Weiss. Automated learning of decision rules for text categorization. ACM Trans. Inf.
Syst., 12 (3), 233-251, 1994.
[6] I. Dagan, Y. Karov, D. Roth. Mistake-driven learning in text categorization. In C. Cardie, R. Weischedel (Eds.) Proceedings of Conference on Empirical Methods in Natural Language Processing, Rhode Island, pages 55-63, 1997.
[7] W. Cohen. Learning rules that classify e-mail. In M.A.
Hearst, H. Hirsh (Eds.) Proceedings of AAAI Spring Symposium on Machine Learning in Information Access, Stanford, CA, pages 18-25, 1996.
[8] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. A bayesian approach to filtering junk e-mail. AAAI Workshop on Learning for Text Classification, 1998.
[9] I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, G.
Paliouras, D. Spyropoulos. An evaluation of Naive Bayesian anti-spam filtering. In G. Potamias, V.
Moustakis, M. Van Someren (Eds.) Proceedings of Workshop on Machine Learning in the New Information Age, Barcelona, pages 9-17, 2000.
[10] A. McCallum, K Nigam. A comparison of event models for Naive Bayes text classification. In M. Sahami (Ed) Proceedings of AIII Workshop on Learning for Text Categorization, Madison, WI, pages 41-48, 1998.
[11] S.N. Sanchez, E. Triantaphyllou, D. Kraft. A feature mining based approach for the classification of text documents into disjoint classes. Inf. Process. Manage., 38 (4), pages 583-604, 2002.
[12] Vipul's Razor spam detection and filtering network.
Available from http://razor.sourceforge.net/.
[13] M. Mari, A. Negri, A. Poggi. Agent-Based Support for Open Communities. Submitted to AAMAS, Utrecht, the Netherlands, 2005.
[14] M. Mari, L. Lazzari, A. Negri, A. Poggi, P. Turci. A Multi- Agent System to Support Remote Software Development.
WOA, Torino, 2004.
[15] J. Jung, G. Jo. Collaborative Junk E-Mail Filtering Based on Multi-agent Systems. Web Communication Technologies and Internet-Related Social Issues - HIS, 2003.