Online Email Classification Using Ant Clustering Algorithm

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 9, September 2012)

577

Online Email Classification Using Ant Clustering Algorithm

Neeti Saxena

1

_{, Bharati Verma}

2

_{, Nitin Shukla}

3

1_{Department of CS, SRITM Jabalpur,INDIA.}

2_{Department of CS, GEC Gwalior, INDIA.}

3_{Department of CS, SRITM Jabalpur, INDIA.}

Abstract— Classification of email messages & web page content is essential to many tasks in web information retrieval such as maintaining email directories, web directories and focused crawling. The uncontrolled nature of email & web content presents additional challenges to their classification as compared to traditional text classification, but the common fields of email messages and interconnected nature of hypertext also provides features that can assist the process. As we review work in web page classification, we note the importance of these web-specific features and algorithms, - describe state of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages. Email Classification has been taken as research in the past but all of those researches have been focused on SPAM filtering. Whereas, SPAM filtering is a similar classification as my proposed algorithm, but there are several base changes, primarily, SPAM filter classification attributes are much different and they don’t process the content of the emails to classify the emails. Also ANT COLONY OPTIMIZATION techniques have not been applied for email classification in the past. The AIM of the research is to provide a handy tool to the end per day and the manual classification of emails is done. users to classify their emails in various folders on server so that users don’t need to separate them. This is going to be beneficial for HR Managers of any other organization where thousands of emails are received

Keywords- Ant Clustering, Email, Optimization, classification, web, web page

I. INTRODUCTION

Email has become one of the fastest and most efficient forms of communication. However, the increase of email users with high volume of email messages could lead to un-structured mail boxes, email congestion; email overload, unprioritised email messages, and resulted in the dramatic increase of email classification management tools during the past few years.

Our contributions are the use of empirical analysis to select an optimum, novel collection of features of users‟ email contents that enable the rapid detection of most important words, phrases in emails and a demonstration of the effectiveness of two equal sets of emails training and testing data.

Email has been an efficient and popular communication mechanism as the number of Internet users‟ increases. Therefore, email management has become an important and growing problem for individuals and organizations because it is prone to misuse. One of the problems that are most paramount is disordered email message, congested and

un-structured emails in mail boxes.

It may be very hard to find archived email message, search for previous emails with specified contents or features when the mails are not well structured and organized.

Schuff et al [23] stated that “Emails are widely used to synchronize real-time communication, which is inconsistent with its primary goals”. Email messages are designed to be sent, accumulate in repository and be periodically collected and read by receipt, which lends itself to the details of a vacation or a meeting‟s upcoming agenda. Since most people rely on emails for efficiency and effectiveness of communication, mail boxes may become congested. Messages range from static organization knowledge to conversations with such a broad horizon of messages. Users may find it difficult to prioritize and successfully process the contents of new incoming messages. Also it may be difficult to find a previously archived message in the mail box. Kushmerick [24] stated that “the ubiquity of email and its convenience as knowledge management tools make it unlikely that users‟ behavior will change as falling bandwidth and disk storage prices further reduce the incentive to steer away from using email as a document storage system”. At this stage new effective method for managing information in email, reducing email overloads is developed by classifying emails based on importance of words in the email messages and deriving the closest classes the mail could belong to either: critical, urgent, very

(2)

International Journal of Emerging Technology and Advanced Engineering

578

Email classification presents challenges because of large and various number of features in the dataset and large number of mails. Applicability in email datasets with existing classification techniques was limited because the large number of features makes most mails undistinguishable. In many emails datasets, only a small percentage of the total features may be useful in classifying mails, and using all the features may adversely affect performance. The quality of training dataset decides the performance of both the email classification algorithms and feature selection algorithms. An ideal training dataset for each particular category will include all the important terms and their possible distribution in the category. The classification algorithms such as Neural Network (NN), Support Vector Machine (SVM), and Naïve Bayesian (NB) are currently used in various datasets and showing a good classification result as experimented by Young et al [16]. There have been, however, little studies in applying back propagation techniques (BPT) for email classifications. The main disadvantage of BPT is that they require considerable time for parameter selection and training. On the other hand, previous research has shown that back propagation in neural networks (NNs) can achieve very accurate results, that are sometimes more accurate than those of the symbolic classifiers. NNs have been successfully applied in many real world tasks. In this paper we present BACK PROPAGATION TECHNIQUE a NNs-based system for automatic email classification.

II. EMAIL CLASSIFICATION CHALLENGES The characteristics of emails differ significantly and as a consequence email classification poses certain challenges, not often encountered in text or document classification. Some of the differences and challenges are:

1. Each user‟s mailbox is different and is constantly increasing. Email contents vary from time to time as new messages are added and old messages are deleted. A classification scheme that can adapt to varying email characteristics is important.

2. Manual classification of emails is based on personal preferences and hence the criteria used may not be as simple as those used for text classification. This distinction needs to be taken into account by any technique proposed for email classification.

3. The information content of emails vary significantly, and other factors, such as the subject field, sender, CC field, BCC field, the person email is addressed to, play an important role in classification. This is in contrast to documents, which are richer in content resulting in easier identification of topics or context.

4. Emails can be classified into folders and could also be classified into subfolders within a folder.

The differences in the emails classified to subfolders may be purely semantic (e.g., meeting with travelling folder, conference expenses within travelling folder and many more).

III. RELATED WORK

Rambow et al. apply a machine learning approach to email summarization [3]. Wan et al. study decision-making summarization for email conversations [4]. Several systems for automatic e-mail classification have been developed. Cohen [5] used RIPPER to induce keyword-spotting rules. Bayesian approaches have been used e.g. [6] as well as Nearest-neighbor techniques [7]. Previous studies have been limited because they have failed to explore the use of feature selection and phrase representations. [8] Compared a cross-experiment between 14 classification methods, including decision tree, Naïve Bayesian, Neural Network, linear squares fit, Rocchio. KNN is one of top performers, and it performs well in scaling up to very large and noisy classification problems. Approaches to filtering junk email are considered [9,10,11,12,13] showed approaches to filtering emails involve the deployment of data mining techniques. The hashing-trick is a natural fit for spam filtering. Nearly all commonly used spam classifiers, such as Naive Bayes, logistic regression and support-vector machines [14] require emails are represented in the bag-of-words format. A major disadvantage of the bag-of-bag-of-words representation is the necessity of a dictionary data structure for mapping words to vector indices.

In collaborative spam filtering, the set of possible word tokens is generally very large. As a result, the dictionary can use up a significant portion of a servers' memory.

(3)

International Journal of Emerging Technology and Advanced Engineering

579

[15] Showed a good performance reducing the classification error by discovering temporal relations in an email sequence in the form of temporal sequence patterns and embedding the discovered information into content based learning methods.

[16] Proposed a model based on the Neural Network to classify personal emails and the use of Principal Component Analysis as a preprocessor of NN to reduce the data in terms of both dimensionality as well as size. In the classification experiment for spam filtering, Decision tree showed better result than NB, NN, or SVM classifier [17].

Aery et al [26] elaborated further about email classification that “depending upon the mechanism used, email classification schemes can be broadly categorized into: i) Rule based classification, ii) Information Retrieval based classification and iii) Machine Learning based classification techniques.

Rule Based Classification: Rule based classification systems use rules to classify emails into folders. William Cohen [27] uses the RIPPER learning algorithm to induce „keyword spotting rules‟ for email classification. i-ems [27] is a rule based classification system that learns rules based only on sender information and keywords. Ishmail [28] is another rule-based classifier integrated with the Emacs mail program Rmail.

Information Retrieval Based Classification: Segal and Kephart [13] have used the TF-IDF classifier as the means for classification inSwift File, implemented as an add-on to Lotus Notes. It predicts three likely destination folders for every incoming email. The TF-IDF classifierperforms well in the absence of a large training set and also when the amount of training data increases, adding to the heterogeneityof a folder.

Machine Learning Based Classification: The File system by Rennie [9] uses the naive Bayes approach for effective training, providing good classification accuracy, and for performing iterative learning. Re: Agent by Boone [2] first uses the TFIDF measure to extract useful features from emails and then predicts the actions to be performed using the trained data and a set of keywords. Mail Agent Interface (Magi) by Payne and Edwards [10] uses the symbolic rule induction system CN2 [3] to induce a user-profile from observations of user interactions”.

IV. EMAIL CLASSIFICATION

Classification of text in an email message is an example of supervised learning that seeks to build a probabilistic model of a function that maps emails to classes. In supervised learning of text in email messages, where an entire email dataset represents one example of emails to be classified, a learning algorithm is presented with a set of already classified, or labeled, examples. This set is called the training set. A number of classified emails from the training set are removed prior to model building to be used for testing the model‟s performance.

(4)

International Journal of Emerging Technology and Advanced Engineering

580

This process is called n-times cross validation where “n” is the number of times the example set is partitioned. We produce 1000 models for evaluation using this process and we obtained 1000-times cross validation. Now our model has been constructed, it was used to predict the classification of future email messages. The accuracy of our models is largely dependent on:

 The performance of our back propagation algorithm

 The important word selection using information retrieval

 The “representativeness” of the training data with respect to newly acquired email data to be classified.

The more representative, the training data, the better the performance. A larger number of training examples is often better, because a larger sample is likely to be more reflective of the actual distribution of the data as a whole.

V. EMAIL MESSAGE TRANSFORMATION

Classification machine learning algorithms operate on numerical quantities as inputs. A labeled example is often a vector of numeric attribute values with one or more attached labels.

For some methods, like naïve Bayesian, integer counts of the attribute values are all that is required. Attributes in these cases can be nominal types but they are still ultimately converted to numeric counts before processing. The conversion of text records to vectors of numeric attributes is a multistage process of feature construction which employs several key transformations. There are many approaches to this process – one of the most common is to treat each email content as a “bag of words.” Each unique word constitutes an attribute (a position in our example vector). The number of occurrences of a word in a n email (frequency of occurrence) is the attribute‟s value for that email message. Email messages are therefore represented as vectors of numeric attributes where each attribute value is the frequency of occurrence of a distinct term. This set of email message vectors is often referred to as a vector space. Algorithms that operate on such representations are said to be using vector space models of the data. Furthermore, some types of words, or series of words, may be preferable for learning.

For example, often nouns or noun phrases are preferred. Part-of-speech identification algorithms and lexical/semantic dictionaries are typically used to provide additional information about terms. Also, very common words like “and” and “the” are often filtered out using stop word to improve performance. All of these transformations are performed before any learning takes place. The number of steps involved in this pre-processing can be quite numerous and often constitutes the bulk of the overall model building process [9].

VI. APPLICATIONS OF MESSAGE CLASSIFICATION Our email message classification has many direct applications such as:

 Classification of email according to the content to (Critical, Urgent, very important, Important, and Not important) or

 Classification of streams of emails, with various important words, phrases, to identify particular items of interest.

We implemented supervised learning for the email word extractions and also support higher-level

Objectives like:

 Information extraction: the process of extracting bits of specific information from unstructured email messages

 Classification methods: These are often used to identify the parts of message content that qualify for extraction. For example, extracting the date, time, location, important words meeting, interview, surgery appointment, flight booking, conference booking and venue, credit card deadline etc.

(5)

International Journal of Emerging Technology and Advanced Engineering

581

i) picking up an isolated item, and ii) dropping the item where more similar items are present. Assuming the ants or agents can only handle one item at a time and only one type of item exists in the environment, they defined each action in terms of a probabilistic function.

As such, the item will be moved until the agent reaches a more Dense region. Deneubourg et al.‟s model was extended by Lumer & Faieta to include a distance function, d between data objects, hence removing the need for assuming type homogeneity and making it more generalized for the purposes of exploratory data analysis. They provided an algorithm, which models the inherent similarity of data objects onto a lower dimensional space, e.g. 2-D. As in a connectionist‟s self-organizing map, such mapping would be able to capture the similarity relationships between higher-dimensional objects and project it on a 2-D grid. Other similar work includes the AntClass clustering algorithm, which is a combination of an ant colony with the partitioned K-Means algorithm. The ant colony of AntClass differs from Lumer & Faieta‟s model as ants are allowed to carry more than a single object at a time, have local memory and other heterogeneous features.

VIII. PROPOSED SYSTEM

We are implementing the above system using JAVA Language and mail dir database downloaded from internet is used for the purpose. In my algorithm I will work in following phases:

1. Training Phase: User will type category in the box provided & the folder in which training emails are stored by clicking on select button beside it. 2. Testing Phase: After training, testing will be done

using emails whose categories are already known. After Testing is done it will list documents & category found after applying ANT clustering. 3. Actual Document Processing Phase: In this phase

user can select any emails (whose categories are not known), project will apply ANT clustering to find their categories.

For simulation purposes a multithreading environment is created to work for multiple documents processing. For implementation of system we will be using processing ant clustering mechanism to classify the email on the basis of categories learnt in the first phase.

IX. RESULTS &DISCUSSIONS

The AIM of the research is to provide a handy tool to the end users to classify their emails in various folders on server so that users don‟t need to separate them. This is going to be beneficial for HR Managers of any other organization where thousands of emails are received per day and the manual classification of emails is done.

We are expecting the results of the research to be in very favorable. This is unique research where very few people have worked on the scenario of email classification.

Once implemented successfully we can further apply this algorithm for the flexibility of the users according to their need. An automatic transfer of emails in the classified folders is another work which can be done to provide more convenience to the users of the system.

REFERENCES

[1] 2010 MAAWG Email Security Awareness and Usage Report, Messing Anti-Abuse Working Group/Ipsos Public Affairs [2] Register of Known Spam Operations (ROKSO)

[3] N. Ducheneaut and V. Bellotti. E-mail as habitat: an exploration of embedded personal information management. Interactions v.8, n.5, pp.30-38, 2001.

[4] Stephen Wan and Kathleen McKeown. Generating overview summaries of ongoing email thread discussions. In Proceedings of COLING‟04, the 20th International Conference on Computational Linguistics, August 23–27 2004.

[5] Owen Rambow, Lokesh Shrestha, John Chen, and Chirsty Lauridsen. Summarizing email threads. In HLT/NAACL, May 2–7 2004.

[6] D. J. Cook and L. B. Holder. Graph based data mining. IEEE Intelligent Systems,15(2):32{41, 2000.

[7] Kamber M, Winstone L and Hanj – Generation and decision Tree Induction – Efficient classification in Data mining [8] Y. Yang, “An Evaluation of Statistical Approaches to Text

Categorization,” Journal of Information Retrieval, Vol 1, No. 1/2, 1999, pp. 67-88.

[9] W. Cohen, “Learning rules that classify e-mail,” In Proc. of the AAAI Spring Symposium on Machine Learning in Information Access, 1996.

[10] Y. Diao, H. Lu, and D. Wu, “A comparative study of classification based personal e-mail filtering,” In Proc. Of fourth PAKDD, 2000.

[11] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian Approach to Filtering Junk E-Mail,” In Proc. Of the AAAI Workshop on Learning for Text Categorization. [12] T. Fawcett, “in vivo spam filtering: A challenge problem for

(6)

International Journal of Emerging Technology and Advanced Engineering

582

[13] K. Gee, “Using latent semantic indexing to filter spam,” In Proc. of eighteenth ACM Symposium on Applied Computing, Data Mining Track, 2003.

[14] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani. The elements of statistical learning. Springer New York, 2001.

[15] S.Kiritchenko, S.Matwin, and S.Abu-Hakima, 2004. Email Classification with Temporal Features. Intelligent Information Systems, pp.523-533.

[16] B.Cui, A.Mondal, J.Shen, G.Cong, and K.Tan, 2005. On effiective Email classification via Neural networks. In Proceedings of DEXA,, PP.85-94.

[17] Seongwook Youn and Dennis McLeod. A Comparative Study for Email Classification.

[18] The spambase dataset from http://archive.ics.uci.edu/ ml/datasets/Spambase

[19] Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-Packard Labs.

[20] Introduction to Algorithms (Cormen, Leiserson, and Rivest) 1990, Chapter 16 "Greedy Algorithms" p. 329.

[21] J48classifier,http://www.d.umn.edu/~padhy005/ chapter05.html

[22] WEKA: Data Mining Software in Java (2008),http://www. cs.waikata.ac.nz/ml/weka

[23] C.Ferri, P.Flach, and J.Hernandez-Ozalle. Learining Decision Trees using the area under the ROC Curve. In Proc. Of the

19th Intl.Conf.on Machine Learning, pages 139-146, Sydney,Australia, july 2002

[24] Schuff, D., O. Turetke, D. Croson, F 2007, „Managing Email Overload: Solutions and Future Challenges‟, IEEE Computer Society, vol. 40, No. 2, pp. 31-36.

[25] Kushmerick, N., Lau, T. 2005, „Automated Email Activity Management: An Unsupervised learning Approach‟, Proceedings of 10th International Conference on Intelligent User Interfaces, ACM Press, pp. 67-74.

[26] Based on Structure and Content. In Proceedings of the Fifth IEEE international Conference on Data Mining (pp. 18-25). Washington, DC: IEEE Computer Society.

[27] Boone, G. 1998, „Concept Features in Re: Agent, An Intelligent Email Agent‟, Proceedings of 2nd International Conference on autonomous agents, ACM Press, pp.141-148. [28] Ramos, J. (2002). Using TF-IDF to Determine Word

Relevance in Document Queries, Department of Computer Science, Rutgers University, Piscataway, NJ, 08855.

[29] Yun, F.Y., Cheng, H.L., Wei, S. (2008). Email Classification Using Semantic Feature Space, Proceedings of the 2008 International Conference on Advanced Language Processing and Web Information Technology, IEEE Computer Society Washington, DC, USA, pp.32-37.