AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM DETECTION

(1)

AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM

DETECTION

Shashi Kant Rathore

Department of Computer Science & Engineering, Lovely Professional University,

Jalandhar, Punjab

[email protected]

Jyoti

Department of Computer Science & Engineering, Lovely Professional University,

Jalandhar, Punjab

[email protected]

Amrit Kaur Department of Computer

Science & Engineering, Lovely Professional University,

Jalandhar, Punjab

[email protected]

Abstract- Image spam is a recent variant of spam and poses a great threat to email communication. Initially spam emails contained only textual messages and were easily detected by text –based spam filters. To avoid detection, spammer comes with a new approach to send their spam. It consists in including their advertisements as part of an embedded image file attachment (.gif, jpg, png, etc). So this paper concentrates on identifying and avoiding partial image spam. To detect the image spams, new framework will be derived which is based on the email features. The classification module classifies the image into the text, image and special character classes. A trained filter will be there to detect image as a SPAM images on the basis of features detection. And Text will be recognizing by Content Scanning technique. By this combined approach we can enhance the detection capability of SPAM filter. Keywords: - E-mail spam, image spam, image spam detection, filtering and classification

1. INTRODUCTION

The bulk of spam messages is sent on daily basis is disturbing and poses a great threat to the utility of email communications. Spam emails are usually classified into the following some categories: commercial advertisements, lottery winning announcements, pharmacy and health, online degrees, bank and finance, adult content. To block email spam, basically companies and service providers relied on detecting some keywords which are frequently used in email spam, such as the word ‘click me’ or ‘earn money’ etc.

since the text based spam filters are used to detect the text-based spam, the spammer have come up with a

new approach to send their spam: the image spam and the text based spam techniques failed to detect the image based email spams.

The image spam is type of junk email that replaces text with the images. The image spam is generally attached into an e-mail. User will receive the spam by opening the email or by double clicking at the email. When the image spam is dispersed over the network, it is a larger drain in the network resource than the literal spam because the image file is larger than the text file and the image spam requires the higher bandwidth. It causes the greater degradation of transfer rates. The image spam should in the different formats such as gif, .bmp, jpg, png, jpeg, etc. In fact, the process of classifying the image spam groups because an image contains many properties. For example brightness, radian, contrast etc. there are many methods for image spam detection and those methods only detect the images of text or humans or body appearances.

An image spam can be classified into two categories: pure image and mixed image. The pure image spam is the kind of spam which contains only images; the mix image spam consists of images and a text message attached to an email. There are many categories of image spam: advertisement image spam, making money image spam etc.

Therefore, this paper proposes a new method that enhances the image spam detection, so, the image spam, not only the images of text or human pictures, but also other images such as images of commercials can also be detected.

Every day we are facing new internet spam which is much obfuscated and it becomes a serious problem across the network. Various types of techniques have been recommended to detect the various types of spams. The image spam causes numerous problems

(2)

because there are many varieties of the images to be threatened.

(Z he Wang) Suggested a detection technique for an image spam using near-duplicate detection technique. This technique introduces a non-spam image repository. When the user received an email with image it will be compared with the image in the image database for spam filtering process. The received image will be eliminated when the features of the image are different from the image in the database. (G.Frencesso) Proposed two different image processing techniques which are used to detect the image spam that composed of with text and images. The component based method on SIFT method is used to detect the image spam. This method detect image spam that was modified from the converted of content text to an image and the embedded spam message through an email. Some image spam was identified using the boosting tree which is learning based prototype system. The detection system is called the image spam tracker. (N.Jordan) Introduces a new method which convert .jpeg image to ASCII using JP2A. And which identifies the image using image spam by using properties of an image as attributes. He applied file properties and a histogram algorithm for image spam detection. This method is called as the FH algorithm. This algorithm is the part of 2-step image spam classification while the second part is the comparison part of the histogram, both the gray and color histograms, models are used for image testing. Although the image spam is rapidly grow over the internet, another unwanted image message called HAM also causes the problem to the internet and slow down the bandwidth.

(B.Battista)Proposed an image spam filtering mechanism named as content obscuring techniques. This technique is based on the use of image classifiers. The method aims to distinguish between ham and spam images through the low level characteristics of the image texts. Moreover, three types of image text are determined. First, the presence of small fragments around characters; second the presence of large fragments around characters and the third is large background shape overlapping with characters.

2. METHODLOGY ADOPTED

This paper proposed the method to detect a spam from the body of the email, known as Partial spam image detection using the SVM. The uncertified email or spam email can be distinguished from the certified email. Whenever the email arrives at the email server, it will be sent to the classification

module to separate the content according to its characteristics which can be images and text. The result from the classification module will be terminated up at the evaluation module where email should be determined as a certified email or the uncertified email. There are two types of databases: one which contains the spam keywords and the other contains the spam images. When the image arrives at classification module, the image is stored and converted to the matrices form. On the basis of data in the matrices the classification module classifies the image into the text, image and special character classes.

To detect the image spam, the database is maintained which is based on the previous image spams. The features will be compared with the image database and on that basis, spam will be detected. To compare the features of the images we use the traversing algorithms which will traverse the matrices and compare the features with the database stored. The following are the image features:

Figure 1: System Architecture of the Image Spam Detection

A. Extent of text feature: To detect the spam we need to determine the extent of text in the spam image. This could be interpreted as features can be extracted within the text region to that of the whole image. Text may be integrally available in the

Certified

Email

Uncertified

_Email

Image spam databas Text spam databa Coming Email Classification Module Text Character Special Feature Image Feature Evaluation module Text Evaluation Image Evaluation

(3)

natural scene images in the form of road marks and logos of the synthetic images (such as graphic images) (Bagga, 2004).

B. Color saturation features: Color saturation refers to the intensity of a color and number of pixels in an image. The term hue refers to the color of the image itself, while the color saturation describes the color intensity or purity of the image. When the color of the image is fully statured it is considered as the spam image. As the saturation increases the colors appear to be exact and as the saturation decreases the colors appear to be pale (Bagga, 2004).

2.1. SVM

Support Vector Machine, is a field of research in pattern recognition, artificial intelligence and computer vision. SVM is capable of reading while and black pixels on any image and can distinguish the accurate numeric number and alpha character. SVM is a basic technology used in advanced scanning applications. SVM is used to split text and images. SVM is widely used in convert documents into electronic files or to publish a text on website. Figure 2 illustrates the example of an image spam email.

Figure 2: Sample of the Image Spam SVM is electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text into machine machine-encoded text. Figure 3 shows how when figure 2 is split into text and image using SVM.

Figure 3: shows the isolated text and image when SVM is applied

2.2 Classification Module

In this module, when email arrives at the email sever, the SVM translates or extract the received email according to the content based on its features as shown in figure 1. There are two types of outcomes from the SVM, that may be text and images, and then the detail of each type must be separately defined. The text message may contain the special symbols and keywords. Example of the special symbol as a part of the text message are such as(%,!,@,#)etc. examples of keywords are such as win, join now, click me, call anywhere, click here, earn big, easy money etc.

2.3. Database

The spam database can be divided into two parts: keywords database and an image spam database. In the keyword database, all the keywords of the advertised spam are recorded. A popular resource to collect those keywords is the internet, such as trash box of web mails, the spam box of the web mails, shopping web site, including the internet site of spam list. And the image spam database stores the only images that are counted as spam.

Figure 4: Sample of Keyword Database

Figure 5: Sample of Image Spam Database

C D E

Call anywhere Dating Earn money came up winner Debt free Earn big casino Doctor approved Earn extra money chartroom Doctor prescribed Earn degree click me Degree program

Click here Depression Career

opportunity

image text A 0t :1] MONEY:01

(4)

This database contains images and its properties such as RGB color, contrast, radian, brightness etc. Figure 4 shows the keyword database sample.

2. 4. Evaluation module

The evaluation module is the last module in the detection where all the text and images obtained from the email will be identified as a certified and uncertified email as shown in figure 1. All the converted texts and images will compare with data stored in the database. Linear search algorithm is used searching the images in the database. The search will be analyzed the true contents of the image. Comparison method starts from retrieving images from the database using the linear search algorithm, where all the images are searched and analyzed the content of images by comparing all attributes with the image spam attributes in the database. For example contrast, brightness, radian etc.

3. RESULTS

This section shows the several results when

uncertified and certified mails sent to the

receiver by the sender.

3.1 Result when the Non-SPAM mails are sent:

In the experiment, when the certified mails are

sent in group and outcome showing details of

each group of mails in the table1:

Email Probability Result True /False

1st_mails _.70 _SPAM _TRUE

2nd mail .41 HAM FALSE 3rd mail .53 HAM FALSE 4th mail .80 SPAM TRUE 5th mail .39 HAM FALSE 6th_mail _.78 _SPAM _TRUE

7th mail .88 SPAM TRUE 8th mail .74 SPAM TRUE 9th mail .77 SPAM TRUE 10thmail .81 SPAM TRUE

Table 1: Result when the Non-SPAM mails are sent.

 Total number of Certified Emails are sent=10  Number of Emails found SPAM positive is =3  Number of Emails found SPAM negative is=7  Overall throughput for certified Mails= 76% 3.2 Result when SPAM mails are sent:

In second experiment, when system has been

simulated with some Non-SPAMs mails.

Results are shown in below table. Table2 shows

the result, when the spam mails are sent.

E-mail Probabil ity of whole mail Result True/Fals e

1st mail .82 SPAM TRUE 2nd mail .96 SPAM TRUE 3rd mail .73 SPAM TRUE 4th mail .78 SPAM TRUE 5th mail .56 HAM FALSE 6th mail .60 SPAM FALSE 7th mail .77 SPAM TRUE 8th mail .91 SPAM TRUE 9th mail .43 HAM FALSE 10thmail .69 SPAM TRUE

Table 2: Result when SPAM mails are sent.

 Total number of certified mails are sent=10  Number of emails found spam positive =2  Number of emails found spam negative=8  Overall throughput for certified mails=85%

4. CONCLUSION

Spam is critical problem across the network because it is progressed from text to images. Some of the email spam filtering software could not identify the image spam. Image spam erodes the limited network resources and creates trouble for people. So this paper proposed a new technique to identify and avoid the received image spam across the network by using the feature extraction and classification framework to target the image spam currently seen on the internet.

5.

REFERENCES

[1]. B.Battista, F. a. (2011). Improving

Image Spam Filtering Using Image Text

Features.

in

proc.7th

INternational

Conference

on

Email

and

(5)

[2.] Bagga, J. A. (2004). Categorizing

images

in

web

Documents.

IEEE

multimedia, pp.22-30.

[3]. Belding-Royer, I. D. (2004). AODV

Routing Protocol Implemantation Design. In

C.E. Perkins,Ad hoc Netwoking, 173-219.

[4]. Botvich, J. M. (2008). A Trust Based

System for Enhanced Spam Filtering.

Journal of Sotware,VOL.3,No.5.

[5]. Christina V, K. S. (2010). A Study on

Email

Spam

Filtering

Techniques.

International

Journal

of

Computer

Applications.

[6].Commnunity Workshop Series. (n.d.).

Retrieved

from

http://www2.lib.unc.edu/cws/handouts/:

http://www2.lib.unc.edu/cws/handouts/email

basics.pdf

[7]. E.Damiani, S. D. (2003). An Open

Digest

Based

Technique

For

Spam

Detection.

[8]. G.Frencesso, P. a. (2009). Using

heterogeneous features for anti spam filters.

19th Internayional conference on database

and Expert System Application, (pp.

670-674).

[9]. Hasaan T., C. P. (n.d.). Towards

Eradiction of spam: development and

evaluation of an intelligent SPAM filter.

Edith Cowan University,Perth, western

Austalia.

[10]. introEmail. (n.d.). Retrieved from

http://www.albanypubliclibrary.org/:

http://www.albanypubliclibrary.org/docume

nts/pcc/IntroEmail.pdf

[11]. Jan Gobel, T. H. (2008). Towards

Proactive Spam Filtering.

[12]. Khosri, A. (31,2007). An Overview of

Content BAsed Spam Filtering Techniques.

Informatica.

[13]. N.Jordan, M. N. (2011). Image Spam

ASCII to the Rescue! 3rd INternational

Conferrence on Milicious and Unwanted

Software, (pp. 65-68).

[14]. Pour, A. N. (2012). Miniminzing the

Time of Spam Mail Detection by Relocating

Filtering System to the Sender Mail Server.

International JOurnal of Network Security

and Its applications.

[15]. Shashi Kant Rathore, P. J. (August

2011). A New Probability based Analysis

for Recogonition of Unwanted Emails.

International

Journal

of

Computer

Applications, 4.

[16]. Sheu, J.-J. (2009). An Efficient

Two-Phase Spam Filtering Method BAsed on

E-Mails Catergorization. International Journal

of Network Security.

[17]. Sunil Taneja, D. A. (2011). End to End

Delay Analyasis of Prominent On-Demand

Routing Protocols. IJCST Vol.2,Issue 1.

[18]. Tech-FAQ. (n.d.). Retrieved from

faq.com/:

http://www.tech-faq.com/how-the-email-system-works.html

[19]. Thamarai Subramaniam, H. A. (2010).

Overview of textual Anti-Spam Filtering

Techniques. International Journal of the

Physical Science Vol5,pp.1869-1882.

(6)

[20]. V Sathiya, M. T. (2011). Partial Image

Spam E-Mail Detection using OCR. IJETT,

june .

[21]. Vinit Garg, M. K. (2011). Advanced

Survey of Mobile Ad-hoc Network. IJCST

Vol2.

[23]. W., s. (2004). Data and Computer

Communications. Prentics Hall,7th Edition.

[24]. Z he Wang, W. J. (n.d.). Filtering

Image spam with near duplicate detection.

Computer Science Department, Princeton

University.