AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM
DETECTION
Shashi Kant Rathore
Department of Computer Science & Engineering, Lovely Professional University,Jalandhar, Punjab
Jyoti
Department of Computer Science & Engineering, Lovely Professional University,
Jalandhar, Punjab
Amrit Kaur Department of Computer
Science & Engineering, Lovely Professional University,
Jalandhar, Punjab
[email protected]
Abstract- Image spam is a recent variant of spam and poses a great threat to email communication. Initially spam emails contained only textual messages and were easily detected by text –based spam filters. To avoid detection, spammer comes with a new approach to send their spam. It consists in including their advertisements as part of an embedded image file attachment (.gif, jpg, png, etc). So this paper concentrates on identifying and avoiding partial image spam. To detect the image spams, new framework will be derived which is based on the email features. The classification module classifies the image into the text, image and special character classes. A trained filter will be there to detect image as a SPAM images on the basis of features detection. And Text will be recognizing by Content Scanning technique. By this combined approach we can enhance the detection capability of SPAM filter. Keywords: - E-mail spam, image spam, image spam detection, filtering and classification
1. INTRODUCTION
The bulk of spam messages is sent on daily basis is disturbing and poses a great threat to the utility of email communications. Spam emails are usually classified into the following some categories: commercial advertisements, lottery winning announcements, pharmacy and health, online degrees, bank and finance, adult content. To block email spam, basically companies and service providers relied on detecting some keywords which are frequently used in email spam, such as the word ‘click me’ or ‘earn money’ etc.
since the text based spam filters are used to detect the text-based spam, the spammer have come up with a
new approach to send their spam: the image spam and the text based spam techniques failed to detect the image based email spams.
The image spam is type of junk email that replaces text with the images. The image spam is generally attached into an e-mail. User will receive the spam by opening the email or by double clicking at the email. When the image spam is dispersed over the network, it is a larger drain in the network resource than the literal spam because the image file is larger than the text file and the image spam requires the higher bandwidth. It causes the greater degradation of transfer rates. The image spam should in the different formats such as gif, .bmp, jpg, png, jpeg, etc. In fact, the process of classifying the image spam groups because an image contains many properties. For example brightness, radian, contrast etc. there are many methods for image spam detection and those methods only detect the images of text or humans or body appearances.
An image spam can be classified into two categories: pure image and mixed image. The pure image spam is the kind of spam which contains only images; the mix image spam consists of images and a text message attached to an email. There are many categories of image spam: advertisement image spam, making money image spam etc.
Therefore, this paper proposes a new method that enhances the image spam detection, so, the image spam, not only the images of text or human pictures, but also other images such as images of commercials can also be detected.
Every day we are facing new internet spam which is much obfuscated and it becomes a serious problem across the network. Various types of techniques have been recommended to detect the various types of spams. The image spam causes numerous problems
because there are many varieties of the images to be threatened.
(Z he Wang) Suggested a detection technique for an image spam using near-duplicate detection technique. This technique introduces a non-spam image repository. When the user received an email with image it will be compared with the image in the image database for spam filtering process. The received image will be eliminated when the features of the image are different from the image in the database. (G.Frencesso) Proposed two different image processing techniques which are used to detect the image spam that composed of with text and images. The component based method on SIFT method is used to detect the image spam. This method detect image spam that was modified from the converted of content text to an image and the embedded spam message through an email. Some image spam was identified using the boosting tree which is learning based prototype system. The detection system is called the image spam tracker. (N.Jordan) Introduces a new method which convert .jpeg image to ASCII using JP2A. And which identifies the image using image spam by using properties of an image as attributes. He applied file properties and a histogram algorithm for image spam detection. This method is called as the FH algorithm. This algorithm is the part of 2-step image spam classification while the second part is the comparison part of the histogram, both the gray and color histograms, models are used for image testing. Although the image spam is rapidly grow over the internet, another unwanted image message called HAM also causes the problem to the internet and slow down the bandwidth.
(B.Battista)Proposed an image spam filtering mechanism named as content obscuring techniques. This technique is based on the use of image classifiers. The method aims to distinguish between ham and spam images through the low level characteristics of the image texts. Moreover, three types of image text are determined. First, the presence of small fragments around characters; second the presence of large fragments around characters and the third is large background shape overlapping with characters.
2. METHODLOGY ADOPTED
This paper proposed the method to detect a spam from the body of the email, known as Partial spam image detection using the SVM. The uncertified email or spam email can be distinguished from the certified email. Whenever the email arrives at the email server, it will be sent to the classification
module to separate the content according to its characteristics which can be images and text. The result from the classification module will be terminated up at the evaluation module where email should be determined as a certified email or the uncertified email. There are two types of databases: one which contains the spam keywords and the other contains the spam images. When the image arrives at classification module, the image is stored and converted to the matrices form. On the basis of data in the matrices the classification module classifies the image into the text, image and special character classes.
To detect the image spam, the database is maintained which is based on the previous image spams. The features will be compared with the image database and on that basis, spam will be detected. To compare the features of the images we use the traversing algorithms which will traverse the matrices and compare the features with the database stored. The following are the image features:
Figure 1: System Architecture of the Image Spam Detection
A. Extent of text feature: To detect the spam we need to determine the extent of text in the spam image. This could be interpreted as features can be extracted within the text region to that of the whole image. Text may be integrally available in the
Certified
Uncertified
EmailImage spam databas Text spam databa Coming Email Classification Module Text Character Special Feature Image Feature Evaluation module Text Evaluation Image Evaluation
natural scene images in the form of road marks and logos of the synthetic images (such as graphic images) (Bagga, 2004).
B. Color saturation features: Color saturation refers to the intensity of a color and number of pixels in an image. The term hue refers to the color of the image itself, while the color saturation describes the color intensity or purity of the image. When the color of the image is fully statured it is considered as the spam image. As the saturation increases the colors appear to be exact and as the saturation decreases the colors appear to be pale (Bagga, 2004).
2.1. SVM
Support Vector Machine, is a field of research in pattern recognition, artificial intelligence and computer vision. SVM is capable of reading while and black pixels on any image and can distinguish the accurate numeric number and alpha character. SVM is a basic technology used in advanced scanning applications. SVM is used to split text and images. SVM is widely used in convert documents into electronic files or to publish a text on website. Figure 2 illustrates the example of an image spam email.
Figure 2: Sample of the Image Spam SVM is electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text into machine machine-encoded text. Figure 3 shows how when figure 2 is split into text and image using SVM.
Figure 3: shows the isolated text and image when SVM is applied
2.2 Classification Module
In this module, when email arrives at the email sever, the SVM translates or extract the received email according to the content based on its features as shown in figure 1. There are two types of outcomes from the SVM, that may be text and images, and then the detail of each type must be separately defined. The text message may contain the special symbols and keywords. Example of the special symbol as a part of the text message are such as(%,!,@,#)etc. examples of keywords are such as win, join now, click me, call anywhere, click here, earn big, easy money etc.
2.3. Database
The spam database can be divided into two parts: keywords database and an image spam database. In the keyword database, all the keywords of the advertised spam are recorded. A popular resource to collect those keywords is the internet, such as trash box of web mails, the spam box of the web mails, shopping web site, including the internet site of spam list. And the image spam database stores the only images that are counted as spam.
Figure 4: Sample of Keyword Database
Figure 5: Sample of Image Spam Database
C D E
Call anywhere Dating Earn money came up winner Debt free Earn big casino Doctor approved Earn extra money chartroom Doctor prescribed Earn degree click me Degree program
Click here Depression Career
opportunity
image text A 0t :1] MONEY:01
This database contains images and its properties such as RGB color, contrast, radian, brightness etc. Figure 4 shows the keyword database sample.
2. 4. Evaluation module
The evaluation module is the last module in the detection where all the text and images obtained from the email will be identified as a certified and uncertified email as shown in figure 1. All the converted texts and images will compare with data stored in the database. Linear search algorithm is used searching the images in the database. The search will be analyzed the true contents of the image. Comparison method starts from retrieving images from the database using the linear search algorithm, where all the images are searched and analyzed the content of images by comparing all attributes with the image spam attributes in the database. For example contrast, brightness, radian etc.
3. RESULTS
This section shows the several results when
uncertified and certified mails sent to the
receiver by the sender.
3.1 Result when the Non-SPAM mails are sent:
In the experiment, when the certified mails are
sent in group and outcome showing details of
each group of mails in the table1:
Email Probability Result True /False
1st mails .70 SPAM TRUE
2nd mail .41 HAM FALSE 3rd mail .53 HAM FALSE 4th mail .80 SPAM TRUE 5th mail .39 HAM FALSE 6th mail .78 SPAM TRUE
7th mail .88 SPAM TRUE 8th mail .74 SPAM TRUE 9th mail .77 SPAM TRUE 10thmail .81 SPAM TRUE
Table 1: Result when the Non-SPAM mails are sent.
Total number of Certified Emails are sent=10 Number of Emails found SPAM positive is =3 Number of Emails found SPAM negative is=7 Overall throughput for certified Mails= 76% 3.2 Result when SPAM mails are sent:
In second experiment, when system has been
simulated with some Non-SPAMs mails.
Results are shown in below table. Table2 shows
the result, when the spam mails are sent.
E-mail Probabil ity of whole mail Result True/Fals e
1st mail .82 SPAM TRUE 2nd mail .96 SPAM TRUE 3rd mail .73 SPAM TRUE 4th mail .78 SPAM TRUE 5th mail .56 HAM FALSE 6th mail .60 SPAM FALSE 7th mail .77 SPAM TRUE 8th mail .91 SPAM TRUE 9th mail .43 HAM FALSE 10thmail .69 SPAM TRUE
Table 2: Result when SPAM mails are sent.
Total number of certified mails are sent=10 Number of emails found spam positive =2 Number of emails found spam negative=8 Overall throughput for certified mails=85%
4. CONCLUSION
Spam is critical problem across the network because it is progressed from text to images. Some of the email spam filtering software could not identify the image spam. Image spam erodes the limited network resources and creates trouble for people. So this paper proposed a new technique to identify and avoid the received image spam across the network by using the feature extraction and classification framework to target the image spam currently seen on the internet.
5.
REFERENCES
[1]. B.Battista, F. a. (2011). Improving
Image Spam Filtering Using Image Text
Features.
in
proc.7th
INternational
Conference
on
and
[2.] Bagga, J. A. (2004). Categorizing
images
in
web
Documents.
IEEE
multimedia, pp.22-30.
[3]. Belding-Royer, I. D. (2004). AODV
Routing Protocol Implemantation Design. In
C.E. Perkins,Ad hoc Netwoking, 173-219.
[4]. Botvich, J. M. (2008). A Trust Based
System for Enhanced Spam Filtering.
Journal of Sotware,VOL.3,No.5.
[5]. Christina V, K. S. (2010). A Study on
Spam
Filtering
Techniques.
International
Journal
of
Computer
Applications.
[6].Commnunity Workshop Series. (n.d.).
Retrieved
from
http://www2.lib.unc.edu/cws/handouts/:
http://www2.lib.unc.edu/cws/handouts/email
basics.pdf
[7]. E.Damiani, S. D. (2003). An Open
Digest
Based
Technique
For
Spam
Detection.
[8]. G.Frencesso, P. a. (2009). Using
heterogeneous features for anti spam filters.
19th Internayional conference on database
and Expert System Application, (pp.
670-674).
[9]. Hasaan T., C. P. (n.d.). Towards
Eradiction of spam: development and
evaluation of an intelligent SPAM filter.
Edith Cowan University,Perth, western
Austalia.
[10]. introEmail. (n.d.). Retrieved from
http://www.albanypubliclibrary.org/:
http://www.albanypubliclibrary.org/docume
nts/pcc/IntroEmail.pdf
[11]. Jan Gobel, T. H. (2008). Towards
Proactive Spam Filtering.
[12]. Khosri, A. (31,2007). An Overview of
Content BAsed Spam Filtering Techniques.
Informatica.
[13]. N.Jordan, M. N. (2011). Image Spam
ASCII to the Rescue! 3rd INternational
Conferrence on Milicious and Unwanted
Software, (pp. 65-68).
[14]. Pour, A. N. (2012). Miniminzing the
Time of Spam Mail Detection by Relocating
Filtering System to the Sender Mail Server.
International JOurnal of Network Security
and Its applications.
[15]. Shashi Kant Rathore, P. J. (August
2011). A New Probability based Analysis
for Recogonition of Unwanted Emails.
International
Journal
of
Computer
Applications, 4.
[16]. Sheu, J.-J. (2009). An Efficient
Two-Phase Spam Filtering Method BAsed on
E-Mails Catergorization. International Journal
of Network Security.
[17]. Sunil Taneja, D. A. (2011). End to End
Delay Analyasis of Prominent On-Demand
Routing Protocols. IJCST Vol.2,Issue 1.
[18]. Tech-FAQ. (n.d.). Retrieved from
faq.com/:
http://www.tech-faq.com/how-the-email-system-works.html
[19]. Thamarai Subramaniam, H. A. (2010).
Overview of textual Anti-Spam Filtering
Techniques. International Journal of the
Physical Science Vol5,pp.1869-1882.
[20]. V Sathiya, M. T. (2011). Partial Image
Spam E-Mail Detection using OCR. IJETT,
june .
[21]. Vinit Garg, M. K. (2011). Advanced
Survey of Mobile Ad-hoc Network. IJCST
Vol2.
[23]. W., s. (2004). Data and Computer
Communications. Prentics Hall,7th Edition.
[24]. Z he Wang, W. J. (n.d.). Filtering
Image spam with near duplicate detection.
Computer Science Department, Princeton
University.