• No results found

Feature Subset Selection in Spam Detection

N/A
N/A
Protected

Academic year: 2021

Share "Feature Subset Selection in Spam Detection"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Amir Rajabi Behjat, Universiti Technology MARA, Malaysia

IT Security for the Next Generation”

Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012

Feature Subset Selection in E-mail Spam

Detection

(2)

Feature Subset Selection in E-mail Spam Detection

Introduction

Problem Statement Objective

Related Works

Proposed Architecture Result and Discussion

Conclusion

(3)

Introduction

Today, e-mails have become a very common and convenient medium for our daily communications. Spam is an unsolicited commercial or bulk e-mails, or an uninterested e-mails. It has created a significant security problem for end users and big organizations through carrying malicious virus and phishing scams.

There are many solutions available to act as a model for automatic spam detection. Most techniques are based on Machine Learning and data mining techniques such as classification, clustering and Genetic Algorithm that classify spam emails and select their relevant features (Sang, 2010).

PAGE 3 | "IT Security for the Next Generation", Asia Pacific & MEA Cup

| 14-16 March, 2012

(4)

Problem Statement

Lack of useful and relevant features that can distinguish between spam and non- spam emails efficiently (Bilal, 2005).

Increase data dimensionality that decreases accuracy (Ren, 2006).

The reduction of classification accuracy within the current methods (Burim , 2005).

A high growth of false positive in different spam detection systems (Ching, 2010).

| 9-11 Марта, 2011

"IT Security for the Next Generation", Россия и СНГ PAGE 4 |

(5)

Objective

To find related and relevant features to distinguish between spam and non-spam emails and decrease data dimensionality.

To increase classification accuracy that shows the accuracy of detection system and MLP classifier.

To decrease false positive using GA as a feature selector and MLP classifier.

| 9-11 Марта, 2011

"IT Security for the Next Generation", Россия и СНГ PAGE 5 |

(6)

Related Works

Author and Year Proposed System

Ruan and Tan (2007) Classified spam and legitimate emails based on support Vector Machine (SVM) and select more relevant features using Genetic Algorithm.

Matthew and Chung Keung (2009)

Selected features as a mathematical method and classified them based on k-NN and Naïve Bayes. The results of proposed detection system show an accuracy near to 98%.

Gaurav and Shashikala (2010)

Detected spam emails based on different techniques such as analyzing email contents by spam keywords knowledge base, finding sender of emails, and cross validation without using feature selection. The lack of feature selection and high computational costs decrease accuracy of classifier.

Lee, Kim and Park (2010)

Developed an optimal spam detection model based on Random Forest (RF).

This model is able to optimize RF parameters and select relevant features simultaneously.

Ruan and Tan (2010)

Selected features of spam emails such as words and symbols by artificial Immune system to detect spam emails based Artificial Neural Network classifier. The result of this study showed 99% accuracy and 0.2% error rate.

PAGE 6 | "IT Security for the Next Generation", Asia Pacific & MEA Cup

| 14-16 March, 2012

(7)

Detection System Architecture

| 9-11 Марта, 2011

"IT Security for the Next Generation", Россия и СНГ PAGE 7 |

(8)

Genetic Algorithm

GA finds an optimal binary vector, where each bit is associated with a feature.

If the ith bit of this vector = 1, then the ith feature is allowed to participate in classification;

if the bit is = 0, then the corresponding feature does not participate.

GA method applied in this study decreases the number of irrelevant features and high dimensionality. Thus, the number of features are decreased from 156 to 78.

| 9-11 Марта, 2011

"IT Security for the Next Generation", Россия и СНГ PAGE 8 |

(9)

The Fitness function of GA

| 9-11 Марта, 2011

"IT Security for the Next Generation", Россия и СНГ PAGE 9 |

(10)

Multi-Layer perceptron (MLP)

Multi-layer perceptron is arranged in different layers, namely input, hidden and output layers.

In fact, these layers are organized to minimize appropriate error functions and increase the accuracy of classification.

This classifier not only increase the accuracy, but increase the speed of detection process.

| 9-11 Марта, 2011

"IT Security for the Next Generation", Россия и СНГ PAGE 10 |

(11)

Dataset

Two benchmark corpora used to test our proposed email spam detection system based MLP classifier are the SpamAssassin corpus and LingSpam corpus.

Name of Data

base Spam Non-spam

LingSpam 481 2412

SpamAssassin 3797 6954

PAGE 11 |

Experiment

"IT Security for the Next Generation", Asia Pacific & MEA Cup

| 14-16 March, 2012

(12)

Performance Measurement

| 9-11 Марта, 2011

"IT Security for the Next Generation", Россия и СНГ PAGE 12 |

TP = spam emails and identified as spam FN = spam emails but identified as non-spam

TN = non-spam emails and identified as non-spam FP = non-spam emails but identified as spam

Measure Expression

Accuracy (TP+TN)/(TP+FP+FN+TN)

Precision TP/(TP+FP)

Recall TP/(TP+FN)

(13)

Table 1 illustrates the performance of GA based feature subset selection with MLP classifier in the spam detection system.

10 runs were used for each dataset to estimate the classifications.

Other techniques from previous studies are compared to the proposed technique i.e MLP

PAGE 13 |

Result

| 14-16 March 2012

"IT Security for the Next Generation", Asia Pacific & MEA Cup

Methods Accuracy (%) Recall (%) Precision (%)

Linear Discriminant 98.55 91.49 99.77

SVM 97.09 97.38 97.02

BP Neural Network 99.13 95.84 98.93

NB 96.41 81.10 96.85

SVM-IG 96.85 81.90 99.0

MLP LingSpam 99.83 93.75 100

MLP SpamAssassin 99.81 100 99.48

Table 1

(14)

PAGE 14 |

Cont. Result

| 14-16 March , 2012

"IT Security for the Next Generation", Asia Pacific & MEA Cup

(15)

The Genetic Algorithm (GA) as a feature selection algorithm is discussed in this study. This algorithm like an automated heuristic method selects significant features of spam emails to decline data-dimensionality and increase MLP classifier.

The results of this paper show the performance near to 100% and 0.01 false positive rate. Integrated experiments of this paper are based on two benchmark corpora LingSpam and SpamAssassin to indicate the strength of GA in feature selection and the ability of MLP classifier for generating high accuracy and decrease the time of computing.

Conclusion

| 9-11 Марта, 2011

"IT Security for the Next Generation", Россия и СНГ PAGE 15 |

(16)

Thank You

Amir Rajabi Behjat, Universiti Technology Mara, Malaysia

“IT Security for the Next Generation”

Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012

References

Related documents

Figure 2 Associations between MSI screening status, beta-catenin overexpression and survival in all patients and in subgroups according to disease stage... Figure 3 (See legend on

The aim of this study was to estimate the effect of home visits on the use of modern methods of contraception by women of childbearing age in the Kumbo East Health Dis- trict,

Remote control of local power electronic modules and storage devices by the utility is the basis of the so called ‘smart grid’; single point storage and conditioning of renewable

Agroecosystems in North Carolina are typically characterized by a rich diversity of potential host plants (both cultivated and non-cultivated), and these hosts may

Deubiquitinating enzyme USP22 positively regulates c-Myc stability and tumorigenic activity in mammalian and breast cancer cells.. USP28 contributes to the proliferation and

•   Fare clic per modificare gli stili del testo dello schema " –  Secondo livello" •   Terzo livello " –   Quarto livello " »   Quinto livello

5 - Forward all universities to submit its data (indicators and performance measures) during the school year and schedule field visits to universities by a team of