Efficient Spam Classification by Appropriate Feature Selection

(1)

© 2013.Prajakta Ozarkar & Dr. Manasi Patwardhan. This is a research/review paper, distributed under the terms of the Creative Commons Attribution-Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all

non-Global Journal of Computer Science and Technology

Software & Data Engineering

Volume 13 Issue 5 Version 1.0 Year 2013

Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA)

Online ISSN: 0975-4172 & Print ISSN: 0975-4350

Efficient Spam Classification by Appropriate Feature Selection

By Prajakta Ozarkar & Dr. Manasi Patwardhan

Vishwakarma Institute of Technology, India

Abstract -

Spam is a key problem in electronic communication, including large-scale email systems

and the growing number of blogs. Currently a lot of research work is performed on automatic

detection of spam emails using classification techniques such as SVM, NB, MLP, KNN, ID3, J48,

Random Tree, etc. For spam dataset it is possible to have large number of training instances. Based

on this fact, we have made use of Random Forest and Partial Decision Trees algorithms to classify

spam vs. non-spam emails. These algorithms outperformed the previously implemented algorithms

in terms of accuracy and time complexity. As a preprocessing step we have used feature selection

methods such as Chi-square, Information gain, Gain ratio, Symmetrical uncertainty, Relief, One R

and Correlation. This allowed us to select subset of relevant, non redundant and most contributing

features to have an added benefit in terms of improvisation in accuracy and reduced time complexity.

Keywords :

feature selection, preprocessing, random forest, part.

GJCST-C Classification

: H.4.3

Efficient Spam Classification by Appropriate Feature Selection

(2)

Efficient Spam Classification by Appropriate

Feature Selection

Prajakta Ozarkar α & Dr. Manasi Patwardhan σ Abstract - Spam is a key problem in electronic

communication, including large-scale email systems and the growing number of blogs. Currently a lot of research work is performed on automatic detection of spam emails using classification techniques such as SVM, NB, MLP, KNN, ID3, J48, Random Tree, etc. For spam dataset it is possible to have large number of training instances. Based on this fact, we have made use of Random Forest and Partial Decision Trees algorithms to classify spam vs. non-spam emails. These algorithms outperformed the previously implemented algorithms in terms of accuracy and time complexity. As a preprocessing step we have used feature selection methods such as Chi-square, Information gain, Gain ratio, Symmetrical uncertainty, Relief, One R and Correlation. This allowed us to select subset of relevant, non redundant and most contributing features to have an added benefit in terms of improvisation in accuracy and reduced time complexity.

Keywords : feature selection, preprocessing, random forest, part.

I.

I

ntroduction

n this paper we have studied previous approaches used for classifying spam and non spam emails by using distinct classification algorithms. We have also studied the distinct features extracted for classifier training and the feature selection algorithms applied to get rid of irrelevant features and selecting the most contributing features. After studying the current feature selection and classification approaches, we have applied two new classification techniques viz. Random forests and Partial decision trees along with distinct feature selection algorithms.

R.Parimala,et.al. [1] Presents a new FS (Feature selection) technique which is guided by F selector Package. They have used nine feature selection techniques such as Correlation based feature selection, Chi-square, Entropy, Information Gain, Gain Ratio, Mutual Information, Symmetrical Uncertainty, One R, Relief and five classification algorithms such as Linear Discriminant Analysis, Random Forest, Rpart, Naïve Byes and Support Vector Machine on spam base dataset. In their evaluation, the results show that filter methods CFS, Chi-squared, GR, Relief, SU, IG, and one

Author α : Prajakta Ozarkar, Student, Vishwakarma Institute of Technology, Pune, Maharashtra, India.

E-mail : prajaktaozarkar00 @gmail.com

Author σ : Manasi Patwardhan, Professor, Vishwakarma Institute of Technology, Pune, Maharashtra, India.

E-mail : manasi.patwardhan @vit.edu

Enables the classifiers to achieve the highest increase in classification accuracy.They conclude that the implemented FS can improve the accuracy of Support vector machine classifiers by performing FS.

In the paper by R. Kishore Kumar, et.al.[2] spam dataset is analyzed using Tanagra data mining tool. Initially, feature construction and feature selection is done to extract the relevant features by using Fisher filtering, Relief, Runs Filtering, Step disc. Then classification algorithms such as C4.5, C-PLS, C-RT, CS-CRT, CS-MC4, CS-SVC, ID, K-NN LDA, Log Reg TRIRLS, Multilayer Perceptron, Multilogical Logistic Regression, Naïve Bayes Continuous, DA, PLS-LDA, Rend Tree and SVM are applied over spam base dataset and cross validation is done for each of these classifiers. They conclude Fisher filtering and Runs filtering feature selection algorithms performs better for many classifiers. The Rend tree classification algorithm with the relevant features extracted by fisher filtering produces more than 99% accuracy in spam detection. W.A. Awad, et.al. [3] reviews machine learning methods Bayesian classification, k-NN, ANNs, SVMs, Artificial immune system and Rough sets on the Spam Assassin spam corpus. They conclude Naïve bayes method has the highest precision among the six algorithms while the k-nearest neighbor has the worst precision percentage. Also, the rough sets method has a very competitive percentage.

In the work by V.Christina, et.al.[4]employs supervised machine learning techniques namely C4.5 Decision tree classifier, Multilayer perceptron and Naïve Bayes classifier. Five features of an e-mail: all (A), header (H), body (B), subject (S), and body with subject (B+S), are used to evaluate the performance of four machine learning algorithms. The training dataset, spam and legitimate message corpus is generated from the mails that they have received from their institute mail server for a period of six months. They conclude Multilayer Perceptron classifier outperforms other classifiers and the false positive rate is also very low compared to other algorithms.

Rafiqul Islam, et.al. [5] have presented an effective and efficient email classification technique based on data filtering method. In their testing they have introduced an innovative filtering technique using instance selection method (ISM) to reduce the pointless data instances from training model and then classify the test data. In their model, tokenization and domain

I

49 Year 013 2

Global Journal of Computer Science and Technology Volume XIII Issue V Version I

(DDDDDDDD

)

(3)

specific feature selection methods are used for feature extraction. The behavioral features are also included for improving performance, especially for reducing false positive (FP) problems. The behavioral features include the frequency of sending/receiving emails, email attachment, type of attachment, and size of attachment and length of the email. In their experiment, they have tested five base classifiers Naive Bayes, SVM, IB1, Decision Table and Random Forest on 6 different datasets. They also have tested adaptive boosting (AdaboostM1) as meta-classifier on top of base classifiers. They have achieved overall classification accuracy above 97%.

A comparative analysis is performed by Ms.D KarthikaRenuka, et.al. [6], for the classification techniques such as MLP, J48 and Naïve Bayesian, for classifying spam messages from e-mail using WEKA tool. The dataset gathered from UCI repository had 2788 legitimate and 1813 spam emails received during a period of several months. Using this dataset as a training dataset, models are built for classification algorithms. The study reveals that the same classifier performed dissimilarly when run on the same dataset but using different software tools. Thus, from all perspectives MLP is top performer in all cases and thus, can be deemed consistent.

Following table summarizes all the previous classification approaches enlisted above and provides a comparison in terms of % accuracy they have achieved with the application of a specific feature selection algorithm.

Table 1 : Comparison of previous approaches of spam detection

II.

P

roposed

W

ork

After a detailed review of the existing techniques used for spam detection, in this section we are illustrating the methodology and techniques we used for spam mail detection.

2 Year 013 2 50

(DDDDDDDD

)

C

Reference Classifier Used and

features % FeatureSelection Acc (%)

R.Parimala,et.al. [1] SVM (100%)SVM (16%) -CFS 91.4493 SVM (70%) Chi 93.00 SVM (70%) IG 93.00 SVM (70%) GR 93.39 SVM (70%) SU 93.33 SVM (70%) oneR 92.65 SVM (70%) Relief 93.15 SVM( 32%) Lda 91.90 SVM (12%) Rpart 90.51 SVM (16%) SVM 89.95 SVM (21%) RF 91.23 SVM (7%) NB 80.00 R. Kishore Kumar,

et.al.[2] C-PLSC-RT FisherFisher 99.897699.9465 CS-CRT Fisher 99.9465 CS-MC4 Fisher 99.9415 CS-SVC Fisher 99.9685 ID3 Fisher 99.9137 KNN Fisher 99.9391 LDA Fisher 99.8861 LogReg TRI Fisher 99.8552

Reference Classifier Used and

features % FeatureSelection Acc (%) R. Kishore Kumar,

et.al.[2] MLPMultilogical LR FisherFisher 99.945999.9311 NBC Fisher 99.8865 PLS-DA Fisher 99.8752 PLD-LDA Fisher 99.8757 Rnd Tree Fisher 99.9911 SVM Fisher 99.9070 C4.5 Relief 99.9487 C-PLS Relief 99.8537 C-RT Relief 99.9261 CS-CRT Relief 99.9261 CS-MC4 Relief 99.9324 CS-SVC Relief 99.8794 ID3 Relief 99.895 KNN Relief 99.9176 LDA Relief 99.8481 LogReg TRI Relief 99.8179 MLP Relief 99.9185 Multilogical LR Relief 99.8883 NBC Relief 99.8587 PLS-DA Relief 99.8474 PLD-LDA Relief 99.8476 Rnd Tree Relief 99.9676 SVM Relief 99.8639 C4.5 Runs 99.9633 C-PLS Runs 99.9102 C-RT Runs 99.9404 CS-CRT Runs 99.9404 CS-MC4 Runs 99.9615 CS-SVC Runs 99.9233 ID3 Runs 99.9137 KNN Runs 99.9404 LDA Runs 99.8887 MLP Runs 99.9607 LogReg TRI Runs 99.8611 Multilogical LR Runs 99.9313 NBC Runs 99.8874 PLS-DA Runs 99.8879 PLD-LDA Runs 99.8879 Rnd Tree Runs 99.9883 SVM Runs 99.9076 C4.5 StepDisc 99.9633 C-PLS StepDisc 99.9081 C-RT StepDisc 99.9341 CS-CRT StepDisc 99.9341 CS-MC4 StepDisc 99.9604 CS-SVC StepDisc 99.9218 ID3 StepDisc 99.9105 KNN StepDisc 99.935 LDA StepDisc 99.8881 LogReg TRI StepDisc 99.8587 MLP StepDisc 99.9481 Multilogical LR StepDisc 99.9294 NBC StepDisc 99.8829 PLS-DA StepDisc 99.8826 PLD-LDA StepDisc 99.8829 Rnd Tree StepDisc 99.99 SVM StepDisc 99.905 W. A. Awad, et.al.[3] NBCSVM - - 99.4696.9 KNN - 96.2 ANN - 96.83 AIS - 96.23 Rough Sets - 97.42 V.Christina, et.al.[4] NBC - 98.6 J48 - 96.6 MLP - 99.3 RafiqulIslam,et.al [5] NBSMO - - 92.396.4 IB1 - 95.8 DT - 95.9 RF - 96.1 Ms.DKarthikaRenuk a,et.al [6] MLPJ48 - - 9392 NBC - 89

(4)

Figure 1 shows the process we have used for spam mail identification and how it is used in conjunction with a machine learning scheme. Feature ranking techniques such as Chi-square, Information gain, Gain ratio, Symmetrical uncertainty, Relief, One and Correlation are applied to a copy of the training data. After the feature selection subset with the highest merit is used to reduce the dimensionality of both the original training data and the testing data. Both reduced datasets may then be passed to a machine learning scheme for training and testing. Results are obtained by using Random Forest and Part classification techniques.

Figure 1 : Stages of Spam Email Classification

In the following subsections we discuss the basic concepts related to our work. It includes a brief background on feature ranking techniques, classification techniques and results.

III.

D

ata

S

et

The dataset used for our experiment is spam base [13].The last column of 'spam base. Data' denotes whether the e-mail was considered spam (1) or not (0). Most of the attributes indicate the frequency of spam related term occurrences. The first 48 set of attributes (1–48) give tf-idf (term frequency and inverse document frequency) values for spam related words, whereas the next 6 attributes (49-54) provide tf-idf values for spam related terms. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters, capital_ run_ length_ average, capital_ run_ length_ longest and capital_ run_ length_ total. Thus, our dataset has in total 57 attributes serving as an input features for spam detection and the last attribute represent the class (spam/non-spam).

We have also used one public dataset Enron [20].The “preprocessed” subdirectory contains the messages in the preprocessed format. Each message

is in a separate text file. The body of an email contains the actual information. This information needs to be extracted before running a filter process by means of preprocessing. The purpose for preprocessing is to transform messages in mail into a uniform format that can be understood by the learning algorithm. Following are the steps involved in preprocessing:

1. Feature extraction (Tokenization): Extracting features from e-mail in to a vector space.

2. Stemming: Stemming is a process for removing the commoner morphological and in-flexional endings from words in English.

3. Stop word removal: Removal of non-informative words.

4. Noise removal: Removal of obscure text or symbols from features.

5. Representation: tf-idf is a statistical measure used to calculate how significant a word is to a document in a feature corpus. Word frequency is established by term frequency (tf), number of times the word appears in the message yields the significance of the word to the document. The term frequency then is multiplied with inverse document frequency (idf) which measures the frequency of the word occurring in all messages

IV.

F

eature

R

anking and

S

ubset

S

election

From the above defined feature vector of total 58 features, we use feature ranking and selection algorithms to select the subset of features. We rank the given set of features using the following distinct approaches.

a) Chisquare

Chi-squared hypothesis tests may be performed on contingency tables in order to decide whether or not effects are present. Effects in a contingency table are defined as relationships between the row and column variables; that is, are the levels of the row variable differentially distributed over levels of the column variables. Significance in this hypothesis test means that interpretation of the cell frequencies is warranted. Non-significance means that any differences in cell frequencies could be explained by chance. Hypothesis tests on contingency tables are based on a statistic called Chi-square [8].

Where, O – Observed cell frequency, E –Expected cell frequency.

b) Information Gain

Information Gain is the expected reduction in entropy caused by partitioning the examples according to a given attribute. Information gain is a symmetrical

51 Year 013 2

(DDDDDDDD

)

C

Efficient Spam Classification by Appropriate Feature Selection

2

= (𝑂 − 𝐸)

2

(5)

measure that is, the amount of information gained about Y after observing X is equal to the amount of information gained about X after observing Y. The entropy of Y is given by [9]

𝐻𝐻𝑌𝑌 = − _𝑃𝑃_𝑌𝑌_{𝑙𝑙𝑜𝑜𝑔𝑔}2(_𝑃𝑃_𝑌𝑌 )_{𝑦𝑦∈𝑌𝑌}

If the observed values of Y in the training data are partitioned according to the values of a second feature X, and the entropy of Y with respect to the partitions induced by X is less than the entropy of Y prior to partitioning, then there is a relationship between features Y and X. Equation gives the entropy of Y after observing X

𝐻𝐻𝑌𝑌 = − (𝑥𝑥) 𝑃𝑃𝑦𝑦𝑥𝑥𝑙𝑙𝑜𝑜𝑔𝑔2(𝑃𝑃𝑦𝑦𝑥𝑥 )𝑦𝑦∈𝑌𝑌𝑥𝑥∈𝑋𝑋

The amount by which the entropy of Y decreases reflects additional information about Y provided by X and is called the information gain or alternatively, mutual information [9]. Information gain is given by

𝐺𝐺𝑎𝑎𝑖𝑖𝑛𝑛=𝐻𝐻𝑌𝑌 + 𝐻𝐻𝑌𝑌𝑋𝑋

= 𝐻𝐻𝑋𝑋 + 𝐻𝐻𝑋𝑋𝑌𝑌

=H Y +H X −(𝑋𝑋,𝑌𝑌) c) Gain Ratio

The various selection criteria have been compared empirically in a series of experiments. When all attributes are binary, the gain ratio criterion has been found to give considerably smaller decision trees. When the task includes attributes with large numbers of values, the subset criterion gives smaller decision trees that also have better predictive performance, but can require much more computation. However, when these many-valued attributes are augmented by redundant attributes which contain the same information at a lower level of detail, the gain ratio criterion gives decision trees with the greatest predictive accuracy. All in all, it suggests that the gain ratio criterion does pick a good attribute for the root of the tree [12].

𝐺𝐺𝑎𝑎𝑖𝑖𝑛𝑛𝑅𝑅𝑎𝑎𝑡𝑡𝑖𝑖𝑜𝑜=𝐻𝐻𝑌𝑌 +𝐻𝐻𝑋𝑋−𝐻𝐻(𝑌𝑌,𝑋𝑋)𝐻𝐻(𝑋𝑋) d) Symmetrical Uncertainty

Information gain is a symmetrical measure that is, the amount of information gained about Y after observing X is equal to the amount of information gained about X after observing Y. Symmetry is a desirable property for a measure of feature-feature inter correlation to have. Unfortunately, information gain is biased in favor of features with more values. Symmetrical uncertainty compensates for information gain’s bias toward attributes with more values and normalizes its value to the range [0, 1] [9]:

𝑆𝑆𝑦𝑦𝑚𝑚𝑚𝑚𝑒𝑒𝑡𝑡𝑟𝑟𝑖𝑖𝑐𝑐𝑎𝑎𝑙𝑙𝑈𝑈𝑛𝑛𝑐𝑐𝑒𝑒𝑟𝑟𝑡𝑡𝑎𝑎𝑖𝑖𝑛𝑛𝑡𝑡𝑦𝑦𝐶𝐶𝑜𝑜𝑒𝑒𝑓𝑓𝑓𝑓= 2.0×𝐺𝐺𝑎𝑎𝑖𝑖𝑛𝑛𝐻𝐻𝑌𝑌 +(𝑋𝑋) e) Relief

Relief [10] is a feature weighting algorithm that is sensitive to feature interactions. Relief attempts to

approximate the following difference of probabilities for the weight of a feature X [9]:

𝑊𝑊𝑋𝑋=𝑃𝑃(𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡𝑣𝑣𝑎𝑎𝑙𝑙𝑢𝑢𝑒𝑒𝑜𝑜𝑓𝑓𝑋𝑋

𝑛𝑛𝑒𝑒𝑎𝑎𝑟𝑟𝑒𝑒𝑠𝑠𝑡𝑡𝑖𝑖𝑛𝑛𝑠𝑠𝑡𝑡𝑎𝑎𝑛𝑛𝑐𝑐𝑒𝑒𝑜𝑜𝑓𝑓𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡𝑐𝑐𝑙𝑙𝑎𝑎𝑠𝑠𝑠𝑠)

−𝑃𝑃(𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡𝑣𝑣𝑎𝑎𝑙𝑙𝑢𝑢𝑒𝑒𝑜𝑜𝑓𝑓𝑋𝑋

𝑛𝑛𝑒𝑒𝑎𝑎𝑟𝑟𝑒𝑒𝑠𝑠𝑡𝑡𝑖𝑖𝑛𝑛𝑠𝑠𝑡𝑡𝑎𝑎𝑛𝑛𝑐𝑐𝑒𝑒𝑜𝑜𝑓𝑓𝑠𝑠𝑎𝑎𝑚𝑚𝑒𝑒𝑐𝑐𝑙𝑙𝑎𝑎𝑠𝑠𝑠𝑠)

By removing the context sensitivity provided by the “nearest instance” condition, attributes are treated as independent of one another;

𝑅𝑅𝑒𝑒𝑙𝑙𝑖𝑖𝑒𝑒𝑓𝑓𝑋𝑋= 𝑃𝑃 (𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡𝑣𝑣𝑎𝑎𝑙𝑙𝑢𝑢𝑒𝑒𝑜𝑜𝑓𝑓𝑋𝑋

𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡𝑐𝑐𝑙𝑙𝑎𝑎𝑠𝑠𝑠𝑠)

−𝑃𝑃(𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡𝑣𝑣𝑎𝑎𝑙𝑙𝑢𝑢𝑒𝑒𝑜𝑜𝑓𝑓𝑋𝑋 𝑠𝑠𝑎𝑎𝑚𝑚𝑒𝑒𝑐𝑐𝑙𝑙𝑎𝑎𝑠𝑠𝑠𝑠) Which can be reformulated as

𝑅𝑅𝑒𝑒𝑙𝑙𝑖𝑖𝑒𝑒𝑓𝑓𝑥𝑥= 𝐺𝐺𝑖𝑖𝑛𝑛𝑖𝑖′× 𝑝𝑝𝑥𝑥 2𝑥𝑥∈𝑋𝑋 1− 𝑝𝑝𝑐𝑐 2𝑐𝑐∈𝐶𝐶𝑝𝑝𝑐𝑐 2𝑐𝑐∈𝐶𝐶

Where, C is the class variable and

𝐺𝐺𝑖𝑖𝑛𝑛𝑖𝑖′= 𝑝𝑝𝑐𝑐 (1−𝑝𝑝) ∈𝐶𝐶 − 𝑝𝑝𝑥𝑥 2 𝑝𝑝𝑥𝑥 2𝑥𝑥∈𝑋𝑋𝑝𝑝𝑐𝑐𝑥𝑥 (1−𝑝𝑝 𝑐𝑐𝑥𝑥)

𝑐𝑐∈𝐶𝐶𝑥𝑥∈𝑋𝑋 6. f) OneR

Like other empirical learning methods, 1R [11] takes as input a set of examples, each with several attributes and a class. The aim is to infer a rule that predicts the class given the values of the attributes. The 1R algorithm chooses the most informative single attribute and bases the rule on this attribute alone. The basic idea is:

For each attribute a, form a rule as follows: For each value v from the domain of a,

Select the set of instances where a has value v. Let c be the most frequent class in that set. Add the following clause to the rule fora: if a has value v then the class is c

Calculate the classification accuracy of this rule. Use the rule with the highest classification accuracy. The algorithm assumes that the attributes are discrete. If not, then they must be discretized.

g) Correlation

Feature selection for classification tasks in machine learning can be accomplished on the basis of correlation between features, and that such a feature selection procedure can be beneficial to common machine learning algorithms [9]. Features are relevant if their values vary systematically with category membership. In other words, a feature is useful if it is correlated with or predictive of the class; otherwise it is irrelevant. A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other. The acceptance of a feature will depend on the extent to which it predicts classes in areas of the instance space not already predicted by other features. Correlation

2 Year 013 2 52

(DDDDDDDD

)

(6)

based feature selection feature subset evaluation function [9]:

𝑀𝑀𝑠𝑠=𝑘𝑘𝑟𝑟𝑐𝑐𝑓𝑓𝑘𝑘+𝑘𝑘𝑘𝑘−1 𝑟𝑟𝑓𝑓𝑓𝑓

Where, - the heuristic “merit” of a feature subset S containing k features, 𝑟𝑟𝑐𝑐𝑓𝑓-the mean feature-class correlation, 𝑟𝑟𝑓𝑓𝑓𝑓 -the average feature-feature inter-correlation.

Feature ranking further help us to -

1. Remove irrelevant features, which might be misleading the classifier decreasing the classifier interpretability by reducing generalization by increasing over fitting.

2. Remove redundant features, which provide no additional information than the other set of features, unnecessarily decreasing the efficiency of the classifier.

3. Selecting high rank features, which may not affect much as far as improving precision and recall is concerned; but reduces time complexity drastically. Selection of such high rank features reduces the dimensionality feature space of the domain. It speeds up the classifier there of improving the performance and increasing the comprehensibility of the classification result.

We have considered 87%, 77% and 70% of the features; wherein there is a performance improvement in 70% feature consideration.

IV.

C

lassification

M

ethod

Based on the assumption that the given dataset has enough number of the training instances we have chosen the following two classification algorithms. The algorithms work well based on the fact that the dataset is of good quality.

a) Random

Forest Random Forests [14] are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Each tree is grown as follows:

1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.

2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.

3. Each tree is grown to the largest extent possible. There is no pruning.

Random Forest is an ensemble of trees. In our implementation of random forest we have selected a vector of 4 features (randomly selected), to build each tree in a forest of 10 random trees. Tree grows to its maximum depth as that argument is set to zero, which indicates unlimited depth. By using bagging and voting techniques classification is being done. For example, a sample part of the output of the forest (very small portion of the forest) is as shown below:

This is the case when 100%features have selected for training model, accordingly the root node of each tree changes.

53 Year 013 2

(DDDDDDDD

)

C

Total Random forest Trees: 10 Numbers of random features: 4 Out of bag error: 0.1092391304347826 All the trees in the forest:

(7)

b) Partial Decision Tree

Rule learners are prominent representatives of supervised machine learning approaches. Basically, this type of learner tries to induce a set of rules for a collection of training instances. These rules are then applied on the test instances for classification purposes. Two well-known members of the family of rule-learners are C4.5 and RIPPER. C4.5 [16], for instance, generates an unprimed decision tree and transforms this tree into a set of rules. For each path from the root node to a leaf a rule is generated. Then, each rule is simplified separately followed by a rule-ranking strategy. Finally, the algorithm deletes rules from the rule set as long as the rule set’s error rate on the training instances decreases. RIPPER [17] implements a divide and conquers strategy to rule induction. Only one rule is generated at a time and the instances from a training set covered by this rule are removed. It iteratively derives new rules for the remaining instances of the training set.

PART (Partial Decision Trees) adopts the divide-and-conquer strategy of RIPPER [17] and combines it with the decision tree approach of C4.5 [16]. PART generates a set of rules according to the divide-and-conquer strategy, removes all instances from the training collection that are covered by this rule and proceeds recursively until no instance remains. To generate a single rule, PART builds a partial decision tree for the current set of instances and chooses the leaf with the largest coverage as the new rule. For example, following is the way of rule formation in our implementation of PART and some of the rules are as shown below:

Rule 1:

word_freq_remove> 0.0 AND char_freq_! > 0.049 AND

word_freq_edu<= 0.06: 1 (Instances: 490 and Incorrect: 7)

Now, after Rule1 the next set of rules are formed excluding 490 instances from the 4601 total instances of spambase. Rule 2: char_freq_$ > 0.058 AND word_freq_hp<=0.4 AND capital_run_length_longest> 9.0 AND word_freq_1999 <= 0.0 AND word_freq_edu<= 0.08 AND

char_freq_! > 0.107: 1 (Instances: 334 and Incorrect: 2) Next set of rules is formed on remaining 3777 instnaces of spambase. Rule 3: word_freq_money<= 0.03 AND word_freq_000<=0.25AND word_freq_remove<= 0.26 AND word_freq_free<=0.19AND word_freq_font<= 0.12 AND char_freq_! <= 0.391 AND char_freq_$<=0.172 AND

word_freq_george> 0.0: 0 (Instances: 553 and

Incorrect: 0) Total 42 rules are formulated when training.

V.

R

esults

a) Smapbase Results

The dataset spambase was taken from UCI machine learning repository [13]. Spambase dataset contains 4601 instances and 58 attributes. 1 - 57 continuous attributes and 1 nominal class label. The email spam classification has been implemented in Eclipse. Eclipse considered by many to be the best Java development tool available. Feature ranking and feature selection is done by using the methods such as Chi-square, Information gain, Gain ratio, Relief, OneR, Correlation as a preprocessing step so as to select feature subset for building the learning model.

Classification algorithms are from decision tree family, viz, Random Forest and Partial Decision Trees. Random forest is an effective tool in prediction. Because of the law of large numbers they do not over fit. Random inputs and random features produce good results in classification-less so in regression. For the larger data sets, it seems that significantly lower error rates are possible [14]. Feature space can be reduced by the magnitude of 10 while achieving similar classification results. For example, it takes about 2,000 features to achieve similar accuracies as those obtained with 149 PART features [15].

As a part of our implementation, we have divided the dataset into two parts. 80% of the dataset is used for training purpose and 20% for the testing purpose. After preprocessing step top 87%, 77% and 70% features are considered while building training model and testing because there is a significant performance improvement. Prediction accuracy, correctly classified instances, incorrectly classified instances, confusion matrix and time complexity are used as performance measures of the system.

More than 99% prediction accuracy is achieved by Random forest with all the seven feature selection methods in consideration; whereas 97% prediction accuracy is achieved by PART with almost all the seven feature selection methods while training the model. Training and testing results, when 100% features have considered are given in Table 2.

Table 2 : Results of 100% feature selection

2 Year 013 2 54

(DDDDDDDD

)

C

Classifier Training Testing Time _(ms)

Random

Forest 99.918 94.354 1540

(8)

Both training results and testing results on spambase dataset after feature ranking and subset selection are shown in the Table 3 and Table 4.

Table 3 : Training Results

Table 4 :Testing Results

From the results above, it can be observed that for Random Forest, after using 87% of the feature set extracted the training accuracy is (96.012%) whereas the computation time reduced by 51.574% (from 9466ms – to 4584ms). This shows that the remaining 13% features were not contributing towards the classification.

Also, it can be observed that for Part, after using 87%, 77% of the feature set extracted the training accuracy is increased. There is a significant improvement in 87% feature selection by 1% and computation time is reduced by 67.879% (from 18558 ms – to 5961ms). This shows that not only the remaining 30% features were redundant but also they were misleading the classification.

VI.

E

nron

R

esults

More than 96% prediction accuracy is achieved by Random forest with all the seven feature selection methods in consideration; whereas more than 95% prediction accuracy is achieved by PART with almost all the seven feature selection methods while training the model. Training and testing results, when 100% features have considered are given in Table 5.

Table 5 :Results of 100% feature selection

55 Year 013 2

(DDDDDDDD

)

C

FS (%) FS RF Acc (%) Time (ms) Part Acc (%) Time (ms) 87% Chi 99.891 1349 98.234 3797 Infogain 99.837 1330 98.505 3080 Gainratio 99.918 1386 98.315 3611 Relief 99.891 1397 96.63 3467 SU 99.918 1367 98.505 3124 OneR 99.918 1470 96.902 4727 Corr 99.728 1153 95.027 847 77% Chi 99.918 1373 97.283 2701 Infogain 99.891 1498 97.147 3131 Gainratio 99.918 1604 97.006 4007 Relief 99.864 1367 97.799 3829 SU 99.891 1294 97.147 2867 OneR 99.891 1406 94.973 3469 Corr 99.728 1145 95.027 835 70% Chi 99.891 1282 97.092 2437 Infogain 99.918 1314 97.092 2409 Gainratio 99.864 1383 96.821 2642 Relief 99.81 1428 96.658 2855 SU 99.918 1276 97.092 2394 OneR 99.918 1442 95.245 2528 Corr 99.728 1152 95.027 845 FS

(%) FS RF Acc(%) Part Acc (%)

87% Chi 94.788 92.291 Infogain 94.137 93.16 Gainratio 93.594 94.137 Relief 95.114 93.185 SU 93.16 93.16 OneR 92.834 89.902 70% Chi 94.245 93.268 Infogain 94.68 93.268 Gainratio 94.028 94.463 Relief 93.811 91.965 SU 94.137 93.268 OneR 93.16 89.794 Corr 93.051 92.942 FS

Corr 93.051 92.942 77% ChiInfogain 93.48594.028 92.50893.051 Gainratio 94.245 92.291 Relief 93.485 92.617 SU 94.028 93.051 OneR 93.16 91.531 Corr 93.051 92.942

Classifier Training Testing Time

(ms) Random

Forest 96.181 93.623 9466

PART 95.093 91.787 18558

Both training and testing results after feature ranking and subset selection are shown in the Table 6 and Table 7. FS (%) FS RF Acc (%) Time (ms) Part Acc (%) Time (ms) 87% Chi 96.012 4210 94.634 5961 Infogain 96.012 4106 94.634 5839 Gainratio 96.012 4584 94.634 5791 Relief 96.012 4070 94.634 5806 SU 96.012 4170 94.634 5854 OneR 96.012 4085 94.634 5856 Corr 96.012 4147 94.634 5821

(9)

Table 7:Testing Results

From the results above, it can be observed that for Random Forest, after using 87% of the feature set extracted the training accuracy is (96.012%) whereas the computation time reduced by 51.574% (from 9466ms – to 4584ms). This shows that the remaining 13% features were not contributing towards the classification. Also, it can be observed that for Part, after using 87%, 77% of the feature set extracted the training accuracy is increased. There is a significant improvement in 87% feature selection by 1% and computation time is reduced by 67.879% (from 18558 ms – to 5961ms). This shows that not only the remaining 30% features were redundant but also they were misleading the classification.

VI.

C

onclusion

In this paper we have studied previous approaches of spam email detection using machine learning methodologies. We have compared and evaluated the approaches based on the factors such as dataset used; features extracted, ranked and selected; feature selection algorithms used and the results received in terms of accuracy (precision, recall and error rate) and performance (time required).

classifiers to classify spam email detection. For spambase dataset, we acquired the best percentage

1.521% and computation time is reduced by 52% (from 4938 ms – to 2409ms).

R

eferences

R

éférences

R

eferencias

1. “A Study of Spam E-mail classification using Feature Selection package”, R.Parimala, Dr. R. Nallaswamy, National Institute of Technology, Global Journal of Computer Science and Technology, Volume 11 Issue 7 Version 1.0 May 2011.

2. “Comparative Study on Email Spam Classifier using Data Mining Techniques”, R. Kishore Kumar, G. Poonkuzhali, P. Sudhakar, Member, IAENG, Proceedings of the International Multiconference of Engineers and Computer Scientists 2012 Vol I, IMECS 2012, March 14-16, Hong Kong.

3. “Machine Learning Methods for Spam E-mail Classification”, W.A. Awad and S.M. ELseuofi, International Journal of Computer Applications (0975 – 8887) Volume16– No.1, February 2011. 4. “Email Spam Filtering using Supervised Machine

Learning Techniques”, V.Christina, S.Karpagavalli, G.Suganya, (IJCSE) International Journal on Computer Science and EngineeringVol. 02, No. 09, 2010, 3126-3129.

5. “Email Classification Using Data Reduction Method”, Rafiqul Islam and Yang Xiang, member IEEE, School of Information Technology Deakin University, Burwood 3125, Victoria, Australia.

6. “Spam Classification based on Supervised Learning using Machine Learning Techniques”, Ms.D Karthika Renuka, Dr.T.Hamsapriya, Mr.M.Raja Chakkaravarthi, Ms. P. Lakshmi Surya, 978-1-61284-764-1/11/$26.00 ©2011 IEEE.

7. “An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization”, Chih-Chin Lai, Ming-Chi Tsai, Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04) 0-7695-2291-2/04 $ 20.00 IEEE.

2 Year 013 2 56

(DDDDDDDD

)

C

FS

87% Chi 93.43 90.725 Infogain 93.43 90.725 Gainratio 93.43 90.725 Relief 93.43 90.725 SU 93.43 90.725 OneR 93.43 90.725 Corr 93.43 90.725

8. “Introductory Statistics: Concepts, Models, and Applications”, David W. Stockburger.

9. “Feature Subset Selection: A Correlation Based Filter Approach”, Hall, M. A., Smith, L. A., 1997,

G-mail Dataset Test Results

Further, we have tested our Enron model on the dataset created by using emails we have received in our Gmail accounts during the period of last 3 months. The results are shown in the Table 8. In this, experiment we test dataset is completely non-overlapping with the training set allowing us to truly evaluate the performance of our system.

Table 8 :Personal Email Dataset Testing Results

Classifier Testing Accuracy (%)

Random Forest 96

PART 97.33

The datasets available for spam detection are large in number and for such larger datasets Random Forest and Part tend to produce better results with lower error rates and higher precision. So, we used these two

accuracy of 99.918% with Random Forest which is 9% better than previous spambase approaches and 96.416% with Part. For enron dataset, we acquired the best percentage accuracy of 96.181% with Random Forest and 95.093% with Part. Enron dataset is used by [21] in an unsupervised spam learning and detection scheme. The feature selection algorithms used also contributed to achieve better accuracy with lower time complexity due to dimensionality reduction. For Random Forest, after using 70% of the feature set extracted, for spambase data set, the training accuracy remained the same (99.918%) whereas the computation time reduced by 20% (from 1540ms – to 1276ms), whereas for PART, the training accuracy is increased by

(10)

University of California, Irvine, CA, http://www.ics.uci.edu/~mlearn/MLRepository.html, Hettich, S., Blake, C. L., and Merz, C. J.,1998.

14. “Random Forests”, Leo Breiman, Statistics Department University of California Berkeley, CA 94720, January 2001.

15. “Exploiting Partial Decision Trees for Feature Subset Selection in eMail Categorization”, Helmut Berger, Dieter Merkl, Michael Dittenbach, SAC’06 April 2327, 2006, Dijon, France Copyright 2006 ACM 1595931082/06/0004.

16. “C4.5: Programs for Machine Learning”, J. R. Quinlan, Morgan Kaufmann Publishers Inc., 1993. 17. “Fast effective rule induction”, W. W. Cohen, In

Proc. of the Int’l Conf. on Machine Learning, pages 115–123. Morgan Kaufmann, 1995.

18. “Toward optimal feature selection using Ranking methods and classification Algorithms”, Jasmina Novaković, PericaStrbac, DusanBulatović, March

2011.

19. “SpamAssassin”, http://spamassassin.apache.org. 20. The enron spam dataset http://www.aueb.gr/users/

ion/data/enron-spa

21. “A Case for Unsupervised-Learning-based Spam Filtering”, Feng Qian, Abhinav Pathak, Y. Charlie Hu, Z. Morley Mao, Yinglian Xie.

57 Year 013 2

(DDDDDDDD

)

C

International Conference on Neural Information Processing and Intelligent Information Systems, Springer, p855-858.

10. “A practical approach to feature selection”, K. Kira and L. A. Rendell, Proceedings of the Ninth International Conference, 1992.

11. “Very simple classification rules perform well on most commonly used datasets”, Holte, R.C.(1993) Machine Learning, Vol. 11, 63–91.

12. “Induction of decision trees”, J.R. Quinlan, Machine Learning 1, 81-106, 1986.

13. “UCI repository of Machine learning Databases”, Department of Information and Computer Science,

(11)

Efficient Spam Classification by Appropriate Feature Selection

non-Global Journal of Computer Science and Technology

Software & Data Engineering

Efficient Spam Classification by Appropriate Feature Selection

By Prajakta Ozarkar & Dr. Manasi Patwardhan

Vishwakarma Institute of Technology, India

Abstract -

Spam is a key problem in electronic communication, including large-scale email systems

and the growing number of blogs. Currently a lot of research work is performed on automatic

detection of spam emails using classification techniques such as SVM, NB, MLP, KNN, ID3, J48,

Random Tree, etc. For spam dataset it is possible to have large number of training instances. Based

on this fact, we have made use of Random Forest and Partial Decision Trees algorithms to classify

spam vs. non-spam emails. These algorithms outperformed the previously implemented algorithms

in terms of accuracy and time complexity. As a preprocessing step we have used feature selection

methods such as Chi-square, Information gain, Gain ratio, Symmetrical uncertainty, Relief, One R

and Correlation. This allowed us to select subset of relevant, non redundant and most contributing

features to have an added benefit in terms of improvisation in accuracy and reduced time complexity.

Keywords :

feature selection, preprocessing, random forest, part.

GJCST-C Classification

: H.4.3

Efficient Spam Classification by Appropriate Feature Selection

Efficient Spam Classification by Appropriate

Feature Selection

I

I

P

W

D

S

F

R

S

S

C

M

R

E

R

C

R

R

R

Global Journals Inc. (US)

Guidelines Handbook