E-MAIL SPAM DETECTION USING MODIFIED
MACHINE LEARNING ALGORITHM
SITI HAJAR SALMA BT MOHD SALIM
BACHELOR OF COMPUTER SCIENCE (COMPUTER
NETWORK SECURITY) WITH HONOURS
E-MAIL SPAM DETECTION USING MODIFIED MACHINE LEARNING ALGORITHM
SITI HAJAR SALMA BT MOHD SALIM
BACHELOR OF COMPUTER SCIENCE (COMPUTER NETWORK SECURITY) WITH HONOURS
Universiti Sultan Zainal Abidin 2021
i
DECLARATION
I hereby declare that this report is based on my original work except for quotations and citations, which have been duly acknowledged. I also declare that it has not been previously or concurrently submitted for any other degree at University Sultan Zainal Abidin or other institutions.
________________________________
Name : Siti Hajar Salma bt Mohd Salim
ii
CONFIRMATION
This project to confirm that:
The research conducted and the writing of this report were under my supervisor.
__________________________
Name: Miss Nazirah bt Abd Hamid
iii
DEDICATION
First of all, all praise to Allah SWT, that giving me live to start and finish this Final Year Project 1. To my parents, Mama and Baba thank you for always giving me spirit, supporting me when im about to stop, reminding me about how far that I have been through to get to this. I also would like to thank to my Supervisor, Miss Nazirah bt Abd Hamid for always being a helping and positive supervisor. I am lucky enough to have her as my supervisor because she always tries her best to make sure her students do well in finishing this project. A big thank also to my friends which is my housemate, for always be there when I need them. God knows how stress I am to start this project. I even about to stop everything, but then with all people around me, they made me feel like I can and I will succeed. So here I am dedicating this project to all the people that involve in making it happens. I am blessed to have all good people around me. Thank you Allah, thank you for everything.
iv ABSTRACT
E-mails is one of the most important communication in the worldwide and communicating by email is almost instantaneous, which enhances communications by quickly disseminating information and providing fast response to customer inquiries. Spam has emerged as a major problem in recent years and the most widely recognized form of spam, is email spam. Spam emails not only influence the organisations financially but also exasperate the individual email user and often cause inconvenient to the users. The objectives of this projects are to study on how to use machine learning techniques to classify and choose whether the mail is spam or ham, to design and implement a modified machine learning technique, and to test the efficiency of the algorithm for E-Mail Spam Detection Using Machine Learning Algorithm. The mails are classified as spam and ham. Unwanted mails are called as spam and genuine mails are called as ham. Many techniques are used to detect the spam mails but the accuracy and performance of the algorithms is distinct from each other. Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. This project is designed and developed to show the modified algorithms enable to improve the effectiveness and efficiency of detecting email as a spam or ham.
v ABSTRAK
E-mel adalah salah satu komunikasi terpenting di seluruh dunia dan komunikasi melalui e-mel adalah pantas, yang mana ia meningkatkan komunikasi dengan menyebarkan maklumat dengan cepat dan memberikan respons pantas terhadap pertanyaan pelanggan. Spam telah muncul sebagai masalah utama dalam beberapa tahun kebelakang dan bentuk spam yang paling dikenali, adalah spam e-mel. Spam e-mael bukan sahaja mempengaruhi organisasi dari segi kewangan tetapi juga mengganggu pengguna e-mel dan sering menimbulkan kesulitan kepada pengguna. Objektif projek ini adalah untuk mengkaji bagaimana menggunakan teknik Machine Learning untuk mengklasifikasikan dan memilih sama ada e-mel itu spam atau ham, untuk merancang dan melaksanakan teknik Machine Learning yang diubah, dan untuk menguji kecekapan algoritma untuk E-Mail Spam Detection Using Machine Learning Algorithm. E-mel tersebut dikelaskan sebagai spam dan ham. E-mel yang tidak diingini disebut sebagai spam dan e-mel asli disebut sebagai ham. Banyak teknik digunakan untuk mengesan e-mel spam tetapi ketepatan dan prestasi algoritma berbeza antara satu sama lain. Machine Learning adalah aplikasi Artificial Intelligence (AI) yang menyediakan sistem kemampuan untuk belajar dan memperbaiki secara automatik dari pengalaman tanpa diprogram secara eksplisit. Projek ini dirancang dan dikembangkan untuk menunjukkan algoritma yang diubahsuai memungkinkan untuk meningkatkan keberkesanan dan kecekapan mengesan e-mel sebagai spam atau ham.
vi TABLE OF CONTENTS Page CONFIRMATION ...i DECLARATION ...i DEDICATION ... iii ABSTRACT ... iv ABSTRAK ... v TABLE OF CONTENTS ... vi
LIST OF TABLES ... viii
LIST OF FIGURES ... ix
LIST OF ABBREVIATIONS ... xi
LIST OF APPENDENCIES... xii
CHAPTER 1 ...1 INTRODUCTION...1 1.1 Background ...1 1.2 Problem Statement ...3 1.3 Objective ...4 1.4 Scope ...5 1.5 Limitation of Work ...6 1.6 Structure Of Thesis ...6 1.7 Summary ...6 CHAPTER 2 ...7 LITERATURE REVIEW ...7 2.1. Introduction ...7 2.2. Machine Learning ...8 2.3. Related Work ...9 2.4. Expected Result ... 11 2.5. Summary ... 11 CHAPTER 3 ... 12 METHODOLOGY... 12 3.1 Introduction ... 12 3.2 Framework... 13 3.2.1 Data Source ... 14 3.2.2 Data Sets ... 15
vii
3.4 Performance Evaluation ... 19
3.5 Model Evaluation Results ... 22
3.6 System Design ... 27 3.6.1 Data Pre-processing ... 27 3.7 Summary ... 36 CHAPTER 4 ... 37 4.1 Conclusion ... 37 REFERENCES ... 38
viii
LIST OF TABLES
TABLE TITLE PAGE
ix
LIST OF FIGURES
FIGURE TITLE PAGE
Figure 3.2.1: Framework... 13
Figure 3.2.2: Kaggle Website ... 14
Figure 3.2.3: UCI Machine Learning Website ... 14
Figure 3.2.4: GitHub Website ... 14
Figure 3.2.5: Dataset in arff format ... 15
Figure 3.3.1: Flowchart ... 18
Figure 3.4.1: Naïve Bayes’s Result ... 19
Figure 3.4.2: Decision Table’s Result ... 20
Figure 3.4.3: Multilayer Perceptron’s Result ... 21
Figure 3.6.1: Download dataset UCI Machine Learning dataset ... 27
Figure 3.6.2: Choose raw dataset ... 28
Figure 3.6.3: Go to Pre-process tab ... 29
Figure 3.6.4: Click unsupervised and choose remove percentage ... 29
Figure 3.6.5: Click apply ... 30
Figure 3.6.6: Save as training.arff ... 30
Figure 3.6.7: Click Undo tab ... 31
Figure 3.6.8: Change invertSelection to ‘True’... 31
Figure 3.6.9: Save as Testing ... 32
Figure 3.6.10: Go to preprocess and insert training.arff ... 32
Figure 3.6.11: Go to classify and choose Naïve Bayes algorithm ... 33
Figure 3.6.12: At the test option, choose supplied test set ... 33
x
Figure 3.6.14: Choose testing.arff and run by clicking ‘Start’... 34 Figure 3.6.15: The result for Naïve Bayes algorithm ... 35
xi
LIST OF ABBREVIATIONS
AI Artificial Intelligence
SVM Support Vector Machine
E-Mail Electronic Mail
WEKA Waikato Environment for Knowledge Analysis
T True
F False
MLP Multilayer Perceptron
xii
LIST OF APPENDENCIES
1 CHAPTER 1
INTRODUCTION
1.1 Background
Nowadays, spam is not a new issue anymore especially for people who everyday
uses electronic devices as their needs to complete the task. There are many definitions
of spam but the simplest definition is the mass transmission of unwanted messages.
Spams are usually sent to a large number of users for a variety of use cases such as
advertising, phishing, spreading malware or in other words for inappropriate purpose.
Spam has emerged as a major problem in recent years and the most widely recognized
form of spam, is e-mail spam [1].
E-mails is one of the most important communication in the worldwide and
communicating by e-mail is almost instantaneous, which enhances communications
by quickly disseminating information and providing fast response to customer
2
organisation. This is because, it can cause delay to send important e-mails due to large
amount of spam traffic in between email server and host.
Spam is unsolicited commercial e-mail. While many consider spam a trivial
nuisance rather than an attack, it has been used as a means of enhancing malicious code
attacks. In March 2002, there were reports of malicious code embedded in MP3 files
that were included as attachments to spam. The most significant consequence of spam,
however, is the waste of computer and human resources. Many organizations attempt
to cope with the flood of spam by using e-mail filtering technologies. Other
organizations simply tell users of the mail system to delete unwanted messages [2].
However, there are two classification of e-mails which are classified as spam
and ham. Unwanted e-mails are called as spam and genuine e-mails are called as ham.
Many techniques are used to detect the spam e-mails but the accuracy and performance
of the algorithms is distinct from each other [3]. Machine learning is an application of
artificial intelligence (AI) that provides systems the ability to automatically learn and
improve from experience without being explicitly programmed. There are many
techniques in Ai such as Heuristics, Support Vector Machine (SVM) and Artificial
Neural Networks. This project is designed and developed to show the modified
algorithms enable to improve the effectiveness and efficiency of detecting email as a
3 1.2 Problem Statement
E-mail spam is still a problem even today. It is not only annoying but also
dangerous to users. Even though antivirus software has come a long way, infected PCs
but still trojan and bots are still the major sources of spam. In the time it takes for spam
filters to analyse the content of the mail message, you might have already allowed
e-mail spam into your system. Besides, e-e-mail spam not only influence individual, but it
can affect the big organisation. This is because it can cause delay to send important
e-mails due to large amount of spam traffic in between e-mail server and host. Other than
that, many techniques are used to detect the spam mails but the accuracy and
4 1.3 Objective
There are three objectives that will be achieved in this project:
1. To study the use of machine learning technique for classifying the e-mail is spam or ham.
2. To implement e-mail spam detection using a modified machine learning technique.
3. To test the effectiveness and efficiency of the modified algorithm for E-Mail Spam Detection.
5 1.4 Scope
Project scope is one of the most important in this project. It involves in
determining and documenting a list of specific project goals. There are two phases for
project scope. Phase one, is to get the best algorithm. In this project, Weka is being used
as a tool to find the best machine learning algorithm to classify e-mail spam. Next, phase
two which is modified the algorithm. Based on the results from previous phase, the
algorithm is then modified to improve the effectiveness and the efficiency of detecting
6 1.5 Limitation of Work
Limitation of work is the constraint or the capable limit of what the project can do.
The limitation of this project is that it is only applies to e-mail user. Besides, this project
is used to classify e-mail spam only which cannot classify other than spam problem.
1.6 Structure Of Thesis
This paper is organised of following chapters:
This Chapter 1 discusses about the introduction of this project. All background,
objective and scope are discussed in this chapter. Chapter 2 addresses the related works
and studies related to the classify spam that have been done by the researchers.. The
flow of the project's work can be seen from Chapter 3. The structure and flowchart for
the project are also seen in this chapter. The last is Chapter 4, which will address the
project's end.
1.7 Summary
The discussion on the context of the project, the problem statement, aims and
7 CHAPTER 2
LITERATURE REVIEW
2.1. Introduction
Literature review is a chapter that discussing about resources or previous existing
project that modified to ongoing project. Some articles, journals and thesis are used as
8 2.2. Machine Learning
In this project, the latest algorithm for machine learning is used and updated to suit
project needs. The reasons are that the algorithm for machine learning is adept at
analyzing vast quantities of data. Because of the ever-increasing data that is processed,
it normally improves over time. This gives the algorithm more information and can be
used to make better forecasts.
Without human interference, machine learning enables instantaneous adaptation. It
detects emerging threats and patterns and takes the necessary steps. It also saves time
9 2.3. Related Work
Most research has been performed using a range of techniques to identify and
recognize spam emails.
Nur Syamimi bt Aziz (2020) has performed a research on Support Vector
Machine Algorithm. In this paper, she used Support Vector Machine (SVM) to explore
how text classifying techniques can be used to dig into some data.
Nur Syaidatul Amirah Bt Abdul Hamid (2019) has conducted a research about
binary classifier. In this paper, she found that Machine Learning platform has abilities
to speed up analysing of gigantic data.
P.Priyatharsini, Dr.C.Chandrasekar, M.Phill Scholar (2017) have performed a
research in data mining”. In this paper, they used decision tree algorithm to classify
email spam.
Priti Sharma and Uma Bhardwaj (2017) have conducted a research for an e-mail
detection. In this paper, they found that the overall accuracy of 87.5% achieved by the
hybrid bagged approach based SMD system.
Sumant Sharma, Amit Arora, Himachal Pradesh (2013) have performed a research
using an adaptive approach. In this paper, they used Support Vector Machine (SVM) to
identify spam email account. They also found the dataset from UCI Machine Learning
10
Table 2.3.1: Related Work
No Year Title Author Objective Methodology Other Findings
1 2 3 4 5 2020 2019 2017 2017 2013 Classifying Sentiments In Social Media: Integrating Text-Based Affective Method With Support Vector Machine Algorithm
Spam Detection Using Machine Learning Based Binary Classifier
Email Spam Filtering Using Classifiers in Data Mining
Machine Learning Based Spam E-Mail Detection
Adaptive Approach for Spam Detection
Nur Syamimi Binti Aziz Nur Syaidatul Amirah Bt Abdul Hamid P.Priyatharsini Dr.C.Chandrasekar M.Phill Scholar Priti Sharma Uma Bhardwaj Sumant Sharma Amit Arora Himachal Pradesh To explore how text classifying techniques can be used to dig into some data To detect spam using based binary To classify the email spam To propose machine learning based hybrid To identify spam email account Support Vector Machine (SVM) Binary Classifier Decision Tree Naïve Bayes Support Vector Machine (SVM)
This project used a Rapid Miner as open-source software for classifying sentiments in social media
Machine Learning platform has abilities to speed up analysing of gigantic data
The machine learner's task is to search for patterns and construct mathematical models
The overall accuracy of 87.5% achieved by the hybrid bagged approach based SMD system.
The dataset used for the project is named as SPAMBASE
dataset download from UCI Machine Learning Repository.
11 2.4. Expected Result
By the end of this project, the expected result of this Spam Email Detection Using
Machine Learning will be finished and works successfully so that it will give benefits
to the users.
2.5. Summary
The system and model used in the suggested project were discussed in this chapter.
Based on past research papers and journals, the process and model are selected. There
12 CHAPTER 3
METHODOLOGY
3.1 Introduction
This chapter describes the flow of the project from start to finish to achieve the goal of
the project. Many methodologies for detecting email spam have been used. So, finding
and choosing the right methodology is important so that the project goes well with the
13 3.2 Framework
This project comprises of two phases. The first phase is to get the best algorithm.
Weka is being used as a tool to find the best machine learning algorithm to classify
e-mail spam. The second phase is modified the algorithm. Based on the results from
previous phase, the algorithm is then modified to improve the effectiveness and the
efficiency of detecting e-mail whether it is a spam or a ham.
Select .arff dataset
Insert dataset into Weka tool to find the best algorithm
Modify the best algorithm
Detect whether an email is ham or spam Data preprocessing
Evaluate the model
14 3.2.1 Data Source
Due to numerous limitations, for example, the volume of data and the
throughput required for proper and timely ingestion, data collection is completely
difficult. The dataset that has been used in this project is the actual data that can be
downloaded from the repository site for machine learning data. To get the dataset to be
used in this project, there are some websites that can be referred such as Kaggle from
Figure 3.2.1, UCI Machine Learning Repository from Figure 3.2.1.1 and GitHub Figure
3.2.1.2.
Figure 3.2.2: Kaggle Website
Figure 3.2.3: UCI Machine Learning Website
15 3.2.2 Data Sets
This dataset consists of e-mails originated from filed work and private e-mails..
16 3.3 Flowchart of The Project
Flowchart is very important to illustrate the sequence of operations to finish the
project. It use symbol to represent of a process. Each of steps in the process is
represented by different symbol and contains a short description of the process step.
The top of the flowchart will begin with start.
In order to explain the organized project on how it should be designed and how
the method is connected to each other, the data model flow is important for this project.
It helps to make the project process structured seamlessly and clearly.
From the framework of Figure 3.2.1, WEKA has been used as a platform to
develop this project. Firstly, the functions of the WEKA have been studied and explored
to make sure this project happens successfully. The input data of this project was the
datasets from UCI Machine Learning which the dataset was already in arff format.
Data pre-processing begins with reformatting the dataset into two different files
in the format of training and testing. Next, the formatted data would be imported into
WEKA. Then, for the next phase the formatted data from the previous flowchart would
be trained and tested in the Weka by running them with existing algorithms. The
algorithms that has been used are Decision Table, Naïve Bayes and Multilayer
Perceptron.
Next in the evaluate model phase, the results from the training and testing would
be evaluated and analysed and the process for this phase will be repeated by using
17
After that, next phase is to modify algorithm. Based on the results from previous
phase, the algorithm is then modified to improve the effectiveness and the efficiency of
18 Input Data processing Insert dataset into WEKA Evaluate the model
Modify the best algorithm Process for detecting the email End Spam Ham Classify algorithm Start Figure 3.3.1: Flowchart
19
3.4 Performance Evaluation
The pre experiment was done to test the accuracy of the algorthms. The result for Naïve Bayes captured in Figure 3.4.1, figure 3.4.2 is for Decision Table and
figure3.4.3 is for Multilayer Perceptron. 3.4.1 Naïve Bayes 77.983%
20 3.4.2 Decision Table 87.0246%
21 3.4.3 Multilayer Perceptron 88.2634%
22 3.4 Model Evaluation Results
Naïve Bayes
It is a classification technique with an assumption of independence among
predictors based on Bayes' Theorem. A Naive Bayes classifier assumes, in simple terms,
that the existence of a certain feature in a class is unrelated to the presence of any other
feature.[7]
The Naive Bayes model is easy to construct and especially helpful for very large
data sets. Along with simplicity, even very sophisticated classification methods are
known to outperform Naive Bayes.[7]
The Bayes theorem provides a way for P(c|x) posterior probability to be
calculated from P(c), P(x) and P(x|c). Look below at the equation:
Above,
P(c|x) is the posterior probability of class (c, target)
given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
23 Advantages
It is easy and fast to predict class of test data set. It also perform well in multi class
prediction
When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which
24 Drawbacks
If categorical variable has a category (in test data set), which was not observed
in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
On the other side naive Bayes is also known as a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors.
In real life, it is almost impossible that we get a set of predictors which are completely
independent.[7]
25 Decision Table
A Decision Table is a tabular representation of inputs versus conditions for
rules/cases/test. It is a very efficient instrument used for both difficult software testing
and the management of specifications. The Decision Table helps to verify all possible
combinations of test conditions, and missing conditions can also be easily found by
testers. The conditions are shown as the values of True(T) and False(F).[8]
Advantages
If the behavior of the system is different for different inputs and not the same for a
set of inputs, comparable partitioning and boundary value analysis will not help, but it
is possible to use the decision table.[8]
The representation is simple so that it can be understood easily and is often used for
growth and industry. This table helps to make successful combinations and can provide
better coverage for research.[8]
It is possible to easily transform any complex market conditions into decision tables.
In a situation where we usually go for 100 percent coverage when the input
combinations are poor, this technique will ensure coverage.[8]
Drawbacks
The primary downside is that the table can become more complex as the amount
26 Multilayer Perceptron
The classical type of neural network is Multilayer Perceptrons, or MLPs for
short. One or more layers of neurons are made of them. There may be one or more
hidden layers that include abstraction levels, and predictions are made on the output
layer, also called the visible layer. Data is fed to the input layer.[9]
For classification prediction problems where inputs are assigned a class or mark,
MLPs are suitable.
They are also appropriate for problems with regression prediction when, given
a set of inputs, a real-valued quantity is predicted. Data is also presented in a tabular
format, as you might see in a spreadsheet or a CSV file.[10]
Use MLPs For
- Tabular datasets
- Classification prediction problems
- Regression prediction problems
Advantages
They are very flexible and can be used generally to learn a mapping from inputs to
27 3.6 System Design
Download dataset from UCI Machine Learning Repository. The input data of this
project was the datasets from UCI Machine Learning which the dataset was already in
arff format.
Figure 3.6.1: Download dataset UCI Machine Learning dataset
3.6.1 Data Pre-processing
Data pre-processing begins with reformatting the dataset into two different files in
the format of training and testing. Next, the formatted data would be imported into
WEKA. Then, for the next phase the formatted data from the previous flowchart would
be trained and tested in the Weka by running them with existing algorithms. The
algorithms that has been used are Decision Table, Naïve Bayes and Multilayer
28
29
Figure 3.6.3: Go to Pre-process tab
30
Figure 3.6.5: Click apply
31
Figure 3.6.7: Click Undo tab
32
Figure 3.6.9: Save as Testing
33
Figure 3.6.11: Go to classify and choose Naïve Bayes algorithm
34
Figure 3.6.13: Click open file
35
36
3.6 Summary
One of the most significant positions in system and application development is
methodology. There are also a number of different methodologies for software
development that are available that can be used to create any form of application. The
correct approach will allow the project to be carried out according to the time defined.
The activities in the technique at each point are clarified so that it can be easily
37 CHAPTER 4
4.1 Conclusion
In conclusion, there are many techniques to classify spam. However there is still
no efficient and accurate result.
Each algorithm has benefits and drawbacks of its own. The experiments were
performed in Weka to get the best algorithm. The algorithm that provide the most
accurate result will be chosen. In order to make it more efficient and accurate, it is
38
REFERENCES
[1] Sharma, S. and Arora, A., 2013. Adaptive approach for spam
detection. International Journal of Computer Science Issues (IJCSI), 10(4), p.23.
[2] Whitman, M.E. and Mattord, H.J., 2011. Principles of information security.
Cengage Learning.
[3] P.Priyatharsini1, Dr. C.Chandrasekar2 M.Phil Scholar1, Assistant Professor.,
2017. Email Spam Filtering Using Classifiers in Data Mining.
[4] Sharma, P. and Bhardwaj, U., 2018. Machine Learning based Spam E-Mail
Detection. International Journal of Intelligent Engineering & Systems, 11(3),
pp.1-10.
[5] Nur Syamimi Binti Aziz., 2020. Classifying Sentiments In Social Media:
Integrating Text- Based Affective Method With Support Vector Machine
Algorithm Classifying Sentiments In Social Media: Integrating Text-Based
Affective Method With Support Vector Machine Algorithm. University Sultan
Zainal Abidin.
[6] Nur Syaidatul Amirah Binti Abdul Hamid., 2019., Spam Detection Using Machine
Learning Based Binary Classifier., University Sultan Zainal Abidin.
[7] Analytics Vidhya (2017). Learn Naive Bayes Algorithm | Naive Bayes Classifier
Examples. [online] Analytics Vidhya. Available at:
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/ [Accessed
39
[8] Rungta, K. (2020). Decision Table Testing: Learn with Example. [online]
Guru99.com. Available at: https://www.guru99.com/decision-table-testing.html
[Accessed 19 Jan. 2021].
[9] MissingLink.ai. (2017). MissingLink.ai. [online] Available at:
https://missinglink.ai/guides/neural-network-concepts/perceptrons-and-multi-layer-perceptrons-the-artificial-neuron-at-the-core-of-deep-learning/ [Accessed 19
Jan. 2021].
[10] https://www.facebook.com/MachineLearningMastery (2018). When to Use
MLP, CNN, and RNN Neural Networks. [online] Machine Learning Mastery.
Available at: