Volume 6, Issue 02, February 2020 (ISSN: 2394 – 6598)
80
©IJETIE 2020
A Competent Spam Prediction Technique by Supervised Deep Learning Classifiers
Durga Maariappan K1, Dhanush Rao1, Giridhar B1, T. Sangeetha2
1Department of IT, Sri Krishna College of Technology
2Assistant Professor, Department of IT, Sri Krishna College of Technology
ABSTRACT
Since the last couple of decades, mobile phone users have been rapidly increased as the SMS are easy, less expensive and independent upon cell phone operating system. SMS has become one of the popular communication media throughout the world. 3.5 billion or 80% active users, throughout the word, use mobile SMS as a communication medium. Out of this huge number of SMS, a large number of SMS are spam, generated by offenders for a number of reasons. Due to the spam SMS, criminal gangs become stronger and perform different criminal activities. Short Message Service (SMS) has become one of the most important media of communications due to the rapid increase of mobile users and it’s easy to use operating mechanism. This flood of SMS goes with the problem of spam SMS that are generated by spurious users. The detection of spam SMS has gotten more attention of researchers in recent times and is treated with a number of different machine learning approaches. The traditional methods do not produce good results. In this work, Proposed approach uses deep learning classifiers to predict spam messages.
Keywords: Deep Learning, Communication Medium, Short Message Service, Classifiers
1. INTRODUCTION
Data Analytics refers to the set of quantitative and qualitative approaches for deriving valuable insights from data. It involves many processes that include extracting data and categorizing it in order to derive various patterns, relations, connections, and other such valuable insights from it. Today, almost every organization has morphed itself into a data-driven organization, and this means that they are deploying an approach to collect more data that is related to their customers, markets, and business processes. This data is then categorized, stored, and analysed to make sense out of it and derive valuable insights from it.
The term ‘Data Analytics’ is not a simple one as it appears to be. It is the most complex term, when it comes to big data applications. Predictive analytics can also ensure that the domain of big data can be
deployed for predicting the future based on the present data. A good example of predictive analytics is the deployment of analytical aspects to the sales cycle of an enterprise. It starts with the lead source analysis, analysing the type of communication, the number of communications and the channels of communication, along with sentiment analysis through heightened use of Machine Learning algorithms and more in order to come up with a perfect predictive analysis methodology for any enterprise. The use of internet has been extensively increasing over the past decade and it continues to be on the ascent. Hence it is apt to say that the Internet is gradually becoming an integral part of everyday life. Internet usage is expected to continue growing and e-mail has become a powerful tool intended for idea and information exchange. Negligible time delay
81
©IJETIE 2020 during transmission, security of the data being
transferred, low costs are few of the multifarious advantages that e-mail enjoys over other physical methods. However, there are few issues that spoil the efficient usage of emails. Spam email is one among them [3]. In recent years, spam email or more properly, Unsolicited Bulk Email (UBE) is a widespread problem on the Internet. Spam email is so cheap to send, that unsolicited messages are sent to a large number of users indiscriminately. When a large number of spam messages are received, it is necessary to take a long time to identify spam or non- spam email and their email messages may cause the mail server to crush. Mobile or SMS spam is a real and growing problem primarily due to the availability of very cheap bulk pre-pay SMS packages and the fact that SMS engenders higher response rates as it is a trusted and personal service. SMS spam filtering is a relatively new task which inherits many issues and solutions from email spam filtering. However it poses its own specific challenges.
2. RELATED WORKS
Spam mail, also called unsolicited bulk e-mail or junk mail that is sent to a group of recipients who have not requested it. The task of spam filtering is to rule out unsolicited e-mails automatically from a user's mail stream. These unsolicited mails have already caused many problems such as filling mailboxes, engulfing important personal mail, wasting network bandwidth, consuming users time and energy to sort through it, not to mention all the other problems associated with spam (crashed mail servers, pornography adverts sent to children, and so on)[4]. According to a series of surveys conducted by CAUBE.AU 1, the number of total spasms received by 41 email addresses has increased by a factor of six in two years (from 1753 spams in 2000 to 10,847 spams in 2001) [7].
Therefore, it is challenging to develop spam filters that can effectively eliminate the increasing volumes
of unwanted mails automatically before they enter a user's mailbox. D. Puniskis [5] in his research applied the neural network approach to the classification of spam. His method employs attributes composed of descriptive characteristics of the evasive patterns that spammers employ rather than using the context or frequency of keywords in the message. The data used is corpus of 2788 legitimate and 1812 spam emails received during a period of several months. The result shows that ANN is good and ANN is not suitable for using alone as a spam filtering tool. In [6] email data was classified using four different classifiers (Neural Network, SVM classifier, Naïve Bayesian Classifier, and J48 classifier). The experiment was performed based on different data size and different feature size.
The final classification result should be ‘1’ if it is finally spam, otherwise, it should be ‘0’. Although supervised learning techniques feature in the majority of recent work in SMS spam filtering there have been other machine learning approaches investigated. An early centralised spam filtering solution was suggested by Dixit, Gupta, and Ravishankar (2005) where, rather than using the more standard text classification approaches, the SMS messages were represented as a character-based vector which was projected into a smaller normalised feature space and clustered to identify clusters of spam and non-spam messages. New messages are classified based on their distance from the known spam and non-spam clusters. This approach is motivated by the lack of keywords available for the normal classification algorithms due to the short length of the SMS messages, but the efficacy of this approach at classifying spam was not evaluated. The behaviour of spam senders over time can be indicative of whether a given message is spam or not. Hu and Yan (2010) add a frequency analysis of SMS traffic to an existing spam filter with the goal of improving the central system’s real-time processing speed. By considering the frequency of spam messages received during
82
©IJETIE 2020 different time periods and at different locations, they
focus filtering on specific time periods and locations.
Their approach improves the throughput of the system greatly, but at the cost of a large decrease in spam detection and a significant rise to 2.5% in the false positive rate. Non content-based technologies such as social network analysis have become popular in the email filtering area (Boykin & Roychowdhury, 2005; yu Lam & yan Yeung, 2007; Tseng & Chen, 2009). Network analysis approaches are address- based filtering approaches which aim to predict whether a sender is a potential spammer or not. This is different from the objective of the content and collaborative filtering techniques, which is to predict whether the message itself is spam or not. There is some evidence of the start of the use of these techniques for SMS filtering. Wang et al. (2010) presented an interesting solution for point-to-point SMS messages, those sent from one mobile device to another, which combines social network analysis with spectral analysis of message submission behaviour. They generate a directed graph from message logs and suggest two kinds of filters, an offline filter and an online filter. The offline filter uses features from a one-hop social network that models longer-term sender behaviour while the online filter focuses on how many receivers a sender has sent to in a given time period which is extracted from a two-hop social network and combined with temporal spectral analysis of submission behaviour.
They suggest that their approaches can be combined with content-based approaches either serially, where results of independent filter systems can be combined, or sequentially where the behaviour-based filter can provide input to the content-based system or vice versa. 3.3. Spam filtering in other short text classification domains There has been other relevant spam classification work recently in related short text message domains. There is significant evidence of spam in social networks including instant message
spam (aka spam) and twitter spam. Typically, fake or bot accounts are used to automatically send messages or tweets that contain links that can be used to gather marketing information or for more malicious or phishing purposes. It has been recently reported that just 35% of the average Twitter users’ followers are real people.
2.1. DATASET
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. Format: .csv. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. Source:
UCI data base. our dataset consists of one large text file in which each line corresponds to a text message.
Therefore, pre-processing of the data, extraction of features, and tokenization of each message is required.
3. MACHINE LEARNING METHODS 3.1. DECISION TREE
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too. The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data (training data). The understanding level of Decision Trees algorithm is so easy compared with other classification algorithms. The decision tree algorithm tries to solve the problem, by using tree representation. Each internal node of the tree corresponds to an attribute, and each leaf node corresponds to a class label.
83
©IJETIE 2020 3.2. DECISION TREE ALGORITHM
PSEUDOCODE
Place the best attribute of the dataset at the root of the tree. Split the training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute.
Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree. n decision trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node. We continue comparing our record’s attribute values with other internal nodes of the tree until we reach a leaf node with predicted class value. As we know how the modelled decision tree can be used to predict the target class or the value.
4. DEEP LEARNING METHOD
The convolution layers are the main powerhouse of a CNN model. Automatically detecting meaningful features given only an image and a label is not an easy task. The convolution layers learn such complex features by building on top of each other. The first layers detect edges, the next layers combine them to detect shapes, to following layers merge this information to infer that this is a nose. To be clear, the CNN doesn’t know what a nose is. By seeing a lot of them in images, it learns to detect that as a feature. The fully connected layers learn how to use these features produced by convolutions in order to correctly classify the images. A layer is nothing but a collection of neurons which take in an input and provide an output. Inputs to each of these neurons are processed through the activation functions assigned to the neurons. For example, here is a small neural network.
Convolutional Neural Networks (CNN) are everywhere. It is arguably the most popular deep learning architecture. The recent surge of interest in deep learning is due to the immense popularity and effectiveness of convnets. The interest in CNN started with AlexNet in 2012 and it has grown exponentially ever since. In just three years, researchers progressed from 8 layer AlexNet to 152 layer ResNet. CNN is now the go-to model on every image related problem.
In terms of accuracy they blow competition out of the water. It is also successfully applied to recommender systems, natural language processing and more. The main advantage of CNN compared to its predecessors is that it automatically detects the important features without any human supervision. For example, given many pictures of cats and dogs it learns distinctive features for each class by itself. CNN is also computationally efficient. It uses special convolution and pooling operations and performs parameter sharing. This enables CNN models to run on any device, making them universally attractive.
84
©IJETIE 2020 All in all, this sounds like pure magic. We are dealing
with a very powerful and efficient model which performs automatic feature extraction to achieve superhuman accuracy (yes CNN models now do image classification better than humans).
5. RESULTS
Fig. word cloud of UCI SMS spam dataset
TABLE 1.1 SUMMARY OF RESULTS
Classifiers Techniques Accuracy Decision
Tree
Machine Learning
86%
CNN Deep Learning 98%
6. CONCLUSION
Short Message Service (SMS) has grown into a multi- billion dollars industry. At the same time, reduction in the cost of messaging services has resulted in growth in unsolicited commercial advertisements (spams) being sent to mobile phones. In this project, a database of real SMS Spams from UCI Machine Learning repository is used, and after pre-processing and feature extraction, different machine learning techniques are applied to the database. Finally, the results are compared and the best algorithm for spam filtering for text messaging is introduced
REFERENCES
[1] J. Miranda, N. Makitalo, J. Garcia-Alonso, J.
Berrocal,T. Mikkonen, C. Canal, and J. M. Murillo,
“From the internet of things to the internet of people,”
IEEE Internet Computing, vol. 19, no. 2, pp. 40–47, 2015.
[2] T. Qiu, A. Zhao, F. Xia, W. Si, and D. O. Wu,
“Rose: Robustness strategy for scale-free wireless sensor networks,” IEEE/ACM Transactions on Networking, vol. 25, no. 5, pp. 2944–2959, 2017.
[3] L. Yao, Q. Z. Sheng, and S. Dustdar, “Web-based management of the internet of things,” IEEE Internet Computing, vol. 19, no. 4, pp. 60–67, 2015.
[4] T. Qiu, R. Qiao, and D. O. Wu, “Eabs: An event- aware backpressure scheduling scheme for emergency internet
of things,” IEEE Transactions on Mobile Computing, vol. 17, no. 1, pp. 72–84, 2017.
[5] T. Qiu, K. Zheng, H. Song, M. Han, and B.
Kantarci, “A local-optimization emergency scheduling scheme
with self-recovery for smart grid,” IEEE Transactions on Industrial Informatics, vol. 13, no. 6, pp. 3195–
3205, 2017.
[6] S. Lu, V. H. Nascimento, J. Sun, and Z. Wang,
“Sparsity aware adaptive link combination approach
over distributed networks,” Electronics Letters, vol.
50, no. 18, pp. 1285–1287, 2014.
[7] E. Tan, L. Guo, S. Chen, X. Zhang, and Y. Zhao,
“Spammer behaviour analysis and detection in user generated
content on social networks,” in IEEE International Conference on Distributed Computing Systems, May. 16-
18, 2012, pp. 305–314.
[8] P. Heymann, G. Koutrika, and H. Garcia-Molina,
“Fighting spam on social web sites,” IEEE Internet Computing, vol. 11, no. 6, pp. 36–45, 2007.
[9] M. Al Hasan, V. Chaoji, S. Salem, and M. Zaki,
“Link prediction using supervised learning,” in Proc of Sdm
Workshop on Link Analysis Counterterrorism and Security, Apr. 26-28, 2006, pp. 798–805.
[10] F. Pedregosa, A. Gramfort, V. Michel, B.
Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.
Weiss,
V. Dubourg, and J. Vanderplas, “Scikit-learn:
Machine learning in python,” Journal of Machine Learning Research,
vol. 12, no. 10, pp. 2825–2830, 2013.
[11] X. Zhu, Y. Nie, S. Jin, A. Li, and Y. Jia,
“Spammer detection on online social networks based on logistic
85
©IJETIE 2020 regression,” in International Conference on Web-Age
Information Management, Aug. 11-13, 2015, pp. 29–
40.
[12] H. Jia, Y. M. Cheung, and J. Liu, “A new distance metric for unsupervised learning of categorical data,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 5, pp. 1065–1079, 2016.
[13] S. D. Xenaki, K. D. Koutroumbas, and A. A.
Rontogiannis, “A novel adaptive possibilistic clustering algorithm,” IEEE Transactions on Fuzzy Systems, vol. 24, no. 4, pp. 791–810, 2015.
[14] K. V. Laerhoven, “Combining the self- organizing map and k-means clustering for on-line classification of sensor data,” in International Conference on Artificial Neural Networks, Aug. 21- 25, 2001, pp. 464–469.
[15] X. Yang, Y. Wang, D. Wu, and A. Ma, “K- means based clustering on mobile usage for social network analysis
purpose,” in International Conference on Advanced Information Management and Service, Nov. 30-Dec.
2,
2010, pp. 223–228.
[16] S. Yang, J. Kim, and M. Chung, “A prediction model based on big data analysis using hybrid fcm clustering,”
in International Journal of Internet Technology and Secured Transactions, Dec. 14-16, 2015, pp. 337–
339.
[17] M. A. Wiering and H. V. Hasselt, “Ensemble algorithms in reinforcement learning,” IEEE Transactions on Systems Man and Cybernetics Part B Cybernetics, vol. 38, no. 4, pp. 930–936, 2008.
[18] F. Peyravi, V. Derhami, and A. Latif,
“Reinforcement learning based search (rls) algorithm in social networks,”
in International Symposium on Artificial Intelligence and Signal Processing, Mar. 3-5, 2015, pp. 206–210.
[19] S. Bhagat, G. Cormode, and S. Muthukrishnan,
“Node classification in social networks,”
[21] S. Lu and Z. Wang, “Accelerated algorithms for eigenvalue decomposition with application to spectral clustering,” in The 49th Asilomar Conference on Signals, Systems and Computers(Asilomar), Pacific Grove, CA,
USA, 2015, pp. 355–359.
[22] T. Chen, C. Huang, E. Chang, and J. Wang,
“Automatic accent identification using gaussian mixture models,” in Automatic Speech Recognition and Understanding, 2001. ASRU ’01. IEEE Workshop on, Dec. 9-13, 2001, pp. 343– 346.
[23] A. Rasmus, M. Berglund, M. Honkala, H.
Valpola, and T. Raiko, “Semi-supervised learning with ladder networks,” in Advances in Neural Information Processing Systems, Dec. 7-12, 2015, pp. 3546–3554.
[24] P. K. Mallapragada, R. Jin, A. K. Jain, and Y.
Liu, “Semiboost: Boosting for semi-supervised learning,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 2000–2014, 2009.
[25] S. J. Roberts, D. Husmeier, I. Rezek, and W.
Penny, “Bayesian approaches to gaussian mixture modelling,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1133–1142, 1998.