141
Detecting Spam Classification on Twitter Using
URL Analysis, Natural Language Processing
and Machine Learning
Ms. Sonika A. Chorey
P.R.M.I.T.&R. Badnera, Amravati India [email protected]
Ms.Prof R.N.Sawade
P.R.M.I.T.&R. Badnera, Amravati India [email protected]
Ms Priyanka Chorey
P.R.M.I.T.& R. Badnera, Amravati, India [email protected]
Ms.P.V.Mamanka
P.R.M.I.T.& R. Badnera, Amravati, India [email protected]
Abstract- In the present day world, people are so much habituated to Social Networks. Because of this, it is very easy to spread spam contents through them. One can access the details of any person very easily through these sites. No one is safe inside the social media. In this paper we are proposing an application which uses an integrated approach to the spam classification in Twitter. The integrated approach comprises the use of URL analysis, natural language processing and supervised machine learning techniques. In short, this is a three step process. In this paper we consider the problem of detecting spammers on twitter. We construct a large labeled collection of users, manually classified into spammers and non-spammers. We then identify a number of characteristics related to tweet content and user social behavior, which could potentially be used to detect spammers.
Keywords—natural language processing; tweets; machine learning, URLs.
I. INTRODUCTION
Online Social Networks (OSNs) are becoming very popular these days. Some of the popular OSNs are Twitter, Facebook, MySpace, LinkedIn etc. With the increasing popularity of these sites, the attacks on them are also increasing. It is a platform through which people can share their ideas and thoughts. These sites have millions of users and not all users are legitimate. Each of these OSNs has lots of illegitimate (or spam) accounts with them.
Over the last few years, social networking sites have become one of the main ways for users to keep track and com municate with their friends online. Sites such as Facebook, MySpace, and Twitter are consistently among the top 20 most-viewed web sites of the Internet. Moreover, statistics show that, on average, users spend more time on popular social networking sites than on any other site [1]. Most social networks provide mobile platforms that allow users to.
In this paper, we concentrate more on spammers in Twitter. Twitter is a microblogging site, which allows only a maximum of 140 characters in each tweet (message). The four major types of spammers on Twitter that we have considered in this paper are:
1) Malware Propagators: Malware propagators are users who tweet malicious links, which on clicking leads to the
downloading of malwares. Malware’s are malicious software
2) Phishers: These are account users who spread malicious URLs through their tweets. Legitimate users who click on these links end up in wrong websites. This leads to the stealing of passwords, credit card information and so on.
3) Adult Content Propagators: These spammers broadcast links containing adult contents. On clicking those links, the user will be redirected to malicious sites..
4) Marketers: These are spammers who concentrate on spreading advertisements.They try to trend different products. Marketers are normally harmless because, the only thing they do is, popularizing their products. But sometimes these users can mislead the legitimate users.
In this paper we are proposing an application which can classify a Twitter user into spam or legitimate. To achieve this, an integrated approach, which contains URL analysis, Natural Language Processing and Machine Learning techniques are used. These techniques are applied in the same order as given above.
A tweet may or may not contain a URL. Since Twitter supports only 140 characters in a tweet, a long URL is normally shortened. So many URL shortening services are available these days. For example, Google URL shortened, Bitly, Twitter URL shortener etc. These shorteners generate short URLs which ends with .gl, bit.ly, t.co etc. Moreover, Twitter has different features like @ mentions, # tags and RT. @ mentions are actually used to address a user. # Tags are used for trending a topic. RT shows that the tweet is retweeted.
142 experiment results. Finally the paper is concluded in the section VI.
II. RELATED WORK and Background
Social networks offer a way for users to keep track of their friends and communicate with them. This network of trust typically regulates which personal information is visible to whom. In our work, we looked at the different ways in which social networks manage the network of trust and the visibility of information between users. This is important because the nature of the network of trust provides spammers with different options for sending spam messages, learning information about their victims, or befriending someone (to appear trustworthy and make it more difficult to be detected as a spammer).
2.1 The MySpace Social Network
MySpace was the first social network to gain significant popularity among Internet users. The basic idea of this network is to provide each user with a web page, which the user can then personalize with information about herself and her interests. Even though MySpace has also the concept of “friendship,” like Facebook, MySpace pages are public by default. Therefore, it is easier for a malicious user to obtain sensitive information about a user on MySpace than onFacebook.
2.2 The Twitter Social Network
Twitter is a much simpler social network than Facebook and MySpace. It is designed as a microblogging platform,where users send short text messages (i.e., tweets) that appear on their friends’ pages. Unlike Facebook and MySpace, no personal information is shown on Twitter pages by default. Users are identified only by a username and, op-tionally, by a real name. To profile a user, it is possible to analyze the tweets she sends, and the feeds to which she is subscribed. However, this is significantly more difficult than on the other social networks.
A Twitter user can start “following” another user. As a consequence, she receives the user’s tweets on her own page. The user who is “followed” can, if she wants, follow the other one back. Tweets can be grouped by hash tags, which are popular words, beginning with a “#” character. This allows users to efficiently search who is posting topics of interest at a certain time. When a user likes someone’s tweet, he can decide to retreat it. As a result, that message is shown to
all her followers. By default, profiles on Twitter are public, but a user can decide to protect her profile. By doing that, anyone wanting to follow the user needs her permission. Ac- cording to the same statistics, Twitter is the social network that has the fastest growing rate on the Internet. During the last year, it reported a 660% increase in visits [2]
2.3 The Facebook Social Network
Facebook is currently the largest social network on the Internet. On their website, the Facebook administrators claim to have more than 400 million active users all over the world, with over 2 billion media items (videos and pictures) shared every week [3].
Usually, user profiles are not public, and the right to view a user’s page is granted only after having established a relationship of trust (paraphrasing the Facebook terminology, becoming friends) with the user. When a user A wants to become friend with another user B, the platform first sends a
request to B, who has to acknowledge that she knows A. When B confirms the request, a friendship connection with A is established. However, the users’ perception of Facebook friendship is different from their perception of a relationship in real life. Most of the time, Facebook users accept friendship requests from persons they barely know, while in real life, the person asking to be friend would undergo more scrutiny.
A lot of research has been done in this field. The authors have used the concept of social honeypots in [1], along with machine learning for spam detection in OSNs. Social honeypots are fake profiles or accounts which are created deliberately to gain the attention of a spammer.
The method of detecting pharmaceutical spam in Twitter is discussed in [2]. This is done by applying text mining techniques and data mining tools. This paper is addressing how to classify a new incoming spam as pharmaceutical spam or not. The authors used decision tree (J48) algorithm and Naïve-Bayes algorithm. Finally they compared the output obtained by both these classifiers. A set of 65 words (which were related to pharmaceuticals) were used as the training set. If at least, one word out of these 65 were present in the tweets in the test set, then they will be classified as spam.
Online Spam Filtering is presented in [3]. This is a real time system. This can inspect a message (tweet in the case of Twitter and post in the case of Facebook) and drop it if it is found to be a spam. The spam messages are dropped even before the intended recipient gets it. Everything happens in real time. Such messages are not stored in the database. The paper uses machine learning techniques. Millions of tweets and posts are collected from both Twitter and Facebook for datasets. In this paper the authors have used two supervised machine learning algorithms namely Support Vector Machine (SVM) and Decision Tree.
Evaluation of the context-aware spam that could result from information that is shared on the social networks is dealt in [4]. The mitigation techniques are also discussed here. The authors have done analysis on Facebook. The authors concluded that context-aware e-mail attacks have a high rate of success. The paper also mentions the defence strategies taken by other social networks like LinkedIn and MySpace.
Harvested Twitter dataset and links are examined in [6]. Here the authors have found features using which content polluters can be easily identified. The authors proposed a long term study of protecting social networks using honeypots. Almost 60 honeypots were deployed for seven months which resulted in the harvesting of more than 30000 spam data. The spam classification was done using machine learning algorithms.
143 algorithm. The authors manually labelled 500 accounts as spam and non-spam for the training set. All the algorithms used were compared with each other and thus Naive-Bayes was found to be the best.
Our application is a combination of the tasks discussed here so far.
III. SYSTEM DESIGN AND METHODOLOGY
We have come up with an application which can classify a Twitter user into spam or legitimate. The user of this
application has to enter the username of the account to be checked into the interface provided. So, basically the input is the username and output is either spam or legitimate. The last 10 tweets of a user is used for the whole process.
The entire work uses three techniques:
A. URL Analysis
The first step in this application is URL analysis. URL analysis has been done in [11] and [12]. For this, the URLs are extracted from the tweets. The extracted URLs are normally the shortened ones. These URLs are converted to their long form. For doing this we use HttpURLConnection class. This helps in finding the page to which a particular URL is redirected to. When a URL is redirected to another, the response code will be 301. So if the header contains 301, we’ll take that location as the long URL. URL analysis involves 2 steps:
1) Comparison with a set of blacklisted URLs
A set of Blacklisted URLs were downloaded from http://urlblacklist.com. This data set consists of lakhs of URLs from different categories. We chose 4 categories: URLs related to advertisements, malwares, adult contents and phishing. This set contains almost 15,000 URLs.
The URLs extracted from the tweets are compared with the blacklisted URLs. If n URLs are extracted from the tweets and even if one among the n URLs is present in the blacklist, the user is regarded as spam. This is because a legitimate user will never tweet a URL which is blacklisted. If the user is classified as spam, then the whole process stops here. Else, the process continues.
2) Comparison with a set of already identified expressions
The next task is to identify a set of expressions or words in a URL which can prove that the URL is a spam. Some of the expressions were obtained from www.urlblacklist.com. The rest were identified after thorough research. A total of 33 words were identified.
The set of 33 words are: /ads, /realmedia/ads/, /pics/banner/, adultos, adultsight , adultsite, adultsonly, adultweb, blowjob, bondage, centerfold, cumshot, cyberlust, cybercore, hardcore, incest, masturbate, obscene, pedophil, pedofil, playmate, pornstar, sexdream, showgirl, softcore, striptease, adultsight, adultsite, adultsonly, adultweb, penis,
vagina , xxx.
The presence of even one of these expressions can conclude that the URL is spam. If the URL is spam, the user is classified as spam. If the user is not classified as spam in this step, or if he hasn’t tweeted any URLs, then the next technique of natural language processing is applied.
B. Natural Language Processing
Natural Language Processing (NLP) is a technique which enables a machine to process a natural language (like English) and do all the things that a human can do. In short, NLP helps in automating things. A similar approach is used in [8], [9] and [10]. Extracting information from unstructured data using NLP is discussed in [8]. Malicious tweets are identified in [9]. Here also NLP is used. In [10], NLP is used in sentiment analysis of a subject.
Before going into deep concepts of NLP, a set of incomplete sentences which normally appears in a tweet are identified. After researching on Twitter, 11 common sentences in spam tweets were found. They are ‘add me at’, ‘take me on a date’, ‘you'll laugh when you see this pic of you’, ‘You look different in this photo’, ‘my friend sent me this pic with you in it’, ‘my friend showed me this pic of you’, ‘follow me back’, ‘discount drugs’, ‘I found you in this video’, ‘Is that you in this picture’, ‘buy now’. If these expressions are found in the tweet, then the user is classified as spam.
In this paper, two concepts of NLP have been used: removal of stop words and stemming. For processing English there is no need of stop words like I, about, above etc. So all these words are removed and only the keywords are extracted. The next step is to find the root word or stem of the keyword. For this, stemming techniques are used.
Examples of stemming:
Complexity --- > Complex
Possessive --- > Possess
A simple stemming algorithm has been used in this paper. A set of spam words that can appear in a tweet is identified, like ‘porn’, ‘Viagra’ etc. The stemmed keywords are compared with the set of identified spam words. If the words match, then the user is regarded as spam. At this stage, if the user is not found as spam, then the third technique of Machine Learning is used.
C. Machine Learning Techniques
Using machine learning techniques, a machine can learn on its own. So no human intervention is required. These algorithms use training set. Training set are labelled examples obtained after analysing data manually. The training set is of the form (a1, a 2 . . . an, L) where a1. . an are attributes and L is the label. The test set contains a set of n attributes of the form {a1, a2 . . .an}.
144 here were first tested using two algorithms: Naïve-Bayes and SVM. Out of this Naive-Bayes was found to be more accurate. That is the reason why we chose the same.
A confusion matrix is drawn as follows:
Spam Legitimate
Spam a b
Legitimate c d
number of spam classified as legitimate, number of legitimate classified as spam and number of legitimate classified as legitimate respectively.
The accuracy, true positive and false positive is calculated as follows:
Accuracy = (a+d)/(a+b+c+d)
True Positive = (a)/(a+b)
False Positive= (c)/(c+d)
The results obtained after using SVM and Naïve-Bayes is given below:
Accuracy True False Positive
Positive
Naïve-Bayes 94% 0.9 0.03
SVM 92% 0.875 0.05
Naïve-Bayes
Naive-Bayes is a probabilistic classifier which uses the Bayes Theorem. Each feature is independent of each other. Consider a Test Set T with attributes (features) a1, a2. . . an.
T = {a1, a2, . . . an} and a set of labels L = {Spam,
Legitimate}.
Then,
P(L | a1, a2 . . . an) = p (L) (ai| L) where i = 1 to n
Whichever label has the higher probability is the label of that particular test set T.
Requirements for the implementation are a training set and a test set. The most important thing for developing an efficient classifier is to construct a good training set. The success of the classifier lies in the efficiency of the training set. Inefficient training set will lead to a classifier with low accuracy.
Where a, b, c and d are the number of spam classified as spam,
IV. EXPERIMENT
Our experiment consists of dataset crawled from Twitter.
A. Training Set
The training set was obtained from 10 recent Tweets of 100 users. Six features were used for classification. The features are: number of @ mentions, number of unique @ mentions, number of # tags, number of unique # tags, number of URLs and number of unique URLs.
The training set has 100 instances with 6 features and a label, i.e. the set contains 100 rows and 7 columns. This set is read as a CSV file into the program.
B. Test Set
The test set also contains 6 features without label. The aim is to find the label. It is read as a text file.
Description of Features Used
Naïve-Bayes algorithm is applied first. Since this is a machine learning algorithm and solely depends on the accuracy of the training set, we can expect an error rate of 2-10%. So it is not wise to get errors at the early stage itself. But comparing with URL blacklist and predicting as spam is more accurate one. Error expected is very less. NLP is also a strong method in accurate spam classification. So the order: URL analysis, Natural Language processing and then Machine Learning technique is significant in this application.
VI. CONCLUSION
In this paper we have proposed an integrated approach for the classification of a Twitter user into spam or legitimate. The combined approach, which includes URL analysis, Natural Language processing and Machine Learning techniques, could successfully do the classification. The combined approach gives more accuracy than each of these methods being applied alone. Also, here we have identified different set of expressions, tweets, words and other features which can show that a user is a spam or legitimate. The integrated approach is found to be more accurate than machine learning used alone.
145
1.
REFERENCES
[2] Uncovering social spammers: social honeypots +machine learning: Kyumin Lee. 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval. .
[3] The Impact of Natural Language Processing Based Textual Analysis of Social Media Interaction on Decision Making. Larson, Keri, Watson, Richard T, Proceedings of the 21st European Conference on Information Systems.
[4] Detecting malicious tweets in trending topics using a statistical analysis of laguage, Juan Martinez-Romo,, Lourdes Araujo, Expert Systems with Applications: An International Journal
[5] Sentiment Analyzer: Extracting Sentiments about a Given Topic Using Natural Language Processing Techniques. Jeonghee Yi ; IBM Almaden Res. Center, San Jose, CA, USA ; Nasukawa, T. ; Bunescu, R. ; Niblack, W. Data Mining, 2003. ICDM 2003. Third IEEE International Conferenc
[6] Information Assurance: Detection of Web Spam Attacks in Social Media Pang-Ning Tan, Feilong Chen, and Anil K Jain. Proceedings of the 27th Army Science Conference, Orlando, Florida(2010).
[7] Design and Evaluation of a Real-Time URL Spam Filtering Service. Kurt Thomas, Chris Grier, Justin Ma, Vern Paxon, and Dawn Song. Proceedings of the IEEE Symposium on Security and Privacy.
[8] Detecting Spammers on Twitter”: Fabr ́ cio Benevenuto, Gabriel Magno, Tiago, Rodrigues, and Virglio Almeida. In Anti-Abuse and Spam Conference (CEAS) (July 2010).
[9] Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter: Kyumin Lee and Brian David Eoff and James Caverlee. In Fifth International AAAI Conference on Weblogs and Social Media July 2011.
[10] Detecting Spammers on Social Networks: Gianluca Stringhini, Christopher Kruegel, Giovanni Vigna. Annual Computer Security Application Conference 2010.
[11] Design and Evaluation of a Real-Time URL Spam Filtering Service : Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song