Communication through email and addiction to internet has become an effective way of people
contacting each other in today’s fast-paced world. People access different social, educational, informative and other kind of websites, exchange a huge quantity of email messages daily to share texts, files, photos and video links. Phishing is one of the active social engineering practices used to gain advantage of website users unaware of phishing traps (Toolan and Carthy, 2010).
Shortcoming in email and web security technologies also allows abusers to take advantage of the unaware web/email users. In computing, "phishing" is termed as an activity that uses social engineering techniques to try to get confidential information from the email/web user, such as identity, usernames/passwords, pin codes, bank account details and credit card information, by pretending to be someone else and hiding their real identity (Ma et al.,2009).
Due to the number of the users using the internet are increasing rapidly, the numbers of phishing attacks are also increasing. The increase of 59% in phishing attack volumes is reported in 2012 than 2011 and globally the losses due to phishing are estimated at $1.5 billion in 2012 an
82
increase of 22% than 2011( RSA’s Report, 2013). The countries that are mostly affected by attacks are UK, USA, Canada, South Africa, Germany, France, Colombia and Brazil. The phishers attack mostly the financial activities and because of better economic growth of Canada the phishing attacks increased to 400% in 2012.
In 2012, cyber criminals have used the simple hosting method tactics and targeted the hijacked websites to launch the phishing attack. The web shells, smarter web analytical tools and automated toolkits are used to hack huge number of websites (RSA Report, 2013). The RSA analysts have noted that the combined attack schemes are in use to phish users and redirecting them to infection points.
In 2013, the launch of 4-G channels in mobile communication and the growth in the use of the mobile usage in the personal and office life, and the increasing pivotal need of the application in mobiles. It is forecasted in 2013 that the phishing attacks expected to be more directed at the mobile and smart phone users. The expected attacks would be by voice (vishing), Mobile applications, SMS (smishing) and spammed emails that the user will open on their mobiles. The use of social networking, shopping and gaming applications is very common. Only from the Google and Apple stores 25 billion app are downloaded in 2012 and by 2015, the
number may grow up to 185 billion (RSA’s Report, 2013).
2.11.1
What is Phishing
The idea behind phishing is that bait is thrown out in the web by abusers with the hope that a less aware user will grab it and bite into it just like the fish. In most cases, bait is floated either through an e-mail or disguised web page that requires user to input data that are hosted by a phishing website. The popular methods usually used by fishermen to camouflage are to be a well-known bank, Online tradesman, credit card company, people who say their deceased parents have left huge sums of money and they want help to get the money by offering a part out of the sum, and emails congratulating that a lottery is won by you and require some personal information to redeem the cash. Victims of phishing websites may lose their identity, pin codes, passwords, bank account and credit card details to the phishing email senders.
Phishing is sometimes confused with spam and conventionally spam blocking techniques were tried to block it, but were not as effective due to the structural variance and close resemblance of phishing websites to legitimate websites. The main difference between SPAM
83
and phishing is that SPAM is used for mass advertisement to sell a product whereas phishing is employed to collect information that can be further used for some illegal activity (Irani et al., 2008).
As the world is becoming a global village, quick access to useful and actionable information has become very vital for accurate decision making and to survive in this very competitive market. Although, phishers are now employing several techniques in creating phishing websites to fool and allure users, they all use a set of mutual features to create phishing websites. Since, without those features they lose the advantage of deception (Sophie et al., 2011). This helps us to differentiate between legitimate and phishy websites based on the features extracted from the visited website.
Overall, two approaches are employed in identifying phishing websites. The first is based on blacklist (Sanglerdsinlapachai and Rungsawang, 2010), in which the requested URL is compared with those in that list. The downside of this approach is that the blacklist usually cannot cover all phishing websites since, within seconds, a new fraudulent website is expected to be launched. The second approach is known as heuristic-based methods (Sophie et al., 2011), where several features are collected from the website to classify it as either phishing or legitimate. In contrast to the blacklist method, a heuristics-based solution can recognize freshly created phishing websites in real-time (Miyamoto et al., 2008). The efficiency of the heuristic- based method, sometimes called features-based method, depends on picking a set of discriminative features that could help distinguish phishing websites from legitimate one's (Guang et al., 2011). The way in which the features are processed also play an extensive role in classifying websites accurately.
Data mining is a process of extracting meaningful information from a large data bank (Fayyad et al., 1998). Data mining and knowledge discovery techniques have been employed in different capacities including financial analysis, market forecasting, retail industry and decision support systems (Toolan and Carthy, 2010). The two important data mining techniques which are discussed in Chapter 3 are association rule mining and classification rule mining. Classification and association rule are alike unless classification rule mining exercises prediction of one attribute, for instance, the class, on the other hand, association rule discovery can describe the relationships among attributes in the data set (Thabtah et al., 2010).
As mentioned in this chapter, AC integrates association rule mining with classification to find additional knowledge missed by traditional classification techniques (Furnkranz, 1999).
84
Many experimental studies, e.g. (Hao et al., 2009), (Sangsuriyun et al., 2010 ) and (W-C et al.,
2012) showed that AC is a high conceivable technique that developed more predictive and accurate classification systems than traditional classification methods like decision trees (Abu- nimeh et al., 2009). This is axiomatic while AC finds hidden correlations among the different features. Moreover, many of the rules found by AC methods cannot be found by traditional classification techniques such as decision trees, PART (Frank and Witten, 1998), RIPPER (Cohen, 1995), Prism (Cendrowska, 1987) since the AC algorithm discovers all relationship between the class attribute and the other attributes in the training data set.
Through this part of the chapter, the intention is to discuss the intelligent data mining techniques to solve the complex problem of detecting phishing. This will surely show the usability of the developed AC algorithm in practical applications. The detailed review of common approaches in machine learning and data mining that are currently used to detect phishing will be demonstrated and the detailed description of the methods used to extract the features from the website will be discussed in later chapter 5.