Alternative representations - Real time detection of malicious webpages using machine learning

(TF) in each webpage used as part of the weighting function alongside the inverse webpage frequency (IDF) of each term in the entire collection. IDF is usually defined as

IDF(t) =log(N

n), (3.1)

where N is the total number of webpages in the collection andNt is the number

of webpages in which term tappears at least once. The TFIDF weight for a term

t in a webpage d is the product of the term frequency and the inverse webpages frequency for that term, returning:

TFIDF(t, d) =ξ(t, d).IDF(t). (3.2) .

3.3 Alternative representations

Although TFIDF weighting has been used in the part primarily for retrieval, con- nections between this weighting scheme and probabilistic classification using the na¨ıve bayes algorithm have been recently explored [66]. Parametric distribution such as bounded Gaussian or Poisson distribution can capture the probability of

3.4. Reduce dimensionality 71

words appearing different numbers of times in webpages [67, 68]. Alternatively, a simple Boolean representation of webpages can be used, that records whether or not a given term appears in a webpage. In this case, the following are found

f(n) =            1 α_>1 0 otherwise (3.3)

Most rule-base methods [29, 69] use a underlying Boolean model, as the an- tecedents of the classification rules only considers word presence and absence in webpages. Boolean vector representation has also been used in probabilistic classification models [70]. Although it may look like a bad approach, sometimes it has its uses in various applications. Neither the parametric distribution of the Boolean representations have been used in the thesis though.

3.4 Reduce dimensionality

The World Wide Web is one of the most diversified environments filled with content. To deduce vector space representations from this content would be a humongous task. This problem asks for a solution that is scalable and reduc- ing dimensionality is one that has been used extensively by many researchers.

3.5. Features 72

One way to tackle it is to look at English Literature where many words such as prepositions, conjunctions and pronouns provide structure in language rather than content. These words can thus be removed from the vectors. Words listed in [71] can also be removed. Where there are common words that appear once or twice among webpages play insignificant role and can also be removed [72].

3.5 Features

3.5.1 Webpage content

The semantic features included the TFIDF vectors which were derived from the webpages and the whole process has been described in Sections 3.1-3.4in detail. In summary, the webpages have all their HTML tags removed, then the stop words were removed and then the remaining text were used in the simulations.

3.5.2 URLs

URLs identify webpages and have been used in the thesis as unique identifiers. Many malicious webpages have suspicious looking characters in their URLs and in their contents. Sometimes the URLs have spelling mistakes too. The lexical features of URLs were fed into the machine learning techniques. If there were

3.5. Features 73

spelling mistakes or suspicious characters in the URL, then they were regarded as suspicious.

3.5.3 Webpage links

Webpages have many links that give out further information e.g. webpages that link to malicious webpages are likely to be malicious. The simulations extracted all the links from each webpage and they were fed into the models too.

3.5.4 Visual features

All the features that have been mentioned so far are text based e.g. source code, stripped HTML, domain names, URL etc. A complimentary feature was the use of image based features. First, screenshots of webpages were downloaded by passing the URLs to PhantomJS (a headless webkit browser with JavaScript API). It took each URL, saved a screenshot of the webpage and converted them to PNG file format. Images were then converted to a format understandable by the models. There are two popular techniques that are generally used i.e. Speeded Up Robust Features (SURF) and Scale Invariant Feature Transform (SIFT) [73]. The simulation used SURF because it had less stringent licensing options compared to SIFT [74]. The idea is that malicious webpages will look similar because are

3.6. Summary 74

likely to be have less input from designers whereas safe webpages will have better designs.

3.6 Summary

In this Chapter a framework has been built to extract various features from webpages to feed into the models. The features include content, source code, URLs and visual features from webpages which were extracted using crawlers and parsers. This framework lays foundation for the next two Chapters where various machine learning techniques (both supervised and unsupervised) use these features.

CHAPTER 4

Computer simulation results using

supervised models to detect malicious

4.1. Introduction 76

Figure 4.1: Architecture for classifying webpages using supervised techniques

4.1 Introduction

This Chapter focuses on the supervised models and discusses the results from these models. These models are popular in text classification.

Chapter 3 mentioned that this thesis used the bag of words approach, due to its simplicity and to avoid computational complexity. Figure 4.1 shows how supervised models have been used to classify the webpages. It is very similar to

In document Real time detection of malicious webpages using machine learning techniques (Page 71-78)