Algorithm - A classifier client implementation: Net Defence

4.7 A classifier client implementation: Net Defence

4.7.4 Algorithm

Within Net Defence there are several distinct pieces of code that perform discrete oper-

ations. The main algorithm, however, is responsible for making the actual classifications each time the user requests a URL and is depicted in the flow diagram shown in Figure 4.11.

When Google Chome starts up and loads Net Defence extension, it executes the back-

ground JavaScript file. Within this extension, this file is calledcore.js and is responsible

for registering an event callback (a method to be executed when the event fires) for the

onBeforeRequest event listener. A high-level code overview, which is a code sample from

core.js is shown in Listing C.2.

This listener is triggered every time a URL is requested, including URLs requested by extensions and JavaScript scripts executing on web pages. The benefits of this include

4.7. A CLASSIFIER CLIENT IMPLEMENTATION: NET DEFENCE 67

that URLs can not be hidden from the engine by making the calls asynchronously within a JavaScript procedure. A second benefit is that this callback is executed in a blocking fashion. This means that URLs will not be fetched until after they have been classified, and can be cancelled (blocked) altogether, further protecting users from drive-by attacks. The checkURL method is the callback which is executed whenever a URL is requested.

An overview of the algorithm used in the checkURL method is depicted in Figure 4.11.

Firstly, the URL is then checked against the exceptions rule list which allow for any URL that matches them to be allowed. For usability and performance reasons, the next check performed is against a user-defined whitelist. If a URL is found in this list, it is automatically allowed and the algorithm will not proceed any further. If the requested URL does not appear in this whitelist, it is checked against a user-defined blacklist. This is not a large blacklist that is defined from a resource on the internet, but a list of URLs that have been classified as benign in the past but the user has flagged them as false negatives. Similarly, if a URL has previously been classified as malicious and the user has flagged it as a false positive, it is added to the whitelist. If the URL does not match any exceptions, whitelist or blacklist entries, it is then processed by the remainder of the algorithm.

The following process encapsulates the ANN functionality around which this research is based. Firstly, the algorithm checks to see if the classifier is already instantiated from a previous request. If there is no instantiation, it will check for the data that are used to instantiate the object. If they are not found, they are requested from the CDS through the update functionality shown Figure 4.11. On the first execution, the classifier object

is created and is an instance of theOnlinePerceptron class, which is stored in the library

written for this extension. Once the classifier is instantiated, the data store is checked for the network definition. If this definition is not present; the update service is then invoked and the definition is downloaded and stored. This definition contains information regarding how many inputs are used, their weightings as well as the bias that the network should use. Also contained within this data is the BOW definition and the normalisation values required.

Once the classifier is instantiated and initialised, data extraction begins. The first step of this process is to extract every word from each section of the URL. A vector of boolean values is created for each section of the URL and is as long as the BOW for each section

with each value set tofalse. Each word present in the requested URL is checked against

this vector and, if present, the corresponding index of the vector is flipped to true. This process is repeated for each section of the URL. Once each BOW has been created, the

vectors are merged in the order of the URL structure, creating the large vector that will later be merged with another vector to form the input data for the classifier.

The final data required as input to the classifier is that of the lexical features of the URL. The URL is analyzed and a set of 20 metrics is calculated, as described in Section 3.1.1. These values are inserted into another vector and passed to an object which uses the update data to normalise the values for each field. After normalisation of this vector is complete, the vector is the merged to the end of the BOW vector which then finally constitutes the final input data passed to the classifier.

The last step within this algorithm is to pass control to the classifier which then flags the input data, and therefore the requested URL, as malicious or benign. A benign classification results in the extension allowing the request to continue uninterrupted, while a malicious classification results in the request being blocked and the user being notified, depending on the extension’s settings. The options available are discussed in Section 4.7.5. Within this figure, the malicious request is simply logged to the console rather than being blocked out right. This is useful in research environments or during the testing of the classifier.

In document A framework for high speed lexical classification of malicious URLs (Page 83-86)