Vector extraction - A framework for high speed lexical classification of malicious URLs

During training, both positive and benign data sets are loaded. They are then broken down into the constituent sections which make up a valid URL. Each section has an associated list which is used to store the words which will make up the Bag-of-Words (BOW) used as input to the ANN. Each section of each URL is separated into the tokens which make it up. These words are then inserted into their associated lists, even if they are duplicates. Only once this process is complete are the lists checked for duplicates inside themselves. This means that a word will appear once within any given list, but may be present in multiple lists. This distinction is important as lists may have words which are duplicates of words from other lists. This allows the classifier to assign different

weights to a word appearing in different sections of a URL. The lists generated by this process are then used as a mapping to create a bag of words for any specific URL. When a URL is being used to create an input vector for the ANN, the BOW is used. A vector with the same single dimension as the BOW is created which will represent a set of binary inputs. Each value in the vector is assigned a 0 at initialisation time, indicating that there are no words present within the sample. Then, each word within each section of the URL is checked against the list of the BOW which is associated with that section. If that word is found at a particular index within that list, that index within the input vector is set to 1. For further lists, an offset is used to find the index for the associated list within the input vector. In addition to these binary inputs, there are 20 ‘obfuscation resistant’ features which are used, as described by the authors. These features are continuous values which represent various metrics regarding a URL. These values are described as follows:

• The URL as a whole has two associated metrics: the number of dots present as well

as the total length.

• The domain name length is recorded, whether or not it represents an IP address or

uses a port number. The number of tokens and hyphens that are used are recorded, as well as the length of the longest token that appears within the domain name.

• The directory section of the URL is analysed for length, the number of directories

traversed. Additionally, the longest directory name, the highest number of dots that appear within a directory name as well as the highest number of URL delimiters used in a directory name are calculated.

• The length, number of dots and the number of delimiters used within the filename

are calculated and stored within the input vector.

• Within the argument portion of the sample, the length and number of variables

defined are stored. Finally, the length of the longest variable is calculated and stored, as well as the highest number of delimiters used within a variable value.

Finally, where relevant, there are flag features which accompany some of these continuous values. These flags will be discussed in Section 3.4.1.

3.4. VECTOR EXTRACTION 31

Table 3.1: Example of vector data pre and post normalisation.

x 71 3 26 3 0 18 0 0 0 0 0 0

normalised 0.034 0.085 0.111 0.142 0 0 0 0 0 0 0 1

3.4.1 Normalisation

Once the input vector has been constructed, the continuous variables which describe various metrics regarding the URL are normalised. Examples of the continuous variables include the length of the sample or the number of tokens in the domain section of the URL. These variables differ to binary variables which only have two possible values: 0 and 1. Some of these metrics have naturally higher values than others, such as length versus the number of dots used in the path section. As a result, within this example, length would have a naturally higher bias associated with it at the start of training. While this would eventually be trained out; it adds training time through iterations required for the ANN to correct this natural error. As a result, these values are all normalised to a range of between 0 and 1, where 1 represents a value equivalent to the highest value encountered for that field within the training data set. It is important to note that 0 does not imply a 0 value, but a value that is the lowest in the encountered data set. For this reason, for fields that are relevant, a flag value is associated, as mentioned in Section 3.1.1. This is a binary value where a 1 indicates that a continuous value field with a 0 value represents an equivalence to the lowest number encountered within the training data set. A 0 value for this binary field indicates that the associated continuous field has no input.

xn= x−l

h−l (3.1)

The normalisation values for the input vectors are calculated before training begins. Firstly, a pair of vectors which represent maximum and minimum values for each field is initialised to a 0 value for each field. Then each extracted sample vector is iterated over, checking each field for a value that is higher than the highest value encountered thus far, or lower than the lowest value encountered. When an input vector is extracted from a sample URL, these maximum and minimum values are used to calculate the normalised value of the field. The method by which these values are used is shown in Equation 3.1

wherexnrepresents the normalised value ofx, hrepresents the highest value encountered

in the training data set andl represents the lowest value for that field encountered in the

data set.

in this table are continuous value variables, while the remaining six are the associated zero-value flags for each field. The first four fields are normalised as values which are somewhere inside the range of value encountered. The fifth value stays at 0, with the eleventh flag also set at zero, indicating a 0 value pre-normalisation. The sixth field is normalised from 18 to 0, with its zero-value flag set to 1, indicating that 18 is equivalent to the smallest value encountered during normalisation for that field.

In document A framework for high speed lexical classification of malicious URLs (Page 47-50)