Similarity Graphs - Assessing Code Obfuscation of Metamorphic JavaScript

The method employed by Mishra et al. [2] will be used to assess the extent of metamorphism done by the engine and to compare two files of code. This program compares code at a bytecode level to quantify the degree of metamorphism after a file is passed through the metamorphic engine. To do this, the code needs to be converted to an appropriate form. Figure 10 illustrates the process mentioned below.

• The first step in the process is to compile the JavaScript file to a class file using Rhino. The following command is used.

Figure 10: Converting file to appropriate format

java −cp j s . j a r org . m o z i l l a . j a v a s c r i p t . t o o l s . j s c . Main • The next step is to disassemble the class file and convert it to human readable

language,

javap −c <filename >

• The output from the command line is saved as a txt file.

• The file is parsed using python to remove all the leading white spaces, function names, comments and the number preceding each instruction.

The java -cp command is used to compile the file from a JavaScript file to a class file and the javap command takes the class file as an input and gives us a more human readable file. A script is written to parse through a directory, retrieve all the JS files and perform the required operation thereby removing the manual work required for this process. A sample file in Figure 11 is shown below:

Figure 11: A sample file after converting to bytecode.

To compare two files we look to Mishra’s method. File X is the original code and File Y is the code that is passed through the metamorphic engine. The method put forward by Mishra consists of the following steps. Figure 12 illustrates the process mentioned below.

Figure 12: Assessing extent of metamorphism.[2]

• Given two pieces of code X and Y, X before the file is passed through the engine and Y after the metamorphism is done, the opcode sequences are extracted from each of the files. Redundant commands such as comments, blank lines, and such are removed from this sequence.

• The resultant file is an opcode sequence of length n and m wheren and m are

the number of opcodes in each file.

• Two opcode sequences are then compared by considering subsequences of three opcode. A match is found is all the three opcodes are similar regardless of their position in the subsequence. The coordinates (x,y) are marked on the graph where x is the opcode number of the first opcode of the three opcode sequence in program X and y is the opcode number of the opcode subsequence in program Y.

a n*m graph is obtained.

• If a line segment falls on the main diagonal of the graph, it means that the opcodes are at the same location in both the files. A line of the diagonal indicates that the match appear at different locations in the two files.

A number of experiments are run using this method to verify that the implemen- tation of JavaScript metamorphic techniques is effective.

CHAPTER 5

Support vector machines.

Support Vector Machines (SVM) proposed in the 1990s is a discriminative classifier. It is a supervised machine learning algorithm and a preferred model for classification as it has high accuracy with less computation power. It was mainly designed for pattern recognition but is also applied in many classification problems such as image, speech recognition, text categorization, and face detection.

Given training data which is labeled i.e each data belonging to a specific class, a SVM is capable of building a model that can predict the class of new test data. The main objective of a SVM is to find a separating hyperplane that can clearly classify data points. The hyperplane is a line dividing a plane into two parts where each data point falls into one category. Figure 13 below shows an example of the hyperplane. For a set of independent and identically distributed training sample

Figure 13: A HyperPlane [3]

The aim is to find a hyperplane represented as

𝑤𝑇.𝑥 +𝑏 = 0 (2)

which can be used to separate the two different samples accurately w is the weights vector, (x,y) is a set of training data, and b = -w.x0.

The optimal hyperplane is one that maximizes the distance from the hyperplane to nearest point of each class. This hyperplane is selected from a number of hyperplanes or the classifying patterns. The figure below 14 gives a basic idea.

Figure 14: Set of Possible Hyperplanes[3]

The dimension of the hyperplane is dependant on the number of features fed into the model. The model becomes more complex when the number of features exceeds 3. To build an SVM, Support vectors play a vital role. Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane.

Using Support vectors, we maximize the margin of the classifier. The bilateral blank area (2/ || w||) must be maximized to obtain the partition hyperplane and this is done by maximizing the weight of the margin. Formula 3 [3] is used to maximize the weight of the margin.

𝑀 𝑖𝑛Φ(𝑤) = 1/2||𝑤||2 = 1/2(𝑤, 𝑤) [3] (3) such that

𝑦𝑖(𝑤.𝑥𝑖+𝑏)>= 1 [3] (4) where w is the weights vector, and Φis the kernel function. The kernel function is used to map a pattern onto a higher dimension such that the pattern can become linearly separable. 1/2 is used in the formula as it is convenient for taking the derivative. 4 are the constraints to ensure that all instances are correctly classified. Maximizing the margin of the classifier is done to give better classification. Larger the margin, better is the classification of the model and thereby lesser percentage of error.

In document Assessing Code Obfuscation of Metamorphic JavaScript (Page 31-38)