Jakobsen Algorithm - Simple Substitution Distance

5.3 Simple Substitution Distance

5.3.1 Jakobsen Algorithm

Basic assumption in Jakobsen’s algorithm is that the plain text is in English language and Ciphertext contains 26 different symbols. Each symbol maps to one of the 26 different letters in English. Algorithm starts with calculating the frequency of the symbols in the ciphertext and arranging them in the decreasing order of their occurrence frequency. Putative key is then calculated by mapping these symbols from the ciphertext to the letters in English arranged in the standard frequency order. This key is altered in each iteration after looking at the obtained plain text. Suppose key ‘K’ is initial key, then it can be represented as

K = 𝑘1, 𝑘2, 𝑘3, . . . , 𝑘26

where each 𝑘𝑛 is mapping of one symbol from cipher text on some English letter. In each iteration algorithm modifies key ‘K’ slightly. This new key is used to decrypt ciphertext and then the newly generated putative plain text is checked. If putative plain text is closer to the original plain text, then the change in key is retained and algorithm starts again considering the obtained modified key as the putative key. A scheme is followed to modify the key. Initially, symbols adjacent to each other are swapped and cipher text is decrypted to check correctness of the key. Suppose, we have the putative key as shown above, then initially swapping is done in 𝑘1 and

𝑘2. In next iteration, swapping will be done in 𝑘2 and 𝑘3 and so on. This can be

viewed as first round of swapping in which we swap adjacent element. In next round, elements at places 𝑖𝑡ℎ _and ₍_𝑖_{+ 2)}𝑡ℎ _{are swapped. In the successive rounds, distance} between to be swapped elements is increased by 1. In all, there can be (︀26₂)︀ different

ways in which the key can be altered. This scheme of swapping can be visualized as round 1: 𝑘1|𝑘2 𝑘2|𝑘3 𝑘3|𝑘4 . . . 𝑘23|𝑘24 𝑘24|𝑘25 𝑘25|𝑘26 round 2: 𝑘1|𝑘3 𝑘2|𝑘4 𝑘3|𝑘5 . . . 𝑘23|𝑘25 𝑘24|𝑘26 round 3: 𝑘1|𝑘4 𝑘2|𝑘5 𝑘3|𝑘6 . . . 𝑘23|𝑘26 .. . ... . .. round 24: 𝑘1|𝑘24 𝑘2|𝑘25 𝑘3|𝑘26 round 25: 𝑘1|𝑘25 𝑘2|𝑘26 round 26: 𝑘1|𝑘26

To test the correctness of the key, algorithm creates a digram matrix of English letters from plain text. Each entry in matrix represents frequency count of that digram in plain text. This matrix is referred as 𝐸 matrix in the algorithm. Similarly, another digram matrix is created from decrypted text. This matrix is called as 𝐷 matrix in the algorithm. Correctness of the key is calculated as

score (k)=∑︁

𝑖,𝑗

|𝑑𝑖,𝑗−𝑒𝑖,𝑗|

where, 𝑑𝑖,𝑗 and 𝑒𝑖,𝑗 refers to 𝑖𝑡ℎ and 𝑗𝑡ℎ element in 𝐷 and 𝐸 matrices respectively. In each iteration of the algorithm, the score is calculated. If the score is smaller than the score obtained in previous iteration, it is retained else discarded. In other words, lower the score, better is the decrypted text. At the end of the algorithm, if the obtained decrypted text is the same as the original text, then the score will become zero. As a result, we will get the key and decrypted text, which will be similar to the original plain text.

5.3.2 Implementation

In this method, we extracted the opcode sequence from generated morphed ver- sions of the given malware JavaScript. Then the frequency of each opcode is counted. All opcodes are arranged in the decreasing order of their count. Consider we get the

opcode order as

MOV,ADD,LOAD,SUB,JMP,CMP

This will be called as standard opcode order for the given malware. Then digram frequency graph is constructed by using the same opcode sequence. In this graph, each entry will represent frequency count of a digram. Adjacency matrix is used to represent this graph. This adjacency matrix will be analogous to 𝐸 matrix in Jakobsen’s algorithm [13]. The opcode sequence is then extracted from the file to be tested. Frequency of each opcode is counted and arranged in decreasing order of their frequency count. Suppose we get the opcode sequence as

ADD,SUB,CMP,LOAD,MOV,MUL,JMP

This opcode order is mapped to previously calculated standard opcode order. This mapping will give us the putative key. Initial putative key is shown in Table 4. After

Table 4: Opcode Mapping (Putative Key)

Standard Opcode Order MOV ADD LOAD SUB JMP CMP OTHER

Test Opcode Order ADD SUB CMP LOAD MOV MUL JMP

doing one to one mapping, if testing file contains more unique opcodes, then they will be mapped to dummy opcodeOTHER. Then all opcodes in testing file are replaced with corresponding opcode in the putative key. A digram matrix is created from the replaced opcode sequence. This matrix is analogous to 𝐷 matrix in Jakobsen’s algorithm. A score is calculated by taking Manhattan distance between the𝐷and 𝐸 matrices.

score=∑︁

𝑖,𝑗

|𝑑𝑖,𝑗 −𝑒𝑖,𝑗|

Key swapping is then done in the putative key by swapping the adjacent elements. For example, we have the putative key as shown in Table 5. First swapping occurs

between ADDand SUB to give a new putative key as shown in Table 6. Table 5: Key Before Swapping

Standard Opcode Order MOV ADD LOAD SUB JMP CMP OTHER

Test Opcode Order ADD SUB CMP LOAD MOV MUL JMP

Table 6: Key After Swapping

Standard Opcode Order MOV ADD LOAD SUB JMP CMP OTHER

Test Opcode Order SUB ADD CMP LOAD MOV MUL JMP

By using this new key, we calculate scores for the given test file again. To reflect the changes made in the key, we need to construct a new𝐷matrix. The new𝐷matrix will be constructed by swapping the rows and columns in the previously constructed 𝐷 matrix. Suppose we have a 𝐷 matrix as shown in the Table 7.

Table 7: D Matrix before Swapping

MOV ADD LOAD SUB JMP CMP OTHER

MOV 9 7 8 6 3 2 1 ADD 9 7 5 8 5 8 3 LOAD 5 8 6 4 3 7 1 SUB 4 8 4 7 4 8 3 JMP 5 3 8 5 3 8 6 CMP 3 2 5 6 3 7 3 OTHER 9 6 0 2 5 9 7

As we swap entries corresponding toMOVand ADDin the putative key, to achieve the same effect on 𝐷 matrix, we need to swap the rows and columns corresponding toMOV and ADD. After swapping, we will get a new 𝐷 matrix as shown in Table 8.

Thereafter, score is calculated by taking the Manhattan distance between the newly created 𝐷 matrix and 𝐸 matrix. If the new score is greater than the previous score, then we discard the swapping and move towards next swapping. Here in this

Table 8: D Matrix after Swapping

MOV ADD LOAD SUB JMP CMP OTHER

MOV 7 9 5 8 5 8 3 ADD 7 9 8 6 3 2 1 LOAD 8 5 6 4 3 7 1 SUB 8 4 4 7 4 8 3 JMP 3 5 8 5 3 8 6 CMP 2 3 5 6 3 7 3 OTHER 6 9 0 2 5 9 7

example, next swapping will be done between MOVand LOAD. If the new score is less than the previously calculated score, then we will retain the swapping and continue algorithm from beginning, keeping swapped key as the putative key.

In the first round, swapping will be done between adjacent elements in the key. In next round, swapping will be done between elements on the𝑖𝑡ℎ_and _𝑖_{+ 2}𝑡ℎ _position in the key. In successive rounds, distance between elements to be swapped is increased by one. This swapping scheme can be visualized as

round 1: 𝑘1|𝑘2 𝑘2|𝑘3 𝑘3|𝑘4 . . . 𝑘𝑛−3|𝑘𝑛−2 𝑘𝑛−2|𝑘𝑛−1 𝑘𝑛−1|𝑘𝑛 round 2: 𝑘1|𝑘3 𝑘2|𝑘4 𝑘3|𝑘5 . . . 𝑘𝑛−3|𝑘𝑛−2 𝑘𝑛−2|𝑘𝑛 round 3: 𝑘1|𝑘4 𝑘2|𝑘5 𝑘3|𝑘6 . . . 𝑘𝑛−3|𝑘𝑛 .. . ... . .. round n-2: 𝑘1|𝑘𝑛−2 𝑘2|𝑘𝑛−1 𝑘3|𝑘𝑛 round n-1: 𝑘1|𝑘𝑛−1 𝑘2|𝑘𝑛 round n: 𝑘1|𝑘𝑛

The smallest score achieved in this process is the score of similarity. To classify a file, this score is compared against a predefined threshold. If the score is less than the threshold value, then file under consideration belongs to the given malware family.

In document Hunting For Metamorphic JavaScript Malware (Page 46-51)