5.3 Simple Substitution Distance
5.3.1 Jakobsen Algorithm
Basic assumption in Jakobsenβs algorithm is that the plain text is in English language and Ciphertext contains 26 different symbols. Each symbol maps to one of the 26 different letters in English. Algorithm starts with calculating the frequency of the symbols in the ciphertext and arranging them in the decreasing order of their occurrence frequency. Putative key is then calculated by mapping these symbols from the ciphertext to the letters in English arranged in the standard frequency order. This key is altered in each iteration after looking at the obtained plain text. Suppose key βKβ is initial key, then it can be represented as
K = π1, π2, π3, . . . , π26
where each ππ is mapping of one symbol from cipher text on some English letter. In each iteration algorithm modifies key βKβ slightly. This new key is used to decrypt ciphertext and then the newly generated putative plain text is checked. If putative plain text is closer to the original plain text, then the change in key is retained and algorithm starts again considering the obtained modified key as the putative key. A scheme is followed to modify the key. Initially, symbols adjacent to each other are swapped and cipher text is decrypted to check correctness of the key. Suppose, we have the putative key as shown above, then initially swapping is done in π1 and
π2. In next iteration, swapping will be done in π2 and π3 and so on. This can be
viewed as first round of swapping in which we swap adjacent element. In next round, elements at places ππ‘β and (π+ 2)π‘β are swapped. In the successive rounds, distance between to be swapped elements is increased by 1. In all, there can be (οΈ262)οΈ different
ways in which the key can be altered. This scheme of swapping can be visualized as round 1: π1|π2 π2|π3 π3|π4 . . . π23|π24 π24|π25 π25|π26 round 2: π1|π3 π2|π4 π3|π5 . . . π23|π25 π24|π26 round 3: π1|π4 π2|π5 π3|π6 . . . π23|π26 .. . ... . .. round 24: π1|π24 π2|π25 π3|π26 round 25: π1|π25 π2|π26 round 26: π1|π26
To test the correctness of the key, algorithm creates a digram matrix of English letters from plain text. Each entry in matrix represents frequency count of that digram in plain text. This matrix is referred as πΈ matrix in the algorithm. Similarly, another digram matrix is created from decrypted text. This matrix is called as π· matrix in the algorithm. Correctness of the key is calculated as
score (k)=βοΈ
π,π
|ππ,πβππ,π|
where, ππ,π and ππ,π refers to ππ‘β and ππ‘β element in π· and πΈ matrices respectively. In each iteration of the algorithm, the score is calculated. If the score is smaller than the score obtained in previous iteration, it is retained else discarded. In other words, lower the score, better is the decrypted text. At the end of the algorithm, if the obtained decrypted text is the same as the original text, then the score will become zero. As a result, we will get the key and decrypted text, which will be similar to the original plain text.
5.3.2 Implementation
In this method, we extracted the opcode sequence from generated morphed ver- sions of the given malware JavaScript. Then the frequency of each opcode is counted. All opcodes are arranged in the decreasing order of their count. Consider we get the
opcode order as
MOV,ADD,LOAD,SUB,JMP,CMP
This will be called as standard opcode order for the given malware. Then digram frequency graph is constructed by using the same opcode sequence. In this graph, each entry will represent frequency count of a digram. Adjacency matrix is used to represent this graph. This adjacency matrix will be analogous to πΈ matrix in Jakobsenβs algorithm [13]. The opcode sequence is then extracted from the file to be tested. Frequency of each opcode is counted and arranged in decreasing order of their frequency count. Suppose we get the opcode sequence as
ADD,SUB,CMP,LOAD,MOV,MUL,JMP
This opcode order is mapped to previously calculated standard opcode order. This mapping will give us the putative key. Initial putative key is shown in Table 4. After
Table 4: Opcode Mapping (Putative Key)
Standard Opcode Order MOV ADD LOAD SUB JMP CMP OTHER
Test Opcode Order ADD SUB CMP LOAD MOV MUL JMP
doing one to one mapping, if testing file contains more unique opcodes, then they will be mapped to dummy opcodeOTHER. Then all opcodes in testing file are replaced with corresponding opcode in the putative key. A digram matrix is created from the replaced opcode sequence. This matrix is analogous to π· matrix in Jakobsenβs algorithm. A score is calculated by taking Manhattan distance between theπ·and πΈ matrices.
score=βοΈ
π,π
|ππ,π βππ,π|
Key swapping is then done in the putative key by swapping the adjacent elements. For example, we have the putative key as shown in Table 5. First swapping occurs
between ADDand SUB to give a new putative key as shown in Table 6. Table 5: Key Before Swapping
Standard Opcode Order MOV ADD LOAD SUB JMP CMP OTHER
Test Opcode Order ADD SUB CMP LOAD MOV MUL JMP
Table 6: Key After Swapping
Standard Opcode Order MOV ADD LOAD SUB JMP CMP OTHER
Test Opcode Order SUB ADD CMP LOAD MOV MUL JMP
By using this new key, we calculate scores for the given test file again. To reflect the changes made in the key, we need to construct a newπ·matrix. The newπ·matrix will be constructed by swapping the rows and columns in the previously constructed π· matrix. Suppose we have a π· matrix as shown in the Table 7.
Table 7: D Matrix before Swapping
MOV ADD LOAD SUB JMP CMP OTHER
MOV 9 7 8 6 3 2 1 ADD 9 7 5 8 5 8 3 LOAD 5 8 6 4 3 7 1 SUB 4 8 4 7 4 8 3 JMP 5 3 8 5 3 8 6 CMP 3 2 5 6 3 7 3 OTHER 9 6 0 2 5 9 7
As we swap entries corresponding toMOVand ADDin the putative key, to achieve the same effect on π· matrix, we need to swap the rows and columns corresponding toMOV and ADD. After swapping, we will get a new π· matrix as shown in Table 8.
Thereafter, score is calculated by taking the Manhattan distance between the newly created π· matrix and πΈ matrix. If the new score is greater than the previous score, then we discard the swapping and move towards next swapping. Here in this
Table 8: D Matrix after Swapping
MOV ADD LOAD SUB JMP CMP OTHER
MOV 7 9 5 8 5 8 3 ADD 7 9 8 6 3 2 1 LOAD 8 5 6 4 3 7 1 SUB 8 4 4 7 4 8 3 JMP 3 5 8 5 3 8 6 CMP 2 3 5 6 3 7 3 OTHER 6 9 0 2 5 9 7
example, next swapping will be done between MOVand LOAD. If the new score is less than the previously calculated score, then we will retain the swapping and continue algorithm from beginning, keeping swapped key as the putative key.
In the first round, swapping will be done between adjacent elements in the key. In next round, swapping will be done between elements on theππ‘βand π+ 2π‘β position in the key. In successive rounds, distance between elements to be swapped is increased by one. This swapping scheme can be visualized as
round 1: π1|π2 π2|π3 π3|π4 . . . ππβ3|ππβ2 ππβ2|ππβ1 ππβ1|ππ round 2: π1|π3 π2|π4 π3|π5 . . . ππβ3|ππβ2 ππβ2|ππ round 3: π1|π4 π2|π5 π3|π6 . . . ππβ3|ππ .. . ... . .. round n-2: π1|ππβ2 π2|ππβ1 π3|ππ round n-1: π1|ππβ1 π2|ππ round n: π1|ππ
The smallest score achieved in this process is the score of similarity. To classify a file, this score is compared against a predefined threshold. If the score is less than the threshold value, then file under consideration belongs to the given malware family.