Software Fingerprinting for Automated Malicious Code Analysis

Full text

(1)

Software Fingerprinting for Automated Malicious Code Analysis

Philippe Charland

Mission Critical Cyber Security Section October 25, 2012

Terms of Release: This document is approved for release to Defence Departments, Defence Contractors, and Governments. Any further distribution requires the prior approval of the Defence R&D Canada – Valcartier Document Review Panel

(2)

2

Outline

• Software Reverse Engineering

• Motivation

• Research Objective

• Prototypes

• Future Work

(3)

3

Why Software Reverse Engineering?

• To develop a solid understanding of a software for which there is no

– Documentation – Source code

• Malicious software (malware) falls into this category…

(4)

4

Malware Figures

• SophosLabs

– Analyzed 95,000 malware pieces every day in 2010

• Panda Security

– 26 million new malware samples were identified in 2011

– 73,000 strains per day

(5)

5

Targeted Attacks

(6)

6

Software Reverse Engineering Process

• IDA Pro

• Olly Dbg

• Unpacking

• Debugger

• Experience based

• Newbies have trouble with this

Increase in Complexity

Deobfuscation / Software Dearmoring

Disassembly / Code‐level Analysis

Relevant and Interesting Feature Identification

(7)

7

Assembly Code Analysis

• Most nebulous portion of the process

• Largely depends on intuition and experience

• Looking at assembly is tedious

• Not seeing the forest for the trees

• Analyst fatigue – High level of attention required

(8)

8

Assembly Code Analysis

• Question:

lea eax, DWORD PTR [edx+edx]

add eax, eax

add eax, eax

add eax, eax

add eax, eax

(9)

9

Assembly Code Analysis

• Question:

lea eax, DWORD PTR [edx+edx]

add eax, eax add eax, eax add eax, eax add eax, eax

• Answer:

y = x * 32

(10)

10

Assembly Code Analysis

• Doing everything manually is unsustainable...

• Throwing more reverse engineers is not possible...

(11)

11

Assembly Code Analysis

• Automate some of the assembly code analysis process!

(12)

12

Motivation

• Malware authors

– Develop huge numbers of variants to bypass antivirus – Exchange source code among them

– Reuse open source code

• Reverse engineers

– Leverage the code reuse in malware – Reduce redundant analysis efforts

– Accelerate the reverse engineering process

(13)

13

Research Objective

• Automatically identify code fragments that reuse 1. Open source code

2. Previously analyzed assembly code

(14)

14

• RE‐Google – regoogle.carnivore.it

– IDA Pro plug‐in

• Python script

– Enumerate all functions and extract

• Strings

• Constants

• Imported functions names – Perform a Google Code Search

– Add top results as function comments

Assembly and Source Code Matching

(15)

15

Constants Results

(16)

16

(17)

17

RE‐Google

• Relies on the Google Code Search API – Shut down on January 15, 2012...

• Look for alternatives...

(18)

18

Google Code Search Alternatives

• As suggested on the Google Code Search Group:

– www.antepedia.com

– www.grepcode.com

– www.koders.com

– www.krugle.org

– ...

(19)

19

Koders

• Merging with Ohloh (code.ohloh.net)

• Index and search 10+ BLOCs (3x the amount of Koders)

• Support 43 programming languages

(20)

20

Koders

(21)

21

Koders

(22)

22

Koders

(23)

23

Koders – Search for SHA‐512

(24)

24

(25)

25

RE‐Source IDA Pro Plug‐in

• Based on the original RE‐Google Python script

File Assembly

File RE‐Source Extract Features

Build Query

RE‐Source Page

HTML Parse Results Page

Comment Functions

(26)

26

RE‐Source – Precise Calculator Case study

(27)

27

Precise Calculator

• Open source programmable scientific calculator

• Has more than 150 mathematical and statistical functions

• Written in C++ and assembly – 9 .h files

– 7 .cpp files

– 2 assembly files

(28)

28

RE‐Source – Precise Calculator Case study

• Disassembled executable contains 533 functions

• Features extracted for 67 functions

• Identified 5 of the 7 .cpp files with 100% accuracy – 70% of the original source code

• Detected functions

– Mathematical, geometrical and statistical

– Parsing, editing, GUI

(29)

29

RE‐Source – Precise Calculator Case study

(30)

30

RE‐Source – Precise Calculator Case study

(31)

31

Clone Detection

• Technique to identify duplicate code fragments in a code base

• Most algorithms operate on source code – Decrease code size by consolidating it

– Facilitate program comprehension and software maintenance

• Commercial off‐the‐shelf software – Copyright infringements

– Plagiarism detection

(32)

32

Clone Detection vs. Clone Search

• Clone Detection

– Identify all the similar code fragments within a code base – Compare every code fragment pair

• Clone Search

– Identify only the code fragments similar to a target fragment

(33)

33

Clone Types

• Syntactic Clones

– Textual similarity

• Type I, II, III clones

• Semantic Clones

– Functional similarity

• Type IV clones

(34)

34

push eax ; Memory

call ds:_aligned_free and dword ptr [esi], 0

pop ecx

• Identical code fragments except for variations in whitespace, layout and comments

Syntactic Clones – Type I

push eax

call ds:_aligned_free and dword ptr [esi], 0 pop ecx

(35)

35

Syntactic Clones – Type II

push edi ; Size call _malloc

mov edx, eax mov ecx, edi

mov [esp+24h+var_C], edx mov edi, edx

mov edx, ecx xor eax,eax shr ecx, 2 rep stosd mov ecx, edx add esp, 4 and ecx, 3 rep stosb

mov eax, [esp+20h+var_C]

test eax, eax

jnz loc_10001A97 mov eax, [ebx]

push eax

push edi ; Size

call _malloc mov edx, eax mov ecx, edi

mov [esp+20h+InBuffer], edx mov edi, edx

mov edx, ecx xor eax, eax shr ecx, 2 rep stosd mov ecx, edx add esp, 4 and ecx, 3 rep stosb

mov eax, [esp+1Ch+InBuffer]

test eax, eax

jnz loc_10001493 mov eax, [ebx]

push eax

• Structurally/syntactically identical fragments except for

variations in identifiers, literals, types, layout and comments

(36)

36

• Copied fragments with further modifications

• Statements can be changed, added or removed in addition to variations in identifiers, literals, types, layout and comments

Syntactic Clones – Type III

mov esi, [ebp+arg_0]

mov edx, [esi+214h]

mov edi, [esi+220h]

mov [ebp+var_4], edx cmp [esi+21Ch], edi jl short loc_76641044 lea ebx, [edx+edi*8]

mov esi, [ebp+arg_0]

mov edx, [esi+214h]

mov [ebp+var_4], edx mov edi, [esi+220h]

cmp [esi+21Ch], edi jl short loc_76641044 lea ebx, [edx+edi*8]

(37)

37

Semantic Clones – Type IV

strlen1 proc near arg_0 = dword ptr 4

mov eax, [esp+arg_0]

loc_401004:

cmp byte ptr [eax], 0 jz short done

inc eax

jmp short loc_401004 done:

sub eax, [esp+arg_0]

retn strlen1 endp

strlen3 proc near arg_0 = dword ptr 4

push edi

mov edi, [esp+4+arg_0]

xor ecx, ecx not ecx

xor al, al cld

repne scasb not ecx

lea eax, [ecx-1]

pop edi retn

strlen3 endp

• Two or more code fragments that perform the same

computation implemented through different syntactic variants

(38)

38

• Extended from A. Saebjornsen, et al. (2009), University of California, Davis

Clone Detector – Overview

Files Assembly

Files

Disassembler Normalizer

Exact Clone Detector

File XML

File Token Indexer

Visualizer

Regionizer

Inexact Clone Detector

Duplicate Clone Merger Maximal Clone

Merger Binary

Files

(39)

39

Clone Detector – Regionizer

sub_402D5F proc near ; CODE XREF: sub_402FC1+12p mov edi, edi

push esi

push edi

mov edi, ecx

lea esi, [edi+0D0h]

mov eax, [esi]

test eax, eax

jz short loc_402D7C

push eax ; Memory

call ds:_aligned_free and dword ptr [esi], 0

pop ecx

loc_402D7C: ; CODE XREF: sub_402D5F+10j and dword ptr [edi+0D4h], 0

push 90h ; Size

push 0 ; Val

add edi, 40h

push edi ; Dst

call memset add esp, 0Ch

pop edi

pop esi

retn sub_402D5F endp

(40)

40

Clone Detector – Regionizer

sub_402D5F proc near

mov edi, edi

push esi

push edi

mov edi, ecx

lea esi, [edi+0D0h]

mov eax, [esi]

test eax, eax

jz short loc_402D7C

push eax

call ds:_aligned_free and dword ptr [esi], 0

pop ecx

and dword ptr [edi+0D4h], 0

push 90h

push 0

add edi, 40h

push edi

call memset add esp, 0Ch

pop edi

pop esi

retn sub_402D5F endp

Region 0

Region 1

Region 2 Region 3 Region 4

Window Size = 10 instructions Step Size = 4 instructions

Region 5

(41)

41

Clone Detector – Normalization

• Registers, constants and memory addresses are normalized

• Constants

– VAL or VALx, where x is an index number

• Memory addresses

– MEM or MEMx, where x is an index number

• Registers

– Different normalization levels are available

(42)

42

Clone Detector – Normalization

REG

– EAX → REG – CS → REG – EDI → REG

REGSeg, REGGen, REGldxPtr – EAX → REGGen

– CS → REGSeg – EDI → REGIdxPtr

REGGen8, REGGen16, REGGen32 – EAX → REGGen32

– AX → REGGen16 – AH → REGGen8

REGx

– EAX → REG#0 – AX → REG#1 – AH → REG#2

REG

REGSeg

REGx

REGGen REGIdxPtr

REGGen8 REGGen16 REGGen32

(43)

43

Clone Detector – Normalization

• Assembly code

mov edi, edi push ebp

push ebp, esp

mov eax, dword ptr [epb+8]

• Normalized assembly code

mov REG, REG push REG

push REG, REG mov REG, MEM

(44)

44

Clone Detector – Exact Clones

• Compare statements between regions

• Two regions are considered an exact clone if all their normalized

statements are identical (i.e. same hash value)

(45)

45

Clone Detector – Exact Clones

sub_402D5F proc near

mov REGGen32, REGGen32 push REGIdxPtr

push REGIdxPtr

mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL

mov REGGen32, MEM

test REGGen32, REGGen32

jz short VAL

push REGGen16 ...

retn sub_402D5F endp

sub_579AEG proc near ...

mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL

mov REGGen32, MEM

test REGGen32, REGGen32

jz short VAL

push REGGen16 ...

retn sub_579AEG endp

Hash Key Clone Cluster 4464394

6486468

1561898

sub_579AEG proc near

mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL

mov REGGen32, MEM

test REGGen32, REGGen32 push REGGen16

mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL

...

retn sub_579AEG endp

(46)

46

Clone Detector – Exact Clones

(47)

47

Clone Detector – Exact Clones

(48)

48

Clone Detector – Exact Clones

(49)

49

Clone Detector – Inexact Clones

• Compute a feature vector for each region

• Feature vectors are constructed based on – Mnemonics of instructions

– Types of operands in instructions

– Combination of mnemonic and operands – …

• Two regions are considered an inexact clone if the similarity

between their feature vectors is within a minimum similarity

threshold

(50)

50

Clone Detector – Inexact Clones

(51)

51

Clone Detector – Inexact Clones

(52)

52

Clone Detector – Inexact Clones

(53)

53

Clone Detector – Duplicate Clone Merger

• Remove clones that are highly overlapping regions in the same function

push edi

call ds:GetTickCount

sub eax, dword_1000D22C push eax

lea eax, [esp+0Ch]

push offset a9lu push eax

call _sprintf

lea ecx, [esp+14h]

lea edx, [esp+24h]

push ecx

push offset Dest push offset a8sS push edx

call _sprintf

mov eax, dword_1000D218 mov ecx, dword_1000A044

(54)

54

Clone Detector – Maximal Clone Merger

push edi

call ds:GetTickCount

sub eax, dword_1000D22C push eax

lea eax, [esp+0Ch]

push offset a9lu push eax

call _sprintf

lea ecx, [esp+14h]

lea edx, [esp+24h]

push ecx

push offset Dest push offset a8sS push edx

call _sprintf

mov eax, dword_1000D218 mov ecx, dword_1000A044 lea edi, [esp+34h]

add esp, 1Ch

lea edx, [ecx+eax+0Ah]

push edi

call ds:GetTickCount

sub eax, dword_1000D22C push eax

lea eax, [esp+0Ch]

push offset a9lu push eax

call _sprintf

lea ecx, [esp+14h]

lea edx, [esp+24h]

push ecx

push offset Dest push offset a8sS push edx

call _sprintf

mov eax, dword_1000D218 mov ecx, dword_1000A044 lea edi, [esp+34h]

add esp, 1Ch

lea edx, [ecx+eax+0Ah]

• Merge consecutive cloned regions

(55)

55

Clone Detector – Clone Search

(56)

56

Clone Detector – Clone Search

(57)

57

Case Study

• 18 open source code dynamic link libraries (DLLs) – Recall and precision consistently above 80%

• Zeus and Blaster malware – Precision over 96%

– Efficiency is not sensitive to the window size

(58)

58

Future Work

• Proof‐of‐concept prototypes

• RE‐Source

– Automatically identify a larger proportion of libraries

• Clone Detector

– Improve the precision and recall of inexact clone detection – Support semantic clones

• Conduct additional case studies

(59)

59

Questions

philippe.charland@drdc‐rddc.gc.ca

(60)

Figure

Updating...

References

Updating...

Related subjects :