Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

(1)

Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

Steven H. H. Ding^∗, Benjamin C. M. Fung^∗, and Philippe Charland^†

∗Data Mining and Security Lab, School of Information Studies, McGill University, Montreal, Canada.

Emails: [email protected], [email protected]

†Mission Critical Cyber Security Section, Defence R&D Canada - Valcartier, Quebec, QC, Canada.

Abstract—Reverse engineering is a manually intensive but necessary technique for understanding the inner workings of new malware, finding vulnerabilities in existing systems, and detecting patent infringements in released software. An assembly clone search engine facilitates the work of reverse engineers by identifying those duplicated or known parts.

However, it is challenging to design a robust clone search engine, since there exist various compiler optimization options and code obfuscation techniques that make logically similar assembly functions appear to be very different.

A practical clone search engine relies on a robust vector representation of assembly code. However, the existing clone search approaches, which rely on a manual feature engineering process to form a feature vector for an assembly function, fail to consider the relationships between features and identify those unique patterns that can statistically distinguish assembly functions. To address this problem, we propose to jointly learn the lexical semantic relationships and the vector representation of assembly functions based on assembly code. We have devel- oped an assembly code representation learning model Asm2Vec.

It only needs assembly code as input and does not require any prior knowledge such as the correct mapping between assembly functions. It can find and incorporate rich semantic relationships among tokens appearing in assembly code. We conduct extensive experiments and benchmark the learning model with state-of-the-art static and dynamic clone search approaches. We show that the learned representation is more robust and significantly outperforms existing methods against changes introduced by obfuscation and optimizations.

1. Introduction

Software developments mostly do not start from scratch.

Due to the prevalent and commonly uncontrolled reuse of source code in the software development process [1], [2], [3], there exist a large number of clones in the underlying assembly code as well. An effective assembly clone search engine can significantly reduce the burden of the manual analysis process involved in reverse engineering. It addresses the information needs of a reverse engineer by taking ad- vantage of existing massive binary data.

Assembly code clone search is emerging as an Infor- mation Retrieval (IR) technique that helps address security- related problems. It has been used for differing binaries to locate the changed parts [4], identifying known library functions such as encryption [5], searching for known program-

ming bugs or zero-day vulnerabilities in existing software or Internet of Things (IoT) devices firmware [6], [7], as well as detecting software plagiarism or GNU license infringements when the source code is unavailable [8], [9]. However, designing an effective search engine is difficult, due to vari- eties of compiler optimizations and obfuscation techniques that make logically similar assembly functions appear to be dramatically different. Figure 1 shows an example. The optimized or obfuscated assembly function breaks control flow and basic block integrity. It is challenging to identify these semantically similar, but structurally and syntactically different assembly functions as clones.

Developing a clone search solution requires a robust vector representation of assembly code, by which one can measure the similarity between a query and the indexed functions. Based on the manually engineered features, rel- evant studies can be categorized into static or dynamic approaches. Dynamic approaches model the semantic similarity by dynamically analyzing the I/O behavior of assembly code [10], [11], [12], [13]. Static approaches model the similarity between assembly code by looking for their static differences with respect to the syntax or descriptive statistics [6], [7], [8], [14], [15], [16], [17], [18]. Static approaches are more scalable and provide better coverage than the dynamic approaches. Dynamic approaches are more robust against changes in syntax but less scalable. We identify two problems which can be mitigated to boost the semantic richness and robustness of static features. We show that by considering these two factors, a static approach can even achieve better performance than the state-of-the-art dynamic approaches.

P1: Existing state-of-the-art static approaches fail to consider the relationships among features. LSH-S [16], n- gram [8], n-perm [8], BinClone [15] and Kam1n0 [17]

model assembly code fragments as frequency values of operations and categorized operands. Tracelet [14] models assembly code as the editing distance between instruction sequences. Discovre [7] and Genius [6] construct descriptive features, such as the ratio of arithmetic assembly instructions, the number of transfer instructions, the number of basic blocks, among others. All these approaches assume each feature or category is an independent dimension. How- ever, a xmm0 Streaming SIMD Extensions (SSE) register is related to SSE operations such as movaps. A fclose libc function call is related to other file-related libc calls such as fopen. A strcpy libc call can be replaced with memcpy.

These relationships provide more semantic information than 2019 IEEE Symposium on Security and Privacy

(2)

Figure 1: Different assembly functions compiled from the same source code of gmpz tdiv r 2exp in libgmp. From left to right, the assembly functions are compiled with gcc O0 option, gcc O3 option, LLVM obfuscator Control Flow Graph Flattening option, and LLVM obfuscator Bogus Control Flow Graph option. Asm2Vec can statically identify them as clones.

individual tokens or descriptive statistics.

To address this problem, we propose to incorporate lexical semantic relationship into the feature engineering process. Manually specifying all the potential relationships from prior knowledge of assembly language is time- consuming and infeasible in practice. Instead, we propose to learn these relationships directly from plain assembly code.

Asm2Vecexplores co-occurrence relationships among tokens and discovers rich lexical semantic relationships among tokens (see Figure 2). For example, memcpy, strcpy, memncpy and mempcpy appear to be semantically similar to each other. SSE registers relate to SSE operands. Asm2Vec does not require any prior knowledge in the training process.

P2: The existing static approaches assume that features are equally important [14], [15], [16], [17] or require a mapping of equivalent assembly functions to learn the weights [6], [7]. The chosen weights may not embrace the important patterns and diversity that distinguishes one assembly function from another. An experienced reverse engineer does not identify a known function by equally looking through the whole content or logic, but rather pinpoints critical spots and important patterns that identify a specific function based on past experience in binary analysis. One also does not need mappings of equivalent assembly code.

To solve this problem, we find that it is possible to simulate the way in which an experienced reverse engineer works. Inspired by recent development in representation learning [19], [20], we propose to train a neural network model to read many assembly code data and let the model identify the best representation that distinguishes one function from the rest. In this paper, we make the following contributions:

• We propose a novel approach for assembly clone de- tection. It is the first work that employs representation learning to construct a feature vector for assembly code,

as a way to mitigate problems P1 and P2 in current hand- crafted features. All previous research on assembly clone search requires a manual feature engineering process. The clone search engine is part of an open source platform¹.

• We develop a representation learning model, namely Asm2Vec, for assembly code syntax and control flow graph. The model learns latent lexical semantics between tokens and represents an assembly function as an inter- nally weighted mixture of collective semantics. The learning process does not require any prior knowledge about assembly code, such as compiler optimization settings or the correct mapping between assembly functions. It only needs assembly code functions as inputs.

• We show that Asm2Vec is more resilient to code obfuscation and compiler optimizations than state-of-the-art static features and dynamic approaches. Our experiment covers different configurations of compiler and a strong obfuscator which substitutes instructions, splits basic blocks, adds bogus logics, and completely destroys the original control flow graph. We also conduct a vulnerability search case study on a publicly available vulnerability dataset, where Asm2Vecachieves zero false positives and 100% recalls.

It outperforms a dynamic state-of-the-art vulnerability search method.

Asm2Vecas a static approach cannot completely defeat code obfuscation. However, it is more resilient to code obfuscation than state-of-the-art static features. This paper is organized as follows: Section 2 formally defines the search problem. Section 3 systematically integrates representation learning into a clone search process. Section 4 describes the model. Section 5 presents our experiment. Section 6 discusses the literature. Section 7 discusses the limitations and concludes the paper.

1. https://github.com/McGill-DMaS/Kam1n0-Plugin-IDA-Pro

(3)

dec ah

rax

mfence fsub

al

fstp

rbp

r10b ec ecbh rbx r10d

10 bl10b

fxam

xor

ffs ffffs ff fld

c c cbhhh c c cbbhhhch

rcx

cl rdi

aah dh r12d

ax rdxax

fxch

ds dl

div

r12b

fldcw subsub

faddp

add adc

test

r r10 r1blb r101l r11b r11dr1

shl shr fs

r14b r14d r11 r13d

gs

cmovbe

r13b

fucomi

h hxadd 0d 0 0 13 0 0 0dd 0 13 0 0xchg

ja ra ra rarrrdd raaadd

jb r15br144

r15dr1

jg

jl

tp stp s s fst

cmp

js s jzs

fild

l cl l cl l clddddddd c clll c c cldddddddd c cll l l cpuidl

r1 r14 r11 hhgggr

r1 r1 hhhgr1141 hhggg

bswapr1

xxoo esi gsgs

esp

jecxz

bbbllllll 11bbbbbbbbbbbbbbbbbbbbbbll 11bbbbbbb neg

mfefflffdddfffffffff ttff mfefmulfflffdddfff ttff

rolllror

r1 c c c 1 1 l r1 cl cl cll c c c c c c 1 and1 cp orcp r8

r9

lea r10 r12

10 10 r11 r14 12 1 12 r1 r13444 111333 1113r15

h hhh

ah dhhhaahhh hhhh

ah dhhhaahh cmpxchg

j s

jja jbe

cmovle add subadd

fdiv

rsi

movsx rspr

ence t enc dffsssffff t enc uuuuuuuuuuuuuuufffflllllllllffffffffsssssst ence

t e dfffssffff t e

faddt ss

cmovnb cm callcm

cmovns cmovnz

imul

ffs ffffs ffffs ff ffaddddsss adddd fchs fdivp

ds dss ds j syscall

jjjl jgejjjl nop

not

fnstcw

cmoo cmovg

mov

ovleocmova cmoo cmovb

r151 r8br15

cmovl

r8d

p p p xx mmmo xxppp xxpushpo

r8bbb r8bbb r9b 1 1 10dx00x00xcdd r8d r10d0000x00x00cddh r9dr8d

vp vpfsubrp

fiffld fiffld fistp fdff iv

fmulp

movl mcmovge

hl llllll hl allmovsxd

eax

jjjg jlejjjg

prefetcht2

imul imul cmovs

jjjecxzzzpzzp prefetcht0

allllll movzx cmm cmovz

ebpgs eeessspppgg eg ee bp bpgs

ppp bebx

d dud2

jmp

ebbbebb ebbbebb rbb ecx rb cx

r r dddx rdd bbrbrbrb x rb r r xppp x xb cx

ddxx rrdddd bbxxbbb

xxbpppp x x x x u jnb

or oedi

jnsjjjs tte tt s tte tt s tte jjjjtt stttt jzsss jjjjjz

jnz edxecc

e cm ge

cm tzcnt ecm ecm geeeidiv

ffa ff vffaffffaff ulll v v v v v v llff v v ulffaaa v vfld1

div div div d d d d d nttdvvv nttmul

d incd

sar

p p ddd111p p ddd111 fucomip

rr999bbbr8b r8bbbb sbb

fd ff ivff

v v fd ff ivf

v fldzv

ungetc

htons htonl

openlog rand

strncmp

bind

fileno_unlocked getmntent

ctime

mkdir ac

execvp

getsid fileno

ax sigemptyset

k execlk

fcntl

bp sleep

bpp bx close

connect

getenv

putchar_unlocked

fnmatch

fchmod ntohl

ntohs fscanf

cx memcpy

localtime

di realloc

sprintf

strlen

r11w memcmp

c dxc vfork

_exit

getrlimit64 signal

getsockname

kdir kd kd kd kd k k kchown strerror

dii r10wdi getuid

strstr execcv strptime setbuf

getchar_unlocked setmntent

b bxxxbb bbb bxxxbb r13wbbb getppid ffs

ff caa clearerr

mmap64 popen

semget

r12wax fork

fgetc

select

cfsetispeed pen

ppe p nn ppfopen64

w wr15w ntott

ntott cld e ewrite

fnstsw g g getgid exit

system

b b bb w w w w w000000000www

b b bb bbb bb w ww w w w w w w w w w b w00www

b bb bb w r14ww strtoll

htott ohl ohlll ho oowait

setpriority

len rlenfputc

mk mk exeeeccclllmk exkkk open64lll

ct cttsetsockopt

nlogmmm

n m

getpgrp

getgroups

atoi malloc

setprioppp setprio tcsetattr set

st fputs

memchr

p waitpid

fflush

p 664 psrand

t id getgtt t id getgtt r setgroups

raise te tt t read

geteuid

mem memsetmem

connectnnneect ftruncate64

ge ge regfree

lseek64

fgets_unlocked

sigaddset

utimes

ffc ff n ffc fsyncff n

recv

ungetctt ungetctt putc_unlocked

atott i readdir64

k kk k k kc k k k k chmod vsnprintf getmntent_r

ts tt id ts tt dsemop

64 b 64 64 b 6socket

getstt oc t

ggge c ttcsetpgrp r dupr rec sendtorec

mempcpy

strchr

rmdir t

setttb sett fwrite_unlocked

getctt har

kk h r

ksetrlimit64 g orkgtstt or o tstt getopt

id ggidgetegid ran ran n semctlran

snp unamevsnp

n wngettimeofday

recvfrom

nv strtr

nstrtok_r mount

ep sselecccscc epp ss cscc eaccept

sprin sprint sscanf

slee k646464 ep ea e eppcpp c 4 4 slee k6646 p e ea e eppcpp c dup2

moddd mooddumask

access endmntent

lseee syslog

b 1w b b b 133wbbbbw b b b w b b b rrr w w w w w w 1

1w b 1 wbbbbw b b b w b b b r w w w w w siw ntottoho

tcgetattr ffg

ffff freeffg

ctime ct e cct c meee setutxent

u fdatasyncu fopen

fprintf

llo printftt p

p o

p p epstrftime

ldd s dttttttccctttttt seeettt ld

s d

sttt cfmakeraw

ch opench sscan localtime_rss

stre strpbrkstre regggggg tzset

get gt getutxent

utimim inet_pton

r r upr duuupuurrr sync ffo

ff pe ferror

etstt ge gtt get ge ge tage tt ttttt rrr

ts t t g gtt tabort

vv vvs unaaamvvvvvsvsmm m m closedirm

setutttccc s ctccttcccc opendir

putenv

lseeksys

l h cl cl

h cllreadlink memrchr

sl syn l dd ncccdd l nssss n nccccyyyyyyycc sle asioctldddyyyyyyy

printf

ssseeell ht iiihhss ta tt tttt r t t ll ht h aiiihh ta tt tttt r ettfchdir

getcwd clea

p a p fclose

set w wa w w it set tcflushit

nl nlog nlssemmmsssssss n nlog nsetpgidmcccss

strcasestr strtr rm

st m

execlp __ oop

_ p

_time

a

strchrnula dirnameputctttth strcpy

inet_aton een

fread

getpagesize cv ecc tttorree tt

i cv ecc tttorree ttsetsid

d 1

d rrrr1ddd1 r111 rr111 d 1

d d1 d d1 111 111 r8w anf setmntett

a a

g anf setmntett a aferror_unlocked

yreadd ystrspn

htotto strnlen memcmm

strncasecmp

ffc ff dir asssk ge

ffc ff di asssk ge setenv getmm

pclose

strtt strtoull

r111 1ss 1ss r9w strtol

mstrpbbbs sss msymlinkbbbsss

onl setggtt rrrooog s ooops onl

gooo moooplongjmp

o o kill

sigprocmask

kkkeeddd ssss kkkc

putsssc

ritytt ttctt seeegegggg rmunmapttctt seeegegggggg

strs st strrsrsrrseesssss rename

mem trchr

mem trchr strrchr

opeeeeeenee ope opeeenope en e e opeeen mkfifo ffiffleno unl ffiffleno unl fputs_unlocked

en ssseen isatty

getutxtt egggggg cfsetospeed strcasecmp

ffn ff maa regexec

strcmp

nv b b b b nv

b b strtok

ffo ff rkkkkks ff ffkkkkks

g p

cfsetspeed

error e e ssst e e echdirt

sigaction

g g ffrg ff eeee

printftt g g ffrg ff eeee

printftt getc_unlocked

mcpypp mc mctt

ssstrcpypp m m mttt

ssstrcpypp memmove

sigfillset

d d d d seyee d d d d d e isattssstssssssyee send

vp mall vexecvl ok_rrr oookk ok r ooolink

stpcpyge nt n ll if_nametoindex o

nfdopen

tt t strcspn

bi d b cccketb pp222222222

bi d b cccketb ppp2222222tcgetpgrp

t l to strtoultt l

romsigg ralarm

xdddx r13www x x133w13ww

dddx r1 ww x133w1 w cwde tftttftt

t snprintf fff nffirrrr oprrrpep n p

pe opp irr

snnnp qsort ffr ff eadd fwrite

sestr sfchown

cfsfffs inet_ntoa b f ge

poppenpppppoppepppp penneenn n buf ge

popenpppppoppepppp pennenn n fseeko64

g s ssttttt g

sttttt unlink memm strncpy

renn inet_ntop

ddd dddgetpid

y inet y unsetenvinet cvttss2si

pslld

ucomiss psllq

cs ucomisd

movhps

cvtsi2sd pmovmskb

cvtsi2ss maxsd

2si punpckhbw2s p

pmuludqp

om pminubom

tanh

vpcmpeqq sinfmodtattnan

punpckhdq punpckhqdq

psubb psubd

sqrt atan2 ceil

sqrtss xorpd

jp

cvtstt i2 ckhbwk

i2 ckhbw pcmpgtw pcmpgtd

psrldq

pcmpeqd

punp pcmpeqbpunp

f dttdttatttta cos

ppp sfenceppp

pcmpgtb ceilf

pmommo movdqu

movdqa

ckhdq ckhdqpunpcklbw

vvpcmm movupd pmaxub

qr qr sqrtf

times

sqrtsd 2 ce

2 e

floor

pxor

pcm

pm

punpckldq cvtdq2pd floorf pgtdpgmovupsttmmmovvvpppdpppppd

cvtdtt q2 ccvvtd2 unpckhps minssceii

paddd shufps paddbsqrrrssq fqsstssfqttqqq

minsd jp d jjjp d subsd

ffffccc fccc exp

fflffoo ffflffooo palignr

ld fp ff s ld fp

ff ssubss dq2ppddsdq2ppddpunpckhwdqq2pd22pd2 punpp psllqq

psllq paddq

cvt c i movsst

s log10s

vms ovvskbssskmkbbobbv omovsdv por

t ffm ff ootttttttt fm ff oottttttt fm ff ootttttt fffott 2 tt f tt f tt ffffffff hypot 10 10 pow

sd pc

sdd ppc sandnpd

2ss sssfeffn uuunpccc i2ss sssfeffn uuunpccccc andnps

pd pdmulpd

bw bwvpmovmskb pshufd

ceilf ma cei ceii ma cmpltpsceii

p t d jjjp bsdjjjjjjpjjjp t d jpp bsdjjjjjj subpd

mo anddd mo anddd movaps

cvtsd2ss f orfrrf r o o f o noosubps

uco pmiiuunuuccuccouoooo rtstt d cvttsd2si mo mo movapd ddddddq cccm

dm

qddd q qpsrldddd

ddd d

d pad

ddd ddddd ppa dmulpsdd mulud

pmuludpsrlq psrlw dqadunpcklpd

qucommmov qmmmmmo qdivssucomv

pdmin dddmmi pddivsdn

bbbbbbssss pa m m mulppppppsppaa ssssssmmmov ssssd s sssiissss bbbbb

ad p m m mulppppppsppaa

m m diimov mulss pa orpsppal

ttimes ttimes sysconf

lw lw xmm0 xmm1

cvtttt stt s miss s2spss2222 misspunpcklwd222 rl rl rl xxmm xxmmrlll xmm2m

p p xmm3

m m m mulsdmmuuumm m m m m m puuu m orpdm

pcm pm xmm4

cvt cvttt ppsxxxxxxmmxmm5 bbpxxmmm xmm6 xxxxxxm xmm7xxm

pad pad pad h fff or ddd

h ffd porpslldqdddddd

nsdsq pd

inn md miinsdssq pd

inn md mii addsd

ufdffdorpppppp uaddsspppppp

d d ooosssd

d tt d t t oood

ttt t log

punpck 2pd 2 2

ss k

2pd 2 2 2ss

packuswb

uc uc cvtss2sd

px pp ororor

s ps ps pss p2 p p p ps ps ps px pp or

2 p pandp hps psusu hpspunpcklqdq

nh ann d t taa tt tt 2 ceil 2 ce 2 ce t ooottttdttdtt 2 tttttattttananhn 2 eil 2 ce 2 e t ootttttttt 2 tttperror

b ddd b ddd andpd

b ubbb duuquubub duuandps

atatt n 0 0 w w

w 2

0 0 0 w w wsincos

ps ufdff p ufdff p ufdff p ps

u p

ufdff p umovlpdp

tan

w m w m000www m mvmovdqa

sinnneeexxx sinx strtod

nsss mindn m miins mindn m miijnp

lf lf lf xxmmmmmm

p p p p p p p pppppppppppaaaaannnnnnnnnppnnnpnnnppppdddpppppp lf xxmmmmmmxxxx mm3 m

p p p p o p pdddpppppp movlhps axub ax axsss a a asss xxm pandn

cs r

cs csm m muuusssssso

or uuu

o movntdq

q d

q cvt q q dsdddq dddd ddd t q dsdddq addpd

lld pmd pmd pm lld ddddddddddddlldl m

liii pmm dddddddddaddps dq pa dq pa movd

dddiiivvv divvv movq bdxxxxorppp mp p p ooorrrpppmmmm bxxorppp mp orrrpppmmmm divpd psu

ps mov

psu pcvtpd2psmov

!"#$%&'()*' Calls

+,("*-$./,-!)*'0

!'1$223$4,560",-0

789:6"$4,560",-0

;&</

./,-!)*'0

=*'16)*',1 ./,-!)*'0

>9:6"$4,560",-0

?@9:6"$4,560",-0

>9:6"$4,560",-0

%A*!)'59/*6'"$./,-!)*'0

%6A,$./,-!)*'0

,<*-B$=*</!-,

2"-6'5$2,!-(#

C79:6"$4,560",-0 ,<*-B

./,-!)*'0

D,"E*-F6'5G H((,00$!'1$I-*(,00$=*'"-*A

J'0"-&()*'0 4,560",-0 K6:($%&'()*'$=!AA0

Figure 2: T-SNE clustering visualization of tokens appearing in assembly code. There are three categories of tokens: operation, operand, and libc function call. Each token is represented as a 200-dimensional numeric vector. They are learned by Asm2Vec on plain assembly code without any prior knowledge of the assembly language. The training assembly code does not contain the libc callee functions’ content. For visualization, T-SNE reduces the vectors to two dimensions by nearest neighbor approximation. A smaller geometric distance indicates a higher lexical semantic similarity.

2. Problem Definition

In the assembly clone search literature, there are four types of clones [15], [16], [17]: Type I: literally identical;

Type II: syntactically equivalent; Type III: slightly modified;

and Type IV: semantically similar. We focus on Type IV clones, where assembly functions may appear syntactically different, but share similar functional logic in their source code. For example, the same source code with and without obfuscation, or a patched source code between different releases. We use the following notions: function denotes an assembly function; source function represents the original function written in source code, such as C++; repository functionstands for the assembly function that is indexed inside the repository; and target function denotes the assembly function query. Given an assembly function, our goal is to search for its semantic clones from the repository RP. We formally define the search problem as follows:

Definition 1. (Assembly function clone search) Given a target functionft, the search problem is to retrieve the top- k repository functions f_s ∈ RP, ranked by their semantic similarity, so they can be considered as Type IV clones.

3. Overall Workflow

Figure 3 shows the overall workflow. There are four steps: Step 1: Given a repository of assembly functions, we first build a neural network model for these functions.

We only need their assembly code as training data without

Figure 3: The overall work flow of Asm2Vec.

any prior knowledge. Step 2: After the training phase, the model produces a vector representation for each repository function. Step 3: Given a target function ft that was not trained with this model, we use the model to estimate its vector representation. Step 4: We compare the vector of ft

against the other vectors in the repository by using cosine similarity to retrieve the top-k ranked candidates as results.

The training process is a one-time effort and is efficient to learn representation for queries. If a new assembly function is added to the repository, we follow the same procedure in Step 3 to estimate its vector representation. The model can be retrained periodically to guarantee the vectors’ quality.

4. Assembly Code Representation Learning

In this section, we propose a representation learning model for assembly code. Specifically, our design is based on the PV-DM model [20]. PV-DM model learns document representation based on the tokens in the document. How-

(4)

ever, a document is sequentially laid out, which is different than assembly code, as the latter can be represented as a graph and has a specific syntax. First, we describe the original PV-DM neural network, which learns a vectorized representation of text paragraph. Then, we formulate our Asm2Vecmodel and describe how it is trained on instruction sequences for a given function. After, we elaborate how to model a control flow graph as multiple sequences.

4.1. Preliminaries

The PV-DM model is designed for text data. It is an extension of the original word2vec model. It can jointly learn vector representations for each word and each paragraph.

Figure 4 shows its architecture.

Figure 4: The PV-DM model.

Given a text paragraph which contains multiple sentences, PV-DM applies a sliding window over each sentence.

The sliding window starts from the beginning of the sentence and moves forward a single word at each step. For example, in Figure 4, the sliding window has a size of 5.

In the first step, the sliding window contains the five words

‘the’, ‘cat’, ‘sat’, ‘on’ and ‘a’. The word ‘sat’ in the middle is treated as the target and the surrounding words are treated as the context. In the second step, the window moves forward a single word and contains ‘cat’, ‘sat’, ‘on’, ‘a’ and ‘mat’, where the word ‘on’ is the target.

At each step, the PV-DM model performs a multi-class prediction task (see Figure 4). It maps the current paragraph into a vector based on the paragraph ID and maps each word in the context into a vector based on the word ID. The model averages these vectors and predicts the target word from the vocabulary through a softmax classification. The back-propagated classification error will be used to update these vectors. Formally, given a text corpus T that contains a list of paragraphs p ∈ T, each paragraph p contains a list of sentences s ∈ p, and each sentence is a sequence of |s|

words wt∈ s. PV-DM maximizes the log probability:

T

X

p p

X

s

|s|−k

X

t=k

log P(wt|p, wt−k, ..., wt+k) (1)

The sliding window size is 2k + 1. The paragraph vector captures the information that is missing from the context to predict the target. It is interpreted as topics [20]. PV- DM is designed for text data that is sequentially laid out.

However, assembly code carries richer syntax than plaintext. It contains operations, operands, and control flow that are structurally different than plaintext. These differences require a different model architecture design that cannot be addressed by PV-DM. Next, we present a representation learning model that integrates the syntax of assembly code.

4.2. The Asm2Vec Model

An assembly function can be represented as a control flow graph (CFG). We propose to model the control flow graph as multiple sequences. Each sequence corresponds to a potential execution trace that contains linearly laid-out assembly instructions. Given a binary file, we use the IDA Pro² disassembler to extract a list of assembly functions, their basic blocks, and control flow graphs.

This section corresponds to Step 1 and 2 in Figure 3.

In these steps, we train a representation model and produce a numeric vector for each repository function fs ∈ RP.

Figure 5 shows the neural network structure of the model.

It is different than the original PV-DM model.

First, we map each repository function fs to a vector θ~fs ∈ R^2×d. ~θfs is the vector representation of function fs to be learned in training. d is a user chosen parameter.

Similarly, we collect all the unique tokens in the repository RP. We treat operands and operations in assembly code as tokens. We map each token t into a numeric vector ~vt∈ R^d and another numeric vector ~v⁰t ∈ R^2×d. ~vt is the vector representations of token t. After training, it represents a token’s lexical semantics. ~vtvectors are used in Figure 2 to visualize the relationship among tokens. ~v⁰tis used for token prediction. All ~θfs and ~vt are initialized to small random value around zero. All ~v⁰_t are initialized to zeros. We use 2 × d for fs since we concatenate the vectors for operation and operands to represent an instruction.

We treat each repository function fs ∈ RP as multiple sequences S(fs) = seq[1 : i], where seqi is one of them.

We assume that the order of sequences is randomized. A sequence is represented as a list of instructions I(seqi) = in[1 : j], where in_j is one of them. An instruction inj

contains a list of operands A(inj) and one operation P(inj).

Their concatenation is denoted as its list of tokens T (inj) = P(inj) || A(inj), where || denotes concatenation. Constants tokens are normalized into their hexadecimal form.

For each sequence seqi in function fs, the neural network walks through the instructions from its beginning. We collect the current instruction inj, its previous instruction inj−1, and its next instruction inj+1. We ignore the instructions that are out-of-boundary. The proposed model tries to

2. IDA Pro, available at: http://www.hex-rays.com/

(5)

Figure 5: The proposed Asm2Vec neural network model for assembly code.

maximize the following log probability across the repository RP:

RP

X

fs

S(fs)

X

seqi

I(seqi)

X

inj

T (inj)

X

tc

log P(tc|fs, inj−1, inj+1) (2) It maximizes the log probability of seeing a token tc at the current instruction, given the current assembly function fs and neighbor instructions. The intuition is to use the current function’s vector and the context provided by the neighbor instructions to predict the current instruction. The vectors provided by neighbor instructions capture the lexical semantic relationship. The function’s vector remembers what cannot be predicted given the context. It models the instructions that distinguish the current function from the others.

For a given function fs, we first look-up its vector representation ~θ_f_s through the previously built dictionary.

To model a neighbor instruction in as CT (in) ∈ R^2×d, we average the vector representations of its operands (∈ R^d) and concatenate the averaged vector (∈ R^d) with the vector representation of the operation. It can be formulated as:

CT (in) = ~v_P(in)|| 1

|A(in)|

A(in)

X

t

~

vtb (3)

Recall that P(∗) denotes an operation and it is a single token. By averaging fs with CT (inj− 1) and CT (inj+ 1), δ(in, f_s) models the joint memory of neighbor instructions:

δ(inj, fs) =1

3(~θfs+ CT (inj−1) + CT (inj+1)) (4) Example 1. Consider a simple assembly code function fs and one of its sequence in Figure 5. Take the third instruction where j = 3 for example. T (in3) = {’push’,

’rbx’}. A(in3−1) = {’rbp’, ’rsp’}. P(in₃₋₁) = {’mov’}. We collect their respective vectors ~vrbp, ~vrsp, ~vmovand calculate CT (in3−1) = ~vmov||(~vrbp+ ~vrsp)/2. Following the same

procedure, we calculate CT (in3+1). With Equation 4 and

~θfs we have δ(in3, fs).

Given δ(in, fs), the probability term in Equation 2 can be rewritten as follows:

P(tc|fs, inj−1, inj+1) = P(tc|δ(inj, fs)) (5) Recall that we map each token into two vectors ~v and ~v⁰. For each target token tc ∈ T (inj), which belongs to the current instruction, we look-up its output vector ~v⁰tc. The probability in Equation 5 can be modeled as a softmax multi- class regression problem:

P(tc|δ(inj, fs)) = P(~v⁰tc|δ(inj, fs))

= f (~v⁰tc, δ(inj, fs)) PD

d f (~v⁰_t_d, δ(in_j, f_s)) f (~v⁰tc, δ(inj, fs)) = U h((~v⁰tc)^T× δ(inj, fs)) D denotes the whole vocabulary constructed upon the repository RP. U h(·) denotes a sigmoid function applied to each value of a vector. The total number of parameters to be estimated is (|D| + 1) × 2 × d for each pass of the softmax layout. The term |D| is too large for the softmax classification. Following [20], [21], we use the k negative sampling approach to approximate the log probability as:

log P(tc|δ(inj, fs)) ≈ log f (~v⁰tc|δ(inj, fs)) +

k

X

i=1

EtdvPn(tc) Jtd6= tcKlog f (−1 ×v~⁰_t_d, δ(in_j, f_s)) (6) J·K is an identity function. If the expression inside this function is evaluated to be true, then it outputs 1; otherwise 0. For example,J1 + 2 = 3K = 1 and J1 + 1 = 3K = 0. The negative sampling algorithm distinguishes the correct guess tc with k randomly selected negative samples {td|td6= tc}