Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization
Steven H. H. Ding∗, Benjamin C. M. Fung∗, and Philippe Charland†
∗Data Mining and Security Lab, School of Information Studies, McGill University, Montreal, Canada.
Emails: [email protected], [email protected]
†Mission Critical Cyber Security Section, Defence R&D Canada - Valcartier, Quebec, QC, Canada.
Email: [email protected]
Abstract—Reverse engineering is a manually intensive but necessary technique for understanding the inner workings of new malware, finding vulnerabilities in existing systems, and detecting patent infringements in released software. An assembly clone search engine facilitates the work of reverse engineers by identifying those duplicated or known parts.
However, it is challenging to design a robust clone search engine, since there exist various compiler optimization options and code obfuscation techniques that make logically similar assembly functions appear to be very different.
A practical clone search engine relies on a robust vector representation of assembly code. However, the existing clone search approaches, which rely on a manual feature engineering process to form a feature vector for an assembly function, fail to consider the relationships between features and identify those unique patterns that can statistically distinguish assembly functions. To address this problem, we propose to jointly learn the lexical semantic relationships and the vector representation of assembly functions based on assembly code. We have devel- oped an assembly code representation learning model Asm2Vec.
It only needs assembly code as input and does not require any prior knowledge such as the correct mapping between assembly functions. It can find and incorporate rich semantic relationships among tokens appearing in assembly code. We conduct extensive experiments and benchmark the learning model with state-of-the-art static and dynamic clone search approaches. We show that the learned representation is more robust and significantly outperforms existing methods against changes introduced by obfuscation and optimizations.
1. Introduction
Software developments mostly do not start from scratch.
Due to the prevalent and commonly uncontrolled reuse of source code in the software development process [1], [2], [3], there exist a large number of clones in the underlying assembly code as well. An effective assembly clone search engine can significantly reduce the burden of the manual analysis process involved in reverse engineering. It addresses the information needs of a reverse engineer by taking ad- vantage of existing massive binary data.
Assembly code clone search is emerging as an Infor- mation Retrieval (IR) technique that helps address security- related problems. It has been used for differing binaries to locate the changed parts [4], identifying known library func- tions such as encryption [5], searching for known program-
ming bugs or zero-day vulnerabilities in existing software or Internet of Things (IoT) devices firmware [6], [7], as well as detecting software plagiarism or GNU license infringements when the source code is unavailable [8], [9]. However, designing an effective search engine is difficult, due to vari- eties of compiler optimizations and obfuscation techniques that make logically similar assembly functions appear to be dramatically different. Figure 1 shows an example. The optimized or obfuscated assembly function breaks control flow and basic block integrity. It is challenging to identify these semantically similar, but structurally and syntactically different assembly functions as clones.
Developing a clone search solution requires a robust vector representation of assembly code, by which one can measure the similarity between a query and the indexed functions. Based on the manually engineered features, rel- evant studies can be categorized into static or dynamic ap- proaches. Dynamic approaches model the semantic similar- ity by dynamically analyzing the I/O behavior of assembly code [10], [11], [12], [13]. Static approaches model the similarity between assembly code by looking for their static differences with respect to the syntax or descriptive statistics [6], [7], [8], [14], [15], [16], [17], [18]. Static approaches are more scalable and provide better coverage than the dynamic approaches. Dynamic approaches are more robust against changes in syntax but less scalable. We identify two problems which can be mitigated to boost the semantic richness and robustness of static features. We show that by considering these two factors, a static approach can even achieve better performance than the state-of-the-art dynamic approaches.
P1: Existing state-of-the-art static approaches fail to consider the relationships among features. LSH-S [16], n- gram [8], n-perm [8], BinClone [15] and Kam1n0 [17]
model assembly code fragments as frequency values of operations and categorized operands. Tracelet [14] models assembly code as the editing distance between instruction sequences. Discovre [7] and Genius [6] construct descriptive features, such as the ratio of arithmetic assembly instruc- tions, the number of transfer instructions, the number of basic blocks, among others. All these approaches assume each feature or category is an independent dimension. How- ever, a xmm0 Streaming SIMD Extensions (SSE) register is related to SSE operations such as movaps. A fclose libc function call is related to other file-related libc calls such as fopen. A strcpy libc call can be replaced with memcpy.
These relationships provide more semantic information than 2019 IEEE Symposium on Security and Privacy
Figure 1: Different assembly functions compiled from the same source code of gmpz tdiv r 2exp in libgmp. From left to right, the assembly functions are compiled with gcc O0 option, gcc O3 option, LLVM obfuscator Control Flow Graph Flattening option, and LLVM obfuscator Bogus Control Flow Graph option. Asm2Vec can statically identify them as clones.
individual tokens or descriptive statistics.
To address this problem, we propose to incorporate lexical semantic relationship into the feature engineering process. Manually specifying all the potential relation- ships from prior knowledge of assembly language is time- consuming and infeasible in practice. Instead, we propose to learn these relationships directly from plain assembly code.
Asm2Vecexplores co-occurrence relationships among tokens and discovers rich lexical semantic relationships among to- kens (see Figure 2). For example, memcpy, strcpy, memncpy and mempcpy appear to be semantically similar to each other. SSE registers relate to SSE operands. Asm2Vec does not require any prior knowledge in the training process.
P2: The existing static approaches assume that features are equally important [14], [15], [16], [17] or require a mapping of equivalent assembly functions to learn the weights [6], [7]. The chosen weights may not embrace the important patterns and diversity that distinguishes one assembly function from another. An experienced reverse en- gineer does not identify a known function by equally looking through the whole content or logic, but rather pinpoints critical spots and important patterns that identify a specific function based on past experience in binary analysis. One also does not need mappings of equivalent assembly code.
To solve this problem, we find that it is possible to simulate the way in which an experienced reverse engineer works. Inspired by recent development in representation learning [19], [20], we propose to train a neural network model to read many assembly code data and let the model identify the best representation that distinguishes one func- tion from the rest. In this paper, we make the following contributions:
• We propose a novel approach for assembly clone de- tection. It is the first work that employs representation learning to construct a feature vector for assembly code,
as a way to mitigate problems P1 and P2 in current hand- crafted features. All previous research on assembly clone search requires a manual feature engineering process. The clone search engine is part of an open source platform1.
• We develop a representation learning model, namely Asm2Vec, for assembly code syntax and control flow graph. The model learns latent lexical semantics between tokens and represents an assembly function as an inter- nally weighted mixture of collective semantics. The learn- ing process does not require any prior knowledge about assembly code, such as compiler optimization settings or the correct mapping between assembly functions. It only needs assembly code functions as inputs.
• We show that Asm2Vec is more resilient to code obfusca- tion and compiler optimizations than state-of-the-art static features and dynamic approaches. Our experiment covers different configurations of compiler and a strong obfusca- tor which substitutes instructions, splits basic blocks, adds bogus logics, and completely destroys the original control flow graph. We also conduct a vulnerability search case study on a publicly available vulnerability dataset, where Asm2Vecachieves zero false positives and 100% recalls.
It outperforms a dynamic state-of-the-art vulnerability search method.
Asm2Vecas a static approach cannot completely defeat code obfuscation. However, it is more resilient to code obfuscation than state-of-the-art static features. This paper is organized as follows: Section 2 formally defines the search problem. Section 3 systematically integrates representation learning into a clone search process. Section 4 describes the model. Section 5 presents our experiment. Section 6 discusses the literature. Section 7 discusses the limitations and concludes the paper.
1. https://github.com/McGill-DMaS/Kam1n0-Plugin-IDA-Pro
dec ah
rax
mfence fsub
al
fstp
rbp
r10b ec ecbh rbx r10d
10 bl10b
fxam
xor
ffs ffffs ff fld
c c cbhhh c c cbbhhhch
rcx
cl rdi
aah dh r12d
ax rdxax
fxch
ds dl
div
r12b
fldcw subsub
faddp
add adc
test
r r10 r1blb r101l r11b r11dr1
shl shr fs
r14b r14d r11 r13d
gs
cmovbe
r13b
fucomi
h hxadd 0d 0 0 13 0 0 0dd 0 13 0 0xchg
ja ra ra rarrrdd raaadd
jb r15br144
r15dr1
jg
jl
tp stp s s fst
cmp
js s jzs
fild
l cl l cl l clddddddd c clll c c cldddddddd c cll l l cpuidl
r1 r14 r11 hhgggr
r1 r1 hhhgr1141 hhggg
bswapr1
xxoo esi gsgs
esp
jecxz
bbbllllll 11bbbbbbbbbbbbbbbbbbbbbbll 11bbbbbbb neg
mfefflffdddfffffffff ttff mfefmulfflffdddfff ttff
rolllror
r1 c c c 1 1 l r1 cl cl cll c c c c c c 1 and1 cp orcp r8
r9
lea r10 r12
10 10 r11 r14 12 1 12 r1 r13444 111333 1113r15
h hhh
ah dhhhaahhh hhhh
ah dhhhaahh cmpxchg
j s
jja jbe
cmovle add subadd
fdiv
rsi
movsx rspr
ence t enc dffsssffff t enc uuuuuuuuuuuuuuufffflllllllllffffffffsssssst ence
t e dfffssffff t e
faddt ss
cmovnb cm callcm
cmovns cmovnz
imul
ffs ffffs ffffs ff ffaddddsss adddd fchs fdivp
ds dss ds j syscall
jjjl jgejjjl nop
not
fnstcw
cmoo cmovg
mov
ovleocmova cmoo cmovb
r151 r8br15
cmovl
r8d
p p p xx mmmo xxppp xxpushpo
r8bbb r8bbb r9b 1 1 10dx00x00xcdd r8d r10d0000x00x00cddh r9dr8d
vp vpfsubrp
fiffld fiffld fistp fdff iv
fmulp
movl mcmovge
hl llllll hl allmovsxd
eax
jjjg jlejjjg
prefetcht2
imul imul cmovs
jjjecxzzzpzzp prefetcht0
allllll movzx cmm cmovz
ebpgs eeessspppgg eg ee bp bpgs
ppp bebx
d dud2
jmp
ebbbebb ebbbebb rbb ecx rb cx
r r dddx rdd bbrbrbrb x rb r r xppp x xb cx
ddxx rrdddd bbxxbbb
xxbpppp x x x x u jnb
or oedi
jnsjjjs tte tt s tte tt s tte jjjjtt stttt jzsss jjjjjz
jnz edxecc
e cm ge
cm tzcnt ecm ecm geeeidiv
ffa ff vffaffffaff ulll v v v v v v llff v v ulffaaa v vfld1
div div div d d d d d nttdvvv nttmul
d incd
sar
p p ddd111p p ddd111 fucomip
rr999bbbr8b r8bbbb sbb
fd ff ivff
v v fd ff ivf
v fldzv
ungetc
htons htonl
openlog rand
strncmp
bind
fileno_unlocked getmntent
ctime
mkdir ac
execvp
getsid fileno
ax sigemptyset
k execlk
fcntl
bp sleep
bpp bx close
connect
getenv
putchar_unlocked
fnmatch
fchmod ntohl
ntohs fscanf
cx memcpy
localtime
di realloc
sprintf
strlen
r11w memcmp
c dxc vfork
_exit
getrlimit64 signal
getsockname
kdir kd kd kd kd k k kchown strerror
dii r10wdi getuid
strstr execcv strptime setbuf
getchar_unlocked setmntent
b bxxxbb bbb bxxxbb r13wbbb getppid ffs
ff caa clearerr
mmap64 popen
semget
r12wax fork
fgetc
select
cfsetispeed pen
ppe p nn ppfopen64
w wr15w ntott
ntott cld e ewrite
fnstsw g g getgid exit
system
b b bb w w w w w000000000www
b b bb bbb bb w ww w w w w w w w w w b w00www
b bb bb w r14ww strtoll
htott ohl ohlll ho oowait
setpriority
len rlenfputc
mk mk exeeeccclllmk exkkk open64lll
ct cttsetsockopt
nlogmmm
n m
getpgrp
getgroups
atoi malloc
setprioppp setprio tcsetattr set
st fputs
memchr
p waitpid
fflush
p 664 psrand
t id getgtt t id getgtt r setgroups
raise te tt t read
geteuid
mem memsetmem
connectnnneect ftruncate64
ge ge regfree
lseek64
fgets_unlocked
sigaddset
utimes
ffc ff n ffc fsyncff n
recv
ungetctt ungetctt putc_unlocked
atott i readdir64
k kk k k kc k k k k chmod vsnprintf getmntent_r
ts tt id ts tt dsemop
64 b 64 64 b 6socket
getstt oc t
ggge c ttcsetpgrp r dupr rec sendtorec
mempcpy
strchr
rmdir t
setttb sett fwrite_unlocked
getctt har
kk h r
ksetrlimit64 g orkgtstt or o tstt getopt
id ggidgetegid ran ran n semctlran
snp unamevsnp
n wngettimeofday
recvfrom
nv strtr
nstrtok_r mount
ep sselecccscc epp ss cscc eaccept
sprin sprint sscanf
slee k646464 ep ea e eppcpp c 4 4 slee k6646 p e ea e eppcpp c dup2
moddd mooddumask
access endmntent
lseee syslog
b 1w b b b 133wbbbbw b b b w b b b rrr w w w w w w 1
1w b 1 wbbbbw b b b w b b b r w w w w w siw ntottoho
tcgetattr ffg
ffff freeffg
ctime ct e cct c meee setutxent
u fdatasyncu fopen
fprintf
llo printftt p
p o
p p epstrftime
ldd s dttttttccctttttt seeettt ld
s d
sttt cfmakeraw
ch opench sscan localtime_rss
stre strpbrkstre regggggg tzset
get gt getutxent
utimim inet_pton
r r upr duuupuurrr sync ffo
ff pe ferror
etstt ge gtt get ge ge tage tt ttttt rrr
ts t t g gtt tabort
vv vvs unaaamvvvvvsvsmm m m closedirm
setutttccc s ctccttcccc opendir
putenv
lseeksys
l h cl cl
h cllreadlink memrchr
sl syn l dd ncccdd l nssss n nccccyyyyyyycc sle asioctldddyyyyyyy
printf
ssseeell ht iiihhss ta tt tttt r t t ll ht h aiiihh ta tt tttt r ettfchdir
getcwd clea
p a p fclose
set w wa w w it set tcflushit
nl nlog nlssemmmsssssss n nlog nsetpgidmcccss
strcasestr strtr rm
st m
execlp __ oop
_ p
_time
a
strchrnula dirnameputctttth strcpy
inet_aton een
fread
getpagesize cv ecc tttorree tt
i cv ecc tttorree ttsetsid
d 1
d rrrr1ddd1 r111 rr111 d 1
d d1 d d1 111 111 r8w anf setmntett
a a
g anf setmntett a aferror_unlocked
yreadd ystrspn
htotto strnlen memcmm
strncasecmp
ffc ff dir asssk ge
ffc ff di asssk ge setenv getmm
pclose
strtt strtoull
r111 1ss 1ss r9w strtol
mstrpbbbs sss msymlinkbbbsss
onl setggtt rrrooog s ooops onl
gooo moooplongjmp
o o kill
sigprocmask
kkkeeddd ssss kkkc
putsssc
ritytt ttctt seeegegggg rmunmapttctt seeegegggggg
strs st strrsrsrrseesssss rename
mem trchr
mem trchr strrchr
opeeeeeenee ope opeeenope en e e opeeen mkfifo ffiffleno unl ffiffleno unl fputs_unlocked
en ssseen isatty
getutxtt egggggg cfsetospeed strcasecmp
ffn ff maa regexec
strcmp
nv b b b b nv
b b strtok
ffo ff rkkkkks ff ffkkkkks
g p
cfsetspeed
error e e ssst e e echdirt
sigaction
g g ffrg ff eeee
printftt g g ffrg ff eeee
printftt getc_unlocked
mcpypp mc mctt
ssstrcpypp m m mttt
ssstrcpypp memmove
sigfillset
d d d d seyee d d d d d e isattssstssssssyee send
vp mall vexecvl ok_rrr oookk ok r ooolink
stpcpyge nt n ll if_nametoindex o
nfdopen
tt t strcspn
bi d b cccketb pp222222222
bi d b cccketb ppp2222222tcgetpgrp
t l to strtoultt l
romsigg ralarm
xdddx r13www x x133w13ww
dddx r1 ww x133w1 w cwde tftttftt
t snprintf fff nffirrrr oprrrpep n p
pe opp irr
snnnp qsort ffr ff eadd fwrite
sestr sfchown
cfsfffs inet_ntoa b f ge
poppenpppppoppepppp penneenn n buf ge
popenpppppoppepppp pennenn n fseeko64
g s ssttttt g
sttttt unlink memm strncpy
renn inet_ntop
ddd dddgetpid
y inet y unsetenvinet cvttss2si
pslld
ucomiss psllq
cs ucomisd
movhps
cvtsi2sd pmovmskb
cvtsi2ss maxsd
2si punpckhbw2s p
pmuludqp
om pminubom
tanh
vpcmpeqq sinfmodtattnan
punpckhdq punpckhqdq
psubb psubd
sqrt atan2 ceil
sqrtss xorpd
jp
cvtstt i2 ckhbwk
i2 ckhbw pcmpgtw pcmpgtd
psrldq
pcmpeqd
punp pcmpeqbpunp
f dttdttatttta cos
ppp sfenceppp
pcmpgtb ceilf
pmommo movdqu
movdqa
ckhdq ckhdqpunpcklbw
vvpcmm movupd pmaxub
qr qr sqrtf
times
sqrtsd 2 ce
2 e
floor
pxor
pcm
pm
punpckldq cvtdq2pd floorf pgtdpgmovupsttmmmovvvpppdpppppd
cvtdtt q2 ccvvtd2 unpckhps minssceii
paddd shufps paddbsqrrrssq fqsstssfqttqqq
minsd jp d jjjp d subsd
ffffccc fccc exp
fflffoo ffflffooo palignr
ld fp ff s ld fp
ff ssubss dq2ppddsdq2ppddpunpckhwdqq2pd22pd2 punpp psllqq
psllq paddq
cvt c i movsst
s log10s
vms ovvskbssskmkbbobbv omovsdv por
t ffm ff ootttttttt fm ff oottttttt fm ff ootttttt fffott 2 tt f tt f tt ffffffff hypot 10 10 pow
sd pc
sdd ppc sandnpd
2ss sssfeffn uuunpccc i2ss sssfeffn uuunpccccc andnps
pd pdmulpd
bw bwvpmovmskb pshufd
ceilf ma cei ceii ma cmpltpsceii
p t d jjjp bsdjjjjjjpjjjp t d jpp bsdjjjjjj subpd
mo anddd mo anddd movaps
cvtsd2ss f orfrrf r o o f o noosubps
uco pmiiuunuuccuccouoooo rtstt d cvttsd2si mo mo movapd ddddddq cccm
dm
qddd q qpsrldddd
ddd d
d pad
ddd ddddd ppa dmulpsdd mulud
pmuludpsrlq psrlw dqadunpcklpd
qucommmov qmmmmmo qdivssucomv
pdmin dddmmi pddivsdn
bbbbbbssss pa m m mulppppppsppaa ssssssmmmov ssssd s sssiissss bbbbb
ad p m m mulppppppsppaa
m m diimov mulss pa orpsppal
ttimes ttimes sysconf
lw lw xmm0 xmm1
cvtttt stt s miss s2spss2222 misspunpcklwd222 rl rl rl xxmm xxmmrlll xmm2m
p p xmm3
m m m mulsdmmuuumm m m m m m puuu m orpdm
pcm pm xmm4
cvt cvttt ppsxxxxxxmmxmm5 bbpxxmmm xmm6 xxxxxxm xmm7xxm
pad pad pad h fff or ddd
h ffd porpslldqdddddd
nsdsq pd
inn md miinsdssq pd
inn md mii addsd
ufdffdorpppppp uaddsspppppp
d d ooosssd
d tt d t t oood
ttt t log
punpck 2pd 2 2
ss k
2pd 2 2 2ss
packuswb
uc uc cvtss2sd
px pp ororor
s ps ps pss p2 p p p ps ps ps px pp or
2 p pandp hps psusu hpspunpcklqdq
nh ann d t taa tt tt 2 ceil 2 ce 2 ce t ooottttdttdtt 2 tttttattttananhn 2 eil 2 ce 2 e t ootttttttt 2 tttperror
b ddd b ddd andpd
b ubbb duuquubub duuandps
atatt n 0 0 w w
w 2
0 0 0 w w wsincos
ps ufdff p ufdff p ufdff p ps
u p
ufdff p umovlpdp
tan
w m w m000www m mvmovdqa
sinnneeexxx sinx strtod
nsss mindn m miins mindn m miijnp
lf lf lf xxmmmmmm
p p p p p p p pppppppppppaaaaannnnnnnnnppnnnpnnnppppdddpppppp lf xxmmmmmmxxxx mm3 m
p p p p o p pdddpppppp movlhps axub ax axsss a a asss xxm pandn
cs r
cs csm m muuusssssso
or uuu
o movntdq
q d
q cvt q q dsdddq dddd ddd t q dsdddq addpd
lld pmd pmd pm lld ddddddddddddlldl m
liii pmm dddddddddaddps dq pa dq pa movd
dddiiivvv divvv movq bdxxxxorppp mp p p ooorrrpppmmmm bxxorppp mp orrrpppmmmm divpd psu
ps mov
psu pcvtpd2psmov
!"#$%&'()*' Calls
+,("*-$./,-!)*'0
!'1$223$4,560",-0
789:6"$4,560",-0
;&</
./,-!)*'0
=*'16)*',1 ./,-!)*'0
>9:6"$4,560",-0
?@9:6"$4,560",-0
?@9:6"$4,560",-0
>9:6"$4,560",-0
%A*!)'59/*6'"$./,-!)*'0
%6A,$./,-!)*'0
,<*-B$=*</!-,
2"-6'5$2,!-(#
C79:6"$4,560",-0 ,<*-B
./,-!)*'0
D,"E*-F6'5G H((,00$!'1$I-*(,00$=*'"-*A
J'0"-&()*'0 4,560",-0 K6:($%&'()*'$=!AA0
Figure 2: T-SNE clustering visualization of tokens appearing in assembly code. There are three categories of tokens: operation, operand, and libc function call. Each token is represented as a 200-dimensional numeric vector. They are learned by Asm2Vec on plain assembly code without any prior knowledge of the assembly language. The training assembly code does not contain the libc callee functions’ content. For visualization, T-SNE reduces the vectors to two dimensions by nearest neighbor approximation. A smaller geometric distance indicates a higher lexical semantic similarity.
2. Problem Definition
In the assembly clone search literature, there are four types of clones [15], [16], [17]: Type I: literally identical;
Type II: syntactically equivalent; Type III: slightly modified;
and Type IV: semantically similar. We focus on Type IV clones, where assembly functions may appear syntactically different, but share similar functional logic in their source code. For example, the same source code with and without obfuscation, or a patched source code between different releases. We use the following notions: function denotes an assembly function; source function represents the original function written in source code, such as C++; repository functionstands for the assembly function that is indexed in- side the repository; and target function denotes the assembly function query. Given an assembly function, our goal is to search for its semantic clones from the repository RP. We formally define the search problem as follows:
Definition 1. (Assembly function clone search) Given a target functionft, the search problem is to retrieve the top- k repository functions fs ∈ RP, ranked by their semantic similarity, so they can be considered as Type IV clones.
3. Overall Workflow
Figure 3 shows the overall workflow. There are four steps: Step 1: Given a repository of assembly functions, we first build a neural network model for these functions.
We only need their assembly code as training data without
Figure 3: The overall work flow of Asm2Vec.
any prior knowledge. Step 2: After the training phase, the model produces a vector representation for each repository function. Step 3: Given a target function ft that was not trained with this model, we use the model to estimate its vector representation. Step 4: We compare the vector of ft
against the other vectors in the repository by using cosine similarity to retrieve the top-k ranked candidates as results.
The training process is a one-time effort and is efficient to learn representation for queries. If a new assembly func- tion is added to the repository, we follow the same procedure in Step 3 to estimate its vector representation. The model can be retrained periodically to guarantee the vectors’ quality.
4. Assembly Code Representation Learning
In this section, we propose a representation learning model for assembly code. Specifically, our design is based on the PV-DM model [20]. PV-DM model learns document representation based on the tokens in the document. How-
ever, a document is sequentially laid out, which is different than assembly code, as the latter can be represented as a graph and has a specific syntax. First, we describe the original PV-DM neural network, which learns a vectorized representation of text paragraph. Then, we formulate our Asm2Vecmodel and describe how it is trained on instruction sequences for a given function. After, we elaborate how to model a control flow graph as multiple sequences.
4.1. Preliminaries
The PV-DM model is designed for text data. It is an extension of the original word2vec model. It can jointly learn vector representations for each word and each paragraph.
Figure 4 shows its architecture.
Figure 4: The PV-DM model.
Given a text paragraph which contains multiple sen- tences, PV-DM applies a sliding window over each sentence.
The sliding window starts from the beginning of the sen- tence and moves forward a single word at each step. For example, in Figure 4, the sliding window has a size of 5.
In the first step, the sliding window contains the five words
‘the’, ‘cat’, ‘sat’, ‘on’ and ‘a’. The word ‘sat’ in the middle is treated as the target and the surrounding words are treated as the context. In the second step, the window moves forward a single word and contains ‘cat’, ‘sat’, ‘on’, ‘a’ and ‘mat’, where the word ‘on’ is the target.
At each step, the PV-DM model performs a multi-class prediction task (see Figure 4). It maps the current paragraph into a vector based on the paragraph ID and maps each word in the context into a vector based on the word ID. The model averages these vectors and predicts the target word from the vocabulary through a softmax classification. The back-propagated classification error will be used to update these vectors. Formally, given a text corpus T that contains a list of paragraphs p ∈ T, each paragraph p contains a list of sentences s ∈ p, and each sentence is a sequence of |s|
words wt∈ s. PV-DM maximizes the log probability:
T
X
p p
X
s
|s|−k
X
t=k
log P(wt|p, wt−k, ..., wt+k) (1)
The sliding window size is 2k + 1. The paragraph vector captures the information that is missing from the context to predict the target. It is interpreted as topics [20]. PV- DM is designed for text data that is sequentially laid out.
However, assembly code carries richer syntax than plain- text. It contains operations, operands, and control flow that are structurally different than plaintext. These differences require a different model architecture design that cannot be addressed by PV-DM. Next, we present a representation learning model that integrates the syntax of assembly code.
4.2. The Asm2Vec Model
An assembly function can be represented as a control flow graph (CFG). We propose to model the control flow graph as multiple sequences. Each sequence corresponds to a potential execution trace that contains linearly laid-out assembly instructions. Given a binary file, we use the IDA Pro2 disassembler to extract a list of assembly functions, their basic blocks, and control flow graphs.
This section corresponds to Step 1 and 2 in Figure 3.
In these steps, we train a representation model and produce a numeric vector for each repository function fs ∈ RP.
Figure 5 shows the neural network structure of the model.
It is different than the original PV-DM model.
First, we map each repository function fs to a vector θ~fs ∈ R2×d. ~θfs is the vector representation of function fs to be learned in training. d is a user chosen parameter.
Similarly, we collect all the unique tokens in the repository RP. We treat operands and operations in assembly code as tokens. We map each token t into a numeric vector ~vt∈ Rd and another numeric vector ~v0t ∈ R2×d. ~vt is the vector representations of token t. After training, it represents a token’s lexical semantics. ~vtvectors are used in Figure 2 to visualize the relationship among tokens. ~v0tis used for token prediction. All ~θfs and ~vt are initialized to small random value around zero. All ~v0t are initialized to zeros. We use 2 × d for fs since we concatenate the vectors for operation and operands to represent an instruction.
We treat each repository function fs ∈ RP as multiple sequences S(fs) = seq[1 : i], where seqi is one of them.
We assume that the order of sequences is randomized. A sequence is represented as a list of instructions I(seqi) = in[1 : j], where inj is one of them. An instruction inj
contains a list of operands A(inj) and one operation P(inj).
Their concatenation is denoted as its list of tokens T (inj) = P(inj) || A(inj), where || denotes concatenation. Constants tokens are normalized into their hexadecimal form.
For each sequence seqi in function fs, the neural net- work walks through the instructions from its beginning. We collect the current instruction inj, its previous instruction inj−1, and its next instruction inj+1. We ignore the instruc- tions that are out-of-boundary. The proposed model tries to
2. IDA Pro, available at: http://www.hex-rays.com/
Figure 5: The proposed Asm2Vec neural network model for assembly code.
maximize the following log probability across the repository RP:
RP
X
fs
S(fs)
X
seqi
I(seqi)
X
inj
T (inj)
X
tc
log P(tc|fs, inj−1, inj+1) (2) It maximizes the log probability of seeing a token tc at the current instruction, given the current assembly function fs and neighbor instructions. The intuition is to use the current function’s vector and the context provided by the neighbor instructions to predict the current instruction. The vectors provided by neighbor instructions capture the lexi- cal semantic relationship. The function’s vector remembers what cannot be predicted given the context. It models the instructions that distinguish the current function from the others.
For a given function fs, we first look-up its vector representation ~θfs through the previously built dictionary.
To model a neighbor instruction in as CT (in) ∈ R2×d, we average the vector representations of its operands (∈ Rd) and concatenate the averaged vector (∈ Rd) with the vector representation of the operation. It can be formulated as:
CT (in) = ~vP(in)|| 1
|A(in)|
A(in)
X
t
~
vtb (3)
Recall that P(∗) denotes an operation and it is a single token. By averaging fs with CT (inj− 1) and CT (inj+ 1), δ(in, fs) models the joint memory of neighbor instructions:
δ(inj, fs) =1
3(~θfs+ CT (inj−1) + CT (inj+1)) (4) Example 1. Consider a simple assembly code function fs and one of its sequence in Figure 5. Take the third instruction where j = 3 for example. T (in3) = {’push’,
’rbx’}. A(in3−1) = {’rbp’, ’rsp’}. P(in3−1) = {’mov’}. We collect their respective vectors ~vrbp, ~vrsp, ~vmovand calculate CT (in3−1) = ~vmov||(~vrbp+ ~vrsp)/2. Following the same
procedure, we calculate CT (in3+1). With Equation 4 and
~θfs we have δ(in3, fs).
Given δ(in, fs), the probability term in Equation 2 can be rewritten as follows:
P(tc|fs, inj−1, inj+1) = P(tc|δ(inj, fs)) (5) Recall that we map each token into two vectors ~v and ~v0. For each target token tc ∈ T (inj), which belongs to the current instruction, we look-up its output vector ~v0tc. The probability in Equation 5 can be modeled as a softmax multi- class regression problem:
P(tc|δ(inj, fs)) = P(~v0tc|δ(inj, fs))
= f (~v0tc, δ(inj, fs)) PD
d f (~v0td, δ(inj, fs)) f (~v0tc, δ(inj, fs)) = U h((~v0tc)T× δ(inj, fs)) D denotes the whole vocabulary constructed upon the repos- itory RP. U h(·) denotes a sigmoid function applied to each value of a vector. The total number of parameters to be estimated is (|D| + 1) × 2 × d for each pass of the softmax layout. The term |D| is too large for the softmax classification. Following [20], [21], we use the k negative sampling approach to approximate the log probability as:
log P(tc|δ(inj, fs)) ≈ log f (~v0tc|δ(inj, fs)) +
k
X
i=1
EtdvPn(tc) Jtd6= tcKlog f (−1 ×v~0td, δ(inj, fs)) (6) J·K is an identity function. If the expression inside this function is evaluated to be true, then it outputs 1; otherwise 0. For example,J1 + 2 = 3K = 1 and J1 + 1 = 3K = 0. The negative sampling algorithm distinguishes the correct guess tc with k randomly selected negative samples {td|td6= tc}