• No results found

Classifying malware using network traffic analysis. Or how to learn Redis, git, tshark and Python in 4 hours.

N/A
N/A
Protected

Academic year: 2021

Share "Classifying malware using network traffic analysis. Or how to learn Redis, git, tshark and Python in 4 hours."

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Classifying malware using network traffic analysis.

Or how to learn Redis, git, tshark and Python in 4 hours.

Alexandre Dulaunoy

(2)

Problem Statement

We have more 5000 pcap files generated per day for each malware

execution in a sandbox

• We need to classify1 the malware into various sets

The project needs to be done in less than a day and the code

(3)

File Format and Filename

... 0580c82f6f90b75fcf81fd3ac779ae84.pcap 05a0f4f7a72f04bda62e3a6c92970f6e.pcap 05b4a945e5f1f7675c19b74748fd30d1.pcap 05b57374486ce8a5ce33d3b7d6c9ba48.pcap 05bbddc8edac3615754f93139cf11674.pcap 05bf1ff78685b5de06b0417da01443a9.pcap 05c3bccc1abab5c698efa0dfec2fd3a4.pcap ...

<MD5 hash of the malware).pcap

MD5 values2 of malware samples are used by A/V vendors, security

(4)

MapReduce and Network Forensic

• MapReduce is an old concept in computer science

◦ Themap stage to perform isolated computation on independent

problems

◦ Thereducestage to combine the computation results

Network forensic computations can easily be expressed in map and

reduce steps:

◦ parsing, filtering, counting, sorting, aggregating, anonymizing,

(5)

Processing and Reading pcap files

ls -1 | parallel --gnu ’tcpdump -s0 -A -n -r {1}’

Nice for processing the files but...

• How do we combine the results?

• How do we extract the classification parameters? (e.g. sed, awk,

(6)

tshark

tshark -G fields

• Wireshark is supporting a wide range of dissectors

tshark allows to use the dissectors from the command line

(7)

Concurrent Network Forensic Processing

• To allow concurrent processing, a non-blocking data store is

required

• To allow flexibility, a schema-free data store is required

• To allow fast processing, you need to scale horizontally and to

know the cost of querying the data store

• To allow streaming processing, write/cost versus read/cost should

(8)

Redis: a key-value/tuple store

• Redis is key store written in C with an extended set of data types

like lists, sets, ranked sets, hashes, queues

• Redis is usually in memory with persistence achieved by regularly

saving on disk

• Redis API is simple (telnet-like) and supported by a multitude of

programming languages

(9)

Redis: installation

Download Redis 3.0.6 (stable version)

• tar xvfz redis-3.0.6.tar.gz

• cd redis-3.0.6

(10)

Keys

• Keys are free text values (up to 231 bytes) - newline not allowed

• Short keys are usually better (to save memory)

(11)

Value and data types

• binary-safe strings

• lists of binary-safe strings

• sets of binary-safe strings

• hashes (dictionary-like)

(12)

Running redis and talking to redis...

• screen

cd ./src/ && ./redis-server

• new screen session (crtl-a c)

• redis-cli

(13)

Commands available on all keys

Those commands are available on all keys regardless of their type

TYPE [key]→ gives you the type of key (from string to hash)

• EXISTS [key] →does the key exist in the current database

• RENAME [old new]

• RENAMENX [old new]

• DEL [key]

RANDOMKEY → returns a random key

• TTL [key] → returns the number of sec before expiration

• EXPIRE [key ttl] or EXPIRE [key ts]

• KEYS [pattern] → returns all keys matching a pattern (!to use

(14)

Commands available for strings type

• SET [key] [value]

• GET [key]

MGET [key1] [key2] [key3]

• MSET [key1] [valueofkey1] ...

• INCR [key] — INCRBY [key] [value] →! string interpreted as

integer

• DECR [key] — INCRBY [key] [value] → ! string interpreted as

(15)

Commands available for sets type

• SADD [key] [member] →adds a member to a set named key

• SMEMBERS [key] → return the member of a set

• SREM [key] [member] →removes a member to a set named key

• SCARD [key] → returns the cardinality of a set

SUNION [key ...] → returns the union of all the sets

• SINTER [key ...] →returns the intersection of all the sets

• SDIFF [key ...] → returns the difference of all the sets

(16)

Commands available for list type

• RPUSH - LPUSH [key] [value]

• LLEN [key]

• LRANGE [key] [start] [end]

LTRIM [key] [start] [end]

• LSET [key] [index] [value]

(17)

Sorting

• SORT [key]

(18)

Commands availabled for sorted set type

• ZADD [key] [score] [member]

• ZCARD [key]

• ZSCORE [key] [member]

ZRANK [key] [member]→ get the rank of a member from bottom

(19)

Atomic commands

• GETSET [key] [newvalue] →sets newvalue and return previous

value

• (M)SETNX [key] [newvalue]→ sets newvalue except if key exists

(useful for locking)

MSETNX is very useful to update a large set of objects without race condition.

(20)

Database commands

• SELECT [0-15] → selects a database (default is 0)

• MOVE [key] [db]→ move key to another database

FLUSHDB →delete all the keys in the current database

• FLUSHALL→ delete all the keys in all the databases

• SAVE - BGSAVE →save database on disk (directly or in

background)

• DBSIZE

(21)

Redis from shell?

ret=$(redis-cli SADD dns:${md5} ${rdata}) num=$(redis-cli SCARD dns:${md5})

• Why not Python?

import redis

r = redis.StrictRedis(host=’localhost’, port=6379, db=0) r.set(’foo’, ’bar’)

(22)

How do you integrate it?

ls -1 ./pcap/*.pcap | parallel --gnu "cat {1} |

tshark -E header=yes -E separator=, -Tfields -e http.server -r {1} | python import.py -f {1} "

(23)

Structure and Sharing

• As you start to work on a project and you are willing to share it

• One approach is to start with a common directory structure for the

software to be shared

/PcapClassifier/bin/ -> where to store the code /etc/ -> where to store configuration /data/

(24)

import.py (first iteration)

# N e e d to c a t c h the MD5 f i l e n a m e of the m a l w a r e ( f r o m the p c a p f i l e n a m e ) i m p o r t a r g p a r s e i m p o r t sys a r g P a r s e r = a r g p a r s e . A r g u m e n t P a r s e r ( d e s c r i p t i o n =’ P c a p c l a s s i f i e r ’) a r g P a r s e r . a d d _ a r g u m e n t (’ - f ’, a c t i o n =’ a p p e n d ’, h e l p=’ F i l e n a m e ’) a r g s = a r g P a r s e r . p a r s e _ a r g s () if a r g s . f is not N o n e : f i l e n a m e = a r g s . f for l i n e in sys . s t d i n : p r i n t ( l i n e . s p l i t (" , ") [ 0 ] ) e l s e: a r g P a r s e r . p r i n t _ h e l p ()

(25)

Storing the code and its revision

Initialize a git repository (one time)

cd P c a p C l a s s i f i e r git i n i t

Add a file to the index (one time)

git add ./bin/i m p o r t. py

Commit a file to the index (when you do changes)

(26)

import.py (second iteration)

# N e e d to k n o w and s t o r e the MD5 f i l e n a m e p r o c e s s e d i m p o r t a r g p a r s e i m p o r t sys i m p o r t r e d i s a r g P a r s e r = a r g p a r s e . A r g u m e n t P a r s e r ( d e s c r i p t i o n =’ P c a p c l a s s i f i e r ’) a r g P a r s e r . a d d _ a r g u m e n t (’ - f ’, a c t i o n =’ a p p e n d ’, h e l p=’ F i l e n a m e ’) a r g s = a r g P a r s e r . p a r s e _ a r g s () r = r e d i s . S t r i c t R e d i s ( h o s t =’ l o c a l h o s t ’, p o r t =6379 , db =0) if a r g s . f is not N o n e : md5 = a r g s . f [ 0 ] . s p l i t (" . ") [ 0 ] r . s a d d (’ p r o c e s s e d ’, md5 ) for l i n e in sys . s t d i n : p r i n t ( l i n e . s p l i t (" , ") [ 0 ] )

(27)

Redis datastructure

• Now we have a set of the file processed (including the MD5 hash

of the malware):

{processed}={abcdef...,deadbeef...,0101aba....}

• A type set to store the field type decoded from the pcap (e.g.

http.user agent)

{type}={”http.user agent,http.server”}

• A unique set to store all the values seen per type

(28)

import.py (third iteration)

i m p o r ta r g p a r s e i m p o r tsys i m p o r tr e d i s a r g P a r s e r = a r g p a r s e . A r g u m e n t P a r s e r ( d e s c r i p t i o n =’ P c a p c l a s s i f i e r ’) a r g P a r s e r . a d d _ a r g u m e n t (’ - f ’, a c t i o n =’ a p p e n d ’,h e l p=’ F i l e n a m e ’) a r g s = a r g P a r s e r . p a r s e _ a r g s () r = r e d i s . S t r i c t R e d i s ( h o s t =’ l o c a l h o s t ’, p o r t =6379 , db =0) ifa r g s . fis notN o n e : md5 = a r g s . f [ 0 ] . s p l i t (" . ") [ 0 ] r . s a d d (’ p r o c e s s e d ’, md5 ) l n u m b e r =0 f i e l d s = N o n e forl i n einsys . s t d i n : ifl n u m b e r == 0: f i e l d s = l i n e . r s t r i p (). s p l i t (" , ") forf i e l dinf i e l d s : r . s a d d (’ t y p e ’, f i e l d ) e l s e: e l e m e n t s = l i n e . r s t r i p (). s p l i t (" , ") i =0 fore l e m e n tine l e m e n t s : try:

(29)

How to link a type to a MD5 malware hash?

Now, we have set of value per type

{e :http.user agent}={”Mozilla...,Curl...”}

• Using MD5 to hash type value and create a set of type value for

each malware

(30)

import.py (fourth iteration)

i m p o r ta r g p a r s e i m p o r tsys i m p o r tr e d i s i m p o r th a s h l i b a r g P a r s e r = a r g p a r s e . A r g u m e n t P a r s e r ( d e s c r i p t i o n =’ P c a p c l a s s i f i e r ’) a r g P a r s e r . a d d _ a r g u m e n t (’ - f ’, a c t i o n =’ a p p e n d ’,h e l p=’ F i l e n a m e ’) a r g s = a r g P a r s e r . p a r s e _ a r g s () r = r e d i s . S t r i c t R e d i s ( h o s t =’ l o c a l h o s t ’, p o r t =6379 , db =0) ifa r g s . fis notN o n e : md5 = a r g s . f [ 0 ] . s p l i t (" . ") [ 0 ] r . s a d d (’ p r o c e s s e d ’, md5 ) l n u m b e r =0 f i e l d s = N o n e forl i n einsys . s t d i n : ifl n u m b e r == 0: # p r i n t ( l i n e . s p l i t ( " , " ) [ 0 ] ) f i e l d s = l i n e . r s t r i p (). s p l i t (" , ") forf i e l dinf i e l d s : r . s a d d (’ t y p e ’, f i e l d ) e l s e: e l e m e n t s = l i n e . r s t r i p (). s p l i t (" , ") i =0 fore l e m e n tine l e m e n t s : try: r . s a d d (’ e : ’+ f i e l d s [ i ] , e l e m e n t ) e h a s h = h a s h l i b . md5 () e h a s h . u p d a t e ( e l e m e n t . e n c o d e (’ utf -8 ’))

(31)

graph.py - exporting graph from Redis

i m p o r t r e d i s i m p o r t n e t w o r k x as nx i m p o r t h a s h l i b r = r e d i s . S t r i c t R e d i s ( h o s t =’ l o c a l h o s t ’, p o r t =6379 , db =0) g = nx . G r a p h () for m a l w a r e in r . s m e m b e r s (’ p r o c e s s e d ’): g . a d d _ n o d e ( m a l w a r e ) for f i e l d t y p e in r . s m e m b e r s (’ t y p e ’): g . a d d _ n o d e ( f i e l d t y p e ) for v in r . s m e m b e r s (’ e : ’+ f i e l d t y p e . d e c o d e (’ utf -8 ’)): g . a d d _ n o d e ( v ) e h a s h = h a s h l i b . md5 () e h a s h . u p d a t e ( v ) e h h e x = e h a s h . h e x d i g e s t () for m in r . s m e m b e r s (’ v : ’+ e h h e x ): p r i n t ( m ) g . a d d _ e d g e ( v , m )

(32)

References

Related documents

it becomes more difficult to imagine why a difference between groups can occur, what alternative interpretations may exist for reported findings about an association detected

wafer prober (unpackaged chip test) or package handler (packaged chip test), touches chips through a socket (contactor).  Uses

Since visual reasoning skills are of great importance to mathematicians and non- mathematicians alike; since visual reasoning skills are part of teachers’ content knowledge

-multiple styles -fast thinking -expanded technique -experimentation -confident -structure problems -reduced understanding -diminished communication.. This will help sharpen your

more, the authors of Live Model Books have used these improvements to produce images of the human form that come as close as possible to what you would see, if you were observing

Understanding the effects of manipulations in the absence of pilot studies or previous research is important to the increasing number of studies using GC

*We who have ticked the Gift Aid column above want Cool Earth to reclaim the tax back on any donations we have made over the last six years and all future donations until we notify

Regarding empirical hypothesis, table 6 shows that the coefficient of the variable credit quality (which proxies for private information, and defines the