• No results found

Mining for Bugs with Graph Database Queries

N/A
N/A
Protected

Academic year: 2021

Share "Mining for Bugs with Graph Database Queries"

Copied!
63
0
0

Loading.... (view fulltext now)

Full text

(1)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Mining for Bugs with Graph Database Queries

Fabian Yamaguchi (@fabsx00) 31C3

(2)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Hello 31C3!

2

» It’s been 5 years since my last CCC talk 

» Talk back then: mostly about hard earned bugs

» Talk today: my attempts since then to make bug hunting a little less painful.

(3)

» Finding tiny problems in huge code bases

» Move away from precise but hard to scale methods typical for program

analysis

» Engineering perspective Robust (inexact) analysis that recognizes bugs in

noisy code bases

» Assist auditors in their daily work, don’t try to replace them.

(4)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Combining two areas of research

4

Graph Databases

Good old CS

(5)

Stefan Esser’s libssh2 Bug (Syscan’13)

(6)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Stefan Esser’s libssh2 Bug

6

Summation inside allocation

(7)

Stefan Esser’s libssh2 Bug

(8)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN 8

» Q: How did he find the bug?

» Whitebox fuzzer enhanced with symbolic execution?

» Machine learning powered anomaly detector?

(9)

» Q: How did he find the bug?

» Whitebox fuzzer enhanced with symbolic execution?

» Machine learning powered anomaly detector?

» Theorem proving? Model checking?

» A: regular expression for grep.

(10)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

What I think this should tell us

» If used right, even primitive tools like grep are powerful

Allow auditor knowledge to guide the analysis!

» Query is a simple model making auditor knowledge explicit » High false positive rates are often tolerated in practice

(11)

Goal: A Robust Search Engine for Source Code

Query Response Robust Parser Source Code Pattern Recognition

(12)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Aspects of code a query should be able to model

» What do statements look like?

» Can we get from one statement to another? » How do statements affect each other?

» Let’s see how compilers deal with this

(13)

Typical graph representations for code analysis

AST

CFG

PDG

(14)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Representations have different strengths

14

» None of these representations can “do it all”

» Typical query: Find a call to foo in a statement where data flow exists to a statement that contains a call to bar.

» Creates the need to transition from one graph

representation to the other. (here: AST -> PDG -> AST) » Can’t we get a representation to account for all of these

(15)

Idea: Merge representations at statement nodes

» Sub-graphs for statements in AST, CFG and PDG

AST

CFG

(16)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Code Property Graph

16

(17)

Now, let’s just put it in a database and we’re done

» A number of great RE tools use RDBMS backends » Tried to map graphs to tables… failed

» Tried to map graphs to documents … failed

(18)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Native storage format of RDBMs: Tables

18

Image from: http://neo4j.com/

» Relationships can be expressed via “JOIN tables”

» Lookup time proportional to size of JOIN table => inefficient lookups for data with lots of many-to-many relationships

(19)

Bid data! Cloud Era! NoSQL!

» Ignore the guy in the suit, this stuff actually makes sense from a technical point of view

» “NoSQL”: Storing and querying databases differently

» A search for alternatives to the relational database model

» Graph databases: NoSQL database that uses graphs as a native storage format

(20)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Native storage format of GraphDBs: Property Graphs

20

(21)

Querying property graphs with

Traversals

» Choose a set of start nodes

» Describe where to walk from there

» If it’s possible to walk the graph according to your

description, nodes reached by the walk are returned.

» Ex.: Return all objects created by people over the age of 30 » Start nodes: all nodes where “age > 30”

» Follow outgoing ‘created’ edges

(22)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Start at all nodes. Filter for “age > 30”

22

Image from: http://markorodriguez.com/2011/08/03/on-the-nature-of-pipes/

Gremlin traversal:

g.V.filter{ it.age > 30} .out(“created”)

(23)

Following outgoing “created” edges

Gremlin traversal:

g.V.filter{ it.age > 30 }. .out(“created”)

(24)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

“Stored procedures”: Custom Steps:

24 Gremlin.defineStep(“PeopleWeCanFireNow”, [Vertex, Pipe], {

_().filter{ it.age > 30 }. .out(“created”).back(1) })

PeopleWeCanFireNow

out(“created”) filter{ it.age > 30}

=:

(25)

Graph databases: a perfect match for social networks!

» Example: Facebook Graph Search

(26)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Graph databases: a perfect match for social networks!

26

» Example: Facebook Graph Search

(27)

Graph databases: a perfect match for social networks!

» Example: Facebook Graph Search

(28)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

But they are not limited to storage of useless crap!

» Code property graphs are property graphs by design!

» You can store code property graphs in graph databases!

(29)

Gremlin.defineStep(“PeopleWeCanFireNow”, [Vertex, Pipe], { _().filter{ it.age > 30 }. .out(“created”)

PeopleWeCanFireNow

filter{ it.age > 30}

=:

Allows us to define a domain specific query

language for vulnerability discovery!

out(“created”) filter{ it.age > 30}

(30)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

» A platform for code analysis based on code property graphs stored in a graph database

» Extensible query language

» Scriptable via a python interface (python-joern)

» Shell tools for common operations (joern-tools)

» Known to work on other people’s machines! ;) » http://mlsec.org/joern

(31)

Architecture

python-joern Filesystem: “.joernIndex/” Importer REST API Port: 7474

(1) Import code

(2) Start database server

from joern.all import JoernSteps query = “”” … “””

j = JoernSteps()

j.connectToDatabase()

res = j.runGremlinQuery(query) for r in res: print r

(3) Run your scripts

joern-tools

(4) Or use the

shell utils

(32)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Our first test subject: The VLC Media Player

32

Disclaimer:

» Auditing code for vulnerabilities is NOT about making fun of people. » VLC developers are doing an

exceptionally good job at coding. » The fact that this is a popular

program with a lot of functionality makes it interesting for us.

» Bugs have been reported and

(33)

Importing code and starting the server

$ joern /home/fabs/sourceCode/vlc-2.1.5/ $ neo4j console

(34)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Selecting nodes using Apache Lucene

34 ls -lh .joernIndex total 473M … […] 4.0K Dec 14 14:45 index […] 36M Dec 14 14:45 neostore.nodestore.db […] 233M Dec 14 14:45 neostore.propertystore.db […] 44M Dec 14 14:45 neostore.propertystore.db.strings […] 161M Dec 14 14:45 neostore.relationshipstore.db …

Integrated Text Search Engine

for start node selection

(35)

Lucene Queries with

joern-lookup

echo "type:File AND filepath:*demux*" | joern-lookup -c

1015093 filepath:…/modules/demux/tta.c type:File 1016218 filepath:…/modules/demux/ty.c type:File 1024673 filepath:…/modules/demux/vc1.c type:File …

Lucene Query Language

Node id

Attributes

(36)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Gremlin Queries with

joern-lookup -g

36 echo 'queryNodeIndex("type:File AND filepath:*demux*") .out().filter{it.type == "Function"}.name' |

joern-lookup -g

Custom step

Lucene Query Language

Get names of all functions in files matching *demux*

Control Demux Close Open parse_chunk_headers get_chunk_header …

(37)

“But how do I know what is in the database?”

echo

'getFunctionsByName("GetAoutBuffer").id‘ | joern-lookup -g | tail -n 1

| joern-plot-proggraph -ddg -cfg

Plot control flow

Plot data flow

Plot control and data flow for function GetAoutBuffer

Big thanks to Alwin Maier

for writing the plotting tools!

(38)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Let’s look for a variation of the libssh bug in VLC

38 echo 'getCallsTo("malloc").ithArguments("0")

.sideEffect{cnt = it.code }

.match{ it.type =="AdditiveExpression"}.statements() .out("REACHES")

.match{ it.type == "CallExpression" &&

it.code.startsWith("memcpy")}.ithArguments("2") .filter{it.code != cnt }

.match{it.type == "AdditiveExpression"}.id‘

| joern-lookup -g | sort | uniq | joern-location > foo

Get calls to malloc where first argument contains and additive expression and a call to memcpy is reached by data flow where the third argument also contains an additive expression and the two additive expression are not equal.

(39)

/src/input/vlmshell.c:844:0:26302:29833

/modules/audio_output/pulse.c:85:0:2962:3621 /modules/codec/quicktime.c:667:0:22561:29957 /modules/demux/mp4/libmp4.c:2595:0:81633:82214

$ cat foo

$ grep ‘mp4’ foo | joern-code

(40)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

p_box->i_size = 7

triggers the overflow

40

static int MP4_ReadBox_name( stream_t *p_stream, MP4_Box_t *p_box ) {

MP4_READBOX_ENTER( MP4_Box_data_name_t ); p_box->data.p_name->psz_text =

malloc( p_box->i_size + 1 - 8 ); /* +\0, -name, -size */ if( p_box->data.p_name->psz_text == NULL )

MP4_READBOX_EXIT( 0 );

memcpy( p_box->data.p_name->psz_text, p_peek, p_box->i_size - 8 ); p_box->data.p_name->psz_text[p_box->i_size - 8] = '\0';

MP4_READBOX_EXIT( 1 ); }

(41)

From “The Art of Software Security Assessment”

“In fact, two of your authors typically start reviewing a new codebase by finding the equivalent of the util/ directory and reading the framework and glue code line by line.”, Dowd, McDonald, Schuh (2006)

» Finding fundamental difficulties in using internal and external APIs can help

uncover a lot of vulnerabilities

» Let’s checkout the API for processing of data streams in VLC

(42)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Difficulty: Handling files larger than 4GB

42

» Size of a stream can be larger than UINT32_MAX

» sizeof(size_t) == sizeof(uint32_t) on 32 bit platforms

» Any allocation that depends on stream_Size will need to ensure that the stream_Size fits into a size_t!

(43)

Difficulty: Handling files larger than 4GB

» Size of a stream can be larger than UINT32_MAX

» sizeof(size_t) == sizeof(uint32_t) on 32 bit platforms

» Any allocation that depends on stream_Size will need to ensure that the stream_Size fits into a size_t!

(44)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Truncations in allocations

44

echo 'getCallsTo("stream_Size").statements()

.filter{ it.code.contains("int64_t") }.out("REACHES")

.filter{ it.code.contains("malloc")}.code' | joern-lookup -g

chunk -> data = malloc ( size )

uint8_t * data = ( uint8_t * ) malloc ( size ) ;

char * p_buffer = ( i_size > 0 ) ? malloc ( i_size ) : NULL ;

psz_update_data = malloc ( i_read + 1 )

uint8_t * p_buf = ( uint8_t * ) malloc ( i_size ) ;

uint8_t * p_unarmored = ( uint8_t * ) malloc ( ( i_size * 3 ) / 4 + 1 ) ; uint8_t * p_buf = ( uint8_t * ) malloc ( i_size ) ;

Give me statements containing calls to stream_Size and the symbol

“int64_t” where data flow exists to statements containing the symbol malloc.

(45)
(46)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Side-note: Signatures are used to verify updates

(47)
(48)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Let’s look at the binary straight away

(49)
(50)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

» Edited ...\etc\hosts: update.videolan.org => my web server » Created a 4294967295 Byte of “A”s (vlc/status-win-x86)

(51)

lolwut?!?

» Edited ...\etc\hosts: update.videolan.org => my web server

» Created a 4294967295 Byte of “A”s (vlc/status-win-x86) » Attached Debugger and checked for updates in VLC…

(52)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

lolwut?!?

52

» Edited ...\etc\hosts: update.videolan.org => my web server » Created a 4294967295 Byte of “A”s (vlc/status-win-x86)

» Attached Debugger and checked for updates in VLC…

Arguing in favor

of patching just

got a lot easier.

(53)

Notes on exploitation

» stream_Read is our friend 

» Header “Content-Length: 4294967295” triggers bug » Attacker fully controls amount of data to actually copy

» Data is copied in blocks by stream_Read

» In between copy-operations, stream_Read uses dereferences function pointers!

» ASLR/DEP enabled but limited amount of position dependant code allows for custom ROP chain

» Downsides: multiple threads make stable exploitation difficult » Demo.

(54)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Our second test subject: The Linux Kernel

54

» Large code base with a lot of users » Well suited for static analysis as

hardware for drivers is typically not available

(55)

Defining steps for more complex traversals

» Unsanitized

» The traversal returns attacker-controlled sources only if they meet the following two conditions.

» There exists a path from the source statement to the sink statement in the control flow graph such that no node on the path matches any of the sanitizer descriptions.

» A variable defined by the source and used by the sink

reaches the sink via the control flow path, i.e., the variable is not redefined by any node on the path.

» Pretty complex, so let’s implement it once and the re-use it 

(56)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Example: Query for buffer overflows in write handlers

56 sink sanitizers source query1 = “”” getFunctionASTsByName('*_write*') .getArguments('(copy_from_user OR memcpy)', '2') .sideEffect{ paramName = 'c(ou)?nt'; }

.filter{ it.code.matches(paramName) } .unsanitized( { it._().or( _().isCheck('.*' + paramName + '.*'), _().codeContains('.*alloc.*' + paramName + '.*'), _().codeContains('.*min.*') )} ) .param( '.*c(ou)?nt.*' ) .locations() “””

(57)

Slight variation of the query

query2 = “””

getArguments('(copy_from_user OR memcpy)', '2')

.filter{ !it.argToCall().toList()[0].code.matches('.*(sizeof|min).*') } .sideEffect{ argument = it.code; } /* store argument */

.sideEffect{ dstId = it.statements().toList()[0].id; } /* store id of sink */

.unsanitized({ it._().or( _().isCheck('.*' + Pattern.quote(argument) + '.*') , _().codeContains('.*alloc.*' + Pattern.quote(argument) + '.*'), _().codeContains('.*min.*') ) }, { it._().filter{ it.code.contains('copy_from_user')} } )

.filter{ it.id != dstId } /* filter flows from node to self */

(58)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Profit: Seven 0-days in eleven hits

58

(59)
(60)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Practical Evaluation with Nico Golde (Qualcomm)

» Get somebody working in the industry

to actually try this!

» Found about 100 issues in an

internal audit at Qualcomm in their kernel code (9 queries)

» Evaluation for paper: audit of the Linux Kernel Mainline

» Buffer overflows (2 queries)

» Zero-byte allocation (1 query)

» Memory mapping bugs (1 query)

» Memory disclosure (1 query)

(61)

Kernel 0-day 

(62)

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Conclusion

» Introduced a system to mine code bases for vulnerabilities » Built a bridge between program analysis and graph DBs » Found real exploitable bugs in popular code bases

» Part of a larger effort to explore what we can do with pattern recognition and data mining techniques to discover

vulnerabilities

(63)

Thank you! Questions?

» Implementation

» http://mlsec.org/joern (webpage + documentation)

» http://github.com/fabsx00/joern (main system)

» http://github.com/fabsx00/python-joern (traversals)

» http://github.com/fabsx00/joern-tools (shell utils)

» Fabian Yamaguchi » fabs@goesec.de

» http://codeexploration.blogspot.de (my webpage)

References

Related documents

The model for these cooperatives is characterized by their commitment to people, modern management techniques, quality and process excellence, innovation, significant

To study the role of the PNS in absorptive phase insulin release, we measured plasma con- centrations of glucose as well as islet hormones and incretins in six healthy rhesus

link task task Server components Client components Screen sharing Data: Cockpit parts Video image(s) Cockpit parts Traces, parameters, settings Simulation Real-world Video

Moreover, some control schemes such as the computed torque controller rely directly on the dynamics model to predict the desired actuator force to be used in a feedforward

To provide the forest managers and nature conservation personnel with the first-hand experience on the nature conservation oriented management of the floodplain forests,

Finally, interpreting students are often exposed to intense student competition during their education and experience an overly critical perspective of the field during

7 In sensitivity analysis, the model is calibrated for nominal portfolios in Canada as a check on the importance of portfolio structure and, subsequently, for

didates and I don't think it means the search won': go. ' Haeger responded to the crowd of predominantly June. He said his decision ~o apply,~or S!UC's top spot, She said