Forensic Clusters: Advanced Processing with Open Source Software. Jon Stewart Geoff Black

(1)

Forensic Clusters:

Advanced Processing with

Open Source Software

Jon Stewart Geoff Black

(2)

Who We Are

Lightbox Lightbox Guidance alum Mr. EnScript C++ & Java Search Mac Linux Developer Optimist Python PSTs Fortune 100 Financial Guidance alum NCIS (DDK/ManTech) Louisiana DOJ ACEDS Examiner Pessimist EnScript NSFs

Jon Stewart Geoff Black

Like many of you, we've dealt with a lot of data.

C#

(3)

(4)

(5)

The Problem

• CPU clock speed stopped doubling • Hard drive capacity kept doubling • Multicore CPUs to the rescue!

• ...but they're wasted on single-threaded apps • ...and hard drive transfer speeds might be too

(6)

Solution:

Distributed Processing

• Split the data up

• Process it in parallel • Scale out as needed

(7)

I know, let's use Oracle!

...

(8)

Storage quickly becomes the bottleneck

(9)

These guys have a lot of data... ...how do they do their processing?

(10)

What about these guys?

(11)

Apache Hadoop

Apache Hadoop is an open source software

suite for distributed processing and storage,

primarily inspired by Google's in-house systems.

(12)

The Foundation

• Apache Hadoop created at Yahoo!

• Based on Google MapReduce and File System • Apache HBase based on Google BigTable

(13)

Apache Hadoop:

Scalable, Fault-tolerant, Open Source

Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/

(14)

Apache Hadoop:

(15)

HDFS

Just like any other file system, but better

• Metadata master server (“NameNode”)

– One server acts as FAT/MFT

– Knows where files & blocks are on cluster

• Blocks on separate machines (“DataNodes”)

– Automatic replication of blocks(default 3x) – All blocks have checksums

– Rack-aware

– DataNodes form p2p network

– Clients contact directly for reading/writing file data

• Not general purpose

– Files are read-only once written

(16)

Apache Hadoop:

(17)

MapReduce

Bring the application to the data • Code is sent to every node in the cluster • Local data is then processed as "tasks"

– Support for custom file formats

– Failures are restarted elsewhere, or skipped

• One nodes can process several tasks at once

– Idle CPUs are bad

• Output is automatically collated

(18)

MapReduce Example

• Yahoo! has over 500M unique users / month • Billions of transactions per day

• Search Assist (type ahead)

– Based on years of log data

– Before Hadoop: 26 days processing time – After Hadoop: 20 minutes

(19)

HBase

Random access, updates, versioning

Conventional RDBMS (MySQL)

• Limited throughput • Not distributed

• Fields always allocated

– varchar[255]

• Strict schemas

• ACID integrity/transactions • Support for joins

• Indices speed things up

HBase

• High write throughput • Distributed automatically • Null values not stored

– A row is Name:Value list

• Schemas not imposed • Rows sorted by key

• No joins, no transactions • Good at table scans

(20)

Source: http://www.slideshare.net/cloudera/hadoop-world-2011-advanced-hbase-schema-design-lars-george-cloudera

HBase Auto-Sharding

A B C D E F G H I J K L M N O A B C D E F G H I J K L M N O Client Logical Table Regions Region Servers

(21)

So, let's do forensics on Hadoop!

• We'd done some work on proof-of-concept and design

• Army Intelligence Center of Excellence funded an open source prototype effort

• In collaboration with 42six Solutions, Basis Technologies, and Dapper Vision

(22)

Open Source Stack for

Distributed Forensic Processing

+ St or ag e & Jo b M an ag eme nt Lib rarie s The Sleuth Kit Pr oc essi ng Dapper Vision Picarus

fsrip plugins MRCoffee

Document Analysis

(23)

Pipeline

(24)

Ingest:

File Extraction / Hash Calculation

dd,E01 / The Sleuth Kit fsrip + md5sum sha1sum

(25)

Ingest

Pipeline

(26)

Processing Tasks

• Hash set analysis • Keyword searching • Text extraction • Document clustering • Face detection • Graphics clustering • Video analysis

• Not limited – parse *anything* with plugin interface

(27)

Processing

Dapper Vision Picarus MRCoffee plugins Document Analysis +

(28)

Processing Plugins

• Plugin interface for community members to extend functionality

• Python + <other languages>

• Plugins can return data to the system:

– PST file -> Emails ->HBase rows

– Internet History records ->HBase rows – Extracting zip files ->HBase rows

(29)

Processing Flow

plugin Parsed

(30)

Processing with MRCoffee

• Sandboxed virtual processing

• Creates supermin VM on the fly with kvm

– 100s of KB. No network stack, no gcc, no nothin'.

• VM uses host kernel, glibc, etc.

• Includes whatever support files you give it • Uses character device for data transfer

• Inside VM, MRCoffee daemon pumps data to

your own utility on stdin, pipes data from stdout back to host HBase

(31)

Pipeline

(32)

Review:

Reporting

(33)

Review:

(34)

Review:

(35)

Wrap Up

• Yes, we can scale forensics

– We can do it without dongles

– We can do it without fancy hardware

• We're CPU rich, finally

– We can add machines and do more/better analysis

• Not quite ready for primetime... • ...some lab will need to be first

(36)

Jon Stewart

Geoff Black | [email protected]| [email protected] https://github.com/jonstewart/tp