Forensic Clusters:
Advanced Processing with
Open Source Software
Jon Stewart Geoff Black
Who We Are
Lightbox Lightbox Guidance alum Mr. EnScript C++ & Java Search Mac Linux Developer Optimist Python PSTs Fortune 100 Financial Guidance alum NCIS (DDK/ManTech) Louisiana DOJ ACEDS Examiner Pessimist EnScript NSFsJon Stewart Geoff Black
Like many of you, we've dealt with a lot of data.
C#
The Problem
• CPU clock speed stopped doubling • Hard drive capacity kept doubling • Multicore CPUs to the rescue!
• ...but they're wasted on single-threaded apps • ...and hard drive transfer speeds might be too
Solution:
Distributed Processing
• Split the data up
• Process it in parallel • Scale out as needed
I know, let's use Oracle!
...
Storage quickly becomes the bottleneck
These guys have a lot of data... ...how do they do their processing?
What about these guys?
Apache Hadoop
Apache Hadoop is an open source software
suite for distributed processing and storage,
primarily inspired by Google's in-house systems.
The Foundation
• Apache Hadoop created at Yahoo!
• Based on Google MapReduce and File System • Apache HBase based on Google BigTable
Apache Hadoop:
Scalable, Fault-tolerant, Open Source
Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/
Apache Hadoop:
Scalable, Fault-tolerant, Open Source
Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/
HDFS
Just like any other file system, but better
• Metadata master server (“NameNode”)
– One server acts as FAT/MFT
– Knows where files & blocks are on cluster
• Blocks on separate machines (“DataNodes”)
– Automatic replication of blocks(default 3x) – All blocks have checksums
– Rack-aware
– DataNodes form p2p network
– Clients contact directly for reading/writing file data
• Not general purpose
– Files are read-only once written
Apache Hadoop:
Scalable, Fault-tolerant, Open Source
Source: http://www.cloudera.com/what-is-hadoop/hadoop-overview/
MapReduce
Bring the application to the data • Code is sent to every node in the cluster • Local data is then processed as "tasks"
– Support for custom file formats
– Failures are restarted elsewhere, or skipped
• One nodes can process several tasks at once
– Idle CPUs are bad
• Output is automatically collated
MapReduce Example
• Yahoo! has over 500M unique users / month • Billions of transactions per day
• Search Assist (type ahead)
– Based on years of log data
– Before Hadoop: 26 days processing time – After Hadoop: 20 minutes
HBase
Random access, updates, versioning
Conventional RDBMS (MySQL)
• Limited throughput • Not distributed
• Fields always allocated
– varchar[255]
• Strict schemas
• ACID integrity/transactions • Support for joins
• Indices speed things up
HBase
• High write throughput • Distributed automatically • Null values not stored
– A row is Name:Value list
• Schemas not imposed • Rows sorted by key
• No joins, no transactions • Good at table scans
Source: http://www.slideshare.net/cloudera/hadoop-world-2011-advanced-hbase-schema-design-lars-george-cloudera
HBase Auto-Sharding
A B C D E F G H I J K L M N O A B C D E F G H I J K L M N O Client Logical Table Regions Region ServersSo, let's do forensics on Hadoop!
• We'd done some work on proof-of-concept and design
• Army Intelligence Center of Excellence funded an open source prototype effort
• In collaboration with 42six Solutions, Basis Technologies, and Dapper Vision
Open Source Stack for
Distributed Forensic Processing
+ St or ag e & Jo b M an ag eme nt Lib rarie s The Sleuth Kit Pr oc essi ng Dapper Vision Picarus
fsrip plugins MRCoffee
Document Analysis
Pipeline
Ingest:
File Extraction / Hash Calculation
dd,E01 / The Sleuth Kit fsrip + md5sum sha1sum
Ingest
Pipeline
Processing Tasks
• Hash set analysis • Keyword searching • Text extraction • Document clustering • Face detection • Graphics clustering • Video analysis
• Not limited – parse *anything* with plugin interface
Processing
Dapper Vision Picarus MRCoffee plugins Document Analysis +Processing Plugins
• Plugin interface for community members to extend functionality
• Python + <other languages>
• Plugins can return data to the system:
– PST file -> Emails ->HBase rows
– Internet History records ->HBase rows – Extracting zip files ->HBase rows
Processing Flow
plugin Parsed
Processing with MRCoffee
• Sandboxed virtual processing
• Creates supermin VM on the fly with kvm
– 100s of KB. No network stack, no gcc, no nothin'.
• VM uses host kernel, glibc, etc.
• Includes whatever support files you give it • Uses character device for data transfer
• Inside VM, MRCoffee daemon pumps data to
your own utility on stdin, pipes data from stdout back to host HBase
Pipeline
Review:
Reporting
Review:
Review:
Wrap Up
• Yes, we can scale forensics
– We can do it without dongles
– We can do it without fancy hardware
• We're CPU rich, finally
– We can add machines and do more/better analysis
• Not quite ready for primetime... • ...some lab will need to be first
Jon Stewart
Geoff Black | [email protected]| [email protected] https://github.com/jonstewart/tp