Big Data for the JVM developer
Costin Leau, Elasticsearch
@costinl
Agenda
Data Trends
Data Pipelines
JVM and Big Data
Tool Eco-system
Data Landscape
Data Trends
http://www.emc.com/leadership/programs/digital-universe.htm
Enterprise Data Trends
Enterprise Data Trends
Enterprise Data Trends
Enterprise Data Trends
Unstructured data
No predefined model
Often doesn’t fit in
RDBMS
Pre-Aggregated Data
Computed during data
collection
Counters
Running Averages
Cost Trends
Hardware cost halving
every 18 months
Big Iron: $40k/CPU
Commodity Cluster: $1k/CPU
Cost Trends
Hardware cost halving
every 18 months
Big Iron: $40k/CPU
Commodity Cluster: $1k/CPU
Value of Data
Value from Data Exceeds
Hardware & Software costs
US retail
•60+% increase in net margin possible
•0.5-1.0% annual productivity growth
Big Data
“Big data” refers to datasets whose size is
beyond the ability of typical database
software tools to capture, store, manage,
and analyze
A subjective and moving target
Big data in many sectors today range from
10’s of TB to multiple PB
(Big) Data Pipeline
Big Data Pipeline
ETL
Real Time
Streams
Unstructured Data (HDFS) Real Time
Structured Database
(hBase, Gemfire, Cassandra)
Big SQL
(Greenplum, AsterData,
Etc…)
Batch Processing
Real-Time
Processing
(s4, storm)
AnalyticsData Pipeline
Interactive Processing
Collect Transform RT Analysis Ingest Batch Analysis Distribute Use
Unstructured Data in Big Data Filesystem
HDFSHDFS
Transform
Elasticsearch Elasticsearch
BIG SQL BIG SQL SQLSQL
HBase
HBase CassandraCassandra
Data Grid Data Grid
RT Processing
Collect
Data Pipeline
Collect Transform RT Analysis Ingest Batch Analysis Distribute Use
Batch Processing
(Hadoop)
Unstructured Data in Big Data Filesystem
HDFSHDFS
Interactive Processing
Elasticsearch Elasticsearch
BIG SQL BIG SQL SQLSQL
HBase
HBase CassandraCassandra
Data Grids Data Grids
Data Analytics
Data Presentation
Taming Big Data
JVM as the platform
Portable
Fast
Secure
Rich eco-system
Massive adoption in the enterprise
Hadoop Distributed File System (HDFS)
Map Reduce Framework (M/R)
JVM as the platform
Unstructured Data (HDFS)
Storage - HDFS
Distributed
Scalable
Portable
Data Aware
Commodity
hardware
Computation – Map/Reduce
Counting Words
aka ‘Hello World’
public class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());context.write(word, one);
}}}
public class IntSumReducer extends
Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException { int sum = 0;
for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }
}
Computation – Map/Reduce
Hadoop Streaming
$HADOOP_HOME/bin/hadoop jar \
hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
Hadoop Streaming
$HADOOP_HOME/bin/hadoop jar \
hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper parseLine.py \
-reducer /bin/wc
Cascading
Mid-level abstraction on top of M/R
Hides M/R plumbing through building blocks
Handles process planning and scheduling
JVM based (Java, Clojure* and Scala*)
* External projects to Cascading
Cascading – Counting Words
Scheme sourceScheme = new TextLine(new Fields("line"));
Tap source = new Hfs(sourceScheme, inputPath);
Scheme sinkScheme = new TextLine(new Fields("word", "count"));
Tap sink = new Hfs(sinkScheme, outputPath, SinkMode.REPLACE);
Pipe assembly = new Pipe("wordcount");
String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function function = new RegexGenerator(new Fields("word"), regex);
assembly = new Each(assembly, new Fields("line”), function );
assembly = new GroupBy(assembly, new Fields("word”) );
Aggregator count = new Count(new Fields("count”));
assembly = new Every(assembly, count);
Scalding – Counting Words
package com.twitter.scalding.examples import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size }
.write( Tsv( args("output") ) )
def tokenize(text : String) : Array[String] = {
text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }
}
Cascalog – Counting Words
(ns count-words.core (:use cascalog.api)
(:require [cascalog.ops :as c])) (defmapcatop split
[^String sentence]
(.split sentence "\\s+"))
(defn wordcount-query [src]
(<- [?word ?count]
(src ?textline)
(split ?textline :> ?word) (c/count ?count)))
Apache Pig
High-level abstraction on top of M/R
Procedural ETL scripting language
Extensible (Java, Python, Ruby or Groovy)
input_lines = LOAD '/tmp/books' AS (line:chararray);
-- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces filtered_
words = FILTER words BY word MATCHES '\\w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words';
Apache Pig
Apache Hive
SQL-like abstraction on top of M/R
Allows basic ETL
Extensible (Java, Python, Ruby or Groovy)
Counting Words – Hive
-- import the file as lines
CREATE EXTERNAL TABLE lines(line string)
LOAD DATA INPATH ‘books’ OVERWRITE INTO TABLE lines;
-- create a virtual view that splits the lines
SELECT word, count(*) FROM lines
LATERAL VIEW explode(split(text, ‘ ‘ )) lTable as word
GROUP BY word;
Eco-system
Oozie
HBase
Mahout
Spring for Apache Hadoop
Kafka
Elasticsearch
Storm
Big Data Pipeline
ETL
Real Time
Streams
Unstructured Data (HDFS) Real Time
Structured Database
(hBase, Gemfire, Cassandra)
Big SQL
(Greenplum, AsterData,
Etc…)
Batch Processing