Big Data for the JVM developer. Costin Leau,

(1)

Big Data for the JVM developer

Costin Leau, Elasticsearch

@costinl

(2)

Agenda

Data Trends

Data Pipelines

JVM and Big Data

Tool Eco-system

(3)

Data Landscape

(4)

Data Trends

http://www.emc.com/leadership/programs/digital-universe.htm

(5)

Enterprise Data Trends

(6)

Enterprise Data Trends

(7)

Enterprise Data Trends

(8)

Enterprise Data Trends

Unstructured data

No predefined model

Often doesn’t fit in

RDBMS

Pre-Aggregated Data

Computed during data

collection

Counters

Running Averages

(9)

Cost Trends

Hardware cost halving

every 18 months

Big Iron: $40k/CPU

Commodity Cluster: $1k/CPU

(10)

Cost Trends

Hardware cost halving

every 18 months

Big Iron: $40k/CPU

Commodity Cluster: $1k/CPU

(11)

Value of Data

Value from Data Exceeds

Hardware & Software costs

US retail

•60+% increase in net margin possible

•0.5-1.0% annual productivity growth

(12)

Big Data

“Big data” refers to datasets whose size is

beyond the ability of typical database

software tools to capture, store, manage,

and analyze

A subjective and moving target

Big data in many sectors today range from

10’s of TB to multiple PB

(13)

(Big) Data Pipeline

(14)

Big Data Pipeline

ETL

Real Time

Streams

Unstructured Data (HDFS) Real Time

Structured Database

(hBase, Gemfire, Cassandra)

Big SQL

(Greenplum, AsterData,

Etc…)

Batch Processing

Real-Time

Processing

(s4, storm)

Analytics

(15)

Data Pipeline

Interactive Processing

Collect Transform RT Analysis Ingest Batch Analysis Distribute Use

Unstructured Data in Big Data Filesystem

HDFSHDFS

Transform

Elasticsearch Elasticsearch

BIG SQL BIG SQL SQLSQL

HBase

HBase CassandraCassandra

Data Grid Data Grid

RT Processing

Collect

(16)

Data Pipeline

Collect Transform RT Analysis Ingest Batch Analysis Distribute Use

Batch Processing

(Hadoop)

Unstructured Data in Big Data Filesystem

HDFSHDFS

Interactive Processing

Elasticsearch Elasticsearch

BIG SQL BIG SQL SQLSQL

HBase

HBase CassandraCassandra

Data Grids Data Grids

Data Analytics

Data Presentation

(17)

Taming Big Data

(18)

JVM as the platform

Portable

Fast

Secure

Rich eco-system

Massive adoption in the enterprise

(19)

Hadoop Distributed File System (HDFS)

Map Reduce Framework (M/R)

JVM as the platform

(20)

Unstructured Data (HDFS)

Storage - HDFS

Distributed

Scalable

Portable

Data Aware

Commodity

hardware

(21)

Computation – Map/Reduce

(22)

Counting Words

aka ‘Hello World’

(23)

public class TokenizerMapper extends

Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());context.write(word, one);

}}}

public class IntSumReducer extends

Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException { int sum = 0;

for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }

}

Computation – Map/Reduce

(24)

Hadoop Streaming

$HADOOP_HOME/bin/hadoop jar \

hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper /bin/cat \

-reducer /bin/wc

(25)

Hadoop Streaming

$HADOOP_HOME/bin/hadoop jar \

hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper parseLine.py \

-reducer /bin/wc

(26)

Cascading

Mid-level abstraction on top of M/R

Hides M/R plumbing through building blocks

Handles process planning and scheduling

JVM based (Java, Clojure* and Scala*)

* External projects to Cascading

(27)

Cascading – Counting Words

Scheme sourceScheme = new TextLine(new Fields("line"));

Tap source = new Hfs(sourceScheme, inputPath);

Scheme sinkScheme = new TextLine(new Fields("word", "count"));

Tap sink = new Hfs(sinkScheme, outputPath, SinkMode.REPLACE);

Pipe assembly = new Pipe("wordcount");

String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";

Function function = new RegexGenerator(new Fields("word"), regex);

assembly = new Each(assembly, new Fields("line”), function );

assembly = new GroupBy(assembly, new Fields("word”) );

Aggregator count = new Count(new Fields("count”));

assembly = new Every(assembly, count);

(28)

Scalding – Counting Words

package com.twitter.scalding.examples import com.twitter.scalding._

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") )

.flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size }

.write( Tsv( args("output") ) )

def tokenize(text : String) : Array[String] = {

text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }

}

(29)

Cascalog – Counting Words

(ns count-words.core (:use cascalog.api)

(:require [cascalog.ops :as c])) (defmapcatop split

[^String sentence]

(.split sentence "\\s+"))

(defn wordcount-query [src]

(<- [?word ?count]

(src ?textline)

(split ?textline :> ?word) (c/count ?count)))

(30)

Apache Pig

High-level abstraction on top of M/R

Procedural ETL scripting language

Extensible (Java, Python, Ruby or Groovy)

(31)

input_lines = LOAD '/tmp/books' AS (line:chararray);

-- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- filter out any words that are just white spaces filtered_

words = FILTER words BY word MATCHES '\\w+';

-- create a group for each word

word_groups = GROUP filtered_words BY word;

-- count the entries in each group

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

-- order the records by count

ordered_word_count = ORDER word_count BY count DESC;

STORE ordered_word_count INTO '/tmp/number-of-words';

Apache Pig

(32)

Apache Hive

SQL-like abstraction on top of M/R

Allows basic ETL

Extensible (Java, Python, Ruby or Groovy)

(33)

Counting Words – Hive

-- import the file as lines

CREATE EXTERNAL TABLE lines(line string)

LOAD DATA INPATH ‘books’ OVERWRITE INTO TABLE lines;

-- create a virtual view that splits the lines

**SELECT word, count(*) FROM lines**

LATERAL VIEW explode(split(text, ‘ ‘ )) lTable as word

GROUP BY word;

(34)

Eco-system

Oozie

HBase

Mahout

Spring for Apache Hadoop

Kafka

Elasticsearch

Storm

(35)

Big Data Pipeline

ETL

Real Time

Streams

Unstructured Data (HDFS) Real Time

Structured Database

(hBase, Gemfire, Cassandra)

Big SQL

(Greenplum, AsterData,

Etc…)

Batch Processing

Real-Time

Processing

(s4, storm)

Analytics

(36)

Wrapping up

(37)

Wrap-up

Rich eco-system

Variety of tools/frameworks/solutions

they all run on the JVM

both good & bad

Be agile – start small and grow organically

Iterate over your design … a lot

Focus on data, not the tools

(38)