Simplifying Big Data with Apache Crunch. Micah

(1)

Simplifying Big Data with

Apache Crunch

Micah Whitacre @mkwhit

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

Semantic Chart Search

Cloud Based EMR

Medical Alerting System

(12)

Problem moves from scaling architecture...

(13)

Problem moves from not only scaling architecture...

(14)

(15)

(16)

Battling the 3 V’s

(17)

Battling the 3 V’s

Daily, weekly, monthly uploads 60+ different data formats

(18)

Battling the 3 V’s

Daily, weekly, monthly uploads 60+ different data formats

(19)

Battling the 3 V’s

Daily, weekly, monthly uploads

2+ TB of streaming data daily 60+ different data formats

(20)

Normalize Data Apply Algorithms Load Data for Displays HBase Solr Vertica Avro CSV Vertica HBase

Population Health

(21)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(22)

M

a

p

e

r

R

e

d

u

c

e

r

(23)

(24)

Struggle to fit into single MapReduce job Integration done through persistence

(25)

(26)

Custom impls of common patterns Evolving Requirements

(27)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV Anonymize Data Avro Prep for

(28)

Easy integration between teams Focus on processing steps

Shallow learning curve

(29)

Apache Crunch

Compose processing into pipelines

Open Source FlumeJava impl

Utilizes POJOs (hides serialization)

(30)

(31)

Processing

Pipeline

(32)

Pipeline

Programmatic description of DAG Supports lazy execution

MapReduce, Spark, Memory

(33)

Pipeline pipeline =

MemPipeline.getIntance(); Pipeline pipeline =

new MRPipeline(Driver.class, conf);

Pipeline pipeline =

(34)

Source

Reads various inputs

At least one required per pipeline

Custom implementations

(35)

Source

Sequence Files Avro Parquet HBase JDBC HFiles Text CSV Strings AvroRecords Results POJOs Protobufs Thrift Writables

(36)

pipeline.read(

(37)

pipeline.read(

(38)

PType<String> ptype = …;

pipeline.read(

(39)

PType

Hides serialization

Exposes data in native Java forms

Avro, Thrift, and Protocol Buffers

(40)

Multiple Serialization Types

Serialization Type = PTypeFamily

Can’t mix families in single type

Avro & Writable available

(41)

PType<Integer> intTypes = Writables.ints();

PType<String> stringType = Avros.strings();

PType<Person> personType =

(42)

PType<Pair<String, Person>> pairType = Avros.pairs(stringType, personType);

(43)

PTableType<String, Person> tableType = Avros.tableOf(stringType,personType);

(44)

PType<String> ptype = …;

PCollection<String> strings = pipeline.read( new TextFileSource(path, ptype));

(45)

PCollection

Immutable

Not created only read or transformed

Unsorted

(46)

(47)

Process Reference

Data

(48)

DoFn

Simple API to implement

Location for custom logic

Transforms PCollection between forms

(49)

For each item emits 0:M items

FilterFn - returns boolean MapFn - emits 1:1

(50)

DoFn API

class ExampleDoFn extends DoFn<String, RefData>{

...

(51)

public void process (String s, Emitter<RefData> emitter) { RefData data = …; emitter.emit(data); }

(52)

PCollection<String> refStrings

PCollection<RefData> refs = refStrings.parallelDo(fn,

(53)

PCollection<String> dataStrs... PCollection<RefData> refs =

dataStrs.parallelDo(diffFn, Avros.records(Data.class));

(54)

(55)

Hmm now I need to join...

We need a PTable

(56)

PTable<K, V>

Immutable & Unsorted

Variation PCollection<Pair<K, V>>

Multimap of Keys and Values

(57)

class ExampleDoFn extends DoFn<String, RefData>{

... }

(58)

class ExampleDoFn extends DoFn<String,

Pair<String, RefData>>{ ...

(59)

PCollection<String> refStrings

PTable<String, RefData> refs =

refStrings.parallelDo(fn,

Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

(60)

PTable<String, RefData> refs…; PTable<String, Data> data…;

(61)

data.join(refs);

(62)

PTable<String,

Pair<Data, RefData>>

(63)

Joins

right, left, inner, outer

Mapside, BloomFilter, Sharded

(64)

(65)

(66)

FilterFn API

class MyFilterFn extends FilterFn<...>{

...

(67)

public boolean accept (... value){

return value > 3; }

(68)

PCollection<Model> values = …; PCollection<Model> filtered = values.filter(new

(69)

(70)

PTable<String,Model> models = …;

(71)

PTable<String,Model> models = …; PGroupedTable<String, Model>

groupedModels =

(72)

PGroupedTable<K, V>

Immutable & Sorted

(73)

(74)

(75)

(76)

PCollection<Person> persons = …; pipeline.write(persons,

(77)

PCollection<Person> persons = …; pipeline.write(persons,

(78)

Target

Persists PCollection

At least one required per pipeline

(79)

Target

Sequence Files Avro Parquet HBase JDBC HFiles Text CSV Strings AvroRecords Results POJOs Protobufs Thrift Writables

(80)

(81)

Pipeline pipeline = …; ... pipeline.write(...); PipelineResult result = pipeline.done(); Execution

(82)

(83)

Tuning

GroupingOptions/ParallelDoOptions Scale factors

(84)

Focus on the transformations Smaller learning curve

Functionality first

(85)

Extend pipeline for new features

Iterate with confidence

(86)

Links

http://crunch.apache.org/

http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading