• No results found

Simplifying Big Data with Apache Crunch. Micah

N/A
N/A
Protected

Academic year: 2021

Share "Simplifying Big Data with Apache Crunch. Micah"

Copied!
86
0
0

Loading.... (view fulltext now)

Full text

(1)

Simplifying Big Data with

Apache Crunch

Micah Whitacre @mkwhit

(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)

Semantic Chart Search

Cloud Based EMR

Medical Alerting System

(12)

Problem moves from scaling architecture...

(13)

Problem moves from not only scaling architecture...

(14)
(15)
(16)

Battling the 3 V’s

(17)

Battling the 3 V’s

Daily, weekly, monthly uploads 60+ different data formats

(18)

Battling the 3 V’s

Daily, weekly, monthly uploads 60+ different data formats

(19)

Battling the 3 V’s

Daily, weekly, monthly uploads

2+ TB of streaming data daily 60+ different data formats

(20)

Normalize Data Apply Algorithms Load Data for Displays HBase Solr Vertica Avro CSV Vertica HBase

Population Health

(21)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(22)

M

a

p

p

e

r

R

e

d

u

c

e

r

(23)
(24)

Struggle to fit into single MapReduce job Integration done through persistence

(25)

Struggle to fit into single MapReduce job Integration done through persistence

(26)

Struggle to fit into single MapReduce job Integration done through persistence

Custom impls of common patterns Evolving Requirements

(27)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV Anonymize Data Avro Prep for

(28)

Easy integration between teams Focus on processing steps

Shallow learning curve

(29)

Apache Crunch

Compose processing into pipelines

Open Source FlumeJava impl

Utilizes POJOs (hides serialization)

(30)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(31)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

Processing

Pipeline

(32)

Pipeline

Programmatic description of DAG Supports lazy execution

MapReduce, Spark, Memory

(33)

Pipeline pipeline =

MemPipeline.getIntance(); Pipeline pipeline =

new MRPipeline(Driver.class, conf);

Pipeline pipeline =

(34)

Source

Reads various inputs

At least one required per pipeline

Custom implementations

(35)

Source

Sequence Files Avro Parquet HBase JDBC HFiles Text CSV Strings AvroRecords Results POJOs Protobufs Thrift Writables

(36)

pipeline.read(

(37)

pipeline.read(

(38)

PType<String> ptype = …;

pipeline.read(

(39)

PType

Hides serialization

Exposes data in native Java forms

Avro, Thrift, and Protocol Buffers

(40)

Multiple Serialization Types

Serialization Type = PTypeFamily

Can’t mix families in single type

Avro & Writable available

(41)

PType<Integer> intTypes = Writables.ints();

PType<String> stringType = Avros.strings();

PType<Person> personType =

(42)

PType<Pair<String, Person>> pairType = Avros.pairs(stringType, personType);

(43)

PTableType<String, Person> tableType = Avros.tableOf(stringType,personType);

(44)

PType<String> ptype = …;

PCollection<String> strings = pipeline.read( new TextFileSource(path, ptype));

(45)

PCollection

Immutable

Not created only read or transformed

Unsorted

(46)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(47)

Process Reference

Data

(48)

DoFn

Simple API to implement

Location for custom logic

Transforms PCollection between forms

(49)

For each item emits 0:M items

FilterFn - returns boolean MapFn - emits 1:1

(50)

DoFn API

class ExampleDoFn extends DoFn<String, RefData>{

...

(51)

public void process (String s, Emitter<RefData> emitter) { RefData data = …; emitter.emit(data); }

(52)

PCollection<String> refStrings

PCollection<RefData> refs = refStrings.parallelDo(fn,

(53)

PCollection<String> dataStrs... PCollection<RefData> refs =

dataStrs.parallelDo(diffFn, Avros.records(Data.class));

(54)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(55)

Hmm now I need to join...

We need a PTable

(56)

PTable<K, V>

Immutable & Unsorted

Variation PCollection<Pair<K, V>>

Multimap of Keys and Values

(57)

class ExampleDoFn extends DoFn<String, RefData>{

... }

(58)

class ExampleDoFn extends DoFn<String,

Pair<String, RefData>>{ ...

(59)

PCollection<String> refStrings

PTable<String, RefData> refs =

refStrings.parallelDo(fn,

Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));

(60)

PTable<String, RefData> refs…; PTable<String, Data> data…;

(61)

data.join(refs);

(62)

PTable<String,

Pair<Data, RefData>>

(63)

Joins

right, left, inner, outer

Mapside, BloomFilter, Sharded

(64)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(65)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(66)

FilterFn API

class MyFilterFn extends FilterFn<...>{

...

(67)

public boolean accept (... value){

return value > 3; }

(68)

PCollection<Model> values = …; PCollection<Model> filtered = values.filter(new

(69)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(70)

PTable<String,Model> models = …;

(71)

PTable<String,Model> models = …; PGroupedTable<String, Model>

groupedModels =

(72)

PGroupedTable<K, V>

Immutable & Sorted

(73)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(74)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(75)
(76)

PCollection<Person> persons = …; pipeline.write(persons,

(77)

PCollection<Person> persons = …; pipeline.write(persons,

(78)

Target

Persists PCollection

At least one required per pipeline

(79)

Target

Sequence Files Avro Parquet HBase JDBC HFiles Text CSV Strings AvroRecords Results POJOs Protobufs Thrift Writables

(80)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(81)

Pipeline pipeline = …; ... pipeline.write(...); PipelineResult result = pipeline.done(); Execution

(82)

Process Reference Data Process Raw Person Data Process Raw Data using Reference Filter Out Invalid Data Group Data By Person Create Person Record Avro CSV CSV

(83)

Tuning

GroupingOptions/ParallelDoOptions Scale factors

(84)

Focus on the transformations Smaller learning curve

Functionality first

(85)

Extend pipeline for new features

Iterate with confidence

(86)

Links

http://crunch.apache.org/

http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading

References

Related documents

Besides Ahyarudin ’ s research, Kartika (2013) conducted study in SMAN 8 Bandar Lampung and found that there was a significant increase of the students ’ reading

En los individuos alcohólicos existe 65.5 veces más la probabilidad de ser afectado por el VIH/Sida que los que no lo son, la ocupación se asocia como factor de riego 9 veces mayor

Others are requirements contained in new Bill 198 (not yet proclaimed in force and discussed in greater detail in section C below), which makes amendments to the Securities

With regard to paragraph 17 and related text in the Respondent's submissions, the shape of the federal on-reserve child welfare program is not subject to

Naloxone is a potent opioid antagonist and is used in reversal of opioid- induced respiratory depression, either in overdose or in those patients who have suffered exaggerated

PLACE OF WORK: AEROPORTO DI PONTECAGNANO (SALERNO) PLACE OF WORK: AEROPORTO DI BARI E DI BRINDISI Refurbish of old equipment on arrival area. arrivi riveniente dalla

• Study 2: We hypothesize that individuals (with no gambling addiction) who are High Cortisol Responders (HCR) to stress will show an altered reward activity to gambling, with

The guideline development group recognises that the interval between seeking advice and initiation of disease modifying anti-rheumatic drugs (DMARD) treatment has continued to