• No results found

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

N/A
N/A
Protected

Academic year: 2021

Share "I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

Apache Kafka, Stream Processing, and Real-time Data

Jay Kreps

(2)

The Plan

1.  What is Data Integration?

2.  What is Apache Kafka?

3.  Logs and Distributed Systems

4.  Logs and Data Integration

(3)
(4)

Maslow’s Hierarchy

Physiological Safety

Love & Belonging Esteem

(5)

For Data

Acquisition/Collection Semantics

Understanding Automation

(6)

New Types of Data

•  Database data

– Users, products, orders, etc

•  Events

– Clicks, Impressions, Pageviews, etc

•  Application metrics

– CPU usage, requests/sec

•  Application logs

(7)

New Types of Systems

•  Live Stores

–  Voldemort

–  Espresso

–  Graph

–  OLAP

–  Search

–  InGraphs •  Offline

–  Hadoop

(8)

Bad

Oracle Oracle

Oracle User Tracking

Hadoop Log Search Monitoring Data Warehouse Social Graph Rec.

Engine Search Email

Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics Production Services ... Security

(9)

Good

Oracle Oracle

Oracle User Tracking

Hadoop Log Search Monitoring Data Warehouse Social Graph Rec

Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics Production Services ... Security Log

(10)

The Plan

1.  What is Data Integration?

2.  What is Apache Kafka?

3.  Logs and Distributed Systems

4.  Logs and Data Integration

(11)

Apache Kafka

producer kafka cluster producer producer producer producer producer producer producer producer consumer consumer consumer consumer consumer consumer consumer consumer consumer

(12)
(13)

A

"

brief

"

history

"

of

"

(14)

Three design principles

1.  One pipeline to rule them all

2.  Stream processing >> messaging

(15)

Characteristics

•  Scalability of a filesystem

– Hundreds of MB/sec/server throughput

– Many TB per server

•  Guarantees of a database

– Messages strictly ordered

– All data persistent

•  Distributed by default

– Replication

(16)

Kafka At LinkedIn

•  175 TB of in-flight log data per colo

•  Low-latency: ~1.5 ms

•  Replicated to each datacenter

•  Tens of thousands of data producers

•  Thousands of consumers

•  7 million messages written/sec

•  35 million messages read/sec

(17)

The Plan

1.  What is Data Integration?

2.  What is Apache Kafka?

3.  Logs and Distributed Systems

4.  Logs and Data Integration

(18)
(19)
(20)
(21)

0 1 2 3 4 5 6 7 8 9 10 11

Next Record Written

12 1st Record

(22)

Partitioning

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 1 0 1 1 Partition 0 Partition 1 Partition 2

(23)

Logs: pub/sub done right

0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Log Data Source writes Destination System A

(time = 7)

reads

Destination System B (time = 11)

(24)
(25)

Example:

"

(26)

Replica 1 Replica 2

PUT('microsoft', 'bill gates') PUT('apple', 'steve jobs')

PUT('microsoft', 'steve ballmer') PUT('google', 'larry page')

PUT('yahoo', 'terry semel') PUT('google', 'eric schmidt') PUT('yahoo', 'jerry yang') PUT('yahoo', 'carol bartz') PUT('apple', 'tim cook') PUT('google', 'larry page') PUT('yahoo', 'scott thompson') PUT('yahoo', 'marissa mayer') PUT('microsoft', 'satya nadella')

{

'microsoft': 'satya nadella', 'apple': 'tim cook',

'google': 'larry page',

'yahoo': 'marissa mayer' }

(27)

0 1 2 3 4 5 6 7 8

PUT(micr

osoft, bill gates)

PUT(apple, steve jobs) PUT(micr

osoft, steve ballmer)

PUT(google, larr

y page)

PUT(yahoo, terr

y semel)

PUT(google, eric schmidt) PUT(yahoo, jerr

y yang)

PUT(yahoo, car

ol bar

tz)

PUT(apple, tim cook)

Replica 1 (offset=10) Replica 2 (offset=12) PUT(google, larr y page)

PUT(yahoo, scott thompson)

9 10 11 12

PUT(yahoo, marissa mayer) PUT(micr

(28)

State-machine

Replication

Peer Peer Peer

The Log

Writes Reads

Master Slave Slave

The Log

state changes

Requests

Primary-backup

(29)

The Plan

1.  What is Data Integration?

2.  What is Apache Kafka?

3.  Logs and Distributed Systems

4.  Logs and Data Integration

(30)

Example: User views job

Jobs Frontend

Kafka

Job Views

Hadoop Security Job Poster

Analytics

Rec.

Engine Monitoring

(31)

It’s all one big distributed system

Oracle Oracle

Oracle User Tracking

Hadoop SearchLog Monitoring WarehouseData Social Graph

Rec

Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics Production Services ... Security Log

(32)

Comparing Data Transfer

Mechanisms

(33)

The Plan

1.  What is Data Integration?

2.  What is Apache Kafka?

3.  Logs and Distributed Systems

4.  Logs and Data Integration

(34)
(35)
(36)

Stream Processing = Logs + Jobs

Job 1

Job 3

Log A Log B Log C

Job 2

Log E Log D

(37)

Stream processing is a

"

generalization"

(38)

Examples

•  Monitoring

•  Security

•  Content processing

•  Recommendations

•  Newsfeed

(39)
(40)

Samza Architecture

Job Job Job Job

Samza

(41)

Log-centric Architecture

Key-Value Query Layer Search Query Layer Stream Proces sing Log Hadoop Monitoring & Graphs Graph DB, OLAP Store, Etc

(42)

Kafka! http://kafka.apache.org! ! Samza! http://samza.incubator.apache.org! !

Log Blog!

http://linkd.in/199iMwY! ! Me! http://www.linkedin.com/in/jaykreps! @jaykreps! !

References

Related documents

To refute intelligent design, it is enough to display specific, fully articulated Darwinian pathways for the complex systems that, according to intelligent design, lie beyond

The protocol was implemented over an 8-month pe- riod and included the following: (a) analysis of a set of parameters concerning the operating team’s behav- ior (Tab. I);

Innovation Ecosystem Industrie 4.0 Robotics Sensors Additive Manufacturing Suppliers Autonomous Vehicles Drones Advanced Manufacturing Systems Advanced Materials

In the section 2 of the questionnaire, the targeted respondents rated the current practices of Green Procurement in furniture manufacturing companies along a 5

Where R jt is the average equally-weighted return for portfolio j in month t as calculated in specification 3a, 3b, 4a and 4b (see section 3.3, portfolio j refers to

The Current State of Performance Management; Proactive Performance Management; Developing a Baseline; Online Utilization Reports; Automated Trend Analysis; Trouble Ticket Reports;

Using data derived from a survey of just under 2,000 prospective students, it shows how those from low social classes are more debt averse than those from other social classes, and

Hahn and Kuersteiner (2004) considered a bias corrected estimator for a general dynamic model with a scalar fixed effect using sample likelihood derivative quantities evaluated