Apache Kafka, Stream Processing, and Real-time Data
Jay Kreps
The Plan
1. What is Data Integration?
2. What is Apache Kafka?
3. Logs and Distributed Systems
4. Logs and Data Integration
Maslow’s Hierarchy
Physiological Safety
Love & Belonging Esteem
For Data
Acquisition/Collection Semantics
Understanding Automation
New Types of Data
• Database data
– Users, products, orders, etc
• Events
– Clicks, Impressions, Pageviews, etc
• Application metrics
– CPU usage, requests/sec
• Application logs
New Types of Systems
• Live Stores
– Voldemort
– Espresso
– Graph
– OLAP
– Search
– InGraphs • Offline
– Hadoop
Bad
Oracle Oracle
Oracle User Tracking
Hadoop Log Search Monitoring Data Warehouse Social Graph Rec.
Engine Search Email
Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics Production Services ... Security
Good
Oracle Oracle
Oracle User Tracking
Hadoop Log Search Monitoring Data Warehouse Social Graph Rec
Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics Production Services ... Security Log
The Plan
1. What is Data Integration?
2. What is Apache Kafka?
3. Logs and Distributed Systems
4. Logs and Data Integration
Apache Kafka
producer kafka cluster producer producer producer producer producer producer producer producer consumer consumer consumer consumer consumer consumer consumer consumer consumerA
"
brief
"
history
"
of
"
Three design principles
1. One pipeline to rule them all
2. Stream processing >> messaging
Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of a database
– Messages strictly ordered
– All data persistent
• Distributed by default
– Replication
Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Low-latency: ~1.5 ms
• Replicated to each datacenter
• Tens of thousands of data producers
• Thousands of consumers
• 7 million messages written/sec
• 35 million messages read/sec
The Plan
1. What is Data Integration?
2. What is Apache Kafka?
3. Logs and Distributed Systems
4. Logs and Data Integration
0 1 2 3 4 5 6 7 8 9 10 11
Next Record Written
12 1st Record
Partitioning
0 1 2 3 4 5 6 7 8 9 1 0
1 1
1 2
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 1 0 1 1 Partition 0 Partition 1 Partition 2
Logs: pub/sub done right
0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Log Data Source writes Destination System A
(time = 7)
reads
Destination System B (time = 11)
Example:
"
Replica 1 Replica 2
PUT('microsoft', 'bill gates') PUT('apple', 'steve jobs')
PUT('microsoft', 'steve ballmer') PUT('google', 'larry page')
PUT('yahoo', 'terry semel') PUT('google', 'eric schmidt') PUT('yahoo', 'jerry yang') PUT('yahoo', 'carol bartz') PUT('apple', 'tim cook') PUT('google', 'larry page') PUT('yahoo', 'scott thompson') PUT('yahoo', 'marissa mayer') PUT('microsoft', 'satya nadella')
{
'microsoft': 'satya nadella', 'apple': 'tim cook',
'google': 'larry page',
'yahoo': 'marissa mayer' }
0 1 2 3 4 5 6 7 8
PUT(micr
osoft, bill gates)
PUT(apple, steve jobs) PUT(micr
osoft, steve ballmer)
PUT(google, larr
y page)
PUT(yahoo, terr
y semel)
PUT(google, eric schmidt) PUT(yahoo, jerr
y yang)
PUT(yahoo, car
ol bar
tz)
PUT(apple, tim cook)
Replica 1 (offset=10) Replica 2 (offset=12) PUT(google, larr y page)
PUT(yahoo, scott thompson)
9 10 11 12
PUT(yahoo, marissa mayer) PUT(micr
State-machine
Replication
Peer Peer Peer
The Log
Writes Reads
Master Slave Slave
The Log
state changes
Requests
Primary-backup
The Plan
1. What is Data Integration?
2. What is Apache Kafka?
3. Logs and Distributed Systems
4. Logs and Data Integration
Example: User views job
Jobs Frontend
Kafka
Job Views
Hadoop Security Job Poster
Analytics
Rec.
Engine Monitoring
It’s all one big distributed system
Oracle Oracle
Oracle User Tracking
Hadoop SearchLog Monitoring WarehouseData Social Graph
Rec
Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics Production Services ... Security Log
Comparing Data Transfer
Mechanisms
The Plan
1. What is Data Integration?
2. What is Apache Kafka?
3. Logs and Distributed Systems
4. Logs and Data Integration
Stream Processing = Logs + Jobs
Job 1
Job 3
Log A Log B Log C
Job 2
Log E Log D
Stream processing is a
"
generalization"
Examples
• Monitoring
• Security
• Content processing
• Recommendations
• Newsfeed
Samza Architecture
Job Job Job Job
Samza
Log-centric Architecture
Key-Value Query Layer Search Query Layer Stream Proces sing Log Hadoop Monitoring & Graphs Graph DB, OLAP Store, EtcKafka! http://kafka.apache.org! ! Samza! http://samza.incubator.apache.org! !
Log Blog!
http://linkd.in/199iMwY! ! Me! http://www.linkedin.com/in/jaykreps! @jaykreps! !