Putting Apache Kafka to Use!

(1)

Putting Apache Kafka to Use!

Building a Real-time Data Platform for Event Streams!

JAY KREPS, CONFLUENT!

(2)

A Couple of Themes!

(3)

Theme 1: Rise of Events!

(4)

Theme 2: Immutability Everywhere!

Level! Example! Immutable Alternative!

Mutable local state! Counter in a for loop! Functional Programming!

Mutable process-wide state! ConcurrentHashMap! Functional Programming!

!

Mutable on disk structures! B-Tree! LSM!

Distributed systems! Dynamo-like key-value store! State machine replication!

Mutability in databases! RDBMS! Event Sourcing!

Company-wide data flow! Double write! Kafka!

(5)

Theme 3: Datacenter-Level Thinking!

(6)

Experience at LinkedIn!

(7)

2009: We want all our data in Hadoop!!

(8)

What is all our data?!

(9)

Initial approach: “gut it out”!

(10)

Problems!

•  Data coverage!

•  Many source systems!

•  Relational DBs!

•  Log files!

•  Metrics!

•  Messaging systems!

•  Many data formats!

•  Constant change!

•  New schemas!

•  New data sources!

(11)

Needed: organizational scalability!

Θ(N) => Θ(1)!

(12)

How does everything else work?!

?!

(13)

Relational database changes!

Relational Databases

App App App

Cache

Poll For Changes

Caches &

Derived Stores

Relational Data Warehouse

ODS Data Guard

Hadoop CSV Dump

Transforms

Transforms Apps and Services

OLTP Queries

(14)

NoSQL!

Key-value Store

Hadoop ETL Load

App App App

(15)

User events!

Apps and Services

Log Aggregation HTTP

NFS

Hadoop

Relational Data Warehouse rsync

Transform & Load Load

Transform

(16)

Application Logs!

Apps and Services

Splunk

(17)

Messaging!

App App App

Broker

Processor Processor

App App

Broker

Processor Processor

App App

Broker

Processor Processor

App

Processor Processor

(18)

Metrics and operational data!

App App App

Monitoring

(19)

This is a giant mess!

Relational Databases

App App App

Cache

Poll For Changes

Caches &

Derived Stores

Relational Data Warehouse

ODS Data Guard

Hadoop CSV Dump

Transforms

Transforms Apps and Services

Splunk

ActiveMQ

Apps

ActiveMQ

Apps Apps

Log Aggregation HTTP

NFS

NFS rsync

Transform & Load Load Monitoring

Apps and Services Apps and Services

HTTP

Key-value Store Apps

OLTP Queries

(20)

Impossible ideas!

•  Publish data from Hadoop to a search index!

•  Run a SQL query to find the biggest latency bottleneck!

•  Run a SQL query to find common error patterns!

•  Low latency monitoring of database changes or user activity!

•  Incorporate popularity in real-time display and relevance algorithms!

•  Products that incorporate user activity!

(21)

An infrastructure solution?!

(22)

Idea: Stream Data Platform!

Stream Data Platform:

?

HADOOP:

Offline Data

Spark

Synchronous

Req/Response Near real time

0 - 100s ms > 100s ms

Offline batch

> 1 hour Real-time

Analytics

Impala

Map- Reduce Hive

RDBMS DWH

NoSQL

Monitoring Search

Apps

Stream Processing

(23)

First Attempt: Messaging systems!!

(24)

Problems!

•  Throughput!

•  Batch systems!

•  Persistence!

•  Stream Processing!

•  Ordering guarantees!

•  Partitioning!

(25)

Second Attempt: Build Kafka!!

(26)

What does it do?!

Kafka Cluster

Consumer Consumer Consumer Consumer Consumer Producer Producer Producer Producer Producer

(27)

Commit Log Abstraction!

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2

Old New

Writes Reader 1 Reader 2

(28)

Logs & Publish-Subscribe Messaging!

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 Log 2

Source System

writes

Destination System A

reads

Destination System B

reads

(29)

A Kafka Topic!

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2 Partition

0

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2 Partition

1

Partition 2

Old New

Writes

(30)

Replication!

Server 1 Server 2 Server 3

A:0 A:0 A:0

A:1 A:1

A:1

B:0 B:0 ^Controller

(31)

Scaling Consumers!

Server 1 P0

Server 2

P3 P1 P2

Kafka Cluster

Consumer Group A

C1 C2

Consumer Group B

C3 C4 C5 C6

(32)

Kafka: A Modern Distributed System for Streams!

 Scalability of a filesystem!

◦ Hundreds of MB/sec/server throughput!

◦ Many TB per server!

 Guarantees of a database!

◦ Messages strictly ordered!

◦ All data persistent!

 Distributed by default!

◦ Replication!

◦ Partitioning model!

 Producers, Consumers, and Brokers all fault tolerant and horizontally scalable!

(33)

Stream Data Platform!

KAFKA:

Stream Data Platform

HADOOP:

Offline Data

Spark

Synchronous

Req/Response Near real time

0 - 100s ms > 100s ms

Offline batch

> 1 hour Real-time

Analytics

Impala

Map- Reduce Hive

RDBMS DWH

NoSQL

Monitoring Search

Apps

Stream Processing

(34)

Batch Data => Batch Processing!

(35)

Stream processing is a!

generalization!

of batch processing !

and request/response processing!

(36)

Request/Response processing: !

One input => One output!

(37)

Batch processing: !

All inputs => All outputs!

(38)

Stream Processing: !

Some inputs => some outputs!

(you choose how much “some” is)!

(39)

Stream Processing a la carte!

Transform Transform Transform Input Kafka Topic

Live Data Store

Hadoop

Transform Transform Transform Intermediate

Kafka Topic

Your code

Output Kafka Topic

cat input | grep “foo” | wc -l

(40)

Stream Processing with Frameworks!

+! =! Processing! ^Stream

(41)

Unix Pipes, Modernized!

cat /usr/share/dict/words | wc -l

(42)

On Schemas!

Bad Schemas < No Schemas < Good Schemas!

(43)

Put it all together!

Kafka Log

Search Monitoring

Real-time Analytics Social

Graph Search Newsfeed OLAP

Samza Apps Key-Value

Storage Oracle

Apps Apps

Security &

Fraud

Hadoop Teradata

Apps

(44)

At LinkedIn!

•  Everything in the company is a real-time stream!

•  > 800 billion messages written per day!

•  > 2.9 trillion messages read per day!

•  ~ 1 PB of stream data!

•  Tens of thousands of producer processes!

•  Backbone for data stores!

•  Search!

•  Social Graph!

•  Newsfeed!

•  Primary storage (in progress)!

•  Basis for stream processing!

(45)

Elsewhere!

(46)

Why this is the future!

1. System diversity is increasing!

2. Data diversity and volume is

increasing!

3.  The world is getting faster!

4.  The technology exists!

(47)

•  Mission: Make this a practical reality everywhere!

•  Product!

•  Apache Kafka!

•  Schemas and metadata management!

•  Connectors for common systems!

•  Monitor data flow end-to-end!

•  Stream processing integration

!

(48)

Questions?!

•  Confluent!

•  @confluentinc!

•  http://confluent.io !

•  http://blog.confluent.io/2015/02/25/

stream-data-platform-1 !

•  Apache Kafka!

•  @apachekafka!

•  http://kafka.apache.org!

•  http://linkd.in/199iMwY !

•  Me!

•  @jaykreps!

Putting Apache Kafka to Use!