• No results found

Putting Apache Kafka to Use!

N/A
N/A
Protected

Academic year: 2021

Share "Putting Apache Kafka to Use!"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

Putting Apache Kafka to Use!

Building a Real-time Data Platform for Event Streams!

JAY KREPS, CONFLUENT!

(2)

A Couple of Themes!

(3)

Theme 1: Rise of Events!

(4)

Theme 2: Immutability Everywhere!

Level! Example! Immutable Alternative!

Mutable local state! Counter in a for loop! Functional Programming!

Mutable process-wide state! ConcurrentHashMap! Functional Programming!

!

Mutable on disk structures! B-Tree! LSM!

Distributed systems! Dynamo-like key-value store! State machine replication!

Mutability in databases! RDBMS! Event Sourcing!

Company-wide data flow! Double write! Kafka!

(5)

Theme 3: Datacenter-Level Thinking!

(6)

Experience at LinkedIn!

(7)

2009: We want all our data in Hadoop!!

(8)

What is all our data?!

(9)

Initial approach: “gut it out”!

(10)

Problems!

•  Data coverage!

•  Many source systems!

•  Relational DBs!

•  Log files!

•  Metrics!

•  Messaging systems!

•  Many data formats!

•  Constant change!

•  New schemas!

•  New data sources!

(11)

Needed: organizational scalability!

Θ(N) => Θ(1)!

(12)

How does everything else work?!

?!

(13)

Relational database changes!

Relational Databases

App App App

Cache

Poll For Changes

Caches &

Derived Stores

Relational Data Warehouse

ODS Data Guard

Hadoop CSV Dump

Transforms

Transforms Apps and Services

OLTP Queries

(14)

NoSQL!

Key-value Store

Hadoop ETL Load

App App App

(15)

User events!

Apps and Services

Apps and Services

Apps and Services

Log Aggregation HTTP

NFS

NFS

Hadoop

Relational Data Warehouse rsync

Transform & Load Load

Transform

(16)

Application Logs!

Apps and Services

Apps and Services

Apps and Services

Splunk

(17)

Messaging!

App App App

Broker

Processor Processor

App App

Broker

Processor Processor

App App

Broker

Processor Processor

App

Processor Processor

(18)

Metrics and operational data!

App App App

Monitoring

(19)

This is a giant mess!

Relational Databases

App App App

Cache

Poll For Changes

Caches &

Derived Stores

Relational Data Warehouse

ODS Data Guard

Hadoop CSV Dump

Transforms

Transforms Apps and Services

Splunk

ActiveMQ

Apps

ActiveMQ

Apps Apps

Log Aggregation HTTP

NFS

NFS rsync

Transform & Load Load Monitoring

Apps and Services Apps and Services

HTTP

Key-value Store Apps

OLTP Queries

(20)

Impossible ideas!

•  Publish data from Hadoop to a search index!

•  Run a SQL query to find the biggest latency bottleneck!

•  Run a SQL query to find common error patterns!

•  Low latency monitoring of database changes or user activity!

•  Incorporate popularity in real-time display and relevance algorithms!

•  Products that incorporate user activity!

(21)

An infrastructure solution?!

(22)

Idea: Stream Data Platform!

Stream Data Platform:

?

HADOOP:

Offline Data

Spark

Synchronous

Req/Response Near real time

0 - 100s ms > 100s ms

Offline batch

> 1 hour Real-time

Analytics

Impala

Map- Reduce Hive

RDBMS DWH

NoSQL

Monitoring Search

Apps

Stream Processing

(23)

First Attempt: Messaging systems!!

(24)

Problems!

•  Throughput!

•  Batch systems!

•  Persistence!

•  Stream Processing!

•  Ordering guarantees!

•  Partitioning!

(25)

Second Attempt: Build Kafka!!

(26)

What does it do?!

Kafka Cluster

Consumer Consumer Consumer Consumer Consumer Producer Producer Producer Producer Producer

(27)

Commit Log Abstraction!

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2

Old New

Writes Reader 1 Reader 2

(28)

Logs & Publish-Subscribe Messaging!

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 Log 2

Source System

writes

Destination System A

reads

Destination System B

reads

(29)

A Kafka Topic!

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2 Partition

0

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2 Partition

1

Partition 2

Old New

Writes

(30)

Replication!

Server 1 Server 2 Server 3

A:0 A:0 A:0

A:1 A:1

A:1

B:0 B:0 Controller

(31)

Scaling Consumers!

Server 1 P0

Server 2

P3 P1 P2

Kafka Cluster

Consumer Group A

C1 C2

Consumer Group B

C3 C4 C5 C6

(32)

Kafka: A Modern Distributed System for Streams!

  Scalability of a filesystem!

◦ Hundreds of MB/sec/server throughput!

◦ Many TB per server!

  Guarantees of a database!

◦ Messages strictly ordered!

◦ All data persistent!

  Distributed by default!

◦ Replication!

◦ Partitioning model!

  Producers, Consumers, and Brokers all fault tolerant and horizontally scalable!

(33)

Stream Data Platform!

KAFKA:

Stream Data Platform

HADOOP:

Offline Data

Spark

Synchronous

Req/Response Near real time

0 - 100s ms > 100s ms

Offline batch

> 1 hour Real-time

Analytics

Impala

Map- Reduce Hive

RDBMS DWH

NoSQL

Monitoring Search

Apps

Stream Processing

(34)

Batch Data => Batch Processing!

(35)

Stream processing is a!

generalization!

of batch processing !

and request/response processing!

(36)

Request/Response processing: !

One input => One output!

(37)

Batch processing: !

All inputs => All outputs!

(38)

Stream Processing: !

Some inputs => some outputs!

(you choose how much “some” is)!

(39)

Stream Processing a la carte!

Transform Transform Transform Input Kafka Topic

Live Data Store

Hadoop

Transform Transform Transform Intermediate

Kafka Topic

Your code

Output Kafka Topic

cat input | grep “foo” | wc -l

(40)

Stream Processing with Frameworks!

+! =! Processing! Stream

(41)

Unix Pipes, Modernized!

cat /usr/share/dict/words | wc -l

(42)

On Schemas!

Bad Schemas < No Schemas < Good Schemas!

(43)

Put it all together!

Kafka Log

Search Monitoring

Real-time Analytics Social

Graph Search Newsfeed OLAP

Samza Apps Key-Value

Storage Oracle

Apps Apps

Apps Apps

Security &

Fraud

Hadoop Teradata

Apps

(44)

At LinkedIn!

•  Everything in the company is a real-time stream!

•  > 800 billion messages written per day!

•  > 2.9 trillion messages read per day!

•  ~ 1 PB of stream data!

•  Tens of thousands of producer processes!

•  Backbone for data stores!

•  Search!

•  Social Graph!

•  Newsfeed!

•  Primary storage (in progress)!

•  Basis for stream processing!

(45)

Elsewhere!

(46)

Why this is the future!

1. System diversity is increasing!

2. Data diversity and volume is

increasing!

3.  The world is getting faster!

4.  The technology exists!

(47)

•  Mission: Make this a practical reality everywhere!

•  Product!

•  Apache Kafka!

•  Schemas and metadata management!

•  Connectors for common systems!

•  Monitor data flow end-to-end!

•  Stream processing integration

!

(48)

Questions?!

•  Confluent!

•  @confluentinc!

•  http://confluent.io !

•  http://blog.confluent.io/2015/02/25/

stream-data-platform-1 !

•  Apache Kafka!

•  @apachekafka!

•  http://kafka.apache.org!

•  http://linkd.in/199iMwY !

•  Me!

•  @jaykreps!

References

Related documents

It involves a mixed- methods approach in which an both an online survey and individual focus groups are used to assess whether these young people would feel confident in

Wholesale and retail trade and repair of motor vehicles and motorcycles.. Comert cu ridicata si cu amanuntul, intretinerea si repararea autovehiculelor si

14 When black, Latina, and white women like Sandy and June organized wedding ceremonies, they “imagine[d] a world ordered by love, by a radical embrace of difference.”

Central to this thesis reports is the research conducted in case studies carried out in two white goods manufacturers in Anhui Province. The aim of this

•  Option 2: Write a custom NetCDF to Hadoop sequencer and keep the files together. »  Basically puts indexes into the files so Hadoop can parse by key »  Maintains the

  Trening za pripremu certifikata  Microsoft Certified Technology  Specialist (MCTS): SQL Server 2005   

c. Trace deposits in the cash receipts +ournal to the cash balance in the general ledger . d. Compare the cash balance in the general ledger with the bank

            The state diagram above shows the general logic of the controller. So at S0 the only