• No results found

Real Time Data Processing using Spark Streaming

N/A
N/A
Protected

Academic year: 2021

Share "Real Time Data Processing using Spark Streaming"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume

Committer, Apache Sqoop

Real Time Data Processing

using Spark Streaming

(2)

Motivation for Real-Time Stream Processing

Data is being created at unprecedented rates

Exponential data growth from mobile, web, social

Connected devices: 9B in 2012 to 50B by 2020

Over 1 trillion sensors by 2020

Datacenter IP traffic growing at CAGR of 25%

How can we harness it data in real-time?

Value can quickly degrade → capture value immediately

From reactive analysis to direct operational impact

Unlocks new competitive advantages

Requires a completely new approach...

(3)

Use Cases Across Industries

Credit

Identify

fraudulent transactions as soon as they occur.

Transportation

Dynamic Re-routing Of traffic or Vehicle Fleet.

Retail

Dynamic Inventory Management

Real-time In-store

Offers and

recommendations

Consumer Internet &

Mobile

Optimize user

engagement based on user’s current behavior.

Healthcare

Continuously monitor patient vital stats and

proactively identify at-risk patients.

Manufacturing

Identify equipment failures and react instantly

Perform Proactive

Surveillance

Identify threats

and intrusions In real-time

Digital

Advertising

& Marketing

Optimize and

personalize content based on real-time

(4)

From Volume and Variety to Velocity

Present

Batch + Stream Processing Big-Data = Volume + Variety

Big-Data = Volume + Variety + Velocity

Past

Present

Hadoop Ecosystem evolves as well…

Past

Big Data has evolved

Batch Processing

Time to insight of Hours

(5)

Key Components of Streaming Architectures

Data Ingestion

& Transportation Service

Real-Time Stream Processing Engine

Kafka Flume

System Management Security

Real-Time Data Serving

(6)

Canonical Stream Processing Architecture

Kafka Data Ingest

App 1 App 2

. .

Kafka Flume

HDFS

HBase

Data Sources

(7)

Spark: Easy and Fast Big Data

Easy to Develop

Rich APIs in Java, Scala,

Python

Interactive shell

Fast to Run

General execution graphs

In-memory storage

2-5× less code

Up to 10× faster on disk,

100× in memory

(8)

Spark Architecture

Driver

Worker

Worker Worker

Data RAM

Data RAM

Data RAM

(9)

RDDs

RDD = Resilient Distributed Datasets

Immutable representation of data

Operations on one RDD creates a new one

Memory caching layer that stores data in a distributed, fault-tolerant cache

Created by parallel transformations on data in stable storage

Lazy materialization

Two observations:

a. Can fall back to disk when data-set does not fit in memory

b. Provides fault-tolerance through concept of lineage

(10)

Spark Streaming

Extension of Apache Spark’s Core API, for Stream Processing.

The Framework Provides

Fault Tolerance

Scalability

High-Throughput

(11)

Spark Streaming

Incoming data represented as Discretized Streams (DStreams)

Stream is broken down into micro-batches

Each micro-batch is an RDD – can share code between batch and streaming

(12)

val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1

batch @ t batch @ t+2

tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture

(13)

Use DStreams for Windowing Functions

(14)

Spark Streaming

Runs as a Spark job

YARN or standalone for scheduling

YARN has KDC integration

Use the same code for real-time Spark Streaming and for batch Spark jobs.

Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….

Easy to write “Receivers” for custom messaging systems.

(15)

Sharing Code between Batch and Streaming

def filterErrors (rdd: RDD[String]): RDD[String] = { rdd.filter(s => s.contains(“ERROR”))

}

Library that filters “ERRORS”

Streaming generates RDDs periodically

Any code that operates on RDDs can therefore be used in streaming as

well

(16)

Sharing Code between Batch and Streaming

val lines = sc.textFile(…)

val filtered = filterErrors(lines) filtered.saveAsTextFile(...)

Spark:

val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)

val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => { filterErrors(rdd)

}))

filtered.saveAsTextFiles(…)

Spark Streaming:

(17)

Reliability

Received data automatically persisted to HDFS Write Ahead Log to prevent data loss

set spark.streaming.receiver.writeAheadLog.enable=true in spark conf

When AM dies, the application is restarted by YARN

Received, ack-ed and unprocessed data replayed from WAL (data that made it into blocks)

Reliable Receivers can replay data from the original source, if required

Un-acked data replayed from source.

Kafka, Flume receivers bundled with Spark are examples

Reliable Receivers + WAL = No data loss on driver or receiver failure!

(18)

Kafka Connectors

Reliable Kafka DStream

Stores received data to Write Ahead Log on HDFS for replay

No data loss

Stable and supported!

Direct Kafka DStream

Uses low level API to pull data from Kafka

Replays from Kafka on driver failure

No data loss

Experimental

(19)

Flume Connector

Flume Polling DStream

Use Spark sink from Maven to Flume’s plugin directory

Flume Polling Receiver polls the sink to receive data

Replays received data from WAL on HDFS

No data loss

Stable and Supported!

(20)

Spark Streaming Use-Cases

Real-time dashboards

Show approximate results in real-time

Reconcile periodically with source-of-truth using Spark

Joins of multiple streams

Time-based or count-based “windows”

Combine multiple sources of input to produce composite data

Re-use RDDs created by Streaming in other Spark jobs.

(21)

What is coming?

Run on Secure YARN for more than 7 days!

Better Monitoring and alerting

Batch-level and task-level monitoring

SQL on Streaming

Run SQL-like queries on top of Streaming (medium – long term)

Python!

Limited support coming in Spark 1.3

(22)

Current Spark project status

400+ contributors and 50+ companies contributing

Includes: Databricks, Cloudera, Intel, Yahoo! etc

Dozens of production deployments

Spark Streaming Survived Netflix Chaos Monkey – production ready!

Included in CDH!

(23)

More Info..

CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera- docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html

Cloudera Blog: http://blog.cloudera.com/blog/category/spark/

Apache Spark homepage: http://spark.apache.org/

Github: https://github.com/apache/spark

(24)

Thank you

hshreedharan@cloudera.com

15% Discount Code for Cloudera Training PNWCUG_15

References

Related documents

I wish to test the hypothesis that the dominant regulatory philosophy within the EMU confined to neoclassical tenets – belief in self-correcting markets, rational

(2008, 2012) have not documented a strong evidence of seismogenic magnetic anomalies before the China earth- quake of 1 September 2003, but in our opinion the authors have shown

The access to GPD from COMPASS in the future as well as the polarized Drell-Yan and the semi-inclusive Deep Inelastic Scattering for the TMD functions.. The future of COMPASS

This reaction is known to take place during the processing of milk and milk products.1 The deproteinized fraction of one of these milk products was examined by paper chromatography,

Spark Streaming is an extension of Spark that allows processing data stream using micro-batches of data... Discretized

Python API for Spark Streaming Spark SQL pluggable data sources. >  Hive, JSON, Parquet, Cassandra,

Once they collect their logs, respondents say the most useful feature of log management systems is “real-time alerts,” with 68 percent indicating they are very useful and 25

Nmap sends a raw IP packet without any additional protocol header (see a good TCP/IP book for information about IP packets), to each protocol on the target machine. Receipt of an