Real-time Big Data Analytics with Storm

Full text

(1)

Real-time Big Data Analytics with Storm

June 2013

Ron Bodkin

Founder & CEO, Think Big

(2)

No SQL Roadshow| 2

Accelerating Your Time to Value

Strategy and Roadmap

IMAGINE

Training and Education

ILLUMINATE

Hands-On Data Science and

Data Engineering

IMPLEMENT

Leading Provider of

Data Science and Engineering Services

(3)

Agenda

Ÿ 

What is Storm?

Ÿ 

When is Storm appropriate?

Ÿ 

API Options

Ÿ 

Use Cases

(4)

No SQL Roadshow| 4

Ÿ 

DAG Processing of never ending streams of data

-  Open Source: https://github.com/nathanmarz/storm/wiki -  Used at Twitter plus > 24 other companies

-  Reliable - At Least Once semantics -  Think MapReduce for data streams -  Java / Clojure based

-  Bolts in Java and ‘Shell Bolts’

-  Not a queue, but usually reads from a queue.

Ÿ 

Related:

-  S4, CEP

Ÿ 

Compromises

-  Static topologies & cluster sizing – Avoid messy dynamic rebalancing -  Nimbus – SPOF

Ÿ 

Strong Community Support, emerging commercial support

Storm Overview

(5)

Ÿ 

Cluster

- Nimbus - Zookeeper

- Supervisor, Workers

Ÿ 

Topology

Ÿ 

Streams

Ÿ 

Tuple

Ÿ 

Spout

Ÿ 

Bolt

Ÿ 

Stream Groupings

- Shuffle, Field

Ÿ 

Trident

Ÿ 

DRPC

Storm Concepts

(6)

No SQL Roadshow| 6

What is Real-Time?

Ÿ 

Low latency

-  Query response -  Data refresh

-  End-to-end response

Ÿ 

… nanoseconds, milliseconds, seconds, or minutes depending on your problem

(7)

Ÿ 

Better end-user experience

-  Ex: View an ad, see the counter move.

Ÿ 

Operational intelligence

-  Low latency analysis -  Operational intelligence

Ÿ 

Event response

-  Content half life measured at 3 hours (H Mason: http://bit.ly/nu7IDw) -  Path to additional real-time capabilities

-  Scalable analysis

-  Example: Trend analysis to recommend ‘hot’ articles.

Why Real-time?

(8)

No SQL Roadshow| 8

Ÿ 

Scalable way to manage queue readers and logic pipeline

Ÿ 

Much better than roll your own

Ÿ 

Reliable (Message guarantees, fault tolerant)

Ÿ 

Multi-node scaling (1MM messages / 10 nodes)

Ÿ 

It works

Ÿ 

Open source community

Ÿ 

See also: https://github.com/nathanmarz/storm/wiki/Rationale

Why Storm?

(9)

Ÿ 

Raw Java API (tuples and raw computation)

Ÿ 

Trident – Java DSL (SQL-like primitives)

Ÿ 

Python adapter

Ÿ 

Closure DSL

Ÿ 

RedStorm – Ruby

Ÿ 

Wukong – Ruby

Ÿ 

Trident-ML

Ÿ 

Storm-Esper

Programming Options

(10)

No SQL Roadshow| 10

Use Cases

Ÿ  Scale analysis pipeline

Ÿ  Lively stats

Ÿ  Recommendations

Ÿ  Better predictions

Ÿ  Realtime analytics

Ÿ  Online machine learning

Ÿ  Distributed RPC

(11)

Overall Architecture

NoSQL

Memcache (Tuple fail tracking)

Queue

Hadoop

Ad Serving

LB

Edge

Edge Impression

S3S3

DFSS3 Archive Logs

Management Server

LB

Edge

Edge

Relational Store

Ad Management Ad Selling

Storm

- Queue Management - Simple Bot Filtering - Real-time Bucketization - Performance Counters - Event Logging

View Ad

Cleansing Model Training Recommendations Events

Monitoring & Alerting (Metrics, Alarms, Notifications)

Model Parameters getPrediction

Performance Counters Impression Buckets

(12)

No SQL Roadshow| 12

Integration Patterns

(13)

Analytics Architecture

Storm

Web Server

Time Series BucketBolt Simple Bot

Annotator

DFS Adapter Impression

Spout

Time Series Buckets

(Batch)

Time Series Buckets (Realtime)

Impression Prediction Predictive

Model Parameters Impressions

Impressions Impressions

Hadoop

Impression Bucketization

Predictive Model Training

NoSQL Bolt

Time

(14)

No SQL Roadshow| 14

Analyze

Massive Historical

Data Set

Analyze

Recent

Past

Realtime

Prediction

Solution Approach

Historical Data Set = S3 Analyze = Hadoop + Pig + R

Recent Past = Storm + NoSQL Analyze = R + Web Service

(15)

Example: Real-Time Recommendation Updates

Web Servers

Read &

Update

•  Requests read single user profile

•  May update that profile record

•  Classic key/value store problem

•  Can split transient state into read/write NoSQL with batch view NoSQL

Batch Log Feed

Batch Model Feed

Hadoop Cluster

Batch Export

Classic BI Access Ad Hoc

Queries

MPP Database NoSQL

Database

(16)

No SQL Roadshow| 16

Real-Time Recommendation Updates w/ Storm

Web Servers

•  More complex

recommendation logic

•  Parallelize analysis across topology

•  Processing benefits from

distributed logic of streaming

Batch Log Feed

Batch Model Feed

Hadoop Cluster

Batch Export

Classic BI Access Ad Hoc

Queries Storm

MPP Database NoSQL

Database

(17)

Social Updates (real-time news feed)

Web Servers

•  Pages now integrate many profile records

•  Fan-out-on-read or Fan-out-on-write

•  Processing benefits from

distributed logic of streaming

Batch Log Feed

Batch Model Feed

Hadoop Cluster

Batch Export

Classic BI Access Ad Hoc

Queries Storm

MPP Database NoSQL

Database

(18)

No SQL Roadshow| 18

Finance: Trade Compliance Monitoring

Order Mgmt System

•  Storm reduces market data to usable subset

•  Integrates high, medium and low frequency data

•  Real-time Storm filter triggers deeper search via NoSQL

Batch Log Feed

Batch Model Feed

Hadoop Cluster

Batch Export

Classic BI Access Ad Hoc

Queries Storm

MPP Database NoSQL

Database Market Data

Feed

(19)

Operational Intelligence: Stats and Search

Web Servers

•  Cache and microbatch

updates to multidimensional stats

•  Support distributed queries

•  Cache updates

Hadoop Cluster

Ad Hoc Queries Storm

Sys Logs

Batch System Log Feed

Batch Summary Views

Search &

NoSQL DB

(20)

No SQL Roadshow| 20

Ÿ 

Easy to develop, hard to debug – start locally

-  Timeouts

Ÿ 

Storm infinite loop of failures

-  Use Memcached to count # tuple failures

Ÿ 

At Least Once processing

-  Hadoop based read-repair job

Ÿ 

Performance Counters not getting flushed

-  Tick Tuples

Ÿ 

Always ACK

Ÿ 

Batching to S3

-  Run a compaction & event de-duplication job in Hadoop

Lessons

(21)

Conclusions

Ÿ  There’s many kinds of real-time problems

Ÿ  Use of Hadoop and/or NoSQL can solve

-  Low latency queries

-  Event response with localized intelligence -  Operational intelligence

Ÿ  Storm is valuable for

-  Ingesting data within seconds

-  Complex real-time distributed logic -  Operational intelligence

Ÿ  We’re Hiring!

Figure

Updating...

References

Updating...