Real-time Big Data Analytics with Storm
June 2013
Ron Bodkin
Founder & CEO, Think Big
No SQL Roadshow| 2
Accelerating Your Time to Value
Strategy and Roadmap
IMAGINE
Training and Education
ILLUMINATE
Hands-On Data Science and
Data Engineering
IMPLEMENT
Leading Provider of
Data Science and Engineering Services
Agenda
What is Storm?
When is Storm appropriate?
API Options
Use CasesNo SQL Roadshow| 4
DAG Processing of never ending streams of data- Open Source: https://github.com/nathanmarz/storm/wiki - Used at Twitter plus > 24 other companies
- Reliable - At Least Once semantics - Think MapReduce for data streams - Java / Clojure based
- Bolts in Java and ‘Shell Bolts’
- Not a queue, but usually reads from a queue.
Related:- S4, CEP
Compromises- Static topologies & cluster sizing – Avoid messy dynamic rebalancing - Nimbus – SPOF
Strong Community Support, emerging commercial supportStorm Overview
Cluster- Nimbus - Zookeeper
- Supervisor, Workers
Topology
Streams
Tuple
Spout
Bolt
Stream Groupings- Shuffle, Field
Trident
DRPCStorm Concepts
No SQL Roadshow| 6
What is Real-Time?
Low latency- Query response - Data refresh
- End-to-end response
… nanoseconds, milliseconds, seconds, or minutes depending on your problem
Better end-user experience- Ex: View an ad, see the counter move.
Operational intelligence- Low latency analysis - Operational intelligence
Event response- Content half life measured at 3 hours (H Mason: http://bit.ly/nu7IDw) - Path to additional real-time capabilities
- Scalable analysis
- Example: Trend analysis to recommend ‘hot’ articles.
Why Real-time?
No SQL Roadshow| 8
Scalable way to manage queue readers and logic pipeline
Much better than roll your own
Reliable (Message guarantees, fault tolerant)
Multi-node scaling (1MM messages / 10 nodes)
It works
Open source community
See also: https://github.com/nathanmarz/storm/wiki/RationaleWhy Storm?
Raw Java API (tuples and raw computation)
Trident – Java DSL (SQL-like primitives)
Python adapter
Closure DSL
RedStorm – Ruby
Wukong – Ruby
Trident-ML
Storm-EsperProgramming Options
No SQL Roadshow| 10
Use Cases
Scale analysis pipeline
Lively stats
Recommendations
Better predictions
Realtime analytics
Online machine learning
Distributed RPC
Overall Architecture
NoSQL
Memcache (Tuple fail tracking)
Queue
Hadoop
Ad Serving
LB
Edge
Edge Impression
S3S3
DFSS3 Archive Logs
Management Server
LB
Edge
Edge
Relational Store
Ad Management Ad Selling
Storm
- Queue Management - Simple Bot Filtering - Real-time Bucketization - Performance Counters - Event Logging
View Ad
Cleansing Model Training Recommendations Events
Monitoring & Alerting (Metrics, Alarms, Notifications)
Model Parameters getPrediction
Performance Counters Impression Buckets
No SQL Roadshow| 12
Integration Patterns
Analytics Architecture
Storm
Web Server
Time Series BucketBolt Simple Bot
Annotator
DFS Adapter Impression
Spout
Time Series Buckets
(Batch)
Time Series Buckets (Realtime)
Impression Prediction Predictive
Model Parameters Impressions
Impressions Impressions
Hadoop
Impression Bucketization
Predictive Model Training
NoSQL Bolt
Time
No SQL Roadshow| 14
Analyze
Massive Historical
Data Set
Analyze
Recent
Past
Realtime
Prediction
Solution Approach
Historical Data Set = S3 Analyze = Hadoop + Pig + R
Recent Past = Storm + NoSQL Analyze = R + Web Service
Example: Real-Time Recommendation Updates
Web Servers
Read &
Update
• Requests read single user profile
• May update that profile record
• Classic key/value store problem
• Can split transient state into read/write NoSQL with batch view NoSQL
Batch Log Feed
Batch Model Feed
Hadoop Cluster
Batch Export
Classic BI Access Ad Hoc
Queries
MPP Database NoSQL
Database
No SQL Roadshow| 16
Real-Time Recommendation Updates w/ Storm
Web Servers
• More complex
recommendation logic
• Parallelize analysis across topology
• Processing benefits from
distributed logic of streaming
Batch Log Feed
Batch Model Feed
Hadoop Cluster
Batch Export
Classic BI Access Ad Hoc
Queries Storm
MPP Database NoSQL
Database
Social Updates (real-time news feed)
Web Servers
• Pages now integrate many profile records
• Fan-out-on-read or Fan-out-on-write
• Processing benefits from
distributed logic of streaming
Batch Log Feed
Batch Model Feed
Hadoop Cluster
Batch Export
Classic BI Access Ad Hoc
Queries Storm
MPP Database NoSQL
Database
No SQL Roadshow| 18
Finance: Trade Compliance Monitoring
Order Mgmt System
• Storm reduces market data to usable subset
• Integrates high, medium and low frequency data
• Real-time Storm filter triggers deeper search via NoSQL
Batch Log Feed
Batch Model Feed
Hadoop Cluster
Batch Export
Classic BI Access Ad Hoc
Queries Storm
MPP Database NoSQL
Database Market Data
Feed
Operational Intelligence: Stats and Search
Web Servers
• Cache and microbatch
updates to multidimensional stats
• Support distributed queries
• Cache updates
Hadoop Cluster
Ad Hoc Queries Storm
Sys Logs
Batch System Log Feed
Batch Summary Views
Search &
NoSQL DB
No SQL Roadshow| 20
Easy to develop, hard to debug – start locally- Timeouts
Storm infinite loop of failures- Use Memcached to count # tuple failures
At Least Once processing- Hadoop based read-repair job
Performance Counters not getting flushed- Tick Tuples
Always ACK
Batching to S3- Run a compaction & event de-duplication job in Hadoop
Lessons
Conclusions
There’s many kinds of real-time problems
Use of Hadoop and/or NoSQL can solve
- Low latency queries
- Event response with localized intelligence - Operational intelligence
Storm is valuable for
- Ingesting data within seconds
- Complex real-time distributed logic - Operational intelligence
We’re Hiring!