SINGLE PLATFORM. COMPLETE SCALABILITY.
Real Time Analy:cs for Big Data
About Me
MTBK Junky Technology addict A Proud Dad Head of Product @ GigaSpaces•
Ecommerce – Auc=on monitoring, addwards
•
Search engines
•
Real-‐=me Marke=ng – Improving conversion rate
•
Weather repor=ng
•
Traffic analysis
•
Call Center Management
•
Supply-‐Chain Op=miza=on
•
Quality Management in Manufacturing
•
SLA Monitoring and Maintenance
•
Global Shipment & Delivery Monitoring
•
Fraud Detec=on in Financial Companies
Real Time Analy:cs Use Cases
Analy:cs @ TwiJer
• How many request/day?
• What’s the average latency?
• How many signups, sms, tweets?
Counting
• Desktop vs Mobile user ?
• What devices fail at the same time?
• What features get user hooked?
Correlating
• Duplicate detection
• Sentiment analysis
• Patterns and trends
Note the Time dimension
• Real time (msec/sec)
Counting
• Near real time(Min/Hours)
Correlating
• Batch (Days..)
The data resolu:on & processing models
• Mostly Event Driven
• High resolution – every tweet counts
Counting
• Ad-hoc queries
• Mid resolution - Aggregated counters
Correlating
• Pre generated reports
• Cross grain resolution – trends,..
Tradi:onal analy:cs applica:ons
•
Scale-‐up Database
– Use tradi=onal SQL database
– Use stored procedure for event driven reports – Use flash memory disks to reduce disk I/O – Use read only replica to scale-‐out read queries
•
Limita=ons
– Doesn’t scale on write
CEP – Complex Event Processing
•
Process the data as it comes
•
Maintain a window of the data in-‐memory
•
Pros:
– Extremely low-‐latency – Rela=vely low-‐cost
•
Cons
– Hard to scale (Mostly limited to scale-‐up) – Not agile -‐ Queries must be pre-‐generated – Fairly complex
8
In Memory Data Grid
•
Distributed in-‐memory database
•
Scale out
•
Pros
– Scale on write/read
– Fits to event driven (CEP style) , ad-‐hoc query model
•
Cons
-‐ Cost of memory vs disk
NoSQL
•
Use distributed database
– Hbase, Cassandra, MongoDB
•
Pros
– Scale on write/read – Elas=c
•
Cons
– Read latency
– Consistency tradeoffs are hard – Maturity – fairly young technology
10
Hadoop MapReudce
•
Distributed batch processing
•
Pros
– Designed to process massive amount of data – Mature
– Low cost
•
Cons
Hadoop Map/Reduce – Reality check..
“With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true
real-time..” (Yahoo CTO Raymie Stata)
Hadoop/Hive..Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable
enough to hit realtime goals ( Alex Himel, Engineering
Manager at Facebook.)
"MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency,“ (Google senior director of
engineering Eisar Lipkovitz)
12
So what’s the boJom line?
One size doesn’t fit all..
The solution has to be a
combination of several
FACEBOOK REAL-‐TIME
ANALYTICS SYSTEM
Goals
•
Show why plugins are valuable
– What value is your business deriving from it?
•
Make the data more ac=onable
– Help users take ac=on to make their content more valuable.
– How many people see a plugin, how many people take ac=on on it, and how many are converted to traffic back on your site.
•
Make the data more =mely
– Went from a 48-‐hour turn around to 30 seconds.
– Mul=ple points of failure were removed to make this goal.
•
Handle massive load
The actual analy:cs..
•
Like buJon analy:cs
•
Comments box analy:cs
16
Technology Evalua:on
•
MySQL DB Counters
•
In-‐Memory Counters
•
MapReduce
•
Cassandra
•
HBase
PTail Scribe Puma Hbase FACEBOOK Log FACEBOOK Log FACEBOOK Log HDFS
Real Time Long Term
Batch 1.5 Sec
The solu:on..
10,000 write/sec per serverChecking the assump:ons..
Memory is still core
• “(We) write extremely lean log lines. The more compact the log lines the more can be stored in memory..”
• “(We) batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of
memory when creating a hashtable”
The NoSQL space is very dynamic..
• “When Facebook engineers started the project 6 months ago, Cassandra did not have distributed counters which is now committed in trunk.”. (Eric Hauser Senior
Facebook Analy:cs.Next..
•
What if..
20
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
We can rely on memory as
a reliable store?
We can’t decide on a
particular NoSQL
database?
We need to package the
solution as a product?
Step 1: Use memory..
•
We rely on memory
anyway to get 10k msg/
sec..
•
Why not use memory to
store the events
•
Reliability is achieved
through redundancy and
replica=on
Events FACEBOOK FACEBOOK FACEBOOK Memory Grid Data Grid Data Grid Data GridStep 1: Use memory..
•
We rely on memory
anyway to get 10k msg/
sec..
•
Why not use memory to
store the events
•
Reliability is achieved
through redundancy and
replica=on
22
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Events FACEBOOK FACEBOOK FACEBOOK Any API Data Grid
Step 2 – Collocate
Events FACEBOOK FACEBOOK FACEBOOK Processing Grid Data Grid Data Grid Data Grid•
Pulng the code
together with the
data.
Step 2 – Collocate
Events FACEBOOK FACEBOOK FACEBOOK Processing Grid Data Grid Data Grid Data Grid•
Pulng the code
together with the
data.
@EventDriven @Polling
public class SimpleListener {
@EventTemplate
Data unprocessedData() {
Data template = new Data();
template.setProcessed(false);
return template;
}
@SpaceDataEvent
public Data eventListener(Data event) {
//process Data here
}
Step 3 – Write behind to SQL/NoSQL
Events FACEBOOK FACEBOOK FACEBOOK Processing Grid Data Grid Data Grid Data GridData
Source
Adaptor
MySQL HBase CassandraOpen Long Term persistency
Write Behind
Economic Data Scaling
High Memory
Memory 192GB
Cores 12 cores
Clock speed 3.2 GHhz
Dell Price $367/month
$1.9/GB TB (~960GB)/
Month
5x Blades = $1835/month
•
Combine memory and disk
– Memory is x10, x100 lower than disk for high data access rate (Stanford research)
– Disk is lower at cost for high capacity lower access rate. – Solu=on:
• Memory -‐ short-‐term data,
• Disk -‐ long term. data
– Only ~16G required to store the
log in memory ( 500b messages at 10k/h ) at a cost of ~32$ month per server.
Economic Opera:ons Scaling
• Automa=on -‐ reduce opera=onal cost
• Elas=c Scaling – reduce over provisioning cost
• Cloud portability (JClouds) – choose the right cloud for the job
Pu_ng it all together
Analytic Application Event Sources Write behind- In Memory Data Grid - RT Processing Grid
• Light Event Processing
• Map-reduce
• Event driven
• Execute code with data
• Transactional
• Secured
• Elastic
NoSQL DB
• Low cost storage
• Write/Read
scalability
• Dynamic scaling
• Raw Data and
aggregated Data
Generate Patterns
Pu_ng it all together
Analytic Application Event Sources Write behind- In Memory Data Grid - RT Processing Grid
• Light Event Processing
• Map-reduce
• Event driven
• Execute code with data
• Transactional
• Secured
• Elastic
NoSQL DB
• Low cost storage
• Write/Read
scalability
• Dynamic scaling
• Raw Data and
aggregated Data
Generate Patterns
Script script = new
StaticScritpt(“groovy”,”println hi; return 0”) Query q = em.createNativeQuery(“execute ?”); q.setParamter(1, script); Integer result =
5x beJer performance per server!
• Hardware – Linux
– HP DL380 G6 servers -‐ each has:
– 2 Intel quad-‐core Xeon X5560 processors (2.8 Ghz Nehalem)
– 32 Gb RAM (4GB per core)
– 6 * 146 Gb 15K RPM SAS disks – Red Hat 5.2 Event injector Up to 128 threads GigaSpaces/ (Other Msg Server) App Services Up to 128 threads 0 10,000 20,000 30,000 40,000 50,000 60,000 Event injection throughput Event injection throughput with write multiple EJB/Remoting service invocation throughput GS WLS Other Giga 50,000 write/sec per server
Pu_ng it all together – Elas:c Big Data Plaborm
• The best of both worlds
• Support Real Time and Batch
• Fully managed stack
• Makes the development and
deployment of Big Data
applica=on significantly simpler • Extremely cost effec=ve
– Best ra=o of Disk + Memory
– Run on any cloud
Other benefits
• Built-in Pub/Sub • Built-in CEP
Designed for real
time event
processing
• Standard Query • Any database
Open
• Transactional, consistent
• Survive complete database failure
Reliable
• Can be packaged into a single product • Fully automated deployment
• End to end management and monitoring
Simple
32
Further reading..
natishalom.typepad.com
• Real Time Analytics for Big Data: An Alternative Approach
GigaOM
• Big data in real time is no fantasy
Highscalability.com
• Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day
THANK YOU!
@uri1803
hJp://blog.gigaspaces.com
34Economic Scaling
JClouds Cloud Driver
Worker Role
Load Balancer
Cloudify Application Cluster
VM Instance
Storage VM Instance
Controll er
Cloudify Agent Cloudify Agent
Console
Controller
Scale-in Scale-out