• No results found

Real Time Analy:cs for Big Data Lessons Learned from Facebook

N/A
N/A
Protected

Academic year: 2021

Share "Real Time Analy:cs for Big Data Lessons Learned from Facebook"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

SINGLE  PLATFORM.  COMPLETE  SCALABILITY.  

 

Real  Time  Analy:cs  for  Big  Data  

(2)

About  Me  

MTBK Junky Technology addict A Proud Dad Head of Product @ GigaSpaces

(3)

Ecommerce  –  Auc=on  monitoring,  addwards  

Search  engines    

Real-­‐=me  Marke=ng  –  Improving  conversion  rate  

Weather  repor=ng  

Traffic  analysis    

Call  Center  Management  

Supply-­‐Chain  Op=miza=on  

Quality  Management  in  Manufacturing  

SLA  Monitoring  and  Maintenance  

Global  Shipment  &  Delivery  Monitoring  

Fraud  Detec=on  in  Financial  Companies  

Real  Time  Analy:cs  Use  Cases  

(4)

Analy:cs  @  TwiJer  

•  How many request/day?

•  What’s the average latency?

•  How many signups, sms, tweets?

Counting

•  Desktop vs Mobile user ?

•  What devices fail at the same time?

•  What features get user hooked?

Correlating

•  Duplicate detection

•  Sentiment analysis

•  Patterns and trends

(5)

Note  the  Time  dimension  

•  Real time (msec/sec)

Counting

•  Near real time(Min/Hours)

Correlating

•  Batch (Days..)

(6)

The  data  resolu:on  &  processing  models  

•  Mostly Event Driven

•  High resolution – every tweet counts

Counting

•  Ad-hoc queries

•  Mid resolution - Aggregated counters

Correlating

•  Pre generated reports

•  Cross grain resolution – trends,..

(7)

Tradi:onal  analy:cs  applica:ons  

• 

Scale-­‐up  Database    

–  Use  tradi=onal  SQL  database  

–  Use  stored  procedure  for  event  driven  reports   –  Use    flash  memory  disks  to  reduce  disk  I/O   –  Use  read  only  replica  to  scale-­‐out  read  queries  

 

• 

Limita=ons  

–  Doesn’t  scale  on  write  

(8)

CEP  –  Complex  Event  Processing  

• 

Process  the  data  as  it  comes  

• 

Maintain  a  window  of  the  data  in-­‐memory  

• 

Pros:  

–  Extremely  low-­‐latency   –  Rela=vely  low-­‐cost  

• 

Cons  

–  Hard  to  scale  (Mostly  limited  to  scale-­‐up)   –  Not  agile  -­‐  Queries  must  be  pre-­‐generated   –  Fairly  complex    

8  

(9)

In  Memory  Data  Grid  

• 

Distributed  in-­‐memory  database  

• 

Scale  out  

• 

Pros  

–  Scale  on  write/read  

–  Fits  to  event  driven  (CEP  style)  ,  ad-­‐hoc  query  model  

• 

Cons  

-­‐  Cost  of  memory  vs  disk  

(10)

NoSQL  

• 

Use  distributed  database  

–  Hbase,  Cassandra,  MongoDB    

• 

Pros  

–  Scale  on  write/read   –  Elas=c  

• 

Cons  

–  Read  latency  

–  Consistency  tradeoffs  are  hard   –  Maturity  –  fairly  young  technology  

10  

(11)

Hadoop  MapReudce  

• 

Distributed  batch  processing  

• 

Pros  

–  Designed  to  process  massive  amount  of  data   –  Mature  

–  Low  cost  

• 

Cons  

(12)

Hadoop  Map/Reduce  –  Reality  check..  

“With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true

real-time..” (Yahoo CTO Raymie Stata)

Hadoop/Hive..Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable

enough to hit realtime goals ( Alex Himel, Engineering

Manager at Facebook.)

"MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency,“ (Google senior director of

engineering Eisar Lipkovitz)

12  

(13)

So  what’s  the  boJom  line?  

One size doesn’t fit all..

The solution has to be a

combination of several

(14)

FACEBOOK  REAL-­‐TIME  

ANALYTICS  SYSTEM  

(15)

Goals  

• 

Show  why  plugins  are  valuable    

–  What  value  is  your  business  deriving  from  it?  

• 

Make  the  data  more  ac=onable    

–  Help  users  take  ac=on  to  make  their  content  more  valuable.  

–  How  many  people  see  a  plugin,  how  many  people  take  ac=on  on   it,  and  how  many  are  converted  to  traffic  back  on  your  site.      

• 

Make  the  data  more  =mely  

–  Went  from  a  48-­‐hour  turn  around  to  30  seconds.  

–  Mul=ple  points  of  failure  were  removed  to  make  this  goal.    

• 

Handle  massive  load  

(16)

The  actual  analy:cs..  

• 

Like  buJon  analy:cs  

• 

Comments  box  analy:cs  

16  

(17)

Technology  Evalua:on  

• 

MySQL  DB  Counters  

• 

In-­‐Memory  Counters  

• 

MapReduce  

• 

Cassandra  

• 

HBase  

(18)

PTail Scribe Puma Hbase FACEBOOK Log FACEBOOK Log FACEBOOK Log HDFS

Real Time Long Term

Batch 1.5 Sec

The  solu:on..  

10,000 write/sec per server

(19)

Checking  the  assump:ons..  

Memory is still core

•  “(We) write extremely lean log lines. The more compact the log lines the more can be stored in memory..”

•  “(We) batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of

memory when creating a hashtable”

The NoSQL space is very dynamic..

•  “When Facebook engineers started the project 6 months ago, Cassandra did not have distributed counters which is now committed in trunk.”. (Eric Hauser Senior

(20)

Facebook  Analy:cs.Next..  

• 

What  if..  

20  

®  Copyright  2011  Gigaspaces  Ltd.  All  Rights  Reserved    

We can rely on memory as

a reliable store?

We can’t decide on a

particular NoSQL

database?

We need to package the

solution as a product?

(21)

Step  1:  Use  memory..  

We  rely  on  memory  

anyway  to  get  10k  msg/

sec..  

Why  not  use  memory  to  

store  the  events  

Reliability  is  achieved  

through  redundancy  and  

replica=on  

Events FACEBOOK FACEBOOK FACEBOOK Memory Grid Data Grid Data Grid Data Grid

(22)

Step  1:  Use  memory..  

We  rely  on  memory  

anyway  to  get  10k  msg/

sec..  

Why  not  use  memory  to  

store  the  events  

Reliability  is  achieved  

through  redundancy  and  

replica=on  

22  

®  Copyright  2011  Gigaspaces  Ltd.  All  Rights  Reserved    

Events FACEBOOK FACEBOOK FACEBOOK Any API Data Grid

(23)

Step  2  –  Collocate  

Events FACEBOOK FACEBOOK FACEBOOK Processing Grid Data Grid Data Grid Data Grid

• 

Pulng  the  code  

together  with  the  

data.  

(24)

Step  2  –  Collocate  

Events FACEBOOK FACEBOOK FACEBOOK Processing Grid Data Grid Data Grid Data Grid

• 

Pulng  the  code  

together  with  the  

data.  

 

@EventDriven @Polling

public class SimpleListener {

@EventTemplate

Data unprocessedData() {

Data template = new Data();

template.setProcessed(false);

return template;

}

@SpaceDataEvent

public Data eventListener(Data event) {

//process Data here

}

(25)

Step  3  –  Write  behind  to  SQL/NoSQL  

Events FACEBOOK FACEBOOK FACEBOOK Processing Grid Data Grid Data Grid Data Grid

Data

Source

Adaptor

MySQL HBase Cassandra

Open Long Term persistency

Write Behind

(26)

Economic  Data  Scaling  

High Memory

Memory 192GB

Cores 12 cores

Clock speed 3.2 GHhz

Dell Price $367/month

$1.9/GB TB (~960GB)/

Month

5x Blades = $1835/month

Combine  memory  and  disk  

–  Memory  is  x10,  x100  lower  than   disk  for  high  data  access  rate   (Stanford  research)  

–  Disk  is  lower  at  cost  for  high   capacity  lower  access  rate.   –  Solu=on:    

•  Memory  -­‐  short-­‐term  data,    

•  Disk  -­‐  long  term.  data  

 

–  Only  ~16G  required  to  store  the  

log  in  memory  (  500b  messages   at  10k/h  )  at  a  cost  of    ~32$   month  per  server.  

(27)

Economic  Opera:ons  Scaling  

•  Automa=on    -­‐  reduce  opera=onal  cost  

•  Elas=c  Scaling  –  reduce  over  provisioning  cost  

•  Cloud  portability  (JClouds)  –  choose  the  right  cloud  for  the  job  

(28)

Pu_ng  it  all  together  

Analytic Application Event Sources Write behind

- In Memory Data Grid - RT Processing Grid

•  Light Event Processing

•  Map-reduce

•  Event driven

•  Execute code with data

•  Transactional

•  Secured

•  Elastic

NoSQL DB

•  Low cost storage

•  Write/Read

scalability

•  Dynamic scaling

•  Raw Data and

aggregated Data

Generate Patterns

(29)

Pu_ng  it  all  together  

Analytic Application Event Sources Write behind

- In Memory Data Grid - RT Processing Grid

•  Light Event Processing

•  Map-reduce

•  Event driven

•  Execute code with data

•  Transactional

•  Secured

•  Elastic

NoSQL DB

•  Low cost storage

•  Write/Read

scalability

•  Dynamic scaling

•  Raw Data and

aggregated Data

Generate Patterns

Script script = new

StaticScritpt(“groovy”,”println hi; return 0”) Query q = em.createNativeQuery(“execute ?”); q.setParamter(1, script); Integer result =

(30)

5x  beJer  performance  per  server!

 

•  Hardware  –  Linux  

–  HP  DL380  G6  servers  -­‐  each  has:  

–  2  Intel  quad-­‐core  Xeon  X5560  processors  (2.8  Ghz  Nehalem)    

–  32  Gb  RAM  (4GB  per  core)    

–  6  *  146  Gb  15K  RPM  SAS  disks     –  Red  Hat  5.2   Event injector Up to 128 threads GigaSpaces/ (Other Msg Server) App Services Up to 128 threads 0 10,000 20,000 30,000 40,000 50,000 60,000 Event injection throughput Event injection throughput with write multiple EJB/Remoting service invocation throughput GS WLS Other Giga 50,000 write/sec per server

(31)

Pu_ng  it  all  together  –  Elas:c  Big  Data  Plaborm    

•  The  best  of  both  worlds  

•  Support  Real  Time  and  Batch  

•  Fully  managed  stack  

•  Makes  the  development  and  

deployment  of  Big  Data  

applica=on  significantly  simpler   •  Extremely  cost  effec=ve  

–  Best  ra=o  of  Disk  +  Memory  

–  Run  on  any  cloud  

(32)

Other  benefits  

•  Built-in Pub/Sub •  Built-in CEP

Designed for real

time event

processing

•  Standard Query •  Any database

Open

•  Transactional, consistent

•  Survive complete database failure

Reliable

•  Can be packaged into a single product •  Fully automated deployment

•  End to end management and monitoring

Simple

32  

(33)

Further  reading..  

natishalom.typepad.com

• Real Time Analytics for Big Data: An Alternative Approach

GigaOM

• Big data in real time is no fantasy

Highscalability.com

• Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day

(34)

 

THANK  YOU!  

 

@uri1803  

hJp://blog.gigaspaces.com  

34  

(35)

Economic  Scaling  

JClouds Cloud Driver

Worker Role

Load Balancer

Cloudify Application Cluster

VM Instance

Storage VM Instance

Controll er

Cloudify Agent Cloudify Agent

Console

Controller

Scale-in Scale-out

(36)

References

Related documents