Realtime Big Data: Sources
Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie
Finance
Finance
Gaming
Gaming
Monitoring
Monitoring
Social Media
Social Media
Sensor Networks
Sensor Networks
Advertisment
Advertisment
Tasks by Complexity
●
Counting and Averages (over time
windows), Count Distinct
●
Profiles and Histograms
●
Trends
●
Outliers and Fraud detection
●
Prediction (churn, failure)
C
om
pl
ex
Tasks by Latency
●
Reporting
●
Visualization and Monitoring
●
Optimizing, Personalization
●
Control
F
as
t
re
po
ns
es
What makes Data Big?
●
Many Events
–
100 events /
second
–
360k per hour
–
8.6M per day
–
260M per month
–
3.2B per year
●
Many Objects
http://www.flickr.com/photos/arenamontanus/269158554/Current approach: Scaling
●
Batch (MapReduce)
●
Stream (Storm,
Spark)
Scaling? Approximate!
●
Scaling is nice, but:
–
Scaling is expensive
–
Data is noisy
–
Not every data point is important
–
Methods are noisy, too
Scaling vs. Approximation
Scaling
need raw processing
power to get fast
may compute
results you don't
need
practically requires
a cluster setup
Approximation
approximate more
to get fast
focusses on data
you are interested in
already consumes
whole stream with
one node
Heavy Hitters (a.k.a. Top-k)
●
Count activities over large item sets (millions,
even more, e.g. IP addresses, Twitter users)
●
Interested in most active elements only.
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005
frank
paul
jan
felix
leo
alex
15
12
8
5
3
2
Fixed tables of counts
Case 1: element already in data base
Case 2: new element
paul
142
12
13
nico
alex
2
Wait a minute? Only Counting?
●
Well, getting the top most active items is
already useful.
–
Web analytics, Users, Trending Topics
Counting is Statistics
●
Empirical mean:
●
Correlations:
More: Maximum-Likelihood
●
Estimate probabilistic models
based on
which is slightly
biased, but
simpler
Outlier detection
●
Once you have a model, you can compute
So much more to do with trends
●
Least Recently Used Caches
●
Sparse Vectors
●
Sparse Matrices
●
Conditional Probabilities (Histograms)
●
Accumulators
streamdrill
●
Core Engine:
–
Heavy Hitters counting + exponential decay
–
Instant
counts & top-k results over time windows
●