Streamdrill: Analyzing Big Data Streams in Realtime

(1)

Streamdrill: Analyzing Big

Data Streams in Realtime

Mikio L. Braun

[email protected]

@mikiobraun

(2)

Realtime Big Data: Sources

Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie

Finance

Gaming

Monitoring

Social Media

Sensor Networks

Advertisment

(3)

Tasks by Complexity

● Counting and Averages (over time

windows), Count Distinct

● Profiles and Histograms

● Trends

● Outliers and Fraud detection

● Prediction (churn, failure)

C

om

pl

ex

(4)

Tasks by Latency

● Reporting

● Visualization and Monitoring

● Optimizing, Personalization

● Control

F

as

t

re

po

ns

es

(5)

What makes Data Big?

● Many Events

–

100 events /

second

–

360k per hour

–

8.6M per day

–

260M per month

–

3.2B per year

● Many Objects

http://www.flickr.com/photos/arenamontanus/269158554/

(6)

Current approach: Scaling

● Batch (MapReduce)

● Stream (Storm,

Spark)

(7)

Scaling? Approximate!

● Scaling is nice, but:

–

Scaling is expensive

–

Data is noisy

–

Not every data point is important

–

Methods are noisy, too

(8)

Scaling vs. Approximation

Scaling

need raw processing

power to get fast

may compute

results you don't

need

practically requires

a cluster setup

Approximation

approximate more

to get fast

focusses on data

you are interested in

already consumes

whole stream with

one node

(9)

Heavy Hitters (a.k.a. Top-k)

●

Count activities over large item sets (millions,

even more, e.g. IP addresses, Twitter users)

●

Interested in most active elements only.

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

frank

paul

jan

felix

leo

alex

15

12

8

5

3

2 Fixed tables of counts

Case 1: element already in data base

Case 2: new element

paul

142

12

13 nico

alex

2

(10)

Wait a minute? Only Counting?

● Well, getting the top most active items is

already useful.

–

Web analytics, Users, Trending Topics

(11)

Counting is Statistics

● Empirical mean:

● Correlations:

(12)

More: Maximum-Likelihood

● Estimate probabilistic models

based on

which is slightly

biased, but

simpler

(13)

Outlier detection

● Once you have a model, you can compute

(14)

(15)

So much more to do with trends

● Least Recently Used Caches

● Sparse Vectors

● Sparse Matrices

● Conditional Probabilities (Histograms)

● Accumulators

(16)

streamdrill

●

Core Engine:

–

Heavy Hitters counting + exponential decay

–

Instant

counts & top-k results over time windows

●

Modules for specific use cases

●

Features

–

In-Memory, with snapshots to disk

–

written in Scala

–

Interface: Query by REST, push data by REST or UDP

–

Single node performance: up to 20k events/s, about

(17)

(18)

streamdrill

modules

stream

drill

core

recommendation

profiling

fraud detection

Ready made modules to cover core

business applications.

(19)

Use Case: Realtime user profiles

Objective

Track user activity

in different

categories in

realtime

● Event

–

(user, category)

● Output

(20)

Use Case: Realtime

recommendations

Objective

–

Recommend

items to users

based on user

interests, item

popularity

● Event:

–

(user, item, categories)

● Output:

–

User profiles to find

(21)

Use Case: Realtime fraud and

rate limiting

Objective:

–

Identify

unusually

active users/IPs

–

Unusually high

co-occurence

● Event

–

(id) or (id, device)

● Output

–

trend for ids, or

–

look at size of (id, *) or

(*, device) above

threshold

(22)

Summary

● Streamdrill: Big Data through

approximation

● Counts are the basis for (nearyl)

everything