• No results found

Streamdrill: Analyzing Big Data Streams in Realtime

N/A
N/A
Protected

Academic year: 2021

Share "Streamdrill: Analyzing Big Data Streams in Realtime"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Streamdrill: Analyzing Big

Data Streams in Realtime

Mikio L. Braun

[email protected]

@mikiobraun

(2)

Realtime Big Data: Sources

Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie

Finance

Finance

Gaming

Gaming

Monitoring

Monitoring

Social Media

Social Media

Sensor Networks

Sensor Networks

Advertisment

Advertisment

(3)

Tasks by Complexity

Counting and Averages (over time

windows), Count Distinct

Profiles and Histograms

Trends

Outliers and Fraud detection

Prediction (churn, failure)

C

om

pl

ex

(4)

Tasks by Latency

Reporting

Visualization and Monitoring

Optimizing, Personalization

Control

F

as

t

re

po

ns

es

(5)

What makes Data Big?

Many Events

100 events /

second

360k per hour

8.6M per day

260M per month

3.2B per year

Many Objects

http://www.flickr.com/photos/arenamontanus/269158554/

(6)

Current approach: Scaling

Batch (MapReduce)

Stream (Storm,

Spark)

(7)

Scaling? Approximate!

Scaling is nice, but:

Scaling is expensive

Data is noisy

Not every data point is important

Methods are noisy, too

(8)

Scaling vs. Approximation

Scaling

need raw processing

power to get fast

may compute

results you don't

need

practically requires

a cluster setup

Approximation

approximate more

to get fast

focusses on data

you are interested in

already consumes

whole stream with

one node

(9)

Heavy Hitters (a.k.a. Top-k)

Count activities over large item sets (millions,

even more, e.g. IP addresses, Twitter users)

Interested in most active elements only.

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

frank

paul

jan

felix

leo

alex

15

12

8

5

3

2

Fixed tables of counts

Case 1: element already in data base

Case 2: new element

paul

142

12

13

nico

alex

2

(10)

Wait a minute? Only Counting?

Well, getting the top most active items is

already useful.

Web analytics, Users, Trending Topics

(11)

Counting is Statistics

Empirical mean:

Correlations:

(12)

More: Maximum-Likelihood

Estimate probabilistic models

based on

which is slightly

biased, but

simpler

(13)

Outlier detection

Once you have a model, you can compute

(14)
(15)

So much more to do with trends

Least Recently Used Caches

Sparse Vectors

Sparse Matrices

Conditional Probabilities (Histograms)

Accumulators

(16)

streamdrill

Core Engine:

Heavy Hitters counting + exponential decay

Instant

counts & top-k results over time windows

Modules for specific use cases

Features

In-Memory, with snapshots to disk

written in Scala

Interface: Query by REST, push data by REST or UDP

Single node performance: up to 20k events/s, about

(17)
(18)

streamdrill

modules

stream

drill

core

recommendation

profiling

fraud detection

Ready made modules to cover core

business applications.

(19)

Use Case: Realtime user profiles

Objective

Track user activity

in different

categories in

realtime

Event

(user, category)

Output

(20)

Use Case: Realtime

recommendations

Objective

Recommend

items to users

based on user

interests, item

popularity

Event:

(user, item, categories)

Output:

User profiles to find

(21)

Use Case: Realtime fraud and

rate limiting

Objective:

Identify

unusually

active users/IPs

Unusually high

co-occurence

Event

(id) or (id, device)

Output

trend for ids, or

look at size of (id, *) or

(*, device) above

threshold

(22)

Summary

Streamdrill: Big Data through

approximation

Counts are the basis for (nearyl)

everything

References

Related documents

Therefore, instead of using randomly oriented thin straight wires for electromagnetic simulation of volume scattering from vegetation areas, the present work proposes a random

napus genome assembly v8.1 (Bayer et al. 2011), encompassing approximately 3 and 4.7 Mb regions, respectively; located inside the qTUYVA4 interval (Fig. PCR analysis using the

To examine the role of multiple uncontrolled condi- tions in clinical inertia, we studied the association of blood pressure (BP) intensification and the presence and control of

Summary of Geological / Geomorphological Interest: This site covers much of Killerton Park and its quarries, which show Permian volcanic rocks

About con- struction activities, the only possibility to calculate the historical cost of construction projects is by activity- based costing method, however, due to the outsourcing

“On Vortex Shedding from a Circular Cylinder in the Critical Reynolds Number Regime,” Journal of Fluid Mechanics, 37,

Effect of feeding method on retained energy (RE) estimated from predicted body condition score (PBCS) in limit-fed mid- to late-gestation cows.. Effect of feeding method on