• No results found

General Incremental Sliding-Window Aggregation

N/A
N/A
Protected

Academic year: 2021

Share "General Incremental Sliding-Window Aggregation"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

General Incremental

Sliding-Window Aggregation

Kanat Tangwongsan

Mahidol University International College, Thailand

(part of this work done at IBM Research, Watson)

Joint work with Martin Hirzel, Scott Schneider, and Kun-Lung Wu

(2)

Dynamic data is everywhere

caller_id: __   len: __   timestamp:_ caller_id: __   len: __   timestamp:_ caller_id: __   len: __   timestamp:_ caller_id: __   len: __   timestamp:_ caller_id: __   len: __   timestamp:_

t

now

often interested in a sliding window

Compute some aggregation on this data

(3)

caller_id: __   len: __   timestamp:_

in stream out stream

Aggregation Window caller_id: __   len: __   timestamp:_ caller_id: __   len: __   timestamp:_

Build a system to…

Answer the following questions every minute about the

past 24 hrs:

How long was the longest call?

How many calls have that duration?

Who is a caller with that duration?

(4)

Answer the following questions every

minute

about the

past 24 hrs:

How long was the longest call?

How many calls have that duration?

Who is a caller with that duration?

Basic Solution:

Maintain a sliding window (past 24 hr)

Walk through the window every minute

Simple but slow:

(5)

Improvement Opportunities

Idea: When window slides, lots of common contents with the most recent query.

How to Reuse?

If invertible, keep a running sum: add on arrival, subtract

on departure

sum

Partial sum: bundle items that arrive and depart together to reduce # of items in the window.

(6)

This Work:

How to engineer sliding-window aggregation so that

can add new aggregation operations easily

can get good performance with little hassle

(7)

Performance Generality

This Work

Require invertibility or

commutativity or assoc.

Require FIFO windows

Prior Work

Good for small updates

Good for large updates

Prior Work

(e.g., Arasu-Widom VLDB’04, Moon et al. ICDE’00)

(e.g., Cranor et al. SIGMOD’03, Krishnamurthi et al. SIGMOD’06)

This Work

Require associativity (not but invertibility nor commutativity)

OK to be non-FIFO

Good for Large & Small:

If m updates are made to a window of size n, use O(mlog(n/m)) time to compute aggregate.

(8)

Our Approach

High-level Idea Aggregation Interface

map-assoc. reduce-map w1 w2 w3

wn t1 t2 t3

tn lift a combine reduce using lower final answer User declares query

Data Structure

Template

injected into

Window-Management

Library

linked with User writes aggregation code following an interface
(9)

Example

StdDev

lift

combine component-wise add lower x 7! {c : 1, ⌃ : x, : x2} q 1 c ( ⌃2/c)

v

u

u

t

1 n n

X

i=1

(

w

i

x

¯

)

2 count sum of values sum of squares
(10)

Perfect Binary Tree

Depth: log n

Changing k leaves affects

O(k log(n/k)) nodes

comb(.,.) comb(.,.) comb(.,.) comb(.,.)

comb(.,.) comb(.,.)

comb(.,.)

0⍟1 2⍟3 4⍟5 6⍟7

0⍟1⍟2⍟3 4⍟5⍟6⍟7

0⍟1⍟2⍟…⍟7

(11)

Data Structure Engineering

User writes code

following an

interface

Data Structure

Template

injected into

Window-Management

Library

linked with

Minimize pointer chasing

Allocate memory in bulk

Place data to help cache

(12)

Main idea:

keep a perfect binary tree, treating leaves as a circular buffer, all laid out as one array

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 logical 8 9 15 14 13 12 11 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 physical binary heap encoding

Q: Non power of 2? Dynamic window size? non-FIFO?

(13)

But…

Circular Buffer Leaves != Window-Ordered Data

1 1 2 1 2 3 1 2 3 4 2 3 4 2 3 4 5 F B F F F F F B B B B B

leaf view

window view

1 1 1 1 2 2 2 2 2 3 3 3 3 4 4 4 5

Fix?

When inverted:

i j B F e f g h

suffix prefix

(14)

Experimental Analysis

What’s the throughput relative to non-incremental?

1

What’s the performance trend under updates of

different sizes?

2

How does wildly-changing window size affect the

overall throughput?

(15)

our scheme crossover for StdDev (~ 10) crossover for Max (~ 260) 0 0.5 1 1.5 2 2.5 3 1 4 16 64 256 1024 Million Tuples/Second Window Size

ISS Max RA Max

Small windows:

slow down ≤ 10%

Break 10x as early as

n in the thousands

What’s the throughput relative to non-incremental?

1

faster

(16)

0 0.5 1 1.5 2 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Avg. Per-tuple Cost (Microseconds)

m = 1 m = 4 m = 64 m = n/16 m = n/4 m = n

Our Theory: Process k events in

O(1 + log (n/k)) time per event.

What’s the performance trend under updates of

different sizes?

2

(17)

0 0.5 1 1.5 2 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Avg Cost Per Tuple (Microseconds)

Window Size RA StdDev (fixed)

RA Max (fixed)

RA StdDev (osc.) RA Max (osc.)

How does wildly-changing window size affect the

overall throughput?

3

(18)

Take-Home Points

More details: see paper/come to the poster session.

This work: sliding-window aggregation

easily extendible by users and has good

performance

careful systems + algorithmic design and

engineering (blend of known ideas)

general (non-FIFO, only need associativity)

and fast for large & small windows

If m updates have been made on a window of size n, use

References

Related documents

The studies that explicitly incorporate income and/or power distribution as a mediating factor of the income-environment relationship seem to agree on at least two conclusions:

We might say that the IPCC performed an important legitim- ating function for the speculative technology of BECCS, pulling it into the political world, making previously

The high rates of resistance to cephalosporins in gram-negative BSI isolates of both the neonatal and pediatric population is concerning ( Klebsiella pneumoniae /cefotaxime 62.6%

Emirates NBD Retail Banking & Wealth Management (RBWM) division continued to outperform the market in 2014, gaining market share growth across most products and customer

In all, the results of Table 5 show that environmentally friendly portfolios in the Nordic region achieved higher returns with lower risk measures than the benchmark.. That is,

Overall, the research aim, reducing the likelihood and severity of construction disputes, is addressed through the development of the interactive exhibit, which can be used

It especially gives arguments to disentangle the two major types of relations at work in food supply chains and especially SSFCs: namely distance relations on

Twelve mature Merino wethers (average body weight 53.62 + 3.44 kg) were separated into 4 groups based on their live weight with each groups assigned three diets, that are: diet