• No results found

I/O-Efficient Data Structures for Colored Range and Prefix Reporting

N/A
N/A
Protected

Academic year: 2021

Share "I/O-Efficient Data Structures for Colored Range and Prefix Reporting"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

I/O-Efficient Data

Structures for

C

o

l

o

r

e

d

Range and Prefix Reporting

Kasper Green Larsen, MADALGO, Aarhus University Rasmus Pagh, IT University of Copenhagen

(2)

Motivating example

• Store collection of documents (= sets of words). • Given a string p, return the IDs of all

documents that contain p as a word. • Optimal solution: “Inverted index”,

precomputing the answer for each p. Sanningen finns

på marken/men ingen vågar ta den./Sanningen ligger på gatan./ Ingen gör den till sin.

Jag ligger och ska somna, jag ser okända bilder/ och tecken klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga.

(3)

Motivating example

• Store collection of documents (= sets of words). • Given a string p, return the IDs of all

documents that contain p as prefix of a word. • Optimal solution: Topic of this talk.

. Sanningen finns på marken/men ingen vågar ta den./Sanningen ligger på gatan./ Ingen gör den till sin.

Jag ligger och ska somna, jag ser okända bilder/ och tecken klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga. “colored prefix reporting”

(4)

Simple data structures

Inverted list for each prefix:

Too much space for fixed alphabet size. • Inverted list for each word:

Too much time if many documents have many different words with prefix p.

Sanningen finns på marken/men ingen vågar ta den./Sanningen ligger på gatan./ Ingen gör den till sin.

Jag ligger och ska somna, jag ser okända bilder/ och tecken klottrande sig själva bakom ögonlocken/på mörkrets vägg. I springan mellan vakenhet och dröm/försöker ett stort brev tränga sig in förgäves. Jag öppnar dörr nummer två./ Vänner! Ni drack mörkret/och blev synliga.

(5)

Relation to range reporting

[GHJS ’03]

words in lexicographic order p*

(6)

Relation to range reporting

[GHJS ’03]

words in lexicographic order p*

(7)

Relation to range reporting

[GHJS ’03]

words in lexicographic order p*

y-coord = x-coord of

(8)

Relation to range reporting

[GHJS ’03]

words in lexicographic order p*

3-sided range query captures only first occurrence of color

(9)

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range

queries can be answered in O(1+k/B) I/Os.

New result

• Parameters:

k = number of points returned

B = number of points in a memory block (assume query string fits in one block)

(10)

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range

queries can be answered in O(1+k/B) I/Os.

New result

Optimal time and space

• Parameters:

k = number of points returned

B = number of points in a memory block (assume query string fits in one block)

(11)

We can store a subset of n points from [n]×[n], using linear space, such that 3-sided range

queries can be answered in O(1+k/B) I/Os.

New result

Optimal time and space Implies optimal colored prefix reporting • Parameters:

k = number of points returned

B = number of points in a memory block (assume query string fits in one block)

(12)

Model of computation

• I/O model with Ɵ(B log n) bits per block. • Memory is a sequence of blocks.

Cost model: Number of block retrievals (I/Os). • O(1) blocks can be stored in “cache” and

(13)

Model of computation

• I/O model with Ɵ(B log n) bits per block. • Memory is a sequence of blocks.

Cost model: Number of block retrievals (I/Os). • O(1) blocks can be stored in “cache” and

accessed at no cost.

Many other p

apers: Block stores B

(14)

Selected previous results

Reference Space

overhead Search time Model

Arge et al. ’99 O(1) O(logB(n)) Comparison

Nekrich ’07 O(1) logBlogB...(n) Unrestricted

Nekrich ’07 logB*(n) O(logB*(n)) Unrestricted

Nekrich ’07 (logB(n))2 O(1) Unrestricted

Here O(1) O(1) Unrestricted

(15)

High-level description

1. Place points in a binary priority search tree

points to report

(16)

High-level description

points to report

points in range 2. Search for points near the “fringe” using O(1)

(17)

High-level description

points to report

points in range 3. Read blocks containing remaining points (easy).

(18)

Base data structure

• Core technical contribution of paper.

• Able to handle point sets of size poly(B log n) optimally.

Main technique: Use of tabulation and fusion trees allow us to make a dynamic data

structure partially persistent free of charge

(19)

Indivisibility assumption

• Assumption often used to show lower bounds: Each block contains at most B points/items;

reading a block is required to report an item. • Our data structure is among few to break this

assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation).

(20)

Indivisibility assumption

• Assumption often used to show lower bounds: Each block contains at most B points/items;

reading a block is required to report an item. • Our data structure is among few to break this

assumption (with the dictionary of Iacono and Pǎtraşcu, previous presentation).

Open problem: Is there a nontrivial lower bound under the indivisibility assumption?

(21)

Memory models

• There may be alternatives to the cache-oriented I/O model:

Unlike conventional processors that rely on the hardware to automatically bring data and

instructions close to the processor with a hierarchy of hardware caches, the Cell Broadband Engine

requires the programmer to create a “shopping list” of the data that the program requires.

(22)

Scatter-I/O model

• One I/O operation can read or write any set of

B words in memory.

• In this stronger model, we give a much simpler data structure for colored prefix search.

(23)

Scatter-I/O model

• One I/O operation can read or write any set of

B words in memory.

• In this stronger model, we give a much simpler data structure for colored prefix search.

Open problem: Can notoriously I/O-difficult graph problems such as BFS and DFS be

(24)

Conclusion

• Theoretically optimal solutions in the I/O model for colored prefix/range reporting. • In fact, optimal solution to 3-sided range

(25)

Conclusion

• Theoretically optimal solutions in the I/O model for colored prefix/range reporting. • In fact, optimal solution to 3-sided range

reporting in two dimensions.

Main open problem: Efficient extension to top-k searches, where only the k highest

References

Related documents

At the level of practice, I have proposed two legal tests for degradation and humiliation that could be used by courts to interpret the more abstract tests

In the paper, we discussed a specific scenario where airlines can reduce their delays by specifying relative slot cost as RTC and using subbing

217: Precourse 04: Prescribing Hand therapy after tendon repair ( lectures on zone 1 Flexor; zones III-IV extensor, panel participation on “difficult cases” : American Society

The School of Social Work recognizes its responsibility to the social work profession to uphold standards of academic and professional excellence and to operate within the

(b) The phenomenon of emission of electrons from metal surface by the incidence of light photon is called photo electric effect.. (c) Emitted e - is called photo

What is required? how is this element assessed? 0 5 10 Why did you give this score? 1) Termination of pregnancy is integrated into a wider reproductive package of services

A new algorithm for mining fuzzy association rules in the large databases based on ontology, Proceedings of the Sixth IEEE International Conference on Data Mining, pp. From data

This packet is being sent home to you so that you man have a better understand of the program procedures, Rules, and regulations of the Automotive Technology classes at highland