• No results found

Efficient Processing of Top-k Spatial Keyword Queries

N/A
N/A
Protected

Academic year: 2021

Share "Efficient Processing of Top-k Spatial Keyword Queries"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Efficient Processing of Top-k Spatial Keyword Queries

João B. Rocha-Junior, Orestis Gkorgkas,

Simon Jonassen, and Kjetil Nørvåg

(2)

Outline

• Top-k spatial keyword queries

• Current approaches

• Spatial inverted index

• Single-keyword queries

• Multiple-keyword queries

• Experimental evaluation

• Conclusion

(3)

Motivation

• More and more documents in the Internet are being associated with a spatial location

– Ex: tweets, images (Flickr), Wikipedia sites, OpenStreetMap objects,…

• Most of these geotagged objects are

associated with a text (description)

(4)

Top-k spatial keyword queries

• Query

– Spatial location – Query keywords

Italian food

• Returns the k best

spatio-textual objects ranked in terms of both

– Spatial distance to the query location

– Textual relevance to the query keywords

(5)

Another example…

• Query

– Spatial location – Query keywords

• Returns the k best

spatio-textual objects ranked in terms of both

– Spatial distance to the query location

– Textual relevance to the query keywords

q

objects query location

distance

(6)

Ranking objects

• Score

• The spatial proximity (δ) is the normalized Euclidean distance between p and q

• The textual relevance (θ) is the cosine

similarity between the description of p and the query keywords

• The query preference parameter (α) defines the importance of one measure over the other

τ ( p,q) = α ∗δ ( p,q) + (1 −α ) ∗θ ( p,q)

(7)

Current approaches

• Employ a modified R-tree [1,2]

– Each node keeps an abstract document

representing all documents in the node sub-tree

• Abstract document

– Pairs (term, weight), one pair per term

– The weight permits computing an upper-bound score for the objects in the node sub-tree

[1] Cong, G., Jensen, C.S., Wu, D.: “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009.

[2] Li, Z., Lee, K.C., Zheng, B., Lee, W., Lee, D., Wang, X.: “IR-tree: an efficient index for geographic document search”, TKDE, 2010.

(8)

Example

e3 e2

root:

bar:2 pop:2 pub:1 rock:1 samba:1

e1: e2: e3:

bar:2 pub:2 samba:1

pop:1 pub:1 samba:1 e1

q q

e1 e2 e3

p5 p7 p1 p2 p3 p4 p6

bar:1 pop:2 pub:1 rock:1 e1

e1: p1 p2 p3

For simplicity, we assume that the impact of a term is defined by the frequency rock:1

pub:1

pub:2 pub:1

(9)

Current approaches

• There are several variations

– Incorporating document similarity – Clustering the nodes

• Main problems

– Frequent and infrequent terms are stored in the same way (have the same cost)

– Accesses several nodes due to text dimensionality – Complex management of inverted files and/or

vectors, one per node

(10)

Spatial inverted index (S2I)

• Similarly to an inverted index, S2I maps terms to objects that contain the term

– The most frequent terms are stored in aggregated R-trees (aR-trees)

– The less frequent terms are stored in blocks in a file

• The aR-tree permits accessing the objects in decreasing order of term relevance

• The blocks permits storing the less frequent

terms efficiently

(11)

Distribution of terms

• The distribution of terms is very skewed

• Few hundred terms take up 50% of the text

Terms

Frequency

(12)

Example

(13)

Aggregated R-tree (max) for frequent terms (e.g., pub)

• Only relevant objects are evaluated

• The objects are accessed in decreasing order of score

e1

e2

e0 e0:

e1: e2:

e1(1) e2(2)

p1(1) p2(1) p5(2) p6(2) p7(1) , max=1

, max=2

Term impact

Term impact

Max value

Max value

(14)

Single-keyword queries

• Only a single block or tree is accessed

• Block

– All the objects are read and the k best are reported

• Tree

– The nodes are accessed in decreasing order of score – The algorithm terminates when the score of the k-th

object is higher than the score of any unvisited node

(15)

Example, processing top-1

e1

e2 e0 , max=1

, max=2

e0:

e2: e1(1) e2(2)

p1(1) p2(1) p5(2) p6(2) p7(1)

Max-heap: <e1>

Minimum distance

Top-1

e1:

Max-heap: <eMax-heap: <p25, e, p16>, e1, p7>

(16)

Multiple-keyword queries

• Requires aggregating the partial scores of the objects for each term t of the query keywords

• Similar to Fagin’s algorithm (NRA)

– Different bounds

• Score:

τ ( p,q) = τ

t

t∈q.d

( p,q)

Partial score Partial score

(17)

Multiple-keyword algorithm

• For each term t in q, access the objects p in S2I in decreasing of partial score

– The objects are retrieved from a tree or block

• Update the lower bound score of p

– Sum of the partial scores know plus the lowest possible partial score (using the spatial distance)

• Update the upper bound score of the visited objects

• Return the objects whose lower bond score

cannot be overcame by the remaining objects

(18)

Experimental evaluation

• We compare our approach (S2I) with the DIR- tree proposed by Cong et al. [1]

• Both approaches are implemented in Java

• Measures: response time, I/O, update time, and index size

• Size of tree nodes and blocks: 4KB

[1] Cong, G., Jensen C. S., Wu, D. “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009.

(19)

Datasets

Datasets Total no.

of objects Avg. no. of unique

terms per object Total no. of terms

Twitter1 1M 11.94 12.5M

Twitter2 2M 12.00 25M

Twitter3 3M 12.26 38.6M

Twitter4 4M 12.27 51.6M

Data1 0.1M 131.70 32.6M

Wikipedia 0.4M 163.65 169.4M

Flickr 1.4M 14.49 25.4M

OpenStreetMap 3M 8.76 31.5M

(20)

Variables studied

• Number of results

– 10, 20, 30, 40, 50

• Number of query keywords

– 1, 2, 3, 4, and 5

• Query preference rate (α)

– 0.1, 0.3, 0.5, 0.7, 0.9

• Scalability (twitter dataset)

– 1M, 2M, 3M, 4M

(21)

Number of results (k)

• The response time of S2I is one order of magnitude better due to less disk accesses

– DIR-tree reads several nodes before finding the top-k due to text dimensionality

(22)

Number of query keywords

• One order of magnitude better in I/O and

response time

(23)

Insertion time and index size

• S2I does not require updating inverted files (and vectors), and computing document similarity

• S2I requires more space

(24)

Conclusions

• Top-k spatial keyword queries are intuitive and have several applications

• We propose a new index

– Terms with different frequency are stored differently

• We propose algorithms to single- and multiple- keyword queries

• The efficiency of our approach is verified through

experiments on synthetic and real datasets

(25)

More information…

João B. Rocha-Junior [email protected]

http://www.idi.ntnu.no/~joao

Thanks!

(26)

Scalability

• S2I improvement over DIR-tree increases with

cardinality of the datasets

(27)

Different datasets

• The advantage of S2I over DIR-tree is higher

for datasets with few terms per documents

(28)

Terms removal

• Terms with length=1

• Terms that have no letter character

– ! Character.isLetter(token.charAt(i))

References

Related documents

[r]

101 102 VITON DU MOULIN A vendre Code D sBs sBs CAPITOL CORRADO-I LACQ URTE-I CALYPSO II AVONTUUR LORETTCHEN.. INSEL

Since the 1970s, the beneficial effects of omega-3 PUFAs have been commonly explained to be attributed to either prevention of the conversion of the omega-6 PUFA arachidonic acid

The various strength of concrete like compressive, flexural and split tensile were studied and non-destructive test such as rebound hammer test and ultrasonic

Second, the GTSRUDL model is modified for the two cases considered in this paper to simulate the proposed concrete removal and replacement sequence; then the response

Li et al Earth, Planets and Space (2014) 66 161 DOI 10 1186/s40623 014 0161 3 FULL PAPER Open Access Concentration of electrostatic solitary waves around magnetic nulls within

Online 2455 3891 Print 0974 2441Vol 9, Suppl 3, 2016 QUALITY BY DESIGN BASED OPTIMIZATION AND VALIDATION OF NEW REVERSE PHASE HIGH PERFORMANCE LIQUID CHROMATOGRAPHY METHOD FOR