• No results found

In this chapter, we introduced a new approach for accelerating spatial query pro- cessing for relational index structures. We presented gray containers as a new and general concept and showed how we can efficiently store them by means of data compression techniques within ORDBMSs. In particular, we introduced a quick spa- tial data compressor QSDC, in order to emphasize those packer characteristics which are important for efficient spatial query processing, namely good compression ratio and high unpack speed. Furthermore, we introduced a cost-based decompositioning algorithm for complex spatial objects, called GroupCon. GroupCon takes decom- pression cost and access probabilities of gray containers into account. This decompo- sitioning algorithm is applicable for different spatial index structures, data space res- olutions and compression algorithms. We showed in a broad experimental evaluation that the combination of GroupCon and QSDC accelerates the RI-tree, the RQ-tree and the RR-tree by up to two orders of magnitude. Furthermore, we showed that the combination of a slightly altered GroupCon algorithm, called JoinGroupCon, and

QSDC accelerates spatial join processing of complex objects by more than one order

of magnitude compared to the use of uncompressed one-value approximations. The main difference between GroupCon and JoinGroupCon is that the latter does not

10 10 0 10 0 0 10 0 0 0 10 0 0 0 0

1E +0 1 1E+0 2 1E+0 3 1E +0 4 1E+0 5 1E+0 6

JoinGroupCon(NOOPT) one-value(NOOPT) JoinGroupCon(BZIP2) one-value(BZIP2) JoinGroupCon(QSDC) one-value(QSDC) pr o cess in g ti m e [s ] memory size in kb

Figure 92: Overall sort-merge join performance. (CAR data set; different cache sizes of the sweep-line status)

Conclusion 167

assume a potential query distribution, but exploits available statistics of the join part- ner relation as input parameter for the grouping process.

Note that for spatial indexing a similar approach is conceivable by combining the results of this chapter with the results of the foregoing chapter. In Chapter 5, we concentrated on the acceleration of relational indexing by means of statistics, where- as in this chapter we looked at the decompositioning of complex spatial objects based on an assumed query distribution. Combining these two techniques allows to accel- erate relational index structures in such a way that interactive response times for digital mockup and other application ranges of virtual engineering are possible.

169

Part III

Database Support for

Similarity Search

171

Chapter 7

Foundations of Similarity Search

Similarity search has gained increasing importance in many different applica-

tions, including medical imaging [KSF+ 96], molecular biology [AKKS 99], multi- media [Gud 95], and computer aided design [BKK 97a] [BKK 97b]. The search of similar database objects for a given query object is typically performed by following a feature-based approach. The basic idea is to extract important properties from the original data objects and to map these features into high-dimensional feature vectors, i.e. points in the feature space. Since the choice which features to extract mainly de- pends on the considered application, numerous feature transformations have been proposed. The result of such a transformation is a feature vector which is stored in a

feature database, e.g. spatial databases storing feature transformed landuse maps,

multimedia databases storing feature transformed audio sequences, and CAD data- bases storing feature transformed industrial parts.

This chapter is dedicated to the foundations of similarity search, with a strong emphasis on related work. It is organized as follows. In Section 7.1, we formally introduce the basic similarity query types, and discuss, in Section 7.2, how we can integrate them into an ORDBMS. In the Sections 7.3 to 7.5, we present different access methods and algorithms from the literature which are used for efficient simi-

larity search. In Section 7.6, we discuss existing approaches for effective similarity search.

7.1 Similarity Query Types

There are some specific query types that occur in the context of similarity search in CAD databases. The most important ones are: range queries, k-nearest neighbor que-

ries, and incremental ranking queries. Whereas for range queries, the number of results

is typically unknown in advance, the k-nearest neighbor queries specify the retrieval of those k objects from the database that have the smallest distances to q. Finally, similarity ranking queries support incremental fetching of the database objects.

In this section, we provide formal definitions for these fundamental similarity query types. Let O be the domain of all objects that may occur as database objects or query objects. For every type of similarity search, a distance function has to be provided that measures the (dis-)similarity of two objects o1 and o2 by . Often we abbreviate simdist by d. By , let us denote a database containing objects.

7.1.1 Similarity Range Queries

Range queries are specified by a query object q and a range value ε by which the answer set is defined to contain all the objects o from the database that have a dis- tance to the query object q of less than or equal to ε:

Definition 19 (Similarity Range Query).

For a query object and a query range , the similarity range query

simrange: returns the set

.

Note that for the similarity range query, the distance values of the resulting objects is bounded by the query range ε, but the number of answers is previously unknown. The result may be empty if no object has a similarity distance to the query object that is less or equal to the query range, and it may enclose the overall database if no object has a distance to the query object that is greater than the query range (cf. Figure 93). A user may thus be forced to iteratively start several queries before getting a feeling for an appropriate value of ε. For the query range ε = 0, the similarity range query is equivalent to a point query (i.e. searching for identical database objects). However, the point query is a seldom used query type in the context of similarity search.

simdist: O×OIR0+

simdist o( 1,o2) DBO

N = DB

qO ε∈IR0+

O×IR0+→2DB

Similarity Query Types 173

7.1.2 Similarity k-nn Queries

The k-nearest neighbor query overcomes the problem of the similarity range que- ry by giving the user the possibility to specify the size k of the answer set. This query type does not require a user to provide a query range and is therefore far easier to use than the similarity range query. The k-nearest neighbor query returns the k most sim- ilar feature vectors from the database and is defined as follows:

Definition 20 (Similarity k-Nearest Neighbor Query).

For a query object and a query parameter , the k-nearest neighbor query

simknn: returns the set that contains k objects from the database, and for which the following condition holds:

If there exist several database objects with the same distance as the k-th object in the answer set, denoted as simdistq,k, this k-th object is a non-deterministic selection of one of those equally distanced objects. If the query parameter k is equal to 1, we have the special case of a nearest neighbor query, i.e. finding the most similar object in the database. Obviously, the value of k depends on the performed task, but in general, the value for this query parameter is small ( ). Examples for k-nearest neighbor queries with several values of k are given in Figure 94.As depicted,

simdistq,k grows monotonically for an increasing value of k.

When considering the k-nearest neighbor query as defined above, we find three aspects which may be considered as a disadvantage for CAD applications. First, al-

Figure 93: Similarity range query.

ε1 q ε2 q ε3 q a) a reasonable query range ε1

b) a too small query range ε2

c) a too large query range ε3 qO kIN O×IN→2DB NNq( )kDB o1NNq( )k ,∀o2DB\NNq( )k: simdist o( ( 1,q)≤simdist o( 2,q)) k<100

though the query parameter k is comparatively easy to select, it may be still difficult to provide one single value. Rather, the user may be interested in starting with a very small value, e.g. k = 3, and if the answer set does not meet his expectation, the simi- larity system should be able to generate further similar objects in an incremental “give-me-more” manner. Using the k-nearest neighbor query type for this purpose, the user is forced to increase the value of k and to start another query. This is obvious- ly inefficient since the already generated similar objects are computed once again. Secondly, a user may not accept to see answer objects which he already knows from previous queries. Third, even if a user chooses a rather high value of k, he would like to get the first results soon, i.e. we need a pipelined query processing.

7.1.3 Similarity Ranking Queries

An incremental similarity search is achieved by the so-called similarity ranking

query. The basic idea of this query type is to rank the database objects in order of their

similarity distance.

For reasons of efficiency, the ranking procedure should not be performed and completed in advance at query initialization time. In view of very large databases and of expensive similarity distance functions for complex objects, this course will take too much time until the user will receive the first answer. While incrementally pro- ceeding in the ranking procedure, the next object should be reported shortly after the corresponding user request, as soon as its correct ranking is ensured. Another reason for deferring as much as possible of the ranking procedure is that the user often may be satisfied with only a few answers. In this case, the system has spent too much effort in ranking all the remaining objects in vain.

q q q

Figure 94: Similarity k-nearest neighbor query.

Similarity Query Types 175

Definition 21 (Similarity Ranking Query).

Let be a query object and a database. Let

be a bijection which ranks our database DB w.r.t. the query object q as follows: . Then, the similarity ranking function simrank: is defined as:

simrank (q) = rankedq

We write for the object oi that is ranked at position i. Note that in most cases, the ranking is uniquely characterized by this definition. However, if there are several objects in the database that have the same distance to the query object, i.e. for some , the order of oi and oj is not determined, and there is not a single rankedq-function but a family of ranking functions which we denote by . Figure 95 provides two examples of similar- ity ranking queries. In the examples, the top k objects are marked for k = 3.

7.1.4 Further Similarity Queries

Besides the three mentioned query types, there exist other similarity queries, as for instance approximate nearest neighbor queries and inverse nearest neighbor queries. In approximate k-nearest neighbor queries the user also specifies a query point and a number k of answers to be reported. In contrast to exact nearest neighbor que- ries, the user is not interested exactly in the closest points, but is satisfied with points which are not much further away from the query point than the exact nearest neigh- bors. The degree of inexactness is a parameter which is also decisive for the efficien- cy improvement of the query processing.

In inverse nearest neighbor queries the user only specifies a query point. Given this query point within a data set, an inverse nearest neighbor query finds all points for which the query point is a nearest neighbor.

qO DBO rankedq: 1{ …DB }→DB

i j,

∀ ∈{1…DB }: i<j simdist ranked( q( )i ,q)≤simdist ranked( q( )j ,q)

O→({1.. DB }→DB)

rankedq( )i = oi

simdist o( i,q) = simdist o( j,q) i j, ∈{1… DB}

RANKq

Figure 95: Examples of a q-ranking for two query points q’ and q”.

q 1 2 3 q’ 1 2 3