• No results found

4.4 Experimental Evaluation

4.4.3 Partial Similarity Search

In this section, we tested the norm vector filter and the closest pair filter. Let us note that detecting partial similarity is a very expensive operation. Furthermore, we cannot apply the M-tree as the distance function is not a metric (cf. Definition 13).

Range queries. Figure 4.7 shows the average of 10 range queries for vary- ing ε-values on a vector set of 7 vectors. The partial similarity parameter s was set to 2. Again, the closest pair filter is very selective. As the exact distance function is very expensive, the closest pair filter can be beneficially used for smallε-values. For higher ε-values, the rather high evaluation cost of the closest pair filter carry into weight. On the other hand, the norm vec- tor can safely be used for all values ofε, as there is no noteworthy overhead. For rather smallε-values, it even outperforms the closest pair filter, although the norm vector has a lower selectivity than the closest pair filter. This is because the lower computational cost of the norm vector filter still pays off, compared to the slightly more exact distance computations which have to be carried out.

4.4 Experimental Evaluation 67

Figure 4.7: Partial range queries for s = 2, CAD dataset, cardinality 7, dimensionality 6.

68 4 Efficient Similarity Search on Vector Sets

Figure 4.8: Partial k-nn queries for s = 3, CAD dataset, cardinality 5, dimensionality 6 (the sequential scan took about 2123 sec. for each k).

4.5 Summary 69

k-nn queries. Figure 4.8 shows the average of 10 k-nn queries for vector sets of 5 vectors each having a dimensionality of 6 and a partial similarity pa- rameters = 3. For small values ofk, the norm vector filter outperforms the exact distance computation by almost one order of magnitude. For higher values of k, the selectivity of the norm vector filter decreases and thus the overall response time increases. For values ofk equal to 100, the norm vector filter still accelerates the query process by 100%. As already mentioned, the closest pair filter is rather expensive. Although it has an excellent selectivity, the norm vector filter is better for rather small values of k. For increasing values of k, the closest pair filter outperforms the norm vector filter because of the much better selectivity and the very expensive exact distance calcula- tions.

4.5

Summary

In this chapter, we motivated the use of vector set data by pointing out the different application areas of this promising representation technique. We introduced a suitable distance function on vector sets, which reflects the in- tuitive notion of similarity for the presented application ranges. Furthermore, we presented different filtering techniques with different runtime complexi- ties. Our experimental evaluation and our analytical reasoning showed that the closest pair filter is the most selective filter. As this filter is rather ex- pensive, it only pays off for partial similarity queries which are extremely expensive themselves. For complete similarity queries, the combination of the norm vector filter and the centroid filter is the method of choice for a lot of different data distributions, as it can be computed efficiently and the information of each vector and each dimension is taken into considera- tion. The experimental evaluation on real world datasets demonstrates that the presented filtering techniques accelerate similarity range queries andk-nn queries by up to one order of magnitude compared to metric index structures and the sequential scan.

Chapter 5

Multi-Step Density-Based

Clustering

In recent years, the research community spent a lot of attention to the clus- tering problem resulting in a large variety of different clustering algorithms [JMF99]. One important class of clustering algorithms is density-based clus- tering which can be used for clustering all kinds of metric data and is not confined to vector spaces. Density-based clustering is rather robust concern- ing outliers [EKSX96] and is very effective in clustering all sorts of data, e.g. multi-represented objects [KKPS04a]. Furthermore, the reachability plot cre- ated by the density-based hierarchical clustering algorithm OPTICS serves as a starting point for an effective data mining tool described in Chapter 7, which helps to visually analyze cluster hierarchies.

Density-based clustering algorithms like DBSCAN and OPTICS, which were introduced in Chapter 3, are based onε-range queries for each database object. Each range query requires a lot of distance calculations, especially when high ε-values are used. Therefore, these algorithms are only applicable to large collections of complex objects, e.g. trees, point sets, and graphs (cf. Figure 1), if those range queries are supported efficiently. When working with complex objects, the necessary distance calculations are the time-limiting factor. Thus, the ultimate goal is to save as many of these complex distance calculations as possible.

72 5 Multi-Step Density-Based Clustering

In this chapter, we present an approach which helps to compute density- based clusterings efficiently. The core idea of our approach is to integrate the multi-step query processing paradigm directly into the clustering algorithm rather than using it “only” for accelerating range queries. Our clustering ap- proach itself exploits the information provided by simple distance measures lower-bounding complex and expensive exact distance functions. Expensive exact distance computations are only performed when the information pro- vided by simple distance computations, which are often based on simple object representations, is not enough to compute the exact clustering. Fur- thermore, we show how our approach can be used for approximated clustering where the result might be slightly different from the one we compute based on the exact information. In order to measure the dissimilarity between the resulting clusterings, we introduce suitable quality measures.

The remainder of this chapter is organized as follows. In Section 5.1, we look at different approaches presented in the literature for efficiently com- puting these algorithms. We will explain why the presented algorithms are not suitable for expensive distance computations if we are interested in the exact clustering structure. In Section 5.2, we will present our new approach which tries to use lower-bounding distance functions before computing the expensive exact distances. The new approach integrates the multi-step query processing paradigm directly into the clustering algorithms rather than using it independently. As our approach can also be used for generating approx- imated clusterings, we introduce objective quality measures in Section 5.3 which allow us to assess the quality of approximated clusterings. In Section 5.4, we present a detailed experimental evaluation showing that the pre- sented approach can accelerate the generation of density-based clusterings on complex objects by more than one order of magnitude. We show that for approximated clustering the achieved quality is scalable w.r.t. the overall runtime. Section 5.5 summarizes the chapter.

5.1 Related Work 73

5.1

Related Work

DBSCAN and OPTICS determine the local densities by performing repeated range queries. In this section, we will sketch different approaches from the literature to accelerate these density-based clustering algorithms and discuss their unsuitability for complex object representations.