• No results found

1.5 Outline of the Thesis

2.2.3 Classification and Clustering of Multi-Instance Objects

Classification of Multi-Instance Objects. Data mining in multi-instance or set-valued data objects has been predominantly examined in the classi- fication section so far. In [DLLP97a] Dietterich et al. defined the problem of multi-instance learning for drug prediction and provided a specialized al- gorithm to solve this particular task by learning axis parallel rectangles. In the following years, new algorithms, increasing the performance for this spe- cial task, were introduced [Zho04]. In [WFP03], a more general method for handling multi-instance objects was introduced. It is applicable for a wider variety of multi-instance problems. This model considers several concepts for each class and requires certain cardinalities for the instances belonging to the concepts in order to specify a class of multi-instance objects. Additionally to this model, [GFKS02b] proposes more general kernel functions for comparing multi-instance objects.

Clustering of Multi-Instance Objects. For clustering multi-instance ob- jects, it is possible to use distance functions for sets of objects like [EM97, RB01]. Having such a distance measure, it is possible to cluster multi- instance objects with k-medoid methods like PAM and CLARANS [NH94] or employ density-based clustering approaches like DBSCAN. Though this method yields the possibility to partition multi-instance objects into clusters, the clustering model consists of representative objects in the best case. An- other problem of this approach is that the selection of a meaningful distance measure has an important impact of the resulting clustering. For example, netflow-distance [RB01] demands that all instances within two compared objects are somehow similar, whereas for the minimal Hausdorff [WZ00] dis- tance the indication of similarity is only dependent on the closest pair.

48 2 Related Work

2.2.4

Evaluation Techniques

Effectiveness measurement of a clustering method is a freguent task in this work. Thus, we describe here several approaches for this task. Often, we consider the agreement of the calculated clusterings to the given class sys- tems. To do so, we can calculate different quality measures, e.g. precision, recall, F-measure and average entropy.

In order to calculate the precision and F-Measure, we proceed as fol- lows. For each clusterci found by a clustering algorithm, its class assignment

Class(ci) is determined by the class label of objects belonging to ci that are

in the majority. Then, we calculated the precision P, recall R or F-Measure within all clusters w.r.t. the determined class assignments by using the fol- lowing formulas. P = (P

ci∈CCard({o|Class(o) =Class(ci)}))/Card(DB), R= (P

ci∈CCard({o|Class(o)6=Class(ci)}))/Card(DB) and F-Measure=

(2∗Precision ∗Recall)/(Precision +Recall).

In addition, we can measure the average entropy over all clusters. This quality measure is based on the impurity of a clusterci w.r.t. the class labels

of objects belonging toci. Let pj,i be the relative frequency of the class label

Classj in the cluster ci. We calculate average entropy as following.

Avg.Entropy = X

ci∈C

(Card(ci)∗(−

X

Classj

pj,ilog(pj,i)))/Card(DB)

Furthermore, we can measure the agreement between the reference clus- tering and the results of a clustering algorithm using the Rand Index [HBV01], also known as Rand Statistics.

Part II

Similarity Search Techniques

Chapter 3

Efficient Object Identification

Object identification is a very important task in advanced database systems such as biometric and multimedia database systems. This chapter begins with the introduction of object identification in Section 3.1. Section 3.2 briefly discusses related work. In Section 3.3, we introduce the Gaussian uncertainty model for identification task. Based on this model, two novel query types are defined. The algorithms used to determine the exact results for both query types are described in Section3.4. These algorithms can either be used on top of a sequential scan of the complete database or be used in the refinement step for the candidate set generated by our index structure, the Gauss-tree. Section 3.5 defines the Gauss-tree along with the methods for query processing and tree construction. In Section3.6, we give a detailed experimental evaluation of both the effectiveness and the efficiency of our technique. Section 3.7 concludes this chapter.

3.1

Introduction

In many applications like face recognition [ZCPR03, CWS95], fingerprint analysis [oI84], or voice recognition [Cam97], data objects are represented

52 3 Efficient Object Identification

by feature vectors with a varying degree of exactness or uncertainty (see Section1.1 of Chapter1 for details). Therefore, the observed feature values cannot be considered to be known exactly and two feature vectors describing the same object can be significantly different from each other. The degree of similarity between observed and exact values can vary from feature to feature because some features cannot be determined as exactly as others. For example, it is easier to determine the proportions of a face than the breadth of a nose. Additionally, to varying uncertainties between the features, we have to consider individual uncertainties for the objects as well because the circumstances in which a given data object is transformed into a feature vector may strongly vary. For example, most data collections consisting of facial images do not just contain images that were taken under the same illumination and having exactly the same distance between camera and face.

Due to these uncertainties, we are facing new problems. An object that is observed more than once under different circumstances will most likely generate a different feature vector for each of these observations. Thus, object identification, i.e. determining if two feature vectors belong to the same object, becomes much more complicated. For example, we might have a database of facial features. When observing one of the persons that are stored in this database, we cannot simple search for the observed feature vector in the database.

To solve identification problems, the simplest solution is to employ feature based similarity search. By defining a distance function like the Euclidian distance to feature vectors, we can assume that the distance between the feature vectors corresponds to the dissimilarity of objects. Thus, to identify an object, we could retrieve the nearest neighbor in the database. To speed up query processing for large databases, a variety of index structures for feature spaces of medium to high dimensionality has been proposed, e.g. the TV-tree [LJF94] and the X-tree [BKK96].