As mentioned above, one reason for the occurrence of multi-represented ob- jects is the existence of several useful feature transformations that model different, important aspects of the same data objects, e.g. shape and color of an image. Another reason for the occurrence of multi-represented objects is the existence of different measuring techniques for an object. A satellite might offer several pictures of the same area for varying color spectra like infrared or ultraviolet. One final reason for the occurrence of multi-instance objects is that several databases store the same data object independently. If these databases are integrated into a global data collection, the global view contains a representations from each of the source database. For example, the efforts to build up integrated databases for terror prevention provide a picture of various facets of a person after data linkage is done. Each data source represents another type of information about a potential terrorist. Thus, this highly delicate application uses multi-represented data as well.
Important applications providing multi-represented objects are:
BINDS TO MIP-1-ALPHA, MIP-1-BETA AND RANTES AND SUBSEQUENTLY…
MDYQVSSPTYIDYDTSE PCINVKQIAARLLPPLYS LVFIFGFVGNMLVILINC
KR …… rider, horse, equestrian,
hill, forest, …
proteins images
Figure 7.1: Proteins and images can be described by multi-represented objects.
Biomolecules are described by various representations like text anno- tations, sequential data, e.g. genes or amino acid sequences, or struc- tural features,e.g. the secondary structure or the three dimensional structure.
• General Images
As mentioned before, for images there exists a large variety of possible feature transformations that try to express different kinds of content of an image like shapes, textures and colors.
• CAD-Parts
CAD-parts can be transformed into feature representations by a variety of transformations. Furthermore, CAD parts are often described by text containing structural, functional and commercial information.
• Biometry
Another interesting area providing multi-represented data are biometric applications. A person can identify herself by finger prints, her iris pattern, her voice or her face.
Satellites often make several pictures of an area using different fre- quency spectra like infrared or ultraviolet.
This enumeration lists only some of the applications of multi-represented objects and is far from being complete. In the following, we will concentrate us on the first two applications. Data mining in molecular biological data- bases and image databases. For those two applications, we will introduce techniques that are capable to draw advantages out of the more meaningful input space. Figure 7.1 illustrates the first two examples.
Clustering of
Multi-Represented Objects
Clustering is one of the most important data mining tasks and many cluster- ing algorithms were introduced by the research community so far. Usually, these methods are targeted at finding groups of similar objects in one type of object representation using one distance function. In this chapter, density- based clustering of multi-represented objects is examined and two clustering methods that are based on DBSCAN are introduced. The introduced meth- ods are applied to two important types of multi-represented data, protein data and images.
8.1
Introduction
In recent years, the research community spent a lot of attention to clus- tering resulting in a large variety of different clustering algorithms [DLR77, EKSX96, ZRL96, WYM97, AGGR98, GRS98, ABKS99, HK01]. However, all those methods are based on one representation space, usually a vector space of features and a corresponding distance measure. But for a variety of modern applications such as biomolecular data, CAD-parts or multimedia files mined from the internet, it is problematic to find a common feature space that incorporates all given information. In this chapter, we therefore introduce a clustering method that is capable to handle multiple representa- tion.
To cluster multi-represented data, using the established clustering meth- ods would require to restrict the analysis to a single representation or to construct a feature space comprising all representations. However, the re- striction to a single feature space would not consider all available informa- tion and the construction of a combined feature space demands great care when constructing a combined distance function. Since the distance func- tions best-suited for each representation might not even provide the same value set, it is difficult to find a proper combination that gives a meaningful distance. Another important problem is that several data objects might not provide all possible representations. For example, finding all representations of a protein is expensive and time consuming. Thus, there are much less three dimensional models of proteins than there are amino acid sequences available so far. In these cases, the combined distance function would need to handle missing representations adequately. A last drawback of combined feature spaces is the following. Since many clustering algorithms are based on similarity queries, the use of index structures is usually very beneficial to increase the efficiency, especially for large data sets. For the design of a proper combined distance measure, this is another important constraint to consider, since the combination of the distance functions needs to be at least
metric to allow the use of an index structure.
In this chapter, we propose a method to integrate multiple representa- tions directly into the clustering algorithm. Our method is based on the density-based clustering algorithm DBSCAN [EKSX96] that provides sev- eral advantages over other algorithms, especially when analyzing noisy data. Since our method employs a separated feature space for each representa- tion, it is not necessary to design a new suitable distance measure for each new application. Additionally, the handling of objects that do not provide all possible representations is integrated naturally without defining dummy values to compensate the missing representations. Last but not least, our method does not require a combined index structure, but benefits from each index that is provided for a single representation. Thus, it is possible to em- ploy highly specialized index structures and filters for each representation. We evaluate our method for two example applications. The first is a data set consisting of protein sequences and text descriptions. Additionally, we applied our method to the clustering of images retrieved from the internet. For this second data set, we employed two different similarity models. The introduced solutions were published in [KKPS04a].
The rest of the chapter is organized as follows. After this introduction, we present related work. Section 8.3 formalizes the problem and introduces our new clustering method. In our experimental evaluation that is given in section 8.4, we introduce a new quality measure to judge the quality of a clustering with respect to a reference clustering and display the results achieved by our method in comparison with the other mentioned approaches. The last section summarizes the chapter.
8.2
Related Work
As mentioned in the introduction, the research community has developed a variety of algorithms to cluster data for various applications [ABKS99, EKSX96, HK98, GRS98, ZRL96, XEKS98]. Most of these algorithms are
designed for one feature space and one distance function to represent the data objects. Thus, to apply these algorithms to multi-represented data, it is necessary to unite the representations into one common feature space.
A similar setting to the clustering of multi-represented objects is the clustering of heterogenous or multi-typed objects [WZC+03,ZCM02] in web mining. In this setting, there are also multiple databases, each yielding ob- jects in a separated data space. Each object within these data spaces may be related to an arbitrary amount of data objects within the other data spaces. The framework of reinforcement clustering employs an iterative process based on an arbitrary clustering algorithm. It clusters one dedicated data space while employing the other data spaces for additional information. It is also applicable for multi-represented objects. However, due to its dependency on the data space, it is not well suited to solve our task. Since to the best of our knowledge reinforcement clustering is the only other clustering algorithm directly applicable to multi-represented objects, we use it for comparison in our evaluation section.
The goal of clustering multi-represented objects is to find a global cluster- ing for data objects that might have representations in multiple data spaces. The setting of reinforcement clustering is to cluster the data within one data space while using the related data spaces for additional information. Since the results may vary for different starting representations, the application of reinforcement clustering is problematic. It is unclear how many iterations are needed until a common clustering for all representations is found and if the algorithm reaches a common clustering at all for an arbitrary number of iterations. Let us note that this is not a problem in the original use of reinforcement clustering, but causes a major problem when applying it to multi-represented objects.
Our method is based on the density based clustering algorithm DBSCAN, that was introduced in chapter 2.2.3.