The above contributions lead us to some general conclusions about multi- instance data mining. Several distance measures on multi-instance objects have been introduced that are suitable for different applications. The mininal Hausdorff distance is known to provide a suitable kNN classifier for classical multi-instance learning[WZ00]. In chapter 4, we use minimal matching dis-
tance for clustering and in chapter 5 we use the sum of minimum distances forkNN classification. Thus, distance based data mining proves to be a valu- able approach for multi-instance data mining. However, this approach still suffers from the following drawbacks:
• Though for a some of the mentioned problems a suitable distance mea- sures was found, it is not clear which of the distance measures is suited best for a new application. This drawback is a general problem of distance based data mining. However, for multi-instance problems it is very critical because the ideas of the distance measures vary very strongly. For example, the minimal Hausdorff distance defines a dis- tance value of two objects as the distance of the closest pair of in- stances while minimal matching distance compares disjoint pairs of all instances.
• Another problem of distance based data mining is efficiency. Many data mining algorithms need large numbers of distance calculations and most distance measures for multi-instance objects are very expensive, i.e. one distance calculation has quadratic or even cubic time complexity with respect to the maximum number of instances in the compared objects. Furthermore, often similarity queries are difficult to speed up because the used similarity measures might not even be metric. To avoid this problem, we introduced a filter step and kNN classification based on centroid sets. However, a general approach for speeding up distance based multi-instance data mining is unlikely to exist due to the strongly varying notion of multi-instance distance functions. Besides distance based data mining, we applied an aggregation-based ap- proach to handle multi-instance classification. Aggregation can be considered as an additional preprocessing step that transforms a multi-instance object to a feature vector. The resulting feature vector is used to represent the multi- instance object when applying standard data mining algorithms. Though we
found a solution for the given application, aggregation is not generally ap- plicable. Finding a suitable aggregation function is also strongly application dependent like the selection of a suitable distance measure. Furthermore, not all relationships between sets of instances are expressible by a single feature vector.
The internal crawler used a statical process to model a class of multi- instance objects. This approach is the most general of the introduced tech- niques for multi-instance data mining because it can handle different kinds of relationships between the instances of two object. Depending on the used distribution function for each process this approach is capable to decide the specificity of a single instance for a class. However, this approach is suit- able for classification only and is based on the assumption of independence between the instances of one object. Another important problem is the se- lection of the distribution function that is used to generate the instances of a class. Last but not least, the number of instances of an object is treated as equally distributed for each class which might not be realistic.
To conclude, for classification and clustering of multi-instance objects a careful examination of the given application is necessary. Depending on the relationships between the instances of two compared objects and the degree the instances within an object are correlated, varying methods or similarity measures are applicable. Thus, finding a proper solution depends on the given application to a very high degree.
Data Mining in
Multi-Represented Objects
Multi-Represented Objects
Multi-represented objects are the second basic type of compound objects besides multi-instance objects. A multi-represented object consists of a tuple of feature representations. Each feature representation belongs to a different feature space and represents another view on the object. In this chapter, we give a brief introduction to multi-represented objects and survey important applications.
7.1
Multi-Represented Objects
Many important areas of KDD are concerned with finding useful patterns in large collections of complex objects. Images, biomolecules or CAD parts are only some examples of complex objects that are in the center of interest of many researchers. However, the more complex a type of object is the more feature transformations exist that try to extract relevant features and construct a meaningful object representation. For example, [VT00] surveys a variety of systems for content based image retrieval and their various feature transformations. Other examples are the feature transformations of CAD- parts described in chapter 4.3. All of these feature transformations are well suited for different applications and treat a data object from another point of view. For example, shape descriptors are well suited for spotting certain objects on images, whereas color histograms are better suited to compare complete sceneries.
For data mining, the existence of multiple feature transformations is of- ten problematic because it is not clear which of the representations contains the features that are needed to achieve the desired results. Thus, the selec- tion of a feature transformation is often a difficult and crucial decision that strongly influences the resulting patterns. Clearly, incorporating all avail- able feature transformations offers a more complete view of a data object and minimizes the hazard that the information that is necessary to derive meaningful patterns are not contained in the object representation. On the other hand, considering too many aspects of an object is often problematic as well. The found patterns are often very complicated and lose generality. Furthermore, the efficiency of the data mining algorithms suffers strongly since much more features have to be processed. Thus, integrating multiple representation yields chances as well as drawbacks.
In this chapter, we introduce data mining techniques that allow data min- ing for multi-represented objects. The idea of multi-represented data mining is to use compound objects, i.e. the tuples of all available object repre-
sentations as input for the data mining algorithms. The use of this object representation yields solutions that can draw advantages out of additional information, as we will see in the next two chapters.
Formally, we define a multi-represented object o as a tuple (r1, . . . , rk)∈
R1 ×. . .×Rk . The representation space Ri = Fi ∪ {”−”} consists of a
feature space Fi for a given representation and a symbol ”−” to symbolize missing representation vectors. The consideration of missing objects is an important case that must be considered to solve real-world problems. For example, in image databases new images often lack a text annotation, or in protein databases the three dimensional structure of an already known protein is not explored yet.