• No results found

4.4 Implementation of RIS

4.5.3 Effectiveness Evaluation

Synthetic Data Sets. We evaluated the effectiveness of RIS, using sev- eral synthetic data sets of varying dimensionality. The data sets contained between two and five overlapping clusters in varying subspaces. In all exper- iments, RIS detected the correct subspaces in which clusters exist and as- signed the highest quality values to them. All higher dimensional subspaces which were generated were removed by the upward pruning procedure. Gene Expression Data. We also applied RIS to the Spellman data set (cf. 3.5.1). The two top-ranked subspaces were the subspace spanned by the time slots 90, 110, 130, and 190 and the subspace spanned by the time slots 190, 270, and 290. Both subspaces played also a central role in the evaluation of the algorithm SUBCLU (cf. Section 3.5.3). A clustering using OPTICS in these two top-ranked subspaces provided several clusters and in fact more information than SUBCLU yielded. This is due to the use of a hierarchical clustering algorithm in the detected subspaces. For example, the genes MRPL17, MRPL31, MRPL32, and MRPL33 (four mitochondrial large ribosomal subunits) were clustered together with other mitochondrial proteins SNF7 and VPS4 (which are direct interaction partners) by SUB- CLU. However, several other genes that code for mitochondrial proteins, e.g. MEF1, PHB1, CYC1, MGE1, ATP12, could be added to this cluster because of the information OPTICS yielded in this subspace. Figure 4.6 illustrates the part of the cluster ordering generated by OPTICS in the par-

4.5 Performance Evaluation 63 MRPL17 MRPL31 MRPL32 MRPL33 UBC1 UBC4 VPS4 SNF7 ... CYC1 MGE1 PHB1 MEF1 ATP12 MCR1

Figure 4.6: Part of the reachability plot generated by OPTICS for the subspace which was ranked second by RIS.

ticular subspace. It can be seen that the additional genes are less dense than the core part of the cluster. To detect the entire nested cluster, the global parameter setting for the SUBCLU run in Section 3.5.3 was too strict, i.e. theε-value was to small. However, running SUBCLU with a higherε-value adds also other non-related genes, i.e. noise points to the cluster.

Additionally, RIS combined with OPTICS found some clusters which were not detected by SUBCLU. An excerpt of such a cluster is depicted in Table 4.1. This cluster was again found in the subspace spanned by the time slots 90, 110, 130, and 190 and contains several transcription related genes that directly interact with each other. It was not detected by SUBCLU because it does not fit the density threshold used for the SUBCLU run. However, it yields a significant valley in the reachability plot generated by OPTICS for that subspace. The functional relationship of the contained genes is biologically meaningful and important.

In summary, RIS detects several subspaces containing several biologically relevant co-expressions. All significant clusters SUBCLU has found were re- produced by the combined application of RIS and OPTICS. Furthermore, the application of the hierarchical algorithm OPTICS yielded new infor-

Gene Name Function

RRP3 RNA splicing, builds complex with NPL3 NPL3 RNA splicing, builds complex with RRP3 TFA1 transcription elongation factor

SPT5 part of transcription elongation factor complex (TEFC) CDC73 part of TEFC, builds complex with CKB1

CKB1 cell cycle transition gene, builds complex with CDC73 Table 4.1: A cluster missed by SUBCLU but detected by RIS/OPTICS. mation such as extended nested clusters and additional clusters showing different densities. By outperforming SUBCLU, the combined application of RIS and OPTICS also yields superior accuracy than CLIQUE.

4.6

Summary

In this chapter, we introduced a preprocessing step for clustering high- dimensional data. Based on a quality criterion for the interestingness of a subspace, we presented an efficient algorithm called RIS to compute all interesting subspaces containing dense regions of arbitrary shape and size. Furthermore, the well-established technique of random sampling can be ap- plied to RIS in order to speed up the runtime of the algorithm significantly with a minimum loss of quality. The effectiveness evaluation shows that a combination of RIS and OPTICS can be successfully applied to high- dimensional real-world data, e.g. gene expression data in order to find co- regulated genes.

Chapter 5

Advanced Subspace Selection

for Clustering

The previous chapter showed that the combination of the subspace selection technique RIS and the hierarchical clustering algorithm OPTICS is supe- rior to subspace clustering algorithms which are based on a global density threshold. The problem that still remains is that RIS itself is again based on a global density threshold. In this chapter, we present a feature selec- tion technique called SURFING (SUbspacesRelevantFor clusterING) that finds all subspaces interesting for clustering and is independent from any global density threshold. The sorting is based on a quality criterion, using the k-nearest neighbor distances of the points to measure the hierarchical clustering structure of a subspace. A broad evaluation based on synthetic and real-world data sets demonstrates that SURFING is suitable to find all relevant subspaces in large, high-dimensional, sparse data sets and produces better results than comparative methods.

5.1

Introduction

Recent density-based approaches to subspace clustering or comparable sub- space selection methods (RIS) use a global density threshold for the def- inition of clusters due to efficiency reasons. However, the application of one global density threshold to subspaces of different dimensionality as well as to all clusters in one subspace is rather unacceptable. The data space naturally increases exponentially with each dimension that is added to a subspace. The clusters in the same subspace may exceed different density parameters or exhibit a nested hierarchical clustering structure. Therefore, for subspace clustering, it would be highly desirable to adapt the density threshold to the dimensionality of the subspaces or even better to rely on a hierarchical clustering notion that is independent from a globally fixed threshold.

In this chapter, we introduce SURFING (SUbspacesRelevantFor clus- terING), a feature selection method for clustering which does not rely on a global density parameter. Our approach explores all subspaces exhibiting aninteresting hierarchical clustering structure and ranks them according to a quality criterion based on thek-nearest neighbor distances of the points. SURFING does not demand that the user specifies parameters that are hard to anticipate such as the number of clusters, the (average) dimensionality of subspace clusters or a global density threshold.

The remainder of this chapter is organized as follows. A quality crite- rion for ranking the interestingness of subspaces is developed in Section 5.2. In Section 5.3 we present our algorithm SURFING to rank all subspaces that are relevant for clustering. A thorough experimental evaluation of the performance of SURFING including a comparison to comparative subspace clustering methods is presented in Section 5.4. Section 5.5 concludes the chapter.