• No results found

Parallel Clustering Techniques

5.2 Clustering Effort Data

5.2.3 Parallel Clustering Techniques

Clustering techniques are used heavily for statistical analysis, data mining, and other data- intensive applications. Many parallel clustering implementations have been attempted. We

summarize them below.

Parallel K-Means Clustering

A number of distributed-memory parallel K-Means algorithms have been developed. Ranka and Sahni (Ranka and Sahni, 1991) describe an algorithm for K-Means clustering for use in image-analysis applications. Their algorithm runs on supercomputers with hypercube inter- connect topologies. The authors showed that this algorithm scales to 64 cores with roughly 50% parallel efficiency when clustering 16,384 data points. Stoffel and Belkoniene (Stoffel and Belkoniene, 1999) present another parallel K-Means implementation that achieves 90% parallel efficiency on as many as 32 processors for data sets of 100,000 objects. Forman and Zhang (Forman and Zhang, 2000b; Forman and Zhang, 2000a) present a similar K-Means al- gorithm that achieves nearly linear speedup for as many as 128 processors with object counts up to 10 million nodes. Finally, Kraj et al. present ParaKMeans, a publicly available parallel K-Means algorithm for K-Means clustering that can be used through a web interface. Their algorithm achieves 5x speedup with 7 processors and 10,000 data points. Depending onk, the speedup of ParKMeans levels off between 4 and 7 processes, and there is no marginal speedup after this point.

Parallel Hierarchical Clustering

Parallel Hierarchical clustering has been studied extensively. Olson (Olson, 1993) presents optimal hierarchical clustering algorithms. In addition to shared-memory algorithms, he presents an algorithm for butterfly networks ofn processors that runs in O(nlog(n)) time. Rajasekaran (Rajasekaran, 2005) improves on this work by defining tighter bounds for the expected running time. His algorithm runs inO(log(n))time on a system withn processors in the average case, but it assume that the data points to be clustered are distributed uniformly among processors, which does not always hold in practice.

Du and Lin (Du and Lin, 2005) present a hierarchical clustering implementation for ma- chines running MPI. They perform experiments with three biological gene-expression data sets representing 7,452, 9,217, and 11017 genes, respectively. For the smallest data set, they achieve 25x speedup with 48 processors. For the larger data sets, speedup is around a factor of 15 with the same number of processors. The authors report that this speedup increases linearly with system size, and marginal speedup does not decrease before 48 processors.

Wang et al. (Wang et al., 2008) present another hierarchical clustering implementation using MPI, but they conduct experiments on financial data. They show an 8x speedup using 16 processors and a 9x speedup using 32 processors, but marginal speedup diminishes after this point.

Parallel Subspace Clustering

Nagesh et al. have implemented a parallel subspace clustering algorithm, pMAFIA, in MPI. pMAFIA partitions the space to be clustered evenly among processors. The authors conduct experiments on an IBM SP2 cluster with 16 processors. With a large data set containing 8.3 million records, the communication required for pMAFIA is negligible compared to the amount of computation. The authors report near-linear speedup when clustering this data on as many as 16 processors.

Parallel K-Medoids Clustering

Kaufman and Hopke (Kaufman et al., 1988; Hopke, 1990) developed and ran a parallel version of the CLARA K-Medoids clustering algorithm described in§5.2.2. Rather than par- allelize the full algorithm, they presented two approaches that take advantage of the sampled nature of CLARA. Since CLARA executes multiple sampled trials of PAM, their approach distributes the sampled trials across a cluster of 10 processors. Good performance was re- ported, but specific numbers were not provided.

Using Parallel Clustering with Effort Data

A natural solution to the data-volume problem would be to investigate parallel clustering algorithms for effort data. As discussed above, there is much existing work on parallel clus- tering, some of which has achieved linear speedup. However, the focus of most existing work has been on partitioning very large data sets among a small number of processors. Effort data, on the other hand, is nearly completely distributed, and we would need to design a clustering algorithm with much more communication and with smaller amounts of data per processor than those described here.

Clustering algorithms for entirely distributed data have been developed. Bandyopadhyay and Coyle (Bandyopadhyay and Coyle, 2003) present a distributed hierarchical clustering algorithm aimed at reducing power consumption in distributed sensor networks. Their al- gorithm groups sensors into a single-level hierarchy in which each sensor sends data to a

clusterhead1. When all sensors report to their clusterheads instead of directly to a centralized

data sink, energy consumption in the sensor network is minimized.

The problem of conducting a full parallel clustering of effort data is more similar to the problem of clustering data on sensor networks than to traditional data-intensive clustering approaches. However, existing work in sensor networks has focused on energy reduction, while we are interested in conducting intensive analysis within our network of processors. Adapting existing algorithms for this purpose is a difficult problem that lies beyond the scope of this dissertation. In this work, we have investigated clustering strategies for single-node machines using effort approximations to reduce the volume of data to be clustered.