DYNAMIC DATA CLUSTERING UNDER DISTRIBUTED ENVIRONMENT

(1)

30

DYNAMIC DATA CLUSTERING UNDER DISTRIBUTED

ENVIRONMENT

Ms. N. Anusha 1, Dr. T. Senthil Prakash2, Mrs. P.V. Jothi Kantham3, Mr. N. Vipin Raj 4 II Year M.E (CSE)1, Professor & HOD2, Assistant Professor3, II Year M.E(CSE)4

Shree Venkateshwara Hi-Tech Engg. College, Gobi, Tamilnadu, India 1,2,3,4 [email protected] 1, [email protected], [email protected]

Abstract

Clustering techniques are used to group up the transactions based on relevance. Hierarchical and partitioning techniques are used for clustering process. Distance measure is used to estimate transaction relationship. Data point clustering is carried out using the geometrical structures. Huge volume of data values are distributed between multiple systems under the Peer to Peer environment. Processing, storage and transmission cost factors are the key issue in the distributed data process. General Decentralized (GDCluster) algorithm is applied to perform clustering on dynamic and distributed data sets. Summarized view of the data sets are used in the clustering process. GDCluster is customized for execution of the partition-based and density-based clustering methods on the summarized views. The clustering model is tuned to adapt the dynamic data values. The GDCluster model supports partition based clustering and density based clustering tasks. Weighted K-means algorithm is tuned to perform partition based clustering under distributed environment.

1. Introduction

Clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters. Clustering is a main task of explorative data mining and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval and bioinformatics.

The notion of a cluster varies between algorithms and is one of the many decisions to take when choosing the appropriate algorithm for a particular problem . At first the terminology of a cluster seems obvious: a group of data objects. However, the clusters found by different algorithms vary significantly in their properties and understanding these cluster models is key to understanding the differences between the various algorithms. Typical cluster models include:

 Connectivity models: for example hierarchical clustering builds models based on distance connectivity.

 Centroid models: for example the K-means algorithm represents each cluster by a single mean vector.

 Distribution models: clusters are modeled using statistic distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.

 Density models: for example DBSCAN and OPTICS defines clusters as connected dense regions in the data space.

 Subspace models: in Biclustering, clusters are modeled with both cluster members and relevant attributes.

 Group models: some algorithms do not

provide a refined model for their results and just provide the grouping information.

(2)

31 the clusters to each other, for example a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished in hard clustering, soft clustering, strict partitioning clustering, strict partitioning clustering with outliers, overlapping clustering, hierarchical clustering and subspace clustering.

2. Related work

Large-scale graph computing. To meet the current prohibitive requirements of processing large-scale graphs, many distributed methods and frameworks have been proposed and become appealing. Pregel [8] and GraphLab both use a vertex-centric computing model and run a user defines program at each worker node in parallel. Giraph [2] is an open source project, which adopts Pregel’s programming model and adjusts for HDFS. In these parallel graph processing systems, it is important to partition large graph into several balanced sub-graphs, so that parallel workers can coordinately process them. However, most of the current systems usually choose simple hash method.

Graph partition. Graph partitioning is a combinatorial optimization problem which has been studied for decades and wildly used in many fields, such as parallel subgraph listing [4]. The wildly used objective function, k-balanced graph partitioning aims to minimize the number of edges cut between partitions while balance the number of vertexes. Though the k-balanced graph partitioning problem is an NP-Complete problem, several solutions have been proposed to tackle this challenge.

Andreev and Racke presented a bicriteria

approximation algorithm which guarantees

polynomial running time with an approximation ratio of O(log n). Another solution was proposed by Even et al. Besides approximated solution, Karypis and Kumar proposed a parallel multi-level graph partitioning algorithm to minimize bisection on

each level. There are some heuristics

implementations like METIS, parallel and multiple

constraints version of METIS which are widely used in many existing systems. Pellegrini and Roman proposed Scotch which takes network topology into account. Although they cannot provide precise performance guarantee, these heuristics are quite effective. More heuristic approaches are summarized in [3].

Streaming partitioning algorithms. The methods mentioned above are offline and require expensive time cost to process. Stanton and Kliot [7] proposed a series of online streaming partitioning method using heuristics. Tsourakakis et al. [9] extended this work by proposing a streaming partitioning framework which combines other heuristic methods. Tsourakakis [1] used higher length walks to improve the quality of the graph partition. Nishimura and Ugander futher proposed Restreaming LDG and Restreaming Fennel that generated initial graph partitioning using the last streaming partitioning result. LogGP [5] used hyper graph to optimize the initial partitioning result. There is no mathematical proof, experiment shows that these one-pass streaming partitioning algorithms have comparable performance to multilevel ones with short partition time. They adapt to dynamic graphs, where offline methods become inefficient for the expensive computational cost when repartition the graph.

(3)

32

on multi-level algorithm that is not fast enough for current large scale graph scenarios.

3. Clustering Algorithms

Clustering algorithms can be categorized based on their cluster model, as listed above. The following overview will only list the most prominent examples of clustering algorithms, as there are probably a few dozen published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms.

3.1. Connectivity Based Clustering

Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. As such, these algorithms connect "objects" to form "clusters" based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name "hierarchical clustering" comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don't mix.

Connectivity based clustering is a whole family of methods that differ by the way distances are computed. Apart from the usual choice of distance functions, the user also needs to decide on the linkage criterion to use. Popular choices are known as single-linkage clustering, complete linkage clustering or UPGMA (Unweighted Pair

Group Method with Arithmetic Mean).

Furthermore, hierarchical clustering can be computed agglomerative or divisive.

While these methods are fairly easy to understand, the results are not always easy to use,

as they will not produce a unique partitioning of the data set, but a hierarchy the user still needs to choose appropriate clusters from. The methods are not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge. In the general case, the

complexity is



 

n

3 , which makes them too slow for large data sets. For some special cases, optimal efficient methods are known: SLINK for

single-linkage and CLINK for complete-linkage

clustering. In the data mining community these methods are recognized as a theoretical foundation of cluster analysis, but often considered obsolete. They did however provide inspiration for many later methods such as density based clustering. 3.2. Centroid-based clustering

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, K-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well known approximative method is Lloyd's algorithm, often actually referred to as "K-means algorithm". It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of K-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medians), choosing medians (k-medians clustering), choosing the initial centers less randomly allowing a fuzzy cluster assignment (Fuzzy c-means).

(4)

33 size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders in between of clusters.

K-means has a number of interesting

theoretical properties. On one hand, it partitions the data space into a structure known as Voronoi diagram. On the other hand, it is conceptually close to nearest neighbor classification and as such popular in machine learning. Third, it can be seen as a variation of model based classification and Lloyd's algorithm as a variation of the Expectation-maximization algorithm for this model.

3.3. Distribution-based clustering

The clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. A nice property of this approach is that this closely resembles the way artificial data sets are generated: by sampling random objects from a distribution. While the theoretical foundation of these methods is excellent, they suffer from one key problem known as overfitting, unless constraints are put on the model complexity. A more complex model will usually always be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult.

The most prominent method is known as expectation-maximization algorithm. Here, the data set is usually modeled with a fixed number of Gaussian distributions that are initialized randomly and whose parameters are iteratively optimized to fit better to the data set. This will converge to a local optimum, so multiple runs may produce different results. In order to obtain a hard clustering, objects are often then assigned to the Gaussian distribution they most likely belong to, for soft clusterings this is not necessary.

Distribution-based clustering is a

semantically strong method, as it not only provides with clusters, but also produces complex models for the clusters that can also capture correlation and dependence of attributes. However, using these

algorithms puts an extra burden on the user: to choose appropriate data models to optimize and for many real data sets, there may be no mathematical model available the algorithm is able to optimize. 3.4. Density-based clustering

In density-based clustering, clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas - that are required to separate clusters - are usually considered to be noise and border points. The most popular density based clustering method is DBSCAN. In contrast to many newer methods, it features a well-defined cluster model called "density-reachability". Similar to linkage based clustering, it is based on connecting points within certain distance thresholds. However, it only connects points that satisfy a density criterion, in the original variant defined as a minimum number of other objects within this radius. A cluster consists of all density-connected objects plus all objects that are within these objects range. Another interesting property of DBSCAN is that its complexity is fairly low - it requires a linear number of range queries on the database - and that it will discover essentially the same results in each run, therefore there is no need to run it multiple times. OPTICS is a generalization of DBSCAN that removes the need to choose an appropriate value for the range parameter ε, and produces a hierarchical result related to that of linkage clustering. DeLi-Clu, Density-Link-Clustering combines ideas from single-linkage clustering and OPTICS, eliminating the ε parameter entirely and offering performance improvements over OPTICS by using an R-tree index.

The key drawback of DBSCAN and

(5)

34 every time be outperformed by methods such as EM clustering, that are able to precisely model this kind of data.

4. General Decentralized Clustering Algorithm

Clustering or unsupervised learning is important for analyzing large data sets. Clustering partitions data into groups of similar objects with high intracluster similarity and low inter-cluster similarity. With the progress of large-scale distributed systems, huge amounts of data are increasingly originating from dispersed sources. Analyzing this data, using centralized processing, is often infeasible due to communication, storage and computation overheads. Distributed data mining focuses on the adaptation of data-mining algorithms for distributed computing environments and intends to derive a global model which presents the characteristics of a data set distributed across many nodes.

In fully distributed clustering algorithms, the data set as a whole remains dispersed and the participating distributed processes will gradually

discover various clusters. Communication

complexity and overhead, accuracy (AC) of the derived model and data privacy are among the concerns of DDM. Typical applications requiring distributed clustering include: clustering different media metadata from different machines; clustering nodes’ activity history data; clustering books in a distributed network of libraries; clustering scientific achievements from different institutions and publishers.

A common approach in distributed

clustering is to combine and merge local representations in a central node, or aggregate local models in a hierarchical structure. Some recent proposals, although being completely decentralized, include synchronization at the end of each round and/or require nodes to maintain history of the clustering.

In this paper, a general distributed clustering algorithm (GDCluster) is proposed and instantiated with two popular partition-based and

density-based clustering methods. We first introduce a basic method in which nodes gradually build a summarized view of the data set by continuously exchanging information on data items and data representatives using gossip-based communication. Gossiping is used as a simple, robust and efficient dissemination technique, which assumes no predefined structure in the network. The summarized view is a basis for executing weighted versions of the clustering algorithms to produce approximations of the final clustering results.

GDCluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment. It can handle two classes of clustering, namely partition-based and

density-based, while being fully decentralized,

asynchronous and also adaptable to churn. The general design principles employed in the proposed algorithm also allow customization for other classes of clustering, which are left out of the current paper. We also discuss enhancements to the algorithm particularly aimed at improving communication costs.

The simulation results presented using real and synthetic data sets, show that GDCluster is able to achieve a high quality global clustering solution, which approximates centralized clustering. We also explain effects of various parameters on the accuracy and overhead of the algorithm. We compare our proposal with central clustering and with the LSP2P algorithm and also show its supremacy in achieving higher quality clusters. The main contributions of this paper are as follows:

 Proposing a new fully distributed clustering algorithm can be instantiated to at least two categories of clustering algorithms.

 Dealing with dynamic data and evolving the clustering model.

 Empowering nodes to construct a

(6)

35 5. Problem Statement

Huge volume of data values are distributed between multiple systems under the Peer to Peer environment. Processing, storage and transmission cost factors are the key issue in the distributed data process. General Decentralized (GDCluster) algorithm is applied to perform clustering on dynamic and distributed data sets. Summarized view of the data sets are used in the clustering process. GDCluster is customized for execution of the partition-based and density-based clustering methods on the summarized views. The clustering model is tuned to adapt the dynamic data values. The GDCluster model supports partition based clustering and density based clustering tasks. Weighted K-means algorithm is tuned to perform partition based clustering under distributed environment. The following problems are identified from the existing system.

• GDClustering method is not customized for all cluster types

• Summarized view construction is not optimized

• Limited cluster accuracy levels

• Communication and computational

complexity is high

6. Dynamic Data Clustering Scheme

The General Decentralized (GDCluster) scheme is enhanced to support hierarchical and grid based clustering methods. Summarized view construction is tuned for hierarchical and grid data models. Priority factors are adapted for the relationship identification process. Age based data and representative elimination process is integrated with the system.

The General Decentralized (GDCluste) scheme is designed to support hierarchical and grid based clustering process. Weight estimation process is enhanced with priority values. The summarized view is constructed with hierarchical properties. The system is divided into four major modules. They are Data Preprocess, Summarized View

Construction; Partition based Clustering Process and Hierarchical Clustering Process.

The data preprocess module is designed to perform data cleaning process. Summarized view construction process is designed to group up similar data values.

Partition based clustering process is designed to perform clustering with weight values. Hierarchical and grid based clusters are constructed with hierarchical relationships.

6.1. Data Preprocess

Data cleaning is performed in the data preprocess. Data values are parsed and updated into the database. Redundant data values are removed from the data set. Missing values are assigned with suitable values.

6.2. Summarized View Construction

Summarized views are constructed by continuously exchanging information on data items and data representatives . Data transmission is carried out using gossip-based communication. Gossiping is used as a simple, robust and efficient dissemination technique. Summarized views are used in weighted clustering process.

6.3. Partition based Clustering Process

Partition based clustering process is carried out under the distributed environment. Clustering process is performed using General Decentralized Clustering (GDCluster) scheme. Decentralized asynchronous communication is carried out under the system. Weighted K means clustering algorithm is used in the system.

6.4. Hierarchical Clustering Process

The GDCluster scheme is enhanced with hierarchical and grid clustering support. Summarized view construction is improved to handle hierarchical and grid based data values. Data representatives are organized in hierarchical manner. Statistical operations are called on approximated grid cells.

7. Conclusion

Clustering techniques are applied to

(7)

36 Decentralized Cluster (GDCluster) scheme is used for the distributed data clustering process. Partition based clustering and density based clustering operations are supported by the GDCluster scheme. GDCluster scheme is enhanced to support hierarchical and grid based clustering process. Hierarchical and grid based clustering operations are carried out on dynamic and distributed data sets. Summarization views constructed with hierarchical data relationships. Transmission and computational overhead is reduced by the system. High scalability is supported by the system.

REFERENCES

[1] C. E. Tsourakakis, “Streaming Graph Partitioning In The Planted Partition Model,” CoRR abs/1406.7570, 2014.

[2] Apache Giraph. [Online]. Available:

https://github.com/apache/giraph/, 2014.

[3] Graph Archive Dataset. [Online]. Available: http://staffweb.cms.gre.ac.uk/wc06/partition/, 2014. [4] Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao and N. Xu, “Parallel Subgraph Listing In A Large-Scale Graph,” in Proc. SIGMOD, 2014.

[5] N. Xu, L. Chen and B. Cui, “Loggp: A Log Based Dynamic Graph Partitioning Method,” in Proc. VLDB, 2014.

[6] Ning Xu, Bin Cui, Lei Chen, Zi Huang and Yingxia Shao, “Heterogeneous Environment Aware Streaming Graph Partitioning”, IEEE Transactions On Knowledge And Data Engineering, Vol. 27, No. 6, June 2015.

[7] I. Stanton and G. Kliot, “Streaming Graph Partitioning For Large Distributed Graphs,” in Proc. KDD, 2012, pp. 1222–1230.

[8] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser and G. Czajkowski, “Pregel: A System For Large-Scale Graph Processing,” in Proc. SIGMOD, 2010, pp. 135–146. [9] C. E. Tsourakakis, C. Gkantsidis, B. Radunovi and M. Vojnovi, “Fennel: Streaming Graph Partitioning For Massive Scale Graphs,” Tech. Rep. MSR-TR-2012-113, Microsoft, 2012.