EMERGING CLUSTERING TECHNIQUES ON BIG DATA

(1)

EMERGING CLUSTERING TECHNIQUES ON BIG DATA

Pooja Batra Nagpal1_{, Sarika Chaudhary}2_{, Preetishree Patnaik}3

1,2,3 _{Computer Science/Amity University, India}

ABSTRACT

The term "Big Data" defined as enormous data sets having a large more diverse and complex structure of representation that creates difficulty in storing, analyzing searching and visualization process. This process of execution of the massive data sets into a secrete correlation pattern called as "Big Data Mining" which implies the same concept of discovery the hidden and relevant data through various principles of Data Mining. Clustering algorithms have emerged as an alternative powerful and meta-learning tool helps to analyze the massive volume of data (Big Data) generated by many applications. In general, we studied that Big Data creates a lot of confusion while categorization of Big Data. Therefore the most relevant clustering algorithm must be used to classify the Big Data. In this paper we tried to explain the challenges that big data faced today and survey analysis of different clustering techniques used in Big Data Analytics.

KEYWORDS - BIG DATA, CLUSTERING, DATA MINING, CLUSTERING ALGORITHMS.

1. INTRODUCTION

The term Big Data encompasses of all forms of data, including Web logs, data from social networking sites, sensor data, tweets, blogs, user reviews, and SMS messages. Big data and big data analytics are in the recent study of information technology and business intelligence. These data are generated from various social networking sites like Facebooks, twitter, etc ,online transactions, emails, videos, audios, images, click streams, logs, posts, search queries, health records, science data, sensors Smart phones and their applications [1]. These data are in different format, hence required for databases to store and analyze the data sets and visualize via typical database software tools.

(2)

Today, we accepted ourselves in the era of digitalization with gigantic progress, development of technologies, web media, social networking sites, online world technologies through internet, Smartphone.etc where every user are accessing enormous /massive quantities of data from various data sources. Such enormous data sets having massive, diverse and complex structure of data is term as “Big Data”. These massive data creates a lot of difficulties in storing, analyzing, searching and visualization process. But we know that this massive volume of data sets can be useful to user in various aspects and creates lots of confusion in its storing and analyzing. Therefore ,a big massive of data sets(BIG DATA) are need to be store in effective and efficient manner that helps in various type of operations(i.e. analytical operation, process operations, retrieval, reliability of data & etc). Thus it is most important to execution of these massive data sets into a secrete – correlation/pattern/cluster models that makes easy of its utilization through implementation various types of clustering techniques, Data mining methods.

2. CATEGORIZATION OF BIG DATA.

Although the huge volume of data (Big Data) can be actually useful for users but also creates

a lot of problematic in storing and analyzing. Therefore, a big volume of data or big data has its own deficiencies as well. They need big storages and this volume makes operations such as analytical operations, process operations, retrieval operations, very difficult and hugely time consuming. One way to overcome these difficult problems is to have big data clustered in a compact format that is still an informative version of the entire data. Such clustering techniques aim to produce a good quality of clusters/summaries. Therefore, they would hugely benefits everyone from ordinary users to researchers and people in the corporate world, as they could provide an efficient tool that helps with large data such as critical systems (to detect cyber attacks). In the below figure depicts the categorizations of BIG DATA. [1][2]

2.1 VARIETY

Big data come from a great range of sources and a further volume of data source is categorized into three types: structured, semi structured and unstructured.

2.1.1 STRUCTURED DATA

The structures data are organized manner easily sorted to store in database .these variety Data include the abstract data type, web links, pointers etc.

2.1.2 UNSTRUCTURED DATA

(3)

2.1.2 SEMI- STRUCTURED DATA

These are the combination of structure and un-structured data and doesn’t conforms to a fixed set of tags or others semantics structure of data. [4]

2.2 VOLUME

Volume or the size of data has been larger than terabytes and petabytes. The grand scale and rise of data outstrips fixed store and analysis technique. As the Big data size is massive and huge in nature, so it’s a biggest challenge for the data scientist to design the large database for its effective storage and visualization. [1][4]

2.3 VELOCITY

The range of data used is in max range, Velocity is a necessary parameter not only for big data, but also all processes. For time limited processes to be executed, big data used should be in organization streams to have a maximize value [1][4]

2.4 VERACITY

These types of data are generally uncertainty due to inconsistency and ambiguities latency.

FIGURE2. THE FOUR V’S OF BIG DATA [1][2].

3. TAXONOMY OF CLUSTERING

(4)

cluster formation and visualization varies from one another with their respective properties of the algorithm. Despite from huge number of survey for clustering algorithms available for various domains i.e. machine learning, information retrieval, pattern recognition, bio-informatics, semantic medical sciences .it makes difficult to user to decide which algorithm is appropriate to analysis the massive data sets. Therefore we have implements the taxonomy of clustering algorithms and propose these classifications to develop a frame work that covers major factors in selecting suitable algorithms for massive data sets. The clustering Algorithms are broadly classified into four categories which are as follows.

3.1. Partitioning based clustering

In such type of algorithms, all clusters are determined promptly. Initial groups are specified and reallocated towards a union. In other words, the partitioning algorithms divide data objects into a number of partitions, where each partition represents a cluster. These clusters should fulfill the following requirements as each group must contain at least one object, and they must belong to exactly one group. There are many other partitioning algorithms such as K- modes, PAM, CLARA,

CLARANS and FCM.

3.2. Hierarchical-based clustering

This type of clustering method is also known as Connectivity based clustering. Data are organized in a hierarchical manner depending on the medium of proximity. Proximities are obtained by the intermediate nodes. A dendrogram (Greek word represents a tree structure) the datasets, where individual data is presented by leaf nodes. The initial cluster gradually divides into several clusters as the hierarchy continues. Hierarchical clustering methods are of two types: a) Agglomerative (bottom- up) b) Divisive (top-down). An agglomerative clustering starts with one object for each cluster and recursively merges two or more of the most appropriate clusters. A divisive clustering starts with the dataset as one cluster and recursively splits the most appropriate cluster. The process continues until a stopping criterion is reached (frequently, the requested number k of clusters).

3.3. Density-based clustering

In density-based clustering, clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas that are required to separate clusters are usually considered to be noise and border points.

(5)

discovering clusters of arbitrary shapes. Also, this provides a natural protection against outliers. Thus the overall density of a point is analyzed to determine the functions of datasets that influence a particular data point. DBSCAN, OPTICS, DBCLASD and DENCLUE are algorithms that use such a method to filter out noise and discover clusters of arbitrary shape.

3.4. GRID-BASED CLUSTERING

The space of the data objects is divided into grids (cells). The main advantage of this approach is its fast processing time, because it goes through the dataset once to compute the statistical values for the grids. The accumulated grid-data make grid-based clustering techniques independent of the number of data objects that employ a uniform grid to collect regional statistical data, and then perform the clustering on the grid, instead of the database directly. The performance of a grid-based method depends on the size of the grid, which is usually much less than the size of the database. However, for highly irregular data distributions, using a single uniform grid may not be sufficient to obtain the required clustering quality of the time requirement. Wave-Cluster and STING are typical examples of this category. The various criteria of clustering methods in big data.

In this following section, we explain in detail the corresponding criterion of each property of big data [13]

4. SELECTION CRITERIA 4.1.TYPES OF DATASET

The majority of the traditional clustering algorithms are designed to focus either on numeric data or on categorical data. They collected data in the real world which contain both numeric and categorical attributes. But the drawback is for applying traditional clustering algorithm directly into these kinds of data. The Clustering algorithms work effectively either on purely numeric data or on purely categorical data; most of them perform poorly on mixed categorical and numerical data types. 4.2. SIZE OF DATASET

The size of the dataset has a major effect on the clustering quality. Some clustering methods are more efficient clustering methods than others when the data size is small, and vice versa.

4.3. INPUT PARAMETER

A desirable feature for “practical” clustering is the one that has fewer parameters, since a large number of parameters may affect cluster quality because they will depend on the values of the parameters.

(6)

A successful algorithm will often be able to handle outlier/noisy data because of the fact that the data in most of the real applications are not pure. Also, noise makes it difficult for an algorithm to cluster an object into a suitable cluster. This therefore affects the results provided by the algorithm. 4.5. TIME COMPLEXITY

Most of the clustering methods must be used several times to improve the clustering quality. Therefore if the process takes too long, then it can become impractical for applications that handle big data.

4.6. STABILITY

One of the important features for any clustering algorithm is the ability to generate the same partition of the data irrespective of the order in which the patterns are presented to the algorithm. 4.7. HANDLING HIGH DIMENSIONALITY

This is particularly important feature in cluster analysis because many applications require the analysis of objects containing a large number of features (dimensions). For e.g: text documents may contain thousands of terms or key words as features. It is challenging due to the curse of

dimensionality. Many dimensions may not be relevant. As the number of dimensions increases, the data become increasingly sparse, so that the distance measurement between pairs of points becomes meaningless and the average density of points anywhere in the data is likely to be low.

4.8. CLUSTER SHAPE

A good clustering algorithm should be able to handle real data and their wide variety of data types, which will produce clusters of arbitrary shape.

5. CLUSTERING ALGORITHMS

In the below section we discusses each of the selected algorithms in details with the pseudo code and survey analysis of this algorithm along with its strengths and weakness.

5.1.BRICH

BIRCH is data clustering method named as (Balanced Iterative Reducing and Clustering using Hierarchies) which is an example of hierarchical based clustering method.

This algorithm generates a dendogram called as CF-Tree (clustering feature tree). Steps of BIRCH algorithm:

(7)

each level is identified then test is performs to candidate datasets BIRCH can typically discover a good clustering with a single scan of the dataset and improve the quality of the algorithm processing with a few additional scans. It can also handle noise effectively. But, BIRCH algorithm is not applicable for spherical data sets and cluster does because it uses the concept of radius or diameter to control the boundary of a cluster. In addition, it is order-sensitive and may generate different clusters for different orders of the same input data. In the below figure he details of the algorithm are given below

FIGURE [3]: BRICH ALGORITHM PSEUDO-CODE. [13]

5.2 DBSCAN

DBSCAN is a density based clustering method i.e. “Density based spatial clustering of application with noise”. This algorithm grows with sufficiently high density into cluster and helps to discover of cluster in any arbitrary shape in spatial database with noise. The main objective of density based clustering is that for each object of cluster the Neighborhood radius(Eps) has contain at least minimum no of objects (Minpts) which helps to locate the cardinality of neighborhood cluster and its threshold value.

(8)

point within its neighborhood. The cluster formation in this method is done by density attractor and the local maxima of the overall density function. In this algorithm applicable for clusters of arbitrary shape can be easily described by a simple equation with kernel density functions. Even though DENCLUE requires a careful selection of its input parameters (i.e. σ and ξ), as this input

parameter play important role in cluster formation and quality outputs.

It has several advantages in comparison to other clustering algorithms as :

a) It has a solid mathematical foundation and generalized other clustering methods, such as partitioned and hierarchical;

b) it has good clustering properties for datasets with large amount of noise;

c) It allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional datasets; and

(9)

Figure [3]: DENCLU Algorithm pseudo-code. [13]

5.3.1 GMDBSCAN

GMDBSCAN is part of density and grid based algorithm which can work on high dimensional datasets with developing clustering of any arbitrary shapes. This algorithm is known as Grid Multi density based clustering with noise.[14]

Steps of GMDBSCAN algorithm:

a) Consider the input data sets and make the data set in standardization form. b) Now divide the data space into grid

c) Apply statistics to grid density and construct the SP-tree

d) Construct bit map forming

(10)

The below figure depicts the details algorithm for GMDBSCAN.

FIGURE [3]: GMDBSCAN ALGORITHM PSEUDO-CODE.[14]

CONCLUSION

In this paper we have done a comprehensive of Big Data and its categorization on the basis of data accessibility and define the challenges that big data are facing today for storing, sorting, and analyzing. we disclosure different types of clustering algorithms .As future work we suggest and investigate different types of data clustering algorithms and its implementation to big data , optimize and calculate the efficiency of such algorithm suitable to handle massive Big Data and applicable for multi dimensional data sets .

REFERENCES

[1] Seref SAGIROGLU and Duygu SINANC Gazi University , Big Data : A Review.

[2] Marko Grobelnik [email protected] Jozef Stefan Institute Ljubljana, Slovenia ,May 8th 2012, Big Data Tutorial

[3] https://www.google.co.in/search?q=big+data images

(11)

[5]

http://www.ericsson.com/res/thecompany/docs/publications/ericsson_review/2012/er-big-data-technologies.pdf. last access29th January 2015

[6] A. A. Abbasi and M. Younis. A survey on clustering algorithms for wireless sensor networks. Computer communications, 30(14):2826–2841, 2007.

[7] C. C. Aggarwal and C. Zhai. In Mining Text Data, pp. 77–128. Springer, 2012 A survey of text clustering algorithms.

[8] Ku Ruhana Ku-Mahamud Universiti Utara Malaysia, Malaysia, [email protected] . BIG DATA CLUSTERING USING GRID COMPUTING AND ANTBASED ALGORITHM

[9] Seref SAGIROGLU and Duygu SINANC Gazi University Department of Computer Engineering,

Faculty of Engineering Ankara, Turkey [email protected], [email protected] . Big Data , A survey

[10] JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology

Poznan, Poland Lecture 7 SE Master Course 2008/2009. Data Mining clustering techniques lectures .

[11] XiaoCai,FeipingNie,HengHuang∗ University of Texas at Arlington Arlington, Texas, 76092

[email protected], [email protected], [email protected] Multi-View K-means clusteringonBigData

[12] Future Wei Fan Huawei Noah’s Ark Lab Hong Kong Science Park Shatin, Hong Kong

[email protected] Albert Bifet Yahoo! Research Barcelona Av. Diagonal 177 Barcelona, Catalonia, Spain [email protected] . Mining Big Data: Current Status, and Forecast to the Future.

[13] A. Fahad, N. Alshatri, Z. Tari, Member, IEEE , A. Alamri, I. Khalil A. Zomaya, Fellow, IEEE, S. Foufou, and A. Bouras. A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis.