Authors. Data Clustering: Algorithms and Applications

(1)

Data Clustering: Algorithms

and Applications

(2)

(3)

1 Grid-based Clustering 1 Wei Cheng, Wei Wang, and Sandra Batista

1.1 Introduction . . . 1

1.2 The Classical Algorithms . . . 5

1.2.1 Earliest Approaches: GRIDCLUS and BANG . . . 5

1.2.2 STING and STING+: The Statistical Information Grid Approach . 7 1.2.3 WaveCluster: Wavelets in Grid-based Clustering . . . 8

1.3 Adaptive Grid-based Algorithms . . . 9

1.3.1 AMR: Adaptive Mesh Reﬁnement Clustering . . . 9

1.4 Axis-shifting Grid-based Algorithms . . . 10

1.4.1 NSGC: New Shifting Grid Clustering Algorithm . . . 10

1.4.2 ADCC: Adaptable Deﬂect and Conquer Clustering . . . 11

1.4.3 ASGC: Axis-Shifted Grid-Clustering . . . 12

1.4.4 GDILC: Grid-based Density-IsoLine Clustering Algorithm . . . 13

1.5 High Dimensional Algorithms . . . 14

1.5.1 CLIQUE: The Classical High-Dimensional Algorithm . . . 14

1.5.2 Variants of CLIQUE . . . 15

1.5.2.1 ENCLUS: Entropy-based Approach . . . 15

1.5.2.2 MAFIA: Adaptive Grids in High Dimensions . . . 15

1.5.3 OptiGrid: Density-based Optimal Grid Partitioning . . . 16

1.5.4 Variants of the OptiGrid Approach . . . 18

1.5.4.1 O-Cluster: A Scalable Approach . . . 18

1.5.4.2 CBF: Cell-based Filtering . . . 19

1.6 Conclusions and Summary . . . 20

Bibliography 23

(4)

(5)

Chapter 1

Grid-based Clustering

Wei Cheng

Computer Science Department

University of North Carolina at Chapel Hill Chapel Hill, NC 27599

Wei Wang

Computer Science Department University of California, Los Angeles Los Angeles, CA 90095

Sandra Batista

Statistical Science Department Duke University

Durham, NC 27710

1.1 Introduction . . . . 1

1.2 The Classical Algorithms . . . . 5

1.2.1 Earliest Approaches: GRIDCLUS and BANG . . . . 5

1.2.2 STING and STING+: The Statistical Information Grid Approach . . . . 7

1.2.3 WaveCluster: Wavelets in Grid-based Clustering . . . . 8

1.3 Adaptive Grid-based Algorithms . . . . 9

1.3.1 AMR: Adaptive Mesh Refinement Clustering . . . . 9

1.4 Axis-shifting Grid-based Algorithms . . . . 10

1.4.1 NSGC: New Shifting Grid Clustering Algorithm . . . . 10

1.4.2 ADCC: Adaptable Deflect and Conquer Clustering . . . . 11

1.4.3 ASGC: Axis-Shifted Grid-Clustering . . . . 12

1.4.4 GDILC: Grid-based Density-IsoLine Clustering Algorithm . . . . 12

1.5 High Dimensional Algorithms. . . . 13

1.5.1 CLIQUE: The Classical High-Dimensional Algorithm . . . . 14

1.5.2 Variants of CLIQUE . . . . 15

1.5.2.1 ENCLUS: Entropy-based Approach . . . . 15

1.5.2.2 MAFIA: Adaptive Grids in High Dimensions . . . . 15

1.5.3 OptiGrid: Density-based Optimal Grid Partitioning . . . . 16

1.5.4 Variants of the OptiGrid Approach . . . . 17

1.5.4.1 O-Cluster: A Scalable Approach . . . . 18

1.5.4.2 CBF: Cell-based Filtering . . . . 19

1.6 Conclusions and Summary . . . . 19

1.1 Introduction

Grid-based clustering algorithms are eﬃcient in mining large multidimensional data sets. These algorithms partition the data space into a ﬁnite number of cells to form a grid structure and then form clusters from the cells in the grid structure. Clusters correspond to regions that are more dense in data points than their surroundings. Grids were initially 1

(6)

TABLE 1.1: Grid-based algorithms that use hierarchical clustering or subspace clustering hierarchical clustering GRIDCLUS, BANG-clustering, AMR, STING, STING+ subspace clustering MAFIA, CLIQUE, ENCLUS

proposed by Warnekar and Krishna [29] to organize the feature space, e.g., in GRIDCLUS [25], and increased in popularity after STING [28], CLIQUE [1], and WaveCluster [27] were introduced. The great advantage of grid-based clustering is a significant reduction in time complexity, especially for very large data sets. Rather than clustering the data points directly, grid-based approaches cluster the neighborhood surrounding the data points represented by cells. In most applications since the number of cells is significantly smaller than the number of data points, the performance of grid-based approaches is significantly improved. Grid-based clustering algorithms typically involve the following five steps [9, 10]: 1. Creating the grid structure, i.e., partitioning the data space into a finite number of

cells.

2. Calculating the cell density for each cell. 3. Sorting of the cells according to their densities. 4. Identifying cluster centers.

5. Traversal of neighbor cells.

Since cell density often needs to be calculated in order to sort cells and select cluster centers, most grid-based clustering algorithms may also be considered density-based. Some grid-based clustering algorithms also combine hierarchical clustering or subspace clustering in order to organize cells based on their density. Table 1.1 lists several representative grid-based algorithms which also use hierarchical clustering or subspace clustering.

Grid-based clustering is susceptible to the following data challenges:

1. Non-Uniformity: Using a single inflexible, uniform grid may not be sufficient to achieve desired clustering quality or efficiency for highly irregular data distributions.

2. Locality: If there are local variations in the shape and density of the distribution of data points, the effectiveness of grid-based clustering is limited by predefined cell sizes, cell borders, and the density threshold for significant cells.

3. Dimensionality: Since performance depends on the size of the grid structures and the size of grid structures may increase significantly with more dimensions, grid-based approaches may not be scalable for clustering very high-dimensional data. In addition, there are aspects of the “curse of dimensionality” including filtering noise and selecting the most relevant attributes that are increasingly difficult with more dimensions in a grid-based clustering approach.

To overcome the challenge of non-uniformity,adaptive grid-based clustering algorithms

that divide the feature space at multiple resolutions, e.g., AMR [14] and MAFIA [21], were proposed. The varying grid sizes can cluster well data with non-uniform distributions. For example, as illustrated in Figure 1.1(a), the data is dispersed throughout the spatial domain with several more dense nested regions in the shape of a circle, square, and rectangle. A single resolution uniform grid would have diﬃculty identifying those more dense, nested regions as clusters as shown in Figure 1.1(b). In contrast, an adaptive algorithm, such as

(7)

(a) Original data (b) Uniform grid

FIGURE 1.1: Non-uniformity example with nested clusters

TABLE 1.2: Grid-based Algorithms Addressing Non-uniformity (Adaptive), Locality (Axis-Shifting), and Dimensionality

Adaptive MAFIA, AMR

Axis-shifting NSGC, ADCC, ASGC, GDILC

High-dimension CLIQUE, MAFIA, ENCLUS, OptiGrid, O-cluster, CBF

AMR, that permits higher resolution throughout the space can recognize those nested, more dense clusters with centers at the most clear, dense shapes. (Figure 1.1 is adapted from Figure 1 from Liao et al.[14] and is only illustrative, not based on real data.)

To address locality,axis-shiftingalgorithms were introduced. These methods adopt

axis-shifted partitioning strategies to identify areas of high density in the feature space. For instance, in Figure 1.2(a), traditional grid-based algorithms will have difficulty adhering to the border and continuity of the most dense regions because of the predefined grids and the threshold of significant cells. The clustering from using a single uniform grid, shown in Figure 1.2(b), demonstrates that some clusters are divided into several smaller clusters because the continuity of the border of the dense regions is disturbed by cells with low density. To remedy this, axis-shifting algorithms, such as ASGC [16], shift the coordinate axis by half a cell width in each dimension creating a new grid structure. This shifting yields a clustering that recognizes more dense regions adjacent to lower density cells as shown in Figure 1.2(c). By combining the clustering from both axes, algorithms can recognize dense regions as clusters as shown in Figure 1.2(d). (Figures 1.2(a), 1.2(b), 1.2(c), and 1.2(d) are adapted from Figures 11, 12, 14, and 18 respectively from Lin et al. [16] and are only illustrative, not based on real data or real clustering algorithm results.)

For handling high dimensional data, there are several grid-based approaches. For ex-ample, the CLIQUE algorithm selects appropriate subspaces rather than the whole feature space for ﬁnding the dense regions. In contrast, the OptiGrid algorithm uses density estima-tions. A summary of grid-based algorithms that address these three challenges is presented in Table 1.2.

(8)

v

(a) original data

v

Cluster 1

Cluster 2

Cluster 3

(b) 3 clusters found by 31×38 cells

v

Cluster 1

Cluster 2

Cluster 3

(c) 3 clusters found by 31×38 cells after axis-shifting

v

Cluster 1 Cluster 2

(d) final clusters by combiningbandc

(9)

In the remainder of this chapter we survey classical grid-based clustering algorithms as well as those algorithms that directly address the challenges of non-uniformity, locality, and high dimensionality. First, we discuss some classical grid-based clustering algorithms in Section 1.2. These classical grid-based clustering algorithms include the earliest approaches: GRIDCLUS, STING, WaveCluster, and variants of them. We present an adaptive grid-based algorithm, AMR, in Section 1.3. Several axis-shifting algorithms are evaluated in Section 1.4. In Section 1.5, we discuss high dimensional grid-based algorithms, including CLIQUE, OptiGrid, and their variants. We oﬀer our conclusions and summary in Section 1.6.

1.2 The Classical Algorithms

In this section, we introduce three classical grid-based clustering algorithms together with their variants: GRIDCLUS, STING, and WaveCluster.

1.2.1 Earliest Approaches: GRIDCLUS and BANG

Schikuta et al. [25] introduced the ﬁrst GRID-based hierarchical CLUStering algorithm called GRIDCLUS. The algorithm partitions the data space into a grid structure comprised of disjoint d-dimensional hyper rectangles or blocks. Data points are considered points in d-dimensional space and are designated to blocks in the grid structure such that their topo-logical distributions are maintained. Once the data is assigned to blocks, clustering is done by a neighbor search algorithm. In some respects, GRIDCLUS is the canonical grid-based clustering algorithm and its basic steps coincide with those given for grid-based algorithms in Section 1.1. Namely, GRIDCLUS inserts points into blocks in its grid structure, calculates the resultant density of the blocks, sorts the blocks according to their density, recognizes the most dense blocks as cluster centers, and constructs the rest of clusters using a neighbor search on the blocks.

The grid structure has a scale for each dimension, a grid directory, and the set of data blocks. Each scale is used to partition the entire d-dimensional space and this partitioning is stored in the grid directory. The data blocks contain the data points and there is an upper bound on the number of points per block. The blocks must be non-empty, cover all the data points, and not have any data points in common. Hinrichs oﬀers a more thorough discussion of the grid ﬁle structure used[15].

The density index of a block,B, is deﬁned as the number of points in the block divided

by the spatial volume of the block, i.e., DB=

pB

VB

, (1.1)

where pB is the number of data points in the blockB and VB is the spatial volume of the

block B, i.e., VB= d ∏ i=1 eBi. (1.2)

wheredis the number of dimensions andeBi is the extent of the block in theidimension.

GRIDCLUS sorts the blocks according to their density and those with the highest sity are chosen as the cluster centers. The blocks are clustered in order of descending den-sity iteratively to create a nested sequence of nonempty, disjoint clusters. Starting from cluster centers only neighboring blocks are merged into clusters. The neighbor search is

(10)

Algorithm 1GRIDCLUS Algorithm

1: Setu:= 0,W[] :={},C[] :={}{initialization};

2: Create the grid structure and calculate the block density indices;

3: Generate a sorted block sequenceB1′, B2′, ..., Bb′ and mark all blocks “not active” and

“not clustered”;

4: whilea “not active” block existsdo

5: u⇐u+ 1;

6: mark ﬁrstB1′, B2′, ..., Bj′ with equal density index “active”;

7: foreach “not clustered” blockBl′ :=B1′, B2′, ..., Bj′ do

8: Create a new cluster setC[u];

9: W[u]⇐W[u] + 1,C[u, W[u]]←Bl′;

10: Mark block Bl′ clustered;

11: N EIGHBOR SEARCH(Bl′,C[u, W[u]]);

12: end for

13: foreach “not active” blockB do

14: W[u]⇐W[u] + 1,C[u, W[u]]←B;

15: end for

16: Mark all blocks “not clustered”;

17: end while

Algorithm 2Procedure NEIGHBOR SEARCH(B,C)

1: foreach “active” and “not clustered” neighborB′ ofB do

2: C←B′;

3: Mark blockB′ “clustered”;

4: N EIGHBOR SEARCH(B′, C);

5: end for

done recursively starting at the cluster center, checking for adjacent blocks that should be added to the cluster, and for only those neighboring blocks added to the cluster, contin-uing the search. The GRIDCLUS algorithm is described in Algorithm 1 and the function N EIGHBOR SEARCH is the recursive procedure described in Algorithm 2 [10, 25].

While no explicit time complexity analysis is given for GRIDCLUS in the original paper, the algorithm may not have time complexity much better than other hierarchical clustering

algorithms in the worst case. The number of blocks in the worst case is O(n) where n is

the number of data points and sorting the blocks by density is O(nlogn). However, this

complexity would still be better than hierarchical clustering. The problem is that step 4

can also require O(n) if all the blocks have diﬀerent densities and step 7 can also require

O(n) if all the blocks have the same density. In addition, while the number of neighbors of

any block is a function of the number of dimensions, the depth of the recursive calls to the

neighbor search function can also beO(n). This can occur if the blocks are adjacent in a

single place analogous to a spanning tree that is a straight line. Without any discriminatory density thresholds, the pathological case of step 7 could also apply and the complexity would

be O(n2_{). (Granted average case complexity for several distributions may be signiﬁcantly}

better (i.e.,O(n)) and that may be a better analysis to consider.)

The BANG algorithm introduced by Schikuta and Erhart [26] is an extension of the GRIDCLUS algorithm. It addresses some of the ineﬃciencies of the GRIDCLUS algorithm in terms of grid structure size, searching for neighbors, and managing blocks by their density. BANG also places data points in blocks and uses a variant of the grid directory called a BANG structure to maintain blocks. Neighbor search and processing the blocks in decreasing

(11)

Algorithm 3BANG-clustering Algorithm

1: Partition the feature space into rectangular blocks which contains up to a maximum of

pmax data points.

2: Build a binary tree to maintain the populated blocks, in which the partition level

cor-responds to the node depth in the tree.

3: Calculate the dendrogram in which the density indices of all blocks are calculated and

sorted in decreasing order.

4: Starting with the highest density index, all neighbor blocks are determined and classiﬁed

in decreasing order. BANG-clustering places the found regions in the dendrogram to the right of the original blocks.

5: Repeat step 4 for the remaining blocks of the dendrogram.

TABLE 1.3: Statistical Information in STING

n number of objects (points) in the cell

mean mean of each dimension in this cell

std standard deviation of each dimension in this cell

min the minimum value of each dimension in this cell

max the maximum value of each dimension in this cell

dist the distribution of points in this cell

order of density are also used for clustering blocks. Nearness of neighbors is determined by the maximum dimensions shared by a common face between blocks. A binary tree is used to store the grid structure, so that neighbor searching can be done more efficiently. From this tree in the grid directory and the sorted block densities, the dendrogram is calculated. Centers of clusters are still the most highly dense blocks in the clustering phase. The BANG algorithm is summarized in Algorithm 3 [10]. While both GRIDCLUS and BANG can discern nested clusters efficiently, BANG has been shown to be more efficient than GRIDCLUS on large data sets because of its significantly reduced growth of grid structure size [26].

1.2.2 STING and STING+: The Statistical Information Grid Approach

Wang et al. [28] proposed a STatistical INformation Grid-based clusterinG method (STING) to cluster spatial databases and to facilitate region oriented queries. STING divides the spatial area into rectangular cells and stores the cells in a hierarchical grid structure tree. Each cell (except leaves in the tree) is partitioned into 4 child cells at the next level with each child corresponding to a quadrant of the parent cell. A parent cell is the union of its children; the root cell at level 1 corresponds to the whole spatial area. The leaf lev-el clev-ells are of uniform size, determined globally from the average density of objects. For each cell, both attribute-dependent and attribute-independent parameters of the statistical information are maintained. These parameters are deﬁned in Table 1.3.

STING maintains summary statistics for each cell in its hierarchical tree. As a result, statistical parameters of parent cells can easily be computed from the parameters of child cells. Note that the distribution types may be normal, uniform, exponential and none. Value

of dist may either be assigned by the user or obtained by hypothesis tests such as theχ2

test. Even though measures of these statistical parameters are calculated in a bottom-up fashion from any leaf node, the STING algorithm adopts a top-down approach for clustering

(12)

Algorithm 4STING Algorithm

1: Determine a level to begin with.

2: For each cell of this level, we calculate the conﬁdence interval (or estimated range) of

probability that this cell is relevant to the query.

3: From the interval calculated above, we label the cell asrelevant ornot relevant.

4: If this level is the leaf level, go to Step 6; otherwise, go to Step 5.

5: We go down the hierarchy structure by one level. Go to Step 2 for those cells that form

the relevant cells of the higher level.

6: If the speciﬁcation of the query is met, go to Step 8; otherwise, go to Step 7.

7: Retrieve those data fall into the relevant cells and do further processing. Return the

result that meet the requirement of the query. Go to Step 9.

8: Find the regions of relevant cells. Return those regions that meet the requirement of

the query. Go to Step 9.

9: Stop.

and query by starting from the root of its hierarchical grid structure tree. The algorithm is summarized in Algorithm 4 [10, 28].

The tree can be constructed in O(N) time, whereN is the total number of data points.

Dense cells are identiﬁed and clustered by examining the density of these cells in a similar

vein to the density-based DBSCAN algorithm[7]. If the cell tree has K leaves, then the

complexity of spatial querying and clustering for STING is O(K), which isO(N) in the worst

case since cells that would be empty never need to be materialized and stored in the tree. A

common misconception is thatKwould beO(2d_{) where}_d_{is the number of dimensions and}

that this would be problematic in high dimensions. STING may have problems with higher dimensional data common to all grid-based algorithms (e.g., handling noise and selecting most relevant attributes) [11], but scalability of the grid structure is not one of them.

There are several advantages of STING. First, it is a query-independent approach since the statistical information exists independent of queries. The computational complexity of

STING for clustering is O(K) and this is quite eﬃcient in clustering large data sets especially

whenK≪N. The algorithm is readily parallelizable and allows for multiple resolutions for

examining the data in its hierarchical grid structure. In addition, incremental data updating is supported, so there is lower overhead for incorporating new data points.

Wang et al. extended STING to STING+ so that it is able to process dynamically evolving spatial databases. In addition, STING+ enables active data mining by supporting user-deﬁned trigger conditions.

1.2.3 WaveCluster: Wavelets in Grid-based Clustering

Sheikholeslami et al. [27] proposed a grid-based and density-based clustering approach that uses wavelet transforms: WaveCluster. This algorithm applies wavelet transforms to the data points and then uses the transformed data to ﬁnd clusters. A wavelet transform is a signal processing technique that decomposes a signal into diﬀerent frequency subbands.

The insight to using the wavelet transforms is that data points are consideredd-dimensional

signals wheredis the number of dimensions. The high-frequency parts of a signal correspond

to the more sparse data regions such as boundaries of the clusters whereas the low-frequency high-amplitude parts of a signal correspond to the more dense data regions such as clus-ter inclus-teriors [3]. By examining in different frequency subbands, clusclus-tering results may be achieved at different resolutions and scales from fine to coarse. Data are transformed to p-reserve relative distance between objects at different levels of resolution. A hat-shaped filter

(13)

Algorithm 5WaveCluster Algorithm

INPUT:Multidimensional data objects’ feature vectors OUTPUT:cluster objects

1: First bin the feature space, then assign objects to the units, and compute unit

sum-maries.

2: Apply wavelet transform on the feature space.

3: Find connected components (clusters) in the subbands of transformed feature space, at

multiple levels.

4: Assign labels to the units in the connected components.

5: Make the lookup table.

6: Map the objects to the clusters.

is used to emphasize regions where points cluster and to suppress weaker information in their boundaries. This makes natural clusters more distinguishable and eliminates outliers simultaneously. As input the algorithm requires as parameters the number of grid cells for each dimension, the wavelet, and the number of applications of the wavelet transform. The algorithm is summarized in Algorithm 5 [27].

WaveCluster oﬀers several advantages. The time complexity is O(N) where N is the

number of data points; this is very eﬃcient for large spatial databases. The clustering results are insensitive to outliers and the data input order. The algorithm can accurately discern arbitrarily shaped clusters such as those with concavity and nesting. The wavelet transformation permits multiple levels of resolution, so that clusters may be detected more accurately. This algorithm is primarily only suited for low dimensional data. However, in the case of very high dimensional data, PCA may be applied to the data to reduce the number

of dimensions, so that N > mf _where _m_{is the number of intervals in each dimension and}

f is the number of dimensions selected after PCA. After this, WaveCluster may be applied

to the data to cluster it and still achieve linear time eﬃciency [27].

1.3 Adaptive Grid-based Algorithms

When a single inflexible, uniform grid is used, it may be difficult to achieve the desired clustering quality or efficiency for highly irregular data distributions. In such instances, adaptive algorithms that modify the uniform grid may be able to overcome this weakness in uniform grids. In this section, we introduce an adaptive grid-based clustering algorithm: AMR. Another adaptive algorithm (MAFIA) will be discussed in Section 1.5.2.2.

1.3.1 AMR: Adaptive Mesh Refinement Clustering

Liao et al. [14] proposed a grid-based clustering algorithm AMR using a Adaptive Mesh Refinement technique that applies higher resolution grids to the localized denser regions. Different from traditional grid-based clustering algorithms, such as CLIQUE and GRID-CLUS, which use a single resolution mesh grid, AMR divides the feature space at multiple resolutions. While STING also offers multiple resolutions, it does so over the entire space, not localized regions. AMR creates a hierarchical tree constructed from the grids at multiple resolutions. Using this tree, this algorithm can discern clusters, especially nested ones, that

(14)

may be diﬃcult to discover without clustering several levels of resolutions at once. AMR is very ﬁt for data mining problems with highly irregular data distributions.

AMR clustering algorithm mainly contains two steps summarized here from [13]: 1. Grid Construction: First grids are created at multiple resolutions based on regional

density. The grid hierarchy tree contains nested grids of increasing resolution since the grid construction is done recursively. The construction of the AMR tree starts with a uniform grid covering the entire space and for those cells that exceed a density threshold, the grid is reﬁned into higher resolution grids at each recursive step. The new child grids created as part of the reﬁnement step are connected in the tree to parent grid cells whose density exceeds the threshold.

2. Clustering: To create clusters, each leaf node is considered to be the center of an individual cluster. The algorithm recursively assigns objects in the parent nodes to clusters until the root node is reached. Cells are assigned to clusters based on the minimum distance to the clusters under the tree branch.

The overall complexity for constructing the AMR tree is O(dtN1₁−₋p_ph+ (dtk+ 6d₎_r1−qh

1−q ),

whereN is the number of data points,dis the dimensionality,tis the number of attributes

in each dimension, his the AMR tree height, and prepresents the average percentage of

data points to be reﬁned at each level, ris the mesh size at the root, and qis the average

ratio of mesh sizes between two grid levels [14].

Like most grid-based methods, AMR exhibits insensitivity to the order of input data. The AMR clustering algorithm may be applied to to any collection of attributes with numerical values even those with very irregular or very concentrated data distributions. However, like GDILC, it cannot be scaled to high-dimensional databases because of its overall complexity.

1.4 Axis-shifting Grid-based Algorithms

The effectiveness of a grid-based clustering algorithm is seriously limited by the size of the predefined grids, the borders of the cells, and the density threshold of the significant cells in the face of local variations in shape and density in a data space. These challenges motivate another kind of grid-based algorithms: axis-shifting algorithms. In this section, we introduce four axis-shifting algorithms: NSGC, ADCC, ASGC, and GDILC.

1.4.1 NSGC: New Shifting Grid Clustering Algorithm

Fixed grids may suffer from the boundary effect. To alleviate this, Ma and Chow [18] proposed a New Shifting Grid Clustering algorithm (NSGC). NSGC is both density-based and grid-based. To form its grid structure, the algorithm divides each dimension of the space into an equal number of intervals. NSGC shifts the whole grid structure and uses the shifted grid along with the original grid to determine the density of cells. This reduces the influence of the size and borders of the cells. It then clusters the cells rather than the points. Specifically, NSGC consists of four main steps summarized in Algorithm 6 [18].

NSGC repeats the steps above until the result of the previous iteration and that of the current iteration are smaller than a speciﬁed accepted error threshold. The complexity of

NSGC isO((2w)d_{), where} _d_{is the dimensionality and}_w_{is the number of iterations of the}

algorithm. While it is claimed that this algorithm is non-parametric, its performance is

(15)

Algorithm 6NSGC Algorithm

1: Cell construction: It divides each dimension of the space into 2wintervals, wherew is

the number of iterations.

2: Cell assignment: It ﬁrst ﬁnds the data points belonging to a cell, then shifts by half

cell-size of the corresponding dimension, and ﬁnds the data points belonging to shifted cells.

3: Cell density computation: It uses both the density of the cell itself and its nearest

neighborhood to obtain a descriptive density proﬁle.

4: Group assignment(clustering): It starts when the considered cell or one of its neighbor

cells has no group assigned. Otherwise, the next cell is considered until all non-empty cells are assigned.

Algorithm 7ADCC Algorithm

1: Generate the ﬁrst grid structure.

2: Identify the signiﬁcant cells.

3: Generate the ﬁrst set of clusters.

4: Transform the grid structure.

5: Generate the second set of clusters.

6: Revise the original clusters. In this case, the ﬁrst set of clusters and second set of clusters

are combined recursively.

7: Generate the ﬁnal clustering result.

Ifwis set too low (or high) or the error threshold too high, then clustering results may not

be accurate; there is no a priori way to know the best values of these parameters for speciﬁc data. NSGC is susceptible to errors caused by cell sizes that are too small also. As the size of cells decreases (and the number of iterations increases), the total number of cells and the number of clusters reported both increase. The reported clusters may be too small and not correspond to clusters in the original data. The strongest advantage of NSGC is that its grid shifting strategy permits it to recognize clusters of very arbitrary boundary shapes with great accuracy.

1.4.2 ADCC: Adaptable Deflect and Conquer Clustering

The clustering quality of grid-based clustering algorithms often depends on the size of the predefined grids and the density threshold. To reduce their influence, Lin et al. adopted “deflect and conquer techniques” to propose a new grid-based clustering algorithm ADCC (Adaptable Deflect and Conquer Clustering) [17]. Very similar to NSGC, the idea of ADCC is to utilize the predefined grids and predefined threshold to identify significant cells. Nearby cells that are also significant can be merged to develop a cluster. Next, the grids are deflected half a cell size in all directions and the significant cells are identified again. Finally, the newly generated significant cells and the initial set of significant cells are merged to improve the clustering of both phases. Specifically, ADCC is summarized in Algorithm 7.

The overall complexity of ADCC is O(md₊_N_{), where} _m _{is the number of intervals}

in each dimension, d is the dimensionality of data, and N is the number of data points.

While ADCC is very similar to NSGC in its axis-shifting strategy, it is quite different in how it constructs clusters from the sets of grids. Rather than examining a neighborhood of the two grids at once as NSGC does, ADCC examines the two grids recursively looking for consensus in the significance of cells in both clusterings especially those that overlap a previous clustering to make a determination about the final clustering. This step can

(16)

Algorithm 8ASGC Algorithm

1: Generate the ﬁrst grid structure: the entire feature space is divided into non overlapping

cells thus forming the ﬁrst grid structure.

2: Identify the signiﬁcant cells: These are cells whose density is more than a predeﬁned

threshold.

3: Generate the ﬁrst set of clusters: all neighboring signiﬁcant cells are grouped together

to form clusters.

4: Transform the grid structure: the original coordinate origin is shifted by distanceξ in

each dimension of the feature space to obtain a new grid structure.

5: Generate the second set of clusters: new clusters are generated using steps 2 and 3.

6: Revise the original clusters: the clusters generated from the shifted grid structures can

be used to revise the clusters generated from the original grid structure.

7: Generate the ﬁnal clustering result.

actually help to separate clusters more eﬀectively especially if there is only a small distance with very little data between them. Both methods are suspectible to errors caused by small cell sizes, but can for the most part handle arbitrary borders and shapes in clusters very well. ADCC is not dependent on many parameters to determine its termination. It is only

dependent on the choice of the number of intervals per dimension,m.

1.4.3 ASGC: Axis-Shifted Grid-Clustering

Another attempt by Chang et al. [16] to minimize the impact of the size and borders

of the cells is ASGC (Axis-Shifted Grid-Clustering) (also referred to as ACICA+). After

creating an original grid structure and initial clustering from that grid structure, the original grid structure is shifted in each dimension and another clustering is done. The shifted grid structure can be translated an arbitrary distance to be specified. The effect of this is to implicitly change the size of the original cells. It also offers greater flexibility to adjust to boundaries of clusters in the original data and minimize the effect of the boundary cells. The clusters generated from this shifted grid structure can be used to revise the original clusters. Specifically, the ASGC algorithm involves 7 steps, and is summarized in Algorithm 8 from [13].

The complexity of ASGC is the same as that of ADCC, which isO(md₊_N_{), where}

N is the number of data points, dis the dimensionality of the data, andmis the number

of intervals in each dimension. The main difference between ADCC and ASGC is that the consensus method to revise clusters is bi-directional in ASGC: using the overlapping cells the clusters from the first phase can be used to modify the clusters of the second phase and vice versa. When a cluster of the first clustering overlaps a cluster of the second clustering, the combined cluster of union of both can then be modified in order to generate the final clustering. This permits great flexibility in handling arbitrary shapes of clusters in the original data and minimizes the extent to which either grid structure will separate clusters. By essentially translating the original grid structure an arbitrary distance to create the second grid and overlapping it with the original grid structure, a different resolution (and implicitly different cell size) is also achieved by this translation. While this method is less susceptible to the effects of cell sizes and cell density thresholds than other axis-shifting grid clustering methods, it still requires careful initial choice of cell size and cell density threshold.

(17)

Algorithm 9GDILC Algorithm

1: Cells are initialized by dividing each dimension intomintervals.

2: The distances between sample points and those in neighboring cells are calculated. The

distance thresholdT is computed.

3: The density vector and density thresholdτ are computed.

4: At ﬁrst, GDILC takes each data point whose density is more than the density threshold

τ as a cluster. Then, for each data point x, check, for every data point whose density

is more than the density thresholdτ in the neighbor cells ofx, whether its distance to

xis less than the distance threshold T. If so, GDILC then combines the two clusters

containing those two data points. The algorithm continues until all point pairs have been checked.

5: Outliers are removed.

1.4.4 GDILC: Grid-based Density-IsoLine Clustering Algorithm

Zhao and Song [31] proposed a Grid-based Density-IsoLine Clustering algorithm (GDIL-C) to perform clustering by making use of the density-isoline ﬁgure. It assumes that all data

samples have been normalized. All attributes are numerals and are in the range of [0,1]. This

is for the convenience of distance and density calculation. GDILC first implicitly calculates a density-isoline figure, the contour figure of the density of data points. Then clusters are discovered from the density-isoline figure.

GDILC computes the density of a data point by counting the number of points in its

neighbor region. Speciﬁcally, the density of a data pointxis deﬁned as follows:

Density(x) =|{y:Dist(x, y)≤T}|, (1.3)

whereT is a given distance threshold andDist(x, y) is a distance function (e.g., Euclidean

distance) used to measure the dissimilarity between data points x and y. The

density-isoline ﬁgure is never drawn, but is obtained from the density vectors. The density vectors are computed by counting the elements of each row of the distance matrix that are less than

the radius of the neighbor region,T. To avoid enumerating all data points for calculating the

density vector, GDILC employs a grid-based method. The grid-based method ﬁrst partitions each dimension into several intervals creating hyper-rectangular cells. Then, to calculate the

density of data pointx, GDILC only considers data points in the same cell withxand those

data points in its neighbor cells; this is identical to axis shifting. The GDILC algorithm is shown in Algorithm 9 [10].

For many data sets, this grid-based method signiﬁcantly reduces the search space of calculating the point pair distances; the complexity may appear nearly linear. In the worst

case the time complexity of GDILC remainsO(N2) (i.e., for the pathological case when all

the points cluster in the neighborhood of a constant number of cells). However, it cannot be

scaled to high-dimensional data because the space is divided into md _{cells, where}_m_{is the}

number of intervals in each dimension, anddis the dimensionality. When the dimensionality

dis very large,md _{is signiﬁcantly large, and the data points in each cell is very sparse, the}

GDILC algorithm will no longer work. (There will be diﬃculty computing any distances or thresholds.)

There are two significant advantages to this algorithm. First, it can handle outliers explicitly and this can be refined as desired. Second, it computes necessary thresholds such as those for density and distance directly from the data. These can be fined tuned as needed (i.e., they don’t need to be guessed at any point). In essence this algorithm dynamically learns the data distribution of samples while learning the parameters for thresholds in addition to discerning the clustering in the data.

(18)

1.5 High Dimensional Algorithms

The scalability of grid-based approaches is a significant problem in higher dimension-al data because of the increase in the size of the grid structure and the resultant time complexity increase. Moreover, inherent issues in clustering high dimensional data such as filtering noise and identifying the most relevant attributes or dimensions that represent the most dense regions must be addressed inherently in the grid structure creation as well as the actual clustering algorithm. In this section, we examine carefully a subspace clustering approach presented in CLIQUE and a density estimation approach presented by OptiGrid. This section is greatly influenced by the survey of Berkhin [3] with additional insights on the complexity, strengths, and weaknesses of each algorithm presented.

1.5.1 CLIQUE: The Classical High-Dimensional Algorithm

Agrawal et al. [1] proposed a hybrid density-based, grid-based clustering algorithm, CLIQUE (Clustering In QUEst), to ﬁnd automatically subspace clustering of high dimen-sional numerical data. It locates clusters embedded in subspaces of high dimendimen-sional data without much user intervention to discern signiﬁcant subclusters. In order to present the clustering results in an easily interpretable format, each cluster is given a minimal descrip-tion as a disjunctive normal form (DNF) expression.

CLIQUE ﬁrst partitions its numerical space into units for its grid structure. More

specif-ically, letA={A1, A2, ..., Ad}be a set of bounded, totally ordered domains (attributes) and

S =A1×A2×...,×Ad be ad-dimensional numerical space. By partitioning every

dimen-sion Ai (1 ≤i ≤ d) into m intervals of equal length, CLIQUE divides the d-dimensional

data space into md _{non-overlapping rectangular units. A} _d_{-dimensional data point,} _v_{, is}

considered in a unit,u, if the value ofv in each attribute, is greater than or equal to the

left boundary of that attribute inuand less than the right boundary of that attribute inu.

The selectivity of a unit is deﬁned to be the fraction of total data points in the unit. Only

units whose selectivity is greater than a parameterτ are viewed as dense and retained. The

deﬁnition of dense units applies to all subspaces of the originald-dimensional space.

To identify dense units to retain and subspaces that contain clusters, CLIQUE considers projections of the subspaces from the bottom up (i.e., the least dimensional subspaces to

those of increasing dimension). Given a projection subspace,At1×At2×...×Atp, wherep < d

andti< tj ifi < j, a unit is the intersection of an interval in each dimension. By leveraging

theApriori algorithm, CLIQUE employs a bottom-up scheme because monotonicity holds:

if a collection of points is a cluster in ap-dimensional space, then this collection of points is

also part of a cluster in any (p−1)-dimensional projections of this space. In CLIQUE, the

recursive step from (p−1)-dimensional units top-dimensional units involves a self-join of

thep−1 units sharingfirst common (p−2)-dimensions [3]. To reduce the time complexity of

theApriori process, CLIQUE prunes the pool of candidates, only keeping the set of dense units to form the candidate units in the next level of the dense unit generation algorithm. To prune the candidates, all the subspaces are sorted by their coverage, i.e., the fraction of the database that is covered by the dense units in it. The less covered subspaces are pruned. The cut point between retained and pruned subspaces is selected based on the MDL [24] principle in information theory.

CLIQUE then forms clusters from the remaining candidate units. Two p-dimensional

unitsu1,u2are connected if they have a common face or if there exists anotherp-dimensional

unitussuch thatu1is connected tousandu2is connected tous. A cluster is a maximal set of

(19)

components in the graph deﬁned to represent the dense units as the vertices and edges between vertices existing if and only if the units share a common face. In the worst case, this can be done in quadratic time in the number of dense units. After ﬁnding all the clusters,

CLIQUE uses a DNF expression to specify a ﬁnite set of maximal segments (regions) whose

union is the cluster. Finding the minimal descriptions for the clusters is equivalent to ﬁnding an optimal cover of the clusters; this is NP-hard. In light of this, instead, CLIQUE adopts a greedy approach to cover the cluster in regions and then discards redundant regions.

By integrating density-based, grid-based, and subspace clustering, CLIQUE discovers clusters embedded in subspaces of high dimensional data without requiring users to select subspaces of interest. The DNF expressions for the clusters give a clear representation of

clustering results. The time complexity of CLIQUE is O(cp₊_pN_{), where} _p_{is the highest}

subspace dimension selected, N is the number of input points, and c is a constant; this

grows exponentially with respect to p. The algorithm offers an effective, efficient method

of pruning the space of dense units in order to counter the inherent exponential nature of the problem. However, there is a trade-oﬀ for the pruning of dense units in the subspaces with low coverage. While the algorithm is faster, there is an increased likelihood of missing clusters. In addition, while CLIQUE does not require users to select subspaces of interest, its susceptibility to noise and ability to identify relevant attributes is highly dependent on

the user’s choice of unit intervals,m, and sensitivity threshold,τ.

1.5.2 Variants of CLIQUE

There are two aspects of the CLIQUE algorithm that can be improved. The ﬁrst one is the criterion for the subspace selection. The second is the size and resolution of the grid structure. The former is addressed by the ENCLUS algorithm by using entropy as subspace selection criterion. The latter is addressed by the MAFIA algorithm by using adaptive grids for fast subspace clustering.

1.5.2.1 ENCLUS: Entropy-based Approach

The algorithm ENCLUS (ENtropy-based CLUStering) [6] is an adaptation of the CLIQUE that uses a diﬀerent, entropy-based criterion for subspace selection. Rather than using the fraction of total points in a subspace as a criterion to select subspaces, ENCLUS

uses an entropy criteria and only those subspaces spanned by attributes A1, ..., Ap with

entropyH(A1, ..., Ap)< ϖ(a threshold) are selected for clustering. A low-entropy subspace

corresponds to a more dense region of units. An analogous monotonicity condition or

Apri-ori property also exists in terms of entropy. If a p-dimensional subspace has low entropy,

then so does any (p−1)-dimensional projections of this subspace:

H(A1, ..., Ap−1) =H(A1, ..., Ap)−H(Ap|A1, ..., Ap−1)< ϖ. (1.4) A signiﬁcant limitation of ENCLUS is its extremely high computational cost, especially in terms of computing the entropy of subspaces. However, this cost also yields the beneﬁt that this approach has increased sensitivity to detect clusters especially extremely dense small ones.

1.5.2.2 MAFIA: Adaptive Grids in High Dimensions

MAFIA (Merging of Adaptive Finite IntervAls) proposed by Goil et al. [21] is a descen-dant of CLIQUE. Instead of using a ﬁxed size cell grid structure with an equal number of bins in each dimension, MAFIA constructs adaptive grids to improve subspace cluster-ing and also uses parallelism on a shared-nothcluster-ing architecture to handle massive data sets.

(20)

Algorithm 10MAFIA Algorithm

1: Do one scan of the data to constructadaptive grids in each dimension.

2: Compute the histograms by reading blocks of data into memory using bins.

3: Using the histograms to merge bins into a smaller number of adaptive variable-size bins,

where adjacent bins with similar histogram values are combined to form larger bins. The bins that have low density of data are pruned.

4: Select bins that areα-times (αis a parameter called the cluster dominance factor) more

densely populated than average asp(p= 1 now) candidate dense units (CDUs).

5: Iteratively scan data for higher dimensions, and construct newp-CDU from two (p−

1)-CDUs if they share any (p−2)-face, and merge adjacent CDUs into clusters.

6: Generate minimal DNF expressions for each cluster

MAFIA proposes an adaptive grid of bins in each dimension. Then using an Apriori

algo-rithm, dense intervals are merged to create clusters in the higher dimensional space. The adaptive grid is created by partitioning each dimension independently based on the distribu-tion (i.e., the histogram) observed in that dimension, merging intervals that have the same observed distribution, and pruning those intervals with low density. This pruning during the construction of the adaptive grid reduces the overall computation of the clustering step. The steps of MAFIA are summarized from [3] in Algorithm 10.

Ifpis the highest dimensionality of a candidate dense unit (CDU),N is the number of

data points, andmis a constant, the algorithm’s complexity is O (mp+pN), still exponential

in the dimension as CLIQUE also is. However, the performance results on real data sets show that MAFIA is 40 to 50 times faster than CLIQUE because of the use of adaptive grids and their ability to select a smaller set of interesting CDUs [6]. Parallel MAFIA further oﬀers the ability to obtain a highly scalable clustering for large data sets. Since the adaptive grid permits not only variable resolution because of the variable bin size, but also variable, adaptive grid boundaries, MAFIA yields with greater accuracy cluster boundaries that are very close to grid boundaries and are readily expressed as minimal DNF expressions.

1.5.3 OptiGrid: Density-based Optimal Grid Partitioning

Hinneburg and Keim proposed OptiGrid (OPTimal GRID-Clustering) [12] to address several aspects of the “curse of dimensionality”: noise, scalability of the grid construction, and selecting relevant attributes by optimizing the density function over the data space. OptiGrid uses density estimations to determine the centers of clusters as the clustering was done for the DENCLUE algorithm [11]. A cluster is a region of concentrated density centered around a strong density attractor or local maximum of the density function with density above the noise threshold. Clusters may also have multiple centers if the centers are strong density attractors and there exists a path between them above the noise threshold. By recursively partitioning the feature space into multidimensional grids, OptiGrid creates an optimal grid-partition by constructing the best cutting hyperplanes of the space. These cutting planes cut the space in areas of low density (i.e. local minima of the density func-tion) and preserve areas of high density or clusters, speciﬁcally the cluster centers (i.e. local

maxima of the density function). The cutting hyperplanes are found using a set of

contract-ing linear projections of the feature space. The contracting projections create upper bounds

for the density of the planes orthogonal to them. Namely, for any point,x, in a contracting

projection, P, then for any point y such that P(y) = x, the density of y is at most the

density ofx.

(21)

in [10]. A cutting plane is a (d−1)-dimensional hyperplane consisting of all pointsy that

satisfy the equation ∑d_i₌₁wiyj = 1. The cutting plane partitionsRd into two half spaces.

The decision functionH(x) determines the half space, where a pointx∈Ris located:

H(x) = {

1 if ∑d_i₌₁wixj ≥1,

0 otherwise. (1.5)

Then, a multi-dimensional grid G for the feature space S is deﬁned to be a set H =

{H1, H2, ..., Hk} of (d−1)-dimensional cutting planes. The coding functioncG :S →Nis deﬁned as: x∈S, c(x) = k ∑ i=1 2i·Hi(x). (1.6)

OptiGrid uses a density function to determine the best cutting places and to locate

clusters. The density function ˆfD _{is deﬁned as}

ˆ fD= 1 nh n ∑ i=1 KD(x−xi h ), (1.7)

whereDis a set ofN d-dimensional points,his the smoothness level, andKDis the kernel

density estimator. Clusters are deﬁned as the maxima of the density function, which are

above a certain noise levelξ.

A center-based cluster for a maximumx∗of the density function ˆfD_{is the subset}_C_⊆_D_,

withx∈Cbeing density-attracted byx∗ and ˆfD₍_x∗₎_≥_ξ_{. Points}_x_∈_D _{are called outliers}

if they are density-attracted by a local maximumx∗0with ˆfD(x∗0)< ξ.

OptiGrid selects a set of contracting projections. These projections are then used to find the optimal cutting planes. The projections are useful because they concentrate the density of points and the cutting planes, in contrast, will have low density. Each cutting plane is selected to have minimal point density and to separate two dense half spaces. After each step of constructing a multi-dimensional grid defined by the best cutting planes, OptiGrid finds the clusters using the density function. The algorithm is then applied recursively to the clusters. In each round of recursion, OptiGrid only maintains data objects in the dense grids from the previous round of recursion. The algorithm OptiGrid is described in Algorithm 11 [10, 12].

The time complexity of OptiGrid is O(d·N·logN), where N is the number of data

points anddis the dimensionality of the data. This is very eﬃcient for clustering large

high-dimensional databases. However, it may perform poorly in locating clusters embedded in a low-dimensional subspace of a very high-dimensional database, because its recursive method only reduces the dimensions by one at every step [10]. In addition, it suffers sensitivity to parameter choice and does not efficiently handle grid sizes that exceed available memory [20]. Moreover, OptiGrid requires very careful selection of the projections, density estimate, and determination of what constitutes a best or optimal cutting plane from users. The difficulty of this is only determined on a case by case basis on the data being studied. However, a special case of applying this algorithm can be considered a more efficient variant of CLIQUE and MAFIA. Namely, if the projections used are the projection maps in each dimension, the density estimate is uniform, and there are sufficient cutting planes to separate each density attractor on each dimension, then a more efficient and accurate clustering can be achieved that circumvents the difficulties of CLIQUE, i.e, the time complexity is no longer exponential in the dimensions and clusters are not missed.

(22)

Algorithm 11OptiGrid Algorithm INPUT:data setD,q,min cut score

1: Determine a set of contracting projections P = {P0, P1, ..., Pk} and calculate all the

projections of the data setD:Pi(D), i= 1,2, ..., k;

2: Initialize a list of cutting planesBEST CU T ⇐Φ,CU T ⇐Φ;

3: fori=0 tokdo

4: CU T ⇐best local cutsPi(D);

5: CU T SCORE⇐Score best local cutsPi(D);

6: Insert all the cutting planes with a score≥min cutscoreintoBEST CU T;

7: if BEST CU T = Φthen

8: return D as a cluster;

9: else

10: Select theqcutting planes of the highest score fromBEST CU T and construct a

multidimensional gridGusing theqcutting planes;

11: Insert all data points inD intoGand determine the highly populated grid cells in

G; add these cells to the set of clusters C;

12: ReﬁneC;

13: for allclustersCi inC do

14: Do the same process with data setCi;

15: end for 16: end if 17: end for

1.5.4 Variants of the OptiGrid Approach

In this section, we consider variants of OptiGrid that were introduced to address the issues of the scalabity of the grid structure, especially with respect to available memory, and a clear criterion for the selection of cutting planes.

1.5.4.1 O-Cluster: A Scalable Approach

Milenova et al. proposed a O-cluster (Orthogonal partitioning CLUSTERing) [20] to address three limitations of OptiGrid: scalability in terms of data relative to memory size, lack of clear criterion to determine if a cutting plane is optimal or not, and sensitivity to threshold parameters for noise and cut plane density. O-Clusters addresses the first limitation by using a random sampling technique on the original data and a small buffer size. Only partitions that are not resolved (i.e., ambiguous) have data points maintained in the buffer. As a variant of OptiGrid, O-Cluster uses an axis-parallel partitioning strategy to locate high density areas in the data. To do so, O-Cluster uses contracting projections, but also proposes the use of a statistical test to validate the quality of a cutting plane. The statistical test checks for statistical significance between the difference in the density of the

peaks and a valley when the valley separates the two peaks using a standard χ2 _{test. If}

statistical signiﬁcance is found, the cutting plane would then be through such a valley. O-Cluster is also a recursive method. After testing the splitting points for all projections in a partition, the optimal one is chosen to partition the data. The algorithm then searches for cutting planes in the new partitions. A hierarchical tree structure is used to divide the data into rectangular regions. The main steps are summarized in Algorithm 12.

The time complexity of O-Cluster is approximated to be O(N d), whereNis the number

of data points and dis the number of dimensions. However, Hinneburg and Keim claim a

superlinear lower bound for the time complexity of clustering high dimensional data with

(23)

Algorithm 12O-cluster Algorithm

1: Load data buﬀer.

2: Compute histograms for active partitions.

3: Find “best” splits for active partitions.

4: Flag ambiguous and “frozen” partitions.

5: Split active partitions.

6: Reload buﬀer.

that has been read and must be searched, there is not a constant time way to do that even for random noise. Moreover, the time complexity of the OptiGrid algorithm is dominated by the insertion time into the grid structure. They assert that the insertion time for axis

parallel planes is O(N qI) where qis the number of planes andI is the insertion time and

that the insertion time is the minimum ofqand logN. Since the O-Cluster algorithm uses

a binary clustering tree to load the buﬀer, it is possible that its running time is dominated

by this and is O(N s) where s is the depth of the binary clustering tree, but clear worst

case analysis of that depth was not oﬀered. However, empirically O-Cluster has shown good scalability results even using a small buﬀer size.

While O-Cluster handles uniform noise well, its performance degrades with increasing noise. O-Cluster, unlike OptiGrid, also has a tendency to split clusters because it tends to oversplit the data space and uses histograms that do not suﬃciently smooth the distribution density. However, like OptiGrid, it may have diﬃculty with clusters embedded in a low-dimensional subspace of a very high-low-dimensional feature space because of its recursion step [10].

1.5.4.2 CBF: Cell-based Filtering

CBF (Cell-Based Filtering) [4] proposed by Chang and Jin focuses on the scalability of the grid structure, handling large data sets in memory, and the efficiency of insertion and retrieval of clusters from the grid structure. It also offers a clear criteria for a cutting plane. CBF creates its grid structure by splitting each dimension into a set of partitions using a split index. Once a dimension is split into classes, the split index is equal to the sum of the relative density squared of each class. CBF finds the optimal split section in each dimension by repeatedly examining the the split value index for partitions along the potential split points in the dimension until the maximum value of a split index for a partitioning is found. Cells are created from the overlapping regions of the partitions in each dimension. This cell creation algorithm results in many fewer cells in the grid structure than other high dimensional grid based algorithms we have discussed.

Cells with higher density than a given threshold are inserted into the clusters in the index structure. There is a second layer of the index structure built upon the cluster infor-mation file that provides approxiinfor-mation inforinfor-mation of the clusters by grouping them into sections. For sections of clusters, if the density of the section is greater than a threshold, then this secondary index is set to true for that section and false otherwise. By using this approximation technique, the index structure is able to filter queries by first examining the section information and then accessing the cell and cluster information as necessary. This filtering-based index structure achieves good retrieval performance on clusters by min-imizing the I/O access to acluster information file. While CBF is very efficient in its cell grid creation and query retrieval time, the tradeoff for such efficiency is lower precision for clustering.

(24)

1.6 Conclusions and Summary

The eﬃciency of grid-based clustering algorithms comes from how data points are grouped into cells and clustered collectively rather than individually. This results in drastic time complexity improvements because often data is grouped into far few cells than there are data points. The general approach of these algorithms is to divide the data space into grid data structures, summarize each cell in the grids by a statistic such as density, and then cluster the grid cells. The greatest challenge for using these algorithms is then determin-ing the best strategy of constructdetermin-ing the grid structure. The size of the grid structure and time for its construction largely determines the time complexity of grid-based algorithms. The size of the grid structure is directly inﬂuenced by the size of the cell and the size of the cell determines the resolution at which data may be clustered. The resolution of the grid determines the clustering accuracy and the ability to recognize diverse data cluster boundaries.

We have presented classical grid-based algorithms that use uniform grids and several variant classes that deal with specific challenges of using uniform grid structures. Adaptive algorithms permit grid structures that offer finer resolutions over some regions of the data space in order to deal with data that has highly irregular or concentrated data distributions. Axis shifting algorithms deal with local variations in data shape clusters and density of the data by translating the original grid structure across the data space. Grid-based algorithms for high dimensional data deal with a problem that is inherently exponential in nature; constructing a grid structure that partitions the data in each dimension will create one that is exponential in the number of dimensions. These algorithms choose ways to filter the grid structure to only investigate relevant subspaces or to filter the original subspace in order to only select the most relevant attributes and filter noise.

The time complexities of several grid-based algorithms discussed in this chapter are summarized in Table 1.4. Figure 1.6 summarizes the grid-based algorithms we have discussed in this chapter and their relationships. CGA is a low dimensional grid-based clustering algorithm that can be applied to high dimensional data after PCA is applied to the data [32]. GCHL is a variant of GRIDCLUS that is suitable for high dimensional data [23]. Some current research on grid-based clustering focuses on parallel implementation of grid-based clustering such as PGMCLU[30] and GridDBSCAN [19], grid-based clustering of fuzzy query results[2], and domain-speciﬁc applications such as grid-based clustering on data streams such as SGCS, D-Stream , and DUCstream [22, 5, 8].

(25)

FIGURE 1.3: Relationships among grid-based algorithms

TABLE 1.4: Time complexity of grid based algorithms

Algorithm Name Time Complexity

STING O(K),Kis the number of cells at bottom layer WaveCluster O(N),N is the number of data objects

NSGC O((2w)d_),_d_{is the dimensionality,}_w_{is number of iterations}

ADCC,ASGC O(md) +O(N),mis the number of intervals in each dimension

GDILC O(N)

CLIQUE,MAFIA O(cp₊_pN_),_p_{is the highest subspace dimension selected,}_c_{is a constant}

OptiGrid O(d·N·logN)

(26)

(27)

[1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, SIGMOD ’98, pages 94–105, New York, NY, USA, 1998. ACM.

[2] Mounir Bechchi, Amenel Voglozin, Guillaume Raschia, and Noureddine Mouaddib.

Multi-Dimensional Grid-Based Clustering of Fuzzy Query Results. Rapport de

recherche, INRIA, 2008.

[3] Pavel Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, Inc., San Jose, CA, 2002.

[4] Jae-Woo Chang and Du-Seok Jin. A new cell-based clustering method for large,

high-dimensional data in data mining applications. In Proceedings of the 2002 ACM

sym-posium on Applied computing, SAC ’02, pages 503–507. ACM, 2002.

[5] Yixin Chen and Li Tu. Density-based clustering for real-time stream data. In

Proceed-ings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, pages 133–142, 2007.

[6] Chun-Hung Cheng. ENCLUS: Entropy-based Subspace Clustering for Mining Numer-ical Data. 1999.

[7] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based

algorithm for discovering clusters in large spatial databases with noise. In KDD’96,

pages 226–231, 1996.

[8] Jing Gao, Jianzhong Li, Zhaogong Zhang, and Pang-Ning Tan. An incremental data

stream clustering algorithm based on dense units detection. In Proceedings of the

9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining, PAKDD’05, pages 420–425, Berlin, Heidelberg, 2005. Springer-Verlag.

[9] Peter Grabusts and Arkady Borisov. Using grid-clustering methods in data classiﬁca-tion, 2002.

[10] Chaoqun Ma Guojun Gan and Jianhong Wu. Data Clustering: Theory, Algorithms,

and Applications. SIAM, 2007.

[11] Alexander Hinneburg, Er Hinneburg, and Daniel A. Keim. An eﬃcient approach to

clustering in large multimedia databases with noise. InProc. 4rd Int. Conf. on

Knowl-edge Discovery and Data Mining, pages 58–65. AAAI Press, 1998.

[12] Alexander Hinneburg and Daniel A. Keim. Optimal grid-clustering: Towards breaking

the curse of dimensionality in high-dimensional clustering. InVLDB’99, pages 506–517.

Morgan Kaufmann, 1999.

(28)

[13] MR ILANGO and Dr V MOHAN. A survey of grid based clustering algorithms. In International Journal of Engineering Science and Technology, pages 3441–3446, 2010. [14] Wei keng Liao, Ying Liu, and Alok Choudhary. A grid-based clustering algorithm using

adaptive mesh reﬁnement, 2004.

[15] Hinrichs Klaus Helmer. The grid ﬁle system: implementation and case studies of ap-plications. Technical report, ETH Zurich, 1985.

[16] Nancy P. Lin, Chung-I Chang, Nien-yi Jan, Hao-En Chueh, Hung-Jen Chen, and

Wei-Hua Hao. An axis-shifted crossover-imaged clustering algorithm.WTOS, 7(3):175–184,

March 2008.

[17] Nancy P. Lin, Chung-I Chang, and Chao-Lung Pan. An adaptable deﬂect and conquer

clustering algorithm. In Proceedings of the 6th Conference on WSEAS International

Conference on Applied Computer Science - Volume 6, ACOS’07, pages 155–159, 2007. [18] Eden W.M. Ma and Tommy W.S. Chow. A new shifting grid clustering algorithm.

Pattern Recognition, 37(3):503–514, 2004.

[19] S. Mahran and K. Mahar. Using grid for accelerating density-based clustering. In 8th IEEE International Conference on Computer and Information Technology, pages 35–40, 2008.

[20] Boriana L. Milenova and Marcos M. Campos. O-cluster: scalable clustering of large

high dimensional data sets. InProceedings from the IEEE International Conference on

Data Mining, pages 290–297, 2002.

[21] H. Nagesh, S. Goil, and A. Choudhary. MAFIA: Eﬃcient and scalable subspace clus-tering for very large data sets, 1999.

[22] Nam Hun Park and Won Suk Lee. Statistical grid-based clustering over data streams. SIGMOD Rec., 33(1):32–37, March 2004.

[23] Abdol Hamid Pilevar and M. Sukumar. Gchl: A grid-clustering algorit hm for

high-dimensional very large spatial data bases.Pattern Recognition Letters, pages 999–1010,

2005.

[24] Jorma Rissanen. Stochastic Complexity in Statistical Inquiry Theory. World Scientiﬁc

Publishing Co., Inc., River Edge, NJ, USA, 1989.

[25] E. Schikuta. Grid-clustering: an eﬃcient hierarchical clustering method for very large

data sets. Proceedings of the 13th International Conference on Pattern Recognition,

2:101–105 vol.2, 1996.

[26] Erich Schikuta and Martin Erhart. The bang-clustering system: Grid-based data

anal-ysis. InProc. Sec. Int. Symp. IDA-97, pages 513–524. Springer-Verlag, 1997.

[27] Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. Wavecluster:

A multi-resolution clustering approach for very large spatial databases. In VLDB’98,

pages 428–439, 1998.

[28] Wei Wang, Jiong Yang, and Richard R. Muntz. Sting: A statistical information grid

approach to spatial data mining. InVLDB’97, Proceedings of 23rd International

Con-ference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 186–195. Morgan Kaufmann, 1997.