CiteSeerX — Grid Clustering with Genetic Algorithm and Tabu Search Process

(1)

WWW.JPRR.ORG

Grid Clustering with Genetic Algorithm and Tabu Search Process

Computer Vision and Pattern Recognition Unit, Indian Statistical Institute 203, B.T. Road, Kolkata-700108, India

Computer Division, Saha Institute of Nuclear Physics Bidhannagar, Kolkata-700064, India

Abstract

In this paper we have presented an effective hybrid genetic algorithm for solving clustering problems with multi-dimensional grid structure. The algorithm is basically a combination of Genetic Algorithm (GA) and Tabu Search (TS) so that we can efficiently utilize the stochastic search ability of GA and the hill climbing as well as the local search capabilities of TS. Such hybridization helps to enhance the capability of both the search techniques and to reduce their disadvantages. The application of TS along with GA also greatly reduces the possibility of a stochastic search process to be trapped in a local optimal solution. The proposed grid structure based clustering method is a two-step process. It starts with decomposing the data set into a finite number of grid cells. At the end of the grid partitioning process, each non-empty grid cell is considered to be a small cluster or sub-cluster.

In the second step, the hybrid genetic algorithm is invoked to merge the sub-clusters hierarchically and iteratively so that the expected set of k clusters is finally obtained. The performance of this new technique has been tested on various multi-dimensional synthetic and real data sets. A comparison with related clustering and classification techniques have also been made on some data sets.

Keywords: Hybrid technique, genetic algorithm, tabu search, clustering, split and merge approach, optimization.

1. Introduction

Clustering and classification are two most useful approaches for pattern recognition, image processing, decision making, data mining or knowledge discovery in databases [1-5]. Clustering tries to identify groups or trends as well as distribution patterns of the data when there are no a priori training samples and little prior knowledge. As a tool, clustering has wide applications in many applied fields like biomedical, signal analysis, life science taxonomy, remote sensing, demography and social sciences, geology and anthropology, economics and planning etc.

Traditionally, clustering approaches can be either hierarchical or non-hierarchical. Given N ob- jects, a hierarchical approach makes a hierarchy of N, N-1, ..., 2, 1 clusters. If the hierarchy goes up from N to 1, it is an agglomerative approach. If the hierarchy goes down from 1 to N clusters, it is a divisive approach. Non-hierarchical approaches assume the number of final clusters known a priori (say, k) and finds only one clustering of k groups. The final clustering is expected to process some desirable properties like proximity, similarity, homogeneity and good continuation among members of each cluster, along with specialized properties that a specific problem may demand. In order to achieve this goal, approaches like split and merge are employed [1, 2]. Some optimization techniques are also useful to get a stable solution to the problem.

For pattern features computed in metric space the concepts like proximity, similarity, homogeneity are estimated in terms of some form of distance function on the feature space. The mean value denotes the position from where the sum of distances of all data in the feature space is minimum.

(2)

This property is utilized in the iterative K-means clustering approach which is the fore-runner of a wide variety of clustering approaches [1, 2]. Further progress was reached by utilizing split and merge procedure, the pioneer being ISODATA technique [2]. However, simple distance or density based approaches are suitable for convex cluster detection, so link-clustering, multi-seed clustering and graph-based clustering were advanced to capture non-convex clusters. An example of multi- seed clustering is CURE (Clustering Using REpresentatives) [3]. Another approach named ROCK (RObust Clustering using linKs) is based on links between pairs of data objects [4]. Similarly, in a recent graph based scheme called Chameleon [5] splitting is done by graph partition while the min- cut bisection method is used to decide the pair of clusters for the merging process. The approach becomes computationally expensive because the min-cut bisection of a graph is time-consuming.

The situation is aggravated by increase the dimensionality of the data.

A generalization of the approach is soft computing with fuzzy boundary, for which fuzzy set the- ory is very effective. Most popular example is fuzzy c-means and its many variants [6]. Use of rough set theory also gives some kind of overlapped partitions. In some competitive framework, genetic algorithm has been utilized to get semi optimum clustering of data [7]-[9]. Genetic algorithm has been used to optimize other functions in say, a fuzzy framework [10]-[12].

The split-and-merge principle leads to a class of mixed clustering methods. Some researchers have used GA for the merging process in their split-and-merge technique. The grid-partitioning procedure is also based on the principle of split and merge approach and is advantageous due to its fast processing speed. Here, the computation time is independent of the number of data points and depends only on the number of grid cells in the data space. WaveCluster [13] is a grid-based process which maps the data onto a multi-dimensional grid and applies a wavelet transformation in the feature space. The clustering technique may compromise in identifying clusters when they are connected by a bridge of data points or outliers. DENCLUE (DENsity based CLUstEring) [14] is another grid-based approach where the partitioning is locality-based and the clusters are determined mathematically by identifying the density attractors. The density attractors are local maxima of the overall density function. However, DENCLUE requires a careful choice of the clustering parameters. Similarly, CLIQUE (CLustering In QUEst) [15] is another grid-based clustering approach that identifies the sparse and crowded areas in space thereby discovering the overall distribution patterns of the data set. A cluster is then defined as a maximal set of connected dense units. The accuracy of the clustering result may, however, be degraded at the expense of simplicity of the method.

Though there are various principles and techniques, they are not sufficient for clustering data of diverse shape, density and size. In this paper we propose a hybrid approach with the aim to a) identify clusters of irregular shape and size (which contain concavity and nested shapes); b) cluster data with non-uniform density; c) be independent of data order and handle high dimensional data and d) be insensitive to noise or outlier data. The proposed algorithm is basically a split-and-merge based grid-clustering approach and has been named as Genetically Guided Grid Clustering (GGGC).

Here, the hybridization has been attained by a combination of genetic algorithm and Tabu Search (TS) method. Both the algorithms are dependent on a population based search technique where the population is represented by a set of individuals. Each individual is a string of binary values chosen from{0, 1}. The search process advances to a solution depending on the individuals selected using the fitness score. The fitness score of an individual determines its survival strength in a population.

We have tested the hybrid grid clustering technique on two classes of data. One class consists of two-dimensional synthetic data. Another class of real multi-dimensional data (chosen from UCI Machine Learning Database Repository [16]) is used to test the approach for pattern classification in data mining. The performance of the hybrid algorithm has been estimated by comparing with other relevant and recently developed clustering and classification techniques.

(3)

The rest of the paper is organized as follows. The basic concepts of Simple Genetic Algorithm (SGA) and Tabu Search (TS) method are briefly presented in the following section. The proposed hybrid Genetically Guided Grid Clustering (GGGC) is narrated in Section 3. Section 4 describes the experimental results on various data sets in two and multi-dimensional spaces. Finally, the conclusion is presented in Section 5.

2. Basic concepts of SGA and TS method

Both GA and TS are iterative search procedures used for searching optimum solutions in a multi- dimensional data space. However, GA is a class of stochastic search procedure and TS is a local search method. The objective of the stochastic search procedure is to find an optimal solution over a wide search space. On the other hand, a major task of the TS method is to emerge from a local optima and eventually converge to a global solution.

2.1 Simple Genetic Algorithm

The Simple Genetic Algorithm (SGA) is capable of adaptive and robust search over a wide range of data space. The process is inspired by the Darwinian principle of survival of the fittest individuals and natural selection. The technique was first introduced by Holland [17] for use in adaptive systems. Later on, it was employed by several researchers in solving various optimization problems effectively and efficiently.

The procedure starts with the initialization of a few parameters. The operation of SGA is dependent on three basic operators namely, the reproduction operator, the crossover operator and the mutation operator. The detailed operations by these operators are lucidly described in [18]. The step-wise operation of SGA can be described algorithmically as follows.

Step 1. Generate randomly the initial population of µ individuals. Initialize the parameters δ and η, where δ is the crossover probability and η is the mutation probability. Start the process with g= 1.

Step 2. Evaluate the fitness score of each individual p_i, ∀i ∈ {1, ..., µ} of the entire population based on the objective function f(pi) where pi’s denote objective variables.

Step 3. Select a pair of individuals say, paand p_b at random depending on their fitness values (using roulette wheel method) from the population of µ individuals.

Step 4. Conduct crossover between the selected individuals paand pbwith δ and mutate each bit of each parents with mutation probability η.

Step 5. Each pair of parents (paand pb) creates a pair of offsprings (p^′_a, p^′_b) as new individuals.

In this way, generate a new population p^′_j’s,∀j ∈ {1, ..., µ} for the next generation.

Step 6. Terminate the process if either the population converges on the basis of fitness criteria or g = G_max is satisfied, where G_max is the maximum number of allowed generations. Otherwise, make g= g + 1 and go to Step 2.

The SGA has been modified by introducing adaptive mutation when the algorithm is used in the merging process of the proposed method (GGGC). The modified approach is named as modified GA (MGA).

2.2 Tabu Search Method

Tabu search (TS) is a meta heuristic local search approach that can be used to solve combinatorial optimization problems [19]. Compared to local hill climbing search techniques, it takes care against getting trapped into local optimal solution. Also, compared to GA, TS has a higher operation speed but its performance depends on the initialization process [20].

This iterative search procedure starts with a set of probable or feasible solutions. Each solution is a string of bits, chosen from{1, 0}, which is called the array. Let A_t, A_cand A_b denote the trial,

(4)

current and best array(s) and Ot, Oc and O_b be the corresponding trial, current and best objective function value(s), respectively. The process assigns the current solution Acfor starting its operation.

Then, the trial solutions Atare generated through some moves. After each iteration, a best solution A_bis found. With the progress of the search process it may be found that the best solution is a tabu but satisfies the aspiration criteria. Then it is considered to be the new current solution. In TS, the aspiration criteria implies the rules that override tabu restrictions. During the search process if a certain move is forbidden by tabu restrictions, then on satisfaction of the aspiration criteria the move is allowed.

Thus, the algorithm proceeds as follows.

Step 1: Initialize the parameters MTLS (Maximum Tabu List Size), the number of trial solutions λ, the maximum number of iterations I_max and the probability threshold value. Let A_c be an arbitrary solution and Oc be the corresponding objective function value. Initially, let Ab= Ac, Ob

= Oc, TTL (Tabu List Length) = 0 and I = 1.

Step 2: Using A_c, generate λ trial solutions A¹_t, A²_t, ..., A^λ_t and evaluate their corresponding objective function values O¹_t, O²_t, ..., O^λ_t. Given a current solution Ac, one can generate a trial solution using several strategies. In our case, given Ac, we have flipped the bit of Acif the probability threshold is higher than a randomly generated value between 0 and 1. Otherwise, the corresponding bit is kept unchanged.

Step 3: Arrange the objective function values O¹_t, O²_t, ..., O_t^λin ascending order and denote them as O¹_t^′, O²_t^′, ..., O_t^λ^′. If O¹_t^′ is not tabu or if it is tabu but O_t¹^′ < O_b (in case of minimization) then make Ac= A¹_t^′ and Oc= O_t¹^′. Next, go to step 4. Otherwise, let Ac= A^L_t and Oc = O^L_t where O_t^L is the best objective function of O_t¹^′, O_t²^′, ..., O^λ_t^′ that is not tabu and go to step 4. If O_t¹^′, ..., O_t²^′, O_t^λ^′ are all tabu, then go to step 2.

Step 4: Insert Acat the bottom of the tabu list and increment TTL by 1. If TTL = MTLS + 1 then delete the first element in the list and make TTL = TTL - 1. If Ob> O_cthen Ab = Acand Ob= Oc. Terminate the process if I = I_max with A_b as the best solution and O_b as the corresponding best objective function. Otherwise, make I = I + 1 and go to step 2.

3. Grid Clustering

The grid-clustering approach is different from other clustering methods and uses a multi-resolution grid data structure. It quantizes the space into a finite number of cells that form a grid structure on which all operations for clustering are performed. In the proposed two-stage clustering technique (GGGC) the entire feature space is initially partitioned hierarchically by the multi-dimensional grid structure into a number of cells by Grid based Decomposition Algorithm (GDA). The cells containing data points are finally considered as sub-clusters. In the second stage, the sub-clusters are merged hierarchically using the hybrid method namely, Cluster Merging with Hybrid Algo- rithm (CMHA). CMHA is based on the combination of a modified SGA (MGA) and a tabu search method. Both the MGA and TS process are invoked sequentially to run iteratively and in each run, some sub-clusters are merged. The MGA is started first. When the MGA is terminated with an optimal solution after Gmaxiterations or the solution is not changed for a long time, TS is actually invoked to run iteratively. With the completion of the TS procedure another round of MGA followed by TS is executed. The completion of one MGA and TS cycle is referred to as a single epoch. This sequence is continued several times until the expected number of k clusters are found. The use of TS is effective at this juncture since it can facilitate the genetic method to emerge out of the local optima due to its inherent local search capability. Thus, the TS enhances the performance of the genetic method by reducing the number of overall iterations.

(5)

Let X = (x1, x2, ..., xn) be a set of n patterns. xi is the i-th pattern consisting of a tuple of features(a_i1, a_i2, ..., a_id) where d is the dimension of the feature space. A block is a d-dimensional rectangular shaped cube containing upto a maximum of bs patterns (bs = block size). The properties satisfied for a data set where φ denotes the empty set and m (k < m≤ n) (where k is the expected number of clusters) is the finite number of blocks Bj’s,∀j ∈ {1, 2, ..., m} containing data point(s) created by the multi-dimensional grid structure are as follows.

• xi∈ Bj where{j = 1, 2, ..., m}

• Bj1∩ Bj2= φ if j1 6= j2

• Bj 6= φ

• ∪Bj = X.

3.1 Data Decomposition with Grid Structure

The process GDA is independent of hybrid genetic algorithm and partitions the data/feature space into a reasonably large number of blocks or grid cells. Each cell contains either a finite number of data points or remains empty. The space is initially partitioned into p (= 2^d) blocks depending on the dimension d of the feature space. Such a choice automatically increases the number of initial partitions with increase in the dimensionality of the data. Usually, at higher dimensional problems, the data size should be larger to get good clustering. However, it may be an interesting research problem to find the bound of choosing optimum initial clusters. Each block is then examined and if any side length of a non-empty block is larger than a prespecified threshold value, then the same partitioning process is reinvoked on the corresponding block. Otherwise, the process is moved to other untested blocks until all blocks are encountered. The partitioning process advances in a hierarchical way. The progress of the process is depicted in Fig. 1.

At the end of the entire GDA, the grid cells which contain data point(s), form the sub-clusters of size bs. Here, bs =| B_j | where | B_j | stands for the size (the number of data points) of the cluster.

The GDA is implemented as follows.

Step 1: For each object/pattern xi,∀i ∈ {1, 2, ..., n} find the distance, dmin between xi and its nearest neighbor in the data set X, as follows.

d_min(xi) = min

i6=j k xi− xj k (1)

where j∈ {1, 2, ..., n} and k xi− xj k=

q Pd

l=1(xil− xjl)².

Step 2: Compute d_av that is the average of the minimum distances d_min(x_i), ∀i ∈ {1, 2, ..., n}

given by

d_av = 1 n

n

X

i=1

d_min(xi) (2)

Step 3: Choose the threshold value say, Ts(according to the following equation) of a side of the rectangular shaped block for deciding the continuation of partitioning process for the corresponding block.

T_s= u ∗ dav (3)

where the value of u lies between 5 and 10 so that a block contains at least 5 data points.

(6)

. ..... ...

...

..... . .. ...... .......

. . ..

. .. . .. ... . .. . . . ..

. . ..

. . ... . .. ... . ... .. ..

. . .

A B

C D

. . .

. .. .. ..

. ..

.... .. . .. . ... . . .... ..

. ...

. .. . . ...

.

. ..... ...

...

...... .. ...... .......

. . ..

. .. . .. ... . .. . . . ..

. . ..

. . ... . .. ... . ... .. ..

. . .

A B

C D

. . .

. .. .. ..

. ..

.... .. . .. . ... . . .... ..

. ...

. .. . . ...

.

(a) (b)

. ..... ...

...

..... . .. ...... ... . .......

. . ..

. .. . .. ... . .. . . . ..

. . ..

. . ... . .. ... . ... .. ..

. . .

A B

C D

. . .

. .. .. ...

.. ...... . .. . ... . . ...

. . .. ...

. .. . . ...

.

. ..... ...

...

...... .. ...... ... . .......

. . ..

. .. . .. ... . .. . . . ..

. . ..

. . ... . .. ... . ... .. ..

. . .

A B

C D

. . .

. .. .. ...

.. ...... . .. . ... . . ...

. . .. ...

. .. . . ...

.

(c) (d)

Fig. 1: Progress of the decomposition process using GDA in 2-D data space. (a) Data space before start of the partitioning process. (b) Initial partitioning of the data space. (c) Repartition of the non-empty blocks in the intermediate stage of GDA. (d) Complete partitioning after checking all non-empty blocks.

Step 4: Initially, consider the entire feature space as a single block and start the partitioning process using d-dimensional grid structure.

Step 5: Partition a d-dimensional block which satisfies the partitioning conditions, into p number of equal sized (side length wise) blocks, where p= 2^d.

Step 6: Examine each block with the partitioning conditions. The conditions are that the block must be non-empty and at least one side of the corresponding block must be greater than Ts. If the partitioning conditions are satisfied for a block, go to step 5. Otherwise, go to step 7.

Step 7: Terminate the process if all non-empty blocks are checked with the partitioning condi- tions.

3.2 Cluster Merging with Hybrid Algorithm

Let the number of grid cells or sub-clusters be m after the termination of GDA. The sub-clusters are denoted by Bi’s,∀i ∈ {1, 2, ..., m} and are individually homogeneous in nature. At this stage, the process CMHA (a combination of MGA and TS) is invoked over B_i’s for merging.

The MGA of CMHA eventually starts with a set of µ individuals called population and terminates after the fulfillment of stopping criterion for merging sub-clusters passing through all three basic phases of SGA (discussed in section 2). Each individual p_j,∀j ∈ {1, 2, ..., µ} is a string created randomly with uniform distribution of 0’s and 1’s. The length of each string pj is m bits. Now, if the i-th bit of pj is 1, it denotes the presence of a small cluster Bi. On the other hand, if the corresponding bit is 0, it indicates the absence of B_i. During the merging process of clusters, B_i’s represented by 1 are considered as candidate clusters for merging them with one or more clusters denoted by 0.

(7)

According to the basic phases of SGA, the MGA chooses two individuals pa and p_b randomly from the pool of µ individuals. They are then crossed over using single point crossover operation with the probability δ to generate two offsprings p^′_a and p^′_b, respectively for the next generation.

The third operation called mutation is performed bitwise over p^′_aand p^′_bwith the adaptive mutation probability ηadapto produce p^′′_aand p^′′_b, respectively. The value of ηadapfor the τ -th run is evaluated as follows.

η_adap= η0 ∗ τ (4)

where η⁰is the initial mutation probability.

Now a new pool of µ individuals is generated with p^′′_i’s,∀i ∈ {1, 2, ..., µ}. The fitness value of each individual p^′′_i is calculated using equation (9) and the genetic optimization process is re- initiated. At the termination of the process, the MGA provides an optimal solution as output.

Now TS starts (after termination of the MGA, as discussed before) with the optimal solution of MGA as its current solution A_cand generates randomly a pool of λ individuals, where λ < µ. The pool of individuals in TS is known as trial solutions and they are denoted by A^j_t,∀j ∈ {1, 2, ..., λ}.

Each A^j_t consists of m bits chosen from{0, 1}. The fitness values of λ trial solutions are evaluated and represents by O_t^j,∀j ∈ {1, 2, ..., λ}. Now, the best solution A_b is picked up by observing on the fitness values of λ trial solutions. If Absatisfies the conditions defined in tabu search algorithm, it is to be considered as the current solution (i.e., Ac = A_b) of the next generation. The process is then continued for Imax (Imax << Gmax) iterations to achieve the final best solution of the tabu search process.

At the end of each run (i.e., each epoch) of two iterative processes (MGA and TS), some among m clusters are merged to obtain m^′ (m^′ < m) clusters. The m sub-clusters are represented as a set of strings {B1, B2, ..., B_m}. Some (say m0) of these Bi’s, ∀i ∈ {1, ..., m} are labelled by 0’s and the rest i.e., m− m0 (= m1, say) are labelled by 1’s. Let the subset of Bi which are 0’s be written as B⁰ = {B1⁰, B2⁰, ..., B_m⁰₀} and the rest be defined as B¹ = {B1¹, B2¹, ..., B_m¹₁}.

Each B_i¹,∀i ∈ {1, 2, ..., m1} is considered to be a candidate cluster with which one or more sub- clusters selected from the array B⁰ will be merged (following some conditions). During the run, the combined search processes (MGA and TS) find an object B_e from B⁰_j’s,∀j ∈ {1, ..., m0} for which both adjacency and density conditions are satisfied. According to adjacency condition, Be

will be chosen for a single B_i¹,∀i ∈ {1, 2, ..., m1} for which the number of adjacent data points of two clusters (Beand one sub-cluster of B_i¹’s) is maximum. The density condition is satisfied iff the density difference of the selected clusters is below a threshold value T_d.

The adjacency of two fragmented clusters are verified as follows.

Step 1: Define the radius R of the circular region considering a boundary point of the fragmented cluster as its centre [Here, R = k1 ∗ d_av where0 < k1 ≤ 12 for the data sets selected for the experiment].

Step 2: Select any two fragmented clusters B_c⁰and B_d¹ from B⁰and B¹, respectively.

Step 3: Count the number of boundary points of B_c⁰ and B_d¹which reside within a radius R. Let it be Nband the object density of B_c⁰ and B_d¹beD⁰_c andD¹_d, respectively.

Step 4: Consider B_c⁰as the candidate cluster for merging with B_d¹if abs(D_f⁰− D_g¹) ≤ T_dwhere T_dis the threshold value of the density [Here, T_d = k2∗ d_av where0 < k2 ≤ 6 for the problems considered in the experiment].

Step 5: Terminate the algorithm.

In the above algorithm the values of both user defined parameters R and T_dare dependent on d_av. The value of dav basically represents the density of the data set points. Since the objective of the algorithm is to choose two adjacent clusters which are close to each other (measured by R in Step

(8)

1) and very similar in density (measured by Step 4). It is noted that if the density is high, then the values of k¹and k²are low and if the density is low i.e., the data points are scattered, then the values of k1and k2tend to 12 and 6, respectively.

Once an object B_efrom B⁰_j’s (for which N_bis maximum) is selected, it is included as a candidate of cluster C_i,∀i ∈ {1, 2, ..., m1} and will be excluded from the list of B⁰. Thus, the cluster merging technique (CMHA) is continued for a string B until all B_j⁰’s, ∀j ∈ {1, ..., m0} are exhausted to generate clusters Ci’s,∀i ∈ {1, ..., m1}. Each cluster Ci,∀i ∈ {1, ..., m1} is represented as

C_i =

q

[

l=1

B_l (5)

where1 ≤ q < m and m = m0+ m1.

Further, we have to calculate the seeds/centroids of all existing clusters. Let the seed/centroid of the fragmented cluster Bi be β_i,∀i ∈ {1, ..., m} and that of the newly generated cluster Cj (a collection of one sub-cluster from B¹ and one or more from B⁰) be S_j,∀j ∈ {1, 2, ..., m1}. Now, the centroid Sj of each Cj is computed with the following equation:

Sj = 1 q

q

X

l=1

β_l (6)

where1 ≤ q < m.

In CMHA each individual pi,∀i ∈ {1, 2, ..., µ} of MGA and each individual (trial solution) Aⁱ_t,∀i ∈ {1, 2, ..., λ} of TS is a string of {B1, B2, ..., B_m}. Now the fittest individual is extracted from the pool of respective process to create a new pool of equal size for the next generation of the corresponding process. It is, therefore, required to evaluate the fitness function F(pi) and F (Aⁱ_t) of an individual of MGA and TS process, respectively. The function is, however, dependent on two important functions of cluster Cα,∀α ∈ {1, 2, ..., m1}, namely, Dintra(Cα) and Dinter(Cα). The function Dintra(Cα) represents the intra-distance in the cluster Cα. On the other hand, Dinter(Cα) stands for the inter-distance in C_α’s. The above two functions are defined by the following two equations.

D_inter(Cα) = max

α6=γ k Sα− Sγ k ∗ | Cα | (7)

where α, γ ∈ {1, 2, ..., m1} and C_α∩ C_γ= φ.

and D_intra(Cα) =

q

X

γ=1

k Sα − βγk ∗ | Bγ |

| X |

(8) where α∈ {1, 2, ..., m1} and B_γ ⊂ C_α.

Now we can define the fitness function F(pi) or F (Aⁱ_t) of a string or an individual as follows.

F(pi) or F (Aⁱ_t) =

m1

X

α=1

D_inter(Cα) −

m1

X

α=1

D_intra(Cα)

!

∗ |B¹| (9)

where i∈ {1, 2, ..., µ} or i ∈ {1, 2, ..., λ}.

Thus, the merging process CMHA is completed after merging m0 small clusters in a single run.

On termination of each run, the clusters Cα’s, ∀α ∈ {1, 2, ..., m¹} are considered as Bi’s and m = m1 for the next run. Thus, the twin iterative processes are continued for multiple runs until

(9)

(a) (b)

Fig. 2: (a) The original dataset with Gaussian distribution of data points in the space. (b) Isolated two clusters after applying GGGC.

the predefined number of clusters k (k > 1) is found. In a single run, MGA is iterated at best for G_maxtimes and TS is repeated at most for I_maxtimes.

The time complexity of the GGGC technique is analyzed as follows. Let the size of data set be n. In the first stage i.e., for GDA, significant time which is O(n²) is taken by Step 1 for finding the nearest neighbor of each point of the data set and O(n) to calculate the minimum. Therefore, the time spent by GDA is O(n²). In the second stage, the size of the population is µ and m denotes the string length. The GA is terminated after Gmaxiterations and the TS is continued for Imaxtimes.

Since I_max << G_max, the time taken by CMHA in the worst case is O(G_maxµm²). Therefore, overall time complexity of the grid clustering algorithm is O(n²+ Gmaxµm²).

4. Experimental Results

To conduct experiments on the proposed technique we have considered various types of data sets.

The description of data sets is provided in sections 4.2 and 4.3. The objects of the synthetic data sets depicted in Fig. 2 to Fig. 9 are in R²feature space. The data sets taken for testing are different from each other in terms of the number and the shape of clusters as well as the data density. Some data sets also contain clusters in R²feature space along with some outliers. The proposed method was tested for pattern classification with real data sets in multi-dimensional feature space obtained from UCI Machine Learning Repository [16] as well as with multi-class OCR data from which high dimensional features are extracted.

4.1 Parameter Initialization

We have chosen the parameters of MGA and TS process based on the values suggested in the literatures [10-12, 20]. However, it will be an interesting problem of future research to choose them automatically in a data-driven manner. In our experiment, the self-adaptive method (see equation 4) and the population size of µ = 50 are used for the MGA in CMHA. The initial population is generated uniformly at random depending on the number of clusters Bi’s,∀i ∈ {1, 2, ..., m}. The number of ultimate clusters k is defined by the user. All data sets are tested for 30 runs and in each run, the crossover probability δ and the initial mutation probability η⁰ lie in the range [0.5-0.9] and [0.002-0.005], respectively. The MGA is iterated for Gmax= 100 times. For TS process, Imax= 10 is chosen. The maximum tabu list size MTLS is 10 and the population size of the trial solutions is λ

= 20. The number of test runs for the hybrid approach in merging fragmented clusters for each data set is 30.

(10)

(a) (b)

Fig. 3: (a) The original dataset where one cluster is confined within another cluster. (b) Results obtained by the proposed approach with two separate clusters.

(a) (b)

Fig. 4: (a) The original dataset with two equal shaped clusters before processing. (b) Isolated clusters obtained after the application of GGGC.

4.2 Cluster Identification

In our experiments eight unique data sets with special characteristics, are selected for evaluating the performance of the proposed algorithm (GGGC) in R² feature space. Among them, the data sets in Figs. 7 to 9 are different from others due to presence of random noise within the cluster space.

CURE [3] and ROCK [4] algorithms have been tested on some data sets depicted in Figs. 8 and 9 to compare the efficiency of the proposed clustering algorithm GGGC.

Fig. 2 shows an important and interesting data where both clusters are close to each other. They are equal in size and density. The data points in both clusters follow the Gaussian distribution in R² space. The proposed algorithm can distinctly identify the clusters (shown in Fig. 2(b)).

Each of the data sets in Figs. 3 and 4 comprises of two clusters with uniform data density. In Fig. 3 the shape of one cluster is different from that of the other. Also, the smaller one is entirely surrounded by the bigger cluster. The isolated clusters obtained after exercising the proposed grid clustering algorithm is shown in Fig. 3(b). The other data set in Fig. 4 consists of two identical shaped clusters, which are partly surrounded by each other. GGGC can identify the right clusters as in Fig. 4(b).

Figs. 5 to 9 consist of special types of synthetic data. Among them, Fig. 5 contains two clusters of arbitrary shapes and different sizes with variable density. One of the clusters is too small compared

(11)

(a) (b)

Fig. 5: The original dataset with two unequal sized clusters before the application of GGGC (b) Isolated clusters obtained after processing.

(a) (b)

Fig. 6: (a) The original dataset with six clusters (b) Identified six clusters separately after using the proposed method.

to the other and the smaller one is partially within and located quite near to the larger cluster.

Nevertheless, the clusters are detected correctly as shown in Fig. 5(b).

Fig. 6(a) is a collection of six clusters with 10000 synthetic data points in 2-D space. Two of the six clusters are of identical elliptical shape. They are connected to each other with a string of dense data points. However, shape-wise there exists three clusters that have been identified accurately by GGGC. Two of the remaining three clusters in Fig. 6(a) are almost identical in shape and equal in density, while the third one is the largest cluster with lesser density compared to other clusters. All six are identified accurately by GGGC as shown in Fig. 6(b).

The data set in Fig. 7(a) in 2-D space is of a very special type as it comprises of two clusters of identical shape and different data density in presence of random noise scattered all over the feature space. At the first stage, the data set (containing 8500 points) is segmented into a number of sub- clusters and then CMHA is used for merging them. The two clusters are identified correctly in the presence of noise. However, it is noticed that noise is also grouped into small arbitrary clusters, probably due to randomness. All such small groups are merged into one cluster of noise, as given in Fig. 7(b).

The data sets in Figs. 8(a) and 9(a) (containing 12000 points each) are critical compared to other data sets in the experiment, as both of them consist of maximum number of clusters along with

(12)

(a) (b)

Fig. 7: (a) The original dataset with two clusters and random noise scattered over the entire space (b) Isolated clusters with separated noise after the processing of GGGC.

(a) (b)

(c) (d)

Fig. 8: The original dataset with nine clusters and random noise scattered over the entire space (b) Separated nine clusters from random noise in the feature space after the application of GGGC (c) Eleven clusters identified by CURE algorithm (d) ROCK method found twenty clusters [Source of Figs. 8(c) and 8(d) is www-users.cs.umn.edu/han/chameleon.html.

noisy outliers. The clusters are also different in shape, size and data density. The noise scattered over the feature space is random in nature. These two data sets have been included in the experiment to test the effectiveness and efficiency of GGGC in identifying clusters compared to the results of CURE and ROCK approaches, shown in www-users.cs.umn.edu/han/chameleon.html. The algo-

(13)

(a) (b)

(c) (d)

Fig. 9: (a) The original dataset with eight clusters and random noise scattered over the entire space (b) All eight clusters properly separated from noise at the end of the processing of GGGC (c) CURE method identified eight clusters arbitrarily (d) ROCK algorithm found twenty clusters. Source of Figs. 9(c) and 9(d) is www-users.cs.umn.edu/han/chameleon.html.

rithm (GGGC) clearly identifies all nine clusters in presence of random noise, as depicted in Fig.

8(b). Similarly, all eight clusters of Fig. 9(a) are separated from the random noise with the proposed hybrid method as shown in Fig. 9(b).

From Figs. 8(c) and 9(c) it can be seen that CURE fails to choose right clusters. This is because during merging this method draws random sample as its representative points. Experiment shows that the random sampling method may lead to loss of information about the geometry of the clusters.

On the other hand, ROCK does not work by choosing random sample but its merging process is dependent on the interconnectivity of objects in the data set. Such models are inflexible and can easily tend to incorrect merging decision when different clusters exhibit different interconnectivity characteristics. This phenomenon restricts ROCK method in identifying the clusters correctly as shown in the results of Figs. 8(d) and 9(d).

All data sets in our experiment have been tested at least 30 times. Although all figures shown here illustrate the best case in cluster identification, in some test runs the algorithm fails for some data sets. This is particularly true for Fig. 6(a) which is quite different from the remaining data sets. It is noted that the algorithm wrongly represents the pair of elliptical shaped clusters along with the dense thin line (Fig. 6(a)) for nearly 10% of test runs. For the remaining data sets the proposed method has always chosen right clusters.

(14)

Table 1: Data sets with numerical attribute values

Data set name Number of patterns Dimension Number of classes

Iris Flower 150 4 3

New Thyroid 215 5 3

Italian Wine 178 13 3

Breast Cancer 683 9 2

OCR data 5000 100 49

Table 2: Confusion Matrix for Iris flower classification

Classified as Setosa Classified as Versicolor Classified as Virginica

Setosa 50 0 0

Versicolor 0 48 2

Virginica 0 4 46

4.3 Pattern Classification

The hybrid approach has also been tested for classification of patterns from multi-dimensional data.

In this group we have selected four well-known data sets which are shown in Table 1 with numerical attribute values, taken from UCI Machine Learning Database Repository and one from multi-class and high-dimensional OCR data features of Bengali characters. We have tabulated the average classification result of each data set for 30 runs by a confusion matrix. Each data set is different in number of data points and dimensionality. The performance of the proposed method in terms of accuracy is compared for the case of three relevant genetic algorithm based classification techniques (shown in Table 6). The classification method to be compared is an evolutionary approach to design accurate classifier based on fuzzy rules using a scatter partition of feature space[21]. Another evolutionary approach is to find useful fuzzy concepts for pattern classification using genetic algorithm [22]. C4.5 is a very common classification technique in the form of decision trees from a set of given examples [23].

Iris data with four features is one of the most popular databases reported in the pattern recognition literature. The data set contains three classes named Iris Setosa , Iris Versicolor and Iris Virginica of 50 instances each, where each class refers to a type of Iris flower. The feature attributes of the data are Sepal length, Sepal width, Petal length and Petal width. Among the three classes, one is linearly separable from the other two classes.

At first, we have used GDA on Iris data to create m (=30) number of fragmented clusters Bi’s,

∀i ∈ {1, ..., m}. CMHA is next employed on Bi’s until the number of final clusters is 3. In our experiment, one cluster has always been separated clearly from other two. However, the second and third clusters are not perfectly separable from each other. We describe the best result by a Confusion Matrix in Table 2. GGGC yields the classification accuracy of 96%, as shown in Table 6.

The second data set represents three types of new thyroid gland diseases namely, euthyroidism, hypothyroidism and hyperthyroidism. It comprises of 215 unique data points with five different attributes. We have decomposed this data set by GDA for generating 68 sub-clusters in the first step. Next, CMHA is exercised to merge the fragmented clusters into three classes. Here, none of the three classes is linearly separable from each other. Very few data from each class is confused with either of the remaining classes. The classification scenario is described by the Confusion

(15)

Table 3: Confusion Matrix for Thyroid gland disease classification

Classified as Classified as Classified as Euthyroidism Hyperthyroidism Hypothyroidism

Euthyroidism 145 4 1

Hyperthyroidism 2 30 3

Hypothyroidism 5 0 25

Table 4: Confusion Matrix for Italian Wine classification

Classified as Class 1 Classified as Class 2 Classified as Class 3

Class 1 57 0 2

Class 2 2 62 7

Class 3 0 6 42

Matrix of Table 3. GGGC achieves the classification accuracy of 95.3% (see Table 6).

The third data set contains 178 samples of a chemical analysis of wines grown in the same region of Italy but brewed by three different brewers. The analysis has determined the quantities of 13 constituents found in each type of wines. Similar to the previous two data sets this one is also split into m (where m=51) fragmented clusters followed by the merging process (CMHA). None of the three classes is completely separated from others, as shown by a Confusion Matrix of Table 4. The average classification accuracy of GGGC is 96.6% (see Table 6).

The fourth multi-dimensional data set chosen is a real data of Wisconsin Diagnostic Breast Can- cer (WDBC). It contains two classes namely, malignant and benign with 683 unique data and 9 attributes. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The data set is initially fragmented into 33 sub-clusters and then they are merged by CMHA for classification into two classes. The best classification result of two classes is shown by the Confusion Matrix in Table 5.

Table 6 shows that the classification accuracy of GGGC is 95.8%.

The fifth data set is a multi-class high dimensional data where the problem is to recognize printed characters of an Indian script named Bangla. There are 49 character classes. Each character image bounding box is partitioned into 5 X 5 windows. On each window, the frequency of four border following directions (two for diagonals and one each for horizontal and vertical) are accumulated as four features. Thus, 4 x 25 = 100 features are computed. In other words, it is a 49 class problem with 100 dimensional features which is much larger than the number of classes and feature dimensionality of other data sets. The number of data to be tested was 5000. While testing, this data set was initially fragmented into 109 sub-clusters and then they were merged into 49 classes by CMHA.

It was found that GGGC has attained a classification accuracy of 93.6%.

It may be noted that classification accuracy by the proposed approach is reasonably stable with the feature dimension and number of classes. In Table 6 the average classification accuracy ranges between 96.6% for Italian wine with 13 features and 95.3% for thyroid data with 5 features. For OCR data with high number of classes as well as features, the accuracy is pretty good. These results indicate that more than data dimensionality, it is the discriminant power of the features that is important in getting good accuracy of classification. The method does not reduce the accuracy only because of the increase in dimensionality.

(16)

Table 5: Confusion Matrix for Wisconsin Diagnostic Breast Cancer

Classified as Benign Classified as Malignant

Benign 424 20

Malignant 22 230

Table 6: Test performance (accuracy in %) of various GA-based classification algorithms

Data set name Ref. [21] Ref. [22] C4.5 [23] Proposed method

Iris Flower 95.1 95.3 94.7 96.0

New Thyroid 94.0 94.9 94.0 95.3

Italian Wine 93.7 91.6 90.1 96.6

Breast Cancer 95.3 94.9 94.8 95.8

5. Conclusion

The two-step split-and-merge procedure (GGGC) composed of the algorithms GDA and CMHA, has been employed for solving clustering and pattern classification problems. Both algorithms are quite simple to implement and get terminated after identifying expected k number of clusters with multiple runs. The parameters which are to be provided by the users are also small in number. The performance of GGGC is compared with some recent split-and-merge based clustering as well as genetic algorithm based classification techniques.

As stated before, GDA splits the original data set into m fragmented clusters with grid structure in a multi-dimensional space. The number of sub-clusters m varies with the values of input parameters for a given data set. In the second stage, the CMHA starts to merge the sub-clusters in multiple runs. This algorithm is based on a hybrid iterative procedure of MGA and TS. The algorithms are combined to exploit the advantages of both procedures. In each epoch of this hybrid algorithm, MGA is employed followed by TS whereby some sub-clusters among Bi’s, ∀i ∈ {1, ..., m} are merged. Thus, CMHA is reinvoked several times to obtain k clusters.

The algorithm GGGC is tested on several data sets comprising of synthetic data in R²Euclidean space. Among these data sets, four of them (Figs. 6 to 9) posses special features. In Fig. 6, three isolated clusters and one connected clusters (but shape-wise three clusters) are identified by GGGC.

The data sets in Figs. 7 to 9 consist of two or more clusters with outliers. However, the clusters are detected correctly. From the results on a variety of the data sets it may be inferred that GGGC performs better than CURE and ROCK clustering methods.

We have also tested our approach on some data sets in multi-dimensional feature space obtained from UCI Machine Learning Database Repository [16] as well as multi-class and multi-feature OCR data. The number of classes and attributes of each data set are different. The average classification result of each multi-dimensional data set is shown by the respective confusion matrix. For each data set GGGC detects the classes with an accuracy of at least 95.3% shown in Table 6. It is also ob- served from the experiment that the classification accuracy is not dependent on data dimensionality.

Instead, it is dependent on the presence of important features in the data set. The proposed approach is, thus, numerically promising and it will be of interest to test this procedure on more problems of practical interest in future. Also, it will be helpful to determine the GA parameters automatically in a data-driven manner.

(17)

Acknowledgments

The authors would like to thank Mr. Rahul Banerjee of C&MB Division and Mr. S. Majumdar of Computer Division, SINP for their help.

References

[1] Anderberg, M.R., Cluster Analysis for Application. Academic Press, New York. 1973.

[2] Han, J. and Kamber, M., Data Mining: Concepts and Techniques. Morgan Kaufmann, Los Altos. 2001.

[3] Guha, S., Rastogi, R., Shim, K., CURE: An Efficient Clustering Algorithm for Large Databases. Proc.

ACM SIGMOD Int. Conf. Management of Data. ACM Press. New York. 1998 pp. 73-84.

[4] Guha, S., Rastogi, R. and Shim, K., ROCK: A robust clustering algorithm for categorical attributes.

Proc. of IEEE Conf. on Data Engg. Sydney, Australia. 1999, pp. 512-521.

[5] Karypis, G., Han, E.-H., Kumar, V., Chameleon: Hierarchical Clustering Using Dynamic Modeling.

IEEE Computer. Aug., 1999, pp. 68-75.

[6] Bezdek, J.C., Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York.

1981.

[7] Niesse, J. and Mayne, G., Global geometry optimization of atomic clusters using a modified genetic algorithm in space-fixed coordinates. Journal of Chem. Phys., Vol. 87, No. 10, 1996, pp. 6166-6177.

[8] Iwamatsu, M., Global geometry optimization of silicon clusters using the space-fixed genetic algorithm.

Journal Chem. Phys., Vol. 112, No. 24, 2000, pp. 10976-10983.

[9] Hall, L.O., Ozyurt, I.B., Bezdek, J.C., Clustering with a genetically optimized approach. IEEE Trans.

Evolu. Compu. 3(2), 1999, pp. 103-112.

[10] Chang, X.G., John, L., Evolutionary design of a fuzzy classifier from data, IEEE Transactions on Sys- tems Man and Cybernetics 34 (4), 2004, pp. 18941906.

[11] Roubos, H., Setnes, M., Compact and transparent fuzzy models and classifiers through iterative complexity reduction, IEEE Transaction on Fuzzy Systems 9, 2001, pp. 516524.

[12] Tseng, L.Y., Yang, S. B., A genetic approach to the automatic clustering problem. Pattern Recognition.

34, 2001, pp. 415-424.

[13] Sheikholeslami, G., Chatterjee, S. and Zhang, A., WaveCluster: A multi-resolution clustering approach for very large spatial databases. Proc. Int. Conf. Very Large Data Bases (VLDB’98), New York. 1998, pp. 428-439.

[14] Hinneburg, A. and Keim, D.A., An evolutionarycient approach to clustering in large multimedia databases with noise. Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD’98), New York. 1998, pp. 58-65.

[15] Agrawal, R., Gekrke, J., Gunopulos, D. and Raghavan, P., Automatic subspace clustering of high dimensional data for data mining applications. Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’98), Seattle, WA, 1998, pp. 94-105.

[16] Newman, D.J., Hettich, S., Blake, C.L. and Merz, C.J., UCI Repository of machine learning databases. Univ. of California, Irvine, Dept. of Information and Computer Sciences.

http://www.ics.uci.edu/˜ mlearn/MLRepository.html. 1998.

[17] Holland, J.H., Adaptation in Natural and Artificial Systems. Ann Arbor, MI: Univ Michigan Press.

1975.

[18] Goldberg, D., Genetic algorithms in Search, Optimization and Machine Learning. Reading, MA:

Addison-Wesley Publishing. 1989.

[19] Glover, F., Tabu search - Part I. ORSA Journal of Computing. 1(3), 1989, pp. 190-206.

[20] Al-Sultan, K.S., A tabu search approach to the clustering problem. Pattern Recognition. 28, 1995, pp.

1443-1451.

[21] Ho, S.Y., Chen, H.M., Ho, S.J., Design of accurate classifiers with a compact fuzzy-rule base using an evolutionary scatter partition of feature space, IEEE Transactions on Systems Man and Cybernetics, Part B 34 (2), 2004, pp. 10311043.

[22] Hu, Y.C., Finding useful fuzzy concepts for pattern classification using genetic algorithm, Information Sciences 175 (1), 2005, pp. 119.

[23] Quinlan, J.R., C4.5: Programs for Machine Learning, Morgan Kauffman, San Mateo, CA, 1993.