like machine learning, data mining, pattern recognition, image processing and bioinformatics. Clustering is the process of partitioning or grouping of a given set of data into disjoint cluster. Basically there are two types of clustering approaches, one is hierarchical and the other is partitioned. K-means clustering is one of the partitioned types and it suffers from the fact that that it may not be easy to clearly identify the initial K elements. To overcome the problems in K-means GeneticAlgorithm (GA) and Particle Swarm Optimization (PSO) techniques came into existence. A GeneticAlgorithm (GA) is one of hierarchical approach and can be noted as an optimization technique whose algorithm is based on the mechanics of natural selection and genetics. Particle Swarm Optimization (PSO) is also one of the hierarchical search methods whose mechanics are inspired by the swarming. The PSOalgorithm is simple and can be developed in a few lines of code whereas GAs suffers from identifying a current solution but good at reaching a global region. Even though GA and PSO have their own set of strengths they have weaknesses too. So a hybrid approach (GA-PSO) which combines the advantages of GA and PSO are proposed to get a better performance. The hybrid method merges the standard velocity and modernizes rules of PSOs with the thoughts of selection, crossover and mutation from GAs. A comparative study is carried out by analyzing the results like fitness value and elapsed time of GA-PSO to the standard GA and PSO.
The tools and techniques of data mining helping us to empower the decision making, pattern recognition and forecasting services. All these applications are feasible by analyzing the data and by extraction of hidden patterns in raw data. According to their applications and nature of dataanalysis the data mining algorithms are classified into supervised and unsupervised learning techniques. In this presented work the main focuses on unsupervised learning is placed. In this context the k-mean algorithm is explored and detailed analysis is derived. During investigation of k- means algorithm there are two key issues are identified. The algorithm initially select the random centroid after that it works to satisfy the objective functions due to this the two issues are common first long running time of algorithm secondly the fluctuating accuracy of the algorithm.
Along with the fast advance of internet technique, new information appears every day. In order to apprehend the transfer of the information expressed by the data collected in different time phases, a novel topology adaptive clusteringalgorithm is proposed in this paper, which is abbreviated as TSAC. This algorithm doesn’t need to make any assumption about neuron topology in advance, and can dynamically form it to simulate the distribution of input data. For avoiding the neurons from locating out of the area where input data distribute, it adopts local density to construct new neurons. Besides, minimum spanning tree is imported to perform competitive learning to further enhance its performance. Experiment results demonstrate that TSAC works better than most of traditional clustering algorithms. Another ability of TSAC is that it can cluster dynamic data. For illustrating it, TSAC is used to display the transfer of the information expressed by the news crawled from website through the entire year of 2011. For quantitatively measuring the transfer, data space is partitioned into several grids and density is adopted as the measure criterion.
Data mining is process of extracting useful information from large amount of databases. Data mining is most useful in an exploratory analysis because of nontrivial information in large volumes of data. The data mining techniques are useful for predicting the various diseases in the medical field. Cardiovascular diseases are one of the highest- flying diseases of the modern world1. According to world health organization about more than 12 million deaths occurs worldwide, every year due to heart problems. It is also one of the fatal diseases in India which causes maximum casualties. The diagnosis of this disease is intricate process. It should be diagnosed accurately and correctly. Due to limitation of the potential of the medical experts and their unavailability at certain places put their patients at high risk. Normally, it is diagnosed using intuition of the medical specialist. It would be highly advantageous if the techniques will be integrated with the medical information system
Incremental K-means (IKM) [Ordonez, 2003, Pham et al., 2004, Sanjay et al., 2011, San- jay Chakraborty, 2011] was proposed to improve the performance of the standard K-means with different objectives. Chakraborty and Nagwani , Sanjay et al. , Sanjay Chakraborty  proposed an IKM where the new data are clustered by comparing their smallest distance from the means of the existing clusters. The result is same as if a stan- dard K-means was run for the whole data set, but this new proposed method needed less computation time. The proposed IKM showed better performance when compared with the standard K-means. However, when the threshold value or the % of δ change in the database exceeded 57%, standard K-means outperformed IKM.
In this research paper a sequential hybridization of two popular dataclustering approach (PSO and K-Means) has been proposed. Implementation of the proposed approach with both individual algorithms has been done using on Matlab Version 7.6.0 (R200a)and their performance are evaluated on intel® core i3-2310@ 2.10 GHz with 4GB of RAM running 32 bit OS (Windows 8). Comparative analysis shows that proposed approach have better convergence to lower quantization errors, larger inter- cluster distances, smaller intra-cluster distances and approx. same execution time. Accuracy measurement signifies the real impact of the proposed algorithm; proposed approach is 6% more accurate than PSO and 15.5% more accurate than K-Means Algorithm. Comparison result concludes that the drawback of finding optimal solution by K-Means can be minimized by using PSO over it. The variations in PSOalgorithm and its sequential hybridization with K-Means algorithm if done more efficiently then execution time can be reduced is proposed for future research.
Feature selection has been a long research area within statistics and pattern detection. It is not surprising that feature selection is as lot of an issue for machine learning as it is for pattern detection, as mutually fields share the common task of categorization. There were lots of issues founded while the research work was going on, which included Memory Management, Compact Data Structure , Multilingual text refining and Domain knowledge integration. The Genetic and PSOAlgorithm is used to reduce the features in the selected data. For the Classification process the decision tree classifier has been used as well as Crossover and mutation is applied. The Selected data of PSO which have been taken out is efficient. The selected data is very close to original data. The use of Neural Network has been replaced by the use of Decision tree classifier, which improved the efficiency of the work, along with that PSO is used to apply Crossover. The use of clusters has been avoided to improve the results. The other optimization algorithms can be used to improve the accuracy of the data like learning algorithm. As a Future work, clustering can be done and validity can be measured and tested on the basis of the selected data.
The Internet continues to grow at a phenomenal rate and the amount of information on the web is overwhelming. This web data is growing exponentially and need to be handled properly. Thus, text mining and clustering the huge volume of data is the main challenge for web data. For handling such huge amount of data we are dealing with preprocessing technologies and perform clustering on this preprocessed data using PSOalgorithm. Clustering is defined as grouping of similar type objects (particle) in the same group (called a cluster). Clustering is one of the main task of data mining, and a common technique for statistical dataanalysis, but before this data need to be preprocessed. Preprocessing is the term related with data mining domain, which serves to remove stop word and stemming the word suffixes for topic detection. To perform clustering we are focusing of soft computing domain based evolutionary clustering technology known as Particle Swarm Optimization (PSO). PSO is a bio-inspired swarm intelligence algorithm introduced by Kennedy and Eberhart in 1995 as a population-based stochastic search and optimization process . It is originated from the computer simulation of the individuals (particles or living organisms) in a bird flock or fish school, which basically show a natural behavior when they search for some target (e.g., food) . The goal is,
Osama Abu Abbas (2008) explained Comparisons between dataclustering algorithms. Clustering is division of data into groups of similar objects. Each group called cluster consists of objects that are similar amongst themselves and dissimilar compared to objects of other groups. There are various clustering algorithms used for clustering of data. concluded that as no. of cluster, k become greater; the performance of SOM becomes lower. The performance of k-means and EM algorithms is better than hierarchical clustering algorithms. But there is no comparison between k-mean and EM algorithms in this research.
Agustin Blas et al. described the performance of the grouping Geneticalgorithm in clustering, started with proposed encoding, and different modification of crossover and mutation operation and also initiated the local search include with the island model for improve the performance of the difficult situation. The real data sets like iris and wine were used and compared the results with the classical approaches such as DBSCAN and K-means, and obtaining the excellent results in proposed grouping based methodology the evolutionary approach such as Geneticalgorithm. The performance of the algorithm was measured by using the different fitness function.
We compared ACTS with the biclustering method proposed by Madeira et al. Unlike cluster- ing, biclustering is a process in which rows and columns of a matrix are clustered simultane- ously. BiGGEsTS created a total of 679,107 biclusters for our time-series dataset. BiGGesTS selects specific intervals where a group of genes tends to over-express and clusters them to- gether. The time intervals are chosen from anywhere in the dataset (beginning, middle or end) without changing the order of time-series. Moreover, ACTS always clusters genes from the beginning (time-point 1) and continues until the end based on step size. Thus, genes that have the similar trend from the beginning are clustered together in ACTS, and genes that have similar expression trend in a specific time interval are clustered together in BiGGEsTS. From the results, we could see that ACTS has better performance than BiGGEsTS. Let us consider the following example. Genes SCGB2A2, ANKRD30A, SCGB1D2, SCGB2A1, follow similar trend from time-point 1 to time-point 8 in the given dataset. The results of BiGGEsTS and ACTS are as follows:
PallaviPurohit and Ritesh Joshi et.al  proposed an enhanced approach for traditional K-means clusteringalgorithm due to its certain limitations. The poor performance of traditional K-means clusteringalgorithm is selection of initial centroid points randomly. The proposed algorithms deal with this problem and improve the performance and cluster quality of traditional k- means algorithm. The enhanced algorithm selects the k initial centroids in an efficient manner rather than randomly selecting. It first discovers the closest data objects by calculating Euclidian distance between each data objects and then these data points are deleted from population and forms a new data set. The enhanced algorithm provides more precise results and also reduces the mean square distance. But the proposed algorithm works better for dense dataset rather than sparse data set.
nearly the same (using 2, 3, or 4 principal components (PCs)). Therefore, the clustering results of HAC for the dataset preprocessed by PCA are at most the same result as that of the original dataset, which depends on the number of PCs used (aRI score ranged from 0.566 to 0.759). Table 3 shows the aRI scores of clustering results of HAC on original datasets and datasets preprocessed by D-IMPACT and CLUES. The effectiveness was de- pendent on the datasets. In the case of Iris, D-IMPACT greatly improved the dataset, particularly as compared with CLUES. However, for the Wine dataset, CLUES achieved the better result. This is due to the overlapped clusters in the Wine dataset are undistinguishable using affinity function. In addition, we calculated aRI scores to compare clustering results obtained by the clustering algorithms IMPACT and D-IMPACT. For the Iris data- set, the best aRI score achieved by IMPACT was 0.716, which was greatly lower than the best aRI score by D-IMPACT (0.835). For the Wine dataset, the best aRI score by IMPACT was 0.897, which was slightly lower than the best aRI score by D-IMPACT (0.899). These results show that the movement of the data points was improved in D-IMPACT compared to the IMPACT algorithm. The GSE9712 dataset is high-dimensional and has a small number of samples. Due to the curse of dimensionality and the noise included in microarray data, it is very difficult to distinguish clusters based on the distance matrix. We performed D-IMPACT and CLUES on this dataset to improve the distance matrix, and then applied the clusteringalgorithm HAC. D-IMPACT clearly outperformed CLUES since CLUES greatly decreased the quality of the cluster analysis.
One of the most popular clustering algorithms is K-means which is computationally efficient. However, it still has a few critical weaknesses which is totally dependent on the initial cluster centres and very sensitive with the outliers. [2,3,4] found that K-means can be easily to be trapped in a local minimum and time consuming when applied to large volume data . To improve the performance of K-means, [3, 4, 5, 6] have been proposed a new algorithm, Incremental K-means (IKM).
Abstract:Clustering is a primary data description method in data mining which group’s most similar data. The dataclustering is an important problem in a wide variety of fields. Including data mining, pattern recognition, and bioinformatics. There are various algorithms used to solve this problem. This paper presents the comparison of the performanceanalysis of Fuzzy C mean (FCM) clusteringalgorithm and compares it with Hard C Mean (HCM) algorithm on Iris flower data set. We measure Time complexity and space Complexity of FCM and HCM at Iris data  set. FCM clustering [2, 3] is a clustering technique which is separated from Hard C Mean that employs hard partitioning. The FCM employs fuzzy portioning such that a point can belong to all groups with different membership grades between 0 and 1.
GGC uses the same graph representation that SC and also improves the robustness of the clustering results related to the metric used to measure the data similarity. However, this algorithm has the same memory usage problems than SC: It generates a matrix comparing all data instances pair to pair, whether the problem is focused on large datasets, this matrix becomes extremely big and it is difficult to store (and therefore to compute) all its information. After the introduction of GGC, we propose a new algorithm named Multi-Objective Ge- netic Graph-based ClusteringAlgorithm (MOGGC). It is based on GGC and combines Multi- Objective Genetic Algorithms (MOGA)  with graph-continuity metrics to achieve two goals: Lower memory consumption and increased solution quality in comparison to GGC. In order to assess MOGGC performance, we compare it against the three classical clustering algorithms (K- means, EM and SC) and the original GGC. The experimentation carried out involves synthetic and well-known UCI datasets.
(a) Simulation Tool: The Performanceanalysis of MATLAB version (R2013a) i.e. used for this thesis Implementation of data mining provides processor optimized libraries for fast execution and computation and performed on input cancer dataset. It uses its JIT (just in time) compilation technology to provide execution speeds that rival traditional programming languages. It can also further advantage of multi core and multiprocessor computers, MATLAB provide many multi- threaded linear algebra and numerical function. These functions automatically execute on multiple computational thread in a single MATLAB, to execute faster on multicore computers. In this thesis, all enhanced efficient data retrieve results were performed in MATLAB (R2013b) to get an enhanced result using fuzzy clustering. MATLAB is the high- level language and interactive environment used by millions of engineers and scientists worldwide. It lets the explore and visualize ideas and collaborate across different disciplines with signal and image processing, communication and computation of results. MATLAB provides tools to acquire, analyze, and visualize data, enable you to get insight into your data in a division of the time it would take using spreadsheets or traditional programming languages. It can also document and share the results through plots and reports or as published MATLAB code.
In this paper, energy efficient data aggregation using Voronoi Based GeneticClusteringalgorithm were proposed for reducing the number of transmission from Cluster member to CH ,CH to base station and to increase the WSN life time. The voronoi diagram in VBGC was employed to determine the sensing range of each sensor since the total energy consumption is closely concerned with the number of cluster-heads and their position. Once the voronoi diagram is applied GA was applied for grouping the sensor node ,here proposed work tried to generate an optimal number of cluster- heads and optimize the number of cluster members of each cluster-heads for data aggregation process. The Euclidian distance between sensor node and CH was considered to evaluate the fitness function of a geneticalgorithm in a network. By this function, proposed method minimized the cost of transmission in this network. Data mining process further reduces the numbers data transmission from CH to sink node by data aggregation function such as MIN,MAX,AVG as additional benefit to clustering in network .To validate the algorithm, simulations had been carried out using MATLAB. Simulation results shows better performance of VBGC as compared to basic GA in terms of performance metrics like number of transmission, coverage area and total energy dissipation in the system.
Abstract Metabolomics and other omics tools are gen- erally characterized by large data sets with many variables obtained under different environmental conditions. Clus- tering methods and more specifically two-mode clustering methods are excellent tools for analyzing this type of data. Two-mode clustering methods allow for analysis of the behavior of subsets of metabolites under different experimental conditions. In addition, the results are easily visualized. In this paper we introduce a two-mode clus- tering method based on a geneticalgorithm that uses a criterion that searches for homogeneous clusters. Further- more we introduce a cluster stability criterion to validate the clusters and we provide an extended knee plot to select the optimal number of clusters in both experimental and metabolite modes. The geneticalgorithm-based two-mode clustering gave biological relevant results when it was applied to two real life metabolomics data sets. It was, for
As one of the most important tasks of spatial data mining, cluster analysis has been widely used in several domains, such as biology, system engineering and social sciences, in order to identify natural groups in large amounts of data [1-3]. A clusteringalgorithm assigns a large number of data points to a smaller number of groups (or clusters) such that data points in the same group share the same properties (similar) while, in different groups, they are dissimilar . Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis . Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms.