An Ensemble Based Framework for Temporal Data Clustering

(1)

K.Ratna Prashanthi

, IJRIT 164

IJRIT International Journal of Research in Information Technology, Volume 1, Issue 8, August, 2013, Pg. 164-170

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

An Ensemble Based Framework for Temporal Data Clustering

1K.Ratna Prashanthi, ²Dr. R.V.Krishnaiah

1Research Scholar, Dept. of CSE, DRK Institute of Science &Technology, Hyderabad, AP, India

2Principal, Department of CSE, DRK Group of Institutions, Hyderabad, Andhra Pradesh, India

1[email protected], ²[email protected]

Abstract

Clustering is one of the data mining techniques used to group similar objects. Temporal data clustering does the same on temporal data in the time domain which has its utility in real world applications such as temporal data mining and multimedia information processing. Of late, Yang and Chenproposed a framework for temporal data clustering. They proposed a novel algorithm for clustering temporal data which also includes validation criteria to improve the quality of clusters. The algorithm performs clustering effectively by using different representations which gets rid of information loss as well. In this paper we implemented the alogorithm and built a prototype in Java platform to demonstrate the proof of concept. We tested the prototype with various benchmark datasets. The empirical results revealed that the performance of the application with multiple data sources containing temporal data is encouraging.

Index Terms –Data mining, temporal data clustering, ensemble, weighted consensus function

1. Introduction

Temporal data is everywhere in the world. Processing such data is very essential in many applications.

Multimedia data processing and other temporal data mining are the main application areas where temporal data clustering is used. Temporal data is highly dynamic in nature. While the data is dynamic in nature, it has high level of dependencies among the attributes and values in an attribute. The temporal data clustering analyzes these dependencies and discovers underlying patterns in the temporal data. The aim of temporal clustering is to divide the data into various clusters in such a way that very coherent objects are grouped in same cluster. There is high similarity between objects in a cluster. There is high dissimilarity between objects that are part of different clusters.

High dimensionality is one of the challenges in temporal data clustering [1]. Clustering analysis is one of the difficult tasks which are an ill-posed problem. There are common assumptions associated with this that were violated by the solutions [2]. With respect to data dependency processing, the existing clustering algorithms that act up on temporal data are classified into three types. They are representation – based, model – based and temporal- proximity basedclustering algorithms. Model based clustering algorithms were explored in [3], [4] and [5] while the temporal proximity based algorithms were explored in [1], [6], and [7]. These two categories work on temporal data directly. They also deal with temporal correlation among the data objects directly. They use temporal similarity measures explored in [1], [6], [7] in terms of dynamic time warping while [5], [4], [3] are dynamic models used in

(2)

K.Ratna Prashanthi

, IJRIT 165

the process in the form of Hidden Markov Model. Temporal data clustering can be converted into static data clustering using a representation based algorithm in order to capture the dependency. Based on temporal data representation which has low dimensionality, it is possible to perform clustering with computation efficiency.

In literature many temporal data representations were found in [8], [9], [10], [11], [12], [13], [14], and [15].

However, there is no representation of temporal data which is universal in nature. It is not easy to choose a suitable representation without careful analysis of temporal data. That is the reason the representation – based approaches suffer from problems. Clustering ensemble algorithms were studied in many areas such as machine learning.

However, they were studies in various perspectives. For instance in [16] and [17] clustering ensembles with graph partitioning, in [18], [19], [20] with evidence aggregation, in [21] with semi definiteprogramming. The reason for clustering ensemble is to make use of multiple partitions in order to obtain best partitions. The research revealed that clustering ensemble can be used to detect various possibilities with clustering as explored in [17], [19], [20], and [22]. Formal analysis of clustering ensemble shows interesting fact that proper consensus can discover the underlying structure of given temporal dataset. Thus the clustering ensemble facilitates to use various representations for clustering temporal data. Yang and Chen [23] explored temporal data clustering technique. It has initial clustering analysis that acts on various representations that can result in the production of clusters. Any clustering algorithm can give initial clusters. They proposed a two stage reconciliation process as part of their algorithm in order to ensure quality of clusters. The partitions are processes to have final partition. This technique was tested against various benchmark datasets using different representations.

In this paper we implement the temporal data clustering framework presented by Yang and Chen [23]. We built a prototype application to demonstrate the proof of concept. We did many experiments with benchmark datasets. The results revealed that the application can be used in many real world scenarios. The remainder of this paper is structured as follows. Section II provides details of the proposed approach. Section III provides prototype implementation. Section IV presents experimental results while section V concludes the paper.

2. Framework for Temporal Data Clustering

The framework we implemented is taken from [23] which are used to perform clustering of temporal data which comes from various data sources in the real world. The temporal data mining especially clustering has utility in multimedia processing applications and temporal data mining applications. The framework has considered various representations of the data and the processing of clustering in multiple phases. The overview of the framework is as shown in fig. 1.

Fig. 1 –Overview of clustering framework

As can be seen in fig. 1, the framework takes temporal data set as input and extracts various representations from it. Afterwards, the initial clustering analysis is made. Then the weighted clustering function proposed in [23] is applied which results in a weighted clustering ensemble. Then an agreement function is used to generate final partition. The initial clustering is made using K-means algorithm which is one of the data mining algorithms best used for grouping similar objects. Clusters are validated using validation criteria presented in [24]. Actually the K-

(3)

K.Ratna Prashanthi

, IJRIT 166

means variant by name Bisecting K-means is used in the initial clustering process. It is the combination of hierarchical clustering and K-means. Initially all objects are kept in a single cluster. Then it starts making different clusters.

3. Prototype Application

The prototype application is built in Java platform. The application has Graphical User Interface (GUI) to make it user-friendly. The application UI is built using SWING API in Java Standard Edition. The IDE used for development is Net Beans. The environment used for building the application is a PC with 4 GB RAM, Core 2 Dual processor running Windows 7 operating system.

4. Results

The experiments are made using various benchmark datasets. The data sets are nothing but temporal data.

The data is highly dynamic in nature and shows correlations among the data objects in time domain. The benchmark datasets used for experiments are described in table 1.

Dataset Number of

ClassK*

Size of Dataset (Training + Testing)

Length

Syn Control Gun-Point CBF Face (all) OSU Leaf Swedish Leaf 50Words Trace Two Patterns Wafer Face(four) Lightning-2 Lightning-7 ECG Adiac Yoga

6 2 3 14 6 15 50 4 4 2 4 2 7 2 37 2

300 + 300 50 + 150 30 + 900 560 + 1690 200 + 242 500 + 625 450 + 455 100 + 100 1000 + 4000 1000 + 6174 24 + 88 60 + 61 70 + 73 100 + 100 390 + 391 300 + 3000

60 150 128 131 427 128 270 275 128 152 350 637 319 96 176 426

Table 1 –Data sets and their details

As can be seen in table 1, the data sets and their size, length and other details are provided. These data sets are used in the experiments. These datasets exhibit various representations which are essential to the successful demonstration of temporal data clustering. On the CAVIAR database without noise and with noise experiments were made and classification accuracy is presented in fig. 2.

(4)

K.Ratna Prashanthi

, IJRIT 167

Fig. 2 –Classification accuracy on CAVIAR dataset

As can be seen in fig. 2, it is evident that the algorithm implemented in our prototype application [23]

shows higher performance on CAVIAR database sans noise. With noisy database the clustering accuracy is reduced.

4.1. Difference between Our Approach and Batch Hierarchical Clustering

The results of the batch hierarchical clustering algorithm (BHC) [3] have been compared with the results of our approach. The data streams with user IDs 6and 25 have been compared. The results of our approach are similar to that of BHC except a slight change with respect to sensor 2 and 7. Fig. 3 shows BHC results for user ID 6.

Fig. 3 – Results of BHC for user ID 6 84

86 88 90 92 94 96 98

10 20 30 40 50 60 70 80 90

Classification Rate(%)

Missing Data (%)

Without Noise Noise with = 0.1

(5)

K.Ratna Prashanthi

, IJRIT 168

Fig. 4 – Results of Our Approach for user ID 6

Fig. 5 – Results of BHC for user ID 25

Fig. 6 – Results of Our Approach for user ID 25

As can be seen in fig. 3 and 4 and fig. 5 and 6 the results of BHC and ours are comparable. They are almost identical except the fact that our approach groups sensor 2 and sensor 7 into a single cluster in case of user ID 25. However, our approach leads to higher performance with low computational cost.

5. Conclusion

In this paper we implemented the temporal data clustering framework presented by Yang and Chen [23]

which allows multiple data sources to be taken as input and perform efficient clustering on temporal data. This approach is based on the formal clustering ensemble analysis explored in [24]. We built a prototype application in Java platform to realize the temporal data clustering framework which works on a variety of temporal data sources.

The validation criteria for clusters are the same as used in [25]. We have made experiments on various bench mark datasets. The empirical results revealed that the performance of the prototype is effective and can be used in real world applications that are supposed to process temporal data.

(6)

K.Ratna Prashanthi

, IJRIT 169

6.

References

[1] E. Keogh and S. Kasetty, “On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Study,” Knowledge and Data Discovery, vol. 6, pp. 102-111, 2002.

[2] J. Kleinberg, “An Impossible Theorem for Clustering,” Advances in Neural Information Processing Systems, vol. 15, 2002.

[3] P. Smyth, “Probabilistic Model-Based Clustering of Multivariate and Sequential Data,” Proc. Int’l Workshop Artificial Intelligence and Statistics, pp. 299-304, 1999.

[4] K. Murphy, “Dynamic Bayesian Networks: Representation, Inference and Learning,” PhD thesis, Dept. of Computer Science,Univ. of California, Berkeley, 2002.

[5] Y. Xiong and D. Yeung, “Mixtures of ARMA Models for Model- Based Time Series Clustering,” Proc. IEEE Int’l Conf. Data Mining, pp. 717-720, 2002.

[6] A. Jain, M. Murthy, and P. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, pp. 264- 323, 1999.

[7] R. Xu and D. Wunsch, II, “Survey of Clustering Algorithms,” IEEE Trans. Neural Networks, vol. 16, no. 3, pp.

645-678, May 2005.

[8] N. Dimitova and F. Golshani, “Motion Recovery for Video Content Classification,” ACM Trans. Information Systems, vol. 13, pp. 408-439, 1995.

[9] W. Chen and S. Chang, “Motion Trajectory Matching of VideoObjects,” Proc. SPIE/IS&T Conf. Storage and Retrieval for Media Database, 2000.

[10] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases,”

Proc. ACM SIGMOD, pp. 419-429, 1994.

[11] E. Sahouria and A. Zakhor, “Motion Indexing of Video,” Proc. IEEE Int’l Conf. Image Processing, vol. 2, pp.

526-529, 1997.

[12] C. Cheong, W. Lee, and N. Yahaya, “Wavelet-Based TemporalClustering Analysis on Stock Time Series,”

Proc. Int’l Conf. Quantitative Sciences and Its Applications, 2005.

[13] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrota, “Locally Adaptive Dimensionality Reduction for Indexing Large Scale Time Series Databases,” Proc. ACM SIGMOD, pp. 151-162, 2001.

[14] F. Bashir, “MotionSearch: Object Motion Trajectory-Based Video Database System—Index, Retrieval, Classification and Recognition,” PhD thesis, Dept. of Electrical Eng., Univ. of Illinois, Chicago, 2005.

[15] E. Keogh and M. Pazzani, “A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining, pp. 122-133, 2001.

[16] A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions,” J. Machine Learning Research, vol. 3, pp. 583-617, 2002.

[17] X. Fern and C. Brodley, “Solving Cluster Ensemble Problem by Bipartite Graph Partitioning,” Proc. Int’l Conf.

Machine Learning, pp. 36-43, 2004.

[18] S. Monti, P. Tamayo, J. Mesirov, and T. Golub, “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data,” Machine Learning, vol. 52, pp. 91-118, 2003.

[19] A. Fred and A. Jain, “Combining Multiple Clusterings Using Evidence Accumulation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 6 pp. 835-850, June 2005.

(7)

K.Ratna Prashanthi

, IJRIT 170

[20] N. Ailon, M. Charikar, and A. Newman, “Aggregating Inconsistent Information Ranking and Clustering,” Proc.

ACM Symp. Theory of Computing (STOC ’05), pp. 684-693, 2005.

[21] V. Singh, L. Mukerjee, J. Peng, and J. Xu, “Ensemble Clustering Using Semidefinite Programming,” Advances in Neural Information Processing Systems, pp. 1353-1360, 2007.

[22] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering Aggregation,” ACM Trans. Knowledge Discovery from Data, vol. 1, no. 1, article no. 4, Mar. 2007.

[23] Yun Yang and Ke Chen, “Temporal Data Clustering via Weighted Clustering Ensemble with Different Representations,” IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 2, FEBRUARY 2011.

[24] A. Topchy, M. Law, A. Jain, and A. Fred, “Analysis of Consensus Partition in Cluster Ensemble,” Proc. IEEE Int’l Conf. Data Mining, pp. 225-232, 2004.

[25] M. Halkidi, Y. Batistakis, and M. Varzirgiannis, “On Clustering Validation Techniques,” J. Intelligent Information Systems, vol. 17, pp. 107-145, 2001.

7. Authors Biography

K.Ratna Prashanthi has completed MCA from Hindu PG College, and pursuing M.Tech (C.S.E) in DRK Institute of Science and Technology, JNTUH, Hyderabad, Andhra Pradesh, India. Her main research interest includes Data Mining and Databases.

Dr.R.V.Krishnaiah, did M.Tech (EIE) from NIT Waranagal, MTech(CSE) form JNTU, ,Ph.D, from JNTU Ananthapur, He has memberships in professional bodies MIE, MIETE, MISTE. His main research interests include Image Processing, Security systems, Sensors, Intelligent Systems, Computer networks, Data mining, Software Engineering, network protection and security control.

He has published many papers and Editorial Member and Reviewer for some national and international journals.