An Ensemble Clustering Model for Mining Concept Drifting
Stream Data in Emergency Management
Yong Zhang
School of Management andEconomics University of Electronic Science and Technology of
China
Chengdu, P.R. China
[email protected]
Yi Peng
School of Management and Economics
University of Electronic Science and Technology of
China Chengdu, P.R. China
[email protected]
Jun Li
School of Management and Economics
University of Electronic Science and Technology of
China Chengdu, P.R. China
[email protected]
Gang Kou
School of Management and Economics
University of Electronic Science and Technology of
China Chengdu, P.R. China
[email protected]
Yong Shi
∗College of Information Science & Technology University of Nebraska at Omaha Omaha, NE 68182, USA
[email protected]
ABSTRACT
Mining data streams with concept drifts is always an impor-tant and challenge task for researchers in both application and theory areas, such as emergency management. Because of requiring massive training data with labels, it is a hard and time costing work for existing (ensemble) classical mod-els, sometimes even impossible. Aim to resolve this issue, in this paper; we propose an ensemble clustering model for min-ing concept driftmin-ing stream data in emergency management. Motivated by classifiers, the model will mine the data in t-wo steps: “training” and “testing”, just with a small training set. According to the experiment, the results demonstrate the effect and performance of the proposed model in mining data streams with concept drifts.
Categories and Subject Descriptors
H.2.8 [Information Systems]: Database Applications—
Data Mining
General Terms
Algorithms
Keywords
∗Research Center on Fictitious Economy and Data Sciences, Chinese Academy of Sciences, Beijing 100190, China.
Ensemble Clustering, Clustering Validity, Stream Data, Con-cept drift, Emergency Management
1.
INTRODUCTION
The data records in data stream will be generated constantly along with the time. Some history data may be labeled for the known results, but the others and the new coming ob-jects are not, so how to recognize or predict them is the work of data mining and knowledge discovery will do. Such as the data in personal credit risk assessment in Emergency Man-agement or in online buying business and so on. For these stream data, it is beneficial to dynamically and incremental-ly compute and store two critical layers: observation layer and minimal interest layer, which are determined based on their conceptual and computational importance in stream data analysis [6]. In the first observation layer, the data could be described with synopsis data structures and stored for multiple accessing on this time granularity, and then data mining algorithms could use them for intelligence analysis, knowledge discovery and future prediction [10]. For exam-ple, the risk of an earthquake can be assessed according the years’ geologic activity but not the daily transactions. The common techniques to deal with these datasets which obtain both labeled instances and unlabeled instances are all kinds of classifiers, which mainly process the data through training and testing. However, the traditional classifiers are designed for static data, and they need to be modified when mining stream data. And when the generation of stream da-ta is a non-sda-tationary procedure, there is another problem named concept drift that the machine learning algorithms have to face. A difficult problem with learning in many real-world domains is that the concept of interest may depend on some hidden context, not given explicitly in the form of pre-dictive features. Changes in the hidden context can induce more or less radical changes in the target concept, which is
generally known as concept drift [18]. Then, to try to resolve this problem, researchers presented some improved classifier-s, but most of them are global eager learners (if they are not able to update their local parts incrementally when needed) that are inability to adapt to local concept drift, and con-cept drift may often be local in the real world [15], e.g. what analyst should do is to update the customer structure based on current results but not to rebuild the model when a new potential type of consumer arises in market segmentation. A problem for classifiers, including their ensemble versions, is that building classifiers requires labor intensive labeling process, and it is often the case that we may have a small number of labeled samples to train a few classifiers, but a large number of unlabeled samples are available to build clusters from data streams [19] [12]. Besides this, the data structure or distribution will vary obviously along with the time (e.g. concept drift that a new class arises in the test datasets), but classifiers could not recognize this because of its dependence on training datasets [11].
Aiming to resolve this problem, we proposed a new ensem-ble clustering model for mining stream data with concept drift. With motivation of the classifiers, the fundamental idea is clustering by two stages, adding a preprocess step before mining. Different from the training stage of classi-fiers, it is a step of selecting optimal clustering algorithm and judging characteristics of datasets. Thus the model can be trained and optimized with a small number of represen-tative data chunks with class labels. And then we could get the optimal parameters and optimal results in the second step of “testing”, so the whole procedure is similar with a kind of specific “classifier”. Meanwhile, the model partition the datasets according to the similarity between objects and clusters and preserve the advantage of independent to train-ing sets with class labels of unsupervised learntrain-ing, hence the results maintain the good stability.
The substantial progresses of this paper are made in the following aspects:
• Putting forward a new ensemble clustering algorithm-based model on for mining stream data with concept drift.
• Compare with a single clustering algorithm, the pro-posed model have greatly improving on the accuracy and extensive applicability without predefined param-eterK.
• The fundamental idea based on classification algorithm-s, the whole process is divided into two stages: “Train-ing” and “Test“Train-ing”. And the proposed model could deal with stream data with concept drift.
The remaining of this paper is organized as follows: Section 2 is the related work of research literatures, in Section 3 we will discuss the problem formulation about mining stream data with concept drift and propose the new ensemble model-ECM as well as the detail description of methods and theory, the experiments and results are shown in Section 4. Finally, we make the conclusion remark in Section 5.
2.
RELATED WORK
The proposed framework mainly combines the ensemble clus-tering algorithms and clusclus-tering validity criterion functions with MCDM theory and is used for mining stream data with concept drift. The basic idea is transferred from the super-vised learning.
In the domain of clustering validity, quite a lot of literatures on this topic have been done. Maria Halkidi et al. (2003) made a survey about the clustering evaluation methods and described the development history and relative definitions, they summarize the main three types of clustering validi-ty criterions: external, internal and relative validivalidi-ty indices, and the classical criteria of each type are introduced in de-tails [4, 5]. Two fundamental issues that need to be ad-dressed in cluster validation are: 1) evaluating clustering algorithms and 2) determining the correct number of clus-ters [13]. Zhao et al. (2005) compared various partitional and agglomerative clustering algorithms for large collection of documents [21].
Analysis of stream data is also a topic of active research in data mining, artificial intelligence and knowledge discov-ery .etc. Haixun Wang et al. (2003) proposed a general framework for mining concept-drifting data streams using weighted ensemble classifiers; the ensemble approach im-proved both the efficiency in learning the model and the accuracy in performing classification [17]. Alexey Tsymbal (2004) considerd different types of concept drift, peculiari-ties of the problem, and gave a critical review of existing approaches to the problem of concept drift [15]. J¨urgen Beringer, Eyke and H¨ullermeier (2005) considerd the prob-lem of clustering parallel streams of real-valued data, that was to say, continuously evolving time series, and method’s efficiency was mainly due to a scalable online transformation of the original data which allowed for a fast computation of approximate distances between streams [1]. Peng Zhang et al. (2005) proposed a new ensemble model which combined both classifiers and clusters together for mining data streams [19]. However, there is still no such research that combines the clustering validity and stream data mining.
3.
PROBLEM FORMULATION AND THE
PRO-POSED MODEL
Assume a stream datasetS is consist of infinite instances of data records (xi,li) and new objects will arrive
incremental-ly along with the time. Herexi ∈ Rd(denotes a data space
containingd-dimensional attributes) andlimeans the class
label which the instance belong with (note that only a small number of objects are labeled correctly and actually contain the labeling informationli)[19]. For just simplify the
prob-lem condition, we suppose the stream data is partitioned into a series of data chunks and each chunk could be regard-ed as atime window(orsliding window on a timed interval). As we know, what the most interesting in a stream data is contained in the most recent instances (some data chunks or windows). These chunks can be classified into two types: objects with class labels and objects without class labels. The fundamental problem in mining stream data with drift-ing concepts is how to recognize the interestdrift-ing patterns ac-curately in a timely manner when those data in the test sets
With Labels Without Labels Stream Data Training Optimal Algorithm Test Optimal Cluster Number Ensemble Clustering Model ĂĂ
Figure 1: A description of the Ensemble Clustering Model.
are no longer consistent with the concept (such as the num-ber of class) in the training set. In order to solve the prob-lem in this situation, consider the currentndata chunks, our goal is to find an appropriate clustering model to mining the potential cluster patterns in precise term, as shown in Fig.1. In general, we consider the data chunk with a class label as
S1, and regard those (which should be mined) without class labels asS2,. . . ,Sqaccording to thetime window. The data
chunksS2,. . . ,ST may be overlap or adjacent, see details in
Section 3.4, and note thatq<nandNS1+NS2,...,ST=n,N is
the number of data records. Such that, because of the in-ternal correlation in a same data streamS, there is a closely relationship (e.g. data distribution, shape, structure feature, probability density etc.) betweenS2,. . . ,ST andS1, so we
may know the cluster distribution and discovery the knowl-edge behind them through the same clustering model even if there are drifting concepts. As we discussed above, in order to solve the key problem of mining stream data with drift-ing concepts, we proposed an Ensemble Clusterdrift-ing Model for mining stream data (ECM) in this paper. Meanwhile, the proposed model introduces expert decision support system, combining the experts’ domain knowledge and experience to achieve the guidance of unsupervised learning through the feedback information and interactive processing, for improv-ing the overall quality of the evaluation results.
As shown in Fig. 2, the proposed MCDM-based cluster val-idation framework for mining stream data consists of two major parts. The main middle portion of the framework is designed to assess the performances of clustering algorithms on dataset with class labels. As suggested by Brun et al. [2], a validity measure must be closely related to the error rate in order to assess the scientific validity of a clustering algorithm. This study chose external measures to evaluate clustering algorithms because they correspond to error mea-surement and perform well in predicting the clustering error in previous studies [2]. In the proposed framework, external measures provide inputs to MCDM methods to rank cluster-ing algorithms uscluster-ing data sets with labels. Top-ranked clus-tering algorithms are then recommended to data sets that share similar structures but have no class labels. MCDM methods treat clustering algorithms as alternatives and ex-ternal measures as criteria. Clustering algorithms are ranked according to their performances on the external measures.
Table 1: Table of notations
N The number of objects in a data set.
Nk The number of objects in thekthcluster.
Nt The number of objects in thetthclass,t∈1,2,. . .,T Ntk The number of objects judged to be classt in clusterk K The number of clusters.
T The number of true classes.
Ck Thekthcluster,k∈1,2,. . .,K xi An object,i∈1,2,. . .,n
3.1
Optimal Clustering Algorithm
External criteria are used either (a) for the comparison of a clustering structure C , produced by a clustering algorith-m, with a partitionP ofX drawn independently fromC or (b) for measuring the degree of agreement between a pre-determined partition P and the proximity matrix of X, P
[14]. In this situation, the evaluation of clustering results will become very easy and direct by ignoring expectations characteristics of the division and paying attention to the effectiveness of the cluster distribution.
Through the evaluation with external measures and com-bined with the data set with class labels, the quality of clus-tering results and the performance of clusclus-tering algorithm can be presented obviously and intuitionally in this form. Thus, the optimum clustering algorithm which suit for the current data set could be chosen from all kinds of algorithms. This is similar to the “training” steps of two stages learning of classifiers, go on with this idea, we may “training” cluster-ing algorithm just the same as classifier do. As a data mincluster-ing function, cluster analysis can be used as a standalone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis [6]. Table 1 is the detailed introduction of some classical external validity index. This study chooses seven external measures for the experiment, which are defined as follows.
3.1.1
Purity
Purity[21] is a simple measure of the number of correctly assigned objects in clusterings. The correct class for each cluster is the most frequent one. Purity of the kth cluster (Ck∈C) is defined as P urity(Ck) = max t∈{1,2,...,T} Ntk Nk (1) Wheret∈ {1,2, . . . ,T}is the true class label and the purity ofP(C)whole dataset is the mean of each cluster’s, it’s value arranges in [0,1], the larger means the better.
3.1.2
Entropy
Entropy is an information theory concept which measures the information content of messages [8]. In clustering e-valuation, entropy measures how the different clusterings of objects are distributed within each cluster and is defined as [21]: Entropy(Ck) =− 1 log(N) T ∑ t=1 Ntk Nk log(Ntk Nk ) (2)
Expert Support System Correlation Analysis External Criteria Optimal Algorithm MCDM Clustering Weights Sensitivity Analysis Correct Cluster Number Relative Criteria Clustering Weights Datasets with labels Datasets without labels
Similar Attributions Stream Data Series Data
Figure 2: Stream data mining model based on ensemble clustering.
The entropy E(C) of whole dataset is the mean of each cluster’s and the value arranges in [0,1], which 1 means that the clusters’ distribution is uniform and 0 means the whole dataset is a cluster, the expected value should be minimized.
3.1.3
F-measure
F-measure[9] uses the precision and recall in the informa-tion retrieval. The overall F-measure is the weighted average of the F-measure for each clusterk:
P rec(t, Ck) =Ntk/Nk (3) Rec(t, Ck) =Ntk/Nt (4) Fmeasure(t, Ck) = (b2+ 1)·P rec(t, Ck)·Rec(t, Ck) b2·P rec(t, C k) +Rec(t, Ck) (5) Ifb=1, then the weights of P rec(t, Ck) andRec(t, Ck) are
equal, and the overall F-measure is defined as:
F(C) =∑
t∈T Nt
N Cmaxk∈C
(Fmeasure(t, Ck)) (6)
The value arranges in [0,1] and the expected value should be maximized.
3.1.4
Other Indexes
Consider C ={C1,· · ·, CK} is a clustering structure of a
data setX andG={G1,· · ·, GT}is a defined partition of
the data. We refer to a pair of points (xi, xj) from the data
set using the following terms [4]:
• a: if both points belong to the same cluster of the clus-tering structureC and to the same group of partition
G.
• b: if points belong to the same cluster of C and to different groups ofG.
• c: if points belong to different clusters ofC and to the same group ofG.
• d: if both points belong to different clusters ofC and to different groups ofG.
Then a+b+c+d = M which is the maximum number of all pairs in the data set (meaning, M = N(N−1)/2 whereN is the total number of points in the data set). Now we can define the following indices to measure the degree of similarity between C and G. Rand Index measures the similarity between clusters and can be defined as:
R= (a+d)/M (7)
Jaccard Coefficient is a frequently used external clustering
validity measure:
J=a/(a+b+c) (8)
The above two indices range between 0 and 1, and are max-imized whenK=T. A binary variable is symmetric if both of its states are equally valuable and carry the same weight; that is, there is no preference on which outcome should be coded as 0 or 1. One such example could be the attribute gender having the states male and female. Dissimilarity that is based on symmetric binary variables is called symmetric binary dissimilarity; it has a completely negative correlation
withRand index [6]. Folkes and Mallows Index measures
a-greement between partitions:
F M= √ a a+b· a a+c (9)
For the previous three indices it has been proven that high values of indices indicate great similarity between C and
G.Adjusted Rand Index [7] was developed to correct rand
index:
ωA=
a−(a+c)(a+b)/M
3.2
Optimal Cluster Number
In addition to evaluate the quality of results of clustering algorithms, the clustering validity criteria function has an-other major goal that is to choose the best clustering scheme of a set of defined schemes according to a pre-specified cri-terion (such as the cluster number) on a particular data set, and to optimize the clustering algorithms [5]. A relative cri-terion just achieves the selection of the optimum parameters and clustering models using the predefined evaluation stan-dard against the algorithm and does not involve statistical tests [21]. Make use of this basic idea, we could optimize the input settings of clustering algorithms proposed by the main middle portion of the framework. In this paper, 10 classical relative validity criterion are chosen for the experi-ments in Section 4, and they are Hubert, normalized Hubert (nHubert), Dunn, DB, SD, S Dbw, CS, silhouette (sil), PB-M, C-index (CI) respectively [16], those in the brackets are short names. The concrete model process and method can be found in [20], this paper is an extension research.
3.3
MCDM
MCDM (Multi-Criteria Decision Making) is a branch of the operational research theory and optimum methods. In the process of decision making, MCDM methods can get com-promise solutions which satisfy the requirements of the deci-sion makers’ the most when there are several inconsistent or even contradictory target schemes (or attributes of the deci-sion matrix). The common forms of MCDM is Multiple At-tribute Decision Making (MADM), Multi-Criteria Decision Analysis (MCDA) and multi-objective programming prob-lem, decision makers or experts are needed in the solving process to provide decision support. To achieve the expected decision results and control the real-time decision informa-tion, there is a need to add some artificial constraints and standard rules to decision model, and weighting the decision criterions is a common expression of these constraints and rules. Hence, combined with the weights which are given by expert support system, the framework could work out the optimal algorithm and correct cluster number by means of treating the external validity criterions and relative validi-ty criterions as attribute measures. The MCDM algorithm which is adopted in the experiments is Promethee II [3].
3.4
Stream Data
With regard to the kind of stream data that is described in advance, in the step of optimal algorithm selection and “training” the dataset, the data chunks must be represen-tative and reflect the data features, so it can promote the overall accuracy. Meanwhile, “test datasets”Si (i=2,. . . ,q)
should satisfy the minimum requirement of clustering algo-rithms about instance size of each data chunk. These are some differences from the classifications.
Compared with the classification, Clustering is more sensi-tive to noise data. In most of the cases, noise data always exists in the data analysis tasks and usually is meaning-less, dirty or even harmful for mining process. But some outliers are interesting and when homogeneous outlier data increases largely along with the time, new classes are coming into being and the concept drifts of stream data will emerge. Clustering algorithms can handle this situation because they partitions large data sets into groups according to their in-ternal similarity [6]. With the emerging constantly of new
Mined Data New Data
Sliding Window 1 Sliding Window 1
Sliding Window2 Sliding Window2
(a)
Mined Data New Data
Sliding Window 1 Sliding Window 1
S S
Sllliiidding Window22 Sliding Window2 S
(b)
Figure 3: Stream data in the sliding window, (a) there is an overlap between two windows, (b) adja-cent condition.
data, sliding window will update the data content, and the model could process the new objects together with the mined instances, as shown in Fig. 3 (a). On the other hand, if the discontinuous situation has been considered, the model also could do the mining analysis and discovery knowledge after the data content in sliding window is all renewed, as shown in Fig. 3 (b). That is to say the model doesn’t have to rely on history data and this is just one of the main advantages of the proposed framework that STREAM doesn’t have [6]. In the experiment part of this paper, we will describe the process details.
4.
EXPERIMENTS AND RESULTS
In this section, we conducted experiments to present the w-hole process of the MCDM-based ensemble clustering frame-work. Our goal is to demonstrate the effectiveness of the pro-posed model on clustering stream data with concept drift. All the tests were performed on a 2.50GHz Intel(R) Celeron(R) processor computer with 2.0GB memory and running on Windows 7 Ultimate. The data sets and the experimental results are discussed in sequence.
4.1
Data Sets
In this paper, we generate some synthetic datasets to get an impression of the practical performance and applicabili-ty of the proposed framework. As an important advantage of synthetic data let us note that it allows for conducting ex-periments in a controlled way and, hence, to answer specific questions concerning the performance of our model and its behavior under particular conditions [1]. And the mining results can be verified on the “Test sets” with class labels simultaneously.
The synthetic datasets contain a number of data points in some random circles with fixed radius which are generat-ed by MATLAB functions, the objects in each circle can be regarded as the same class. Then we could get a simple arti-ficial simulation data stream which produces new instances constantly along with the time. There are three data set-s and each of them haset-s three main random circleset-s, and to make a control study, there is another class generated in
ad-Table 2: Description of the datasets
Datasets Instances Size Dimensions Classes Description
Random Circle 1300 2 3 “TrainingSet” with labels
Random Circle1 1305 2 3 “TestSet1” without labels
Random Circle2 2650 2 4 “TestSet2” without labels
dition to simulate gradual concept drift [15]. Main describ-ing characteristics of the three artificial datasets are shown in Table 2. Meanwhile, we assume that the data in Ran-dom Circle are labeled, and the data in other two dataset are not when clustering and the true labels are just used for results verifying. Three datasets are generated in sequence, and Random Circle1 contain five records labeled as class 4 which can be treated as outliers or “noisy data” of this da-ta set compared with the overall dada-ta amount. There are about twice data points in Random Circle2 than in Ran-dom Circle1 and more outliers that are labeled as class 4, so it may be recognized as a new class obviously. The data distributions of three sets are as shown in Fig. 4.
4.2
Results and Discussions
In this experiment, six classical clustering algorithms are adopted and they are DBScan, DensityBasedKmeans, EM, HierarchicalCluster, FarthestFirst, Kmeans, which could rep-resent the density-based methods, model-based methods, hierarchical methods and partitioning methods respective-ly. All the experimental clustering algorithms are run in the KNIME software with Weka Data Mining Integration and the MCDM algorithm is implemented in MATLAB 7.0. Weight of each measure is assigned by the decision support system with twelve domain experts and the value is shown in Table 3. Through According to the framework model, the optimal clustering algorithm is EM according to the “training” results on dataset Random Circle. Then, Ran-dom Circle1 and RanRan-dom Circle2 are clustered by EM with different input parameter K, and in these experiments it varies from 2 to 10. All clustering results would be estimat-ed with relative validity criterions, and then the decision matrixes are formed with these evaluation results. Finally, the optimalKvalues and clustering results are selected with Promethee II algorithm. The whole results are as shown in Table 4, and the top-ranked algorithm and K values are highlighted in boldface and italic. The results show that the five new outlier records in Random Circle1 are assigned to other clusters by the proposed model when slight concept drift happens, and as time goes by, the gradual concept drift may change the structure of data stream tremendously. So the proposed model could recognize the cluster partitions of the data chunks accurately when there are more outlier instances and data distribution has changed actually.
5.
CONCLUSIONS AND FUTURE WORKS
In this paper, we proposed a new Ensemble Clustering Mod-el for mining stream data (ECM) after the analysis of data stream mining. The framework combined the MCDM the-ory with clustering validity criterions to achieve the unsu-pervised learning, and adopted the MCDM theory as well as expert support system in addition. With the knowledge and experience of domain experts, the proposed framework could guide the mining process in an interactive way ac-cording to the information feedback and promote the whole
1 2 3 4 5 6 7 8 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 label1 label2 label3 (a)Random Circle 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 label1 label2 label3 label4 (b)Random Circle1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 label1 label2 label3 label4 (c)Random Circle2
Table 3: Weights of All Validity Criterions
Externals P F R adjusted R J FM E
weight 0.162 0.119 0.171 0.118 0.134 0.138 0.158
Relatives Dunn sil PBM Hubert nHubert DB SD SDbw CS CI
weight 0.072 0.042 0.133 0.069 0.058 0.185 0.159 0.057 0.165 0.060
Table 4: Results of the experiments
Random Circle Random Circle1 Random Circle2
Algorithms Value Order K Value Order K Value Order
DBScan -0.7752 5 K=2 0.43925 2 K=2 0.18975 4 DensityKmeans 0.6 2 K=3 0.732 1 K=3 0.1715 5 EM 1 1 K=4 0.09375 4 K=4 0.55225 1 FarthestFirst -0.8248 6 K=5 0.3805 3 K=5 0.279 3 HierarchicalCluster -0.2 4 K=6 -0.58225 9 K=6 0.326 2 Kmeans 0.2 3 K=7 -0.047 5 K=7 -0.37225 7 K=8 -0.212 6 K=8 -0.66975 9 K=9 -0.31575 7 K=9 -0.37725 8 K=10 -0.4885 8 K=10 -0.09925 6
results. The fundamental idea of this model is to divide the clustering into two stages: “Training” and “Testing”. A pre-process before clustering and a results validation would be added to the framework, and the optimal algorithm and pa-rameter will be obtained to get the optimal clustering results when mining stream data with concept drift.
The proposed model has certain limitations in some situa-tions, such as when the number of clusters in the data set is large, there is a need to determine the varying range of the
K value roughly in advance to reduce the iterations of “test-ing” clustering. And, we only conducted the experiments on artificial data sets in this article, it also has a necessary to further them on the real data sets and compare the results with classification algorithms or others models, these are the next works we will do.
6.
ACKNOWLEDGMENTS
This research has been partially supported by grants from the Foxconn “ZhuoCai” Funds (11F81210102).
7.
REFERENCES
[1] J. Beringer and E. H¨ullermeier. Online clustering of parallel data streams.Data and Knowledge
Engineering, 58(2):180–204, August 2006.
[2] M. Brun, C. Sima, J. Hua, J. Lowey, B. Carroll, E. Suh, and E. R. Dougherty. Model-based evaluation of clustering validation measures.Pattern Recognition, 40(3):807 – 824, 2007.
[3] J. Figueira, S. Greco, M. Ehrogott, J.-P. Brans, and B. Mareschal. Promethee methods. InMultiple
Criteria Decision Analysis: State of the Art Surveys,
volume 78 ofInternational Series in Operations
Research and Management Science, pages 163 – 186.
Springer New York, 2005.
[4] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Cluster validity methods: part i.SIGMOD Rec., 31(2):40–45, 2002.
[5] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering validity checking methods: part ii.
SIGMOD Rec., 31(3):19 – 27, sep 2002.
[6] J. Han and M. Kamber.Data Mining: Concepts and
Techniques, 2nd edition. Morgan Kaufmann, 2006.
[7] L. Hubert and P. Arabie. Comparing partitions.
Journal of Classification, 2:193 – 218, 1985.
10.1007/BF01908075.
[8] A. K. Jain and R. C. Dubes.Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.
[9] B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering.KDD-99, pages 16–22, 1999.
[10] Y. Peng, G. Kou, Y. Shi, and Z. Chen. A descriptive framework for the field of data mining and knowledge discovery.International Journal of Information
Technology & Decision Making, 7(4):639–682, 2008.
[11] Y. Peng, G. Kou, G. Wang, and Y. Shi. Famcdm: A fusion approach of mcdm methods to rank multiclass classification algorithms.Omega, 39(6):677–689, 2011. [12] Y. Peng, G. Kou, G. Wang, W. Wu, and Y. Shi.
Ensemble of software defect predictors: an ahp-based evaluation method.International Journal of
Information Technology & Decision Making,
10(1):187–206, 2011.
[13] P. Tan, M. Steinbach, and V. Kumar.Introduction to
Data Mining. Addison-Wesley, 2005.
[14] S. Theodoridis and K. Koutroubas.Pattern
recognition, Fourth edition. Academic Press, 2008.
[15] A. Tsymbal. The problem of concept drift: Definitions and related work. Technical report, Department of Computer Science, Trinity College: Dublin, Ireland, 2004.
[16] L. Vendramin, R. Campello, and E. Hruschka. Relative clustering validity criteria: A comparative overview.Statistical Analysis and Data Mining, 3(4):209–235, 2010.
[17] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers.Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 24–27, August 2003.
[18] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts.Machine Learning, 23(1):69–101, 1996.
[19] P. Zhang, X. Zhu, J. Tan, and L. Guo. Classifier and cluster ensembles for mining concept drifting data streams.2010 IEEE International Conference on Data
Mining, pages 1175–1180, 2010.
[20] Y. Zhang, Y. Peng, J. Li, and Y. Shi. A clustering validity model based on multiple criteria decision making.The Sixth Chinese Academy of Management
Annual Meeting, 2011.
[21] Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for document datasets.Data
Mining and Knowledge Discovery, 10(2):141–168,