Data Mining:
Concepts and Techniques
(3
rded.)
— Chapter 10 —
Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &
Simon Fraser University
2
Chapter 10. Cluster Analysis: Basic Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density Methods
Grid Based Methods
Evaluation of Clustering
Summary
2
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
4
What is Clustering
Clustering is the classification of objects into different groups, or more precisely, the
partitioning of a data set into subsets (clusters),
so that the data in each subset (ideally) share
some common trait - often according to some
defined distance measure.
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non- exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian) vs. connectivity-based (e.g., density)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
6
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden
Types of Clustering
1.
Partitioning approach:
Construct various partitions and then evaluate them by some criterion
2.
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion
3.
Density-based approach:
Based on connectivity and density functions
4.
Grid-based approach:
based on a multiple-level granularity structure
8
Chapter 10. Cluster Analysis: Basic Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
1) Partitioning Method
December 16, 2
10021 Data Mining: Concepts
and Techniques
Algorithm for Partitioning methods
K-Mean Algorithm
K-Mediods Algorithm
CLARANS (Clustering Based Algorithm for
Randomize Search)
2-Hierarichal Methods
December 16, 2
12021 Data Mining: Concepts
and Techniques
Algorithm for Hierarchal methods
AGNES (AGglomerative NESting Clustering)
DIANA (DIisive ANalysis Clustering )
BIRCH (Balance Iterative Reducing and Clustering)
CAMELEON (CLUSTERING USING DYNAMIC MODELING)
DenClue
3-Density Based Method
14
Algorithm for Density Based methods
DBSCAN ( Density-Based Clustering Based on Connected
Regions with High Density)
OPTICS ( Ordering Points to Identify the Clustering
Structure)
4-Grid Based Clustering
16
Algorithm for Grid Based methods
STING (STatistical Information Grid)
CLIQUE: An Apriori-like Subspace Clustering Method
WaveCluste
18
Common Distance measures:
Distance measure will determine how the similarity of two elements is calculated and it will influence the shape of the clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is given by:
2. The Manhattan distance (also called taxicab norm or 1-
norm) is given by:
Partitioning Algorithms: Basic Concept
Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where c
iis the centroid or medoid of cluster C
i)
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
20
2
1 p C ( ( , i ))
k
i d p c
E i
Distance formula (2-D)
K-MEANS CLUSTERING
The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k
< n .
It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data.
It assumes that the object attributes form a vector
space.
An algorithm for partitioning (or clustering) N data points into K disjoint subsets S
jcontaining data
points so as to minimize the sum-of-squares criterion
where x
nis a vector representing the the n
thdata
point and u
jis the geometric centroid of the data
points in S
j.
Simply speaking k-means clustering is an
algorithm to classify or to group the objects based on attributes/features into K number of group.
K is positive integer number.
The grouping is done by minimizing the sum of squares of distances between data and the
corresponding cluster centroid.
An Example of K-Means Clustering
K=2
Arbitrarily partition objects into k groups
Update the cluster centroids
Update the cluster centroids
Reassign objects Loop if
needed The initial data set
Partition objects into k nonempty subsets
Repeat
Compute centroid (i.e., mean point) for each partition
Assign each object to the
How the K-Mean Clustering algorithm
works?
Step 1: Begin with a decision on the value of k = number of clusters .
Step 2: Put any initial partition that classifies the data into k clusters. You may assign the training samples randomly,or systematically as the following:
1.Take the first k training sample as single- element clusters
2. Assign each of the remaining (N-k) training sample to the cluster with the nearest
centroid. After each assignment, recompute the
centroid of the gaining cluster.
Step 3: Take each sample in sequence and compute its distance from the centroid of each of the clusters. If a sample is not currently in the cluster with the closest
centroid, switch this sample to that cluster and update the centroid of the
cluster gaining the new sample and the cluster losing the sample.
Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
A Simple example showing the
implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
Thus, we obtain two clusters containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
Step 3:
Now using these centroids we compute the Euclidean distance of each object, as shown in table.
Therefore, the new clusters are:
{1,2} and {3,4,5,6,7}
Next centroids are:
m1=(1.25,1.5) and m2 =
(3.9,5.1)
Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
Therefore, there is no change in the cluster.
Thus, the algorithm comes to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
PLOT
(with K=3)
PLOT
Exercise
Consider the 1D data set as {1,2,3,4,7,9}
Where K=2
Identify the clusters and their centroid.
Tip use
Home work
Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
Map the resultant values in Scattered Plot
38
What Is the Problem of the K-Means Method?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the distribution of the data
K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster
Determine the Number of Clusters
Empirical method
# of clusters: k ≈√n/2 for a dataset of n points,
e.g., n = 200, k = 10
How many for the n=900???
40
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
External: supervised, employ criteria not inherent to the dataset
Compare a clustering against prior or expert-specified knowledge using certain clustering quality measure
Internal: unsupervised, criteria derived from data itself
Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
Relative: directly compare different clusterings, usually those
obtained via different parameter settings for the same algorithm
42
Chapter 10. Cluster Analysis: Basic Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Summary
42
Visualization of Clustering
44
46
Summary
Cluster analysis groups objects based on their similarity and has wide applications
Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods
K-means and K-medoids algorithms are popular partitioning-based clustering algorithms
Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace clustering algorithm
Quality of clustering results can be evaluated in various ways
48