Data Mining:
Concepts and Techniques
(3
rded.)
— Chapter 10
—
Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &
Simon Fraser University
2
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density Methods
Grid Based Methods
Evaluation of Clustering
Summary
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes Typical applications
As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
What is Clustering
Clustering is the
classification
of objects into
different groups, or more precisely, the
partitioning
of a
data set
into
subsets
(clusters),
so that the data in each subset (ideally) share
some common trait - often according to some
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,
density)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
Quality: What Is Good Clustering?
A good clustering method will produce high quality
clusters
high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method its implementation, and
Its ability to discover some or all of the hidden patterns
Types of Clustering
1. Partitioning approach:
Construct various partitions and then evaluate them by some criterion
2. Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion
3. Density-based approach:
Based on connectivity and density functions
4. Grid-based approach:
based on a multiple-level granularity structure
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
1) Partitioning Method
10
May 29, 2019 Data Mining: Concepts and Techniques
Algorithm for Partitioning methods
K-Mean Algorithm
K-Mediods Algorithm
CLARANS (Clustering Based Algorithm for
2-Hierarichal Methods
12
May 29, 2019 Data Mining: Concepts and Techniques
Algorithm for Hierarchal methods
AGNES (AGglomerative NESting Clustering)
DIANA (DIisive ANalysis Clustering )
BIRCH (Balance Iterative Reducing and
Clustering)
CAMELEON (CLUSTERING USING DYNAMIC
MODELING)
3-Density Based Method
Algorithm for Density Based methods
DBSCAN (
Density-Based Clustering Based on Connected Regions with High Density)
OPTICS (
Ordering Points to Identify the Clustering Structure)4-Grid Based Clustering
Algorithm for Grid Based methods
STING (STatistical Information Grid)
CLIQUE: An Apriori-like Subspace Clustering
Method
Common Distance measures:
Distance measure
will determine how the
similarity
of two
elements is calculated and it will influence the shape of the
clusters.
They include:
1. The
Euclidean distance
(also called 2-norm distance) is
given by:
2. The
Manhattan distance
(also called taxicab norm or
1-norm) is given by:
Distance formula (2-D)
How the K-Mean Clustering algorithm
works?
An Example of
K-Means
Clustering
K=2 Arbitrarily partition objects into k groups Update the cluster centroids Update the cluster centroids Reassign objects Loop if needed 22The initial data set
Partition objects into k nonempty
subsets
Repeat
Compute centroid (i.e., mean
point) for each partition
Assign each object to the
cluster of its nearest centroid
Step 1: Begin with a decision on the value of k =
number of clusters .
Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the
training samples randomly,or systematically
as the following:
1.Take the first k training sample as
single-element clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest
centroid. After each assignment, recompute
the centroid of the gaining cluster.
Step 3: Take each sample in sequence and
compute its
distance
from the centroid of
each of the clusters. If a sample is not
currently in the cluster with the closest
centroid, switch this sample to that cluster
and update the centroid of the cluster
gaining the new sample and the cluster
losing the sample.
Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
A Simple example showing the
implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2:
Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
Step 3:
Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.
Therefore, the new
clusters are:
{1,2} and {3,4,5,6,7}
Next centroids are:
m1=(1.25,1.5) and m2 =
(3.9,5.1)
Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
Therefore, there is no
change in the cluster.
Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
Class work
Use the k-means algorithm and Euclidean
distance to cluster the following 8 examples into 3
clusters:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8),
A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
Measuring Clustering Quality
3 kinds of measures: External, internal and relative
External: supervised, employ criteria not inherent to the dataset
Compare a clustering against prior or expert-specified
knowledge using certain clustering quality measure Internal: unsupervised, criteria derived from data itself
Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the clusters are, e.g., Silhouette coefficient
Relative: directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Evaluation of Clustering
Visualization of Clustering
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods, and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
Birch and Chameleon are interesting hierarchical clustering algorithms,
and there are also probabilistic hierarchical clustering algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm