Data Mining:

(1)

Data Mining:

Concepts and Techniques

(3rd ed.)

— Chapter 10

—

Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &

(2)

2

Chapter 10. Cluster Analysis: Basic

Concepts and Methods

■ Cluster Analysis: Basic Concepts ■ Partitioning Methods

■ Hierarchical Methods ■ Density Methods

■ Grid Based Methods

■ Evaluation of Clustering ■ Summary

(3)

What is Cluster Analysis?

■ Cluster: A collection of data objects

■ similar (or related) to one another within the same group ■ dissimilar (or unrelated) to the objects in other groups

■ Cluster analysis

■ Finding similarities between data according to the

characteristics found in the data and grouping similar data objects into clusters

■ Unsupervised learning: no predefined classes ■ Typical applications

■ As a stand-alone tool to get insight into data distribution

(4)

(5)

What is Clustering

■ Clustering is the classification is the classification

of objects into different groups, or more precisely,

the partitioning is the classification of objects into

different groups, or more precisely, the

partitioning of a data set is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets is the classification of objects into different groups, or more precisely, the partitioning of a data set into

(6)

Considerations for Cluster Analysis

■ Partitioning criteria

■ Single level vs. hierarchical partitioning (often, multi-level

hierarchical partitioning is desirable)

■ Separation of clusters

■ Exclusive (e.g., one customer belongs to only one region) vs.

non-exclusive (e.g., one document may belong to more than one class)

■ Similarity measure

■ Distance-based (e.g., Euclidian) vs. connectivity-based (e.g.,

density)

■ Clustering space

■ Full space (often when low dimensional) vs. subspaces (often in

high-dimensional clustering)

(7)

Quality: What Is Good Clustering?

■ A good clustering method will produce high quality clusters

■ high intra-class similarity: cohesive within clusters ■ low inter-class similarity: distinctive between clusters

■ The quality of a clustering method depends on

■ the similarity measure used by the method ■ its implementation, and

(8)

Types of Clustering

1. Partitioning approach:

■ Construct various partitions and then evaluate them by

some criterion

2. Hierarchical approach:

■ Create a hierarchical decomposition of the set of data

(or objects) using some criterion 3. Density-based approach:

■ Based on connectivity and density functions

4. Grid-based approach:

■ based on a multiple-level granularity structure

(9)

Chapter 10. Cluster Analysis: Basic

Concepts and Methods

■ Hierarchical Methods ■ Evaluation of Clustering ■ Summary

(10)

1) Partitioning Method

10

* Data Mining: Concepts and

(11)

Algorithm for Partitioning methods

■ K-Mean Algorithm ■ K-Mediods Algorithm

■ CLARANS (Clustering Based Algorithm for Randomize Search)

(12)

2-Hierarichal Methods

12

* Data Mining: Concepts and

(13)

Algorithm for Hierarchal methods

■ AGNES (AGglomerative NESting Clustering) ■ DIANA (DIisive ANalysis Clustering )

■ BIRCH (Balance Iterative Reducing and Clustering)

■ CAMELEON (CLUSTERING USING DYNAMIC MODELING)

(14)

3-Density Based Method

(15)

Algorithm for Density Based methods

■ DBSCAN (Density-Based Clustering Based on Connected

■ Regions with High Density)

■ OPTICS (Ordering Points to Identify the Clustering

(16)

4-Grid Based Clustering

(17)

Algorithm for Grid Based methods

■ STING (STatistical Information Grid)

■ CLIQUE: An Apriori-like Subspace Clustering Method

(18)

(19)

Common Distance measures:

■ Distance measure will determine how the similarity of two

elements is calculated and it will influence the shape of the clusters.

They include:

1. The Euclidean distance (also called 2-norm distance) is given by:

2. The Manhattan distance (also called taxicab norm or 1-norm) is given by:

(20)

Partitioning Algorithms: Basic Concept

■ Partitioning method: Partitioning a database D of n objects into a set of

k clusters, such that the sum of squared distances is minimized (where c_i is the centroid or medoid of cluster C_i)

■ Global optimal: exhaustively enumerate all partitions ■ Heuristic methods: k-means and k-medoids algorithms

(21)

(22)

K-MEANS CLUSTERING

■ The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k

< n.

■ It is similar to the expectation-maximization

algorithmIt is similar to the expectation-maximization

algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data.

■ It assumes that the object attributes form a vector

(23)

■ An algorithm for partitioning (or clustering) N data points into K disjoint subsets S_j containing data

points so as to minimize the sum-of-squares criterion

where x_nis a vector representing the the nth data point and u_j is the geometric centroid of the data points in S_j.

(24)

■ Simply speaking k-means clustering is an

algorithm to classify or to group the objects based on attributes/features into K number of group.

■ K is positive integer number.

■ The grouping is done by minimizing the sum of squares of distances between data and the

(25)

An Example of

K-Means

Clustering

K=2 Arbitrarily partition objects into k groups Update the cluster centroids Update the cluster centroids Reassign objects Loop if needed The initial data set

■ Partition objects into k nonempty

subsets

■ Repeat

■ Compute centroid (i.e., mean

point) for each partition Assign each object to the

(26)

How the K-Mean Clustering algorithm

works?

(27)

■ Step 1: Begin with a decision on the value of k =

number of clusters .

■ Step 2: Put any initial partition that classifies the

data into k clusters. You may assign the

training samples randomly,or systematically as the following:

1.Take the first k training sample as single-element clusters

2. Assign each of the remaining (N-k) training

sample to the cluster with the nearest centroid. After each assignment, recompute the centroid of the gaining cluster.

(28)

■ Step 3: Take each sample in sequence and

compute its distance from the centroid of

each of the clusters. If a sample is not currently in the cluster with the closest

centroid, switch this sample to that cluster and update the centroid of the cluster gaining the new sample and the cluster

losing the sample.

■ Step 4 . Repeat step 3 until convergence is

achieved, that is until a pass through the training sample causes no new assignments.

(29)

A Simple example showing the

implementation of k-means algorithm

(using K=2)

(30)

Step 1:

Initialization: Randomly we choose following two centroids (k=2) for two clusters.

In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).

(31)

Step 2:

■ Thus, we obtain two clusters

containing:

{1,2,3} and {4,5,6,7}.

■ Their new centroids are:

(32)

Step 3:

■ Now using these centroids

we compute the Euclidean distance of each object, as shown in table.

■ Therefore, the new

clusters are:

{1,2} and {3,4,5,6,7}

■ Next centroids are:

m1=(1.25,1.5) and m2 = (3.9,5.1)

(33)

■ Step 4 :

The clusters obtained are: {1,2} and {3,4,5,6,7}

■ Therefore, there is no

change in the cluster.

■ Thus, the algorithm comes

to a halt here and final

result consist of 2 clusters {1,2} and {3,4,5,6,7}.

(34)

(35)

(36)

(37)

Exercise

■ Consider the 1D data set as {1,2,3,4,7,9}

Where K=2

Identify the clusters and their centroid. Tip use

(38)

Home work

■ Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters: ■ A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8),

A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). ■ Map the resultant values in Scattered Plot

(39)

What Is the Problem of the K-Means Method?

■ The k-means algorithm is sensitive to outliers !

■ Since an object with an extremely large value may substantially

distort the distribution of the data

■ K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the most

(40)

Determine the Number of Clusters

■

Empirical method

■

# of clusters: k ≈√n/2

for a dataset of n points,

■

e.g., n = 200, k = 10

■

How many for the n=900???

(41)

Measuring Clustering Quality

■ 3 kinds of measures: External, internal and relative

■ External: supervised, employ criteria not inherent to the

dataset

■ Compare a clustering against prior or expert-specified knowledge using certain clustering quality measure

■ Internal: unsupervised, criteria derived from data itself

■ Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the

clusters are, e.g., Silhouette coefficient

■ Relative: directly compare different clusterings, usually those

(42)

42

Chapter 10. Cluster Analysis: Basic

Concepts and Methods

■ Hierarchical Methods ■ Evaluation of Clustering ■ Summary

(43)

(44)

(45)

(46)

(47)

(48)

Summary

■ Cluster analysis groups objects based on their similarity and has wide

applications

■ Clustering algorithms can be categorized into partitioning methods,

hierarchical methods, density-based methods, grid-based methods, and model-based methods

■ K-means and K-medoids algorithms are popular partitioning-based

clustering algorithms

■ Birch and Chameleon are interesting hierarchical clustering

algorithms, and there are also probabilistic hierarchical clustering algorithms

■ DBSCAN, OPTICS, and DENCLU are interesting density-based

algorithms

■ STING and CLIQUE are grid-based methods, where CLIQUE is also a

subspace clustering algorithm

■ Quality of clustering results can be evaluated in various ways