K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

(1)

K-Means Cluster Analysis

Chapter 3 Chapter 3 PPDM Cl

PPDM Class

(2)

What is Cluster Analysis?

z Finding groups of objects such that the objects in a group will be similar (or related) to one another and different

from (or unrelated to) the objects in other groups

Inter-cluster Inter cluster distances are

maximized Intra-cluster

distances are

minimized

(3)

Applications of Cluster Analysis

z Understanding

– Group related documents

Discovered Clusters Industry Group

1

Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Technology1-DOWN

p

for browsing, group genes and proteins that have similar functionality, or

Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN

2

Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

Fannie Mae DOWN Fed Home Loan DOWN

group stocks with similar

price fluctuations 3

Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

4

Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,

Schlumberger-UP Oil-UP

z Summarization

– Reduce the size of large data sets

Clustering precipitation

C uste g p ec p tat o

in Australia

(4)

What is not Cluster Analysis?

z Supervised classification

– Have class label information

z Simple segmentation

Di idi t d t i t diff t i t ti

– Dividing students into different registration groups alphabetically, by last name

z Results of a query

– Groupings are a result of an external specification

z Graph partitioning

– Some mutual relevance and synergy, but areas are not

identical

(5)

Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Four Clusters

Two Clusters

(6)

Types of Clusterings

z A clustering is a set of clusters

z Important distinction between hierarchical and partitional sets of clusters

z Partitional Clustering

– A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset

z Hierarchical clustering

– A set of nested clusters organized as a hierarchical tree

(7)

Partitional Clustering

Original Points g A Partitional Clustering g

(8)

Hierarchical Clustering

p4 p1

p3 p2

p4 p1 p2 p3 p4 p1 p2 p3

Traditional Hierarchical Clustering Traditional Dendrogram

p4 p1

p3 p4 p3 p2

p1 p2 p3 p4

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

(9)

Other Distinctions Between Sets of Clusters

z Exclusive versus non-exclusive

– In non-exclusive clusterings, points may belong to multiple clusters

clusters.

– Can represent multiple classes or ‘border’ points z Fuzzy versus non-fuzzy y y

– In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1

– Weights must sum to 1

– Probabilistic clustering has similar characteristics z Partial versus complete

I l t t l t f th d t

– In some cases, we only want to cluster some of the data z Heterogeneous versus homogeneous

– Cluster of widely different sizes, shapes, and densities Cluster of widely different sizes, shapes, and densities

(10)

Types of Clusters

z Well-separated clusters

z Center-based clusters

z Contiguous clusters

z Density-based clusters

P t C t l

z Property or Conceptual

z Described by an Objective Function

(11)

Types of Clusters: Well-Separated

z Well-Separated Clusters:

– A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.

3 well-separated clusters

(12)

Types of Clusters: Center-Based

z Center-based

– A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster

– The center of a cluster is often a centroid, the average of all

th i t i th l t d id th t “ t ti ”

the points in the cluster, or a medoid, the most “representative”

point of a cluster

4 center-based clusters

(13)

Types of Clusters: Contiguity-Based

z Contiguous Cluster (Nearest neighbor or Transitive)

– A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

8 contiguous clusters

(14)

Types of Clusters: Density-Based

z Density-based

– A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.

– Used when the clusters are irregular or intertwined, and when noise and outliers are present.

6 density-based clusters

(15)

Types of Clusters: Conceptual Clusters

z Shared Property or Conceptual Clusters

– Finds clusters that share some common property or represent a particular concept.

.

2 Overlapping Circles

(16)

Types of Clusters: Objective Function

z Clusters Defined by an Objective Function

– Finds clusters that minimize or maximize an objective function. Finds clusters that minimize or maximize an objective function.

– Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function (NP Hard)

the given objective function. (NP Hard) – Can have global or local objectives.

Hierarchical clustering algorithms typically have local objectives P titi l l ith t i ll h l b l bj ti

Partitional algorithms typically have global objectives

– A variation of the global objective function approach is to fit the data to a parameterized model.

Parameters for the model are determined from the data.

Mixture models assume that the data is a ‘mixture' of a number of

statistical distributions.

(17)

Types of Clusters: Objective Function …

z Map the clustering problem to a different domain and solve a related problem in that domain

and solve a related problem in that domain

– Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the

weighted edges represent the proximities between points

– Clustering is equivalent to breaking the graph into connected components, one for each cluster.

– Want to minimize the edge weight between clusters and maximize the edge weight within clusters

and maximize the edge weight within clusters

(18)

Characteristics of the Input Data Are Important

z Type of proximity or density measure

– This is a derived measure, but central to clustering z Sparseness

– Dictates type of similarity – Adds to efficiency

z Attribute type

– Dictates type of similarity z Type of Data

z Type of Data

– Dictates type of similarity

– Other characteristics, e.g., autocorrelation

Di i lit

z Dimensionality

z Noise and Outliers

z Type of Distribution

(19)

Clustering Algorithms

z K-means and its variants

z Hierarchical clustering

z Density-based clustering

(20)

K-means Clustering

z Partitional clustering approach

z Each cluster is associated with a centroid (center point)

z Each point is assigned to the cluster with the closest centroid

z Number of clusters K must be specified

z Number of clusters, K, must be specified

z The basic algorithm is very simple

(21)

K-means Clustering – Details

z Initial centroids are often chosen randomly.

– Clusters produced vary from one run to another.

z The centroid is (typically) the mean of the points in the

z The centroid is (typically) the mean of the points in the cluster.

z ‘Closeness’ is measured by Euclidean distance, cosine similarity correlation etc

similarity, correlation, etc.

z K-means will converge for common similarity measures mentioned above.

z Most of the convergence happens in the first few iterations.

– Often the stopping condition is changed to ‘Until relatively few pp g g y points change clusters’

z Complexity is O( n * K * I * d )

– n = number of points, K = number of clusters, n number of points, K number of clusters,

I = number of iterations, d = number of attributes

(22)

Two different K-means Clusterings

1 5 2 2.5 3

Original Points

0 0.5 1 1.5

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

3 3

1 1.5 2 2.5

y

1 1.5 2 2.5

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0 0.5 1

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0 0.5 1

x x

Sub-optimal Clustering

x

Optimal Clustering

(23)

Importance of Choosing Initial Centroids

3

Iteration 1

3

Iteration 2

3

Iteration 3

3

Iteration 4

3

Iteration 5

3

Iteration 6

2 2.5

1 1.5

y

1 1.5

y

1 1.5

y

1 1.5

y

1 1.5

y

1 1.5

y

0 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

(24)

Importance of Choosing Initial Centroids

2 2.5 3

Iteration 1

2 2.5 3

Iteration 2

2 2.5 3

Iteration 3

0 0.5 1 1.5

y

0 0.5 1 1.5

y

0 0.5 1 1.5

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

x

It ti 4 It ti 5 It ti 6

1.5 2 2.5 3

Iteration 4

1.5 2 2.5 3

Iteration 5

1.5 2 2.5 3

Iteration 6

0 0.5 1 1.5

y

0 0.5 1 1.5

y

0 0.5 1 1.5

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

(25)

Evaluating K-means Clusters

z Most common measure is Sum of Squared Error (SSE)

– For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them.

∑ ∑

= ^K dist m _i x

SSE ² ( , )

– x is a data point in cluster C

_i

and m

_i

is the representative point for cluster C

_i

∑ ∑ i = x ∈ C

i 1

i

) , (

i

can show that m

_i

corresponds to the center (mean) of the cluster – Given two clusters, we can choose the one with the smallest

error

– One easy way to reduce SSE is to increase K, the number of clusters

A good clustering with smaller K can have a lower SSE than a poor l t i ith hi h K

clustering with higher K

(26)

Importance of Choosing Initial Centroids …

3

Iteration 1

3

Iteration 2

3

Iteration 3

3

Iteration 4

3

Iteration 5

2 2.5

1 1.5

y

1 1.5

y

1 1.5

y

1 1.5

y

1 1.5

y

0 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

(27)

Importance of Choosing Initial Centroids …

2 2.5 3

Iteration 1

2 2.5 3

Iteration 2

0.5 1 1.5 2

y

0.5 1 1.5 2

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

x

Iteration 3 Iteration 4 Iteration 5

1.5 2 2.5 3

y

3

1.5 2 2.5 3

y

1.5 2 2.5 3

y

2 1 5 1 0 5 0 0 5 1 1 5 2

0 0.5 1

y

2 1 5 1 0 5 0 0 5 1 1 5 2

0 0.5 1

y

2 1 5 1 0 5 0 0 5 1 1 5 2

0 0.5 1

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x

(28)

Problems with Selecting Initial Points

z If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.

Ch i l ti l ll h K i l

– Chance is relatively small when K is large – If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/10

¹⁰

= 0.00036 – Sometimes the initial centroids will readjust themselves in

‘ i ht’ d ti th d ’t

‘right’ way, and sometimes they don’t

– Consider an example of five pairs of clusters

(29)

10 Clusters Example

6 8

Iteration 1

6 8

Iteration 2

6 8

Iteration 3

6 8

Iteration 4

2 4 6

0 2

y

₀

2

y

₀

2

y

₀

2

y

-4 -2 -4 -2 -4 -2 -4 -2

0 5 10 15 20

-6

x

0 5 10 15 20

-6

x

0 5 10 15 20

-6

x

0 5 10 15 20

-6

x

(30)

10 Clusters Example

4 6 8

Iteration 1

4 6 8

Iteration 2

-4 -2 0 2

y

-4 -2 0 2

y

0 5 10 15 20

-6

x

0 5 10 15 20

-6

x

8

Iteration 3

8

Iteration 4

0 2 4 6

y ₀

2 4 6

y

0 5 10 15 20

-6 -4 -2

0 5 10 15 20

-6 -4 -2

x x

(31)

10 Clusters Example

6 8

Iteration 1

6 8

Iteration 2

6 8

Iteration 3

6 8

Iteration 4

2 4 6

0 2

y

₀

2

y

₀

2

y

₀

2

y

-4 -2 -4 -2 -4 -2 -4 -2

0 5 10 15 20

-6

x

0 5 10 15 20

-6

x

0 5 10 15 20

-6

x

0 5 10 15 20

-6

x

Starting with some pairs of clusters having three initial centroids, while other have only one.

x

(32)

10 Clusters Example

2 4 6 8

Iteration 1

2 4 6 8

Iteration 2

-4 -2 0 2

y

-4 -2 0 2

y

0 5 10 15 20

-6

x

0 5 10 15 20

-6

x

6 8

Iteration 3

6 8

Iteration 4

-2 0 2 4

y

-2 0 2 4

y

0 5 10 15 20

-6 -4

x

0 5 10 15 20

-6 -4

x

Starting with some pairs of clusters having three initial centroids, while other have only one.

(33)

Solutions to Initial Centroids Problem

z Multiple runs

– Helps, but probability is not on your side p , p y y

z Sample and use hierarchical clustering to determine initial centroids

z Select more than k initial centroids and then select among these initial centroids

Select most widely separated – Select most widely separated

z Postprocessing

z Bisecting K-means

– Not as susceptible to initialization issues

(34)

Handling Empty Clusters

z Basic K-means algorithm can yield empty clusters

clusters

z Several strategies

– Choose the point that contributes most to SSE

– Choose a point from the cluster with the highest SSE – If there are several empty clusters, the above can be

repeated several times.

(35)

Updating Centers Incrementally

z In the basic K-means algorithm, centroids are

updated after all points are assigned to a centroid updated after all points are assigned to a centroid

z An alternative is to update the centroids after each assignment (incremental approach)

– Each assignment updates zero or two centroids – More expensive

– Introduces an order dependency

N t t l t

– Never get an empty cluster

– Can use “weights” to change the impact

(36)

Pre-processing and Post-processing

z Pre-processing

– Normalize the data Normalize the data – Eliminate outliers

z Post processing

z Post-processing

– Eliminate small clusters that may represent outliers Split ‘loose’ clusters i e clusters with relatively high – Split loose clusters, i.e., clusters with relatively high