Machine Learning, STOR 565 Clustering: Overview and Basic Methods

(1)

Machine Learning, STOR 565

Clustering: Overview and Basic Methods

Andrew Nobel

(2)

Overview of Clustering

a. The basic problem

(3)

(4)

(5)

General Setting

Given: Objects x1, . . . , xnin feature space X

I Dissimilarity or distance d(xi, xj)between pairs of objects

Goal: Divide x1, . . . , xninto disjoint groups C1, . . . , Ck, called clusters, s.t.

I Objects in same cluster are close together

I Objects in different clusters are far apart

I Number of clusters k is small

Distinction: Clustering is complete if it partitions X and incomplete if it

(6)

Clustering: Areas of Application

Genomics, Biology Data Compression Psychology Computer Science

(7)

Feature Vectors

Objects x ∈ X typically represented by a feature vector x = (x1, . . . , xp)

where xiis a numerical/categorical measurement of interest:

I xi∈ R numerical feature

(8)

Examples

Medicine

I Object = patient

I Feature xi=outcome of a diagnostic test on patient

Microarrays (Genomics) I Object = tissue sample

I Feature xi=measured expression level of gene i in that sample

Data Mining

I Object = consumer

(9)

Dissimilarities/Distances Between Feature Vectors

Euclidean d(u, v) =pP

i(ui− vi)2

Manhattan d(u, v) =P

i|ui− vi|

Correlation d(u, v) = 1 − corr(u, v)

Hamming d(u, v) =P

iI{ui6= vi}

(10)

Basic Steps in Clustering

Acquisition of Objects x1, . . . , xn

⇓

Selection and Extraction of Features ⇓ Dissimilarity matrix D = {d(xi, xj) : 1 ≤ i, j ≤ n} ⇓ Clustering Algorithm ⇓ Partition π = {C1, . . . , Ck} of x1, . . . , xn.

(11)

Some Clustering Methods

Hierarchical: Candidate divisions of data described by a binary tree I Agglomerative (bottom-up)

I Divisive (top-down)

Iterative: Search for local minimum of simple cost function I k-means and variants

I partitioning around medioids, self organizing maps

Model-based: Fit feature vectors with a finite mixture model

(12)

Centroids

Definition: The centroid of vectors x1, . . . , xn∈ Rpis their average

c = 1 n n X i=1 xi

The centroid c is the center of mass of the point configuration x1, . . . , xn, and

is an optimal representative for the configuration in the sense that

n X i=1 ||xi− c||2 ≤ n X i=1 ||xi− v||2

(13)

Nearest Neighbor Partitions

Definition: Voronoi (nearest neighbor) partition of vectors c1, . . . , ck∈ Rpis

a collection π = {A1, . . . , Ak} where the cell

Aj= {x : ||x − cj|| ≤ ||x − cs|| all s 6= j}

contains vectors that are closer to cjthan any other cs.

Structure of Cells: Note that

Aj=

\

s6=j

{x : ||x − cj|| ≤ ||x − cs||}

is an intersection of half-spaces. Thus it is a polytope in Rp_{with at most n − 1}

(14)

The k-Means Algorithm

Clustering Problem: Divide x1, . . . , xn∈ Rpinto k clusters.

Optimization: Find centers c01, . . . , c0kto minimize sum of squares (SoS) cost

function Cost(c1, . . . , ck) = n X i=1 min 1≤j≤k||xi− cj|| 2

i.e., sum of squared distances from each point to its nearest center. Then the clusters are the Voronoi partition of x1, . . . , xnwith centers c01, . . . , c

0 k

Problem: Exact solution of optimization problem is not computationally

(15)

The k-Means Algorithm

Fix in advance

I Number of clusters k

I Initial centers C0= {c1, . . . , ck}

Iterate: For m = 1, 2, . . . do:

I Let πmbe the nearest neighbor (Voronoi) partition of the centers Cm−1.

I Let Cmbe the centroids (averages) of the vectors in each cell of πm

(16)

The k-Means Algorithm

Recall: Sum of Squares (SoS) cost function

Cost(c1, . . . , ck) = n X i=1 min 1≤j≤k||xi− cj|| 2

Note: Cost function decreases at each stage of the k-means algorithm. In practice

I Choose multiple initial sets of representative vectors C0= {c1, . . . , ck}

I Run the iterative k-means procedure

I Choose the partition associated with the smallest final cost Example:http://www.onmyphd.com/?p=k-means.clustering.

(17)

Features of Clusters

If clusters are present, their features can affect performance of different clustering procedures.

I Spherical or elliptical in shape

I Similar in overall variance/spread

I Similar in size (number of points)

K-Means tends to perform best when clusters are spherical, similar in

(18)

Binary Trees

1. Distinguished node called theroot with zero or two children but no parent

2. Every other node has one parent and zero or two children

I Nodes with no children are calledleaves I Nodes with two children are calledinternal

(19)

Agglomerative Clustering

Stage 0: Assign each object xito its own cluster

Stage k:

I Find the two closest clusters at stage k − 1

I Combine them into a single cluster

Stop: When all objects xibelong to a single cluster

Output: Binary tree T called a dendrogram

Note: Closeness of clusters C, C0

(20)

Distances Between Clusters

Single Linkage ds(C, C 0 ) = min xi∈C, xj∈C0 d(xi, xj) Average Linkage da(C, C 0 ) = 1 |C| |C0_| X xi∈C, xj∈C0 d(xi, xj) Total Linkage dt(C, C0) = max xi∈C, xj∈C0 d(xi, xj)

(21)

Dendrogram

Binary tree associated with the agglomerative clustering procedure: it is a graphical record of the clustering process

Initialize: Each singleton cluster {xi} corresponds to a node at height 0

Update: If two clusters C, C0are combined, their respective nodes are joined

to a parent node at height d(C, C0)

Each node of dendrogram corresponds to a set of objects. Objects associated with two nodes are merged when forming their parent.

I Leaves correspond to individual objects

(22)

(23)

(24)

Dendrogram, cont.

Note: Dendrogram T represents many possible clusterings, one for each

(rooted) subtree.

Methods for selecting a clustering/subtree I Ad hoc selection (by eye)

I “Cutting” dendrogram at fixed level

I Penalized pruning

Visualization of clustering structure

I Order objects in the same way as the leaves of the dendrogram

(25)

Cars Data

I Samples: 32 unique cars

I Variables: 11 descriptive variables, including gas mileage, horsepower,

number of cylinders, etc.

(26)

Ma se ra ti Bo ra F ord Pa nt era L D ust er 36 0 C ama ro Z 28 C hrysl er Imp eri al C ad ill ac F le et w oo d Li nco ln C on tin en ta l H orn et Sp ort ab ou t Po nt ia c F ire bi rd Me rc 45 0SL C Me rc 45 0SE Me rc 45 0SL Dodge Challenger AMC Ja ve lin H orn et 4 D rive Va lia nt F erra ri D in o H on da C ivi c T oyo ta C oro lla Fiat 128 Fiat X1 -9 Me rc 24 0D Ma zd a R X4 Ma zd a R X4 W ag Me rc 28 0 Me rc 28 0C Lo tu s Eu ro pa Me rc 23 0 D at su n 71 0 Vo lvo 1 42 E T oyo ta C oro na Po rsch e 91 4-2 0 20 40 60 80

Single Linkage Clustering on Cars data

hclust (*, "single")dist(mtcars)

D

ist

an

(27)

F erra ri D in o H on da C ivi c T oyo ta C oro lla Fiat 128 Fiat X1 -9 Ma zd a R X4 Ma zd a R X4 W ag Me rc 28 0 Me rc 28 0C Me rc 24 0D Lo tu s Eu ro pa Me rc 23 0 Vo lvo 1 42 E D at su n 71 0 T oyo ta C oro na Po rsch e 91 4-2 Ma se ra ti Bo ra H orn et 4 D rive Va lia nt Me rc 45 0SL C Me rc 45 0SE Me rc 45 0SL Dodge Challenger AMC Ja ve lin C hrysl er Imp eri al C ad ill ac F le et w oo d Li nco ln C on tin en ta l F ord Pa nt era L D ust er 36 0 C ama ro Z 28 H orn et Sp ort ab ou t Po nt ia c F ire bi rd 0 50 100 150 200 250

Average Linkage Clustering on Cars data

hclust (*, "average")dist(mtcars)

D

ist

an

(28)

TCGA Data

Gene expression data from The Cancer Genome Atlas (TCGA)

I Samples

I 95 Luminal A breast tumors

I 122 Basal breast tumors

(29)

TCGA Data

I Clustered samples (breast tumor subtype)

(30)

Important Questions

I What is the right number of clusters?

I What is right measure of distance?

I What is the best clustering method for the data?

I How robust is an observed clustering to small perturbations of the data?