SoSe 2014: M-TANI:
Big Data Analytics
Lecture 4 – 21/05/2014
Sead Izberovic
Agenda
•
Recap from the previous session
•
Clustering
Introduction
Distance mesures
Hierarchical Clustering
DBIS
Introduction
•
Given a set of objects, with a notion of distance between
those objects. The task of a clustering algorithm is to
group those objects into some number of clusters, so
that:
Members of a cluster are similar to each other
Members of different clusters are dissimilar
•
High dimensionality may be hard to interpret
Introduction
•
Clustering
≠
Classification
•
Classification
Assigning objects to predifined classes
Requires supervised learning
•
Clustering
No predifined classes
DBIS
Introduction
•
Clustering applications
Clustering DNA sequences
Image segmentation
Customer segmentation
Distance mesures
•
Calculate similarity
Large distance = low similarity
Small distance = high similarity
•
Conditions
1. 𝑑𝑑𝑑𝑑 𝑥,𝑦 = 𝑑 with 𝑑:𝑋 ×𝑋 → 𝑹 and x, y, z ϵ𝑋 2. 𝑑𝑑𝑑𝑑 𝑥,𝑦 ≥ 0 3. 𝑑𝑑𝑑𝑑 𝑥,𝑦 = 0 if 𝑥 = 𝑦 4. 𝑑𝑑𝑑𝑑 𝑥,𝑦 = 𝑑𝑑𝑑𝑑 𝑦,𝑥 5. 𝑑𝑑𝑑𝑑 𝑥,𝑧 ≤ 𝑑𝑑𝑑𝑑 𝑥,𝑦 +𝑑𝑑𝑑𝑑(𝑦,𝑧) from [1] and [3]DBIS
Distance mesures
•
Euclidian distance:
𝑑𝑑𝑑𝑑
𝑒𝑒𝑒𝑒𝑒𝑒𝑋
,
𝑌
=
∑
𝑛𝑒=1(
𝑥
𝑒− 𝑦
𝑒)
2•
Jaccard distance:
𝑑𝑑𝑑𝑑
𝑗𝑗𝑒𝑒𝑗𝑗𝑒𝑥
,
𝑦
= 1
− 𝑆𝑆𝑆 𝑥
,
𝑦
•
Cosine distance:
𝑑𝑑𝑑𝑑
𝑒𝑐𝑐𝑋
,
𝑌
=
∑𝑛𝑖=1 𝑥𝑖 ×𝑦𝑖 ∑𝑛𝑖=1(𝑥𝑖)2× ∑𝑛𝑖=1(𝑥𝑖)2•
Edit distance:
smallest number of insertions and
deletions of single characters that will convert string x to
string y
Distance between clusters
DBIS
Distance between clusters
Single-linkageDistance between clusters
Complete-linkageDBIS
Distance between clusters
Average-linkageClustering approches
Clustering
DBIS
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters with minimal distance4.
Repeat 2. and 3. until we have one big clusterSingle Link
Hierarchical Agglomerative
1.
Every Object is acluster
2.
Calculate the minimal distance between to clusters3.
Merge clusters with minimal distance4.
Repeat 2. and 3. until we have one big clusterDBIS
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters with minimal distance4.
Repeat 2. and 3. until we have one big clusterSingle Link
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters with minimal distance4.
Repeat 2. and 3. until we have one big clusterDBIS
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters withminimal distance
4.
Repeat 2. and 3. until we have one big clusterSingle Link
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters with minimal distance4.
Repeat 2. and 3. untilwe have one big cluster
DBIS
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters with minimal distance4.
Repeat 2. and 3. untilwe have one big cluster
Single Link
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters with minimal distance4.
Repeat 2. and 3. untilwe have one big cluster
DBIS
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters with minimal distance4.
Repeat 2. and 3. untilwe have one big cluster
Single Link
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters with minimal distance4.
Repeat 2. and 3. untilwe have one big cluster
DBIS
Hierarchical Agglomerative
1.
Every Object is a cluster2.
Calculate the minimaldistance between to clusters
3.
Merge clusters with minimal distance4.
Repeat 2. and 3. untilwe have one big cluster
Single Link
Hierarchical Divisive
1.
Find max. distance between objects2.
Split those clusters3.
Repeat 1. and 2. until we have clusters whichcontain just one object
DBIS
Hierarchical Divisive
1.
Find max. distance between objects2.
Split those clusters3.
Repeat 1. and 2. until we have clusters whichcontain just one object
Single Link
Hierarchical Divisive
1.
Find max. distance between objects2.
Split those clusters3.
Repeat 1. and 2. until we have clusters whichcontain just one object
DBIS
Hierarchical Divisive
1.
Find max. distance between objects2.
Split those clusters3.
Repeat 1. and 2. until we have clusters whichcontain just one object
Single Link
Hierarchical Divisive
1.
Find max. distance between objects2.
Split those clusters3.
Repeat 1. and 2. until we have clusters whichcontain just one object
DBIS
Hierarchical Divisive
1.
Find max. distance between objects2.
Split those clusters3.
Repeat 1. and 2. until we have clusters whichcontain just one object
Single Link
Hierarchical Divisive
1.
Find max. distance between objects2.
Split those clusters3.
Repeat 1. and 2. until we have clusters whichcontain just one object
DBIS
Hierarchical Divisive
1.
Find max. distance between objects2.
Split those clusters3.
Repeat 1. and 2. until we have clusters whichcontain just one object
Single Link
Cluster Visualization
D is tanc e DendrogramDBIS
Cluster Visualization
D is tanc e Dendrogram adopted from [3]Cluster Visualization
D is tanc e DendrogramDBIS
Cluster Visualization
D is tanc e Dendrogram adopted from [3]Cluster Visualization
D is tanc e DendrogramDBIS
Cluster Visualization
D is tanc e Dendrogram adopted from [3]Cluster Visualization
D is tanc e DendrogramDBIS
Cluster Visualization
D is tanc e Dendrogram adopted from [3]Clustering approches
Clustering
DBIS
K-Means
1.
Place k centroids randomly2.
If distance to centroid min. merge to cluster3.
Move the k centroids tothe new cluster center
4.
Repeat 2. and 3. until wefulfil the stop criteria
K-Means
1.
Place k centroidsrandomly
2.
If distance to centroid min. merge to cluster3.
Move the k centroids tothe new cluster center
4.
Repeat 2. and 3. until weDBIS
K-Means
1.
Place k centroids randomly2.
If distance to centroidmin. merge to cluster
3.
Move the k centroids tothe new cluster center
4.
Repeat 2. and 3. until wefulfil the stop criteria
K-Means
1.
Place k centroids randomly2.
If distance to centroid min. merge to cluster3.
Move the k centroidsto the new cluster center
4.
Repeat 2. and 3. until we fulfil the stop criteriaDBIS
K-Means
1.
Place k centroids randomly2.
If distance to centroid min. merge to cluster3.
Move the k centroids tothe new cluster center
4.
Repeat 2. and 3. untilwe fulfil the stop criteria
K-Means
•
How to choose the right k?
Try different k, looking at the change in the average
distance to centroid as k increases
Average falls rapidly until right k, then changes little
from [2] A v er age di s tanc e t o c ent roi d k
DBIS
Clustering on graphs
•
Create the MST•
Remove k-1 edges with the highest weights•
Create the clustersClustering on graphs
•
Create the MST•
Remove k-1 edges with the highest weights•
Create the clustersk=3 1 2 3 4 5 6
DBIS
Clustering on graphs
•
Create the MST•
Remove k-1 edges withthe highest weights
•
Create the clustersk=3 1 2 3 4 adapted from [3]
Clustering on graphs
•
Create the MST•
Remove k-1 edges with the highest weights•
Create the clustersk=3
1 2
3
DBIS
Literature
1.
Anand Rajaraman, Jeffrey D. Ullman, Jure Leskovec. 2014 Mining of Massive DatasetsCambridge University Press
2.
Jure Leskovec. 2014Slides: Mining Massive Data Sets
URL: http://www.stanford.edu/class/cs246/slides/05-clustering.pdf
3.
Martin Ester, Jörg Sander. 2000Knowledge Discovery in Databases Springer – Verlag Berlin Heidelberg
4.
A. K. Jain, M. N. Murty, and P. J. Flynn. 1999 Data clustering: a review