Clustering and mapper

(1)

Clustering and mapper

June 17th, 2014

(2)

Overview

Goal of talk

Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.)

Basic idea: perform clustering at different “scales”, track how clusters change as scale varies.

Motivation:

1 Coarser than manifold learning, but still works in nonlinear

situations.

2 _{Still retains meaningful geometric information about data set.} 3 Efficiently computable (and so can apply to very large data

sets).

(3)

Overview

Goal of talk

Motivation:

situations.

sets).

(4)

Overview

Goal of talk

Motivation:

situations.

sets).

(5)

Overview

Goal of talk

Motivation:

situations.

sets).

(6)

Overview

Goal of talk

Motivation:

situations.

sets).

(7)

Overview

Goal of talk

Motivation:

situations.

sets).

(8)

Morse theory

Basic idea

Describe topology of a smooth manifoldM using levelsets of a suitable functionh:M →_R.

We recover M by looking ath−1((∞,t]), as t scans over the range of h.

Topology ofM changes at critical points of h.

(9)

Morse theory

Basic idea

(10)

Morse theory

Basic idea

(11)

(12)

(13)

Reeb graphs

Convenient simplification:

1 _{For each} _t ∈_R_{, contract each component of} _f−1₍_t_{) to a}

point.

2 _{Resulting structure is a graph.}

(14)

Reeb graphs

1 _{For each} _t ∈_{R, contract each component of} _f−1₍_t_{) to a}

point.

(15)

Reeb graphs

1 _{For each} _t ∈_{R, contract each component of} _f−1₍_t_{) to a}

point.

(16)

(17)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →_R. 2 Choose a cover U_α of X.

3 _{Cluster each inverse image}_f−1₍_U_α_). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅ andC∩C06=∅.

5 _{Color vertices according to average value of} _f _{in the cluster.}

(18)

Mapper

(19)

Mapper

(20)

Mapper

(21)

Mapper

3 _{Cluster each inverse image}_f−1₍_Uα_). 4 Form a graph where:

(22)

Mapper

(23)

Mapper

(24)

Mapper

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅

andC∩C06=∅.

(25)

Mapper

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅

andC∩C06=∅.

(26)

Filter functions

Clearly, choice of filter function is essential.

Some kind of density measure.

A score measure difference (distance) from some baseline. An eccentricity measure.

(27)

Filter functions

(28)

Filter functions

A score measure difference (distance) from some baseline.

An eccentricity measure.

(29)

Filter functions

(30)

(31)

Breast cancer example

Highly successful example of real data analysis. [Nicolau, Carlsson, Levine]

Working with vectors of gene expression data. Distance metric is correlation.

Filter is a measure of (unsigned) deviation of expression from normal tissue.

Results identify previously unknown c-MYB+ region, which are very different from normal tissue but have very high survival rates.

(32)

Breast cancer example

Working with vectors of gene expression data.

Distance metric is correlation.

(33)

Breast cancer example

(34)

Breast cancer example

(35)

Breast cancer example

(36)

(37)

NBA example

Clever example of application to sports analytics. [Alagappan]

Data set consists of vectors of statistics (points scored, rebounds, etc.).

Distance metric is Euclidean. Filter is points per minute.

Results identify many “new” positions.

(38)

NBA example

(39)

NBA example

Distance metric is Euclidean.

Filter is points per minute.

(40)

NBA example

(41)

NBA example

(42)

Role Player Scoring Rebounder One-of-a-kind Role-Playing Ball-Handler Shooting Ball-Handler Combo Ball-Handler Offensive Ball-Handler Defensive Ball-Handler All-NBA 1st Team All-NBA 2nd Team

Scoring Paint Protector 3PT Rebounder

Paint Protector

(43)

Summary

Claim

Mapper can be successfully applied to analysis of geometric structures in large data sets from a wide variety of domains.

Key idea: clustering across “scales”, represent relationships between clusters as scale varies

Choice of filter function(s) is critical to successful aplication.

(44)

Summary

Claim

(45)

Summary

Claim