• No results found

Clustering and mapper

N/A
N/A
Protected

Academic year: 2021

Share "Clustering and mapper"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Clustering and mapper

Andrew J. Blumberg ([email protected])

June 17th, 2014

(2)

Overview

Goal of talk

Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.)

Basic idea: perform clustering at different “scales”, track how clusters change as scale varies.

Motivation:

1 Coarser than manifold learning, but still works in nonlinear

situations.

2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data

sets).

(3)

Overview

Goal of talk

Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.)

Basic idea: perform clustering at different “scales”, track how clusters change as scale varies.

Motivation:

1 Coarser than manifold learning, but still works in nonlinear

situations.

2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data

sets).

(4)

Overview

Goal of talk

Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.)

Basic idea: perform clustering at different “scales”, track how clusters change as scale varies.

Motivation:

1 Coarser than manifold learning, but still works in nonlinear

situations.

2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data

sets).

(5)

Overview

Goal of talk

Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.)

Basic idea: perform clustering at different “scales”, track how clusters change as scale varies.

Motivation:

1 Coarser than manifold learning, but still works in nonlinear

situations.

2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data

sets).

(6)

Overview

Goal of talk

Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.)

Basic idea: perform clustering at different “scales”, track how clusters change as scale varies.

Motivation:

1 Coarser than manifold learning, but still works in nonlinear

situations.

2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data

sets).

(7)

Overview

Goal of talk

Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.)

Basic idea: perform clustering at different “scales”, track how clusters change as scale varies.

Motivation:

1 Coarser than manifold learning, but still works in nonlinear

situations.

2 Still retains meaningful geometric information about data set. 3 Efficiently computable (and so can apply to very large data

sets).

(8)

Morse theory

Basic idea

Describe topology of a smooth manifoldM using levelsets of a suitable functionh:M →R.

We recover M by looking ath−1((∞,t]), as t scans over the range of h.

Topology ofM changes at critical points of h.

(9)

Morse theory

Basic idea

Describe topology of a smooth manifoldM using levelsets of a suitable functionh:M →R.

We recover M by looking ath−1((∞,t]), as t scans over the range of h.

Topology ofM changes at critical points of h.

(10)

Morse theory

Basic idea

Describe topology of a smooth manifoldM using levelsets of a suitable functionh:M →R.

We recover M by looking ath−1((∞,t]), as t scans over the range of h.

Topology ofM changes at critical points of h.

(11)
(12)
(13)

Reeb graphs

Convenient simplification:

1 For each tR, contract each component of f−1(t) to a

point.

2 Resulting structure is a graph.

(14)

Reeb graphs

Convenient simplification:

1 For each tR, contract each component of f−1(t) to a

point.

2 Resulting structure is a graph.

(15)

Reeb graphs

Convenient simplification:

1 For each tR, contract each component of f−1(t) to a

point.

2 Resulting structure is a graph.

(16)
(17)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →R. 2 Choose a cover Uα of X.

3 Cluster each inverse imagef−1(Uα). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅ andC∩C06=∅.

5 Color vertices according to average value of f in the cluster.

(18)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →R. 2 Choose a cover Uα of X.

3 Cluster each inverse imagef−1(Uα). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅ andC∩C06=∅.

5 Color vertices according to average value of f in the cluster.

(19)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →R. 2 Choose a cover Uα of X.

3 Cluster each inverse imagef−1(Uα). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅ andC∩C06=∅.

5 Color vertices according to average value of f in the cluster.

(20)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →R. 2 Choose a cover Uα of X.

3 Cluster each inverse imagef−1(Uα). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅ andC∩C06=∅.

5 Color vertices according to average value of f in the cluster.

(21)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →R. 2 Choose a cover Uα of X.

3 Cluster each inverse imagef−1(). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅ andC∩C06=∅.

5 Color vertices according to average value of f in the cluster.

(22)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →R. 2 Choose a cover Uα of X.

3 Cluster each inverse imagef−1(). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅ andC∩C06=∅.

5 Color vertices according to average value of f in the cluster.

(23)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →R. 2 Choose a cover Uα of X.

3 Cluster each inverse imagef−1(). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅ andC∩C06=∅.

5 Color vertices according to average value of f in the cluster.

(24)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →R. 2 Choose a cover Uα of X.

3 Cluster each inverse imagef−1(). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅

andC∩C06=∅.

5 Color vertices according to average value of f in the cluster.

(25)

Mapper

The mapper algorithm is a generalization of this procedure. [Singh-Memoli-Carlsson]

Assume given a data setX.

1 Choose a filter functionf :X →R. 2 Choose a cover Uα of X.

3 Cluster each inverse imagef−1(). 4 Form a graph where:

1 Clusters are vertices.

2 An edge connects two clustersC andC0 if bothUα∩Uα0 6=∅

andC∩C06=∅.

5 Color vertices according to average value of f in the cluster.

(26)

Filter functions

Clearly, choice of filter function is essential.

Some kind of density measure.

A score measure difference (distance) from some baseline. An eccentricity measure.

(27)

Filter functions

Clearly, choice of filter function is essential.

Some kind of density measure.

A score measure difference (distance) from some baseline. An eccentricity measure.

(28)

Filter functions

Clearly, choice of filter function is essential.

Some kind of density measure.

A score measure difference (distance) from some baseline.

An eccentricity measure.

(29)

Filter functions

Clearly, choice of filter function is essential.

Some kind of density measure.

A score measure difference (distance) from some baseline. An eccentricity measure.

(30)
(31)

Breast cancer example

Highly successful example of real data analysis. [Nicolau, Carlsson, Levine]

Working with vectors of gene expression data. Distance metric is correlation.

Filter is a measure of (unsigned) deviation of expression from normal tissue.

Results identify previously unknown c-MYB+ region, which are very different from normal tissue but have very high survival rates.

(32)

Breast cancer example

Highly successful example of real data analysis. [Nicolau, Carlsson, Levine]

Working with vectors of gene expression data.

Distance metric is correlation.

Filter is a measure of (unsigned) deviation of expression from normal tissue.

Results identify previously unknown c-MYB+ region, which are very different from normal tissue but have very high survival rates.

(33)

Breast cancer example

Highly successful example of real data analysis. [Nicolau, Carlsson, Levine]

Working with vectors of gene expression data. Distance metric is correlation.

Filter is a measure of (unsigned) deviation of expression from normal tissue.

Results identify previously unknown c-MYB+ region, which are very different from normal tissue but have very high survival rates.

(34)

Breast cancer example

Highly successful example of real data analysis. [Nicolau, Carlsson, Levine]

Working with vectors of gene expression data. Distance metric is correlation.

Filter is a measure of (unsigned) deviation of expression from normal tissue.

Results identify previously unknown c-MYB+ region, which are very different from normal tissue but have very high survival rates.

(35)

Breast cancer example

Highly successful example of real data analysis. [Nicolau, Carlsson, Levine]

Working with vectors of gene expression data. Distance metric is correlation.

Filter is a measure of (unsigned) deviation of expression from normal tissue.

Results identify previously unknown c-MYB+ region, which are very different from normal tissue but have very high survival rates.

(36)
(37)

NBA example

Clever example of application to sports analytics. [Alagappan]

Data set consists of vectors of statistics (points scored, rebounds, etc.).

Distance metric is Euclidean. Filter is points per minute.

Results identify many “new” positions.

(38)

NBA example

Clever example of application to sports analytics. [Alagappan]

Data set consists of vectors of statistics (points scored, rebounds, etc.).

Distance metric is Euclidean. Filter is points per minute.

Results identify many “new” positions.

(39)

NBA example

Clever example of application to sports analytics. [Alagappan]

Data set consists of vectors of statistics (points scored, rebounds, etc.).

Distance metric is Euclidean.

Filter is points per minute.

Results identify many “new” positions.

(40)

NBA example

Clever example of application to sports analytics. [Alagappan]

Data set consists of vectors of statistics (points scored, rebounds, etc.).

Distance metric is Euclidean. Filter is points per minute.

Results identify many “new” positions.

(41)

NBA example

Clever example of application to sports analytics. [Alagappan]

Data set consists of vectors of statistics (points scored, rebounds, etc.).

Distance metric is Euclidean. Filter is points per minute.

Results identify many “new” positions.

(42)

Role Player Scoring Rebounder One-of-a-kind Role-Playing Ball-Handler Shooting Ball-Handler Combo Ball-Handler Offensive Ball-Handler Defensive Ball-Handler All-NBA 1st Team All-NBA 2nd Team

Scoring Paint Protector 3PT Rebounder

Paint Protector

(43)

Summary

Claim

Mapper can be successfully applied to analysis of geometric structures in large data sets from a wide variety of domains.

Key idea: clustering across “scales”, represent relationships between clusters as scale varies

Choice of filter function(s) is critical to successful aplication.

(44)

Summary

Claim

Mapper can be successfully applied to analysis of geometric structures in large data sets from a wide variety of domains.

Key idea: clustering across “scales”, represent relationships between clusters as scale varies

Choice of filter function(s) is critical to successful aplication.

(45)

Summary

Claim

Mapper can be successfully applied to analysis of geometric structures in large data sets from a wide variety of domains.

Key idea: clustering across “scales”, represent relationships between clusters as scale varies

Choice of filter function(s) is critical to successful aplication.

References

Related documents

The objectives of this study were (1) to analyse the re- sults of distal osteotomy of the second metatarsal bones carried out by a MIS procedure, specifically analysing the

 A Macro view of Global Wealth  Purpose & Definition of Family Office  Historical Background of the Family Office  Key Roles of the Family Office..  Types of

Ontology is defines as a set of primitives that is used to design the domain of knowledge. Ontology language is a formal language that is used to design ontologies. It allows the

(Urologická klinika s centrom pre transplantáciu oblièiek FN, Bratislava) Transplantace ledvin ze živých dárcù v IKEM. Bloudíèková S., Bartošová K., Pokorná E., Brùžková I.,

The heliocentric longitudinal distribution for Earth, Venus and Mercury of M- and X-class Flares is shown in Figure 1.. For the Earth, the peak around 45 ◦ (33 ◦ ) shows the

19 shows the cross-section of a 40-strand rotation-resistant hoist rope with a compacted wire rope core and compacted outer strands (Casar Eurolift).. Compacting the core as a

This new Bluetooth technology has featured increased data rate, low power consumption, longer range, higher broadcasting capacity, and improved coexistence with other