• No results found

Parallel Deterministic Annealing Clustering and its Application to LC MS Data Analysis

N/A
N/A
Protected

Academic year: 2020

Share "Parallel Deterministic Annealing Clustering and its Application to LC MS Data Analysis"

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

https://portal.futuregrid.org

Parallel Deterministic Annealing

Clustering and its Application

to LC-MS Data Analysis

October 7 2013

IEEE International Conference on Big Data 2013 (IEEE BigData 2013) Santa Clara CA

Geoffrey Fox, D. R. Mani, Saumyadipta Pyne

[email protected] http://www.infomall.org

(2)

https://portal.futuregrid.org

Challenge

Deterministic Annealing introduced ~1990 for clustering but no broadly available implementations although most tests rate well

• 2000 Extended to non metric spaces (Hofmann and Buhmann)

• 2010-2013 Applied to Dimension reduction, PLSI, Gaussian mixtures etc. (Fox et al.)

• 2011 Applied to model dependent single cluster at a time “peak matching" problem of the precise identification of the common LC-MS peaks across a cohort of multiple biological samples in proteomic biomarker discovery for data from a human tuberculosis cohort.

(Frühwirth, Mani and Pyne)

• Here apply “multi-cluster”, “annealed models”, “Continuous Clustering”, “parallelism” to proteomics case giving high

performance automatic robust method to quickly analyze

proteomics samples such as those taken in rapidly spreading epidemics

(3)

https://portal.futuregrid.org

Deterministic Annealing

Algorithms

(4)

https://portal.futuregrid.org

Some Motivation

• Big Data requires high performance – achieve with parallel computing

• Big Data sometimes requires robust algorithms as more opportunity to make mistakes

Deterministic annealing (DA) is one of better approaches to robust optimization

– Started as “Elastic Net” by Durbin for Travelling Salesman Problem TSP

– Tends to remove local optima

– Addresses overfitting

– Much Faster than simulated annealing

• Physics systems find true lowest energy state if you anneal i.e. you equilibrate at each

temperature as you cool

(5)

https://portal.futuregrid.org

(Deterministic) Annealing

• Find minimum at high temperature when trivial

• Small change avoiding local minima as lower temperature

• Typically gets better answers than standard libraries- R and Mahout

• And can be parallelized and put on GPU’s etc.

(6)

https://portal.futuregrid.org

Basic Deterministic Annealing

H() is objective function to be minimized as a function of parameters 

• Gibbs Distribution at Temperature T

P() = exp( - H()/T) /  d exp( - H()/T)

• Or P() = exp( - H()/T + F/T )

• Minimize Free Energy combining Objective Function and Entropy F = < H - T S(P) > =  d {P()H + T P() lnP()}

• Simulated annealing performs these integrals by Monte Carlo

• Deterministic annealing corresponds to doing integrals analytically (by mean field approximation) and is much much faster

• In each case temperature is lowered slowly – say by a factor 0.95 to 0.9999 at each iteration

(7)

https://portal.futuregrid.org

Implementation of DA Central Clustering

Here points to be clustered are in a

metric space

Clustering variables are

M

i

(

k

)

where this is probability that

point

i

belongs to cluster

k

and

k=1K

M

i

(

k

) = 1

In

Central

or

PW

Clustering, take

H

0

=

i=1N

k=1K

M

i

(

k

)

i

(

k

)

Linear form allows DA integrals to be done analytically

Central clustering

has

i

(

k

) = (X(i)- Y(

k

))

2

and M

i

(

k

)

determined by Expectation E step

H

Central

=

i=1N

k=1K

M

i

(

k

) (X(i)- Y(

k

))

2

<M

i

(

k

)> = exp( -

i

(

k

)/T ) /

k=1K

exp( -

i

(

k

)/T )

Centers Y(k) are determined in M step of EM method

(8)

https://portal.futuregrid.org

Trimmed Clustering

Clustering with position-specific constraints on variance: Applying redescending M-estimators to label-free LC-MS data analysis

(Rudolf Frühwirth , D R Mani and Saumyadipta Pyne) BMC Bioinformatics 2011, 12:358

• HTCC = k=0K i=1N Mi(k) f(i,k)

– f(i,k) = (X(i) - Y(k))2/2(k)2 k > 0

– f(i,0) = c2 / 2 k = 0

• The 0’th cluster captures (at zero temperature) all points outside clusters (background)

• Clusters are trimmed

(X(i) - Y(k))2/2(k)2 < c2 / 2

• Relevant when well defined errors

T ~ 0

T = 1 T = 5

(9)

https://portal.futuregrid.org

Key Features of Proteomics

2-dimensional data arising from a list of peaks

specified by points (m/Z, RT), where m/Z is the

mass to charge ratio, and RT the retention time for

the peak representing by a peptide.

Measurement errors

(m/Z) = 5.98 10

-6

m/Z and

(RT) = 2.35

Ratio of errors drastically different from ratio of

dimensions of (m/Z, RT) space which could distort high

temperature limit – solve by annealing

(m/Z)

2D

(

x

) = ((m/Z|

cluster center

– m/Z|

x

)/

(m/Z))

2

+

((RT|

cluster center

– RT|

x

)/

(RT))

2

is model

(10)

https://portal.futuregrid.org

General Features of DA

In many problems, decreasing temperature is classic

multiscale

– finer resolution (√T is “just” distance scale)

– We have factors like (X(i)- Y(k))2 / T

In clustering, one then looks at

second derivative matrix

(can derive analytically) of Free Energy wrt

each cluster

position

and as temperature is lowered this develops

negative eigenvalue

– Or have multiple clusters at each center and perturb

– Trade-off depends on problem – high dimension takes time to find eigenvector; we use eigenvectors here as 2D

This is a

phase transition

and one splits cluster into two

and continues EM iteration till desired resolution reached

One can start with just one cluster

(11)

https://portal.futuregrid.org 11

System becomes unstable as Temperature lowered and

there

is a

phase transition

and one splits cluster into two

and continues EM iteration

One can start with just one cluster and need NOT specify

desired # clusters; rather specify cluster resolution

Rose, K., Gurewitz, E., and Fox, G. C.

``Statistical mechanics and phase transitions in clustering,'' Physical Review Letters,

65(8):945-948, August 1990.

(12)

https://portal.futuregrid.org

Proteomics 2D DA Clustering T= 25000

(13)

https://portal.futuregrid.org

The

brownish triangles

are sponge peaks outside any

cluster.

The colored hexagons are peaks inside clusters with the

white hexagons

being determined cluster center

13

(14)

https://portal.futuregrid.org

Continuous Clustering

• This is a very useful subtlety introduced by Ken Rose but not widely known although it greatly improves algorithm

• Take a cluster k to be split into 2 with centers Y(k)A and Y(k)Bwith initial

values Y(k)A = Y(k)Bat original center Y(k)

• Then typically if you make this change and perturb the Y(k)A and Y(k)B, they will

return to starting position

as F at stable minimum (positive eigenvalue)

• But instability (the negative eigenvalue) can develop and one finds

• Implement by adding arbitrary number p(k) of centers for each cluster Zi =k=1K p(k) exp(-i(k)/T) and M step gives p(k) = C(k)/N

Halve p(k) at splits; can’t split easily in standard case p(k) = 1

• Show weighting in sums like Zi now equipoint not equicluster as p(k) proportional to points C(k) in cluster

14

Free Energy F

Y(k)A and Y(k)B

Y(k)A + Y(k)B

Free Energy F Free Energy F

(15)

https://portal.futuregrid.org

Deterministic Annealing

for Proteomics

(16)

https://portal.futuregrid.org

Proteomics Clustering Methods

DAVS(c) is Parallel Trimmed DA clustering with clusters satisfying

2D(x) ≤ c2 ; c annealed from large value to given value at T~2

DA2D scales m/Z and RT so clusters “circular” but does NOT trim them; there are no “sponge points”; there are clusters with 1 point

• All use start with 1 cluster center and

– “Continuous Clustering

Anneal(m/Z) from ~1 at T = ∞ to 5.98 10-6 m/Z at T~10

Anneal c from T=40 to 2 (only introduce trimming at low temperatures)

Mclust uses standard model-based non deterministic annealing clustering

Landmarks are a collection of reference peaks (obtained by

identifying a subset of peaks using MS/MS peptide sequencing).

(17)

https://portal.futuregrid.org

Temperature1.00E+01 1.00E+00 1.00E-01 1.00E-02 1.00E-03

1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 Clus ter Count 0 10000 20000 30000 40000 50000 60000 DAVS(2) DA2D

Start Sponge DAVS(2)

Add Close Cluster Check

Sponge Reaches final value

Cluster Count v. Temperature for 2

Runs

• All start with one cluster at far left

• T=1 special as measurement errors divided out

(18)

https://portal.futuregrid.org 18 Landmark

Histograms of number of peaks in clusters for 4 clustering

(19)

https://portal.futuregrid.org

Basic Statistics

• Error is mean squared deviation of points from center in each dimension – sensitive to cut in 2D(x)

• DAVS(3) produces most large clusters; Mclust fewest

• Mclust has many more clusters with just one member

• DA2D similar to DAVS(2) except has some (probably false) associations of points far from center to cluster

19

Charge Method NumberClusters Number of Clusters with occupationcount Scaled ErrorCount > 1 Scaled ErrorCount > 30 1

(Sponge) 2 >2 >30 m/z RT m/z RT

(20)

https://portal.futuregrid.org

DAVS(2) and

DA2D discover

1020 of 1023

Landmark

peaks with

modest error

(21)

https://portal.futuregrid.org 21

Histograms of2D(x) for 4 different clusters methods, and the landmark set plus expectation for a

Gaussian distribution with standard deviations given as(m/z)/3 and(RT)/3 in two directions. The “Landmark” distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at2D(x) = 1 and 4 respectively. Only clusters

(22)

https://portal.futuregrid.org

Basic Equations

• N Number of Points and K Clusters

• NK Unknowns <M

i

(

k

)> determined by

i

(

k

) = (X

i

- Y(

k

))

2

• <Mi(k)> = p(k) exp( -i(k)/T ) / k=1K p(k) exp( -i(k)/T )

• C(k) =

i=1N

<M

i

(k)> Number of points in Cluster

k

• Y(k) =

i=1N

<M

i

(k)> X

i

/ C(k)

• p(k) = C(k) / N

• Iterate T = “∞” to 0.025

• <M

i

(k)> is probability that point

i

in cluster

k

(23)

https://portal.futuregrid.org

Simple Parallelism as in k-means

Decompose points

i

over processors

Equations either pleasingly parallel “maps” over

i

Or “All-Reductions” summing over

i

for each

cluster

Parallel Algorithm:

Each process holds all clusters

and calculates

contributions to clusters from points in node

e.g.

Y(k) =

i=1N

<M

i

(k)> X

i

/ C(k)

Runs well in MPI or MapReduce

See all the MapReduce k-means papers

(24)

https://portal.futuregrid.org

Better Parallelism

The previous model is correct at start but each point does

not really contribute to each cluster as damped

exponentially by exp( -

(X

i

- Y(k))

2

/T )

For Proteomics problem, on average

only 6.45 clusters

needed per point if require

(X

i

- Y(k))

2

/T ≤ ~40 (as exp(-40)

small)

So only need to keep nearby clusters for each point

As

average number of Clusters ~ 20,000

, this gives a factor

of ~3000 improvement

Further communication is no longer all global; it has

nearest neighbor components and calculated by

parallelism over clusters

(25)

https://portal.futuregrid.org 25

(26)

https://portal.futuregrid.org

Online/Realtime Clustering

Given a existing clustering, one can add new

data in two ways

Simplest is of course to interpolate new points

to nearest existing cluster

Better is to add new points and rerun full

algorithm starting at T~1 where

“convergence” is in range T=0.1 to 0.01

Takes 20% to 30% original execution time

(27)

https://portal.futuregrid.org

Summary

Deterministic Annealing provides quality results keeping us healthy and running in model DAVS(c) or unconstrained fashion DA2D

– User can choose trade offs given by cut off c

• Parallel version gives a fast automatic initial analysis of LC-MS peaks with no user input needed including no input on final number of

clusters

• Little known “Continuous Clustering” useful

• Current open source code available but best wait till we finish conversion from C# to Java

Parallel approach subtle as like particle in cell codes, have parallelism over clusters (cells) and/or points (particles)

– ? Useful different benchmark for compilers etc.

• Similar ideas relevant for other clustering and deterministic annealing fields such as non metric spaces, MDS

(28)

https://portal.futuregrid.org

Extras

(29)

https://portal.futuregrid.org 29

Start at T= “” with 1

Cluster

(30)
(31)
(32)

https://portal.futuregrid.org

Clusters v. Regions

• In Lymphocytes clusters are distinct

• In Pathology, clusters divide space into regions and

sophisticated methods like deterministic annealing are probably unnecessary

32

Pathology 54D

(33)

https://portal.futuregrid.org

Protein Universe Browser for COG Sequences with a

few illustrative biologically identified clusters

(34)

https://portal.futuregrid.org

Proteomics 2D DA Clustering T=0.1

small sample of ~30,000 Clusters Count >=2

34

Orange sponge points Outliers not in cluster

(35)

https://portal.futuregrid.org

Remarks on Clustering and MDS

• The standard data libraries (R, Matlab, Mahout) do not have best algorithms/software in either functionality or scalable parallelism

• A lot of algorithms are built around “classic full matrix” kernels

Clustering, Gaussian Mixture Models, PLSI (probabilistic latent semantic indexing), LDA (Latent Dirichlet Allocation) similar

Multi-Dimensional Scaling (MDS) classic information visualization algorithm for high dimension spaces (map preserving distances)

Vector O(N) and Non Vector semimetric O(N2) space cases for N

points; “all” apps are points in spaces – not all “Proper linear spaces”

• Trying to release ~most powerful (in features/performance) available Clustering and MDS library although unfortunately in C#

Supported Features: Vector, Non-Vector, Deterministic annealing, Hierarchical, sharp (trimmed) or general cluster sizes, Fixed points and general weights for MDS, (generalized Elkans algorithm)

(36)

https://portal.futuregrid.org

General Deterministic Annealing

For some cases such as vector clustering and Mixture

Models

one can do integrals by hand

but usually that will be

impossible

So introduce Hamiltonian

H0(

,

)

which by choice of

can be made similar to real Hamiltonian H

R

(

) and which

has

tractable integrals

P0(

) = exp( - H0(

)/T + F0/T ) approximate Gibbs for HR

FR

(P0) = < HR

- T S0(P0) >|0

= < HR

– H0> |0

+ F0(P0)

Where

<…>|0

denotes

d

Po(

)

Easy to show that real Free Energy (the Gibb’s inequality)

FR

(PR) ≤ FR

(P0)

(Kullback-Leibler divergence)

Expectation step E

is find

minimizing FR

(P0) and

Follow with

M step (of EM) setting

= <

> |0

=

d

 

Po(

)

(mean field) and one follows with a traditional

minimization of remaining parameters

36

Note 3 types of variables

used to approximate real Hamiltonian

subject to annealing

(37)

https://portal.futuregrid.org

Implementation of DA-PWC

Clustering variables are again

M

i

(

k

)

(these are

in general

approach) where this is probability point

i

belongs to cluster

k

Pairwise Clustering

Hamiltonian given by nonlinear form

HPWC

= 0.5

i=1N

j=1N

(

i

,

j

)

k=1K

M

i

(

k

) M

j

(

k

) / C(

k

)

(

i

,

j

)

is pairwise distance between points

i

and

j

with

C(k) =

i=1N

M

i

(

k

)

as number of points in Cluster k

Take same form

H0

=

i=1N

k=1K

M

i

(

k

)

i

(

k

)

as for central

clustering

i

(

k

)

determined to minimize

FPWC

(P0) = < HPWC

- T S0(P0) >|0

where integrals can be easily done

And now linear (in

M

i

(

k

)

) H

0

and quadratic H

PC

are different

Again

<M

i

(

k

)> = exp( -

i

(

k

)/T ) /

k=1K

exp( -

i

(

k

)/T )

(38)

https://portal.futuregrid.org

Some Ideas

§

Deterministic annealing

is better than many well-used

optimization problems

§ Started as “Elastic Net” by Durbin for Travelling Salesman Problem TSP

§

Basic idea behind deterministic annealing is

mean field

approximation, which is also used in “Variational Bayes”

and “Variational inference”

§

Markov chain Monte Carlo (MCMC) methods are roughly

single temperature

simulated annealing

• Less sensitive to initial conditions

• Avoid local optima

(39)

https://portal.futuregrid.org

Some Uses of Deterministic Annealing

• Clustering

– Vectors: Rose (Gurewitz and Fox)

– Clusters with fixed sizes and no tails (Proteomics team at Broad)

– No Vectors: Hofmann and Buhmann (Just use pairwise distances)

• Dimension Reduction for visualization and analysis

– Vectors: GTM Generative Topographic Mapping

– No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise distances)

• Can apply to HMM & general mixture models (less study)

– Gaussian Mixture Models

(40)

https://portal.futuregrid.org 40

Histograms of2D(x) for 4 different clusters methods, and the landmark set plus expectation for

a Gaussian distribution with standard deviations given as 1/3 in the two directions. The

“Landmark” distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at2D(x) = 1 and 4 respectively. Only clusters with

(41)

https://portal.futuregrid.org

Some Problems

Analysis of Mass Spectrometry data to find peptides by

clustering peaks (Broad Institute)

– ~0.5 million points in 2 dimensions (one experiment) -- ~ 50,000 clusters summed over charges

Metagenomics – 0.5 million (increasing rapidly) points NOT

in a vector space – hundreds of clusters per sample

Pathology Images >50 Dimensions

Social image analysis is in a highish dimension vector space

– 10-50 million images; 1000 features per image; million clusters

Finding communities from network graphs coming from

Social media contacts etc.

– No vector space; can be huge in all ways

(42)

https://portal.futuregrid.org 42

(43)

https://portal.futuregrid.org 43

Parallelism within a Single Node of Madrid Cluster. A set of runs on 241605 peak data with a single node with 16 cores with either threads or MPI giving parallelism.

Parallelism is either number threads or number of MPI processes.

(44)

https://portal.futuregrid.org

Proteomics 2D DA Clustering T=0.1

small sample of ~30,000 Clusters Count >=2

44

Sponge Peaks

References

Related documents

In these cases, you have two options: apply your desired effects directly to the photo using image editing software, or use CSS background images and borders to style the

Come accennato in precedenza, sul sito Xilinx sono disponibili molti altri esempi di riferimento per questa scheda oltre alla demo di base; due dei più interessanti

The specific high energy and power capacities of lithium metal (Li 0 ) batteries are ideally suited to portable devices and are valuable as storage units for intermittent

Because of the large movement is mainly in the 4th joint (elbow joint), the load torque at the 4th joint is the best option to see the rehabilitation progress as shown in Fig.

Based on the study of variables within the scope of this research, the following conclusions can be drawn. 2) For Unsymmetrical I-Beams, the maximum prestress and eccentricity that

In this study zeolites LTL (potassium form (K_LTL) and proton form (H_LTL)) were used as an alternative adsorbent for adsorption of paraquat in aqueous

[r]

(invited oral presentation, 51th annual meeting of the Biophysical Society, Baltimore, MD,USA, 2007... 51th annual meeting of the Biophysical Society, Baltimore, MD,USA,