https://portal.futuregrid.org
Parallel Deterministic Annealing
Clustering and its Application
to LC-MS Data Analysis
October 7 2013
IEEE International Conference on Big Data 2013 (IEEE BigData 2013) Santa Clara CA
Geoffrey Fox, D. R. Mani, Saumyadipta Pyne
[email protected] http://www.infomall.org
https://portal.futuregrid.org
Challenge
• Deterministic Annealing introduced ~1990 for clustering but no broadly available implementations although most tests rate well
• 2000 Extended to non metric spaces (Hofmann and Buhmann)
• 2010-2013 Applied to Dimension reduction, PLSI, Gaussian mixtures etc. (Fox et al.)
• 2011 Applied to model dependent single cluster at a time “peak matching" problem of the precise identification of the common LC-MS peaks across a cohort of multiple biological samples in proteomic biomarker discovery for data from a human tuberculosis cohort.
(Frühwirth, Mani and Pyne)
• Here apply “multi-cluster”, “annealed models”, “Continuous Clustering”, “parallelism” to proteomics case giving high
performance automatic robust method to quickly analyze
proteomics samples such as those taken in rapidly spreading epidemics
https://portal.futuregrid.org
Deterministic Annealing
Algorithms
https://portal.futuregrid.org
Some Motivation
• Big Data requires high performance – achieve with parallel computing
• Big Data sometimes requires robust algorithms as more opportunity to make mistakes
• Deterministic annealing (DA) is one of better approaches to robust optimization
– Started as “Elastic Net” by Durbin for Travelling Salesman Problem TSP
– Tends to remove local optima
– Addresses overfitting
– Much Faster than simulated annealing
• Physics systems find true lowest energy state if you anneal i.e. you equilibrate at each
temperature as you cool
https://portal.futuregrid.org
(Deterministic) Annealing
• Find minimum at high temperature when trivial
• Small change avoiding local minima as lower temperature
• Typically gets better answers than standard libraries- R and Mahout
• And can be parallelized and put on GPU’s etc.
https://portal.futuregrid.org
Basic Deterministic Annealing
• H() is objective function to be minimized as a function of parameters
• Gibbs Distribution at Temperature T
P() = exp( - H()/T) / d exp( - H()/T)
• Or P() = exp( - H()/T + F/T )
• Minimize Free Energy combining Objective Function and Entropy F = < H - T S(P) > = d {P()H + T P() lnP()}
• Simulated annealing performs these integrals by Monte Carlo
• Deterministic annealing corresponds to doing integrals analytically (by mean field approximation) and is much much faster
• In each case temperature is lowered slowly – say by a factor 0.95 to 0.9999 at each iteration
https://portal.futuregrid.org
Implementation of DA Central Clustering
•
Here points to be clustered are in a
metric space
•
Clustering variables are
M
i(
k
)
where this is probability that
point
i
belongs to cluster
k
and
k=1KM
i(
k
) = 1
•
In
Central
or
PW
Clustering, take
H
0=
i=1N
k=1KM
i(
k
)
i(
k
)
–
Linear form allows DA integrals to be done analytically
•
Central clustering
has
i(
k
) = (X(i)- Y(
k
))
2and M
i(
k
)
determined by Expectation E step
–
H
Central=
i=1N
k=1KM
i(
k
) (X(i)- Y(
k
))
2•
<M
i(
k
)> = exp( -
i(
k
)/T ) /
k=1Kexp( -
i(
k
)/T )
•
Centers Y(k) are determined in M step of EM method
https://portal.futuregrid.org
Trimmed Clustering
• Clustering with position-specific constraints on variance: Applying redescending M-estimators to label-free LC-MS data analysis
(Rudolf Frühwirth , D R Mani and Saumyadipta Pyne) BMC Bioinformatics 2011, 12:358
• HTCC = k=0K i=1N Mi(k) f(i,k)
– f(i,k) = (X(i) - Y(k))2/2(k)2 k > 0
– f(i,0) = c2 / 2 k = 0
• The 0’th cluster captures (at zero temperature) all points outside clusters (background)
• Clusters are trimmed
(X(i) - Y(k))2/2(k)2 < c2 / 2
• Relevant when well defined errors
T ~ 0
T = 1 T = 5
https://portal.futuregrid.org
Key Features of Proteomics
•
2-dimensional data arising from a list of peaks
specified by points (m/Z, RT), where m/Z is the
mass to charge ratio, and RT the retention time for
the peak representing by a peptide.
•
Measurement errors
(m/Z) = 5.98 10
-6m/Z and
(RT) = 2.35
–
Ratio of errors drastically different from ratio of
dimensions of (m/Z, RT) space which could distort high
temperature limit – solve by annealing
(m/Z)
•
2D(
x
) = ((m/Z|
cluster center– m/Z|
x)/
(m/Z))
2+
((RT|
cluster center– RT|
x)/
(RT))
2is model
https://portal.futuregrid.org
General Features of DA
•
In many problems, decreasing temperature is classic
multiscale
– finer resolution (√T is “just” distance scale)
– We have factors like (X(i)- Y(k))2 / T
•
In clustering, one then looks at
second derivative matrix
(can derive analytically) of Free Energy wrt
each cluster
position
and as temperature is lowered this develops
negative eigenvalue
– Or have multiple clusters at each center and perturb
– Trade-off depends on problem – high dimension takes time to find eigenvector; we use eigenvectors here as 2D
•
This is a
phase transition
and one splits cluster into two
and continues EM iteration till desired resolution reached
•
One can start with just one cluster
https://portal.futuregrid.org 11
•
System becomes unstable as Temperature lowered and
there
is a
phase transition
and one splits cluster into two
and continues EM iteration
•
One can start with just one cluster and need NOT specify
desired # clusters; rather specify cluster resolution
Rose, K., Gurewitz, E., and Fox, G. C.
``Statistical mechanics and phase transitions in clustering,'' Physical Review Letters,
65(8):945-948, August 1990.
https://portal.futuregrid.org
Proteomics 2D DA Clustering T= 25000
https://portal.futuregrid.org
The
brownish triangles
are sponge peaks outside any
cluster.
The colored hexagons are peaks inside clusters with the
white hexagons
being determined cluster center
13
https://portal.futuregrid.org
Continuous Clustering
• This is a very useful subtlety introduced by Ken Rose but not widely known although it greatly improves algorithm
• Take a cluster k to be split into 2 with centers Y(k)A and Y(k)Bwith initial
values Y(k)A = Y(k)Bat original center Y(k)
• Then typically if you make this change and perturb the Y(k)A and Y(k)B, they will
return to starting position
as F at stable minimum (positive eigenvalue)
• But instability (the negative eigenvalue) can develop and one finds
• Implement by adding arbitrary number p(k) of centers for each cluster Zi = k=1K p(k) exp(-i(k)/T) and M step gives p(k) = C(k)/N
• Halve p(k) at splits; can’t split easily in standard case p(k) = 1
• Show weighting in sums like Zi now equipoint not equicluster as p(k) proportional to points C(k) in cluster
14
Free Energy F
Y(k)A and Y(k)B
Y(k)A + Y(k)B
Free Energy F Free Energy F
https://portal.futuregrid.org
Deterministic Annealing
for Proteomics
https://portal.futuregrid.org
Proteomics Clustering Methods
• DAVS(c) is Parallel Trimmed DA clustering with clusters satisfying
2D(x) ≤ c2 ; c annealed from large value to given value at T~2
• DA2D scales m/Z and RT so clusters “circular” but does NOT trim them; there are no “sponge points”; there are clusters with 1 point
• All use start with 1 cluster center and
– “Continuous Clustering”
– Anneal (m/Z) from ~1 at T = ∞ to 5.98 10-6 m/Z at T~10
– Anneal c from T=40 to 2 (only introduce trimming at low temperatures)
• Mclust uses standard model-based non deterministic annealing clustering
• Landmarks are a collection of reference peaks (obtained by
identifying a subset of peaks using MS/MS peptide sequencing).
https://portal.futuregrid.org
Temperature1.00E+01 1.00E+00 1.00E-01 1.00E-02 1.00E-03
1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 Clus ter Count 0 10000 20000 30000 40000 50000 60000 DAVS(2) DA2D
Start Sponge DAVS(2)
Add Close Cluster Check
Sponge Reaches final value
Cluster Count v. Temperature for 2
Runs
• All start with one cluster at far left
• T=1 special as measurement errors divided out
https://portal.futuregrid.org 18 Landmark
Histograms of number of peaks in clusters for 4 clustering
https://portal.futuregrid.org
Basic Statistics
• Error is mean squared deviation of points from center in each dimension – sensitive to cut in 2D(x)
• DAVS(3) produces most large clusters; Mclust fewest
• Mclust has many more clusters with just one member
• DA2D similar to DAVS(2) except has some (probably false) associations of points far from center to cluster
19
Charge Method NumberClusters Number of Clusters with occupationcount Scaled ErrorCount > 1 Scaled ErrorCount > 30 1
(Sponge) 2 >2 >30 m/z RT m/z RT
https://portal.futuregrid.org
•
DAVS(2) and
DA2D discover
1020 of 1023
Landmark
peaks with
modest error
https://portal.futuregrid.org 21
Histograms of 2D(x) for 4 different clusters methods, and the landmark set plus expectation for a
Gaussian distribution with standard deviations given as (m/z)/3 and (RT)/3 in two directions. The “Landmark” distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at 2D(x) = 1 and 4 respectively. Only clusters
https://portal.futuregrid.org
Basic Equations
• N Number of Points and K Clusters
• NK Unknowns <M
i(
k
)> determined by
•
i(
k
) = (X
i- Y(
k
))
2• <Mi(k)> = p(k) exp( -i(k)/T ) / k=1K p(k) exp( -i(k)/T )
• C(k) =
i=1N<M
i(k)> Number of points in Cluster
k
• Y(k) =
i=1N<M
i(k)> X
i/ C(k)
• p(k) = C(k) / N
• Iterate T = “∞” to 0.025
• <M
i(k)> is probability that point
i
in cluster
k
https://portal.futuregrid.org
Simple Parallelism as in k-means
•
Decompose points
i
over processors
•
Equations either pleasingly parallel “maps” over
i
•
Or “All-Reductions” summing over
i
for each
cluster
•
Parallel Algorithm:
–
Each process holds all clusters
and calculates
contributions to clusters from points in node
–
e.g.
Y(k) =
i=1N<M
i(k)> X
i/ C(k)
•
Runs well in MPI or MapReduce
–
See all the MapReduce k-means papers
https://portal.futuregrid.org
Better Parallelism
•
The previous model is correct at start but each point does
not really contribute to each cluster as damped
exponentially by exp( -
(X
i- Y(k))
2/T )
•
For Proteomics problem, on average
only 6.45 clusters
needed per point if require
(X
i- Y(k))
2/T ≤ ~40 (as exp(-40)
small)
•
So only need to keep nearby clusters for each point
•
As
average number of Clusters ~ 20,000
, this gives a factor
of ~3000 improvement
•
Further communication is no longer all global; it has
nearest neighbor components and calculated by
parallelism over clusters
https://portal.futuregrid.org 25
https://portal.futuregrid.org
Online/Realtime Clustering
•
Given a existing clustering, one can add new
data in two ways
•
Simplest is of course to interpolate new points
to nearest existing cluster
•
Better is to add new points and rerun full
algorithm starting at T~1 where
“convergence” is in range T=0.1 to 0.01
–
Takes 20% to 30% original execution time
https://portal.futuregrid.org
Summary
• Deterministic Annealing provides quality results keeping us healthy and running in model DAVS(c) or unconstrained fashion DA2D
– User can choose trade offs given by cut off c
• Parallel version gives a fast automatic initial analysis of LC-MS peaks with no user input needed including no input on final number of
clusters
• Little known “Continuous Clustering” useful
• Current open source code available but best wait till we finish conversion from C# to Java
• Parallel approach subtle as like particle in cell codes, have parallelism over clusters (cells) and/or points (particles)
– ? Useful different benchmark for compilers etc.
• Similar ideas relevant for other clustering and deterministic annealing fields such as non metric spaces, MDS
https://portal.futuregrid.org
Extras
https://portal.futuregrid.org 29
•
Start at T= “” with 1
Cluster
https://portal.futuregrid.org
Clusters v. Regions
• In Lymphocytes clusters are distinct
• In Pathology, clusters divide space into regions and
sophisticated methods like deterministic annealing are probably unnecessary
32
Pathology 54D
https://portal.futuregrid.org
Protein Universe Browser for COG Sequences with a
few illustrative biologically identified clusters
https://portal.futuregrid.org
Proteomics 2D DA Clustering T=0.1
small sample of ~30,000 Clusters Count >=2
34
Orange sponge points Outliers not in cluster
https://portal.futuregrid.org
Remarks on Clustering and MDS
• The standard data libraries (R, Matlab, Mahout) do not have best algorithms/software in either functionality or scalable parallelism
• A lot of algorithms are built around “classic full matrix” kernels
• Clustering, Gaussian Mixture Models, PLSI (probabilistic latent semantic indexing), LDA (Latent Dirichlet Allocation) similar
• Multi-Dimensional Scaling (MDS) classic information visualization algorithm for high dimension spaces (map preserving distances)
• Vector O(N) and Non Vector semimetric O(N2) space cases for N
points; “all” apps are points in spaces – not all “Proper linear spaces”
• Trying to release ~most powerful (in features/performance) available Clustering and MDS library although unfortunately in C#
• Supported Features: Vector, Non-Vector, Deterministic annealing, Hierarchical, sharp (trimmed) or general cluster sizes, Fixed points and general weights for MDS, (generalized Elkans algorithm)
https://portal.futuregrid.org
General Deterministic Annealing
•
For some cases such as vector clustering and Mixture
Models
one can do integrals by hand
but usually that will be
impossible
•
So introduce Hamiltonian
H0(
,
)
which by choice of
can be made similar to real Hamiltonian H
R(
) and which
has
tractable integrals
•
P0(
) = exp( - H0(
)/T + F0/T ) approximate Gibbs for HR
•
FR
(P0) = < HR
- T S0(P0) >|0
= < HR
– H0> |0
+ F0(P0)
•
Where
<…>|0
denotes
d
Po(
)
•
Easy to show that real Free Energy (the Gibb’s inequality)
FR
(PR) ≤ FR
(P0)
(Kullback-Leibler divergence)
•
Expectation step E
is find
minimizing FR
(P0) and
•
Follow with
M step (of EM) setting
= <
> |0
=
d
Po(
)
(mean field) and one follows with a traditional
minimization of remaining parameters
36
Note 3 types of variables
used to approximate real Hamiltonian
subject to annealing
https://portal.futuregrid.org
Implementation of DA-PWC
•
Clustering variables are again
M
i(
k
)
(these are
in general
approach) where this is probability point
i
belongs to cluster
k
•
Pairwise Clustering
Hamiltonian given by nonlinear form
•
HPWC
= 0.5
i=1N
j=1N
(
i
,
j
)
k=1KM
i(
k
) M
j(
k
) / C(
k
)
•
(
i
,
j
)
is pairwise distance between points
i
and
j
•
with
C(k) =
i=1NM
i(
k
)
as number of points in Cluster k
•
Take same form
H0
=
i=1N
k=1KM
i(
k
)
i(
k
)
as for central
clustering
•
i(
k
)
determined to minimize
FPWC
(P0) = < HPWC
- T S0(P0) >|0
where integrals can be easily done
•
And now linear (in
M
i(
k
)
) H
0and quadratic H
PCare different
•
Again
<M
i(
k
)> = exp( -
i(
k
)/T ) /
k=1Kexp( -
i(
k
)/T )
https://portal.futuregrid.org
Some Ideas
§
Deterministic annealing
is better than many well-used
optimization problems
§ Started as “Elastic Net” by Durbin for Travelling Salesman Problem TSP
§
Basic idea behind deterministic annealing is
mean field
approximation, which is also used in “Variational Bayes”
and “Variational inference”
§
Markov chain Monte Carlo (MCMC) methods are roughly
single temperature
simulated annealing
• Less sensitive to initial conditions
• Avoid local optima
https://portal.futuregrid.org
Some Uses of Deterministic Annealing
• Clustering
– Vectors: Rose (Gurewitz and Fox)
– Clusters with fixed sizes and no tails (Proteomics team at Broad)
– No Vectors: Hofmann and Buhmann (Just use pairwise distances)
• Dimension Reduction for visualization and analysis
– Vectors: GTM Generative Topographic Mapping
– No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise distances)
• Can apply to HMM & general mixture models (less study)
– Gaussian Mixture Models
https://portal.futuregrid.org 40
Histograms of 2D(x) for 4 different clusters methods, and the landmark set plus expectation for
a Gaussian distribution with standard deviations given as 1/3 in the two directions. The
“Landmark” distribution correspond to previously identified peaks used as a control set. Note DAVS(1) and DAVS(2) have sharp cut offs at 2D(x) = 1 and 4 respectively. Only clusters with
https://portal.futuregrid.org
Some Problems
•
Analysis of Mass Spectrometry data to find peptides by
clustering peaks (Broad Institute)
– ~0.5 million points in 2 dimensions (one experiment) -- ~ 50,000 clusters summed over charges
•
Metagenomics – 0.5 million (increasing rapidly) points NOT
in a vector space – hundreds of clusters per sample
•
Pathology Images >50 Dimensions
•
Social image analysis is in a highish dimension vector space
– 10-50 million images; 1000 features per image; million clusters
•
Finding communities from network graphs coming from
Social media contacts etc.
– No vector space; can be huge in all ways
https://portal.futuregrid.org 42
https://portal.futuregrid.org 43
Parallelism within a Single Node of Madrid Cluster. A set of runs on 241605 peak data with a single node with 16 cores with either threads or MPI giving parallelism.
Parallelism is either number threads or number of MPI processes.
https://portal.futuregrid.org
Proteomics 2D DA Clustering T=0.1
small sample of ~30,000 Clusters Count >=2
44
Sponge Peaks