Towards High Performance Data Analytics
with Java
SALIYA EKANAYAKE
4/1/2013 SALSA PRESENTATION
1
A Bit of Background
Gene Sequence Clustering and Visualization
◦
Projects
◦
Million sequence project
http://salsahpc.indiana.edu/millionseq/
◦
Work on COG (Protein) sequences
http://salsacog.blogspot.com/
◦
Work on phylogenetic trees
http://salsafungiphy.blogspot.com/
◦
Publications
◦
G. L. H. Yang Ruan, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang,
Geoffrey Fox, “Integration of Clustering and Multidimensional Scaling to Determine
Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions,” in C4Bio
2014 of IEEE/ACM CCGrid 2014, Chicago, USA, 2014
◦
L. Stanberry, R. Higdon, W. Haynes, N. Kolker, W. Broomall, S. Ekanayake, A. Hughes,
Y. Ruan, J. Qiu, E. Kolker, and G. Fox, “Visualizing the protein sequence universe,” in
Proceedings of the 3rd international workshop on Emerging computational methods
for the life sciences, Delft, The Netherlands, 2012, pp. 13-22
◦
Y. Ruan, S. Ekanayake, M. Rho, H. Tang, S.-H. Bae, J. Qiu, and G. Fox, “DACIDR:
deterministic annealed clustering with interpolative dimension reduction using a
large collection of 16S rRNA sequences,” in Proceedings of the ACM Conference on
Bioinformatics, Computational Biology and Biomedicine, Orlando, Florida, 2012, pp.
329-336
◦
A. Hughes, Y. Ruan, S. Ekanayake, S. H. Bae, Q. Dong, M. Rho, J. Qiu, and G. Fox,
“Interpolative multidimensional scaling techniques for the identification of clusters in
very large sequence sets,”
BMC Bioinformatics,
vol. 13 Suppl 2, pp. S9, 2012
Under the Hood
4/1/2013 SALSA PRESENTATION
3
D1
Alignment
and
Distance
Calculation
D2Dimension
Reduction
D3Clustering
D4Visualization
D5 >G0H13NN01D34CL GTCGTTTAAGCCATTACGTC … >G0H13NN01DK2OZ GTCGTTAAGCCATTACGTC …# X Y Z
0 0.358 0.2620. 295 1 0.252 0.422 0.372
# Cluster
0 1
1 3
Reality Is More Complex
◦
Study of Biological Sequence Structure
◦
http://salsahpc.blogspot.com/2013/05/study-of-biological-sequence-structure.html
◦
Million Sequence Processes
◦
http://salsahpc.indiana.edu/millionseq/fungi2/fungi2_index.html
Runs On
◦
Tempest
Windows HPC Cluster
◦
FutureGrid, BigRed II, Quarry
Traditional Linux Based HPC Clusters
Algorithms
◦
Alignment and Distance Calculation
◦
SALSA-SWG
C# MPI
◦
SALSA-SWG-MBF
C# MPI
◦
SALSA-NW-MBF
C# MPI
◦
SALSA-SWG-MBF2Java
Java MapReduce
◦
SALSA-NW-BioJava
Java MapReduce
◦
Dimension Reduction
◦
MDSasChisq
C# MPI
◦
DA-SMACOF
C# MPI
◦
Twister DA-SMACOF
Java Iterative MapReduce
◦
WDA-SMACOF
Java Iterative MapReduce
◦
Clustering
◦
DAPWC
C# MPI
Towards Java
Motivation
◦
Immediate
Limited Windows HPC Clusters
◦
Future
Integrate with Apache Big Data Stack (ABDS)
Options
◦
Keep C#
◦
Run on Azure cloud
Not the best for MPI because of high latencies and low bandwidths
◦
Run on Mono
We tried, it worked, but poor in performance
◦
Convert to Java
◦
Time consuming, but gained good results
“Java Ready” Applications
◦
Deterministic Annealing Vector Sponge (DAVS)
Evaluations
MPI Frameworks
◦
MPI.NET
A high performance message passing interface for .NET environment
◦
FastMPJ
A pure Java implementation of mpiJava 1.2 specification
◦
OpenMPI
Java wrapper for native MPI implementation
◦
Nightly snapshot 1.9a1r28881 (OMPI-nightly) – conforms with mpiJava 1.2 specification
◦
Source tree revision 30301 (OMPI-trunk)
◦
Release candidate version 1.7.5rc5 (OMPI-175rc5) – latest of the three
Kernel Benchmarks
◦
Ohio MicroBenchmark (OMB) Suite
◦
Send and receive
◦
Allreduce
Application Benchmarks
◦
DAVS and DAPWC on Real Data
◦
Parallel Patterns of T x P x N
◦
T - # threads per process
◦
P - # MPI processes per node
◦
N - # nodes
◦
Threads from Habanero Java Library
4/1/2013
Mainly for Parallel Loops
SALSA PRESENTATION5
Your code was
Kernel Benchmarks
MPI Send and Receive
Message size (bytes)
0B 1B 2B 4B 8B 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB128KB256KB512KB 1MB
Average
time
(us)
1 10 100 1000 10000
MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG OMPI-nightly C FG
Message Size (bytes)
0B 1B 2B 4B 8B 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB128KB256KB512KB 1MB
Average
Time
(us)
1 10 100 1000 10000
OMPI-trunk C Madrid OMPI-trunk Java Madrid OMPI-trunk C FG OMPI-trunk Java FG
Kernel Benchmarks
MPI Allreduce
4/1/2013 SALSA PRESENTATION
7
Performance with Different MPI Frameworks
OMPI-trunk Performance with and without Infiniband
Message size (bytes)4B 8B 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB128KB256KB512KB 1MB 2MB 4MB 8MB
Average
time
(us)
10 100 1000 10000 100000
MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG OMPI-nightly C FG
Message Size (bytes)
4B 8B 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB128KB256KB512KB 1MB 2MB 4MB 8MB
Average
Time
(us)
1 10 100 1000 10000 100000 1000000
DAVS Performance
Mode – Charge5
TxPxN
1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2 1x8x1
Speedup
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 MPI.NET OMPI-nightly OMPI-trunkTxPxN
2x1x8 4x1x8 8x1x8 1x2x8 4x2x8 1x4x8 2x4x8
Time
(hours)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 MPI.NET OMPI-nightly OMPI-trunkTxPxN
1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2 1x8x1
Time
(hours)
0 0.2 0.4 0.6 0.8 1 1.2 MPI.NET OMPI-nightly OMPI-trunkDAVS Performance
Mode – Charge2
4/1/2013 SALSA PRESENTATION
9
Pure MPI
MPI with Threads
Pure MPI Speedup
TxPxN
1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2
Time
(hours)
0 5 10 15 20 25 30 MPI.NET OMPI-nightly OMPI-trunkTxPxN
1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2
Speedup
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 MPI.NET OMPI-nightly OMPI-trunkTxPxN
2x1x8 4x1x8 8x1x8 1x2x8 4x2x8 1x4x8 2x4x8 1x8x8
DAVS Performance
Single Node Charge 2, Charge 5 and Charge 6
Points
◦
OMPI-trunk performed the best and OMPI-nightly was near too
◦
MPI.NET may be suffering from bad Infiniband
◦
FastMPJ had issues that prevented it from running the applications
◦
Performance with threading is not up to expected for Java
DAPWC Performance
OMPI-175 Only (Chosen over OMPI-trunk)
4/1/2013 SALSA PRESENTATION
11
TxPxN
1x1x11x1x21x2x12x1x11x1x41x2x21x4x12x1x22x2x14x1x11x1x81x2x41x4x21x8x12x1x42x2x22x4x14x1x24x2x18x1x11x1x161x2x81x4x41x8x22x1x82x2x42x4x24x1x44x2x28x1x21x1x321x2x161x4x81x8x42x1x162x2x82x4x44x1x84x2x48x1x41x2x321x4x161x8x82x1x322x2x162x4x84x1x164x2x88x1x81x4x321x8x162x2x322x4x164x1x324x2x168x1x161x8x322x4x324x2x328x1x321x8x43
Time
(hours)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
DAPWC Performance
Parallelism
16
TxPxN
1x1x16 1x2x8 1x4x4 1x8x2 2x1x8 2x2x4 2x4x2 4x1x4 4x2x2 8x1x2 1x1x32 1x2x16 1x4x8 1x8x4 2x1x16 2x2x8 2x4x4 4x1x8 4x2x4 8x1x4 1x2x32 1x4x16 1x8x8 2x1x32 2x2x16 2x4x8 4x1x16 4x2x8 8x1x8 1x4x32 1x8x16 2x2x32 2x4x16 4x1x32 4x2x16 8x1x16 1x8x32 2x4x32 4x2x32 8x1x32
Time
(hours)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
DAPWC Performance
Speedup
Points
◦
Performance with threads is better than DAVS, but Tx
1
xN is peculiar
◦
FastMPJ failed as before
◦
MPI.NET and OMPI-nightly runs are yet to perform
4/1/2013 SALSA PRESENTATION
13
TxPxN
1x1x11x1x21x2x12x1x11x1x41x2x21x4x12x1x22x2x14x1x11x1x81x2x41x4x21x8x12x1x42x2x22x4x14x1x24x2x18x1x11x1x161x2x81x4x41x8x22x1x82x2x42x4x24x1x44x2x28x1x21x1x321x2x161x4x81x8x42x1x162x2x82x4x44x1x84x2x48x1x41x2x321x4x161x8x82x1x322x2x162x4x84x1x164x2x88x1x81x4x321x8x162x2x322x4x164x1x324x2x168x1x161x8x322x4x324x2x328x1x32
Speedup
1 21 41 61 81 101 121