Towards High Performance Data Analytics with Java

(1)

Towards High Performance Data Analytics

with Java

SALIYA EKANAYAKE

4/1/2013 SALSA PRESENTATION

1

(2)

A Bit of Background

Gene Sequence Clustering and Visualization

◦

Projects

◦

Million sequence project

http://salsahpc.indiana.edu/millionseq/

◦

Work on COG (Protein) sequences

http://salsacog.blogspot.com/

◦

Work on phylogenetic trees

http://salsafungiphy.blogspot.com/

◦

Publications

◦

G. L. H. Yang Ruan, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang,

Geoffrey Fox, “Integration of Clustering and Multidimensional Scaling to Determine

Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions,” in C4Bio

2014 of IEEE/ACM CCGrid 2014, Chicago, USA, 2014

◦

L. Stanberry, R. Higdon, W. Haynes, N. Kolker, W. Broomall, S. Ekanayake, A. Hughes,

Y. Ruan, J. Qiu, E. Kolker, and G. Fox, “Visualizing the protein sequence universe,” in

Proceedings of the 3rd international workshop on Emerging computational methods

for the life sciences, Delft, The Netherlands, 2012, pp. 13-22

◦

Y. Ruan, S. Ekanayake, M. Rho, H. Tang, S.-H. Bae, J. Qiu, and G. Fox, “DACIDR:

deterministic annealed clustering with interpolative dimension reduction using a

large collection of 16S rRNA sequences,” in Proceedings of the ACM Conference on

Bioinformatics, Computational Biology and Biomedicine, Orlando, Florida, 2012, pp.

329-336

◦

A. Hughes, Y. Ruan, S. Ekanayake, S. H. Bae, Q. Dong, M. Rho, J. Qiu, and G. Fox,

“Interpolative multidimensional scaling techniques for the identification of clusters in

very large sequence sets,”

BMC Bioinformatics,

vol. 13 Suppl 2, pp. S9, 2012

(3)

Under the Hood

3

D1

Alignment

and

Distance

Calculation

D2

Dimension

Reduction

D3

Clustering

_D4

Visualization

D5 >G0H13NN01D34CL GTCGTTTAAGCCATTACGTC … >G0H13NN01DK2OZ GTCGTTAAGCCATTACGTC …

# X Y Z

0 0.358 0.2620. 295 1 0.252 0.422 0.372

# Cluster

0 1

1 3

Reality Is More Complex

◦

Study of Biological Sequence Structure

◦

http://salsahpc.blogspot.com/2013/05/study-of-biological-sequence-structure.html

◦

Million Sequence Processes

◦

http://salsahpc.indiana.edu/millionseq/fungi2/fungi2_index.html

Runs On

◦

Tempest



Windows HPC Cluster

◦

FutureGrid, BigRed II, Quarry



Traditional Linux Based HPC Clusters

Algorithms

◦

Alignment and Distance Calculation

◦

SALSA-SWG



C# MPI

◦

SALSA-SWG-MBF



C# MPI

◦

SALSA-NW-MBF



C# MPI

◦

SALSA-SWG-MBF2Java



Java MapReduce

◦

SALSA-NW-BioJava



Java MapReduce

◦

Dimension Reduction

◦

MDSasChisq



C# MPI

◦

DA-SMACOF



C# MPI

◦

Twister DA-SMACOF



Java Iterative MapReduce

◦

WDA-SMACOF



Java Iterative MapReduce

◦

Clustering

◦

DAPWC



C# MPI

(4)

Towards Java

Motivation

◦

Immediate



Limited Windows HPC Clusters

◦

Future



Integrate with Apache Big Data Stack (ABDS)

Options

◦

Keep C#

◦

Run on Azure cloud



Not the best for MPI because of high latencies and low bandwidths

◦

Run on Mono



We tried, it worked, but poor in performance

◦

Convert to Java

◦

Time consuming, but gained good results

“Java Ready” Applications

◦

Deterministic Annealing Vector Sponge (DAVS)

(5)

Evaluations

MPI Frameworks

◦

MPI.NET



A high performance message passing interface for .NET environment

◦

FastMPJ



A pure Java implementation of mpiJava 1.2 specification

◦

OpenMPI



Java wrapper for native MPI implementation

◦

Nightly snapshot 1.9a1r28881 (OMPI-nightly) – conforms with mpiJava 1.2 specification

◦

Source tree revision 30301 (OMPI-trunk)

◦

Release candidate version 1.7.5rc5 (OMPI-175rc5) – latest of the three

Kernel Benchmarks

◦

Ohio MicroBenchmark (OMB) Suite

◦

Send and receive

◦

Allreduce

Application Benchmarks

◦

DAVS and DAPWC on Real Data

◦

Parallel Patterns of T x P x N

◦

T - # threads per process

◦

P - # MPI processes per node

◦

N - # nodes

◦

Threads from Habanero Java Library

4/1/2013



Mainly for Parallel Loops

SALSA PRESENTATION

5 

Your code was

(6)

Kernel Benchmarks

MPI Send and Receive

Message size (bytes)

0B 1B 2B 4B 8B 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB128KB256KB512KB 1MB

Average

time

(us)

1 10 100 1000 10000

MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG OMPI-nightly C FG

Message Size (bytes)

0B 1B 2B 4B 8B 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB128KB256KB512KB 1MB

Average

Time

(us)

1 10 100 1000 10000

OMPI-trunk C Madrid OMPI-trunk Java Madrid OMPI-trunk C FG OMPI-trunk Java FG

(7)

Kernel Benchmarks

MPI Allreduce

7 Performance with Different MPI Frameworks

OMPI-trunk Performance with and without Infiniband

Message size (bytes)

4B 8B 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB128KB256KB512KB 1MB 2MB 4MB 8MB

Average

time

(us)

10 100 1000 10000 100000

MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG OMPI-nightly C FG

Message Size (bytes)

4B 8B 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB128KB256KB512KB 1MB 2MB 4MB 8MB

Average

Time

(us)

1 10 100 1000 10000 100000 1000000

(8)

DAVS Performance

Mode – Charge5

TxPxN

1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2 1x8x1

Speedup

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 MPI.NET OMPI-nightly OMPI-trunk

TxPxN

2x1x8 4x1x8 8x1x8 1x2x8 4x2x8 1x4x8 2x4x8

Time

(hours)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 MPI.NET OMPI-nightly OMPI-trunk

TxPxN

1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2 1x8x1

Time

(hours)

0 0.2 0.4 0.6 0.8 1 1.2 MPI.NET OMPI-nightly OMPI-trunk

(9)

DAVS Performance

Mode – Charge2

9 Pure MPI

MPI with Threads

Pure MPI Speedup

TxPxN

1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2

Time

(hours)

0 5 10 15 20 25 30 MPI.NET OMPI-nightly OMPI-trunk

TxPxN

Speedup

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 MPI.NET OMPI-nightly OMPI-trunk

TxPxN

(10)

DAVS Performance

Single Node Charge 2, Charge 5 and Charge 6

Points

◦

OMPI-trunk performed the best and OMPI-nightly was near too

◦

MPI.NET may be suffering from bad Infiniband

◦

FastMPJ had issues that prevented it from running the applications

◦

Performance with threading is not up to expected for Java

(11)

DAPWC Performance

OMPI-175 Only (Chosen over OMPI-trunk)

11 TxPxN

1x1x11x1x21x2x12x1x11x1x41x2x21x4x12x1x22x2x14x1x11x1x81x2x41x4x21x8x12x1x42x2x22x4x14x1x24x2x18x1x11x1x161x2x81x4x41x8x22x1x82x2x42x4x24x1x44x2x28x1x21x1x321x2x161x4x81x8x42x1x162x2x82x4x44x1x84x2x48x1x41x2x321x4x161x8x82x1x322x2x162x4x84x1x164x2x88x1x81x4x321x8x162x2x322x4x164x1x324x2x168x1x161x8x322x4x324x2x328x1x321x8x43

Time

(hours)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

(12)

DAPWC Performance

Parallelism



16 TxPxN

1x1x16 1x2x8 1x4x4 1x8x2 2x1x8 2x2x4 2x4x2 4x1x4 4x2x2 8x1x2 1x1x32 1x2x16 1x4x8 1x8x4 2x1x16 2x2x8 2x4x4 4x1x8 4x2x4 8x1x4 1x2x32 1x4x16 1x8x8 2x1x32 2x2x16 2x4x8 4x1x16 4x2x8 8x1x8 1x4x32 1x8x16 2x2x32 2x4x16 4x1x32 4x2x16 8x1x16 1x8x32 2x4x32 4x2x32 8x1x32

Time

(hours)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

(13)

DAPWC Performance

Speedup

Points

◦

Performance with threads is better than DAVS, but Tx

1 xN is peculiar

◦

FastMPJ failed as before

◦

MPI.NET and OMPI-nightly runs are yet to perform

13 TxPxN

1x1x11x1x21x2x12x1x11x1x41x2x21x4x12x1x22x2x14x1x11x1x81x2x41x4x21x8x12x1x42x2x22x4x14x1x24x2x18x1x11x1x161x2x81x4x41x8x22x1x82x2x42x4x24x1x44x2x28x1x21x1x321x2x161x4x81x8x42x1x162x2x82x4x44x1x84x2x48x1x41x2x321x4x161x8x82x1x322x2x162x4x84x1x164x2x88x1x81x4x321x8x162x2x322x4x164x1x324x2x168x1x161x8x322x4x324x2x328x1x32

Speedup

1 21 41 61 81 101 121

(14)

Current Tasks and Future

Current

◦

Complete migration of applications to Java

◦

Evaluate performance

◦

Investigate “not so great” thread performance

Future

◦

How to integrate with ABDS?

(15)

Thank you!