Java for High Performance Cloud Computing
Guillermo López Taboada
Computer Architecture Group
University of A Coruña (Spain)
[email protected]
October 31st, 2012, Future Networks and Distributed SystemS (FUNDS),
Manchester Metropolitan University, Manchester (UK)
Outline
1
Java for High Performance Computing
Java Shared Memory Programming
Java Message-Passing
Java GPGPU
Development of Efficient HPC Benchmarks
2
High Performance Cloud Computing
AWS IaaS for HPC and Big Data
3
Performance Evaluation
Evaluation of Java HPC
Evaluation of HPC in the Cloud
Case study: ProtTest-HPC
Java is an Alternative for HPC in the Multi-core Era
Language popularity:
(% skilled developers)
#1 C (19.8%)
#2
Java (17.2%)
#3 Objective-C
(9.48%)
#4 C++ (9.3%)
#19 Matlab (0.59%)
#27 Fortran (0.35%)
C and Java are the
most
popular
languages. Fortran,
popular language in
HPC, is the 27#
Java is an Alternative for HPC in the Multi-core Era
Interesting features:
Built-in networking
Built-in multi-threading
Portable, platform
independent
Object Oriented
Main training language
Many productive parallel/distributed
programming libs:
Java shared memory programming (high
level facilities: Concurrency framework)
Java Sockets
Java RMI
Message-Passing in Java (MPJ) libraries
Apache Hadoop
Java Adoption in HPC
HPC developers and
users usually
want
to use
Java in their projects.
Java code
is no longer
slow
(Just-In-Time
compilation)!
But still performance
penalties in Java
communications:
Pros and Cons:
high programming productivity.
but they are
highly
concerned
Java Adoption in HPC
HPC developers and
users usually
want
to use
Java in their projects.
Java code
is no longer
slow
(Just-In-Time
compilation)!
But still performance
penalties in Java
communications:
JIT Performance:
Like native performance.
Java can even
outperform
native
languages (C, Fortran) thanks to
the dynamic compilation!!
Java Adoption in HPC
HPC developers and
users usually
want
to use
Java in their projects.
Java code
is no longer
slow
(Just-In-Time
compilation)!
But still performance
penalties in Java
communications:
High Java Communications Overhead:
Poor high-speed networks support.
The data copies between the Java
heap and native code through JNI.
Costly data serialization.
The use of communication protocols
unsuitable for HPC.
Java for HPC
Current options in Java for HPC:
Java Shared Memory Programming (popular)
Java RMI (poor performance)
Java Sockets (low level programming)
Message-Passing in Java (MPJ) (extended, easy and reasonably good
performance)
Java for HPC
Java Shared Memory Programming:
Java Threads
Concurrency Framework (ThreadPools, Tasks ...)
Parallel Java (PJ)
Listing 1: Java Threads
class BasicThread extends Thread {/ / T h i s method i s c a l l e d when t h e t h r e a d runs p ub li c void run ( ) { f o r(i n t i = 0 ; i <1000; i ++) i n c r e a s e C o u n t ( ) ; } } class Test { i n t c o u n t e r = 0L ; i n c r e a s e C o u n t ( ) { c o u n t e r ++; }
p ub li c s t a t i c void main ( S t r i n g argv [ ] ) { / / Create and s t a r t t h e t h r e a d Thread t 1 =newBasicThread ( ) ; Thread t 2 =newBasicThread ( ) ; t 1 . s t a r t ( ) ;
t 2 . s t a r t ( ) ;
System . o u t . p r i n t l n ( " Counter= " + c o u n t e r ) ; }
Listing 2: Java Concurrency Framework highlights
class Test {p ub li c s t a t i c void main ( S t r i n g argv [ ] ) { / / c r e a t e t h e t h r e a d p o o l
ThreadPool t h r e a d P o o l =newThreadPool ( numThreads ) ; / / run example t a s k s ( t a s k s implement Runnable ) f o r (i n t i = 0 ; i < numTasks ; i ++) { t h r e a d P o o l . runTask ( c r e a t e T a s k ( i ) ) ; } / / c l o s e t h e p o o l and w a i t f o r a l l t a s k s t o f i n i s h . t h r e a d P o o l . j o i n ( ) ; }
/ / Another i n t e r e s t i n g c l a s s e s from c o n c u r r e n c y framework : / / C y c l i c B a r r i e r
/ / ConcurrentHashMap / / P r i o r i t y B l o c k i n g Q u e u e / / E x e c u t o r s . . . }
JOMP is the Java OpenMP binding
Listing 3: JOMP example
p ub li c s t a t i c void
main ( S t r i n g argv [ ] )
{
i n t
myid ;
/ / omp p a r a l l e l
p r i v a t e ( myid )
{
myid = OMP. getThreadNum ( ) ;
System . o u t . p r i n t l n ( ‘ ‘ H e l l o from ’ ’ + myid ) ;
}
/ / omp p a r a l l e l
f o r
f o r
( i = 1 ; i <n ; i ++) {
b [ i ] = ( a [ i ] + a [ i
−
1])
∗
0 . 5 ;
}
}
MPJ is the Java binding of MPI
Listing 4: MPJ example
import mpi .∗ ;p ub li c class H e l l o {
p ub li c s t a t i c void main ( S t r i n g argv [ ] ) { MPI . I n i t ( args ) ;
i n t rank = MPI .COMM_WORLD. Rank ( ) ; i n t msg_tag = 1 3 ;
i f ( rank == 0 ) { i n t peer_process = 1 ; S t r i n g [ ] msg =new S t r i n g [ 1 ] ; msg [ 0 ] =new S t r i n g ( " H e l l o " ) ;
MPI .COMM_WORLD. Send ( msg , 0 , 1 , MPI . OBJECT, peer_process , msg_tag ) ; } else i f ( rank == 1 ) {
i n t peer_process = 0 ; S t r i n g [ ] msg =new S t r i n g [ 1 ] ;
MPI .COMM_WORLD. Recv ( msg , 0 , 1 , MPI . OBJECT, peer_process , msg_tag ) ; System . o u t . p r i n t l n ( msg [ 0 ] ) ;
}
MPI . F i n a l i z e ( ) ; }
Pure
Ja
v
a
Impl.
Socket
impl.
High-speed
network
support
API
Ja
v
a
IO
Ja
v
a
NIO
Myrinet
InfiniBand
SCI
mpiJa
v
a
1.2
JGF
MPJ
Other
APIs
MPJava
X
X
X
Jcluster
X
X
X
Parallel Java
X
X
X
mpiJava
X
X
X
X
P2P-MPI
X
X
X
X
MPJ Express
X
X
X
X
MPJ/Ibis
X
X
X
X
JMPI
X
X
X
F-MPJ
X
X
X
X
X
X
FastMPJ
MPJ implementation developed at the Computer Architecture
Group of University of A Corunna.
Features of FastMPJ include:
High performance intra-node communication
Efficient RDMA transfers over InfiniBand and RoCE
Fully portable, as Java
Scalability up to thousands of cores
Highly productive development and maintenance
Ideal for multicore servers, cluster and cloud computing
FastMPJ Supported Hardware
FastMPJ runs on InfiniBand through IB Verbs, can rely on
TCP/IP and has shared memory support through Java threads:
Ethernet
Java Threads
Java Sockets
IBV API
InfiniPath PSM
InfiniBand
Myrinet/EthernetJava Native Interface (JNI)
TCP/IP
Shared Memoryniodev/iodev
smdev
ibvdev
mxdev
MX/Open−MX
psmdev
MPJ Applications
MPJ API (mpiJava 1.2)
FastMPJ Library
MPJ Collective Primitives
MPJ Point−to−Point Primitives
Java GPGPU
Table:
Available solutions for GPGPU computing in Java
Java bindings
User-friendly
CUDA
JCUDA
Java-GPU
jCuda
Rootbeer
OpenCL
JOCL
Aparapi
Java GPGPU
Table:
Description of the GPU-based testbed
CPU
1
×
Intel(R) Xeon hexacore X5650 @ 2.67GHz
CPU performance
64.08 GFLOPS DP (10.68 GFLOPS DP per core)
GPU
1
×
NVIDIA Tesla “Fermi” M2050
GPU performance
515 GFLOPS DP
Memory
12 GB DDR3 (1333 MHz)
OS
Debian GNU/Linux, kernel 3.2.0-3
CUDA
version 4.2 SDK Toolkit
Java GPGPU
0 100 200 300 400 500 600 700 800 900 1000 2048x2048 4096x4096 8192x8192 GFLOPS Problem sizeMatrix Multiplication performance (Single precision)
CUDA jCuda Aparapi Java 0 50 100 150 200 250 300 350 400 450 500 2048x2048 4096x4096 8192x8192 GFLOPS Problem size
Matrix Multiplication performance (Double precision) CUDA
jCuda Aparapi Java
Java GPGPU
0.5 1 2 4 8 16 32 64 128 256 512 2048x2048 4096x4096 8192x8192 Runtime (seconds) Problem sizeStencil2D performance (Single precision)
Java Aparapi jCuda CUDA 1 2 4 8 16 32 64 128 256 512 1024 2048x2048 4096x4096 8192x8192 Runtime (seconds) Problem size
Stencil2D performance (Double precision)
Java Aparapi jCuda CUDA
Java GPGPU
0 10 20 30 40 50 60 70 80 90 100 110 120 2048x2048 4096x4096 8192x8192 GFLOPS Problem sizeStencil2D performance (Single precision) CUDA jCuda Aparapi Java 0 5 10 15 20 25 30 35 40 45 50 55 60 2048x2048 4096x4096 8192x8192 GFLOPS Problem size
Stencil2D performance (Double precision)
CUDA jCuda Aparapi Java
Java GPGPU
0.0625 0.125 0.25 0.5 1 2 4 8 16 32 64 128 256 2048x2048 4096x4096 8192x8192 Problem sizeFFT performance (Single precision) Java Aparapi jCuda CUDA 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 64 128 256 2048x2048 4096x4096 8192x8192 Problem size FFT performance (Double precision)
Java Aparapi jCuda CUDA
Java GPGPU
0 50 100 150 200 250 300 350 400 2048x2048 4096x4096 8192x8192 GFLOPS Problem sizeFFT performance (Single precision)
CUDA jCuda Aparapi Java 0 50 100 150 200 250 300 2048x2048 4096x4096 8192x8192 GFLOPS Problem size
FFT performance (Double precision)
CUDA jCuda Aparapi Java
NPB-MPJ Characteristics (10,000 SLOC (Source LOC))
Name
Operation
SLOC
Communicat.
intensiveness
K
ernel
Applic.
CG
Conjugate Gradient
1000
Medium
X
EP
Embarrassingly Parallel
350
Low
X
FT
Fourier Transformation
1700
High
X
IS
Integer Sort
700
High
X
MG
Multi-Grid
2000
High
X
Java HPC optimization: NPB-MPJ success case
NPB-MPJ Optimization:
JVM JIT compilation of heavy and frequent methods with
runtime information
Structured programming is the best option
Small frequent methods are better.
mapping elements from multidimensional to one-dimensional
arrays (array flattening technique:
arr3D[x][y][z]
→
arr3D[pos3D(lenghtx,lengthy,x,y,z)])
NPB-MPJ code refactored, obtaining significant
improvements (up to 2800% performance increase)
Experimental Results on One Core (relative perf.)
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200CG (cc1.4xlarge)CG (cc2.8xlarge)FT (cc1.4xlarge)FT (cc2.8xlarge)IS (cc1.4xlarge)IS (cc2.8xlarge)MG (cc1.4xlarge)MG (cc2.8xlarge)
MOPS
NPB Kernels Serial Performance on Amazon EC2 (C Class)
GNU compilerIntel compiler JVM
IaaS: Infrastructure-as-a-Service
IaaS
The access to infrastructure supports the execution of
computationally intensive tasks in the cloud. There are many
IaaS providers:
Amazon Web Services (AWS)
Google Compute Engine (beta)
IBM cloud
HP cloud
Rackspace
ProfitBricks
HPC and Big Data in AWS
AWS
Provides a set of instances for HPC
CC1: dual socket, quadcore Nehalem Xeon (93 GFlops)
CG1: dual socket, quadcore Nehalem Xeon con 2 GPUs
(1123 GFlops)
CC2: dual socket, octcore Sandy Bridge Xeon (333 GFlops)
Up to 63 GB RAM
HPL Linpack in AWS (TOP500)
Table:
Configuration of our EC2 AWS cluster
CPU
2
×
Xeon quadcore [email protected] (46.88 GFlops DP each)
EC2 Compute Units
33.5
GPU
2
×
NVIDIA Tesla “Fermi” M2050 (515 GFlops DP each)
Memory
22 GB DDR3
Storage
1690 GB
Virtualization
Xen HVM 64-bit platform (PV drivers for I/O)
Number of CGI nodes
32
Interconnection network
10 Gigabit Ethernet
Total EC2 Compute Units
1072
Total CPU Cores
256 (3 TFLOPS DP)
Total GPUs
64 (32.96 TFLOPS DP)
HPL Linpack in AWS (TOP500)
HPL Linpack ranks the performance of supercomputers in TOP500 list
In AWS 14 TFlops are available for everybody! (67 USD$/h on demand, 21
USD$/h spot)
0 2000 4000 6000 8000 10000 12000 14000 16000 32 16 8 4 2 1 GFLOPS Number of Instances HPL Performance on Amazon EC2CPU CPU+GPU (CUDA) 0 200 400 600 800 1000 1200 1400 32 16 8 4 2 1 Speedup Number of Instances HPL Scalability on Amazon EC2
CPU CPU+GPU (CUDA)
HPL Linpack in AWS (TOP500)
References
Finis Terrae (CESGA): 14,01 TFlops
Our AWS Cluster: 14,23 TFlops
AWS (06/11): 41,82 TFlops (
# 451 TOP500 06/11
)
MareNostrum (BSC): 63,83 TFlops (
# 299 TOP500
)
MinoTauro (BSC): 103,2 TFlops (
# 114 TOP500
)
Image processing with Montecarlo (MC-GPU)
Time: originally 187 minutes, in AWS with HPC
6 seconds!
Cost: processing 500 radiographies in AWS: aprox. 70 USD$,
0,14 USD$ per unit
200 400 600 800 1000 1200 1400 1600 Runtime (s)
MC-GPU Execution Time on Amazon EC2
CPU CPU+GPU (CUDA) 200 400 600 800 1000 1200 1400 1600 1800 Speedup
MC-GPU Scalability on Amazon EC2
CPU CPU+GPU (CUDA)
AWS HPC and Big Data Remarks
HPC and Big Data in AWS
High computational power (new Sandy Bridge processors
and GPUs)
High performance communications over 10 Gigabit
Ethernet
Efficient access to file network systems (e.g., NFS,
OrangeFS)
Without waiting times in accessing resources
Easy applications and infrastructure deployment,
ready-to-run applications in AMIs
Java HPC in an InfiniBand cluster
Point-to-point Performance on DAS-4 (IB-QDR)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 4 16 64 256 1K
Latency (
µ
s)
1K 4K 16K 64K 256K 1M 4M 16M 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28Bandwidth (Gbps)
MVAPICH2 Open MPI FastMPJ (ibvdev)Java HPC in an InfiniBand cluster
0
10000
20000
30000
40000
50000
60000
70000
1
16
32
64
128
256
512
MOPS
Number of Cores
NPB CG Kernel Performance on DAS-4 (IB-QDR)
MVAPICH2
Open MPI
FastMPJ (ibvdev)
Java HPC in an InfiniBand cluster
0
50000
100000
150000
200000
250000
1
16
32
64
128
256
512
MOPS
NPB FT Kernel Performance on DAS-4 (IB-QDR)
MVAPICH2
Open MPI
FastMPJ (ibvdev)
Java HPC in an InfiniBand cluster
0
1000
2000
3000
4000
5000
6000
1
16
32
64
128
256
512
MOPS
Number of Cores
NPB IS Kernel Performance on DAS-4 (IB-QDR)
MVAPICH2
Open MPI
FastMPJ (ibvdev)
Java HPC in an InfiniBand cluster
0
50000
100000
150000
200000
250000
300000
350000
400000
1
16
32
64
128
256
512
MOPS
NPB MG Kernel Performance on DAS-4 (IB-QDR)
MVAPICH2
Open MPI
FastMPJ (ibvdev)
NPB Performance Evaluation in AWS
Message size (bytes)
Point-to-point Performance on 10 GbE (cc1.4xlarge)
45
50
55
60
65
70
75
1
4
16 64 256 1K
Latency (
µ
s)
1K
4K 16K 64K 256K 1M 4M 16M 64M
0
1
2
3
4
5
6
7
8
9
10
Bandwidth (Gbps)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
Point-to-point Performance on 10 GbE (cc2.8xlarge)
45
50
55
60
65
70
75
1
4
16 64 256 1K
Latency (
µ
s)
1K
4K 16K 64K 256K 1M 4M 16M 64M
0
1
2
3
4
5
6
7
8
9
10
Bandwidth (Gbps)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
Message size (bytes)
Point-to-point Performance on SHMEM (cc1.4xlarge)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1
4
16 64 256 1K
Latency (
µ
s)
1K
4K 16K 64K 256K 1M 4M 16M 64M
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
Bandwidth (Gbps)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
Point-to-point Performance on SHMEM (cc2.8xlarge)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1
4
16 64 256 1K
Latency (
µ
s)
1K
4K 16K 64K 256K 1M 4M 16M 64M
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
Bandwidth (Gbps)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1
8
16
32
64
128
256
512
MOPS
Number of Cores
CG C Class Performance (cc1.4xlarge)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
0
5000
10000
15000
20000
25000
1
8
16
32
64
128
256
512
MOPS
FT C Class Performance (cc1.4xlarge)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
0
100
200
300
400
500
600
1
8
16
32
64
128
256
512
MOPS
Number of Cores
IS C Class Performance (cc1.4xlarge)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
0
10000
20000
30000
40000
50000
60000
1
8
16
32
64
128
256
512
MOPS
MG C Class Performance (cc1.4xlarge)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
0
2000
4000
6000
8000
10000
12000
14000
16000
1
8
16
32
64
128
256
512
MOPS
Number of Cores
CG C Class Performance (cc2.8xlarge)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
1
8
16
32
64
128
256
512
MOPS
FT C Class Performance (cc2.8xlarge)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
0
100
200
300
400
500
600
700
1
8
16
32
64
128
256
512
MOPS
Number of Cores
IS C Class Performance (cc2.8xlarge)
MPICH2
OpenMPI
FastMPJ
NPB Performance Evaluation in AWS
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
1
8
16
32
64
128
256
512
MOPS
MG C Class Performance (cc2.8xlarge)
MPICH2
OpenMPI
FastMPJ
HPC in AWS
Efficient Cloud Computing support
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 128 64 32 16 8 1 MOPS
CG Kernel OpenMPI Performance on Amazon EC2
Default Default (Fillup) Default (Tuned) 16 Instances (Fillup) 16 Instances (Tuned) 0 2000 4000 6000 8000 10000 12000 14000 16000 512 256 128 64 32 16 8 1 MOPS
CG Kernel OpenMPI Performance on Amazon EC2
Default 16 Instances 32 Instances 64 Instances