Procedia Computer Science 70 ( 2015 ) 769 – 777
1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Organizing Committee of ICECCS 2015 doi: 10.1016/j.procs.2015.10.116
ScienceDirect
Design and Development of Grid Enabled, G2PU Accelerated Java
Application (Protein Sequence Study) for Grid Performance
Analysis
Subrata Sinha
a*, Smriti Priya Medhi
a, G.C Hazarika
ba
Centre for Bioinformatics Studies, Dibrugarh University,Dibrugarh-786004,India
b
Department of Mathematics, Dibrugarh University, Dibrugarh-786004,India
Abstract
Recent advancement in the field of structural biology has generated huge volume of data and analyzing such data is vital to know the hidden truths of life but such analysis is compute intensive in nature and requires huge computational power resulting in extensive use of high performance computing (Multi Core Computing, G2PU Computing, CPU-GPU Hybrid computing, Cluster, Grid) models. Grid is one of the widely used HPC model which is commonly used in computational biology and to execute a compute intensive tasks on grid , the applications must also be grid enabled. Performance of grid depends on appropriate selection of load balancing strategy (at server and node level) and maximum utilization of computational resources of nodes (multiple cores of CPU and execution units of GPU) by the grid enabled applications. In this paper an attempt has been made to establish an experimental grid, develop a grid enabled application and to operate the grid with its highest possible performance by the grid enabled applications.
© 2014 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the International Conference on Eco-friendly Computing and Communication Systems (ICECCS 2015).
Keywords:
Grid Computing; Load Balancing; Node Optimization; CPU-G2PU Hybrid Computing; Multi Core CPU; JPPF Grid* Corresponding author. Tel.: +91-848-646-8313 E-mail address: [email protected]
© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1.
Introduction
The rapid magnification of scientific applications has led to the development of incipient generation of distributed
systems such as Grid Computing
1. Grid is extensively used in computational biology and bioinformatics
2. mainly in
molecular dynamics, comparative genomic analysis and annotation, Mass Spectrometry Data Analysis, Protein
Stability studies
3-6. Grid Architecture consists of Grid fabric, grid middleware and grid enabled application.
Grid-enabled applications are specific software applications that can utilize grid infrastructure by the use of grid
middleware. There are many grid middleware viz. UNICORE (Java), Globus (C and Java), Legion (C++), Gridbus
(C, Java, C# and Perl), JPPF (Java) used in various grid projects
7-13. Among all middleware JPPF Grid middleware
has been used for proposed study and its performance was tested for every load balancing algorithm(Static ,
Adaptive, Deterministic, Reinforcement Learning and Node threads Algorithm) against various granularity of
tasks(.1 MB to 51.6 MB) and performance of grid was tested by grid enabled protein sequences analysis program.
Apart from selecting best load balancing algorithm at server, performance enhancement at nodes can be made
possible by enabling grid enabled applications to utilize CPU & GPU for processing a task. Rapid growth of
programmability and capability of GPU (Graphics Processing Units) has accelerated its use in general purpose
computing
13and proved to be a promising solution to compute intensive tasks in various domains of computational
biology
14-24and hybrid computing (CPU+GPU) is a promising future for computational biology
25-27and in the field
of comparative genomics
28, grid with the hybrid computing model is used. Some grids utilize GPU processing to
improve the performance
29. Massive computational power with reduce the processing time
30, lower energy
consumption and less economic cost is the point where this hybrid architecture dominates over the traditional one
using only CPUs
31. To make GPUs programmable as CPUs specific frameworks are required like CUDA
32,
OpenCL
33, both CUDA and OpenCL are low-level libraries and require time consuming programming however
GPU Programming made simplified by APARAPI
34.
In this study we constructed a JPPF grid to analyze and evaluated the performance gain provided by Grid Server
and node performance optimization (attained by selection of appropriate load balancing strategy) and sharing
computational load among CPU and GPU of nodes. To perform the experiment we have established an experimental
grid having Java Parallel Programming Framework as middleware in the Computer Laboratory of Centre for
Bioinformatics Studies.
2.
Materials and Method
2.1. Material
2.1.1 Hardware Components: A) Hardware of Grid Server - i) GPU: Intel HD 4000 D, IvyBridge GT2 16
Execution Units with 350 MHz – 1350(Turbo) MHz. Memory Size: 1556 MB ii) CPU (4 Cores): Intel (R) Core ™
i7-337 3.4 GHz, 8MB RAM. Hardware of Grid Nodes - i) GPU: Intel HD Ivy Bridge GT1 with 6 Execution Units
of 6 @ 350 - 1100 (Boost) MHz clock frequency ii) CPU (2 Cores): Intel (R) Core 2 Duo 3.4GHz, 2GB RAM
B) Interconnection Network: Gigabit Ethernet 1000Mbps, D-Link Switch 24-Port 10/100/1000
2.1.2. Software Components: OS - Windows 7 ,Grid Middleware: JPPF 5.0, GPU Libraries : Aparapi, Standard
Library : JDK 1.7, Apache Ant, version 1.9.4, Biological Libraries : Bio Java 3.0.5.
2.1.3
Data : 51.6 MB of Protein Sequence Data
2.2 Method
Driver configuration: The parameters for each of the algorithms were set as: Static algorithm
(strategy.manual_profile.size = 1), Adaptive algorithm (minSamplesToAnalyse = 100, minSamples
ToCheckConvergence = 50, maxDeviation = 0.2, maxGuessToStable = 50, sizeRatioDeviation = 1.5, decreaseRatio
= 0.2), Deterministic Algorithm( initialSize = 5, performanceCacheSize = 1000, proportionalityFactor = 1),
Reinforcement Learning Algorithm( performanceCacheSize = 3000, performanceVariationThreshold = 0.001),
Nodethreads Algorithm( multiplicator = 1)
Node configuration:. A node's configuration file has a property, processing.threads to specify the size of the
execution thread pool.processing.threads > n , where n is number of CPU cores in nodes.
Development of Grid Enabled Application: A Gridified Bioinformatics Application which analysed 51.6 MB of
protein sequence to compute Instability Index, net charge, hydrophobicity and Aliphatic Index of retrieved protein
sequences. The application was written to evaluate the performance of the established Grid.
Method for executing task on GPU: To enable APARAPI program to execute Open CL native library
(OpenCL.dll) and APARAPI native library (aparapi_x86.dll) must be set to the environment variable PATH.
CPU-GPU Load Balancing: Load shared proportionately as per the capability of CPU and GPU. In our experiment
80% load is given to CPU and 20% load given on GPU.
Fig 1 - Simplified Architecture of GPU Enabled Grid Enabled Application Class Hierarchy and underlying
hardware utilization
The grid enabled application divides the task between the cores of GPU and CPU by adopting the above class
hierarchy. The client application uses standard JPPF classes and APARAPI.
3. Results
3.1 Results of server optimization and performance analysis
Table 1. Server Load Balancing Performance for various algorithms
S l. O f obs . G ranu lar it y O f Tas k ( in MB)
Time of Execution (in milliseconds) for various Server Load Balancing Algorithm
Static Algorithm Adaptive Algorithm Deterministic Algorithm Reinforcement Learning Algorithm Node threads Algorithm 1 0.1 1532 1893 2193 2043 2043 2 0.2 1975 2100 2500 2300 2300 3 0.4 2476 2248 2748 2498 2498 4 0.8 2688 3432 3732 3582 3582 5 1.6 3704 4397 4997 4697 4697 6 3.2 3988 4762 5062 4912 4912 7 6.4 4489 5927 6727 6327 6327 8 12.8 4997 6093 6393 6243 6243 9 25.6 5499 7358 8258 7808 7808 10 51.6 6009 8123 9123 8623 8623
Fig 2. Comparison of Load Balancing Algorithms on Proposed Grid for various level of granularity of task
From figure 2 it can be seen that static load sharing is much efficient than other algorithms. All other algorithms
requires communication between nodes and server. We can see along with increase in granularity of task, the total
execution time of the grid enabled application also increased; this is because of more network communication
between servers and nodes. So we have chosen static load balancing algorithm for server in our grid, because all the
nodes used in the grid are of same configuration.
3.2 Results of node performance by node threads
Table 2. Node Load Balancing Performance for different numbers of processing threads
Sl . N o G ranu lar it y of Task ( in K B
) Performance of grid nodes for various node threads in milli seconds
Node 1 Node 2 Node 3 Node 4
T = 1 T = 2 T = 4 T = 1 T = 2 T = 4 T = 1 T = 2 T = 4 T = 1 T = 2 T = 4 1 0.1 760 381 395 765 384 399 766 384 399 764 383 398 2 0.2 981 492 506 987 495 510 988 495 510 985 494 509 3 0.4 1232 617 631 1237 620 635 1238 620 635 1236 619 634 4 0.8 1337 670 684 1344 673 688 1345 674 689 1342 672 687 5 1.6 1844 923 937 1852 927 942 1853 928 943 1850 926 941 6 3.2 1988 995 1009 1993 998 1013 1994 998 1013 1992 997 1012 7 6.4 2240 1121 1135 2243 1123 1138 2244 1123 1138 2243 1123 1138 8 12.8 2493 1248 1262 2498 1250 1265 2499 1251 1266 2496 1249 1264 9 25.6 2745 1374 1388 2748 1375 1390 2749 1376 1391 2748 1375 1390 10 51.6 2998 1500 1514 3004 1503 1518 3005 1504 1519 3002 1502 1517
a.
b.
c.
d.
We have set processing.threads attribute value to 1, 2 and 4 in Node Configuration file for each node in each
granularity of tasks. For each granularity for each thread value we have executed the grid enabled application in 3
runs and a mean of execution time of all run were considered for each thread.
It is evident from Table 2 and Fig 3(a) – 3(d) that if the node is having 2 cores then setting the number of processing
threads to 2 gives best performance because in our experimental grid all nodes were having 2 cores CPUs.
3.3 Node performance analysis of CPU+GPU hybrid computing model
Table 3. Processing time readings and performance improvement studies when load is shared/not shared between CPUs and GPUs
S l. O f obs . G ranu lar it y of T as k ( in KB ) CPU-GPU
(Load Sharing : 80% in CPU, 20% in GPU) CPU Only 100 % Load on CPU of Each Node Difference in Execution Time(in ms) Improvement (in %) 1 0.1 281 381 100 26.25 2 0.2 292 492 200 40.65 3 0.4 242 617 375 60.78 4 0.8 265 670 405 60.45 5 1.6 363 923 560 60.67 6 3.2 395 995 600 60.30 7 6.4 440 1121 681 60.75 8 12.8 498 1248 750 60.10 9 25.6 544 1374 830 60.41 10 51.6 590 1500 910 60.67
From the above readings and calculations it is evident that if CPU and GPU process data together then we get better
improvement in performance in each node. The data is graphically represented in Figure below-
a
b
Figure 4(a). Execution Time and 4(b) Improvement of CPU-GPU Hybrid model
A performance pattern has noticed that after a granularity of task, the performance remain within a range of ~ 60%
because the performance will get enhanced upto the processing capability of GPU. With higher number of execution
units in GPU the performance will be scaled to much higher level.
3.4 Results of grid performance analysis
The processing time of the grid enabled application T(G) and sequential application T(S) has been recorded
for various granularity of task in milli seconds. The result shows that performance improvement [T(S)/T(G)] after
reaching certain level becomes stable . The results generated are shown in Table 4.
Table 4. Comparative processing time performance calculations between T(S) & T(G)
Figure 5. Execution Time of Grid Enabled and Standalone Application
Performance of Grid Enabled application has been achieved upto 3.4 times in comparison to sequential version of
the same application.
4. Conclusion
Performance of grid depends on the configuration of grid by the appropriate selection of load balancing strategy. A
grid enabled application which can utilize the underlying hardware resources of nodes (CPU-GPU Cores) efficiently
makes the compute intensive tasks quite faster. The study shows how to achieve maximum performance from a grid
enabled application by the appropriated server, node optimization and utilization of the hybrid computing model.
The experiment conducted above was on a mini experimental grid of 4 nodes and the grid enabled application shows
3.4 times better than non grid enabled application performing the same analysis. In addition to that each node when
tested for CPU-GPU hybrid model has shown improvement of ~ 60% per node additionally. This approach is a
promising model for all most all compute intensive tasks in bioinformatics like molecular modeling, docking and
molecular dynamics studies for large number of protein structures.
Sl. Of obs. Granularity of Task(in MB) T(S) in ms T(G) in ms % PE=T(S)/T(G) * 100
1 0.1 3064 1532 200 2 0.2 6912.5 1975 350 3 0.4 11884.8 2476 480 4 0.8 9139.2 2688 340 5 1.6 12593.6 3704 340 6 3.2 13559.2 3988 340 7 6.4 15262.6 4489 340 8 12.8 16989.8 4997 340 9 25.6 18696.6 5499 340 10 51.6 20430.6 6009 340
Acknowledgements
We acknowledge Prof. Subrata Chakraborty, Director i/c Centre for Bioinformatics Studies for facilitating with
the infrastructure to construct the Grid and perform the experiments. Dr. K Narain, Dy. Director of RMRC-ICMR,
Dibrugarh and Dr. Ashwani Sharma, Indian Science and Technology Foundation for their constant support and help.
References
1. Foster I , Yong Z , Raicu I , Shiyong L . Cloud and Grid Computing 360-degree compared. In Proc. Grid Computing Environments Workshop, 2008. GCE '08, 12-16 Nov. 2008, p. 1 - 10
2. Payne JL, Sinnott Armstrong NA, Moore JH. Exploiting graphics processing units for computational biology and bioinformatics. Interdiscip Sci. 2010; 2(3):213-20.
3. Hunter A, Australia WA, Schibeci D , Hiew HL, Bellgard M; GRID-enabled bioinformatics applications for comparative genomic analysis at the CBBC. In Proc. IEEE International Conference on Cluster Computing;20-23 Sept. 2004
4. Thrasher A, Musgrave Z, Kachmarck B, Thain D, Emrich S. Scaling up genome annotation using MAKER and work queue; Int J Bioinform Res Appl 2014;10(4-5):447-60
5. Carapito C, Burel A, Guterl P, Walter A, Varrier F, Bertile F, Van DA. MSDA, a proteomics software suite for in-depth Mass Spectrometry Data Analysis using grid computing; Proteomics. 2014; 14(9):1014-9
6. Wu CC, Lai LF, Gromiha MM, Huang LT. High throughput computing to improve efficiency of predicting protein stability change upon mutation; Int J Data Min Bioinform. 2014; 10(2):206-24.
7. Menday R, Wieder P. GRIP: The Evolution of UNICORE towards a Service-Oriented Grid. Proceedings of the Cracow Grid Workshop 2003, Cracow, Poland, p. 142-150
8. Venugopal S, Buyya R, Winton L. A Grid Service Broker for Scheduling Distributed Data-Oriented Applications on Global Grids, Technical Report, In Proc. 2nd workshop on Middleware for grid computing, Toronto, Ontario, Canada 2004; p. 75-80
9. Hey T,Trefethen AE. The UK e-Science Core Programme and the Grid, Future Generation Computer Systems 2002; 18(8):1017-1031. 10. Johnston W, Gannon D, Nitzberg B. Grids as production computing environments: The engineering aspects of NASA’s Information Power Grid. In Proc. Eighth IEEE International Symposium on High Performance Distributed Computing, Redondo Beach, CA, USA, 1999; p. 197-204 11. Natrajan A, Crowley M, Wilkins Diehr N, Humphrey M, Fox AD, Grimshaw A, Brooks CL. Studying Protein Folding on the Grid: Experiences using CHARMM on NPACI Resources under Legion, In Proc. the 10th International Symposium on High Performance Distributed Computing, August 7-9, 2001; p. 14–21.
12. Buyya R, Date S, Mizuno Matsumoto Y, Venugopal S, Abramson D. Composition of Distributed Brain Activity Analysis and its On-Demand Deployment on Global Grids, In Proc. 10th International Conference on High Performance Computing 2003,Hyderabad
13. Huang Q, Huang Z, Werstein P, Purvis M. GPU as a General Purpose Computing Resource; In Proc. Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, 2008; p. 151 - 158
14. Guerrero GD, Imbernon B, Perez Sanchez H, Sanz F, Garcia JM, Cecilia JM. A performance/ cost evaluation for a GPU-based drug discovery application on volunteer computing; Biomed Res Int. 2014; 2014:474219
15. Goga N, Marrink S, Cioromela R, Moldoveanu F. GPU-SD and DPD parallelization for Gromacs tools for molecular dynamics simulations. In Proc IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE) 2012. p. 251-254
16. Demchuk O, Karpov P, Blume Y. Docking Small Ligands to Molecule of the Plant FtsZ Protein: Application of the CUDA Technology for Faster Computations. Cytology and Genetics 2012; 46(3):172-179.
17. Shimoda T, Suzuki S, Ohue M, Ishida T, Akiyama Y. Protein-protein docking on hardware accelerators: comparison of GPU and MIC architectures; BMC Syst Biol. 2015; 9(Suppl) 1:S6.
18. Wu J, Hong B, Takeda T, Guo JT. High performance transcription factor-DNA docking with GPU computing; Proteome Sci. 2012 ;10 (Suppl) 1:S17
19. Manavski Svetlin A, Giorgio V. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment; BMC Bioinformatics 2008; 9: S10.
20. Zhao K, Chu X. G-BLASTN: accelerating nucleotide alignment by graphics processors; Bioinformatics. 2014; 30(10):1384-91
21. Pang B, Zhao N, Becchi M, Korkin D, Shyu CR. Accelerating large-scale protein structure alignments with graphics processing units; BMC Res Notes 2012; 22(5):116.
22. Yang G, Jiang W, Yang Q, Yu W. PBOOST: a GPU-based tool for parallel permutation tests in genome-wide association studies; Bioinformatics 2015; 31(9):1460-2.
23. Putz B, Kam Thong T, Karbalai N, Altmann A, Muller Myhsok B. Cost-effective GPU-grid for genome-wide epistasis calculations; Methods Inf Med 2013; 52(1):91-5.
24. Papadopoulos A, Kirmitzoglou I, Promponas VJ, Papadopoulos A, Kirmitzoglou I, Promponas VJ, Theocharides T. GPU technology as a platform for accelerating local complexity analysis of protein sequences. In Proc IEEE Eng Med Biol Soc. 2013;2013:2684-7.
25. Wang Y, Du H, Xia M, Ren L, Xu M, Xie T. A Hybrid CPU-GPU Accelerated Framework for Fast Mapping of High-Resolution Human Brain Connectome. PLoS ONE 2008; 8(5): e62789.
26. Nikolay P, Karpov P, Yaroslav B. Hybrid CPU-GPU calculations – a promising future for computational biology. In Proc. Third International Conference on High Performance Computing HPC-UA 2013,Ukraine, Kyiv; p. 330–335
27. Liao T, Zhang Y, Kekenes Huskey PM, Cheng Y, Michailova A, McCulloch AD, Holst M , McCammon JA; Multi-core CPU or GPU-accelerated Multiscale Modeling for Biomolecular Complexes; Mol Based Math Biol. 2013; p. 1-27
28. Singh A, Chen C, Liu W, Mitchell W, Schmidt B. A hybrid computational grid architecture for comparative genomics; IEEE Trans Inf Technol Biomed. 2008; 12(2) : 218-25
Garza J. Grid-based algorithm to search critical points, in the electron density, accelerated by graphics processing units; J Comput Chem. 2014 Dec 5;35(31):2272-8.
30. Agulleiro JI , Vazquez F, Garzon EM, Fernandez JJ. Hybrid computing: CPU+GPU co-processing and its application to tomographic reconstruction; Ultramicroscopy 2012; 115:109-14.
31. Tyng Yeu L, Hung Fu L, Jun Yao C. Proceedings of IEEE 26th International on Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW); 21-25 May 2012, Shanghai; p. 2369 – 2377
32. Ghorpade J, Parande J, Kulkarni M, Bawaskar A. GPGPU Processing In CUDA Architecture; Advanced Computing: An International Journal 2012 ; 3(1): 105-120
33. Lee H, Munshi A. The OpenCL Specification Version: 2.0 Document Revision: 26" (PDF). Khronos Group. Available from https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf Updated on 17 October 2014. Retrieved 16 April 2015.
34. Gary Frost, Aparapi - an open source API for expressing data parallel workloads in Java™. Available from http://developer.amd.com/tools-and-sdks/opencl-zone/aparapi/