Design and Development of Grid Enabled, G2PU Accelerated Java Application (Protein Sequence Study) for Grid Performance Analysis

(1)

Procedia Computer Science 70 ( 2015 ) 769 – 777

Peer-review under responsibility of the Organizing Committee of ICECCS 2015 doi: 10.1016/j.procs.2015.10.116

ScienceDirect

Design and Development of Grid Enabled, G2PU Accelerated Java

Application (Protein Sequence Study) for Grid Performance

Analysis

Subrata Sinha

a*

, Smriti Priya Medhi

a

, G.C Hazarika

b

a

Centre for Bioinformatics Studies, Dibrugarh University,Dibrugarh-786004,India

b

Department of Mathematics, Dibrugarh University, Dibrugarh-786004,India

Abstract

Recent advancement in the field of structural biology has generated huge volume of data and analyzing such data is vital to know the hidden truths of life but such analysis is compute intensive in nature and requires huge computational power resulting in extensive use of high performance computing (Multi Core Computing, G2PU Computing, CPU-GPU Hybrid computing, Cluster, Grid) models. Grid is one of the widely used HPC model which is commonly used in computational biology and to execute a compute intensive tasks on grid , the applications must also be grid enabled. Performance of grid depends on appropriate selection of load balancing strategy (at server and node level) and maximum utilization of computational resources of nodes (multiple cores of CPU and execution units of GPU) by the grid enabled applications. In this paper an attempt has been made to establish an experimental grid, develop a grid enabled application and to operate the grid with its highest possible performance by the grid enabled applications.

Peer-review under responsibility of organizing committee of the International Conference on Eco-friendly Computing and Communication Systems (ICECCS 2015).

Keywords:

Grid Computing; Load Balancing; Node Optimization; CPU-G2PU Hybrid Computing; Multi Core CPU; JPPF Grid

* Corresponding author. Tel.: +91-848-646-8313 E-mail address: [email protected]

(2)

1. Introduction

The rapid magnification of scientific applications has led to the development of incipient generation of distributed

systems such as Grid Computing

1

. Grid is extensively used in computational biology and bioinformatics

2

. mainly in

molecular dynamics, comparative genomic analysis and annotation, Mass Spectrometry Data Analysis, Protein

Stability studies

3-6

. Grid Architecture consists of Grid fabric, grid middleware and grid enabled application.

Grid-enabled applications are specific software applications that can utilize grid infrastructure by the use of grid

middleware. There are many grid middleware viz. UNICORE (Java), Globus (C and Java), Legion (C++), Gridbus

(C, Java, C# and Perl), JPPF (Java) used in various grid projects

7-13

. Among all middleware JPPF Grid middleware

has been used for proposed study and its performance was tested for every load balancing algorithm(Static ,

Adaptive, Deterministic, Reinforcement Learning and Node threads Algorithm) against various granularity of

tasks(.1 MB to 51.6 MB) and performance of grid was tested by grid enabled protein sequences analysis program.

Apart from selecting best load balancing algorithm at server, performance enhancement at nodes can be made

possible by enabling grid enabled applications to utilize CPU & GPU for processing a task. Rapid growth of

programmability and capability of GPU (Graphics Processing Units) has accelerated its use in general purpose

computing

13

and proved to be a promising solution to compute intensive tasks in various domains of computational

biology

14-24

and hybrid computing (CPU+GPU) is a promising future for computational biology

25-27

and in the field

of comparative genomics

28

, grid with the hybrid computing model is used. Some grids utilize GPU processing to

improve the performance

29

. Massive computational power with reduce the processing time

30

, lower energy

consumption and less economic cost is the point where this hybrid architecture dominates over the traditional one

using only CPUs

31

. To make GPUs programmable as CPUs specific frameworks are required like CUDA

32

,

OpenCL

33

, both CUDA and OpenCL are low-level libraries and require time consuming programming however

GPU Programming made simplified by APARAPI

34

.

In this study we constructed a JPPF grid to analyze and evaluated the performance gain provided by Grid Server

and node performance optimization (attained by selection of appropriate load balancing strategy) and sharing

computational load among CPU and GPU of nodes. To perform the experiment we have established an experimental

grid having Java Parallel Programming Framework as middleware in the Computer Laboratory of Centre for

Bioinformatics Studies.

2. Materials and Method

2.1. Material

2.1.1 Hardware Components: A) Hardware of Grid Server - i) GPU: Intel HD 4000 D, IvyBridge GT2 16

Execution Units with 350 MHz – 1350(Turbo) MHz. Memory Size: 1556 MB ii) CPU (4 Cores): Intel (R) Core ™

i7-337 3.4 GHz, 8MB RAM. Hardware of Grid Nodes - i) GPU: Intel HD Ivy Bridge GT1 with 6 Execution Units

of 6 @ 350 - 1100 (Boost) MHz clock frequency ii) CPU (2 Cores): Intel (R) Core 2 Duo 3.4GHz, 2GB RAM

B) Interconnection Network: Gigabit Ethernet 1000Mbps, D-Link Switch 24-Port 10/100/1000

2.1.2. Software Components: OS - Windows 7 ,Grid Middleware: JPPF 5.0, GPU Libraries : Aparapi, Standard

Library : JDK 1.7, Apache Ant, version 1.9.4, Biological Libraries : Bio Java 3.0.5.

2.1.3 Data : 51.6 MB of Protein Sequence Data

2.2 Method

Driver configuration: The parameters for each of the algorithms were set as: Static algorithm

(strategy.manual_profile.size = 1), Adaptive algorithm (minSamplesToAnalyse = 100, minSamples

ToCheckConvergence = 50, maxDeviation = 0.2, maxGuessToStable = 50, sizeRatioDeviation = 1.5, decreaseRatio

= 0.2), Deterministic Algorithm( initialSize = 5, performanceCacheSize = 1000, proportionalityFactor = 1),

(3)

Reinforcement Learning Algorithm( performanceCacheSize = 3000, performanceVariationThreshold = 0.001),

Nodethreads Algorithm( multiplicator = 1)

Node configuration:. A node's configuration file has a property, processing.threads to specify the size of the

execution thread pool.processing.threads > n , where n is number of CPU cores in nodes.

Development of Grid Enabled Application: A Gridified Bioinformatics Application which analysed 51.6 MB of

protein sequence to compute Instability Index, net charge, hydrophobicity and Aliphatic Index of retrieved protein

sequences. The application was written to evaluate the performance of the established Grid.

Method for executing task on GPU: To enable APARAPI program to execute Open CL native library

(OpenCL.dll) and APARAPI native library (aparapi_x86.dll) must be set to the environment variable PATH.

CPU-GPU Load Balancing: Load shared proportionately as per the capability of CPU and GPU. In our experiment

80% load is given to CPU and 20% load given on GPU.

Fig 1 - Simplified Architecture of GPU Enabled Grid Enabled Application Class Hierarchy and underlying

hardware utilization

(4)

The grid enabled application divides the task between the cores of GPU and CPU by adopting the above class

hierarchy. The client application uses standard JPPF classes and APARAPI.

3. Results

3.1 Results of server optimization and performance analysis

Table 1. Server Load Balancing Performance for various algorithms

S l. O f obs . G ranu lar it y O f Tas k ( in MB)

Time of Execution (in milliseconds) for various Server Load Balancing Algorithm

Static Algorithm Adaptive Algorithm Deterministic Algorithm Reinforcement Learning Algorithm Node threads Algorithm 1 0.1 1532 1893 2193 2043 2043 2 0.2 1975 2100 2500 2300 2300 3 0.4 2476 2248 2748 2498 2498 4 0.8 2688 3432 3732 3582 3582 5 1.6 3704 4397 4997 4697 4697 6 3.2 3988 4762 5062 4912 4912 7 6.4 4489 5927 6727 6327 6327 8 12.8 4997 6093 6393 6243 6243 9 25.6 5499 7358 8258 7808 7808 10 51.6 6009 8123 9123 8623 8623

Fig 2. Comparison of Load Balancing Algorithms on Proposed Grid for various level of granularity of task

From figure 2 it can be seen that static load sharing is much efficient than other algorithms. All other algorithms

requires communication between nodes and server. We can see along with increase in granularity of task, the total

(5)

execution time of the grid enabled application also increased; this is because of more network communication

between servers and nodes. So we have chosen static load balancing algorithm for server in our grid, because all the

nodes used in the grid are of same configuration.

3.2 Results of node performance by node threads

Table 2. Node Load Balancing Performance for different numbers of processing threads

Sl . N o G ranu lar it y of Task ( in K B

) Performance of grid nodes for various node threads in milli seconds

Node 1 Node 2 Node 3 Node 4

T = 1 T = 2 T = 4 T = 1 T = 2 T = 4 T = 1 T = 2 T = 4 T = 1 T = 2 T = 4 1 0.1 760 381 395 765 384 399 766 384 399 764 383 398 2 0.2 981 492 506 987 495 510 988 495 510 985 494 509 3 0.4 1232 617 631 1237 620 635 1238 620 635 1236 619 634 4 0.8 1337 670 684 1344 673 688 1345 674 689 1342 672 687 5 1.6 1844 923 937 1852 927 942 1853 928 943 1850 926 941 6 3.2 1988 995 1009 1993 998 1013 1994 998 1013 1992 997 1012 7 6.4 2240 1121 1135 2243 1123 1138 2244 1123 1138 2243 1123 1138 8 12.8 2493 1248 1262 2498 1250 1265 2499 1251 1266 2496 1249 1264 9 25.6 2745 1374 1388 2748 1375 1390 2749 1376 1391 2748 1375 1390 10 51.6 2998 1500 1514 3004 1503 1518 3005 1504 1519 3002 1502 1517

a.

b.

c.

d.

(6)

We have set processing.threads attribute value to 1, 2 and 4 in Node Configuration file for each node in each

granularity of tasks. For each granularity for each thread value we have executed the grid enabled application in 3

runs and a mean of execution time of all run were considered for each thread.

It is evident from Table 2 and Fig 3(a) – 3(d) that if the node is having 2 cores then setting the number of processing

threads to 2 gives best performance because in our experimental grid all nodes were having 2 cores CPUs.

3.3 Node performance analysis of CPU+GPU hybrid computing model

Table 3. Processing time readings and performance improvement studies when load is shared/not shared between CPUs and GPUs

S l. O f obs . G ranu lar it y of T as k ( in KB ) CPU-GPU

(Load Sharing : 80% in CPU, 20% in GPU) CPU Only 100 % Load on CPU of Each Node Difference in Execution Time(in ms) Improvement (in %) 1 0.1 281 381 100 26.25 2 0.2 292 492 200 40.65 3 0.4 242 617 375 60.78 4 0.8 265 670 405 60.45 5 1.6 363 923 560 60.67 6 3.2 395 995 600 60.30 7 6.4 440 1121 681 60.75 8 12.8 498 1248 750 60.10 9 25.6 544 1374 830 60.41 10 51.6 590 1500 910 60.67

From the above readings and calculations it is evident that if CPU and GPU process data together then we get better

improvement in performance in each node. The data is graphically represented in Figure below-

a

b

Figure 4(a). Execution Time and 4(b) Improvement of CPU-GPU Hybrid model

A performance pattern has noticed that after a granularity of task, the performance remain within a range of ~ 60%

because the performance will get enhanced upto the processing capability of GPU. With higher number of execution

units in GPU the performance will be scaled to much higher level.

3.4 Results of grid performance analysis

The processing time of the grid enabled application T(G) and sequential application T(S) has been recorded

for various granularity of task in milli seconds. The result shows that performance improvement [T(S)/T(G)] after

reaching certain level becomes stable . The results generated are shown in Table 4.

(7)

Table 4. Comparative processing time performance calculations between T(S) & T(G)

Figure 5. Execution Time of Grid Enabled and Standalone Application

Performance of Grid Enabled application has been achieved upto 3.4 times in comparison to sequential version of

the same application.

4. Conclusion

Performance of grid depends on the configuration of grid by the appropriate selection of load balancing strategy. A

grid enabled application which can utilize the underlying hardware resources of nodes (CPU-GPU Cores) efficiently

makes the compute intensive tasks quite faster. The study shows how to achieve maximum performance from a grid

enabled application by the appropriated server, node optimization and utilization of the hybrid computing model.

The experiment conducted above was on a mini experimental grid of 4 nodes and the grid enabled application shows

3.4 times better than non grid enabled application performing the same analysis. In addition to that each node when

tested for CPU-GPU hybrid model has shown improvement of ~ 60% per node additionally. This approach is a

promising model for all most all compute intensive tasks in bioinformatics like molecular modeling, docking and

molecular dynamics studies for large number of protein structures.

Sl. Of obs. Granularity of Task(in MB) T(S) in ms T(G) in ms % PE=T(S)/T(G) * 100

1 0.1 3064 1532 200 2 0.2 6912.5 1975 350 3 0.4 11884.8 2476 480 4 0.8 9139.2 2688 340 5 1.6 12593.6 3704 340 6 3.2 13559.2 3988 340 7 6.4 15262.6 4489 340 8 12.8 16989.8 4997 340 9 25.6 18696.6 5499 340 10 51.6 20430.6 6009 340

(8)

Acknowledgements

We acknowledge Prof. Subrata Chakraborty, Director i/c Centre for Bioinformatics Studies for facilitating with

the infrastructure to construct the Grid and perform the experiments. Dr. K Narain, Dy. Director of RMRC-ICMR,

Dibrugarh and Dr. Ashwani Sharma, Indian Science and Technology Foundation for their constant support and help.

References

1. Foster I , Yong Z , Raicu I , Shiyong L . Cloud and Grid Computing 360-degree compared. In Proc. Grid Computing Environments Workshop, 2008. GCE '08, 12-16 Nov. 2008, p. 1 - 10

2. Payne JL, Sinnott Armstrong NA, Moore JH. Exploiting graphics processing units for computational biology and bioinformatics. Interdiscip Sci. 2010; 2(3):213-20.

3. Hunter A, Australia WA, Schibeci D , Hiew HL, Bellgard M; GRID-enabled bioinformatics applications for comparative genomic analysis at the CBBC. In Proc. IEEE International Conference on Cluster Computing;20-23 Sept. 2004

4. Thrasher A, Musgrave Z, Kachmarck B, Thain D, Emrich S. Scaling up genome annotation using MAKER and work queue; Int J Bioinform Res Appl 2014;10(4-5):447-60

5. Carapito C, Burel A, Guterl P, Walter A, Varrier F, Bertile F, Van DA. MSDA, a proteomics software suite for in-depth Mass Spectrometry Data Analysis using grid computing; Proteomics. 2014; 14(9):1014-9

6. Wu CC, Lai LF, Gromiha MM, Huang LT. High throughput computing to improve efficiency of predicting protein stability change upon mutation; Int J Data Min Bioinform. 2014; 10(2):206-24.

7. Menday R, Wieder P. GRIP: The Evolution of UNICORE towards a Service-Oriented Grid. Proceedings of the Cracow Grid Workshop 2003, Cracow, Poland, p. 142-150

8. Venugopal S, Buyya R, Winton L. A Grid Service Broker for Scheduling Distributed Data-Oriented Applications on Global Grids, Technical Report, In Proc. 2nd workshop on Middleware for grid computing, Toronto, Ontario, Canada 2004; p. 75-80

9. Hey T,Trefethen AE. The UK e-Science Core Programme and the Grid, Future Generation Computer Systems 2002; 18(8):1017-1031. 10. Johnston W, Gannon D, Nitzberg B. Grids as production computing environments: The engineering aspects of NASA’s Information Power Grid. In Proc. Eighth IEEE International Symposium on High Performance Distributed Computing, Redondo Beach, CA, USA, 1999; p. 197-204 11. Natrajan A, Crowley M, Wilkins Diehr N, Humphrey M, Fox AD, Grimshaw A, Brooks CL. Studying Protein Folding on the Grid: Experiences using CHARMM on NPACI Resources under Legion, In Proc. the 10th International Symposium on High Performance Distributed Computing, August 7-9, 2001; p. 14–21.

12. Buyya R, Date S, Mizuno Matsumoto Y, Venugopal S, Abramson D. Composition of Distributed Brain Activity Analysis and its On-Demand Deployment on Global Grids, In Proc. 10th International Conference on High Performance Computing 2003,Hyderabad

13. Huang Q, Huang Z, Werstein P, Purvis M. GPU as a General Purpose Computing Resource; In Proc. Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, 2008; p. 151 - 158

14. Guerrero GD, Imbernon B, Perez Sanchez H, Sanz F, Garcia JM, Cecilia JM. A performance/ cost evaluation for a GPU-based drug discovery application on volunteer computing; Biomed Res Int. 2014; 2014:474219

15. Goga N, Marrink S, Cioromela R, Moldoveanu F. GPU-SD and DPD parallelization for Gromacs tools for molecular dynamics simulations. In Proc IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE) 2012. p. 251-254

16. Demchuk O, Karpov P, Blume Y. Docking Small Ligands to Molecule of the Plant FtsZ Protein: Application of the CUDA Technology for Faster Computations. Cytology and Genetics 2012; 46(3):172-179.

17. Shimoda T, Suzuki S, Ohue M, Ishida T, Akiyama Y. Protein-protein docking on hardware accelerators: comparison of GPU and MIC architectures; BMC Syst Biol. 2015; 9(Suppl) 1:S6.

18. Wu J, Hong B, Takeda T, Guo JT. High performance transcription factor-DNA docking with GPU computing; Proteome Sci. 2012 ;10 (Suppl) 1:S17

19. Manavski Svetlin A, Giorgio V. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment; BMC Bioinformatics 2008; 9: S10.

20. Zhao K, Chu X. G-BLASTN: accelerating nucleotide alignment by graphics processors; Bioinformatics. 2014; 30(10):1384-91

21. Pang B, Zhao N, Becchi M, Korkin D, Shyu CR. Accelerating large-scale protein structure alignments with graphics processing units; BMC Res Notes 2012; 22(5):116.

22. Yang G, Jiang W, Yang Q, Yu W. PBOOST: a GPU-based tool for parallel permutation tests in genome-wide association studies; Bioinformatics 2015; 31(9):1460-2.

23. Putz B, Kam Thong T, Karbalai N, Altmann A, Muller Myhsok B. Cost-effective GPU-grid for genome-wide epistasis calculations; Methods Inf Med 2013; 52(1):91-5.

24. Papadopoulos A, Kirmitzoglou I, Promponas VJ, Papadopoulos A, Kirmitzoglou I, Promponas VJ, Theocharides T. GPU technology as a platform for accelerating local complexity analysis of protein sequences. In Proc IEEE Eng Med Biol Soc. 2013;2013:2684-7.

25. Wang Y, Du H, Xia M, Ren L, Xu M, Xie T. A Hybrid CPU-GPU Accelerated Framework for Fast Mapping of High-Resolution Human Brain Connectome. PLoS ONE 2008; 8(5): e62789.

26. Nikolay P, Karpov P, Yaroslav B. Hybrid CPU-GPU calculations – a promising future for computational biology. In Proc. Third International Conference on High Performance Computing HPC-UA 2013,Ukraine, Kyiv; p. 330–335

27. Liao T, Zhang Y, Kekenes Huskey PM, Cheng Y, Michailova A, McCulloch AD, Holst M , McCammon JA; Multi-core CPU or GPU-accelerated Multiscale Modeling for Biomolecular Complexes; Mol Based Math Biol. 2013; p. 1-27

28. Singh A, Chen C, Liu W, Mitchell W, Schmidt B. A hybrid computational grid architecture for comparative genomics; IEEE Trans Inf Technol Biomed. 2008; 12(2) : 218-25

(9)

Garza J. Grid-based algorithm to search critical points, in the electron density, accelerated by graphics processing units; J Comput Chem. 2014 Dec 5;35(31):2272-8.

30. Agulleiro JI , Vazquez F, Garzon EM, Fernandez JJ. Hybrid computing: CPU+GPU co-processing and its application to tomographic reconstruction; Ultramicroscopy 2012; 115:109-14.

31. Tyng Yeu L, Hung Fu L, Jun Yao C. Proceedings of IEEE 26th International on Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW); 21-25 May 2012, Shanghai; p. 2369 – 2377

32. Ghorpade J, Parande J, Kulkarni M, Bawaskar A. GPGPU Processing In CUDA Architecture; Advanced Computing: An International Journal 2012 ; 3(1): 105-120

33. Lee H, Munshi A. The OpenCL Specification Version: 2.0 Document Revision: 26" (PDF). Khronos Group. Available from https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf Updated on 17 October 2014. Retrieved 16 April 2015.

34. Gary Frost, Aparapi - an open source API for expressing data parallel workloads in Java™. Available from http://developer.amd.com/tools-and-sdks/opencl-zone/aparapi/