Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

(1)

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Infrastructure

1

Jung-Ho Um,

2

Sang Bae Park,

3

Hoon Choi,

4

Hanmin Jung

1, First Author

Korea Institute of Science and Technology Information, [email protected]

2,

Korea Institute of Science and Technology Information, [email protected]

*3, Corresponding Author

Korea Institute of Science and Technology Information, [email protected]

4,

Korea Institute of Science and Technology Information, [email protected]

Abstract

Recently the importance of genomic data analysis is growing to realize the personalized treatment of human cancers. The next generation sequencing (NGS) technique is a cost-effective way to obtain such data sets for cancer data analysis. Because NGS produces data sets of lots of short reads, more computing resources are required to analyze those sets of data. To solve this issue, cloud computing can be considered as a prominent solution in order to elastically manage the requirement of computing resources. In this paper, we propose a bio-cloud service for large-scale NGS data analysis based on virtualized computing infrastructure. It has been developed by collaboration of KISTI and KOBIC to enhance the productivity of KOBIC’s genomic data re-sequencing study.

Keywords

: Next-Generation Sequencing, Cloud Computing, Genomic Analysis, Virtual Cluster Management, NGS Analysis Pipeline

1. Introduction

In general, genome sequence data analysis is an essential part of studying the personalized medical treatment of human cancers [1]. Nowadays, the next-generation sequencing (NGS) technique is prevalent in the field of genome analysis study because it is possible for biologists to save the cost and time for generating genome data. Hence, most bioinformatics research groups use the NGS technique to analyze genome sequences [2, 3].

The Korean Bioinformation Center (KOBIC), which is the primary research institute for bioinformatics research on human cancers, constructed a sequence data analysis system which consists of 100 physical computing servers and one storage server. However, the system has a few issues due to the direct use of physical computing resources. First, as the workload increases, latency increases; therefore, job execution time is severely delayed. Second, it is very difficult to enhance the efficiency of computing resources because even a bioinformatics application consists of sequential processing and parallel processing components. Therefore, bioinformatics applications tend to under-utilize physical computing resources. Finally, the existing system only stores the output data from executing applications in a storage server. This means that data management is handled by each user who executes applications. Since the system manages data not from a user’s viewpoint but from an application’s viewpoint, it requires additional efforts by users to manage their data.

These problems cause serious inefficiency in dealing with large-scale data analysis like RNA and DNA re-sequencing. To enhance the productivity of such analysis, we developed the Bio-Cloud system, which is based on virtualized computing resources and is specific to KOBIC’s requirements.

In this paper, we describe the Bio-Cloud system which can elastically allocate virtualized computing resources to bioinformatics applications on demand. We exploit Xen and OpenNebula to construct a virtual infrastructure management system for Bio-Cloud.

2. Proposed Bio-Cloud System

We designed the Bio-Cloud system by analyzing the above problems and developing solutions to resolve them. To reduce job execution time, we constructed a virtual infrastructure management system that can dynamically allocate a virtual cluster to a user’s job on demand. To improve the utilization of computing resources, a virtual cluster is allocated on the basis of CPU usage frequency, memory usage

(2)

patterns, and characteristics of applications including sequential and parallel processing components. Data management functions are provided to support NGS analysis pipelines and make it easy for cloud users to handle their data.

Figure 1. Bio-Cloud system architecture

Figure 1 shows the overall system architecture of Bio-Cloud. The system consists of a virtual infrastructure, resource management server and web interface to receive from and respond to KOBIC’s requests. Users and their scenarios in our system are described as follows.

- End-User: End users execute and monitor their tasks, and upload/download their data by using an App Server provided by the Bio App. manager.

- Bio App.Manager (SaaS Provider): The Bio App Manager (KOBIC personnel in our project) is primarily responsible for submitting the request of a virtual cluster to the Resource Manager by using a web interface. The capacity of a virtual cluster provided by the Resource Manager is based on end user’s requests and the status of virtual infrastructure. The Bio App Manager is also responsible for maintaining pipelines of the NGS application. In addition, the Bio App Manager manually or automatically installs bioinformatics applications on the virtual cluster provided by the Resource Manager.

- Resource Manager: The Resource Manager manages and monitors the physical and virtual

infrastructure for the Bio-Cloud system.

We designed a virtual infrastructure management system to provide a virtual cluster (or virtual machine) to bioinformatics applications. Figure 2 shows components of the virtual infrastructure management system.

(3)

-virtu virtu mac man -man virtu Serv relea virtu on t the V -inpu with show pipe requ Virtual Infr ual machines, ual memory, t chines is create naged by the C Computing nages virtual ual cluster to ver provides a ase, and mon ual CPU’s, the the calculation Virtual Infrast Data Reso ut/output and h data manage ws an exampl eline consists uirement for ea Figur rastructure: and storage the number of ed by SaaS Pr Computing Re Resource M computing re a job and ex a Virtual Clus nitoring of a v e size of the m n, the Virtual tructure. In ad urce Manag temporary da ement functio e of an NGS a s of BWA[4

ach step of the

re 2. Virtual In The Virtual I servers. The d f virtual machi rovider’s requ source Manag Management esources, such xecute it on th ster Managem virtual cluster memory, and Cluster Mana ddition, it also gement Serve ata required to ons to handle analysis pipel 4], BowTie[5] e pipeline is sh nfrastructure M Infrastructure diverse capac ines for a virtu uests. The virtu gement Server Server: The h as virtual cl he virtual clu ment module, w r. Virtual Clu the number of agement modu manages the e

er: The Data o execute a jo

their data acc line frequently ], SSAHA[6] hown in Table

Management S

is considered city (i.e., the n ual cluster, etc

ual resources and Data Res Computing lusters and vi uster, the Com which provide uster Managem f virtual mach ule creates a v execution of a a Resource M ob on a virtua cording to NG y used in DNA ], SamTools[ e 1. System d as a pool o number of vC c.) of virtual cl in the Virtual source Manage Resource Ma irtual machin mputing Resou es functions su ment calculate hines in a virtu virtual cluster a job on a virtu Management al cluster. It pr GS analysis pi A and RNA re [7] and GAT of virtual clus CPU’s, the siz lusters and vir Infrastructure ement Server anagement Se nes. To alloca urce Managem uch as allocat es the numbe ual cluster. Ba or machine f ual cluster. Server mana rovides end u ipelines. Figu e-sequencing. TK[8]. The ters, ze of rtual e are erver ate a ment tion, er of ased from ages users ure 3 The data

(4)

inpu outpu inde T prod anno Ta ut file file na size(Giga ut file file na size(Giga ex file file name size(Giga file name size(Giga Tota

The Data Res duced from th

otates the met

F able 1. The d ame ref.fa abyte)3.460 ame gen.sam abyte)640.920 e ref.fa.amb abyte)0.036 e ref.fa.rpa abyte)0.854 al source Manag he above ana tadata shown i Figure 3. An e data size gener

gen.fq 433.625 gen.addRG.s 682.955 b ref.fa.ann 0.294 ac ref.fa.rsa 0.427 2,190.025 gement Server alysis pipeline in Table 2 to t example NGS rated by the ex snp.rod 3.831 sam gen.bam g 205.829 ref.fa.bwt r 1.281 0 ref.fa.sa g 0.427 4 r must store e, which is re the data produ

analysis pipe xample NGS a gen.sorted.bam 164.111 ref.fa.pac 0.854 gen.sai 49.495 and maintain eusable to NG ucts. It also cre

line analysis pipelin gen.snpCalls.v 668.136 ref.fa.rbwt 1.281 ref.fa.fai gen.s 0.21 1.125 n the whole d GS researcher eates a folder ne vcf snpCalls.vcf.idx 5 data set of 2 rs. To do thi

for each step

x

TB is, it

(5)

pipeline and stores data in the folder. As a result, the data product for a pipeline consists of a hierarchy of folders, and the folders hold data from each step with the metadata.

Table 2. Metadata table

Field Name Description

UserID unique user identifier

Storage Server Address address of storage server stored data Path folder path to store user’s data

Filename file name of data

Datasize size of data

Creation date creation date of data

Modified date last modification date of data Application type application type if data is output data pipelineID pipeline Id for genome sequencing

The Data Resource Management Server supports Project folders, File folders and Script folders to make the management of data products easy. The folders store output data from applications, end user’s input data, and user defined scripts, respectively. Each user owns and manages these folders in a storage server like NFS [9]. Data updating consists of three steps as shown in Figure 4.

1. Meta-data related to the update request is stored to the meta-data table. 2. User submits the update request to storage server.

3. Updated data is transferred to storage server.

Figure 4. Data management flow

The storage server contains a Data Management module and a Data Transfer module to handle user’s requests. The Data Management module maintains input/output data and provides creation, deletion, rename and copy of folders and data. In addition, it maintains a metadata table for efficient data management. The Data Transfer module uploads and downloads data between clients and a storage server.

3. Implementation

The Bio-Cloud designed in the previous section is implemented by the use of Xen and OpenNebula. They are well-suited to the construction of a virtual infrastructure management system for bioinformatics applications. The user environment of Bio-Cloud is similar to KOBIC’s current system to reduce customer’s adaptation efforts. That is, we use the same system components, such as job scheduler, operating system of a virtual machine, web server, etc.

(6)

- Virtual Cluster Management module: We extended OpenNebula [10] to develop this module. OpenNebula supports Open Cloud API (OCA) to manage virtual infrastructure. OCA provides creation, deletion, and monitoring of virtual machines, virtual network, host, physical cluster, user and image. OCA does not provide functions for managing virtual clusters. We exploited OCA to add the functions for virtual cluster management to OpenNebula. The added functions are creation, deletion, and monitoring of a virtual cluster (Table 3). In addition, KOBIC uses Sun Grid Engine (SGE) [11] to execute bioinformatics applications because SGE is an open-source program of high stability with consistent updates. We also adapt SGE for a job scheduler in Bio-Cloud and execute KOBIC’s jobs on virtual clusters.

Table 3. Virtual Cluster Management APIs

Allocate : allocation of virtual cluster

Input Description Client client connect to a client

String description

virtual machine template to allocate virtual cluster int numVM the number of virtual machines for virtual cluster

Output Description OneResponse communication acknowledgement

finalizeCluster : deletion of virtual cluster

saveCluster : saving a virtual machine image in virtual cluster

Input Description int vmId[] storing virtual machine identifier

int disked storing disk identifier int imageId storing image identifier

getClusterInfo : monitoring virtual cluster

Output Description VCInfo store virtual cluster information

- Data Transfer module: KOBIC’s current system uses innoDS [12] for transferring data. However, innoDS costs license fee, which is a major obstacle to deployment of Bio-Cloud. Instead, the open- source code Rapident [13] is used to implement the Data Transfer module. Data Transfer includes upload/download of data and folders (Table 4).

(7)

Table 4. APIs for data transfer

Upload : Upload data

Input Description String srcPath path of user’s data source

String destPath path to store data to server

Output Description boolean success or failure of upload (true/false)

Download : Download data

Input Description String srcPath stored data path on the server

String destPath path to download data

Output Description boolean success or failure of download (true/false)

- Data Management module: KOBIC users want to manage data for their own purposes. This

module provides data management functions from a user’s viewpoint. They include creation, deletion, rename and copy of folders and files (Table 5).

Table 5. Data Management APIs

create : creation of folder or file

Input Description String path path including the folder(or file) name to create

Output Description boolean success or failure of creation (true/false)

delete : deletion of folder or file

Input Description String path path including the folder(or file) name to delete

Output Description boolean success or failure of delete(true/false)

rename : rename folder or file

Input Description String srcPath source path including the folder(or file) name to rename String destPath destination path to rename

Output Description boolean success or failure of rename(true/false)

copy : rename folder or file

Input Description String srcPath source path including the folder(or file) name to copy

String destPath destination path to copy

Output Description boolean success or failure of copy(true/false)

(8)

4. Software as a Service on Bio-Cloud: NEUMA Web Portal

NEUMA [14] web portal service was developed on the top of the Bio-Cloud system described in sections 2 and 3. NEUMA is the fastest RNA re-sequencing application for quantifying the volume of genomic expression data. The NEUMA analysis pipeline comprises the NGS applications shown in Table 6.

Table 6. NEUMA pipeline steps function

1 make bowtie index file of Reference sequence by bowtie(bowtie-0.12.7[5])

2 make mapping statistics of each RNA-Seq 3 find length distribution

4 build indexed table for transcriptome model 5 build suffix array table for transcriptome

model

6 print suffix array table for transcriptome model

7 build gU, iU table for transcriptome model 8 gene and isoform quantification

9 merge

Bio-Cloud provides the NEUMA web portal with the following features. First of all, it is responsible for allocating virtual clusters to NEUMA application services. The virtual clusters consist of vCPU, the memory and storage volume on the Bio App Manager’s request (Figure 5).

Figure 5. Request of virtual clusters

Second, Bio-Cloud provides a feature that can summarize the requests of virtual clusters from users and monitor the current status of virtual clusters allocated to users (Figure 7).

(9)

Thir their rd, Bio-Cloud r status execut Fi F d executes NE ted on virtual igure 6. Summ Figure 7. Mon EUMA applica clusters (Figu Figure 8. M mary of reques nitoring allocat

ation jobs thro ure 8).

Monitoring user

sted virtual clu

ted virtual clus

ough Sun Gri

r’s job status usters

sters

(10)

Fina fold

5. R

C orga on A Map BLA O amo clou of m T clou The man

6. C

I can be s virtu char vCP acco B and with adap alloc ally, Bio-Clou ders. It also pro

Related wor

Cloud compu anizations. Clo Amazon EC2 pReduce prog AST[17] with OBIWEE [18 ong bioinform ud computing multiple tasks These studies ud environmen Bio-Cloud s nagement issue

Conclusions

n this paper, w dynamically a summarized a ual cluster su racteristics of PUs or the siz ording to his/h Bio-Cloud is u NEUMA web h high enough ptive schedule cates a virtua ud provides a ovides editing

rks

ting services oudBurst[15] i 2. It drastical gramming fra MapReduce o 8] has propo matics applic g environmen s. focus on the e nts. But all of t system we h es for NGS an

s

we have propo allocate a virtu s follows. Fir uited for user f bioinformatic

ze of memory her own analys used to deploy b portal servic h performance er for the Vi al cluster to a data managem , uploading, an Figure 9. D for bioinform is a read-mapp lly reduces th amework and on the public c sed an open-cations on a nt for paralle efficient proce them do not su have proposed nalysis on a pr

osed and deve ual cluster dep rst, it reduces rs’ jobs. Seco cs application y. Finally, sin sis viewpoint, NEUMA, a w ce are very us to handle thei irtual Cluster cluster of ph ment feature fo nd downloadin Data manageme matics have be ping applicatio he execution d runtime sy cloud Amazon -source comp private cloud el processing essing of bioin upport analysi d in this pap ivate cloud. eloped Bio-Clo pending on a u user’s waiting ond, it allocat ns (sequential nce each user

a user’s effor web tool for K seful for small ir research. W Management hysical hosts b for creating, de ng of data file ent interface een studied b on for DNA a time for re-s ystem. Cloud n EC2. puting enviro d. Kim [19] of bioinform nformatics app is pipelines es per deals with

oud service fo user’s job. The

g time and jo tes a virtual or parallel p r’s data is sep rts to manage d KOBIC’s RNA l labs that do We are still wor

t in Bio-Clou based on the

eleting, and re es (Figure 9).

by many resea and RNA re-se sequencing by dBLAST[16] onment to sup

has also pre matics applica plications on p ssential to anal h analysis pi or genomic dat e advantages o b execution ti cluster to a j processing) an parately stored data is drastica A re-sequencin not operate co rking on the d ud. The sched application’s

enaming files

arch and indu equencing runn y the use of also parallel pport workfl esented a priv ations consis public and pri lyze genomic ipelines and ta analysis, wh of Bio-Cloud ime by creatin job based on nd the numbe d and mainta ally reduced. ng [20]. Bio-Cl omputing syst development o duler dynamic characteristic and ustry ning f the lized ows vate ting ivate data. data hich may ng a the er of ined loud tems of an cally cs to

(11)

improve performance and scalability. We are also developing a data provenance management module for bioinformatics data management.

10. References

[1] A. H. Chen, and M. C. Lee, “Novel Approaches for the Prediction of Cancer Classification”, In Proceedings of IJACT, Vol. 3, No. 3, pp. 30-39, 2011.

[2] H. Liu, Z. Lu, L Guo, Q. Wu, Q. Ge, and J. Lu, “Next generation sequencing, an effective method for genomic profiling of circulating miRNA”, In Proceedings of JCIT, Vol. 6, No. 12, pp. 434-441, 2011.

[3] M. Xiong, Z. Zhao, J. Arnold, and F. Yu, “Next-Generation Sequencing”, In Proceedings of Journal of Biomedicine and Biotechnology, 2010.

[4] http://bio-bwa.sourceforge.net/ [5] http://bowtie-bio.sourceforge.net/ [6] http://samtools.sourceforge.net/

[7] http://www.sanger.ac.uk/resources/software/ssaha/

[8] http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

[9] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, “Design and Implementation or the Sun Network Filesystem”, In Proceedings of USENIX, 1985.

[10]http://opennebula.org/

[11]http://wikis.sun.com/display/GridEngine/Home [12]http://www.innorix.com/en/DS

[13]http://sourceforge.net/apps/mediawiki/rapidant/ [14]http://neuma.kobic.re.kr/

[15]M. Schatz, “CloudBurst: Highly Sensitive Short Read Mapping with MapReduce”, In Proceedings of BioInformatics, 2009.

[16]A. Matsunaga, M. Tsugawa, and J. Fortes, “CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications,” In Proceedings of 4th IEEE International Conference on eScience, 2009.

[17]http://blast.ncbi.nlm.nih.gov/Blast.cgi

[18]F. Moreews, J. Piat, and O. Sallou, “OBIWEE : an open source bioinformatics cloud environment”, In Proceedings of 12th Annual Bioinformatics Open Source Conference, 2011. [19]T. K. Kim, B. K. Hou, and W. S. Cho, “Private Cloud Computing Techniques for Inter-processing

Bioinformatics Tools”, In Proceedings of Convergence and Hybrid Information Technology 2011. [20]S. Lee, C. H. Seo, B. Lim, J. O. Yang, J. Oh, M. Kim, S. Lee, B. Lee, C. Kang, and S. Lee,

“Accurate quantification of transcriptome from RNA-Seq data by effective length normalization”, In Proceedings of Nucleic Acids Res., 2011.