Enabling Large Scale Scientific Computations for Expressed Sequence Tag Sequencing over Grid and Cloud Computing Clusters

(1)

Enabling Large Scale Scientific

Computations for Expressed

Sequence Tag Sequencing over Grid

and Cloud Computing Cluster

Sangmi Lee Pallickara, Marlon Pierce,

Qunfeng Dong, Chin Hua Kong

Indiana University, Bloomington IN,

USA

(2)

(3)

The EST Pipeline

• The goal is to cluster mRNA sequences

– Overlapping sequences are grouped together into different clusters and then

– A consensus sequence is derived from each cluster.

– CAP3 is one program to assemble contiguous sequences.

• Data sources: NCBI GenBank, short read gene sequencers in the lab, etc.

– Too large to do with serial codes like CAP3

• We use PaCE (S. Aluru) to do a pre-clustering step for large sequences (parallel problem).

– 1 large data set --> many smaller clusters

– Each individual cluster can be fed into CAP3.

– We replaced the memory problem with the many-task problem.

– This is data-file parallel.

• Next step: do the CAP3 consensus sequences match any known sequences?

(4)

http://swarm.cgb.indiana.edu

• Our goal is to provide a

Web service-based

science portal

that can

handle the largest

mRNA clustering

problems.

• Computation is

outsourced to Grids

(TeraGrid) and Clouds

(Amazon)

– Not provided by in-house clusters.

• This is an open service,

open architecture

(5)

(6)

(7)

Swarm: Large scale job submission

infrastructure over the distributed clusters

• Web Service to submit and monitor 10,000’s (or more)

serial or parallel jobs.

• Capabilities:

– Scheduling large number of jobs over distributed HPC clusters (Grid clusters, Cloud cluster and MS Windows HPC cluster)

– Monitoring framework for the large scale jobs

– Standard Web service interface for web application

– Extensible design for the domain specific software logics

– Brokers both Grid and Cloud submissions

• Other applications:

– Calculate properties of all drug-like molecules in PubChem (Gaussian)

(8)

(Revised) Architecture of Swarm Service

Windows Server Cluster

Swarm-Grid Swarm-_Dryad

Local RDMBS

Swarm-Analysis

Standard Web Service Interface Large Task Load Optimizer

Swarm-Grid

Connector Swarm-DryadConnector Swarm-HadoopConnector

Cloud Comp. Cluster Grid HPC/

Condor Cluster

(9)

Swarm-Grid

• Swarm considers

traditional Grid HPC

cluster are suitable for

the high-throughput

jobs.

– Parallel jobs (e.g. MPI jobs)

– Long running jobs

• Resource Ranking

Manager

– Prioritizes the resources with QBETS, INCA

• Fault Manager

– Fatal faults

– Recoverable faults

Resource Ranking Manager

Grid HPC/Condor pool Resource Connector

Condor(Grid/Vanilla) with Birdbath

Grid HPC ClustersGrid HPC_ClustersGrid HPC

ClustersGrid HPC_Clusters

Condor Cluster Standard Web Service Interface

Swarm-Grid QBETS Web Service Local RDMBS MyProxy Server Hosted by TeraGrid Project Hosted by UCSB

Request Manager

Job Distributor Job Queue Data Model

(10)

Swarm-Hadoop

• Suitable for short running serial job collections

• Submit jobs to the cloud computing clusters:

Amazon’s EC2 or Eucalyptus

• Uses Hadoop map-reduce engine.

• Each job processed as a single Map function:

• Input/output location is determined by the Data Model Manager

– Easy to modify for the domain specific

requirements.

Swarm-Hadoop

Hadoop Map Reduce Programming interface

Cloud Computing Cluster

Hadoop Resource Connector Job Producer

DataModel Manager Fault Manager

Local RDMBS

Standard WebService Interface

Job Buffer

Request Manager

(11)

Performance Evaluation

• Java JDK 1.6 or higher, Apache Axis2

• Server: 3.40 GHz Inter Pentium 4 CPU, 1GB RAM

• Swarm Grid:

– Backend TeraGrid machines: Big Red (Indiana University), Ranger (Texas Advanced Computing Center), and NSTG (Oak Ridge National Lab)

• Swarm-Hadoop:

– Computing Nodes: Amazon Web Service EC2 cluster with m1.small instance (2.5 GHz Dual-core AMD Opteron with 1.7GB RAM)

• Swarm-Windows HPC:

– Microsoft Windows HPC cluster, 2.39GHz CPUs, 49.15GB RAM, 24 cores, 4 sockets

• Dataset: partial set of the human EST fragments (published by NCBI GenBank)

– 4.6 GB total

(12)

Total Execution time of CAP3 execution for the

various numbers of jobs (~1 minute) with

(13)

(14)

Conclusions

• Bioinformatics needs both

computing Grids

and

scientific Clouds

–

Problem sizes range over many orders of magnitude

• Swarm is designed to bridge the gap between the

two, while supporting

10,000’s or more jobs

per

user per problem.

• Smart scheduling is an issue in data-parallel

computing

–

Small Jobs(~1min) were processed more efficiently by

Swarm-Hadoop and Swarm-Dryad.

–

Grid style HPC clusters adds minutes (or even longer) of

overhead to each of jobs.

(15)

More Information

• Email:

leesangm AT cs.indiana.edu mpierce AT

cs.indiana.edu

• S

warm Web Site: h

ttp://www.collab-ogce.org/ogce/index.php/Swarm

• Sw

arm on SourceForge:

ht

tp://ogce.svn.sourceforge.net/viewvc/ogce/

(16)

Computational Challenges in the EST

Sequencing

• Challenge 1: Executing tens of thousands of jobs.

– More than 100 plant species have at least 10,000 EST

sequences; tens of thousand assembly jobs are processed.

– Standard queuing systems used by Grid based clusters do NOT allow users to submit 1000s of jobs concurrently to batch queue systems.

• Challenge 2: Requirement of job processing is various

– To complete EST assembly process, various types of computation jobs must be processed. E.g. large scale

parallel processing, serial processing, and embarrassingly parallel jobs.

(17)

Tools for EST Sequence Assembly

Cleaning sequence

reads RepeatMasker

SEG, XNU, RepeatRunner,

PILER

Clustering

sequence reads PaCE

Cap3 Clustering, BAG, Clusterer, CluSTr, UI Cluster

and many more

Assemble reads Cap3 PHRAP, TIGRFAKII, GAP4, Assembler

(18)

Swarm-Grid:

Submitting High-throughput jobs-2

• User(personal account,

community account) based job management: policies in the Gird clusters are based on the user. • Job Distributor: matchmaking

available resources and submitted jobs.

• Job Execution Manager: submit jobs through CondorG using

birdbath WS interface

• Condor resource connector

manages to job to be submitted to the Grid HPC clusters or

traditional Condor cluster.

Resource Ranking Manager

Grid HPC/Condor pool Resource Connector

Condor(Grid/Vanilla) with Birdbath

Grid HPC ClustersGrid HPC_ClustersGrid HPC

ClustersGrid HPC_Clusters

Condor Cluster Standard WebService InterfaceSwarm-Grid QBET Web Service Local RDMBS MyProxy Server Hosted by TeraGrid Project Hosted by UCSB

Request Manager

Job Distributor Job Queue DataModel

(19)

(20)

(21)

EST Sequencing Pipeline

• EST (Expressed Sequence Tag): A fragment of Messenger RNAs (mRNAs) which is transcribed from the genes residing on chromosomes.

• EST Sequencing: Re-constructing full length of mRNA sequences for each expressed gene by means of

assembling EST fragments.

• EST sequencing is a standard practice for gene discovery, especially for the genomes of many organisms which may be too complex for whole-genome sequencing. (e.g. wheat) • EST contigs are important data for accurate gene

annotation.

• A pipeline of computational steps is required:

(22)

Computing resources for computing

intensive Biological Research

• Biologically based researches require substantial

amount of computing resources.

• Many of current computing is based on the limited

local computing infrastructure.

• Available computing resources include:

– US national cyberinfrastructure (e.g. TeraGrid) good fit for closely coupled parallel application

– Cloud computing clusters (e.g. Amazon EC2, Eucalyptus) : good for on-demand jobs that individually last a few

seconds or minutes

– Microsoft Windows based HPC cluster(DryAd) : Job