Enabling Large Scale Scientific
Computations for Expressed
Sequence Tag Sequencing over Grid
and Cloud Computing Cluster
Sangmi Lee Pallickara, Marlon Pierce,
Qunfeng Dong, Chin Hua Kong
Indiana University, Bloomington IN,
USA
The EST Pipeline
• The goal is to cluster mRNA sequences
– Overlapping sequences are grouped together into different clusters and then
– A consensus sequence is derived from each cluster.
– CAP3 is one program to assemble contiguous sequences.
• Data sources: NCBI GenBank, short read gene sequencers in the lab, etc.
– Too large to do with serial codes like CAP3
• We use PaCE (S. Aluru) to do a pre-clustering step for large sequences (parallel problem).
– 1 large data set --> many smaller clusters
– Each individual cluster can be fed into CAP3.
– We replaced the memory problem with the many-task problem.
– This is data-file parallel.
• Next step: do the CAP3 consensus sequences match any known sequences?
http://swarm.cgb.indiana.edu
•
Our goal is to provide a
Web service-based
science portal
that can
handle the largest
mRNA clustering
problems.
•
Computation is
outsourced to Grids
(TeraGrid) and Clouds
(Amazon)
– Not provided by in-house clusters.
•
This is an open service,
open architecture
Swarm: Large scale job submission
infrastructure over the distributed clusters
•
Web Service to submit and monitor 10,000’s (or more)
serial or parallel jobs.
•
Capabilities:
– Scheduling large number of jobs over distributed HPC clusters (Grid clusters, Cloud cluster and MS Windows HPC cluster)
– Monitoring framework for the large scale jobs
– Standard Web service interface for web application
– Extensible design for the domain specific software logics
– Brokers both Grid and Cloud submissions
•
Other applications:
– Calculate properties of all drug-like molecules in PubChem (Gaussian)
(Revised) Architecture of Swarm Service
Windows Server Cluster
Swarm-Grid Swarm-Dryad
Local RDMBS
Swarm-Analysis
Standard Web Service Interface Large Task Load Optimizer
Swarm-Grid
Connector Swarm-DryadConnector Swarm-HadoopConnector
Cloud Comp. Cluster Grid HPC/
Condor Cluster
Swarm-Grid
•
Swarm considers
traditional Grid HPC
cluster are suitable for
the high-throughput
jobs.
– Parallel jobs (e.g. MPI jobs)
– Long running jobs
•
Resource Ranking
Manager
– Prioritizes the resources with QBETS, INCA
•
Fault Manager
– Fatal faults– Recoverable faults
Resource Ranking Manager
Grid HPC/Condor pool Resource Connector
Condor(Grid/Vanilla) with Birdbath
Grid HPC ClustersGrid HPCClustersGrid HPC
ClustersGrid HPCClusters
Condor Cluster Standard Web Service Interface
Swarm-Grid QBETS Web Service Local RDMBS MyProxy Server Hosted by TeraGrid Project Hosted by UCSB
Request Manager
Job Distributor Job Queue Data Model
Swarm-Hadoop
• Suitable for short running serial job collections
• Submit jobs to the cloud computing clusters:
Amazon’s EC2 or Eucalyptus
• Uses Hadoop map-reduce engine.
• Each job processed as a single Map function:
• Input/output location is determined by the Data Model Manager
– Easy to modify for the domain specific
requirements.
Swarm-Hadoop
Hadoop Map Reduce Programming interface
Cloud Computing Cluster
Hadoop Resource Connector Job Producer
DataModel Manager Fault Manager
Local RDMBS
Standard WebService Interface
Job Buffer
Request Manager
Performance Evaluation
• Java JDK 1.6 or higher, Apache Axis2
• Server: 3.40 GHz Inter Pentium 4 CPU, 1GB RAM
• Swarm Grid:
– Backend TeraGrid machines: Big Red (Indiana University), Ranger (Texas Advanced Computing Center), and NSTG (Oak Ridge National Lab)
• Swarm-Hadoop:
– Computing Nodes: Amazon Web Service EC2 cluster with m1.small instance (2.5 GHz Dual-core AMD Opteron with 1.7GB RAM)
• Swarm-Windows HPC:
– Microsoft Windows HPC cluster, 2.39GHz CPUs, 49.15GB RAM, 24 cores, 4 sockets
• Dataset: partial set of the human EST fragments (published by NCBI GenBank)
– 4.6 GB total
Total Execution time of CAP3 execution for the
various numbers of jobs (~1 minute) with
Conclusions
•
Bioinformatics needs both
computing Grids
and
scientific Clouds
–
Problem sizes range over many orders of magnitude
•
Swarm is designed to bridge the gap between the
two, while supporting
10,000’s or more jobs
per
user per problem.
•
Smart scheduling is an issue in data-parallel
computing
–
Small Jobs(~1min) were processed more efficiently by
Swarm-Hadoop and Swarm-Dryad.
–
Grid style HPC clusters adds minutes (or even longer) of
overhead to each of jobs.
More Information
•
Email:
leesangm AT cs.indiana.edu mpierce AT
cs.indiana.edu
•
S
warm Web Site: h
ttp://www.collab-ogce.org/ogce/index.php/Swarm
•
Sw
arm on SourceForge:
ht
tp://ogce.svn.sourceforge.net/viewvc/ogce/
Computational Challenges in the EST
Sequencing
•
Challenge 1: Executing tens of thousands of jobs.
– More than 100 plant species have at least 10,000 EST
sequences; tens of thousand assembly jobs are processed.
– Standard queuing systems used by Grid based clusters do NOT allow users to submit 1000s of jobs concurrently to batch queue systems.
•
Challenge 2: Requirement of job processing is various
– To complete EST assembly process, various types of computation jobs must be processed. E.g. large scale
parallel processing, serial processing, and embarrassingly parallel jobs.
Tools for EST Sequence Assembly
Cleaning sequence
reads RepeatMasker
SEG, XNU, RepeatRunner,
PILER
Clustering
sequence reads PaCE
Cap3 Clustering, BAG, Clusterer, CluSTr, UI Cluster
and many more
Assemble reads Cap3 PHRAP, TIGRFAKII, GAP4, Assembler
Swarm-Grid:
Submitting High-throughput jobs-2
• User(personal account,
community account) based job management: policies in the Gird clusters are based on the user. • Job Distributor: matchmaking
available resources and submitted jobs.
• Job Execution Manager: submit jobs through CondorG using
birdbath WS interface
• Condor resource connector
manages to job to be submitted to the Grid HPC clusters or
traditional Condor cluster.
Resource Ranking Manager
Grid HPC/Condor pool Resource Connector
Condor(Grid/Vanilla) with Birdbath
Grid HPC ClustersGrid HPCClustersGrid HPC
ClustersGrid HPCClusters
Condor Cluster Standard WebService InterfaceSwarm-Grid QBET Web Service Local RDMBS MyProxy Server Hosted by TeraGrid Project Hosted by UCSB
Request Manager
Job Distributor Job Queue DataModel
EST Sequencing Pipeline
• EST (Expressed Sequence Tag): A fragment of Messenger RNAs (mRNAs) which is transcribed from the genes residing on chromosomes.
• EST Sequencing: Re-constructing full length of mRNA sequences for each expressed gene by means of
assembling EST fragments.
• EST sequencing is a standard practice for gene discovery, especially for the genomes of many organisms which may be too complex for whole-genome sequencing. (e.g. wheat) • EST contigs are important data for accurate gene
annotation.
• A pipeline of computational steps is required:
Computing resources for computing
intensive Biological Research
•
Biologically based researches require substantial
amount of computing resources.
•
Many of current computing is based on the limited
local computing infrastructure.
•
Available computing resources include:
– US national cyberinfrastructure (e.g. TeraGrid) good fit for closely coupled parallel application
– Cloud computing clusters (e.g. Amazon EC2, Eucalyptus) : good for on-demand jobs that individually last a few
seconds or minutes
– Microsoft Windows based HPC cluster(DryAd) : Job