HPC pipeline and
cloud
-based solutions for
Next Generation Sequencing data analysis
HPC4NGS 2012, Valencia
Ignacio Medina
[email protected]
Scientific Computing Unit
Bioinformatics and Genomics Department
Centro de Investigación Príncipe Felipe (CIPF)
Index
Introduction
High Performance Computing (
HPC
)
Distributed and
cloud
computing (
cloud
)
NGS Data Analysis Pipeline
Next Steps
Conclusions
Introduction
New scenario in Biology
Next Generation Sequencing
(
NGS
) technology is changing the
way how researchers perform experiments. Many new experiments
are being conducted by sequencing:
exome re-sequencing, RNA-seq,
Meth-seq, ChIP-seq, ...
NGS is allowing researches to:
● Find exome variants responsible of diseases ● Study the whole transcriptome of a phenotype ● Establish the methylation state of a condition ● Locate DNA binding proteins
But
, this new experiments
have increased data size by 1000x
when
compared with microarrays, i.e. from MB to GB in expression analysis
Data processing and analysis
are becoming a
bottleneck
and a
nightmare,
from days or weeks with microarrays to months with
NGS,
and it will be worse as more data become available
Introduction
The NGS data
NGS throughput is amazing
, 30x human
genomes in less than a week,
and it's
becoming faster!
A single run can generate hundreds of
GB of data,
a typical NGS experiment
today can produce 1TB of data in 1
week
The good and bad news:
Prices falling
down quick!
The good news:
Parallelization is
possible!
Many of the analysis is short
read
independent
This new scenario
demands more
advanced software solutions
that
exploit the current computer hardware
properly
. We need new technologies like
HPC or cloud-based applications
In just 10 years!
Introduction
Current computer hardware
Multi-core CPUs
● Currently: quad-core and hexa-core processors, with hyper-threading
● Coming soon: Intel MIC (Many Integrated Core) architecture with more than 50
processing cores
GPU computing solutions
● Desktop: Nvidia GTX580 Fermi architecture (512 cores performing 1.5TFlops) ● HPC: Nvidia Tesla M2090 Fermi architecture (6GB memory, 512 cores,
1.3TFlops SP, 665 GFlops DP, memory bandwidth 177GB/sec)
AVX, a new version of SSE, SIMD extended to 256 bits. New
instruction set and registers
Introduction
Poor software performance
But,
software
in Bioinformatics is in general slow and not designed
for running in clusters or
clouds
, even new NGS software is still being
developed in Perl, R and Java (
Disclaimer: I usually code in Java, R
and Perl, but they are not recommendable for NGS data analysis
)
New NGS technologies push computer needs, new solutions and
technologies become necessary: more
analysis
, performance,
scalable, cloud
Computer performance vs data analysis
●
As a
Computer scientist
I want
high performance software
–
Fast, very fast! Speed matters, low memory footprint!
–Scalable solutions: clusters, cloud
●
As a
Biologist
I need
useful data analysis tools
–
Efficient and sensitive
–
Extract relevant information
Introduction
Motivation and goals
As data analysis group we focus in
analysis
:
It's analysis
,
stupid
!
● ~1-2 weeks sample sequencing
● ~2-4 weeks preprocessing and read mapping, this is necessary to be
improved but is not enough!
● Several months of data analysis, lack of software for biological analysis, i.e.
which are the genes with a coverage >30x that are differentially expressed in a dataset of 50 BAMs accounting for 500GB of data with a p-value<0.05?
To develop new generation of software in bioinformatics to be
● fast, actually, very fast software! With a low memory consumption! ● be able to handle and analyze TB of data
● store data efficiently to query ● distribute computation, no data ● focused and useful in analysis!
Software
must
exploit current hardware and be
fast and efficient in
High Performance Computing (
HPC
)
Technologies Overview
OpenMP
● multi-platform shared-memory parallel programming in C/C++ and Fortran on all
architectures
● flexible interface for developing parallel applications for platforms ranging from the
desktop to the supercomputer
GPU computing
● Frameworks: Nvidia CUDA, Khronos OpenCL, OpenACC (in development)
Streaming SIMD Extensions
(
SSE
)
● SIMD instruction set extension to x86 architecture
● AVX (Advanced Vector Extensions) is an advanced version of SSE, SIMD register
extended from 128 bits to 256 bits with new instructions
Message Passing Interface (MPI)
● de facto standard for communication in a parallel program running on a
distributed-memory system
FPGAs
High Performance Computing (
HPC
)
Software architecture
Common
“mistakes”
in new
HPC
developers
●
Not all algorithms or problems are suitable for GPUs
●
CPUs multi-core must be also exploited, hyper-threading
●
SSE/AVX SIMD instructions can also improve algorithms, i.e. speed-up
6-8x of Smith-Waterman,
Farrar 2007
)
Heterogeneous
HPC
computation, in
shared-memory
●
CPU
(
OpenMP + SSE/AVX
) +
GPU
(
Nvidia CUDA
)
Hybrid approach
OpenMP + SSE/AVX Nvidia CUDA Hundreds of samples TB of data CPU GPU MPI Hadoop MapReduceDistributed and
cloud
computing
Apache Hadoop framework
It's a Java framework library that allows for the distributed processing
of
large data sets
across clusters of computers using a simple
programming model
http://hadoop.apache.org/
Develops open-source software for reliable, scalable, distributed
computing
Distributed and
cloud
computing
Apache Hadoop framework
MapReduce implementation is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
HBase is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a
timestamp; each value in the map is an uninterpreted array of bytes. No SQL schema, different number of columns is normal, very flexible
Who is using in Bioinformatics?
• GATK • Crossbow • 1000genomes
http://aws.amazon.com/1000genomes/
Distributed and
cloud
computing
Amazon AWS
http://aws.amazon.com/ “In 2006, Amazon Web Services (AWS) began offering
IT infrastructure services to businesses in the form of web services -- now commonly known as cloud
computing. One of the key benefits of cloud computing is the opportunity to replace up-front capital
infrastructure expenses with low variable costs that scale with your business”
NGS data analysis pipeline
Global overview
NGS
sequencer Fastq file, up to hundreds of GB per run QC and preprocessing
Short read aligner
QC and preprocessing
SAM/BAM file
VCF file
QC and preprocessing
Variant VCF viewer Variant effect analysis
QC stats, filtering and preprocessing options
Variant Calling analysis
Double mapping strategy:
Burrows-Wheeler Transform (GPU Nvidia CUDA) +
Smith-Waterman (CPU OpenMP+SSE/AVX)
QC stats, filtering and preprocessing options
GATK and SAM mPileup HPC Implementation.
Statistics genomic tests
Consequence type, regulatory variants and system biology information
QC stats, filtering and preprocessing options HTML5+SVG Web based viewer
More info
http://docs.bioinfo.cipf.es/projects/ngs-gpu-pipeline/wiki
Other analysis (HPC for genomics consortium)
RNA-seq (mRNA sequenced) DNA assembly (not a real analysis) Meth-seq Copy Number Transcript isoform ...
HPG suite
NGS data analysis pipeline
HPG-Fastq tool
Features:
● QC stats: read length and quality average, nucleotides %, GC% content, quality
per nucleotide, k-mers, ...
● Filtering: filter by length range, filer by quality range, num. nt out of range, N's, ... ● Preprocessing: quality ends trim, remove cloning vectors, ...
Hybrid approach implementation: Nvida CUDA + OpenMP
example:
hpg-fastq qc –fq fastq_file.fastq -t -v
Paired-end Fastq files of 15GB each QC'ed and preprocessed
in
only one execution in less than 4 minutes
in a standard
NGS data analysis pipeline
HPG-aligner, a HPC DNA short read aligner
Current read aligners software tend to fit in one of these groups:
● Very fast, but no too sensitive: no gaps, no indels, rna-seq... ● Slow, but very sensitive: up to 1 day by sample
Read Aligner algorithms
● Burrows-Wheeler Transform (BWT): very fast! No sensitive ● Smith-Waterman (SW): very sensitive but very slow
Hybrid approach (
paper sent
):
● BWT +1 error implemented in Nvidia CUDA ● SW implemented using OpenMP and SSE/AVX
Fastq file BWT-GPU < 2 errors SW OpenMP/SSE Find CALs with BWT-GPU (seeds) SAM file >= 2 errors Testing BWT-GPU Architecture 10x compared to BFAST ~50x with SSE
NGS data analysis pipeline
HPG-BAM tool
Features
●
BAM Sort implemented in Nvidia CUDA
(2x compared to
SAM tools)
●
Quality control
: % reads with N errors, % reads with multiple
mappings, strand bias, paired-end insert, ...
●
Coverage implemented in OpenMP (2x faster)
: by gene,
region, ...
●
Filtering
: by number of errors, number of hits, …
●Comparator
: stats, intersection, ...
In a standard workstation works in minute scale
Not too much performance improvement, but many new
NGS data analysis pipeline
HPG-variant, suite of tools for variant analysis
HPG-VCF tools
: to analyze large VCFs files with a low memory
footprint: stats, filter, split, merge, …
HPG-Variant
: suite of tools for genome-scale variant analysis
Variant calling:
is a critical step, two HPC implementation
are being developed: GATK, SAMtools mpileup
Genomic analysis
: association, TDT, Hardy-Weinberg, LOH,
… (
~Plink
)
VCF WEB Viewer and VARIANT software (
later
)
NGS data analysis pipeline
CellBase, an integrative database and RESTful WS API
Based on
CellBase
(
accepted in
NAR
), a comprehensive integrative
database and
RESTful Web Services
API
, more than 120GB of data and
100 SQL tables exported in TXT and
JSON:
● Core features: genes, transcripts, exons,
cytobands, proteins (UniProt),...
● Variation: dbSNP and Ensembl SNPs,
HapMap, 1000Genomes, Cosmic, ...
● Functional: 40 OBO ontologies(Gene
Ontology), Interpro, ...
● Regulatory: TFBS, miRNA targets,
conserved regions, …
● System biology: Interactome (IntAct),
Reactome database, co-expressed genes, ...
NGS data analysis pipeline
Genome Maps, a HTML5+SVG genome browser
Visualization is an important part of the data analysis:
heavy data
issue! Do not move data!
Features of
Genome Maps
(
www.genomemaps.org
,
paper sent
)
● First 100% web based: HTML5+SVG (Google Maps inspired) ● Always updated, no browser plugins or installation
● Data from CellBase database and DAS servers: genes, transcripts, exons,
SNPs, TFBS, miRNA targets, …
NGS data analysis pipeline
VARIANT, a genomic
VARI
ant
AN
alysis
T
ool
A genomic variant effect predictor tool, recursive acronym:
VARIANT = VARIant ANalysis Tool (http://variant.bioinfo.cipf.es, accepted in NAR)
VARIANT combines
CellBase
Web Services and
Genome Maps
to
build a
cloud
variant effect predictor,
no installation, always updated
Variant effects:
● Genomic location: intronic, non-synonymous, upstream, …
● Regulatory location: TFBS, miRNA targets, conserved regions, …
Three interfaces: RESTful, web application and CLI
Java RESTful WS API
MySQL clusters +120GB of data Three clients available WEB browser (AJAX) WEB
application CLI program
Fast! - GZip - Multi Thread
NGS data analysis pipeline
Moving data analysis pipeline forward
HPC for Genomics Consortium
, CIPF and Bull leading a
Consortium with UV, UPV and UJI Universities. More than
15
HPC computational researches
and developers joined
Extending data analysis pipeline to
●
DNA assembly
: new assembly algorithm and HPC implementation is
being developed
●
RNA-seq
: RNA mapping, quantification, diff. expr. analysis, isoform
detection, ...
●
Meth-seq
: new bisulfite-seq mapping algorithm, diff. methylated region
analysis, …
●
Copy Number
: detect amplification and deletions, LOH, ...
Next sub-projects
●
ChIP-seq
●
Data analysis integration: eQTL(variant+expr), expr+methylation
●Distributing: MPI-CUDA, rCUDA, Hadoop MapReduce
Next steps
Variant NoSQL Database
Currently thousands of
VCFs
are being
generated, VCF files are being stored in
repositories...
not enough
Motivation:
variant analysis
! Biological
questions remain to be answered easily, ie:
● Which variants with more than 20x are shared in a
single gene in more than 80% of disease samples?
Too many variants for a relational SQL
database
, we need new solutions, new
technology available:
NoSQL distributed
databases
A
NoSQL database and Java and RESTful
WS API
to store and analyze TeraBytes of
variant data is being implemented using
Hadoop Framework
, release in a few months
Hadoop Java API Variant Analysis Java API
RESTful WS API
WEB browser Perl, python, Java, ...
Next steps
Browsing NGS data, HTML5 web based solution
Postulate: “As data size increase
people tend to develop desktop
software for visualization and
analysis“ i.e. GBrowse,
Cytoscape, snpEff, etc.
We took the opposite direction
(Google and cloud approach):
● Moving high volumes of data is not
scalable (~TB of data)
● Cloud distributed and collaborative
solutions
Based on
Genome Maps
visualization technology:
● Browse mapped reads
● Examine variants from a familiy ● Analysis widgets integrated
Client
SAM/BAM data VCF/NoSQL data Thousands of files
~TBs of data
RESTful WS API Specification
Java implementation provided
SAM tools API VCF homemade API
Next steps
HPC for Genomics Consortium
HPC for Genomics Consortium
, CIPF and Bull leading a
Consortium with UV, UPV and UJI Universities. More than
15
HPC computational researches
and developers joined
Last weeks some computational researchers from Germany or
Chinese BGI have raised interest so far in this project
Motivation:
●
Computation is a bottleneck and the current development model based
in many small solutions does not solve big problems
●
As sequencing technology throughput improve and prices go down this
problem will become much worse
●
Focus in analysis, many more analysis are needed
So,
bioinformatician
with
strong background in
Next Steps
Future Clinic
project
Future Clinic project aims to introduce
personalized medicine in the Valencian
health system
Consortium formed by Indra, Bull,
Valencian Health System, Universitat
Politecnica, … and led by CIPF
Specific goals:
● Develop algorithms and software to
process, store and analyze genomic information
● Integrate this information into health
system: Digital Clinic History
● Develop software and technology to aid
to diagnosis
This project is funded by the
Conclusions
NGS technology is pushing computer needs,
new solutions are necessary
The huge data demands faster and more
efficient software solution, HPC can be applied
New architecture model, more cloud solutions
It's not only a hardware problem, we need better
software
Analysis software to handle this data is being
Acknowledgements
Joaquín Dopazo Head of Department
Franscisco Salavert Ruben Sánchez Web and Java
developers
Joaquín Tárraga Cristina Y. González Víctor Requena (BULL) HPC developers
HPC Consortium:
- Ignacio Blanquer (UPV) - José Salavert (UPV) - Jose Duato (UPV) - Juan M. Orduña (UV) - Vicente Arnau (UV) - Enrique Quintana (UJI) - Mario Gómez (BULL) - ...
And the rest of the department!!
Bull Extreme Computing (BUX)
Others developers Alejandro de MaríaRoberto Alonso Jose Carbonell