HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis

(1)

HPC pipeline and

cloud

-based solutions for

Next Generation Sequencing data analysis

HPC4NGS 2012, Valencia

Ignacio Medina

[email protected]

Scientific Computing Unit

Bioinformatics and Genomics Department

Centro de Investigación Príncipe Felipe (CIPF)

(2)

Index



Introduction



High Performance Computing (

HPC

)



Distributed and

cloud

computing (

cloud

)



NGS Data Analysis Pipeline



Next Steps



Conclusions

(3)

Introduction

New scenario in Biology



Next Generation Sequencing

(

NGS

) technology is changing the

way how researchers perform experiments. Many new experiments

are being conducted by sequencing:

exome re-sequencing, RNA-seq,

Meth-seq, ChIP-seq, ...



NGS is allowing researches to:

● Find exome variants responsible of diseases ● Study the whole transcriptome of a phenotype ● Establish the methylation state of a condition ● Locate DNA binding proteins



But

, this new experiments

have increased data size by 1000x

when

compared with microarrays, i.e. from MB to GB in expression analysis



Data processing and analysis

are becoming a

bottleneck

and a

nightmare,

from days or weeks with microarrays to months with

NGS,

and it will be worse as more data become available

(4)

Introduction

The NGS data



NGS throughput is amazing

, 30x human

genomes in less than a week,

and it's

becoming faster!



A single run can generate hundreds of

GB of data,

a typical NGS experiment

today can produce 1TB of data in 1

week



The good and bad news:

Prices falling

down quick!



The good news:

Parallelization is

possible!

Many of the analysis is short

read

independent



This new scenario

demands more

advanced software solutions

that

exploit the current computer hardware

properly

. We need new technologies like

HPC or cloud-based applications

In just 10 years!

(5)

Introduction

Current computer hardware



Multi-core CPUs

● Currently: quad-core and hexa-core processors, with hyper-threading

● Coming soon: Intel MIC (Many Integrated Core) architecture with more than 50

processing cores



GPU computing solutions

● Desktop: Nvidia GTX580 Fermi architecture (512 cores performing 1.5TFlops) ● HPC: Nvidia Tesla M2090 Fermi architecture (6GB memory, 512 cores,

1.3TFlops SP, 665 GFlops DP, memory bandwidth 177GB/sec)



AVX, a new version of SSE, SIMD extended to 256 bits. New

instruction set and registers

(6)

Introduction

Poor software performance



But,

software

in Bioinformatics is in general slow and not designed

for running in clusters or

clouds

, even new NGS software is still being

developed in Perl, R and Java (

Disclaimer: I usually code in Java, R

and Perl, but they are not recommendable for NGS data analysis

)



New NGS technologies push computer needs, new solutions and

technologies become necessary: more

analysis

, performance,

scalable, cloud



Computer performance vs data analysis

●

As a

Computer scientist

I want

high performance software

–

Fast, very fast! Speed matters, low memory footprint!

–

Scalable solutions: clusters, cloud

●

As a

Biologist

I need

useful data analysis tools

–

Efficient and sensitive

–

Extract relevant information

(7)

Introduction

Motivation and goals



As data analysis group we focus in

analysis

:

It's analysis

,

stupid

!

● ~1-2 weeks sample sequencing

● ~2-4 weeks preprocessing and read mapping, this is necessary to be

improved but is not enough!

● Several months of data analysis, lack of software for biological analysis, i.e.

which are the genes with a coverage >30x that are differentially expressed in a dataset of 50 BAMs accounting for 500GB of data with a p-value<0.05?



To develop new generation of software in bioinformatics to be

● fast, actually, very fast software! With a low memory consumption! ● be able to handle and analyze TB of data

● store data efficiently to query ● distribute computation, no data ● focused and useful in analysis!



Software

must

exploit current hardware and be

fast and efficient in

(8)

High Performance Computing (

HPC

)

Technologies Overview



OpenMP

● multi-platform shared-memory parallel programming in C/C++ and Fortran on all

architectures

● flexible interface for developing parallel applications for platforms ranging from the

desktop to the supercomputer



GPU computing

● Frameworks: Nvidia CUDA, Khronos OpenCL, OpenACC (in development)



Streaming SIMD Extensions

(

SSE

)

● SIMD instruction set extension to x86 architecture

● AVX (Advanced Vector Extensions) is an advanced version of SSE, SIMD register

extended from 128 bits to 256 bits with new instructions



Message Passing Interface (MPI)

● de facto standard for communication in a parallel program running on a

distributed-memory system



FPGAs

(9)

High Performance Computing (

HPC

)

Software architecture



Common

“mistakes”

in new

HPC

developers

●

Not all algorithms or problems are suitable for GPUs

●

CPUs multi-core must be also exploited, hyper-threading

●

SSE/AVX SIMD instructions can also improve algorithms, i.e. speed-up

6-8x of Smith-Waterman,

Farrar 2007

)



Heterogeneous

HPC

computation, in

shared-memory

●

CPU

(

OpenMP + SSE/AVX

) +

GPU

(

Nvidia CUDA

)



Hybrid approach

OpenMP + SSE/AVX Nvidia CUDA Hundreds of samples TB of data CPU GPU MPI Hadoop MapReduce

(10)

Distributed and

cloud

computing

Apache Hadoop framework



It's a Java framework library that allows for the distributed processing

of

large data sets

across clusters of computers using a simple

programming model



http://hadoop.apache.org/



Develops open-source software for reliable, scalable, distributed

computing

(11)

Distributed and

cloud

computing

Apache Hadoop framework

MapReduce implementation is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner

HBase is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a

timestamp; each value in the map is an uninterpreted array of bytes. No SQL schema, different number of columns is normal, very flexible

Who is using in Bioinformatics?

• GATK • Crossbow • 1000genomes

http://aws.amazon.com/1000genomes/

(12)

Distributed and

cloud

computing

Amazon AWS

http://aws.amazon.com/ “In 2006, Amazon Web Services (AWS) began offering

IT infrastructure services to businesses in the form of web services -- now commonly known as cloud

computing. One of the key benefits of cloud computing is the opportunity to replace up-front capital

infrastructure expenses with low variable costs that scale with your business”

(13)

NGS data analysis pipeline

Global overview

NGS

sequencer Fastq file, up to hundreds of GB per run QC and preprocessing

Short read aligner

QC and preprocessing

SAM/BAM file

VCF file

QC and preprocessing

Variant VCF viewer Variant effect analysis

QC stats, filtering and preprocessing options

Variant Calling analysis

Double mapping strategy:

Burrows-Wheeler Transform (GPU Nvidia CUDA) +

Smith-Waterman (CPU OpenMP+SSE/AVX)

QC stats, filtering and preprocessing options

GATK and SAM mPileup HPC Implementation.

Statistics genomic tests

Consequence type, regulatory variants and system biology information

QC stats, filtering and preprocessing options HTML5+SVG Web based viewer

More info

http://docs.bioinfo.cipf.es/projects/ngs-gpu-pipeline/wiki

Other analysis (HPC for genomics consortium)

RNA-seq (mRNA sequenced) DNA assembly (not a real analysis) Meth-seq Copy Number Transcript isoform ...

HPG suite

(14)

NGS data analysis pipeline

HPG-Fastq tool



Features:

● QC stats: read length and quality average, nucleotides %, GC% content, quality

per nucleotide, k-mers, ...

● Filtering: filter by length range, filer by quality range, num. nt out of range, N's, ... ● Preprocessing: quality ends trim, remove cloning vectors, ...



Hybrid approach implementation: Nvida CUDA + OpenMP



example:

hpg-fastq qc –fq fastq_file.fastq -t -v



Paired-end Fastq files of 15GB each QC'ed and preprocessed

in

only one execution in less than 4 minutes

in a standard

(15)

NGS data analysis pipeline

HPG-aligner, a HPC DNA short read aligner



Current read aligners software tend to fit in one of these groups:

● Very fast, but no too sensitive: no gaps, no indels, rna-seq... ● Slow, but very sensitive: up to 1 day by sample



Read Aligner algorithms

● Burrows-Wheeler Transform (BWT): very fast! No sensitive ● Smith-Waterman (SW): very sensitive but very slow



Hybrid approach (

paper sent

):

● BWT +1 error implemented in Nvidia CUDA ● SW implemented using OpenMP and SSE/AVX

Fastq file BWT-GPU < 2 errors SW OpenMP/SSE Find CALs with BWT-GPU (seeds) SAM file >= 2 errors Testing BWT-GPU Architecture 10x compared to BFAST ~50x with SSE

(16)

NGS data analysis pipeline

HPG-BAM tool



Features

●

BAM Sort implemented in Nvidia CUDA

(2x compared to

SAM tools)

●

Quality control

: % reads with N errors, % reads with multiple

mappings, strand bias, paired-end insert, ...

●

Coverage implemented in OpenMP (2x faster)

: by gene,

region, ...

●

Filtering

: by number of errors, number of hits, …

●

Comparator

: stats, intersection, ...



In a standard workstation works in minute scale



Not too much performance improvement, but many new

(17)

NGS data analysis pipeline

HPG-variant, suite of tools for variant analysis



HPG-VCF tools

: to analyze large VCFs files with a low memory

footprint: stats, filter, split, merge, …



HPG-Variant

: suite of tools for genome-scale variant analysis



Variant calling:

is a critical step, two HPC implementation

are being developed: GATK, SAMtools mpileup



Genomic analysis

: association, TDT, Hardy-Weinberg, LOH,

… (

~Plink

)



VCF WEB Viewer and VARIANT software (

later

)

(18)

NGS data analysis pipeline

CellBase, an integrative database and RESTful WS API



Based on

CellBase

(

accepted in

NAR

), a comprehensive integrative

database and

RESTful Web Services

API

, more than 120GB of data and

100 SQL tables exported in TXT and

JSON:

● Core features: genes, transcripts, exons,

cytobands, proteins (UniProt),...

● Variation: dbSNP and Ensembl SNPs,

HapMap, 1000Genomes, Cosmic, ...

● Functional: 40 OBO ontologies(Gene

Ontology), Interpro, ...

● Regulatory: TFBS, miRNA targets,

conserved regions, …

● System biology: Interactome (IntAct),

Reactome database, co-expressed genes, ...

(19)

NGS data analysis pipeline

Genome Maps, a HTML5+SVG genome browser



Visualization is an important part of the data analysis:

heavy data

issue! Do not move data!



Features of

Genome Maps

(

www.genomemaps.org

,

paper sent

)

● First 100% web based: HTML5+SVG (Google Maps inspired) ● Always updated, no browser plugins or installation

● Data from CellBase database and DAS servers: genes, transcripts, exons,

SNPs, TFBS, miRNA targets, …

(20)

NGS data analysis pipeline

VARIANT, a genomic

VARI

ant

AN

alysis

T

ool



A genomic variant effect predictor tool, recursive acronym:

 VARIANT = VARIant ANalysis Tool (http://variant.bioinfo.cipf.es, accepted in NAR) 

VARIANT combines

CellBase

Web Services and

Genome Maps

to

build a

cloud

variant effect predictor,

no installation, always updated



Variant effects:

● Genomic location: intronic, non-synonymous, upstream, …

● Regulatory location: TFBS, miRNA targets, conserved regions, …



Three interfaces: RESTful, web application and CLI

Java RESTful WS API

MySQL clusters +120GB of data Three clients available WEB browser (AJAX) WEB

application CLI program

Fast! - GZip - Multi Thread

(21)

NGS data analysis pipeline

Moving data analysis pipeline forward



HPC for Genomics Consortium

, CIPF and Bull leading a

Consortium with UV, UPV and UJI Universities. More than

15 HPC computational researches

and developers joined



Extending data analysis pipeline to

●

DNA assembly

: new assembly algorithm and HPC implementation is

being developed

●

RNA-seq

: RNA mapping, quantification, diff. expr. analysis, isoform

detection, ...

●

Meth-seq

: new bisulfite-seq mapping algorithm, diff. methylated region

analysis, …

●

Copy Number

: detect amplification and deletions, LOH, ...



Next sub-projects

●

ChIP-seq

●

Data analysis integration: eQTL(variant+expr), expr+methylation

●

Distributing: MPI-CUDA, rCUDA, Hadoop MapReduce

(22)

Next steps

Variant NoSQL Database



Currently thousands of

VCFs

are being

generated, VCF files are being stored in

repositories...

not enough



Motivation:

variant analysis

! Biological

questions remain to be answered easily, ie:

● Which variants with more than 20x are shared in a

single gene in more than 80% of disease samples?



Too many variants for a relational SQL

database

, we need new solutions, new

technology available:

NoSQL distributed

databases



A

NoSQL database and Java and RESTful

WS API

to store and analyze TeraBytes of

variant data is being implemented using

Hadoop Framework

, release in a few months

Hadoop Java API Variant Analysis Java API

RESTful WS API

WEB browser Perl, python, Java, ...

(23)

Next steps

Browsing NGS data, HTML5 web based solution



Postulate: “As data size increase

people tend to develop desktop

software for visualization and

analysis“ i.e. GBrowse,

Cytoscape, snpEff, etc.



We took the opposite direction

(Google and cloud approach):

● Moving high volumes of data is not

scalable (~TB of data)

● Cloud distributed and collaborative

solutions



Based on

Genome Maps

visualization technology:

● Browse mapped reads

● Examine variants from a familiy ● Analysis widgets integrated

Client

SAM/BAM data VCF/NoSQL data Thousands of files

~TBs of data

RESTful WS API Specification

Java implementation provided

SAM tools API VCF homemade API

(24)

Next steps

HPC for Genomics Consortium



HPC for Genomics Consortium

, CIPF and Bull leading a

Consortium with UV, UPV and UJI Universities. More than

15 HPC computational researches

and developers joined



Last weeks some computational researchers from Germany or

Chinese BGI have raised interest so far in this project



Motivation:

●

Computation is a bottleneck and the current development model based

in many small solutions does not solve big problems

●

As sequencing technology throughput improve and prices go down this

problem will become much worse

●

Focus in analysis, many more analysis are needed



So,

bioinformatician

with

strong background in

(25)

Next Steps

Future Clinic

project



Future Clinic project aims to introduce

personalized medicine in the Valencian

health system



Consortium formed by Indra, Bull,

Valencian Health System, Universitat

Politecnica, … and led by CIPF



Specific goals:

● Develop algorithms and software to

process, store and analyze genomic information

● Integrate this information into health

system: Digital Clinic History

● Develop software and technology to aid

to diagnosis



This project is funded by the

(26)

Conclusions



NGS technology is pushing computer needs,

new solutions are necessary



The huge data demands faster and more

efficient software solution, HPC can be applied



New architecture model, more cloud solutions



It's not only a hardware problem, we need better

software



Analysis software to handle this data is being

(27)

Acknowledgements

Joaquín Dopazo Head of Department

Franscisco Salavert Ruben Sánchez Web and Java

developers

Joaquín Tárraga Cristina Y. González Víctor Requena (BULL) HPC developers

HPC Consortium:

- Ignacio Blanquer (UPV) - José Salavert (UPV) - Jose Duato (UPV) - Juan M. Orduña (UV) - Vicente Arnau (UV) - Enrique Quintana (UJI) - Mario Gómez (BULL) - ...

And the rest of the department!!

Bull Extreme Computing (BUX)

Others developers Alejandro de María_{Roberto Alonso} Jose Carbonell