• No results found

HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis

N/A
N/A
Protected

Academic year: 2021

Share "HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

HPC pipeline and

cloud

-based solutions for

Next Generation Sequencing data analysis

HPC4NGS 2012, Valencia

Ignacio Medina

[email protected]

Scientific Computing Unit

Bioinformatics and Genomics Department

Centro de Investigación Príncipe Felipe (CIPF)

(2)

Index

Introduction

High Performance Computing (

HPC

)

Distributed and

cloud

computing (

cloud

)

NGS Data Analysis Pipeline

Next Steps

Conclusions

(3)

Introduction

New scenario in Biology

Next Generation Sequencing

(

NGS

) technology is changing the

way how researchers perform experiments. Many new experiments

are being conducted by sequencing:

exome re-sequencing, RNA-seq,

Meth-seq, ChIP-seq, ...

NGS is allowing researches to:

● Find exome variants responsible of diseases ● Study the whole transcriptome of a phenotype ● Establish the methylation state of a condition ● Locate DNA binding proteins

But

, this new experiments

have increased data size by 1000x

when

compared with microarrays, i.e. from MB to GB in expression analysis

Data processing and analysis

are becoming a

bottleneck

and a

nightmare,

from days or weeks with microarrays to months with

NGS,

and it will be worse as more data become available

(4)

Introduction

The NGS data

NGS throughput is amazing

, 30x human

genomes in less than a week,

and it's

becoming faster!

A single run can generate hundreds of

GB of data,

a typical NGS experiment

today can produce 1TB of data in 1

week

The good and bad news:

Prices falling

down quick!

The good news:

Parallelization is

possible!

Many of the analysis is short

read

independent

This new scenario

demands more

advanced software solutions

that

exploit the current computer hardware

properly

. We need new technologies like

HPC or cloud-based applications

In just 10 years!

(5)

Introduction

Current computer hardware

Multi-core CPUs

Currently: quad-core and hexa-core processors, with hyper-threading

Coming soon: Intel MIC (Many Integrated Core) architecture with more than 50

processing cores

GPU computing solutions

Desktop: Nvidia GTX580 Fermi architecture (512 cores performing 1.5TFlops) ● HPC: Nvidia Tesla M2090 Fermi architecture (6GB memory, 512 cores,

1.3TFlops SP, 665 GFlops DP, memory bandwidth 177GB/sec)

AVX, a new version of SSE, SIMD extended to 256 bits. New

instruction set and registers

(6)

Introduction

Poor software performance

But,

software

in Bioinformatics is in general slow and not designed

for running in clusters or

clouds

, even new NGS software is still being

developed in Perl, R and Java (

Disclaimer: I usually code in Java, R

and Perl, but they are not recommendable for NGS data analysis

)

New NGS technologies push computer needs, new solutions and

technologies become necessary: more

analysis

, performance,

scalable, cloud

Computer performance vs data analysis

As a

Computer scientist

I want

high performance software

Fast, very fast! Speed matters, low memory footprint!

Scalable solutions: clusters, cloud

As a

Biologist

I need

useful data analysis tools

Efficient and sensitive

Extract relevant information

(7)

Introduction

Motivation and goals

As data analysis group we focus in

analysis

:

It's analysis

,

stupid

!

● ~1-2 weeks sample sequencing

● ~2-4 weeks preprocessing and read mapping, this is necessary to be

improved but is not enough!

Several months of data analysis, lack of software for biological analysis, i.e.

which are the genes with a coverage >30x that are differentially expressed in a dataset of 50 BAMs accounting for 500GB of data with a p-value<0.05?

To develop new generation of software in bioinformatics to be

fast, actually, very fast software! With a low memory consumption! ● be able to handle and analyze TB of data

● store data efficiently to query ● distribute computation, no data ● focused and useful in analysis!

Software

must

exploit current hardware and be

fast and efficient in

(8)

High Performance Computing (

HPC

)

Technologies Overview

OpenMP

● multi-platform shared-memory parallel programming in C/C++ and Fortran on all

architectures

● flexible interface for developing parallel applications for platforms ranging from the

desktop to the supercomputer

GPU computing

● Frameworks: Nvidia CUDA, Khronos OpenCL, OpenACC (in development)

Streaming SIMD Extensions

(

SSE

)

SIMD instruction set extension to x86 architecture

AVX (Advanced Vector Extensions) is an advanced version of SSE, SIMD register

extended from 128 bits to 256 bits with new instructions

Message Passing Interface (MPI)

de facto standard for communication in a parallel program running on a

distributed-memory system

FPGAs

(9)

High Performance Computing (

HPC

)

Software architecture

Common

“mistakes”

in new

HPC

developers

Not all algorithms or problems are suitable for GPUs

CPUs multi-core must be also exploited, hyper-threading

SSE/AVX SIMD instructions can also improve algorithms, i.e. speed-up

6-8x of Smith-Waterman,

Farrar 2007

)

Heterogeneous

HPC

computation, in

shared-memory

CPU

(

OpenMP + SSE/AVX

) +

GPU

(

Nvidia CUDA

)

Hybrid approach

OpenMP + SSE/AVX Nvidia CUDA Hundreds of samples TB of data CPU GPU MPI Hadoop MapReduce

(10)

Distributed and

cloud

computing

Apache Hadoop framework

It's a Java framework library that allows for the distributed processing

of

large data sets

across clusters of computers using a simple

programming model

http://hadoop.apache.org/

Develops open-source software for reliable, scalable, distributed

computing

(11)

Distributed and

cloud

computing

Apache Hadoop framework

MapReduce implementation is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner

HBase is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a

timestamp; each value in the map is an uninterpreted array of bytes. No SQL schema, different number of columns is normal, very flexible

Who is using in Bioinformatics?

• GATK • Crossbow • 1000genomes

http://aws.amazon.com/1000genomes/

(12)

Distributed and

cloud

computing

Amazon AWS

http://aws.amazon.com/ “In 2006, Amazon Web Services (AWS) began offering

IT infrastructure services to businesses in the form of web services -- now commonly known as cloud

computing. One of the key benefits of cloud computing is the opportunity to replace up-front capital

infrastructure expenses with low variable costs that scale with your business”

(13)

NGS data analysis pipeline

Global overview

NGS

sequencer Fastq file, up to hundreds of GB per run QC and preprocessing

Short read aligner

QC and preprocessing

SAM/BAM file

VCF file

QC and preprocessing

Variant VCF viewer Variant effect analysis

QC stats, filtering and preprocessing options

Variant Calling analysis

Double mapping strategy:

Burrows-Wheeler Transform (GPU Nvidia CUDA) +

Smith-Waterman (CPU OpenMP+SSE/AVX)

QC stats, filtering and preprocessing options

GATK and SAM mPileup HPC Implementation.

Statistics genomic tests

Consequence type, regulatory variants and system biology information

QC stats, filtering and preprocessing options HTML5+SVG Web based viewer

More info

http://docs.bioinfo.cipf.es/projects/ngs-gpu-pipeline/wiki

Other analysis (HPC for genomics consortium)

RNA-seq (mRNA sequenced) DNA assembly (not a real analysis) Meth-seq Copy Number Transcript isoform ...

HPG suite

(14)

NGS data analysis pipeline

HPG-Fastq tool

Features:

QC stats: read length and quality average, nucleotides %, GC% content, quality

per nucleotide, k-mers, ...

Filtering: filter by length range, filer by quality range, num. nt out of range, N's, ... ● Preprocessing: quality ends trim, remove cloning vectors, ...

Hybrid approach implementation: Nvida CUDA + OpenMP

example:

hpg-fastq qc –fq fastq_file.fastq -t -v

Paired-end Fastq files of 15GB each QC'ed and preprocessed

in

only one execution in less than 4 minutes

in a standard

(15)

NGS data analysis pipeline

HPG-aligner, a HPC DNA short read aligner

Current read aligners software tend to fit in one of these groups:

Very fast, but no too sensitive: no gaps, no indels, rna-seq... ● Slow, but very sensitive: up to 1 day by sample

Read Aligner algorithms

Burrows-Wheeler Transform (BWT): very fast! No sensitive ● Smith-Waterman (SW): very sensitive but very slow

Hybrid approach (

paper sent

):

● BWT +1 error implemented in Nvidia CUDA ● SW implemented using OpenMP and SSE/AVX

Fastq file BWT-GPU < 2 errors SW OpenMP/SSE Find CALs with BWT-GPU (seeds) SAM file >= 2 errors Testing BWT-GPU Architecture 10x compared to BFAST ~50x with SSE

(16)

NGS data analysis pipeline

HPG-BAM tool

Features

BAM Sort implemented in Nvidia CUDA

(2x compared to

SAM tools)

Quality control

: % reads with N errors, % reads with multiple

mappings, strand bias, paired-end insert, ...

Coverage implemented in OpenMP (2x faster)

: by gene,

region, ...

Filtering

: by number of errors, number of hits, …

Comparator

: stats, intersection, ...

In a standard workstation works in minute scale

Not too much performance improvement, but many new

(17)

NGS data analysis pipeline

HPG-variant, suite of tools for variant analysis

HPG-VCF tools

: to analyze large VCFs files with a low memory

footprint: stats, filter, split, merge, …

HPG-Variant

: suite of tools for genome-scale variant analysis

Variant calling:

is a critical step, two HPC implementation

are being developed: GATK, SAMtools mpileup

Genomic analysis

: association, TDT, Hardy-Weinberg, LOH,

… (

~Plink

)

VCF WEB Viewer and VARIANT software (

later

)

(18)

NGS data analysis pipeline

CellBase, an integrative database and RESTful WS API

Based on

CellBase

(

accepted in

NAR

), a comprehensive integrative

database and

RESTful Web Services

API

, more than 120GB of data and

100 SQL tables exported in TXT and

JSON:

● Core features: genes, transcripts, exons,

cytobands, proteins (UniProt),...

● Variation: dbSNP and Ensembl SNPs,

HapMap, 1000Genomes, Cosmic, ...

● Functional: 40 OBO ontologies(Gene

Ontology), Interpro, ...

● Regulatory: TFBS, miRNA targets,

conserved regions, …

● System biology: Interactome (IntAct),

Reactome database, co-expressed genes, ...

(19)

NGS data analysis pipeline

Genome Maps, a HTML5+SVG genome browser

Visualization is an important part of the data analysis:

heavy data

issue! Do not move data!

Features of

Genome Maps

(

www.genomemaps.org

,

paper sent

)

● First 100% web based: HTML5+SVG (Google Maps inspired) ● Always updated, no browser plugins or installation

● Data from CellBase database and DAS servers: genes, transcripts, exons,

SNPs, TFBS, miRNA targets, …

(20)

NGS data analysis pipeline

VARIANT, a genomic

VARI

ant

AN

alysis

T

ool

A genomic variant effect predictor tool, recursive acronym:

VARIANT = VARIant ANalysis Tool (http://variant.bioinfo.cipf.es, accepted in NAR) 

VARIANT combines

CellBase

Web Services and

Genome Maps

to

build a

cloud

variant effect predictor,

no installation, always updated

Variant effects:

Genomic location: intronic, non-synonymous, upstream, …

Regulatory location: TFBS, miRNA targets, conserved regions, …

Three interfaces: RESTful, web application and CLI

Java RESTful WS API

MySQL clusters +120GB of data Three clients available WEB browser (AJAX) WEB

application CLI program

Fast! - GZip - Multi Thread

(21)

NGS data analysis pipeline

Moving data analysis pipeline forward

HPC for Genomics Consortium

, CIPF and Bull leading a

Consortium with UV, UPV and UJI Universities. More than

15

HPC computational researches

and developers joined

Extending data analysis pipeline to

DNA assembly

: new assembly algorithm and HPC implementation is

being developed

RNA-seq

: RNA mapping, quantification, diff. expr. analysis, isoform

detection, ...

Meth-seq

: new bisulfite-seq mapping algorithm, diff. methylated region

analysis, …

Copy Number

: detect amplification and deletions, LOH, ...

Next sub-projects

ChIP-seq

Data analysis integration: eQTL(variant+expr), expr+methylation

Distributing: MPI-CUDA, rCUDA, Hadoop MapReduce

(22)

Next steps

Variant NoSQL Database

Currently thousands of

VCFs

are being

generated, VCF files are being stored in

repositories...

not enough

Motivation:

variant analysis

! Biological

questions remain to be answered easily, ie:

● Which variants with more than 20x are shared in a

single gene in more than 80% of disease samples?

Too many variants for a relational SQL

database

, we need new solutions, new

technology available:

NoSQL distributed

databases

A

NoSQL database and Java and RESTful

WS API

to store and analyze TeraBytes of

variant data is being implemented using

Hadoop Framework

, release in a few months

Hadoop Java API Variant Analysis Java API

RESTful WS API

WEB browser Perl, python, Java, ...

(23)

Next steps

Browsing NGS data, HTML5 web based solution

Postulate: “As data size increase

people tend to develop desktop

software for visualization and

analysis“ i.e. GBrowse,

Cytoscape, snpEff, etc.

We took the opposite direction

(Google and cloud approach):

● Moving high volumes of data is not

scalable (~TB of data)

● Cloud distributed and collaborative

solutions

Based on

Genome Maps

visualization technology:

● Browse mapped reads

● Examine variants from a familiy ● Analysis widgets integrated

Client

SAM/BAM data VCF/NoSQL data Thousands of files

~TBs of data

RESTful WS API Specification

Java implementation provided

SAM tools API VCF homemade API

(24)

Next steps

HPC for Genomics Consortium

HPC for Genomics Consortium

, CIPF and Bull leading a

Consortium with UV, UPV and UJI Universities. More than

15

HPC computational researches

and developers joined

Last weeks some computational researchers from Germany or

Chinese BGI have raised interest so far in this project

Motivation:

Computation is a bottleneck and the current development model based

in many small solutions does not solve big problems

As sequencing technology throughput improve and prices go down this

problem will become much worse

Focus in analysis, many more analysis are needed

So,

bioinformatician

with

strong background in

(25)

Next Steps

Future Clinic

project

Future Clinic project aims to introduce

personalized medicine in the Valencian

health system

Consortium formed by Indra, Bull,

Valencian Health System, Universitat

Politecnica, … and led by CIPF

Specific goals:

● Develop algorithms and software to

process, store and analyze genomic information

● Integrate this information into health

system: Digital Clinic History

● Develop software and technology to aid

to diagnosis

This project is funded by the

(26)

Conclusions

NGS technology is pushing computer needs,

new solutions are necessary

The huge data demands faster and more

efficient software solution, HPC can be applied

New architecture model, more cloud solutions

It's not only a hardware problem, we need better

software

Analysis software to handle this data is being

(27)

Acknowledgements

Joaquín Dopazo Head of Department

Franscisco Salavert Ruben Sánchez Web and Java

developers

Joaquín Tárraga Cristina Y. González Víctor Requena (BULL) HPC developers

HPC Consortium:

- Ignacio Blanquer (UPV) - José Salavert (UPV) - Jose Duato (UPV) - Juan M. Orduña (UV) - Vicente Arnau (UV) - Enrique Quintana (UJI) - Mario Gómez (BULL) - ...

And the rest of the department!!

Bull Extreme Computing (BUX)

Others developers Alejandro de MaríaRoberto Alonso Jose Carbonell

References

Related documents

This case study gathered the parents’ perceptions about their role in educating children in an attempt to make a recommendation of strategies to help further improve

From a psychological perspective which focuses on interpersonal rela tionships, trust is seen as: “expectancy held by an individual or a group that the word, promise, verbal or

Agricultural subsidies in developed countries are an example of this case.. this, developing country governments often nationalize the enterprises in the CAD industries in order

List of Appendices Appendix 1: An overview of typologies of welfare states Appendix 2: Population in Mongolia, thousand persons Appendix 3: The Social Security Fund in the USSR:

tourism policy planning represents for European Union countries, Norway and in the case of.. the present master thesis analysis Stavanger planners, must consider the needs

In this study, a number of variables were discussed from different point of view, including the favour rating of the hotel in region, the rank of hotel in

This research project should be of use to companies in the North West when trying to establish a Social Media Marketing program, it will identify methods to be used. M Sc in M

Beebe &amp; Beebe (2012), said in persuasive strategy there are some types, using evidence consists of using credible evidence, using new evidence, using specific evidence,