• No results found

Hadoopizer : a cloud environment for bioinformatics data analysis

N/A
N/A
Protected

Academic year: 2021

Share "Hadoopizer : a cloud environment for bioinformatics data analysis"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Hadoopizer : a cloud environment for 

bioinformatics data analysis

Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3)

(1) [email protected], INRIA/Irisa, Campus de Beaulieu, 35042, RENNES, Cedex, France (2) [email protected], INRIA/Irisa, Campus de Beaulieu, 35042, RENNES, Cedex, France

(3) [email protected], INRIA/Irisa, Campus de Beaulieu, 35042, RENNES, Cedex, France

Journées scientifiques mésocentres et France Grilles 3 Octobre 2012

(2)

Data deluge in biology

Next generation sequencing (NGS)

Very cheap DNA sequencing Compute and storage grow slower than sequence  production “Data deluge” This is the first wave ● Proteomic ● Imaging ● ... Kahn. On the future of genomic data. Science (2011) vol. 331 (6018) pp. 728-9
(3)

Next Generation Sequencing technologies

DNA

Alphabet = A, T, G, C (1 letter = 1 “base”) Human genome = 3 Gb

Next generation sequencing (NGS)

1 run (11 days) = 6 Billion reads of 100 b

Many methods

De novo assembly of a genome Mapping on a genome ...
(4)

NGS application: example of mapping

Many algorithms (101*)

Sequencing errors, mutations, repetitions, gaps

Parallelization

Indexed genome Each read can be mapped individually Reads (short sequences) * source: seqanswers.com wiki
(5)

Data management

Production facilities

Sequencing centers Genoscope, BGI, companies, ... Sequencers in labs => disseminated

Computing resources

Disseminated too => transfers in labs, on platforms (or none) Peaks of activity

      => Engineering challenges

(6)

Why the cloud

Extensible computing resources

Amazon, Azure, private clouds

Parallelization

SGE or Hadoop clusters

Reproducible results

Shareable virtual machines
(7)

Hadoop

MapReduce framework

Distributed processing of large datasets

Scalability

1 to thousands of nodes

Fault tolerance

Detect node failure and retry

Hadoop-related projects

Hbase, Hive, Cassandra, Pig, ...
(8)

Hadoop

Hadoop in a few words

1. Load input data as key/values 2. Distribute them to computing node 3. Map(): transform to new key/values pairs 4. Reduce(): combine values having the same key 5.Write to output file
(9)

Hadoop

Node 1 Node 2 Node 3 Node 4 MAP

REDUCE

Big input file

Complete result file Node 2

(10)

The project

Cloud computing and bioinformatics data:

Does it work? Do we miss some tools? Will it be efficient?

Private cloud setup: GenoCloud

Inventory of existing solutions

Development of Hadoopizer

(11)

Private cloud

Private cloud for test purpose

Based on OpenNebula

240 cores, 940 Gb memory, 8 Tb storage EC2 compatibility

Ready to use images

Sge cluster Hadoop cluster

       http://genocloud.genouest.org

(12)

Inventory

Already existing tools:

CloudAligner Mapping (specific algorithm) CloudBurst Mapping (RMAP algorithm) Contrail De novo assembler (specific algorithm) Crossbow SNPs detection (uses Bowtie and SOAPsnp) Myrna RNAseq, differential expression (bowtie, R/Bioconductor)
(13)

Inventory

Less specific tools:

Eoulsan Filtering, mapping, differential expression Pipelines Extendable (but not simple) CloudMan SGE cluster (with console) + Galaxy frontend
(14)

Bioinfo applications in the cloud

Main limitations

Fast evolution of both data & algorithms Obsolescence of algorithms (myrna, cloudburst, contrail) Evolution cost Test new algorithms Custom code/scripts (very common) Incompatible dependencies

Missing

Launching custom command line Splitters adapted to bioinformatics data formats Some glue to make life easier

       => Hadoopizer

(15)

Non parallelized treatment >a sequence AAAATGCGTCGTACCGT >another sequence TGTCGTACTGGTAC >third sequence TGTCGTACAAACGTTCGA ... EXECUTION of command line on all data Big input file Complete result file

(16)

Hadoopizer: how it works >a sequence AAAATGCGTCGTACCGT >another sequence TGTCGTACTGGTAC >third sequence TGTCGTACAAACGTTCGA ... Input chunks SPLITTING with specific splitter EXECUTION of command line on received data REDUCING with specific reducer Big input file Output chunks Complete result file

(17)

Hadoopizer: using it

Xml example

Command line example

hadoop jar hadoopizer.jar -c job_config.xml -w hdfs://master_node_ip/bowtie_tmp/ <?xml version="1.0" encoding="utf-8"?> <job> <command>

bowtie -m 1 --best --strata -S ${genome} ${reads} > ${mapped} </command>

<input id="genome">

<url autocomplete="true">/home/example/indexed_genome</url> </input>

<input id="reads" split="true">

<url splitter="fastq">/home/example/reads.fastq</url> </input>

<outputs>

<url>/home/example/output_mapping/</url> <output id="mapped" reducer="sam" />

</outputs> </job>

(18)

Hadoopizer: using it

Xml example

Command line example

hadoop jar hadoopizer.jar -c job_config.xml -w hdfs://master_node_ip/bowtie_tmp/ <?xml version="1.0" encoding="utf-8"?> <job> <command>

bowtie -m 1 --best --strata -S ${genome} ${reads} > ${mapped} </command>

<input id="genome">

<url autocomplete="true">/home/example/indexed_genome</url> </input>

<input id="reads" split="true">

<url splitter="fastq">/home/example/reads.fastq</url> </input>

<outputs>

<url>/home/example/output_mapping/</url> <output id="mapped" reducer="sam" />

</outputs> </job>

(19)

Hadoopizer: other features

Software deployment

hadoop jar hadoopizer.jar -b binaries.tar.gz [...] Extracted in work dir on each node

Supported formats

Fasta, Fastq, Sam, (Bam)

Compression

Input and output

Multiple input+output

Support of paired sequences
(20)

Mapping example: first results

First results on mapping:

Reference genome: 400 Mb Reads: 11 Gb

With hadoopizer

343 splits on 10 nodes 55 min for map step, 3h for reduce step This is a test cloud, not optimized for performances Configuration tuning, code improvements, ...

Same command on 1 machine (same config)

4h30
(21)

Hadoopizer: performances

Benchmarks

The next step of the project Comparison with: non parallelized SGE parallelized other implementations using Hadoop Take into account the transfer time

Expected results:

Overhead due to I/O on computing nodes
(22)

Hadoopizer conclusion

Coarse grain parallelism

For embarrassingly parallel problems

Works with any command line

Perspectives

Support more data formats (bed, wiggle, gff, ...) Performance issues
(23)

Perspectives

6 months work

1.0 released on github

https://github.com/genouest/hadoopizer

Open position to continue the project

Benchmarks Public clouds Support other formats, new features Real life applications
(24)

Thank you!

www.genouest.org genocloud.genouest.org

github.com/genouest/hadoopizer [email protected]

References

Related documents

In this work, we demonstrate the in situ formation of a solid ice water layer on the silica and detail the mechanism of interfacial Li-ion conduction by the increased

The relational database model in this research allows complete information and incomplete information to exist in the same relation variable, but places missing data values in

In order to complement the experimental result, a model was derived and used as a tool for evaluating the functional influence of the two process parameters; copper input and

Important challenges for the future of Austrian well-being arise from demographic and environmental trends, which make both synergies and trade-offs between well- being dimensions

In England and Wales this campaign resulted in the 2011 introduction of the domestic violence disclosure scheme (Clare’s Law), a scheme aimed at preventing the

Abstract : At the moment 51 genus, 180 species of aquatic beetles (13 Haliplidae, 78 Dytiscidae, 2 Noteridae, 6 Gyrinidae, 1 Spercheidae, 5 Hydrochidae, 8 Helophoridae,

Standard single-particle reconstruction methodologies in- volve multiple steps, including grouping of images based on similarity, generation of averages with improved SNR, calcula-

In a 3-year respite care demonstration project conducted by Duke University in North Caro- lina, families of people with dementia were offered two types of respite care: in-home