• No results found

Bioinformatique sur Cloud Cas d usage avec le portail Galaxy

N/A
N/A
Protected

Academic year: 2021

Share "Bioinformatique sur Cloud Cas d usage avec le portail Galaxy"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

Christophe Blanchet

Institute of Biology and Chemistry of Proteins

Head of Service ‘’Infrastructure for Biology - IDB’’

CNRS-IBCP FR3302 - LYON - FRANCE - http://idee-b.ibcp.fr

Bioinformatique

sur Cloud

(2)

Ecole Bioinformatique Aviesan, 18 octobre 2013

Bioinformatics Today

Biological data are big data

1512 online databases (NAR Database Issue 2013)

Institut Sanger, UK, 5 PB

Beijing Genome Institute, China, 5 sites, 12.6 PB

Big data in many places

Analysing such data became difficult

Scale-up of the analyses : gene/protein to complete genome/

proteome, ...

Lot of different daily-used tools

That need to be combined in workflows

Usual interfaces: portals, Web services, federation,...

Datacenters with ease of access/use

Distributed resources

Experimental platforms: NGS, imaging, ...

Bioinformatics platforms

Federation of datacenters

ADN ADN BI M ADN ADN BI ADN BI CC BI ADN ADN

(3)

IDB Cloud and Bioinformatics Appliances

Cloud workbench for Biology

Running since Sept. 2011

CNRS-IBCP FR3302, Lyon, France

opened to Biology community

14 bioinformatics appliances: Galaxy portal,

standard compute nodes, proteomics, virtual desktop, structural biology, ...

+40 users from all IFB regional centers

PRABI 15, APLIBIO 14, RENABI-NE 8, -SO 2, -GS 1, -GO 1

VMs up to 32cores-768GB RAM

Infrastructure

Compute +900cores +4TB ram

Standard nodes (32c-128GB)

Bigmen nodes (64c 768GB)

Powered by StratusLab

Storage +250TB

Virtual disks, object storage (S3)

tools BLAST TopHat FastA SSearch R ClustalW2 samtools BWA Linux system Create new cloud services Bioinformatics Marketplace Structures ... Sequences Proteomics Galaxy

+

Virtual Machines OMSSA PeptideShaker HMMer Muscle X!tandem ARIA fastQC Clustal Omega Galaxy tools VM: BLAST, ClustalW2, etc. BI data UNIPROT PDB EMBL PROSITE Genomes Z B A public data user data Move cloud virtual machines IDB Cloud

(4)

Cloud extended services

Bioinformatics Marketplace

find appropriate appliances more easily.

reduce “noise” in the central Marketplace

respect visibility contraints for the bioinformatic appliances, such as confidentiality

Bioinformatics

metadata

‘’

bio:tool

’’

additional elements related to bioinformatics tools

to annotate appliances

help users to search for the tools themselves or the type of analysis

select suitable bioinformatics appliances containing the required tools

Integrated

Web interface

VM & virtual disks management

browse bionformatics appliances with ‘‘bio:tool’’ MDz

Native cloud services

Authentication

Virtual machine management

Persistent disk service

Client CLI

etc.

(5)
(6)

Run your Bioinformatics Cloud Instances

Bioinformatics Marketplace

NGS

Structure Galaxy ARIA (…)

Sequence IBCP's Cloud Resources BLAST, Clustal, etc. PaaS Workers VM CNS Sh are d F S launch jobs ssh IaaS

Master & Storage VM ARIA Portal Launch Insta nce s

(7)

UNIPROT PDB EMBL PROSITE Genomes Public Data so urces Bioinformatics Cloud BLAST, Clustal, etc. PaaS Workers VM CNS Sh are d F S launch jobs ssh IaaS

Master & Storage

VM ARIA Portal shared (NFS) User Persistent data pdisk (iSCSI)

Biological Data in Cloud

Upload your data

Get your results

sftp/http/S3

(8)

Examples of Cloud

(9)

Standard Bioinformatics node

‘Biocompute’ appliance

Use

your own

instance(s)

With

pre-installed standard

bioinformatics tools

BLAST, FastA, SSearch,HMM,...

ClustalW2, Clustal-Omega, Muscle,..

Bowtie(2), BWA, samtools, ...

MEME, R, etc.

Connected to

public reference

data

Uniprot, EMBL, genomes, PDB, etc.

(10)

Structural Biology

TOwards StruCtural AssignmeNt Improvement

To improve the determination of protein structures based on

Nuclear Magnetic Resonance (NMR) information with ARIA

software

Large computational needs.

A NMR laboratory will not specially invest in building a cluster of

about 100 nodes to be able to run such NMR structure calculations.

Flexibility of the cloud to deploy the different required

bioinformatics tools can accelerate such a procedure.

Commercial interest in providing such tools to structural biologists

on a “pay as you go” basis.

Endorsers:

Institut Pasteur Paris

and CNRS IBCP

(11)

Proteomics desktop

Motivation

Collaboration with a mass spectroscopy platform

Running out of space on their local resources

Protein identification

Mass experimental data

Reference databases : nr, Swiss-Prot

Reference screening tools:

OMSSA, X!Tandem

User interface

Remote display

NX

Reference GUIs

SearchGUI

PeptidShaker

(12)

MapReduce Biology

Provide turnkey virtual machine with

pre-configured mapreduce framework

Accelerate bigadata analysis with the two steps map & reduce paradigm

Hadoop MapReduce 1.0.4

Appliances (2)

standard hadoop mapreduce

bioinformaytics software integrated in hadoop

Sequences similarity with mapreduce paradigm

FastA & SSearch

deploy database of sequences in HDFS

compare each structure to others

Developed in the context of the French project MapReduce, ANR ARPEGE

Mappers Databank FastA #01 Reducers subset #01 subset#02 ... FastA #02 ... Results score sequence score sequence ... FastAMR Each mapper send the score and sequences to reducers

Reducers copy the Each mapper

FastAMR splits the databank into subsets

and puts them in the DFS along with the

sequences file

Users run the FastAMR script with its sequences

(13)
(14)

Compte Cloud IDB

Connectez-vous

Remplissez les

différents champs

adresse mail

institutionnelle

Créer la demande

implique l’acceptation

des conditions

d’utilisations !

https://idee-b.ibcp.fr/cloud.html

(15)

Appliances disponibles

Liste des

appliances

existantes

Documentation

spécifique aux

appliances

Création

directe

bouton ‘Power’

(16)

Créer mon portail Galaxy

Appelée aussi ‘Instance’

Compléter les

différents paramètres

lui assigner un nom

nombre de CPUs

taille mémoire

attacher un disque virtuel

Cluster de VM

remplir le nombre de VMs

(17)
(18)

Les disques durs virtuels

Un disque virtuel permet

de conserver ses données

indépendamment de

l’exécution des VMs

retrouver ses données d’une

VM à la suivante.

Actions

créer un vdisk

gérer ses vdisks

Utiliser un vdisk

à la création de la VM

(19)

Echanger les données avec mon portail Galaxy

sftp / scp

client graphique: Cyberduck,

Transmit, Filezilla, ...

Web: Galaxy - Get Data - Lien

pour download

(20)

Conclusion

Added value of cloud,

e.g.

NGS with Galaxy

for scientific analyses: user-specific resources, isolated, different instances together

for training: Oct 2012 Bordeaux, Mai 2013 Galaxy Lille, (next) 2014 Galaxy Jouy

for tools integration: semantic annotation, solve software dependencies

for development & operations (DevOps): different versions at the same time

Provide turnkey bioinformatics appliances

Standard tools and pipelines

New developments

Ready to run on clouds

Public bioinformatics cloud (

e.g.

IDB)

Tightly connected to existing bioinformatics resources

Linked to public biological databases

In collaboration with the French Institute of Bioinformatics tools BLAST TopHat FastA SSearch R ClustalW2 samtools BWA Linux system Create new cloud services Bioinformatics Marketplace Structures ... Sequences Proteomics Galaxy

+

Virtual Machines OMSSA PeptideShaker HMMer Muscle X!tandem ARIA fastQC Clustal Omega Galaxy tools VM: BLAST, ClustalW2, etc. BI data UNIPROT PDB EMBL PROSITE Genomes Z B A public data user data Move cloud virtual machines IDB Cloud

(21)

IFB - French Institute of Bioinformatics

Mission

: to make available core bioinformatics

resources to the national/international life science

research community.

To provide

support for biology pro

grams

supporting projects

training users

To provide an

IT infrastructure

devoted to

management and analysis of biological data

material resources : CPUs, disks, etc.

availability of biology data collections

deployment of bioinformatics tools

To act as a

middleman between the life science

community and the bioinformatics/computer

science research community

(22)

IFB - Infrastructure

IFB-Core resources

Academic

cloud for life cience

Will be hosted at

CNRS IDRIS

supercomputing center (PARIS)

A

pilot

infrastructure (2014-Q1)

Production infrastructure

+5,000cores 1PB (2014-S2)

+

Regional resources

6 regional bioinformatics centers

+6,000 cores ~1PB

2 existing clouds: PRABI-IBCP IDB cloud

(Lyon) & Genouest genocloud (Rennes)

Deploy a clouds federation

RENABI IFB -Bioinformatics French Institute RENABI-GO APLIBIO PRABI RENABI-SO RENABI-NE RENABI-GS FIB-core IT CNRS-IDRIS, Paris

(23)

Acknowledgment

Clément Gauthey (IDB)

StratusLab members

co-funding by the European Community's Seventh

Framework Programme (INFSO-RI-261552) and by

the French National Research Agency's Arpege

Programme (ANR-10-SEGI-001).

Questions ?

References

Related documents

with a destination point in zone l.. Figure 3: An example of a road network, divided into its cruising zones, and the corresponding Markov chain, modeling a taxi operating on

Nowhere is this as evident as in the most recent US election, during which the main three candidates (Hillary Clinton, Bernie Sanders, and Donald Trump) made strategic use of

Anesthetized male wistar rats were subjected to regional 30 min ischemia and 120 min reperfusion and randomly divided into nine groups: (1) Control; saline was

We performed a multi-step approach integrating the cross-talk between transcriptome and regulatory miRNA as well as an interactive pathway network analysis to identify the

In 2010, new investment per Canadian worker remained relatively steady at 96 cents for every dollar of new investment per worker in OECD countries on average and 86 cents relative

Three approaches to modelling, a logistic regression, a random forest, and a wide and deep neural network, are designed to predict the comorbidity risk index as- sociated with 30-

This paper described to study the influences of relative humidity on the electric potential and electric field around the suspension insulators string which used