Christophe Blanchet
Institute of Biology and Chemistry of Proteins
Head of Service ‘’Infrastructure for Biology - IDB’’
CNRS-IBCP FR3302 - LYON - FRANCE - http://idee-b.ibcp.fr
Bioinformatique
sur Cloud
Ecole Bioinformatique Aviesan, 18 octobre 2013
Bioinformatics Today
•
Biological data are big data
•
1512 online databases (NAR Database Issue 2013)
•
Institut Sanger, UK, 5 PB
•
Beijing Genome Institute, China, 5 sites, 12.6 PB
➡
Big data in many places
•
Analysing such data became difficult
•
Scale-up of the analyses : gene/protein to complete genome/
proteome, ...
•
Lot of different daily-used tools
•
That need to be combined in workflows
•
Usual interfaces: portals, Web services, federation,...
➡
Datacenters with ease of access/use
•
Distributed resources
•
Experimental platforms: NGS, imaging, ...
•
Bioinformatics platforms
➡
Federation of datacenters
ADN ADN BI M ADN ADN BI ADN BI CC BI ADN ADNIDB Cloud and Bioinformatics Appliances
•
Cloud workbench for Biology
•
Running since Sept. 2011CNRS-IBCP FR3302, Lyon, France
•
opened to Biology community•
14 bioinformatics appliances: Galaxy portal,standard compute nodes, proteomics, virtual desktop, structural biology, ...
•
+40 users from all IFB regional centersPRABI 15, APLIBIO 14, RENABI-NE 8, -SO 2, -GS 1, -GO 1
•
VMs up to 32cores-768GB RAM•
Infrastructure
•
Compute +900cores +4TB ram•
Standard nodes (32c-128GB)•
Bigmen nodes (64c 768GB)•
Powered by StratusLab•
Storage +250TB•
Virtual disks, object storage (S3)tools BLAST TopHat FastA SSearch R ClustalW2 samtools BWA Linux system Create new cloud services Bioinformatics Marketplace Structures ... Sequences Proteomics Galaxy
+
Virtual Machines OMSSA PeptideShaker HMMer Muscle X!tandem ARIA fastQC Clustal Omega Galaxy tools VM: BLAST, ClustalW2, etc. BI data UNIPROT PDB EMBL PROSITE Genomes Z B A public data user data Move cloud virtual machines IDB CloudCloud extended services
•
Bioinformatics Marketplace
•
find appropriate appliances more easily.•
reduce “noise” in the central Marketplace•
respect visibility contraints for the bioinformatic appliances, such as confidentiality•
Bioinformatics
metadata
‘’
bio:tool
’’
•
additional elements related to bioinformatics tools•
to annotate appliances•
help users to search for the tools themselves or the type of analysis•
select suitable bioinformatics appliances containing the required tools•
Integrated
Web interface
•
VM & virtual disks management•
browse bionformatics appliances with ‘‘bio:tool’’ MDzNative cloud services
•
Authentication•
Virtual machine management•
Persistent disk service•
Client CLI•
etc.Run your Bioinformatics Cloud Instances
Bioinformatics Marketplace
NGS
Structure Galaxy ARIA (…)
Sequence IBCP's Cloud Resources BLAST, Clustal, etc. PaaS Workers VM CNS Sh are d F S launch jobs ssh IaaS
Master & Storage VM ARIA Portal Launch Insta nce s
UNIPROT PDB EMBL PROSITE Genomes Public Data so urces Bioinformatics Cloud BLAST, Clustal, etc. PaaS Workers VM CNS Sh are d F S launch jobs ssh IaaS
Master & Storage
VM ARIA Portal shared (NFS) User Persistent data pdisk (iSCSI)
Biological Data in Cloud
Upload your data
Get your results
sftp/http/S3
Examples of Cloud
Standard Bioinformatics node
•
‘Biocompute’ appliance
•
Use
your own
instance(s)
•
With
pre-installed standard
bioinformatics tools
•
BLAST, FastA, SSearch,HMM,...•
ClustalW2, Clustal-Omega, Muscle,..•
Bowtie(2), BWA, samtools, ...•
MEME, R, etc.•
Connected to
public reference
data
•
Uniprot, EMBL, genomes, PDB, etc.Structural Biology
•
TOwards StruCtural AssignmeNt Improvement
•
To improve the determination of protein structures based on
Nuclear Magnetic Resonance (NMR) information with ARIA
software
•
Large computational needs.
•
A NMR laboratory will not specially invest in building a cluster of
about 100 nodes to be able to run such NMR structure calculations.
•
Flexibility of the cloud to deploy the different required
bioinformatics tools can accelerate such a procedure.
•
Commercial interest in providing such tools to structural biologists
on a “pay as you go” basis.
•
Endorsers:
Institut Pasteur Paris
and CNRS IBCP
Proteomics desktop
•
Motivation
•
Collaboration with a mass spectroscopy platform
•
Running out of space on their local resources
•
Protein identification
•
Mass experimental data
•
Reference databases : nr, Swiss-Prot
•
Reference screening tools:
OMSSA, X!Tandem
•
User interface
•
Remote display
•
NX
•
Reference GUIs
•
SearchGUI
•
PeptidShaker
MapReduce Biology
•
Provide turnkey virtual machine with
pre-configured mapreduce framework
•
Accelerate bigadata analysis with the two steps map & reduce paradigm•
Hadoop MapReduce 1.0.4•
Appliances (2)
•
standard hadoop mapreduce•
bioinformaytics software integrated in hadoop•
Sequences similarity with mapreduce paradigm
•
FastA & SSearch•
deploy database of sequences in HDFS•
compare each structure to othersDeveloped in the context of the French project MapReduce, ANR ARPEGE
Mappers Databank FastA #01 Reducers subset #01 subset#02 ... FastA #02 ... Results score sequence score sequence ... FastAMR Each mapper send the score and sequences to reducers
Reducers copy the Each mapper
FastAMR splits the databank into subsets
and puts them in the DFS along with the
sequences file
Users run the FastAMR script with its sequences
Compte Cloud IDB
•
Connectez-vous
•
Remplissez les
différents champs
•
adresse mail
institutionnelle
•
Créer la demande
implique l’acceptation
des conditions
d’utilisations !
https://idee-b.ibcp.fr/cloud.html
Appliances disponibles
•
Liste des
appliances
existantes
•
Documentation
spécifique aux
appliances
•
Création
directe
•
bouton ‘Power’
Créer mon portail Galaxy
•
Appelée aussi ‘Instance’
•
Compléter les
différents paramètres
•
lui assigner un nom
•
nombre de CPUs
•
taille mémoire
•
attacher un disque virtuel
•
Cluster de VM
•
remplir le nombre de VMs
Les disques durs virtuels
•
Un disque virtuel permet
de conserver ses données
indépendamment de
l’exécution des VMs
•
retrouver ses données d’une
VM à la suivante.
•
Actions
•
créer un vdisk
•
gérer ses vdisks
•
Utiliser un vdisk
•
à la création de la VM
Echanger les données avec mon portail Galaxy
•
sftp / scp
•
client graphique: Cyberduck,
Transmit, Filezilla, ...
•
Web: Galaxy - Get Data - Lien
pour download
Conclusion
•
Added value of cloud,
e.g.
NGS with Galaxy
•
for scientific analyses: user-specific resources, isolated, different instances together•
for training: Oct 2012 Bordeaux, Mai 2013 Galaxy Lille, (next) 2014 Galaxy Jouy•
for tools integration: semantic annotation, solve software dependencies•
for development & operations (DevOps): different versions at the same time•
Provide turnkey bioinformatics appliances
•
Standard tools and pipelines•
New developments•
Ready to run on clouds•
Public bioinformatics cloud (
e.g.
IDB)
•
Tightly connected to existing bioinformatics resources•
Linked to public biological databases•
In collaboration with the French Institute of Bioinformatics tools BLAST TopHat FastA SSearch R ClustalW2 samtools BWA Linux system Create new cloud services Bioinformatics Marketplace Structures ... Sequences Proteomics Galaxy+
Virtual Machines OMSSA PeptideShaker HMMer Muscle X!tandem ARIA fastQC Clustal Omega Galaxy tools VM: BLAST, ClustalW2, etc. BI data UNIPROT PDB EMBL PROSITE Genomes Z B A public data user data Move cloud virtual machines IDB CloudIFB - French Institute of Bioinformatics
Mission
: to make available core bioinformatics
resources to the national/international life science
research community.
•
To provide
support for biology pro
grams
•
supporting projects
•
training users
•
To provide an
IT infrastructure
devoted to
management and analysis of biological data
•
material resources : CPUs, disks, etc.
•
availability of biology data collections
•
deployment of bioinformatics tools
•
To act as a
middleman between the life science
community and the bioinformatics/computer
science research community
IFB - Infrastructure
•
IFB-Core resources
•
Academic
cloud for life cience
•
Will be hosted at
CNRS IDRIS
supercomputing center (PARIS)
•
A
pilot
infrastructure (2014-Q1)
•
Production infrastructure
+5,000cores 1PB (2014-S2)
•
+
Regional resources
•
6 regional bioinformatics centers
•
+6,000 cores ~1PB
•
2 existing clouds: PRABI-IBCP IDB cloud
(Lyon) & Genouest genocloud (Rennes)
•
Deploy a clouds federation
RENABI IFB -Bioinformatics French Institute RENABI-GO APLIBIO PRABI RENABI-SO RENABI-NE RENABI-GS FIB-core IT CNRS-IDRIS, Paris