EGEE-II INFSO-RI-031688
Enabling Grids for E-sciencE
www.eu-egee.org
Grille de calcul pour le biomédical
Journée BioSTIC, 10 juillet 2008
Johan Montagnat Diane Lingrand
Enabling Grids for E-sciencE
EGEE Grid Infrastructure
Flagship European grid infrastructure project
Now in 3rd phase with more than 100 partners Size of the infrastructure today: • > 250 sites in 48 countries
• > 70 000 CPU cores
• ~ 5 PB disk + tape MSS
• > 150 000 jobs/day
DICOM Image Management, J. Montagnat, HealthGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 3
Middleware components
Computing Resources Storage Resources
Site X
Logging, real time monitoring
Workload Management Sites Resources Information Service Dynamic evolution Data Management DataSets info Author. &Authen. queries queries User requests Resour
ces allocation Publication resour
ces info
Enabling Grids for E-sciencE
gLite middleware
•
Complex middleware stack
– On top of Globus Toolkit 2, VDT, CONDOR, batch schedulers...
•
Fundational services
– User authentication and authorization
– Information Index: registration of Computing Elements (CEs), Storage Elements (SEs), etc.
– Workload Management System: workload distribution over sites
– Data Management System: virtual file hierarchy, standard (SRM) interface to storage resources
•
Batch oriented
– EGEE is a federation of computing centers
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Why grids for medical imaging?
•
Sharing computing resources and algorithms
– Research (populations studies, models design, validation, statistics)
– Complex analysis (compute intensive image processing, time constraints...) S e q 1 > d c s c d s s d c s d c d s c bscdsbcbjbfvbfvbvfbvbvbhvb hsvbhdvbhfdbvfd S e q 2 > b v d f v f d v h b d f v b bhvdsvbhvbhdvrefghefgdscg dfgcsdycgdkcsqkc … S e q n > b v d f v f d v h b d f v b bhvdsvbhvbhdvrefghefgdscg dfgcsdycgdkcsqkchdsqhfduh dhdhqedezhhezldhezhfehfle zfzejfv Data Algorithms Procedures Computing power 4
Enabling Grids for E-sciencE
Building on Production Grids
Resources
Communication layer (GEANT, Internet...) (low-level) Middleware
Applications
Production
grid
infrastructure level Resources Resources Resources Resources
Applications Applications Applications
Application-level services
(high-level middleware) Application-level services(high-level middleware)
Applications
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 7
The
Biomed
Virtual Organization
•
Fostering scientific communities
– EGEE is an international, multi-disciplinary research infrastructure
– Scientific communities are expanding beyond administrative boundaries
•
Virtual Organizations
– Authentication and Authorization management
– Share resources (computers, data, algorithms...) inside a VO
– Application areas identification and support unit
•
The Biomed VO now divided in 3 sectors
– Medical imaging
– Bioinformatics
Enabling Grids for E-sciencE
Biomed Virtual Organization
Size of the infrastructure today:
• > 250 sites in 48 countries
• > 70 000 CPU cores
• ~ 5 PB disk + tape MSS
• > 150 000 jobs/day
•Out of which, Biomed VO: > 9000 registered users
• > 100 sites in 30 countries (170 CEs, 130 SEs)
• ~ 17 000 CPU
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Irregular use through time
Year 2007 statistics collected per month
Enabling Grids for E-sciencE
Total production in 2007
~800 CPU years (1000 SpecInt 2000 Normalised CPU time)Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Total production in 2007
Enabling Grids for E-sciencE
Computing resources provision
Computing hours per EGEE
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Life sciences compared to other
sciences
•Biomed VO share
Biomed VO Number of jobs CPU timeEnabling Grids for E-sciencE
Building on Production Grids
Resources
Communication layer (GEANT, Internet...) (low-level) Middleware
Applications
Production
grid
infrastructure level Resources Resources Resources Resources
Applications Applications Applications
Application-level services
(high-level middleware) Application-level services(high-level middleware)
Applications
Application to life sciences, J. Montagnat, June 18, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
WISDOM In silico Drug Discovery
•
WISDOM:
http://wisdom.healthgrid.org/
•
Goal: find new drugs for neglected and emerging
diseases
– Neglected diseases lack R&D
– Emerging diseases require very rapid response time
•
Need for an optimized environment
– To achieve production in a limited time
– To optimize performances
•
Method: grid-enabled virtual docking
– Cheaper than in vitro tests
– Faster than in vitro tests
Enabling Grids for E-sciencE
High Throughput Virtual Docking
Chemical compounds : Chembridge – 500,000 Drug like – 500,000 Targets : Plasmepsin II (1lee, 1lf2, 1lf3) Plasmepsin IV (1ls5) Millions of chemical compounds available in laboratories
High Throughput Screening
1-10$/compound, nearly impossible
Molecular docking (FlexX, Autodock)
~80 CPU years, 1 TB data
Computational data challenge
~6 weeks on ~1000/1600 computers
Hits screening using assays performed on living cells
Chemical compounds : ZINC
Molecular docking : FlexX, Autodock
Targets structures : PDB
Grid infrastructure : EGEE
Leads
Clinical testing Drug
Healthgrid … – March 8th, 2007 – V. Breton
Enabling Grids for E-sciencE
INFSO-RI-508833 17
The new WISDOM architecture
Credit: J. Salzeman, V. Bloch
Major new features:
- improved data management - secure storage
Enabling Grids for E-sciencE
WISDOM-II: A huge international
effort
1%2% 2% 3% 3% 3% 3% 5% 6% 7% 12% 15% 39%EGEE Germany Switzerland EGEE Asia Pacific
EGEE Russia Auvergrid EuChinaGrid EELA
EGEE South Western Europe EGEE Central Europe EGEE Northern Europe EGEE Italy
EGEE South Eastern Europe EGEE France
EGEE UKI
Significant contributions from several International grid infrastructures
Over 420 CPU years in 10 weeks
Application to life sciences, J. Montagnat, June 18, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 19
The grid added value
Univ. Los Andes: Biological targets,
Malaria biology
LPC Clermont-Ferrand:
Biomedical grid Knowledge extraction,SCAI Fraunhofer: Chemoinformatics Univ. Modena: Biological targets, Molecular Dynamics ITB CNR: Bioinformatics, Molecular modelling Univ. Pretoria: Bioinformatics, Malaria biology Academica Sinica: Grid user interface
Biological targets In vitro testing HealthGrid:
Biomedical grid, Dissemination
CEA, Acamba project: Biological targets,
Chemogenomics
Chonnam nat. univ.: In vitro testing New
• The grid provides the centuries of CPU cycles required on demand
• The grid provides the reliable and secure data management services to store and replicate the biochemical inputs and outputs
• The grid offers a collaborative environment for the sharing of data in the research community on avian flu and malaria
Enabling Grids for E-sciencE
Statistics of deployment
•
First Data Challenge: July 1st - August 15th 2005
– Target: malaria
– 80 CPU years
– 1 TB of data produced
– 1700 CPUs used in parallel
– 1st large scale docking deployment world-wide on a e-infrastructure
•
Second Data Challenge: April 15th - June 30th 2006
– Target: avian flu
– 100 CPU years
– 800 GB of data produced
– 1700 CPUs used in parallel
– Collaboration initiated on March 1st: deployment preparation achieved in 45 days
•
Third Data Challenge: October 1st - 15th December 2006
– Target: malaria
– 400 CPU years
– 1,6 TB of data produced
– Up to 5000 CPUs used in parallel
Application to life sciences, J. Montagnat, June 18, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 21
GPSA: Bioinformatics Grid Portal
• Scientific objectives
–Molecular Bioinformatics: protein sequence analysis
–Analyse data from high-throughput Biology: complete genome projects, EST, complete proteomes, structural biology, ….
–Integration of biological data and tools
• Method
–Provide Biologists with an usual Web interface: NPS@
NPS@ Web portal online since 1998
46 tools & 12 updated databases
+ 10,000,000 jobs & 5,000 jobs/day
–Ease the access to updated databases and algorithms.
–Protein databases are stored on grid storage as flat files.
–Legacy bioinformatics applications
Wrapping usual binary in grid environment
transparent remote access with local
filesystem
–Display results in graphical Web interface.
• Status: Production
• Contact: Christophe.Blanchet@ibcp.fr
Enabling Grids for E-sciencE • Grid-enabled bioinformatics resources –9 algorithms –3 protein databases • Bioinformatics descriptors
–XML framework, Web portal,
Web Services
• Encryption system: EncFile
–Security: AES, Key sharing,
M-of-N
• Transparent access
to grid files: Parrot
• http://gpsa-pbil.ibcp.fr
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 23
EGEE Medical imaging sector
•
Pharmacokinetics
– UPV (University Polytechnic of Valencia)
•
GATE
– LPC (CNRS – Clermont Ferrand)
•
SiMRI3D, ThIS, CAVIAR
– CREATIS (CNRS - Lyon)
•
gPTM3D
– LRI-LAL-LIMSI (CNRS - Orsay)
•
Bronze Standard
– I3S (CNRS – Sophia), INRIA
•
SPM Alzheimer's
– U. of Genova / DEST
Enabling Grids for E-sciencE
Pharmacokinetics:
contrast agent diffusion
•
Scientific objectives
– Study of contrast agent diffusion
•
Computations
– Image sequences co-registration (independent computations)
– Parallel contrast agent diffusion modeling
•
Features
– Application portal for use in clinical environment
– Enable on-line analysis of exams acquired daily Examination time ~15 minutes
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 25
SiMRI3D: MRI simulator
•
Scientific objectives
– Better understand MR physics.
– Study MR sequences in-silico.
– Study MR artefacts.
– Validate MR Image processing algorithms on synthetic
– yet realistic images.
•
Method
– Simulate Bloch's electromagnetism equations.
– Parallel (MPI) implementation to speed-up computations.
10243 3D image simulation:
~100 CPU years.
Enabling Grids for E-sciencE
Open GATE collaboration
•
Scientific objectives
Geant4 Application for Tomographic Emission
Medical physics applications: PET camera simulation, radiotherapy, occular brachytherapy treatment
http://www.opengatecollaboration.org
•
Method
GEANT4 base software to model physics of nuclear medicine.
Use Monte Carlo simulation to
improve accuracy of computations (as compared to the deterministic classical approach)
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 27
GATE application to radiotherapy
Scanner slices:
DICOM or DICOMRT
format
Enabling Grids for E-sciencE
gPTM3D: radiology analysis
•
Scientific objectives
– User-assisted reconstruction of 3D anatomical structures
– Quantification (e.g. volume measurements)
– Use models for augmented reality
•
Computations
– Parallleization of segmentation processes
– Large number of short jobs
•
Features
– Highly interactive (almost real-time)
– Used in clinical environment
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Software technologies for integration of process, data and knowledge in medical imaging
NeuroLOG ANR-06-TLOG-024 29
Use case: multiple sclerosis
Segmentation
Classification
Multimodal MRI 4 classes: - White matter - Gray matter - CSF - lesions Non-uniformity correction Intensity normalisation Preprocessing Stereotaxic registration ResamplingRegistration and resampling
Brain segmentation Classification Segmentation MR Images collection (T1, T2, PD) Quantification Statistical tests Analysis
Software technologies for integration of process, data and knowledge in medical imaging
Experiments requirements
•
Large databases
– 256 3D images (test database)
– 120 3D images (mono-site clinical database)
– 2 400 3D images (multi-site clinical trial, TBs of data)
•
Various computing tools
– 10 to 15 processing stages in the pipeline
•
Computing power
– > 2 months sequential execution time
•
Pipeline description and execution
– Workflow description
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Application to rigid registration
algorithms evaluation
Unregistered
Registered
O
1O
2Enabling Grids for E-sciencE A Params PFRegister Service GetFromEGEE Yasmina PFMatchICP CrestLines B Baladin
FormatConv GetFromEGEE GetFromEGEE GetFromEGEE FormatConv FormatConv FormatConv MultiTransfoTest Params Params Params Params Params WriteResults WriteResults WriteResults WriteResults Params MethodToTest
Bronze standard workflow
Params Params
~100 image pairs
~800 EGEE jobs
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 33
Application specific requirements
•
“Embarassingly” parallel problem
•
Fast turn over of short jobs
•
Data confidentiality
•
Interactivity
•
Fine grain parallelism (MPI)
•
Workflow-based
Enabling Grids for E-sciencE
Building on Production Grids
Resources
Communication layer (GEANT, Internet...) (low-level) Middleware
Applications
Production
grid
infrastructure level Resources Resources Resources Resources
Applications Applications Applications
Application-level services
(high-level middleware) Application-level services(high-level middleware)
Applications
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
WAN medical image exchange
•
Cross-enterprise exchange of radiology reports and
images
Radiology Enterprise A Radiology Enterprise B Physician OfficeWide-area
Health
Information
Network
PACS B
PACS A
Radiology-to-radiology
Radiology-to-Physicians
34Enabling Grids for E-sciencE
WAN medical image exchange
•
Grid technologies
Radiology Enterprise A Radiology Enterprise B Physician OfficeWide-area
Health
Information
Network
PACS B
PACS A
– Cross-enterprise authentication – Access control – Secured communicationsMedical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 37
Medical Data Manager
•
Objectives
– Expose a standard grid interface (SRM) for medical image servers (DICOM)
– Use native DICOM storage format
– Fulfill medical applications security requirements
– Do not interfere with clinical practice
User Interfaces Worker Nodes
DICOM clients
DICOM Interface SRM
Enabling Grids for E-sciencE
Medical Data Registration
LCG File Catalog AMGA Metadata Hydra Key store GFAL API 1. Image is acquired
2. Image is stored in DICOM server 3. lcg client
3a. Image is registered (a GUID is associated) 3b. Image key
is produced and registered
4. image metadata
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 39
Medical Data Retrieval
DICOM server SRM-DICOM interface LCG File Catalog AMGA Metadata User Interface Worker Node Hydra Key store 2. lcg client
3. get SURL from GUID 4. request file
5. get file key
6. on-the-fly encryption and anonimyzation return encrypted file
7. get file key and decrypt
file locally File ACL control
Metadata ACL control
Key ACL control
Anonimization & encryption
1. get GUID from metadata
GFAL
Grid Workflow Efficient Enactment for Data Intensive Applications
•
Application workflows representation and efficient
enactment on grids
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Grid Workflow Efficient Enactment for Data Intensive Applications
GWENDIA ANR-06-MDCA-009 41
Workflow management
•
MOTEUR, I3S laboratory, CNRS
– http://egee1.unice.fr/MOTEUR•
High level interface
– Hides grid complexity to the user
•
Service-based approach
– Legacy code service wrapper•
Scufl language (myGrid / Taverna)
– Pure data flow approach•
Grid submission interfaces
– EGEE (LCG2, gLite)– Grid5000 (OAR, DIET)
•
Transparent parallelism exploitation
– Code and data parallelismGrid Workflow Efficient Enactment for Data Intensive Applications
Efficient parallel execution
•
A workflow naturally provides application parallelization
•
MOTEUR transparently exploits 3 kinds of parallelism
•
Jobs grouping strategy in sequential branches
in order to reduce grid latency
S2 S3 S1
D
0,
D
1 S2 S3 S1D
0,
D
1 S2 S3 S1D
0,
D
1 S2 S3 D1D0 D1 D0Workflow parallelism Data parallelism Service parallelism
http://egee1.unice.fr/ MOTEUR
Medical image analysis on EGEE, J. Montagnat, BioGrid, June 2, 2008
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688 43
Conclusions
•
Using the European grid infrastructure for production
– Regular usage in medical imaging community
– Sustained production since 2004
– Closer to clinical end-users
RSNA'07 demonstration (EGEE-MEDICUS) Molecular docking for drug discovery
•
Higher level services
– More data protection (Role-based access control)
Considering multiple actors involved in clinical procedures (patient, doctor, surgeon, radiologist, nurse, anesthetist..)
Data semantics
Metadata associated to data, Ontologies, Content-based data identification, indexing, mining...
– User interfaces