PARIS*: Programming pArallel and distRibuted
systems for large scale numerical sImulation
applicationS
Kerrighed, Vigne
Christine Morin
IRISA/INRIA
Members of PARIS Project (sept 05)
Scientific leader
T. Priol (DR INRIA)
Researchers
F. André (Prof IFSIC) G. Antoniu (CR INRIA) J-P. Banâtre (Prof IFSIC) M. Bertier (MdC INSA) L. Bougé (Prof ENS) Y. Jégou (CR INRIA)
A-M. Kermarrec (DR INRIA) C. Morin (DR INRIA) J.L. Pazat (MdC INSA) C. Perez (CR INRIA) Post-docs A. Viana A. Ribes G. Vallée Engineers
D. Margery (IR INRIA) P. Morillon (IE IFSIC)
Engineers P. Gallard (DGA) V. Lefèvre (G5K - IFSIC) R. Lottiaux (DGA) G. Mornet (G5K - PRIR) P. Palosaari (CoreGRID) J. Parpaillon (Ing. Associé) PhD candidates H-L. Bouziane (INRIA 2) J. Buisson (MENRT 3) Y. Busnel (ENS 1) L. Cudennec (INRIA-Région 1) M. Fertré (MENRT 1) M. Jan (MENRT 3) E. Jeanvoine (CIFRE 2) S. Lacour (INRIA 3) E. Le Merrer (CIFRE FT) S. Monnet (INRIA-Région 3) Y. Radenac (MENRT 3) E. Riviere (MENRT 2) L. Rilling (ENS 3)
Studied Systems
Clusters
A set of interconnected PC used as a single computing
resource
Grid
A set of resources (processor, memory, disk, …)
interconnected via Internet
P2P systems
Research Directions
Single system image operating system
Problem: clusters are difficult to program/use
Challenge: to give the illusion that a cluster is a single machine Component based middleware
Problème: code coupling applications are complex
Challenge: How to facilitate the design of such applications while providing high
performance ?
Advanced programming models
Problème: Current programming models are not adequate for highly dynamic systems Challenge: How to express computing/coordination is such an environement?
Data sharing service
Problème: data sharing in large scale grids Challenge: sharing mutable data
Systèmes P2P
Problème: Master and optimize a P2P system
Challenge: Characterize a P2P system and searching relevant information Experimental grid platform
Problème: Need to experiment to validate our research results Challenge: building a reconfigurable grid
Grid 5000 Experimental Platform
Contribution to the construction of
Grid 5000
9 sites, 5000 processors
Rennes site
500 processors (powerPC, Xeon, Opteron)
Dual processor nodes
Participants
Researcher: Y. Jégou
Engineers: V. Lefèvre, D. Margery, P. Morrillon, G. Mornet 1000 500 500 500 500 500 500 500 500 Grid-5000 Rennes
International Collaborations (funded)
Cluster OS
University of Ulm (Germany), ORNL (USA), Rutgers
University (USA)
Grids
Pisa University (Italy), SNU (Korea)
Large scale data management
UIUC (USA)
P2P systems
OS for Clusters and Grids
Kerrighed
Single System Image (SSI) operating system for high
performance computing on clusters
Vigne
Operating system to ease the use and programming of
Kerrighed
Objectives
Virtual shared memory multiprocessor
Global and transparent resource management Tolerating node failures transparently for the
applications
High performance
Approach
Design of distributed OS mechanisms within an
Kerrighed
Achievements
Customizable efficient full SSI operating system for high performance
computing on clusters
Small clusters (up to 256 nodes)
Advanced research prototype
Integration of the work of 3 Ph.D. students (R. Lottiaux (2001), Geoffroy
Vallée (2004), P. Gallard (2004))
Robust prototype able to execute real applications provided by EDF R&D and
DGA
Open source software
www.kerrighed.org
Stable version (K V1.0.2) based on Linux 2.4.29 Demo LiveCD based on Knoppix
integrated in OSCAR ssi-oscar.irisa.fr
OSCAR is a snapshot of methods for building, programming, and using
clusters. It consists of a fully integrated and easy to install software bundle designed for high performance cluster computing.
Efficient Operating System
Comparison with other SSI for clusters
OpenSSI, openMosix
Results published in CC-GRID 2005
Internship of Benoît Boissinot
Efficient communication system
Highly reactive communication system to support Kerrighed
distributed operating system services
Compatibility of Kerrighed with efficient communication drivers
Collaborations
EDF R& D (since 2000) PhD and post-doc grants DGA (2003-2005)
Funding for research engineers ORNL (S. Scott)
Integration of Kerrighed in
OSCAR
University of Ulm (M. Schöttner) Fault tolerant SDSM
University of Rutgers (L. Iftode) High availability
Invited researchers
R. Badrinath (IIT Kharagpur) Isaac Scherson (UCI)
Current Research Directions
Fault tolerance
Large scale parallel application
checkpointing
System initiated checkpoints Checkpointing grid
applications
Master & PhD Thesis of
Matthieu Fertré
High availability
Current work of Pascal Gallard
and Renaud Lottiaux
Tolerating hot node addition
and eviction Phenix
Investigating the backdoor
approach in the context of Kerrighed
Master Thesis of Benoît
Boissinot
Node 1 Node 2 Node 3
SSI cluster OS Application
Technology Transfer
KerLabs (
http://www.kerlabs.com
)
Start-up funded by Pascal Gallard and Renaud Lottiaux Software suite based on Kerrighed technologies
EasyAdmin: Global cluster management
EasyCheckpoint: Checkpoint/restart of parallel applications EasyRun: Application deployment & scheduling on clusters EasyCluster: the whole Kerrighed SSI solution
Optimized support for high performance networking technologies
Vigne: a Grid OS
Design and implementation of a Grid OS to ease
the use and programming of very large grids
Highly decentralized system
Algorithms based on local knowledge
Self-healing system
Dealing with multiple quasi-simultaneous reconfigurations
Single System Image Flexible system
Vigne
Infrastructure based on decentralized overlays
Structured and unstructured overlays
Application manager for reliable application execution Resource discovery & allocation service
On-going work (PhD Thesis of Emmanuel Jeanvoine)
Volatile data sharing service
PhD Thesis of Louis Rilling
Complex application deployment
PhD Thesis of Boris Daix (co-advised with Christian Pérez)
Collaborations
EDF R&D
Future Work
XtreemOS Project
Integrated Project (FP6 - Call 5)
Under evaluation
Goal
Building and Promoting a Linux-based Operating
System to Support Virtual Organisations for Next Generation Grids
18 partners
Academic & industrial partners 8 countries (including China)
XtreemOS Main Objectives
We will design, implement, evaluate and
distribute an open source Grid OS with native support for virtual organizations.
Development of a Grid Operating System
Enhance Linux to support VO
across multiple administrative domains
Manage very large and
Self-organizing and
self-healing system
Available on PC, SMP, clusters,
PDA and mobile phones
Experimentation and evaluation with a comprehensive set of real
use-cases provided by ISVs and end-users
Integration in notorious Linux
distributions
Mandriva, Red Flag Linux
XtreemOS software: 3 flavours Standard flavour for PC
Federation flavour based on Kerrighed SD flavour for small devices
XtreemOS software will make the VO
management easy for administrators and work, within VOs, easy, secure and
efficient.
Building a reference open source Grid
OS Computer Computer Linux Linux Computer Linux Computer Linux XtreemOS Application Middleware