Ressources management and runtime
environments in the exascale computing era
Guillaume Huard
MOAIS and MESCAL INRIA Projects CNRS LIG Laboratory Grenoble University, France
Introduction
Large scale platforms have become a reality : using grids, parallel applications can run on thousands of processors cores
Tow main models for grids : Structured : lightweight grids Unstructured : P2P overlay grids This talk :
High performance computation on structured grids
Anticipation of their evolution when growing to exaflop range computing power
Structured approach
Computing centers interconnect their clusters : lightweight grids Hierarchical structure
Clusters of homogeneous resources Network and CPU disparity only among distinct clusters
Reasonable reliability
Unavailability usually limited to few machines
Reliable Backbone and services
Challenges in HPC on structured grids
Scalability: required for both algorithms and runtime Adaptivity: computation and data must be balanced and placed to mach
Computing resources capabilities
Communication links capacity
Efficiency: computation on a grid is expensive (energy consumption cost), efficient platform usage mandatory
Outline
1 Computing on lightweight grids
OAR - Resources management TakTuk - Parallel remote executions
KAAPI, TRIVA - Programming environment
2 Going to exascale
Application safety and efficiency
Middlewares interactions and data management Green computing and platform administration
OAR : Managing resources
OAR is the batch scheduler used in Grid5000 clusters Classical batch/interactive submission of parallel jobs Elaborate resource query scheme (precise reservation of nodes/processors/cores, switch location, available memory, ...) Job dependencies enabling computation workflow support OAR also features low level nodes management
Effective nodes cleaning using cpuset
OAR scheduling snapshot
Enabling the grid with OAR
Efficient platform useBest effort jobs : opportunistic computation
Dynamic nodes : appropriate management of volatile resources
Large set of tasks abstraction Array jobs
CiGri system : life cycle management for bag of tasks Large parallel applications setup
Advance reservations : enable clusters coordination
Checkpoint/resubmit : to test global gang scheduling or fault tolerance
Outline
1 Computing on lightweight grids
OAR - Resources management TakTuk - Parallel remote executions
KAAPI, TRIVA - Programming environment
2 Going to exascale
Application safety and efficiency
Middlewares interactions and data management Green computing and platform administration
TakTuk : Adaptive Deployment of Parallel Executions
Nodes administration:launch the same command on all nodes of a platform
uptimeto grab statistics about the recent machine availability
dig,ping,ifconfig ... for network issues diagnostic ...
Parallel applications development:
launch the same parallel program on all nodes (likempirun) Slaves of a master/slave application
All participants of a symmetric parallel application Self organizing system (P2P), daemons (monitoring) ...
Existing tools
Flat deployment tools : pdsh/dsh (IBM Cluster Tools suite) Similar to:
Foreach host in hosts do fork ssh $host command
Naturally pipelined by the OS : deployment in linear time Distributed deployment : gexec (Ganglia Cluster Suite)
Remote gexec daemons take part in the deployment : deployment tree, logarithmic time
Requires daemons installation
Optimal deployment
Theoretical optimal deployment on homogeneous machines mixes
Parallel connections initiation Distribution of remote connexion tasks processes Time node 3 node 2 node 1 Concurrent connection
Dynamic deployment
The performance of nodes and network vary
Heterogeneous architectures in different clusters
Load due to OS or hanged processes (zombies, infinite loop) External contention (network, centralized services)
Cache effects, swap, other users, ... TakTuk algorithm : try to do things ASAP
Distribute the engine (using remote executions) Nodes initiate several parallel connections
TakTuk deployment compared to other tools
Performance versus pdsh and gexec
taktuk, window 15 5 10 15 20 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Execution time (s) Number of nodes pdsh, window 64 0 taktuk, rsh, window 15 0.5 1 1.5 2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Execution time (s) Number of nodes gexec, arity 2 taktuk, ssh, window 15 0 Advantages
No installation required on remote nodes (can self-propagate) Adapts to nodes load, insensitive to nodes failures
TakTuk unique features for the grid
Heterogeneity and hierarchy
Any part of the deployment can be statically specified (e.g. partial topology enforced by cluster front nodes)
Deployed nodes logical numbering
Distinct machines can execute different commands Applications support using deployment connexions
Provides control communications layer
Outline
1 Computing on lightweight grids
OAR - Resources management TakTuk - Parallel remote executions
KAAPI, TRIVA - Programming environment
2 Going to exascale
Application safety and efficiency
Middlewares interactions and data management Green computing and platform administration
KAAPI : Parallel programming library
Middleware for adaptive computation on Multi-core architectures
Clusters and grids
High level API for resources abstraction : Athapascan
fork keyword to create parallel tasks
sharedkeyword to declare shared data Objectives
Write once, run anywhere Guaranteed performances
KAAPI example : C++,
fork
and
shared
struct Fibonacci {
void operator()( int n, a1::Shared w<int> result ) {
if (n < 2) result.write(n);
else { a1::Shared<int> subresult1;
a1::Shared<int> subresult2;
a1::Fork<Fibonacci>()(n-1, subresult1);
a1::Fork<Fibonacci>()(n-2, subresult2);
a1::Fork<Sum>()(result, subresult1, subresult2);
} } };
struct Sum {
void operator()( a1::Shared w<int> result,
a1::Shared r<int> sr1, a1::Shared r<int> sr2 )
{ result.write( sr1.read() + sr2.read() ); } }
KAAPI Workflow
KAAPI Application constructs a data flow graph
KAAPI maps tasks on resources:
workstealing (dynamic load balancing) static placement
TRIVA : Application Execution Visualization
Collaboration with UFRG(Brasil)
3D (2D Resources / Time) for visualization outlines
Application topology Network topology
TRIVA is generic and extensible Based on Paj`e generic traces description language
Treemap views of synthetic data available
Large scale experiments
KAAPI/TakTuk winner of the GRID@Works Plugtest (ETSI Event) for three consecutive years (2006-2008)
N-Queens and Financial applications on near 4000 cores 2008 edition used G5K + Intrigger : mixed communications
TakTuk communications between different grids TCP/IP within each grid
IDHAL experiments : coupling highly heterogenous machines G5K grid
Brasilian grids Luxembourg clusters
Outline
1 Computing on lightweight grids
OAR - Resources management TakTuk - Parallel remote executions
KAAPI, TRIVA - Programming environment
2 Going to exascale
Application safety and efficiency
Middlewares interactions and data management Green computing and platform administration
Evolution forecast for structured grids
Next step: interconnect several structured grids into a larger one Several new issues
Hierarchical network
Nodes communicate with their neighbors only Front node forwarding for inter grid communications
More nodes failures (even during short executions)
This meets unstructured grids issues (as in P2P grids, PlanetLab) Of course, former lightweight grid issues worsen:
KAAPI ongoing works
Deepen the ’run anywhere’ concept Nodes dynamicity
Fault tolerance : checkpoint/restart application
CCK : coordinated checkpoint protocol TIC : theft induced protocol (distributed)
Interaction with the deployment tool : add/remove resources during computation
Heterogeneity handling
Hierarchical work stealing (sensitive to high latency networks) NUMA aware sheduling
TRIVA ongoing works
Improve scalabilityuser navigation in the large volume of informations well chosen data aggregation for relevant overviews Aggregation example : treemap
Transform data summary (e.g. number of steals) into visually relevant square Can be applied at each level: core, processor, node, cluster, grid
Behavior patterns identification
TakTuk ongoing works
Improve distributed applications support Applications management extensions
Deployment networks union support
Interface between batch scheduler and application Data management
Efficient broadcast of large data files
using direct connections rather than deployment network based on Santos and al. algorithms for K item broadcast
OAR ongoing works
Flexibility and application support Green OAR
Dynamic machines power state changes (history and models) Scheduling sensitive to energy (consumption/speed tradeoff) OAR API for interactions with applications
dynamic job’s resources addition/removal Clusters administration
OAR live CD
Conclusion
Thanks for your attention, any question ?
OAR: http://oar.imag.fr
N. Capit, G. Da-Costa, Y. Georgiou, G. Huard, C. Martin, G. Mounier, P. Neyron, and O. Richard. A batch scheduler with high level components In CCGrid 2005
KAAPI: http://kaapi.gforge.inria.fr T. Gautier, X. Besseron, and L. Pi-geon. KAAPI: A thread scheduling runtime system for data flow compu-tations on cluster of multi-processors In PASCO 2007
TakTuk: http://taktuk.gforge.inria.fr B. Claudel, G. Huard, and O. Richard. Taktuk, adaptive deployment of re-mote executions
In HPDC 2009 (to appear) TRIVA: http://triva.gforge.inria.fr L. M. Schnorr, G. Huard, and P. O. A. Navaux. 3d approach to the visualiza-tion of parallel applicavisualiza-tions and grid monitoring information