Ressources management and runtime environments in the exascale computing era

(1)

Ressources management and runtime

environments in the exascale computing era

Guillaume Huard

MOAIS and MESCAL INRIA Projects CNRS LIG Laboratory Grenoble University, France

(2)

Introduction

Large scale platforms have become a reality : using grids, parallel applications can run on thousands of processors cores

Tow main models for grids : Structured : lightweight grids Unstructured : P2P overlay grids This talk :

High performance computation on structured grids

Anticipation of their evolution when growing to exaflop range computing power

(3)

Structured approach

Computing centers interconnect their clusters : lightweight grids Hierarchical structure

Clusters of homogeneous resources Network and CPU disparity only among distinct clusters

Reasonable reliability

Unavailability usually limited to few machines

Reliable Backbone and services

(4)

Challenges in HPC on structured grids

Scalability: required for both algorithms and runtime Adaptivity: computation and data must be balanced and placed to mach

Computing resources capabilities

Communication links capacity

Efficiency: computation on a grid is expensive (energy consumption cost), efficient platform usage mandatory

(5)

Outline

1 Computing on lightweight grids

OAR - Resources management TakTuk - Parallel remote executions

KAAPI, TRIVA - Programming environment

2 Going to exascale

Application safety and efficiency

Middlewares interactions and data management Green computing and platform administration

(6)

OAR : Managing resources

OAR is the batch scheduler used in Grid5000 clusters Classical batch/interactive submission of parallel jobs Elaborate resource query scheme (precise reservation of nodes/processors/cores, switch location, available memory, ...) Job dependencies enabling computation workflow support OAR also features low level nodes management

Effective nodes cleaning using cpuset

(7)

OAR scheduling snapshot

(8)

Enabling the grid with OAR

Efficient platform use

Best effort jobs : opportunistic computation

Dynamic nodes : appropriate management of volatile resources

Large set of tasks abstraction Array jobs

CiGri system : life cycle management for bag of tasks Large parallel applications setup

Advance reservations : enable clusters coordination

Checkpoint/resubmit : to test global gang scheduling or fault tolerance

(9)

Outline

2 Going to exascale

(10)

TakTuk : Adaptive Deployment of Parallel Executions

Nodes administration:

launch the same command on all nodes of a platform

uptimeto grab statistics about the recent machine availability

dig,ping,ifconfig ... for network issues diagnostic ...

Parallel applications development:

launch the same parallel program on all nodes (likempirun) Slaves of a master/slave application

All participants of a symmetric parallel application Self organizing system (P2P), daemons (monitoring) ...

(11)

Existing tools

Flat deployment tools : pdsh/dsh (IBM Cluster Tools suite) Similar to:

Foreach host in hosts do fork ssh $host command

Naturally pipelined by the OS : deployment in linear time Distributed deployment : gexec (Ganglia Cluster Suite)

Remote gexec daemons take part in the deployment : deployment tree, logarithmic time

Requires daemons installation

(12)

Optimal deployment

Theoretical optimal deployment on homogeneous machines mixes

Parallel connections initiation Distribution of remote connexion tasks processes Time node 3 node 2 node 1 Concurrent connection

(13)

Dynamic deployment

The performance of nodes and network vary

Heterogeneous architectures in different clusters

Load due to OS or hanged processes (zombies, infinite loop) External contention (network, centralized services)

Cache effects, swap, other users, ... TakTuk algorithm : try to do things ASAP

Distribute the engine (using remote executions) Nodes initiate several parallel connections

(14)

TakTuk deployment compared to other tools

Performance versus pdsh and gexec

taktuk, window 15 5 10 15 20 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Execution time (s) Number of nodes pdsh, window 64 0 taktuk, rsh, window 15 0.5 1 1.5 2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Execution time (s) Number of nodes gexec, arity 2 taktuk, ssh, window 15 0 Advantages

No installation required on remote nodes (can self-propagate) Adapts to nodes load, insensitive to nodes failures

(15)

TakTuk unique features for the grid

Heterogeneity and hierarchy

Any part of the deployment can be statically specified (e.g. partial topology enforced by cluster front nodes)

Deployed nodes logical numbering

Distinct machines can execute different commands Applications support using deployment connexions

Provides control communications layer

(16)

Outline

2 Going to exascale

(17)

KAAPI : Parallel programming library

Middleware for adaptive computation on Multi-core architectures

Clusters and grids

High level API for resources abstraction : Athapascan

fork keyword to create parallel tasks

sharedkeyword to declare shared data Objectives

Write once, run anywhere Guaranteed performances

(18)

KAAPI example : C++,

fork

and

shared

struct Fibonacci {

void operator()( int n, a1::Shared w<int> result ) {

if (n < 2) result.write(n);

else { a1::Shared<int> subresult1;

a1::Shared<int> subresult2;

a1::Fork<Fibonacci>()(n-1, subresult1);

a1::Fork<Fibonacci>()(n-2, subresult2);

a1::Fork<Sum>()(result, subresult1, subresult2);

} } };

struct Sum {

void operator()( a1::Shared w<int> result,

a1::Shared r<int> sr1, a1::Shared r<int> sr2 )

{ result.write( sr1.read() + sr2.read() ); } }

(19)

KAAPI Workflow

KAAPI Application constructs a data flow graph

KAAPI maps tasks on resources:

workstealing (dynamic load balancing) static placement

(20)

TRIVA : Application Execution Visualization

Collaboration with UFRG

(Brasil)

3D (2D Resources / Time) for visualization outlines

Application topology Network topology

TRIVA is generic and extensible Based on Paj`e generic traces description language

Treemap views of synthetic data available

(21)

Large scale experiments

KAAPI/TakTuk winner of the GRID@Works Plugtest (ETSI Event) for three consecutive years (2006-2008)

N-Queens and Financial applications on near 4000 cores 2008 edition used G5K + Intrigger : mixed communications

TakTuk communications between different grids TCP/IP within each grid

IDHAL experiments : coupling highly heterogenous machines G5K grid

Brasilian grids Luxembourg clusters

(22)

Outline

2 Going to exascale

(23)

Evolution forecast for structured grids

Next step: interconnect several structured grids into a larger one Several new issues

Hierarchical network

Nodes communicate with their neighbors only Front node forwarding for inter grid communications

More nodes failures (even during short executions)

This meets unstructured grids issues (as in P2P grids, PlanetLab) Of course, former lightweight grid issues worsen:

(24)

KAAPI ongoing works

Deepen the ’run anywhere’ concept Nodes dynamicity

Fault tolerance : checkpoint/restart application

CCK : coordinated checkpoint protocol TIC : theft induced protocol (distributed)

Interaction with the deployment tool : add/remove resources during computation

Heterogeneity handling

Hierarchical work stealing (sensitive to high latency networks) NUMA aware sheduling

(25)

TRIVA ongoing works

Improve scalability

user navigation in the large volume of informations well chosen data aggregation for relevant overviews Aggregation example : treemap

Transform data summary (e.g. number of steals) into visually relevant square Can be applied at each level: core, processor, node, cluster, grid

Behavior patterns identification

(26)

TakTuk ongoing works

Improve distributed applications support Applications management extensions

Deployment networks union support

Interface between batch scheduler and application Data management

Efficient broadcast of large data files

using direct connections rather than deployment network based on Santos and al. algorithms for K item broadcast

(27)

OAR ongoing works

Flexibility and application support Green OAR

Dynamic machines power state changes (history and models) Scheduling sensitive to energy (consumption/speed tradeoff) OAR API for interactions with applications

dynamic job’s resources addition/removal Clusters administration

OAR live CD

(28)

Conclusion

Thanks for your attention, any question ?

OAR: http://oar.imag.fr

N. Capit, G. Da-Costa, Y. Georgiou, G. Huard, C. Martin, G. Mounier, P. Neyron, and O. Richard. A batch scheduler with high level components In CCGrid 2005

KAAPI: http://kaapi.gforge.inria.fr T. Gautier, X. Besseron, and L. Pi-geon. KAAPI: A thread scheduling runtime system for data flow compu-tations on cluster of multi-processors In PASCO 2007

TakTuk: http://taktuk.gforge.inria.fr B. Claudel, G. Huard, and O. Richard. Taktuk, adaptive deployment of re-mote executions

In HPDC 2009 (to appear) TRIVA: http://triva.gforge.inria.fr L. M. Schnorr, G. Huard, and P. O. A. Navaux. 3d approach to the visualiza-tion of parallel applicavisualiza-tions and grid monitoring information