Charm++, what’s that?!
Les Mardis du dev’
François Tessier - Runtime team
1 Introduction
2 Charm++
3 Basic examples
4 Load Balancing
Outline
1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 ConclusionParallel programming
Decomposition
What to do in parallel
Mapping
Which processor execute each task
Scheduling
The order
Machine dependent expression
Introduction
Scalable execution of parallel applications
Cluster computing
Number of cores is increasing
Butmemory per coreis decreasing (or increasing slowly)
Applications need to communicate more and more...
10 100 1000 10000 100000 1e+06 1e+07 1990 1995 2000 2005 2010 2015
#Cores / Memory per core (in MB)
Years
Number of cores and amount of memory per core of the first ranking cluster in the Top500 list since 1993
#Cores Memory per core
How to communicate between nodes?
Message passing paradigm Inter-process communication MPI, SOAP, Charm++
Message passing interface
Process 1 Process 2 Process 3 Process 4
Outline
1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 ConclusionCharm++
Presentation
Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python
Presentation
Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python
How to define Charm++?
Object-oriented
Based on the C++ programming language
Objects called chares (contain data, send and receive messages, perform task)
Charm++
Presentation
Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python
How to define Charm++?
Object-oriented
Based on the C++ programming language
Objects called chares (contain data, send and receive messages, perform task)
Asynchronous message passing
Chares exchange messages to communicate
Presentation
Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python
How to define Charm++?
Object-oriented
Based on the C++ programming language
Objects called chares (contain data, send and receive messages, perform task)
Asynchronous message passing
Chares exchange messages to communicate
Messages sent/received asynchronously to the code execution
Parallel
Chunks of code Messages
Charm++
Presentation
Developed at the PPL (Urbana-Champaign, IL) High-level abstraction of parallel programming C++, Python
How to define Charm++?
Object-oriented
Based on the C++ programming language
Objects called chares (contain data, send and receive messages, perform task)
Asynchronous message passing
Chares exchange messages to communicate
Messages sent/received asynchronously to the code execution
Parallel
Chunks of code Messages
Programming paradigm
Way of writing program
What is a Charm++ Program?
Collection of communicant chare objects, included the "main" chare (global object space)
No importance about the number of PU or the type of interconnect (RTS) Send a message by calling another chare’s entry point function (reception point)
Charm++
The Charm++ Compilation and the Runtime System
First strenght: it works!
Application written as a collection of communicating objects
Compilation, execution : Specific target platform, physical resources (-np) RTS manages the details of the physical resources...
... and can take some decisions about :
Mapping chare objects to physical processors Load-balancing chare objects
Checkpointing Fault-tolereance ...
1 Introduction
2 Charm++
3 Basic examples
4 Load Balancing
Charm++ - Hello World!
Listing 1: Hello.h # i f n d e f _ _ M A I N _ H _ _ # d e f i n e _ _ M A I N _ H _ _ c l a s s M a i n : p u b l i c C B a s e _ M a i n { p u b l i c: M a i n ( C k A r g M s g * msg ); M a i n ( C k M i g r a t e M e s s a g e * msg ); }; # e n d i f // _ _ M A I N _ H _ _ Listing 2: Hello.ci m a i n m o d u l e m a i n { m a i n c h a r e M a i n { e n t r y M a i n ( C k A r g M s g * msg ); }; }; Listing 3: Hello.C # i n c l u d e " m a i n . d e c l . h " # i n c l u d e " m a i n . h " // E n t r y p o i n t of C h a r m ++ a p p l i c a t i o n M a i n :: M a i n ( C k A r g M s g * msg ) { // P r i n t a m e s s a g e for the u s e r C k P r i n t f (" H e l l o ␣ W o r l d !\ n "); // E x i t the a p p l i c a t i o n C k E x i t (); } // C o n s t r u c t o r n e e d e d for c h a r e o b j e c t // m i g r a t i o n ( i g n o r e for now ) // N O T E : T h i s c o n s t r u c t o r d o e s not n e e d // to a p p e a r in the ". ci " f i l e M a i n :: M a i n ( C k M i g r a t e M e s s a g e * msg ) { } # i n c l u d e " m a i n . def . h "Listing 4: 1Darray.ci m a i n m o d u l e h e l l o { r e a d o n l y C P r o x y _ M a i n m a i n P r o x y ; r e a d o n l y int n E l e m e n t s ; m a i n c h a r e M a i n { e n t r y M a i n ( C k A r g M s g * m ); e n t r y v o i d d o n e (v o i d); }; a r r a y [1 D ] H e l l o { e n t r y H e l l o (v o i d); e n t r y v o i d S a y H i (int f r o m ); }; }; Listing 5: 1Darray.C # i n c l u d e < s t d i o . h > # i n c l u d e" h e l l o . d e c l . h " /* r e a d o n l y */ C P r o x y _ M a i n m a i n P r o x y ; /* r e a d o n l y */ int n E l e m e n t s ; /* m a i n c h a r e */ c l a s s M a i n : p u b l i c C B a s e _ M a i n { p u b l i c: M a i n ( C k A r g M s g * m ) { if( m - > a r g c >1 ) n E l e m e n t s = a t o i ( m - > a r g v [ 1 ] ) ; d e l e t e m ; C k P r i n t f (" Run ␣ on ␣ % d ␣ p r o c e s s o r s ␣ for ␣ % d ␣ el .\ n ", C k N u m P e s () , n E l e m e n t s ); m a i n P r o x y = t h i s P r o x y ; C P r o x y _ H e l l o arr = C P r o x y _ H e l l o :: c k N e w ( n E l e m e n t s ); arr [ 0 ] . S a y H i ( -1); }; v o i d d o n e (v o i d) { C k P r i n t f (" All ␣ d o n e \ n "); C k E x i t (); }; }; /* a r r a y [1 D ] */ c l a s s H e l l o : p u b l i c C B a s e _ H e l l o { p u b l i c: H e l l o () { C k P r i n t f (" H e l l o ␣ % d ␣ c r e a t e d \ n ", t h i s I n d e x ); } H e l l o ( C k M i g r a t e M e s s a g e * m ) {} v o i d S a y H i (int f r o m ) { C k P r i n t f (" \" H e l l o \" ␣ f r o m ␣ H e l l o ␣ c h a r e ␣ # ␣ % d ␣ on ␣ " " p r o c e s s o r ␣ % d ␣ ( t o l d ␣ by ␣ % d ).\ n ", t h i s I n d e x , C k M y P e () , f r o m ); if ( t h i s I n d e x < n E l e m e n t s -1) t h i s P r o x y [ t h i s I n d e x + 1 ] . S a y H i ( t h i s I n d e x ); e l s e m a i n P r o x y . d o n e (); // D o n e !
Charm++ - 1Darray
$ ./charmrun +p3 ./hello 10
Running "Hello World" with 10 elements using 3 processors. "Hello" from Hello chare # 0 on processor 0 (told by -1). "Hello" from Hello chare # 1 on processor 1 (told by 0). "Hello" from Hello chare # 2 on processor 2 (told by 1). "Hello" from Hello chare # 3 on processor 0 (told by 2). "Hello" from Hello chare # 4 on processor 1 (told by 3). "Hello" from Hello chare # 5 on processor 2 (told by 4). "Hello" from Hello chare # 6 on processor 0 (told by 5). "Hello" from Hello chare # 7 on processor 1 (told by 6). "Hello" from Hello chare # 9 on processor 0 (told by 8). "Hello" from Hello chare # 8 on processor 2 (told by 7).
1 Introduction
2 Charm++
3 Basic examples
4 Load Balancing
Charm++ - Load balancing
Load-balancing
Balance the load between the processing units
Charm++ - Load balancing
Initial state
Listing 6: NewLB.C
C r e a t e L B F u n c _ D e f ( NewLB , " C r e a t e ␣ N e w L B ")
v o i d N e w L B :: w o r k ( B a s e L B :: L D S t a t s * s t a t s ) {
C k P r i n t f (" % d ␣ c h a r e s ␣ e x e c u t e d ␣ on ␣ % d ␣ p r o c s \ n ", stats - > n_obj , stats - > n p r o c s ( ) ) ; // P r o c e s s o r a r r a y P r o c A r r a y * p a r r = new P r o c A r r a y ( s t a t s ); // O b j e c t g r a p h O b j G r a p h * ogr = new O b j G r a p h ( s t a t s ); std :: vector < Vertex >:: i t e r a t o r v _ i t ; std :: vector < Edge >:: i t e r a t o r e _ i t ;
for ( v _ i t = ogr - > v e r t i c e s . b e g i n (); v _ i t != ogr - > v e r t i c e s . end (); ++ v _ i t ) {
d o u b l e l o a d = (* v _ i t ). g e t V e r t e x L o a d (); for ( e _ i t =(* v _ i t ). s e n d T o L i s t . b e g i n (); e _ i t ! = ( * v _ i t ). s e n d T o L i s t . end (); ++ e _ i t ) { int f r o m = (* v _ i t ). g e t V e r t e x I d (); int to = (* e _ i t ). g e t N e i g h b o r I d (); if ( f r o m != to ) { C k P r i n t f (" % d ␣ m s g s ␣ s e n t ␣ f r o m ␣ % d ␣ to ␣ % d \ n ", (* e _ i t ). g e t N u m M s g s () , from , to ); } } } ogr - > v e r t i c e s [ 1 2 ] . s e t N e w P e ( 2 ) ;
Charm++ - Load balancing
Load-balancing
For iterative applications
charmrun +p64 ./App +balancer newLB +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack)
Listing 7: PUP function c l a s s M y C l a s s { p u b l i c: int a ; int * b ; M y C l a s s () { int a = 42;
int * b = (int *) m a l l o c ( a *s i z e o f(int)); }
v o i d pup ( PUP :: er & p ) { p | a ; P U P a r r a y ( p , b , a ); if ( p . i s U n p a c k i n g ()) { b = (int *) m a l l o c ( a *s i z e o f(int)); } } }
Load-balancing
For iterative applications
charmrun +p64 ./App +balancer newLB +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers
GreedyLB RefineLB
Hierarchical load balancers
Thermal aware load balancers...
Charm++ - Load balancing
Load-balancing
For iterative applications
charmrun +p64 ./App +balancer newLB +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers
GreedyLB RefineLB
Hierarchical load balancers
Thermal aware load balancers...
0 2 4 6 1 3 5 7
Groups of chares assigned to cores
Load-balancing
For iterative applications
charmrun +p64 ./App +balancer newLB +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers
GreedyLB RefineLB
Hierarchical load balancers
Thermal aware load balancers...
Charm++ - Load balancing
Load-balancing
For iterative applications
charmrun +p64 ./App +balancer newLB +LBDebug 1 [...] Objects can migrate using a pup function (Pack and Unpack) Several load balancers
GreedyLB RefineLB
Hierarchical load balancers
Thermal aware load balancers...
0 2 4 6 1 3 5 7
Groups of chares assigned to cores
kNeighbor
Benchmarks application designed to simulate intensive communication between processes
Experiments on PlaFRIM : 8 nodes with 8 cores on each (Intel Xeon 5550)
Dumm
yLB
GreedyLB reeBased
Ex
ecution time (in seconds)
0 500 1000 1500 2000 kNeighbor on 64 cores 256 elements − 1MB message size
Outline
1 Introduction 2 Charm++ 3 Basic examples 4 Load Balancing 5 ConclusionTo conclude...
Charm++, message passing since 1984 High-level abstraction of a parallel program Interesting features : load balancing, fault-tolerance,...
Useful tools; CharmDebug, Projections Used by some important applications : ChaNGa, NAMD, OpenAtom
Want to know more?
Website : http://charm.cs.uiuc.edu/
Tutorials : http://charm.cs.illinois. edu/tutorial/TableOfContents.htm