W&M ScholarWorks
W&M ScholarWorks
Dissertations, Theses, and Masters Projects Theses, Dissertations, & Master Projects 2004Dynamic adaptation to CPU and memory load in scientific
Dynamic adaptation to CPU and memory load in scientific
applications
applications
Richard Tran MillsCollege of William & Mary - Arts & Sciences
Follow this and additional works at: https://scholarworks.wm.edu/etd
Part of the Computer Sciences Commons, and the Mathematics Commons
Recommended Citation Recommended Citation
Mills, Richard Tran, "Dynamic adaptation to CPU and memory load in scientific applications" (2004). Dissertations, Theses, and Masters Projects. Paper 1539623457.
https://dx.doi.org/doi:10.21220/s2-my1g-nn08
SCIENTIFIC APPLICATIONS
A D issertation P resented to
T h e Faculty of th e D e p artm en t of C o m puter Science T h e College of W illiam an d M ary in V irginia
In P a rtia l Fulfillm ent
O f th e R equirem ents for th e Degree of D octor of Philosophy
by
R ichard T ran Mills 2004
APPROVAL SHEET
T h is d isse rta tio n is su b m itte d in p a rtia l fulfillm ent of th e requirem ents for th e degree of
D octor of P hilosophy
R ichard T ran Mills
A pproved by th e com m itteq /N o v em b er 2004
D im itrios Nikolopemlos
( jV irg in ia Torczon
R obe'rt V o T g t^ ^ '
/ R ichard Wolski
Department of Computer Science University of California, Santa Barbara
Table o f C ontents
A ck n o w le d g m en ts viii L ist o f T ables x L ist o f F ig u res xi A b stra c t x iv 1 In tr o d u ctio n 2 1.1 C o n tr ib u tio n s ... 41.1.1 D ynam ic load balancing of inner-o u ter iterativ e s o lv e rs ... 5
1.1.2 D ynam ic m em ory a d a p ta tio n in scientific a p p l i c a t i o n s ... 5
1.2 R e l e v a n c e ... 6
1.3 O r g a n i z a t i o n ... 7
2 B ack grou n d an d rela ted w ork 9 2.1 Load b a l a n c i n g ... 9
2.1.1 Pools of in dependent tasks ... 10
2.1.2 D a ta p a r titio n in g ... 12
2.1.2.1 G eom etric p a r t i t i o n i n g ... 13
2.1.2.2 Topological p a r t i ti o n in g ... 14
2.2 A ddressing m em ory s h o r t a g e ... 19
2.2.1 O ut-of-core a lg o r ith m s ... 20
2.2.2 V irtu a l m em ory system m o d if ic a ti o n s ... 22
2.2.3 M em ory-adaptive a l g o r i t h m s ... 23
2.3 A pplication-centric ad ap tiv e scheduling ... 25
3 Load b a la n cin g flex ib le-p h a se iter a tio n s 28 3.1 Flexible-phase ite ra tio n ... 29
3.2 C oarse-grain Jacobi-D avidson im p le m e n ta tio n ... 30
3.3 Load balancing J D c g ... 33
3.3.1 E xp erim en tal e v a lu a tio n ... 34
3.4 Load balancing a fully m ultigrain Jacobi-D avidson solver ... 36
3.4.1 Benefits of m ultigrain p a r a lle lis m ... 36
3.4.2 Load im balance in th e full m u ltig rain c a s e ... 38
3.5 A load balanced Additive-Schw arz preconditioner for F G M R E S ... 39
4 Load b a la n cin g u n der m em o ry p ressu re 43 4.1 L oad-balanced JD cg un d er m em ory p r e s s u r e ... 43
4.2 A voiding th ra sh in g from w ithin th e a lg o r i th m ... 44
4.2.1 Choosing th e p aram eters ... 48
4.2.2 E xp erim en tal r e s u l t s ... 49
4.3 L im itations of th e m em ory-balancing a p p r o a c h ... 52
5 A d y n a m ic m em o ry a d a p ta tio n fram ew ork 55 5.1 A p o rta b le fram ew ork for m em ory a d a p t i v i t y ... 55
5.2 E lem ents of th e im p le m e n ta tio n ... 57
5.3 A lgorithm s for a d a p tin g to m em ory a v a il a b i l it y ... 58
5.3.2 D etecting m em ory s u r p l u s ... 63
5.3.2.1 P roblem s w ith overly aggressive p r o b in g ... 63
5.3.2.2 B alancing perform ance p enalties w ith a dynam ic probing p o l i c y ... 64
5.4 F u rth e r details of th e a d a p ta tio n a lg o r ith m s ... 68
5.4.1 E stim atin g th e sta tic m em ory s i z e ... 69
5.4.2...C alculating R p e n ... 70
5.4.2.1 A low frequency probing a p p r o a c h ... 70
5.4.2.2 P ro b in g w ith higher f r e q u e n c y ... 72
6 D esig n o f M M lib: A m em o ry -m a llea b ility library 77 6.1 T h e general fram ew ork su p p o rte d by M Ml i b ... 77
6.2 D a ta s t r u c t u r e s ... 78
6.3 Core... interface and f u n c tio n a lity ... 79
6.4 O p tim izations a n d a d d itio n al fu n c tio n a lity ... 80
6.4.1 M em ory access a t sub-panel g r a n u l a r i t y ... 80
6.4.2 E viction of “m ost-m issing” panels ... 83
7 E x p e rim e n ta l ev a lu a tio n o f m e m o ry -a d a p ta tio n 86 7.1 A pplication k e r n e l s ... 86
7.1.1 C onjugate G radient (CG)... 86
7.1.2 M odified G ram -Schm idt (MGS) ... 87
7.1.3 M onte-C arlo Ising m odel sim ulation (ISING) ... 89
7.2 E xp erim en tal environm ents a n d m e th o d o lo g y ... 91
7.3 C hoosing th e M Ml i b p a r a m e t e r s ... 94
7.3.1 Panel s i z e ... 95
7.3.2 R p e n - m a x... 97
7.3.3.2 Highly variable m em ory p r e s s u r e ... 100
7.4 G raceful d eg radation of p e r f o r m a n c e ... 102
7.4.1 Effects of panel w rite -fre q u e n c y ... 103
7.5 Q uick response to tra n sie n t m em ory p r e s s u r e ... 106
7.6 A daptive versus ad aptive j o b s ... 108
7.7 Perform ance w ith LR U -friendly access p a t t e r n s ... 110
8 C on clu sion s and fu tu re w ork 116 8.1 Flexible-phase load b a l a n c i n g ... 116
8.2 A dynam ic m em o ry -ad ap tatio n fra m e w o rk ... 117
A p p e n d ix A 120
B ib lio g ra p h y 125
ACKNOWLEDGMENTS
T his work reflects th e influence of a great m any people on my life, an d I regret th a t it is not possible to nam e all of th em here; all of th e m deserve th anks.
I wish to th a n k all of th e m em bers of my d isse rta tio n com m ittee for th e ir insightful com m ents an d guidance, which have im proved th is work trem endously. Forem ost am ong th em I th a n k my advisor, A ndreas Stathopoulos, for teaching me a great deal ab o u t scientific com puting an d for tak in g an active an d en th u siastic role in guiding my research. T h is work reflects a great deal of effort on his p a rt. I th a n k D im itris Nikolopoulos for serving as alm ost a second advisor d u rin g th e m em ory-related po rtio n s of this work. R o b ert Voigt deserves special th an k s not only for his com m ents on th is work, b u t also for valuable guidance in my career a n d professional developm ent. He has always being willing to listen to my problem s w hen I encountered bum ps on my ro ad th ro u g h g rad u a te school. I th a n k V irginia Torczon for her helpful com m ents, and for teaching me some of th e subtleties of scientific com puting w hen I was th e grader for her “Scientific C o m puting” course. Rich W olski helped m otivate th is work w ith his research on G rid m iddlew are, a n d deserves special th a n k s for agreeing to serve on my com m ittee w ithout knowing a n y th in g a b o u t me.
O utside of my com m ittee, oth er people a t W illiam & M ary deserve th anks. Evgenia Sm irni ta u g h t me me th e basics of parallel com puting an d collaborated on th e early stages of th e load balancing p o rtio n of th is work. X iaodong Zhang im pressed u p o n me th e im p ortance of th e interplay betw een algorithm s and arch itectu re in his excellent “Advanced C om puter A rchitecture” course. Shiwei Zhang of th e D e p artm en t of Physics exposed me to m any areas of c o m p u tatio n al science— of which I m ight otherw ise have rem ained ig n o ran t— in his excellent course on c o m p u tatio n al physics. Tom C rockett kept SciClone up an d ru n n in g a n d was always ready to resolve technical issues w ith skill and enthusiasm ; additionally, he provided m uch-needed friendly conversation on th e u p p er floor of Savage House. Jim M cCom bs provided th e Jacobi-D avidson code used in th is work, co llab o rated on th e load- balancing work, and was an am icable office m ate.
I spent th e sum m er of 2003 w orking a t Los Alamos N ational L ab o rato ry w ith P eter L ichtner. A lthough th e work th a t I did th ere was u n related to my d isse rta tio n research, th e experience was im p o rta n t because it rekindled my enthusiasm for research after several years in g rad u a te school h ad caused it to wane. I th a n k P e te r for serving as a m entor w ithout parallel d u rin g an incredibly intellectually stim u latin g sum m er.
I th a n k Uncle M aury for stim u latin g my interest in physics in general an d co m p u tatio n al physics in particu lar; th a t led me down th e road to becom ing a c o m p u tatio n al scientist.
I th a n k my wife, Saffron, for her com panionship, love, and su p p o rt th o u g h some difficult tim es as we suffered th ro u g h our g rad u a te studies together. T h e years th a t we have spent in g rad u a te school have been very trying, b u t th an k s to her co n stan t com panionship th ey have also been th e best ones of my life.
and learning u p o n me a t a young age, an d th eir gentle encouragem ent is responsible for all of my academ ic (and other) success. I th a n k my m other especially for th e countless hours she spent teaching me reading, arith m etic, an d so on w hen I was a young child; I am very fo rtu n a te to have h a d such a p a tie n t teacher of these fundam entals. M y fath er deserves credit for teaching me to th in k ratio n ally a n d scientifically, a n d to view th e world w ith curiosity. It is his exam ple th a t inspired me to pursue a career in th e sciences. He has served as my first a n d closest m entor, answ ering my endless questions w hen I was four years old, reading great books to me an d my sister, encouraging my am bitious science projects in prim ary an d secondary school, helping me select courses to take as an u n d e rg ra d u a te , collaborating w ith me on my senior thesis work, a n d helping me th ro u g h some to u g h tim es in g rad u a te school. His influence on my th in k in g has been inestim able.
This work was supported in part by a D epartm ent o f Energy C om putational Science Graduate Fellowship, provided under grant num ber D E -FG 02-97E R25308 and adm inistered by the wonderful folks at K rell Institu te. Som e o f the work was perform ed using the Sci- Clone com putational cluster at the College o f W illiam & M ary, which was enabled by grants from Sun M icrosystem s, the N ational Science Foundation, and V irginia’s Com monwealth
List o f Tables
3.1 Perform ance of JD cg w ith an d w ith o u t load balancing on a system subject to various s ta tic ex tern al l o a d s ... 36 3.2 Perform ance of JD cg w ith and w ith o u t load balancing w ith com e-and-go
dum m y jobs on each n o d e ... 37 3.3 Perform ance of th e m ultigrain JD ru n n in g on different node configurations,
w ith a n d w ith o u t load b a l a n c i n g ... 40 3.4 Perform ance of FG M R ES w hen preconditioned w ith ASM w ith an d w ithout
load balancing ... 42 4.1 Perform ance of JD cg w ith and w ith o u t load balancing w ith some nodes ex
tern a lly loaded by sta tic , m em ory intensive dum m y j o b s ... 45 4.2 Perform ance of JD cg using th e a n ti-th rash in g scheme w ith a com e-and-go
m em ory intensive dum m y jo b on one n o d e ... 50 4.3 Perform ance of JD cg w ith th e “ideal” a n ti-th rash in g s c h e m e ... 51 4.4 Im provem ents in execution tim e of th e dum m y jo b because of increased pro
cessor th ro u g h p u t due to our an ti-th rash in g scheme ... 52 7.1 C om parison of th e two probing types u n d er highly variable m em ory pressure 103
3.1 T h e d a ta parallel Jacobi-D avidson alg o rith m ... 32 3.2 SciClone: T h e W illiam a n d M ary heterogeneous cluster of w orkstations . . 39 5.1 D etecting and responding to m em ory shortage w ith th e naive (and incorrect)
“solution” ... 60 5.2 T h e alg o rith m for detectin g an d responding to m em ory s h o r t a g e ... 61 5.3 A n exam ple of detectin g and responding to m em ory shortage using th e algo
rith m in F igure 5 . 2 ... 62 5.4 G rap h s depicting th e necessity of a dynam ic delay p a ra m ete r in th e algorithm
for detectin g m em ory a v a il a b i l it y ... 65 5.5 A n exam ple of how peakRSS is d e te r m in e d ... 67 5.6 T h e com bined alg o rith m for detectin g and a d a p tin g to m em ory shortage or
s u r p l u s ... 68 5.7 O verhead as a function of size of th e region u p o n which m in c o r e O is called 71 5.8 An exam ple of how troughR S S is d e t e r m i n e d ... 73 5.9 T h e final, com plete alg o rith m for detectin g and a d a p tin g to m em ory shortage
or surplus, using low frequency probing ... 75 5.10 T h e final, com plete alg o rith m for detectin g a n d a d a p tin g to m em ory shortage
or surplus, using high frequency p r o b i n g ... 76 6.1 E xten d ed fram ew ork m odeling th e m em ory access needs of a wide variety of
6.2 Benefits of using sm aller m em ory tran sfer u n i t s ... 82
6.3 Benefits of evicting p a rtially sw apped out p a n e l s ... 85
7.1 M atrix-vector m ultiplication alg o rith m for a sparse m a trix of dim ension N consisting of a num ber of d ia g o n a ls ... 88
7.2 T h e alg o rith m executed by our MGS test code, which sim ulates th e behavior of a G M R E S-type solver, generating ran d o m vectors of dim ension N which are added to an orth o n o rm al basis via m odified G r a m - S c h m id t ... 89
7.3 T h e alg o rith m for executing a M etropolis sweep th ro u g h th e LxL spin lattice of th e Ising m odel ... 91
7.4 Tw o different m odes in th e behavior of th e v irtu a l m em ory system observed un d er th e sam e experim ental s e tt in g s ... 93
7.5 R elative frequencies of ite ratio n tim es for m em ory-adaptive CG ru n n in g on a Linux 2.4 m achine w ith 128 MB RAM against a memlock jo b consum ing 70 M B ... 94
7.6 Effect of panel size on perform ance of M M LlB-enabled C G ... 96
7.7 Effect of panel size on perform ance of M M LlB-enabled I S I N G ... 97
7.8 Effects of R pen.m ax on M M lib p e r f o r m a n c e ... 99
7.9 C om parison of perform ance using th e two m ethods of calculating R pen . . . 101
7.10 RSS versus tim e for th e two types of probing in m em ory ad ap tiv e CG under highly variable m em ory l o a d ... 104
7.11 Perform ance of th e ap p lication kernels un d er co n stan t m em ory pressure . . 105
7.12 Effects of w rite frequency on M Ml i b perform ance in I S I N G ... 107
7.13 Q uick response by M Ml i b to tra n sie n t m em ory p r e s s u r e ... 108
7.14 Profiles of RSS vs. tim e for two instances of m em ory ad ap tiv e ISING jobs ru n n in g sim ultaneously, using low frequency p r o b i n g ... I l l 7.15 Profiles of RSS vs. tim e for two instances of m em ory ad ap tiv e ISING jobs ru n n in g sim ultaneously, using high frequency p r o b in g ... 112
an LR U -friendly access p a t t e r n ... 114 7.17 Perform ance of m em ory-adaptive CG versus m em ory pressure w hen using
th e w rong panel replacem ent p o lic y ... 115 A .l P seudocode depicting how th e L Blib lib ra ry can be used to load balance th e
ABSTRACT
As com m odity com puters an d netw orking technologies have becom e faster a n d m ore affordable, fairly capable m achines have become nearly u biquitous while th e effective “dis tan ce” betw een th e m has decreased as netw ork connectivity a n d capacity has m ultiplied. T here is considerable interest in developing m eans to readily access such vast am ounts of com puting power to solve scientific problem s, b u t th e com plexity of these m odern com p u tin g environm ents pose problem s for conventional com puter codes designed to ru n on a static, hom ogeneous set of resources. One source of problem s is th e heterogeneity th a t is n a tu ra lly present in these settings. M ore p roblem atic is th e co m p etitio n th a t arises be tween program s for shared resources in these sem i-autonom ous environm ents. F lu c tu a tio n s in th e availability of C P U , memory, an d o th er resources can cripple applicatio n perfor m ance. C ontention for C P U tim e betw een jobs m ay introduce significant load im balance in parallel applications. C ontention for lim ited m em ory resources m ay cause even m ore severe perform ance problem s: If th e requirem ents of th e processes ru n n in g on a com pute node exceed th e am ount of m em ory present, th ra sh in g m ay increase execution tim es by an order of m agnitude or more.
O ur goal is to develop techniques th a t enable scientific applications to achieve good perform ance in non-dedicated environm ents by m onitoring system conditions and a d a p tin g th e ir behavior accordingly. In th is work, we focus on two im p o rta n t shared resources, C P U a n d memory, an d pursue our goal on two d istin ct b u t com plem entary fronts: F irst, we present some sim ple algorithm ic m odifications th a t can significantly im prove load balance in a class of iterativ e m ethods th a t form th e co m p u tatio n al core of m any scientific and engineering applications. Second, we introduce a p o rta b le fram ew ork for enabling scientific applications to dynam ically a d a p t th eir m em ory usage according to cu rren t availability of m ain memory. An application-specific caching policy is used to keep as m uch of th e d a ta set as possible in m ain memory, while th e rem ainder of th e d a ta are accessed in an out-of-core fashion. T h is allows a graceful deg rad atio n of perform ance as m em ory becomes scarce.
We have developed m odular code libraries to facilitate im plem entation of our techniques, an d have deployed th em in a variety of scientific ap plication kernels. E x p erim en tal eval uatio n of th e ir perform ance indicates th a t our techniques provide some im p o rta n t classes of scientific applications w ith ro b u st and low-overhead m eans for m itig a tin g th e effects of fluctuations in C P U a n d m em ory availability.
Chapter 1
In trod u ction
T h e advent of powerful b u t inexpensive w orkstations a n d fast, affordable netw orking tech nologies has fundam entally altered th e landscape of parallel a n d d istrib u te d com puting. Powerful com m odity com ponents have led to th e em ergence of clusters of w orkstations (COW s) as th e p rim a ry m eans of parallel com puting a t m any in stitu tio n s. O n a m uch grander scale, researchers have been tak in g steps tow ards geographically d istrib u te d com puting, building c o m p u tatio n al “G rid s” th a t link com puting resources betw een different in stitu tio n s. G rid environm ents prom ise to provide th e co m p u tatio n al resources necessary for analyzing petabyte-scale d a ta sets and conducting th e m ost co m p u tatio n ally dem anding of sim ulations. T h e im pact of these technological developm ents has been p a rticu la rly felt in scientific applications, which form a great p o rtio n of th e high-perform ance com puting workload.
T he category of “scientific ap p licatio n s” is large an d evolving, an d therefore difficult to characterize accurately. We m ake, however, th e following b ro ad observations a b o u t scientific applications in parallel an d high-perform ance com puting (H PC ) environm ents: F irst, th e applications often work w ith large q u an tities of d a ta , because realistic sim ulation of a problem often requires a large num ber of degrees of freedom. Second, th ey te n d to follow fairly regular, rep etitiv e d a ta access an d com m unication p a tte rn s. T h ird , th ey usually have synchronization points at which collective com m unications occur.
M any m odern parallel com puting environm ents possess a degree of com plexity th a t poses problem s for conventional scientific applications designed to ru n on dedicated, uniform
re-sources such as m assively-parallel processor (M P P ) m achines. O ne source of problem s is th e heterogeneity th a t arises n a tu ra lly in these new settings. For instance, upgrades to a C O W m ay occur in stages, w ith only specific m achines being replaced or added. In oth er cases, th e cluster m ay sim ply be a loosely coupled netw ork of personal w orkstations. In a co m p u tatio n al G rid, d isp a ra te resources from different in stitu tio n s m ay be used in conjunc tion. T rad itio n al parallel program s a tte m p t to d istrib u te work evenly betw een all nodes because th ey ta rg e t environm ents in which com pute nodes are hom ogeneous in processing power an d m em ory capacity. W hen nodes differ considerably in processing speed, such a stra te g y can result in severe load im balance as faster processors idle a t synchronization points, w aiting for th e ir slower b re th re n to finish. If nodes differ considerably in m em ory capacity, nodes w ith higher m em ory capacities m ay be underutilized.
Beyond a rc h ite c tu ra l heterogeneity, co m p etitio n for non-dedicated, sh ared resources poses even g reater challenges. Various economic a n d p ractical considerations have m ade m ultiprogram m ing of resources com m onplace: In decentralized system s, such as G rid en vironm ents or loose netw orks of personal w orkstations, space sh arin g sim ply m ay not be a viable option. Ow ners of sm aller CO W s m ay not be willing to sacrifice th e tim e necessary to ad m inister a b atch queuing system , or to to le rate th e slower tu rn a ro u n d tim es re su lt ing from w aiting in th e queue, especially du rin g th e developm ent a n d debugging cycles. In shared m em ory m achines, even when C PU s are space-shared, jobs still com pete for m em ory a n d I /O resources.
F lu c tu a tio n s in th e availability of C P U , m emory, and oth er resources can cripple a p plication perform ance. C ontention for C P U tim e betw een jobs m ay introduce significant load im balance in parallel applications. C ontention for lim ited m em ory resources m ay cause even m ore severe perform ance problem s: If th e requirem ents of th e processes ru n n in g on a com pute node exceed th e am ount of m em ory present, th ra sh in g m ay increase execution tim es by an order of m agnitude or more.
C H A P T E R 1. IN T R O D U C T IO N 4
m odern parallel environm ents necessitates a degree of system awareness an d a d a p tiv ity th a t is not present in to d a y ’s software. O ur goal is to develop techniques th a t allow scientific applications to achieve good perform ance in non-dedicated environm ents by m onitoring system conditions an d ad a p tin g th eir behavior accordingly. In th is work, we focus on two im p o rta n t shared resources, C P U an d memory, a n d pursue ou r goal on two d istin ct b u t com plem entary fronts:
1. Application-level dynam ic load balancing. We present some sim ple algorithm ic m odi fications which can significantly im prove load balance in a class of iterativ e m ethods. We focus on these m ethods because th ey form th e co m p u tatio n al core of m any scien tific and engineering applications; im provem ents to th e m ethods can lead to im m ediate perform ance gains in a variety of scientific codes.
2. M em ory adaptation in scientific and data-intensive codes. We introduce a p o rta b le fram ew ork for enabling applications to dynam ically a d a p t th e ir m em ory usage accord ing to cu rren t availability of m ain memory. A n application-specific caching policy is used to keep as m uch of th e d a ta set as possible in m ain memory, while th e rem ainder of th e d a ta are accessed in an out-of-core fashion. T his allows a graceful deg rad atio n of perform ance as m em ory becomes scarce.
T h e com bination of system -aw are m echanism s for dynam ic load balancing an d m em ory a d a p ta tio n has th e p o ten tia l to change th e way c o m p u tatio n al science is done by enabling efficient use of shared resources th a t have h ith e rto been underutilized. We provide some spe cific tools an d techniques for im plem enting these m echanism s in some specific cases, b u t our broader goal is to estab lish an archetype for building system -aw are, ad ap tiv e applications.
1.1
C o n tr ib u tio n s
1 .1 .1 D y n a m ic lo a d b a la n c in g o f in n e r - o u t e r it e r a t iv e s o lv e r s
We have stu d ied a novel stra te g y for dynam ic load balancing of a class of inner-o u ter ite ra tive m ethods. O ur goal has been to enable parallel, iterative, sparse m a trix m eth o d s— which form th e core of m any scientific ap plications— to perform well in heterogeneous an d m ul tiprogram m ed environm ents. O ur techniques are designed for in ner-outer iterativ e solvers th a t em ploy local solvers for th e inner iteratio n s, th o u g h th ey are also applicable to oth er algorithm s th a t possess w hat we te rm a flexible phase (defined in C h a p te r 3). We have w ritte n a C -library to facilitate use of our load balancing technique, an d have used it to im plem ent our scheme in some different solver codes.
We have experim entally d e m o n stra ted th a t our scheme can cope w ith severe load im bal ance u n d er un p red ictab le ex ternal loads in a coarse-grained, block Jacobi-D avidson eigen- solver th a t in d ependently calculates a different correction vector on each processor. T his work has been published in [63]. We have also used our load-balancing scheme to enable effective use of a collection of heterogeneous sub-clusters of w orkstations in a version of th e Jacobi-D avidson solver th a t fully em ploys th e so-called m ultigrain parallelism ; see our p a p e r [57]. Finally, we have d e m o n strated th a t our m ethod can be used in an additive- Schwarz preconditioned linear system solver to sm ooth load im balances due to differences in conditioning of th e subdom ains, a n d also to sm ooth out sm all load im balances due to differences in processor speeds.
1 .1 .2 D y n a m ic m e m o r y a d a p t a t io n in s c ie n t ific a p p lic a t io n s
We have explored a general fram ew ork th a t enables scientific a n d oth er d ata-intensive codes th a t rep etitiv ely access large d a ta sets to dynam ically a d a p t th e ir m em ory requirem ents as m em ory pressure arises. We have developed a m odular su p p o rtin g library, M Ml i b, th a t makes it easy to em bed m em ory-adaptivity in m any scientific kernels w ith little coding effort. T h e biggest technical hurdle we had to overcome to realize ou r fram ew ork is the lack of inform ation a b o u t m em ory availability provided by o p e ra tin g system s. We have
C H A P T E R 1. IN T R O D U C T IO N 6
developed an alg o rith m th a t can effectively judge m em ory availability w ith m inim al reliance on o p eratin g system -provided inform ation.
We have used our lib rary to inject m em o ry-adaptivity into th re e scientific applicatio n kernels: a conjugate-gradient linear system solver, a m odified G ram -S chm idt orthogonal- ization ro u tin e (surrounded by a GM RES-like w rap p er), and a M onte-C arlo Ising m odel sim ulation. T hough sim ple, these kernels are representative elem ents of scientific sim ula tions sp anning a diverse range of fields. E xp erim en tal evaluation of these m em ory-adaptive kernels in L inux and Solaris environm ents has shown th a t our techniques yield perform ance several tim es b e tte r th a n th a t o b tained by relying on th e v irtu a l m em ory system to handle m em ory pressure. Furtherm ore, we have d e m o n stra ted th a t these perform ance gains are not only due to enabling use of application-specific replacem ent policies, b u t from avoidance of v irtu a l m em ory system overheads and antagonism . We have also shown th a t m ultiple jobs em ploying our m em o ry -ad ap tatio n techniques can coexist on a node w ith o u t ill-effects.
O u r work is presented in [61] and [62].
1.2
R e le v a n c e
T h is work is relevant to b o th com puter scientists an d p ractitio n ers of c o m p u tatio n al science. From a com puter science perspective, it is significant because it explores a d a p ta tio n to flu ctu atin g system conditions com pletely a t th e application-level; trad itio n ally , a d a p ta tio n approaches have been m ore system -oriented. A n application-centric approach allows us to exploit knowledge ab o u t th e ap plication th a t system software does not possess.
In our work on load balancing parallel co m p u tatio n s under u n p red ictab le ex tern al loads, we utilize knowledge of th e num erics of an im p o rta n t class of iterativ e m ethods to d y n am i cally m odify th e co m p u tatio n — w ith o u t affecting th e end resu lt— to achieve good load b al ance w ith very little overhead. T his d e p a rts from conventional load balancing approaches, which adaptively schedule a co m p u tatio n b u t do not m odify it.
of th e m em ory-access characteristics of an ap plication to m anage m em ory using an a p p ro p ri a te replacem ent policy a n d to use u n its of d a ta tran sfer suited to th e g ran u la rity of m em ory access. O u r m em ory a d a p ta tio n work is ad d itio n ally significant because it has explored th e interplay of system a n d ap plication in m ultiprogram m ed system s un d er extrem e levels of m em ory dem and. System and ap plication behavior un d er such conditions can be extrem ely u np red ictab le an d has been relatively unexplored; application-specific v irtu a l m em ory m an agem ent schemes, for instance, have usually been designed w ith o u t m ultiprogram m ing in m ind.
Prom th e perspective of c o m p u tatio n al science p ractitioners, th is work is relevant b e cause it has th e p o ten tia l to increase th e ir productivity. M any researchers do not have access to dedicated supercom puters or high-end clusters and m ust m eet th eir com puting needs by using eith er local shared resources such as netw orks of w orkstations a n d sm all SM Ps, or resources sh ared across a c o m p u tatio n al grid. T he m ethods we present can en able some im p o rta n t classes of scientific applications to effectively use such resources. T hese techniques can also im prove th e p ro d u ctiv ity of researchers who do have access to dedicated su p ercom puters by allowing th em to ru n sm all to m oderate-sized jobs on readily available non-dedicated resources, allowing th em to conserve th eir sup erco m p u ter allocations and avoid long w ait tim es in com pute-center queues.
1.3
O r g a n iz a tio n
T h e rem ainder of th is docum ent is organized as follows: C h a p te r 2 relates our work to th e current s ta te of knowledge in b o th load balancing an d m echanism s for dealing w ith m em ory shortage. C h a p te r 3 explains our load balancing scheme and presents experim ental results d em o n stratin g its effectiveness in balancing C P U load. In ch a p te r 4, we illu stra te how th e load balancing scheme can be extended to a tte n u a te m em ory pressure, a n d we outline th e shortcom ings of th is approach. In chapter 5, we present th e essential details of our m em ory a d a p ta tio n fram ew ork, an d in ch ap ter 6 we provide an overview of our software
C H A P T E R 1. IN T R O D U C T IO N 8
library, M Ml ib, th a t facilitates use of th e fram ew ork. In ch ap ter 7 we present experim ental evaluations of th e m em ory a d a p ta tio n stra te g y as used in th ree scientific applicatio n kernels. Finally, conclusions are presented in ch ap ter 8.
B ackground and related work
T h e goal of our work is to allow some classes of scientific applications to perform well in m ul tiprogram m ed environm ents w here C P U a n d m em ory availability can vary unpredictably. To m eet th is goal our techniques m ust m ain tain good load balance as ex tern al C P U loads fluctuate, and a d ju st m em ory u tilizatio n gracefully as m em ory availability changes. In this chapter, we present a brief overview of related work in load balancing and m echanism s for addressing m em ory shortage. A dditionally, we describe rela te d work on application- centric ad ap tiv e scheduling techniques th a t dynam ically a d a p t schedules according to re source availability.
2.1
L oad b a la n c in g
M aintaining good load balance is essential to ensure good perform ance an d efficient u ti lization of processors by a parallel application. O therw ise, C P U cycles will be w asted as lightly loaded processors sit idle w aiting for th eir m ore heavily loaded b re th re n to finish. A significant am ount of research has focused on th e developm ent of m ethods and software for load balancing w ithin parallel applications, an d we outline some of those approaches in th is section.
Load balancing is essentially a task-m apping problem , an d is th u s inextricably linked w ith th e m anner in which a co m p u tatio n is decom posed into tasks for parallel execution. T h e problem is to m ap a num ber of tasks onto a set of processors in a way th a t ensures th a t
C H A P T E R 2. BAC K G RO U N D A N D R E L A T E D W O R K 10
each processor is assigned an am ount of work com m ensurate w ith its speed. T h is m apping usually m ust satisfy oth er objectives as well, such as lim iting th e am o u n t of com m unication required. T here are m any different strategies for accom plishing these goals, b u t we can place th em into two broad categories based on w hether th ey take a task-oriented or a d a ta - oriented approach to m apping th e tasks onto processors. T h e ty p e of approach th a t is ap p ro p riate is determ ined by th e am ount of d a ta required to describe an d perform each task, a n d th e cost of m oving th a t d a ta to th e ap p ro p riate processor. (So th e choice betw een task or d a ta o rien tatio n depends on b o th th e p roperties of th e problem to be solved and th e properties of th e ta rg e t parallel m achine.) If processors can inexpensively o b tain th e d a ta required to perform a task, a task-oriented approach can be used. O therw ise, a d a ta - oriented approach m ust be employed because th e cost of m oving th e d a ta for a task is th e m ain lim iting factor.
2 .1 .1 P o o ls o f in d e p e n d e n t ta s k s
T h e task-oriented approach is often know n as th e pool o f tasks paradigm . T h e problem to be solved is broken into a pool of in dependent tasks. Processors th a t need work fetch tasks from th e pool, an d if ad d itio n al tasks are generated d u rin g th e c o m p u tatio n , th ey are placed into th e pool. If th e size of th e tasks is sm all enough, th e pool of task s a p proach usually resu lts in excellent load balance. (Tasks th a t are too large m ay resu lt in one processor continuing to work while others sit idle because all o th er task s have been processed.) In th e sim plest im plem entation (m aster/slav e or m anager/w orker), processors fetch or receive task s from a single processor th a t acts as centralized m anager of th e pool. In a d d itio n to being very easy to im plem ent, th e m aster/slav e approach m aintains global knowledge of th e co m p u tatio n sta te , which facilitates even d istrib u tio n of tasks an d greatly simplifies te rm in a tio n detection. T h e m aster/slav e approach can n o t scale to large num bers of processors, however, because com m unication w ith th e m aster becom es a bottleneck.
divided into disjoint sets m anaged by a local su b -m aster th a t rep o rts to th e central m aster. If greater scalability is needed, a com pletely decentralized scheme can be used, m aking th e pool of task s a d istrib u te d d a ta stru c tu re . In a decentralized scheme, w hen a processor runs o u t of work it sends a p o ten tia l donor processor a work request. T h e donor processor to be polled m ay be chosen in a num ber of ways: T h e sim plest way is to choose a donor random ly, w ith all processors having equal pro b ab ility of being selected. A lternatively, donors can be chosen in a ro u n d -ro b in fashion. T h e ro u n d -ro b in scheme can em ploy a global list of processors, or m ight be restric te d to th e nearest neighbors of each processor in th e netw ork interconnect. N ote th a t if a rou n d -ro b in scheme is com pletely decentralized, a donor m ay receive work requests from m ultiple processors sim ultaneously. To prevent this, a hy b rid c e n tra liz e d /d istrib u te d system m ay be employed. In such a scheme, th e task pool is a d istrib u te d d a ta stru c tu re , b u t one processor m aintains a global variable th a t points to th e processor to which th e next work request should be sent. T h is im proves th e d istrib u tio n of work requests, though contention for access to th e global variable som ew hat lim its scalability. E x p lan atio n s and analysis of several decentralized pool of task s schemes a n d ap p ro p riate term in atio n -d etectio n algorithm s can be found in [70, 52, 51].
If a co m p u tatio n is decom posed into a large num ber of sm all tasks, fetching only one piece of work a t a tim e can resu lt in excessive scheduling overhead. In such situ atio n s, a n allocation m eth o d in which a free processor fetches a chunk of several tasks a t a tim e is usually employed. T h is reduces scheduling overhead, b u t care m ust be taken not to make th e chunks too large, because th e coarser task g ran u larity m ay resu lt in increased load im balance. K ruskal and Weiss [50] used p robabilistic analysis to d eterm ine an o p tim al chunk size th a t balances th e trade-offs betw een very large a n d very sm all chunk sizes. T heir analysis relies on th e assu m p tio n of a num ber of tasks large enough to cancel out variations in th e task execution tim es. Because in p ractice th is assu m p tio n often does not hold, o th er developm ents in chunk-scheduling em ploy schemes th a t s ta r t w ith larger chunk sizes an d th e n g radually decrease th e chunk sizes as th e co m p u tatio n progresses to prevent
C H A P T E R 2. BAC K G RO U N D A N D R E L A T E D W O R K 12
processors from finishing a t different tim es. Such approaches include guided self-scheduling [68], factoring [39], a n d trap ezo id al self-scheduling [85].
T he pool of tasks approach can m ain tain excellent load balance, even in th e face of factors such as large v ariation in th e cost of each task, processor heterogeneity, or dynam ic variation in ex tern al processor load on a tim e-shared system . U nfortunately, th e approach is inapplicable to m any scientific problem s, since th ey cannot be broken into in dependent tasks th a t can be executed asynchronously. A dditionally, th e independent tasks often cannot be described w ith an am ount of inform ation th a t is sm all enough to prevent excessive com m unication betw een m aster a n d slave processes.
2 .1 .2 D a t a p a r t it io n in g
If th e tasks in a co m p u tatio n cannot be described w ith a sm all am ount of inform ation, or if dependencies betw een task s necessitate frequent inter-processor com m unication, a d a ta p a rtitio n in g approach to load balancing should be employed: R a th e r th a n th in k in g in term s of m apping tasks to processors, we m ap th e d a ta objects in th e co m p u tatio n (e.g. m esh points or particles) onto processors. Processors th e n execute those tasks associated w ith th eir local d a ta . To load balance a co m p utation, we p a rtitio n th e objects into approxim ately equal p a rtitio n s (ensuring load balance) while m inim izing th e d a ta dependencies betw een p a rtitio n s (thus reducing com m unication costs). If processors vary in speed, th e p a rtitio n associated w ith a given processor is weighted by its relative speed. W h a t we have called d a ta p a rtitio n in g is usually referred to as dom ain p a rtitio n in g , because it is usually em ployed to assign p o rtions of a co m p u tatio n al dom ain to processors. Some of th e techniques we describe here can be applied to any a b s tra c t object th a t can be m odeled as a graph, so we use the m ore general te rm “d a ta p a rtitio n in g ” .
T h e p a rtitio n in g approaches can be grouped into two broad categories: geom etric and topological. A geom etric approach divides a dom ain based on th e locations of objects in a sim ulation, and is therefore ap p ro p riate for applications such as solid m echanics or particle
sim ulations, w here interactions betw een c o m p u tatio n al objects te n d to occur betw een neigh bors. A topological m eth o d explicitly bases its p a rtitio n in g on th e connectivity of objects in a com putation. T ypically a co m p u tatio n al dom ain is represented as a g rap h w hich is to be e q u ip artitio n ed in a way th a t m inim izes th e edge-cut, which approxim ates th e am ount of com m unication required.
2 .1 .2 .1 G eo m etric p a rtitio n in g
T h e geom etric approaches can be classified into two m ajor types: recursive bisection a p proaches an d octree approaches. T hese categories could also be described respectively as approaches th a t work w ith coarse-grained or fine-grained divisions of th e geometry. T h e basic idea of recursive bisection approaches is to cu t a dom ain into two p a rts such th a t each p a rt contains an approxim ately equal num ber of objects. Each p a rt is th e n fu rth e r cu t into two sm aller p a rts, an d so on, u n til th e required num ber of p artitio n s have been generated. T h e sim plest such m eth o d is Recursive C o o rdinate B isection (RCB) [10], also know n as C o ordinate N ested D issection (CND ). RCB recursively divides th e set of objects using a c u ttin g plane orthogonal to th e coordinate axis along which th e objects are m ost spread out. T h e cut is orthogonal to th e long axis to reduce th e size of th e boundaries betw een subdom ains, across which com m unication will have to occur. RCB is sim ple, fast, easily parallelizable, a n d requires little memory, b u t also ten d s to produce low er-quality p a rtitio n ings. A m odest im provem ent is U nbalanced Recursive B isection (URB) [42], which divides th e geometry, ra th e r th a n th e objects, in half. T h is m inim izes th e geom etric aspect ra tio of th e p a rtitio n an d th u s reduces com m unication volume, as th e am o u n t of inform ation th a t m ust be exchanged across b oundaries is approxim ately p ro p o rtio n al to th e ir length or area. T h e effectiveness of b o th RCB an d URB are lim ited by th e som ew hat artificial restrictio n th a t cuts are always m ade along coordinate axes. Recursive In ertia l B isection (RIB) [83] elim inates th is restric tio n by calculating th e p rincipal axes of in e rtia of th e collection of objects as if th ey were a collection of particles in a physical system . T h e dom ain is th e n cut
C H A P T E R 2. BAC K G RO U N D A N D R E L A T E D W O R K 14
by a bisecting plane orthogonal to th e principal axis associated w ith th e m inim um m om ent of inertia, which is a n a tu ra lly “long” direction across which to cut. G eom etric bisection m ethods need not be lim ited to th e use of c u ttin g planes. Based on some interestin g th eo retical resu lts [34], G ilbert, M iller, and Teng [32] describe a geom etric bisection scheme in which circles or spheres are used to bisect dom ains.
O ctree techniques (also know n as space-filling curve techniques) for geom etric p a rtitio n ing s ta rt w ith a fine-grained division of th e c o m p u tatio n al dom ain, an d th e n aggregate these fine-grained pieces into sets th a t will m ake u p th e p a rtitio n s. T h e fine-grained division is o b tained by recursively dividing a space into a hierarchy of o ctan ts. For a 3D dom ain, th e root o c ta n t is a cube th a t encloses th e entire dom ain. T his o c ta n t is divided into child o c ta n ts by sim ultaneously c u ttin g th e ro o t o c ta n t in h a lf along each axis (form ing eight child o c ta n ts in 3D sp ace). C hild o c ta n ts th a t contain m ultiple objects are th e n recursively divided into su b -octants, an d so on, u n til each object is enclosed by a term in al o c ta n t (i.e., one th a t is not fu rth e r su b d iv id e d ). T h e relationship of o ctan ts in th e hierarchy is described by a tree d a ta s tru c tu re know n as an octree. O ctrees are used to represent space in m esh generation, com puter graphics, and n-body sim ulations, am ong o th er applications. T hey also provide a convenient way to p a rtitio n geom etry am ong processors: By ordering the term in al o c ta n ts according to th e ir positions along a space-filling curve such as a Peano- H ilbert curve [55], a A;-way p a rtitio n in g can easily be generated by c u ttin g th e list into k equal p a rts [88]. T h e speed an d qu ality of octree p a rtitio n in g is roughly equal to th a t of geom etric recursive bisection techniques.
2 .1 .2 .2 T o p ological p a rtitio n in g
G eom etric p artitio n in g s can be calculated very quickly a n d can be qu ite effective for com p u tatio n s in which in teractio n betw een objects is a function of geom etric proxim ity, such as in n-body or solid m echanics problem s. If geom etric proxim ity is a poor in d icato r of in ter action betw een objects, however, using a m ore expensive topological m eth o d th a t explicitly
considers th e connectivity of th e objects can yield far superior partitio n in g s.
O ne of th e sim plest topological m ethods is Levelized N ested D issection (LND) [31]. LND s ta rts w ith an in itial vertex (a pseudo-peripheral vertex is the best choice), and th e n visits neighboring vertices in a b read th -first m anner. W hen h a lf of th e vertices have been visited, th e g raph is bisected into th e set of vertices th a t have been visited, an d those th a t have not. T h is process is applied recursively u n til th e desired num ber of p a rtitio n s have been created. T h e idea b eh in d LND is actu ally qu ite sim ilar to geom etric bisection m ethods: distance from th e in itial vertex is used to cut across a “long” dim ension of th e graph.
S p ectral m ethods [69, 78] use a very different approach for com puting p artitio n in g s. T h e p a rtitio n in g problem is really a discrete o p tim izatio n problem : m inim ize th e edge-cut while dividing th e g rap h into k equal p arts. T his discrete op tim izatio n problem is NP- com plete, b u t th e problem can be ap proxim ated by a continuous o p tim izatio n problem th a t is m ore easily solved. S pectral m ethods solve th e continuous problem by finding some extrem al eigenvalue/eigenvector pairs of a m atrix derived from th e connectivity inform ation of th e graph. S p ectral m ethods te n d to produce high quality p a rtitio n s, b u t solving th e eigenproblem can be co m p u tatio n ally expensive.
P a rtitio n refinem ent m ethods refine a sub-optim al p a rtitio n in g of a graph, an d are of te n used as graph p a rtitio n ers them selves by applying th em to a ran d o m p artitio n in g . K ernighan a n d Lin [48] developed an iterativ e algorithm (known as th e KL algorithm ) th a t im proves a g rap h bisection using a greedy approach to determ ine p airs of vertices to swap betw een p a rtitio n s. V ariations of th e KL alg o rith m are used for refinem ent in several g raph p a rtitio n in g contexts, some of which we discuss below.
Some of th e m ost p o p u lar g rap h p a rtitio n in g m ethods being used to d ay are m ultilevel m ethods [37, 45]. M ultilevel m ethods form a sequence of increasingly coarse approxim ations of th e original g rap h by collapsing to g eth er selected vertices from finer rep resentations of th e graph. T h e g rap h is initially p a rtitio n e d a t th e coarsest level using a single-level algorithm such as those described above. T his p a rtitio n in g is th e n pro p ag ated back th ro u g h th e finer
C H A P T E R 2. BAC K G RO U N D A N D R E L A T E D W O R K 16
approxim ations, being refined a t each level via a K L -type algorithm . M ultilevel approaches are very effective for two reasons. F irst, th e in itial p a rtitio n in g can be easily com puted for th e coarsest graph, because it hides m any edges a n d vertices of th e original graph. Second, increm ental refinem ent algorithm s can quickly m ake large changes to th e p a rtitio n in g by w orking w ith coarser versions of th e graph. Chaco [36], M E T IS /P a rM E T IS [43, 44], and JO S T L E [87] are some p o p u lar softw are packages th a t im plem ent m ultilevel m ethods.
2 .1 .2 .3 Load b a lan cin g a d a p tiv e c o m p u ta tio n s
If ex ternal load is n ot a factor, an d if th e am ount of work p e r object is known an d does n ot vary du rin g th e course of th e co m p utation, a sta tic p a rtitio n in g can be used. T he p a rtitio n in g can be com puted as a sequential preprocessing step a t th e beginning of a sim ulation, using a package such as M E T IS or Chaco, or in parallel using a package such as P arM E T IS or JO S T L E . In m any com putations, however, the am ount of work assigned to each processor can vary unpredictably, eith er because th e num ber of objects in a p a rtitio n can vary, or because th e am ount of work p er object is unknow n. For exam ple, if th e objects are grid points in an adaptive m esh refinem ent (AM R) calculation [12, 11], th e am ount of work for a p a rtitio n m ay grow as th e m esh is refined. If objects are cells in a particle-in-cell sim ulation, th e am ount of work associated w ith an object varies as particles advect into an d o ut of th e cells. Some sim ulations utilize ad ap tiv e physics, in which th e physics m odel used a t a m esh p o in t— an d th u s th e work associated w ith th a t p o in t— m ay change as th e calculation progresses. In all of these cases, th e co m p u tatio n al dom ain m ust be re p a rtitio n e d as th e co m p u tatio n progresses, because a p a rtitio n in g th a t resu lts in good load balance a t th e beginning of th e co m p u tatio n m ay yield very poor load balance later.
D ynam ic rep a rtitio n in g can be done by sim ply com puting a new global p a rtitio n in g from scratch. However, th is generally incurs an unacceptable am ount of costly d a ta m igration, because th e new p a rtitio n in g can be very different from the existing one. In stead, th e old p a rtitio n in g should be a d ju ste d in a way th a t m itigates load im balance, while also
m inim izing th e am ount of d a ta m igration needed to arrive at th e new p artitio n in g . Scratch- rem ap rep a rtitio n e rs [66] a tte m p t to do th is by calculating a new p a rtitio n in g of th e g rap h from scratch, a n d th e n p erm u tin g th e labels of th e new p a rtitio n s so th e ir overlap w ith th e old p a rtitio n s is m axim ized.
S cratch-rem ap approaches result in excellent load balance and low edge-cut, b u t still involve a relatively high am ount of d a ta m ovem ent. For th is reason, increm ental approaches th a t m ake a series of sm all changes to th e p a rtitio n s to restore load balance are usually favored instead. M ost of these m ethods em ploy a diffusive approach, in which objects move from heavily loaded processors to m ore lightly loaded neighbors by a diffusion-like process. T h e num ber of objects to be moved from a processor is based on th e processor workload, while th e choice of which objects get moved is usually m ade using KL-like criteria. In th e sim plest incarn atio n s [21], a first-order finite difference d iscretization of th e diffusion e q u ation is used to calculate th e am ount of work th a t should be moved betw een processors a t each ite ra tio n of th e m ethod. Because th e first-order stencil is com pact (only involving nearest neighbors), such m ethods can be executed locally by a processor. T hese m ethods can be slow to converge to a load balanced sta te , however, so m ore com plex m ethods use some global inform ation to accelerate convergence to load balance. T h e m eth o d of Hu, Blake, an d E m erson [38] (im plem ented in P arM E T IS an d JO S T L E ) uses inform ation from all processors to perform an im plicit solve for th e ste a d y -sta te solution of th e diffusion equation, an d th e n m igrates d a ta in one step to reach th is sta te of load balance.
A lthough local m ethods are usually slower th a n global m ethods to achieve load balance, th ey can be executed asynchronously, w hereas global m ethods cannot. T h is is a significant advantage for two reasons. F irst, load balance can be corrected as it arises, w ithout having to w ait for th e next synchronization. Second, a lth o u g h m ost scientific applications have n a tu ra l synchronization points, certain applications [28, C h a p te r 14] do not, so th e use of a global m eth o d introduces artificial (and costly) synchronizations.
C H A P T E R 2. B A C K G R O U N D A N D R E L A T E D W O R K 18
com puting increm ental repartitio n in g s, b u t once th e new p a rtitio n in g is ob tain ed , perform ing th e a c tu a l d a ta m igration can be a difficult chore for th e applicatio n program m er. Packages such as Z oltan [22] a n d P R E M A [7, 8] provide tools an d fram ew orks for m anaging a n d auto m atically m igrating th e objects in ad aptive com putations. P R E M A uses m u lti th re a d in g to su p p o rt asynchronous load balancing, an d thus is p a rtic u la rly su ited to very irregular com putations.
2 .1 .3 L o a d b a la n c in g u n d e r v a r ia b le e x t e r n a l lo a d
In our discussion so far, we have only considered th e load th a t is in tern al to an applica tion. In a m ultiprogram m ed environm ent, however, th e to ta l load on a processor can vary u n p red ictab ly due to load from processors ex tern al to a given application.
P rovided th a t th e g ran u larity of th e tasks is not too coarse, th e pool of tasks parad ig m can work well (and autom atically) in th e presence of ex tern al load. D a ta p a rtitio n in g approaches, however, face m ajor difficulties in dealing w ith such loads. E x te rn a l load can conceivably be accounted for by w eighting each processor according to its load an d th e n sizing th e p a rtitio n s based on those weights. However, th e high cost of rep a rtitio n in g com bined w ith th e dynam ic an d un p red ictab le n a tu re of ex ternal w orkloads m akes th is im practical. If ex tern al load conditions stay fairly constant, th e cost of rep a rtitio n in g to reflect th a t load can be w orthw hile. If th e decision to re p a rtitio n proves wrong— th a t is, if th e ex tern al w orkload changes d ram atically soon after p a rtitio n in g — it can be very costly. Because of th e difficulties involved in ex ternal load prediction, re p a rtitio n in g is generally n ot used to cope w ith ex tern al load.
In C h a p te r 3 we present a load balancing scheme for a very specific b u t im p o rta n t class of scientific algorithm s, in ner-outer iterativ e m ethods th a t em ploy local inner iteratio n s. T hese algorithm s are not am enable to a pool-of-tasks approach, an d — for th e reasons discussed above— load balancing by dynam ic rep a rtitio n in g is im practical. We present a scheme th a t provides good load balance for these algorithm s in th e face of dynam ic a n d unp red ictab le
load, w ith o u t any reliance on d a ta m igration or perform ance predictions.
W h at we call our “load-balancing” scheme differs fundam entally from o th er load-balancing schemes in th e conventional sense of th e word. T h e pool-of-tasks a n d d a ta p a rtitio n in g a p proaches we have described d istrib u te a fixed set of tasks in a balanced way across a set of processors. O ur load balancing scheme also d istrib u tes tasks in a balanced m anner across a set of processors, b u t th e set from which these tasks are draw n is n ot fixed. T h a t is, th e set includes optional tasks: each task (hopefully) helps move th e alg o rith m to its targ e t, b u t not all tasks m ust be com pleted for th is ta rg e t to be reached. T h e tasks to execute are chosen to m axim ize load balance an d reach th e ta rg e t as quickly (in wall-clock tim e) as possible: tasks are associated w ith d a ta on a p a rtic u la r processor, an d slower processors execute fewer of th e associated tasks. T his flexibility in choosing w hich tasks to execute stem s from th e intrinsic p roperties of th e in ner-outer iterativ e m ethods.
2.2
A d d r e s s in g m e m o r y s h o r ta g e
Jo b s in non-dedicated environm ents contend not only for C P U tim e, b u t for m em ory re sources as well. M em ory pressure can cause extrem e deg rad atio n of perform ance. Its effects are especially pronounced if an affected node is p a rt of a synchronous parallel job, as th e im m ense slowdown experienced by th a t node can cause incredible load im balance. In m any cases th is im balance can n o t be corrected by load balancing techniques th a t m ay be in place w ithin th e application. For instance, if th e jo b is ru n n in g on a shared-m em ory com puter, m igrating work to o th er processors on th a t m achine fixes nothing. Even in d istrib u te d m em ory environm ents, problem s rem ain: T h e usually excellent load balance provided by a pool-of-tasks scheme m ay d isap p ear as th e slowdown on an affected node increases th e g ran u larity of its tasks to excessive levels. If load balancing is perform ed th ro u g h dom ain rep artitio n in g , m em ory pressure m ay render th e very cost of co m puting a new p a rtitio n ing prohibitive! Even w hen calculating th e new p a rtitio n in g is practical, deciding how to weight each p a rtitio n is a difficult problem , as o p eratin g system s typically provide very little
C H A P T E R 2. BA C K G R O U N D A N D R E L A T E D W O R K 20
inform ation a b o u t m em ory usage. For these reasons and others, explicit consideration of m em ory pressure an d its m itig atio n is needed in non-dedicated environm ents, in a d d itio n to good C P U -load balancing approaches.
In th is section we discuss several approaches th a t have been used to rem edy m em ory pressure in dedicated and non-dedicated environm ents.
2 .2 .1 O u t-o f-c o r e a lg o r it h m s
Since th e earliest days of th e com puter age, researchers have been faced w ith problem s too large to solve using th e m ain m em ory of a com puter. In response to these problem s, m any out-of-core algorithm s have been developed. A n out-of-core algorithm [84, 86, 25, 71] is one th a t has been designed to provide acceptable perform ance despite th e slow n a tu re of access ing secondary storage. Such an algorithm m ay be o b tain ed by tak in g a conventional (in-core) algorithm , im plem enting it to access large, contiguous blocks of d a ta , a n d re-scheduling its independent o perations in a m anner th a t m axim izes th e re-use of d a ta th a t have been brought into m ain memory. Some out-of-core algorithm s go beyond sim ply rescheduling independent operations, alterin g conventional algorithm s so th a t th eir d a ta dependencies are m ore am enable to d a ta re-use. T hese alte ra tio n s m ay sacrifice some num erical sta b ility or require m ore C P U operations, b u t a d m it m uch b e tte r out-of-core schedules.
O ut-of-core algorithm s b ear m any sim ilarities to cache-optim ized algorithm s, as th e goal of b o th is to m axim ize use of faster levels of a m em ory hierarchy by scheduling o perations in a m anner th a t provides good sp a tia l an d tem p o ral locality. B o th types of algorithm s benefit from m any of th e sam e o p tim izatio n techniques: S tru c tu rin g loop nests to elim inate strid ed accesses is im p o rta n t a t all levels of th e m em ory hierarchy. M any dense m atrix algorithm s are optim ized for out-of-core use by stru c tu rin g th e m to o p e ra te on blocks of a m atrix a t a tim e. Such blocking techniques are also used to achieve cache optim ality in num erical libraries such as BLAS [53, 24, 23] an d LA PA CK [2]; th ey are so im p o rta n t, in fact, th a t com puter vendors m ake great efforts to precisely tu n e blocking factors in
th eir im plem entations of these libraries, an d th e ATLAS project [89, 90] provides software to au to m atically tu n e these factors. Cache-oblivious algorithm s [30, 15], such as th e one employed by th e p o p u lar F F T W lib ra ry [29], o p e ra te using a divide-and-conquer approach th a t splits a problem into subproblem s th a t fit into th e cache; th is approach elim inates th e need to tu n e hardw are-specific p aram eters. A lthough these algorithm s have been proposed w ith cache-optim ality in m ind, these techniques are also beneficial in out-of-core settings [20].
D espite th e ir sim ilarities, there are im p o rta n t differences in th e design of out-of-core versus cache-optim ized algorithm s. T hese stem from two fu n d am en tal differences betw een th e levels of th e m em ory hierarchy th a t th e two types of algorithm s targ e t. F irst, th e m ain -m em o ry /d isk b an d w id th ra tio is usually m uch higher th a n th e cache/m ain-m em ory b a n d w id th ra tio in a com puter system . T his m eans th a t m odifications th a t increase C P U o perations b u t im prove out-of-core perform ance by reducing disk I /O m ay perform worse th a n unm odified versions w hen executed in-core, despite possibly higher cache efficiency. Second, cache m em ory usually has lim ited associativity— th a t is, blocks bro u g h t from m ain m em ory cannot be placed anyw here in th e cache, b u t m ust be placed in a lim ited num ber of specific locations in th e cache. M any cache-optim ization techniques are designed to reduce so-called “conflict” misses th a t stem from lim ited cache associativity. Because m ain-m em ory is fully associative (i.e., blocks from disk m ay be placed anyw here), such o ptim izations do not im prove out-of-core perform ance.
Out-of-core algorithm s have been in use for decades to allow com puters to solve p ro b lems too big for th eir physical memory. T hese codes have been designed to o p e ra te w ith a fixed am o u n t of memory, however, an d are u n su itab le to deal w ith th e tra n sie n t m em ory shortages encountered on m ultiprogram m ed system s. A n out-of-core code could certainly be used to avoid m em ory pressure, b u t th e u tility of such an approach is dubious: T hough th e algorithm would work efficiently w hen m em ory is scarce a n d an in-core algorithm would th ra sh , w hen m em ory is plentiful th e out-of-core alg o rith m will not take advantage, contin