In our earliest m em ory a d a p ta tio n work, we worked w ith JD cg a n d exploited its flexibility. If m em ory shortage was d etected on a node, th e node would perform no correction phase iterations. T his would allow a com peting, m em ory-hungry job to utilize 100% of th e C P U a n d m em ory resources, hopefully speeding its com pletion and relinquishm ent of resources. O u r experim ents yielded an appreciable perform ance gain, often reducing execution tim es by 20% com pared to load balanced JD cg w ith o u t m em ory a d a p ta tio n . T h e m eth o d suffers from several shortcom ings, however, th e biggest of which is th a t it is applicable only to a lim ited subset of flexible-phase algorithm s. A dditionally, th e m echanism s we used for identifying w hen to recede an d resum e correction phase ite ratio n s are not reliable in general.
In m ore recent work [61, 62] we have developed a m em ory a d a p ta tio n fram ew ork which is widely applicable an d highly po rtab le. We describe it in this chapter.
5.1
A p o r ta b le fra m ew o rk for m e m o r y a d a p t iv it y
M any scientific algorithm s, such as iterativ e m ethods for linear an d nonlinear system s, dense m atrix m ethods, a n d M onte C arlo m ethods, o p e ra te on large d a ta sets in a predictable, rep etitiv e fashion. To best utilize hierarchical memory, applications often o p e ra te in a block-wise fashion to increase locality of m em ory access. A lgorithm s designed to ru n in-
C H A P T E R 5. A D Y N A M IC M E M O R Y A D A P T A T IO N F R A M E W O R K 56
core are blocked to effectively utilize L I a n d L2 caches, while out-of-core algorithm s em ploy a sim ilar stra te g y to best utilize DRAM . In th e lingo of out-of-core algorithm s, blocks are som etim es referred to as panels to d istinguish from disk blocks; we ad o p t th a t term inology here. W ith d a ta p a rtitio n ed into P panels, th e processing p a tte rn of a blocked alg o rith m can be represented schem atically as
for i = 1 :P
G et panel pi from lower level of th e m em ory hierarchy W ork on pi
W rite results back an d evict pi to th e lower level of the m em ory hierarchy end
T he above s tru c tu re suggests a sim ple m echanism for m em ory a d a p ta tio n : control th e resident set size by varying th e num ber of panels cached in core. If th e p rogram has enough memory, it caches its entire d a ta set in core an d ru n s as fast as a sta n d a rd in-core algorithm . If m ain m em ory is scarce, th e num ber of panels cached is reduced, a n d an application- specific cache replacem ent policy is used. T h e m agnitude of th e red u ctio n varies according to th e m em ory shortage; so perform ance should degrade gracefully as m em ory becomes scarce (provided an ap p ro p riate cache-replacem ent policy is used). T h u s if th e am ount of physical m em ory available is only slightly less th a n th e size of th e ad ap tiv e p ro g ra m ’s d a ta set, its perform ance should be close to its in-core perform ance. A n o n-adaptive in- core program , on th e o th er h and, m ay th ra s h under th e same conditions if th e replacem ent policy of th e VM system is in ap p ro p ria te for its access p a tte rn . T h is is often th e case for scientific applications, which com m only access large d a ta sets in a cyclic fashion. For such access p a tte rn s, a m ost recently used (MRU) replacem ent policy should be used, b u t th e generic replacem ent policy used by th e o p e ra tin g system is usually an ap p ro x im atio n of least recently used (LRU) replacem ent
T h e softw are required to su p p o rt th e above m em ory a d a p ta tio n stra te g y can be easily encapsulated into a software library. A n existing blocked code can th e n be easily m odified by th e insertion of a few lib ra ry calls. Essentially, all th a t needs to be done is to m ake th e code call a lib ra ry function th a t retu rn s a p o in ter to panel p t before working w ith it. T he
function call should handle all m em ory m anagem ent in a way th a t is com pletely tra n sp a re n t to th e ap plication program m er. We have w ritte n such a library, which we call MMLIB (for
“m em ory m alleability lib ra ry ” ).
5.2
E le m e n ts o f t h e im p le m e n ta tio n
To use our a d a p ta tio n stra te g y we m ust be able to grow an d sh rin k th e m em ory space of an application. M ost im plem entations of m a l l o c ( ) / f r e e ( ) are not a p p ro p ria te because th ey do not provide a m echanism to release m em ory back to th e o p e ra tin g system . A successful m a llo c Q call can grow a p ro g ram ’s heap. W hen f r e e O is used to deallocate memory, however, th e heap size m ay not be decreased, so freed m em ory is not re tu rn e d to th e o p eratin g system . O ur solution is to use m em ory m apping, which is universally available in m odern o p e ra tin g system s an d provides m ore explicit control over m em ory usage. M em ory o b tained via a m em ory m ap can be easily re tu rn e d to th e o p e ra tin g system ; additionally, m any o p eratin g system s allow program m ers to provide h ints to th e OS via m ad v ise () ab o u t how a m ap p ed region will be used. Using nam ed m appings to files (viz., m em ory-m apped I /O ) confers o th er advantages as well. Because I /O is handled tran sp aren tly , codes can be simplified: explicit I /O calls are n ot needed, and an ad aptive code can g reatly resem ble an d in-core one. I /O traffic is optim ized because n on-dirty pages in m ap p ed regions can be freed w ithout a w rite, while pages th a t are d irty will be w ritte n , b u t not to th e swap device. W riting d irty pages to swap space incurs software overhead an d resu lts in poor d a ta placem ent on disk, since pages th a t are p a rt of a contiguous a rra y m ay be sc attere d betw een several non-contiguous blocks on th e sw ap device.
In our im plem entation, each panel is w ritte n to a disk file. A panel to be cached in core is m ap p ed via an mmapO call, and th e VM system is asked to prefetch th a t panel via m a d v is e ( ) . A panel to be evicted is u n m ap p ed via an munmapO. A m em ory a d a p ta tio n decision is m ade each tim e th e program calls our lib ra ry function to fetch a new panel. Based on its c u rren t estim ates of m em ory availability, it chooses to increase, decrease, or m ain tain
C H A P T E R 5. A D Y N A M IC M E M O R Y A D A P T A T IO N F R A M E W O R K 58
th e num ber of panels cached in core. If panels m ust be evicted, victim panels are selected using a user-specified, ap plication specific policy. T his allows an o p tim al replacem ent policy to be used, in co n trast to th e general policy em ployed by a VM system , w hich m ay be highly su b o p tim al for a given application.