A new parallelization scheme for adaptive mesh refinement

(1)

ContentslistsavailableatScienceDirect

Journal

of

Computational

Science

j o u r n al ho me p a g e :w w w . e l s e v i e r . c o m / l o c a t e / j o c s

A

new

parallelization

scheme

for

adaptive

mesh

reﬁnement

Frank

Löfﬂer

a

_,

_Zhoujian

_Cao

b

_,

_Steven

_R.

_Brandt

a,c,∗

_,

_Zhihui

_Du

d

a_Center_for_Computation_and_Technology,_Louisiana_State_University,_Baton_Rouge,_LA_70803,_USA

b_Institute_of_Applied_Mathematics,_Academy_of_Mathematics_and_Systems_Science,_Chinese_Academy_of_Sciences,_Beijing_100190,_People’s_Republic_of_China c_Department_of_Computer_Science,_Louisiana_State_University,_Baton_Rouge,_LA_70803,_USA

d_Tsinghua_National_Laboratory_for_Information,_Science_and_Technology,_Department_of_Computer,_Science_and_Technology,_Tsinghua_University,_Beijing

100084,People’sRepublicofChina

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received30October2015

Receivedinrevisedform27April2016 Accepted4May2016

Availableonline6May2016

Keywords:

Parallelapplicationframeworks Parallelalgorithms

Parallelapplications Adaptivemeshreﬁnement

a

b

s

t

r

a

c

t

WepresentanewmethodforparallelizationofadaptivemeshrefinementcalledConcurrentStructured AdaptiveMeshRefinement(CSAMR).Thisnewmethodoffersthelowercomputationalcost(i.e.wall time×processorcount)ofsubcyclingintime,butwiththeruntimeperformance(i.e.smallerwalltime) ofevolvingalllevelsatonceusingthetimestepofthefinestlevel(whichdoesmoreworkthansubcycling buthaslessparallelism).Wedemonstrateouralgorithm’seffectivenessusinganadaptivemesh refine-mentcode,AMSS-NCKU,andshowperformanceonBlueWatersandotherhighperformanceclusters. Fortheclassofproblemconsideredinthispaper,ouralgorithmachievesaspeedupof1.7–1.9whenthe processorcountforagivenAMRrunisdoubled,consistentwithourtheoreticalpredictions.

1. Introduction

Adaptivemeshreﬁnement(AMR)isusefulwhendifferentlevels ofaccuracyarerequiredindifferentregionsofa computational domain.StructuredAdaptiveMeshReﬁnement(SAMR)methods havebeenwidelyusedincomputationsinceBerger,Oliger,and Collelapioneeredthetechniquein80’s[1,2].Anumberofcodes andframeworkshavebeendevelopedaroundtheiridea.

For a variety of cutting-edge simulations, including but not limitedtoastrophysicalcompactobjectslikeblackholes,neutron stars,and collisionsbetweenblackholes andneutronstars,the numberofgridpointsisusuallyaboutthesameoneachlevel,e.g. [3–5].Anexamplegridsetupforthistypeofapplicationisshownin Fig.1.

Thisarrangementisalsocommoninmanyotherareas,suchas climatemodeling.InFig.2weshowthegriddecompositionfor a simulationdonewiththeWeather,Research,and Forecasting model(WRF).Thesimulationshownwastakenfromapresentation attheAmericanMeteorologicalSociety[6],andusesmesh reﬁne-mentwiththreelevels,containing230k,305k,and174kgridpoints respectively–reasonablysimilarnumbers.

Simulationsin thisproblemdomaintypically usesubcycling in time, even though subcycling has lower concurrency and

∗ Correspondingauthorat:CenterforComputationandTechnology,Louisiana StateUniversity,BatonRouge,LA70803,USA.

E-mailaddress:[email protected](S.R.Brandt).

therefore longerwall time (approximately 50% longerfor WRF which uses reﬁnement factor 3). The main motivation for this is the substantially reduced computational cost (i.e. wall time_×processorsused).

InanyAMRalgorithm,thecoarselevelneedstoupdated every-whereusingdatafromtheﬁnelevel(restriction),andtheﬁnelevel needstobeupdatedontheboundaryusingdatafromthecoarse level(prolongation).Thisdependency,andthefactthatin subcy-clinglevelstaketime stepsofdifferentsizes, meansthat levels areusuallydistributedindependentlyfromeachotherandevolved sequentially,e.g.level1,thenlevel2,thenlevel3.

EventhoughSAMRisasequentialalgorithm,parallelspeed-up is achievedbydividingthemeshwithineach level(see Fig.3). SAMRcanscale untilthesub-meshesreachacertainminimum size,beyondwhichoverheadsforghostzonesandcommunication becometoocostly.

Thisscalinglimitationisoneofthereasonswhymanycodes eschewsubcyclinginfavorofcomputingalllevelsatonceinparallel (non-subcycling),usingthetimestepdeterminedbythe require-mentsontheﬁnestlevel.Thisachievesmuchmoreparallelismand hastheadvantageofbeingapproximatelytwiceasfastastypical subcyclingalgorithms(forreﬁnementfactortwoandasimilargrid sizeperprocess).

Ontheotherhand,non-subcyclingcanbefarmorecostlyin termsofcomputationalcost.Howmuchmoreexpensivethisoption isdependsonmanyaspects,e.g.thenumberoflevels,andtheratio ofgridpointsoneachreﬁnementlevel.Incaseswherethenumber ofpointsontheﬁnestlevelismuchhigherthanonotherlevels http://dx.doi.org/10.1016/j.jocs.2016.05.003

(2)

Fig.1. ExampleofanAMRgridhierarchywithnestedlevelsoffinerresolution, denotedbydifferentshadesofgray.(Althoughthisexampleis2D,inthecasethe mainapplicationisinfull3D.)Tworegionsarehighlyresolved,eachsurrounded bylevelsofcoarserresolution,whicheventuallymergeandformonesetofboxes surroundingbothfineregions.Notethatonerefinementlevelcannotberepresented byjustoneortwoboxes,buthastobesplitintothreeboxestorepresentitsshape. Weshowthewholestructureontheleftandazoomoftheinnerlevelsontheright handside.

Fig.2. Grids used to simulateHurricane Ivan, produced using the Weather, Research,andForecastingmodel(WRF).Thesimulationshownwastakenfroma presentationattheAmericanMeteorologicalSociety,andhasthreelevelsof reﬁne-mentcontainingreasonablysimilarnumbersofgridpoints.ImagecourtesyofSytske Kimball,ProfessorofMeteorologyDirector,SouthAlabamaMesonet,Dept.ofEarth Sciences,UniversityofSouthAlabama.

Fig.3. (a)showsthatallthedifferentmeshlevelscanbedividedintosimilar sub-meshessothecalculationoneachlevelcanbeexecutedinparallelonallits sub-meshes(theexampleis2Dforsimplicity,butinmanyapplications,themeshesare 3D).Themoresub-meshescanbedividedinto,themoreprocessorscanbeusedto runtheminparallel,thisoftenmeansthemoreperformanceimprovementcanbe achieved.(b)showsthateachsub-meshmustincludeadditionalghostzoneandthe ghostzonewilldeﬁnitelyintroducesomeoverhead.Thesmallerthesub-meshis, themoreoverheadwillbeintroduced.Sothesizeofthesub-meshcannotbetoo smalltoimprovetheperformance.

subcyclingcannotofferahugebeneﬁt.Henceforth,wewillreferto thisnon-subcyclingversionofAMRasNSAMR.

Inthispaper,weprovideanewvariantoftheSAMRthatadds concurrency directlytothe algorithm, calledConcurrent Struc-tured AMR (CSAMR). CSAMRcombines the speed, scaling, and concurrencyofthenon-subcycling,orNSAMR,withthelow com-putationalcostofsubcyclingstructuredAMR,orSAMR.

2. Relatedwork

AmultitudeofpackagesexistthatoffersomeAMRcapability. EvenlimitingthistoparallelAMRproducesalisttoolongtodiscuss here.Therefore,weselectacoupleofcaseswhichwebelieveto bemostrelevant.Forasurveyofhighlevelframeworksin block-structuredAMRpackagesthereaderisreferredto[7].

Paramesh[8,9]isapackageofFortran90subroutinesdesigned toprovideanapplicationdeveloperwithaneasywaytoextend anexistingserialcodewhichusesalogicallyCartesianstructured meshintoaparallelcodewithAMR.For manyproblemsinthe domainofinterestofmostParamesh users,thenumberofgrid pointsonthefinerlevelsislargerthanatthecoarserlevels,and thethroughputbenefitofsubcyclingisnotworththetimecost [10].Thus, ParameshusersprefertouseNSAMR. Indeed,aswe showbelow,inmanycasesofthisnature,thereisnobenefitin throughputfromSAMR.

Other packages require subcycling in time. One interesting exampleofthisisEnzo[11,12].Itonlysupportssubcycling,butit doesnotrequirethattheratiosofcellsizesandtimestepsneedto beidenticalonalllevels,whileothercodesoftendo.Packagesthat allowuserstochoosebetweensubcyclingandnotare,e.g.BoxLib [13],Cactus[14,15]andChombo[16].Othernotablepackagesare Uintah[17,18],Jasmine[19]andSAMRAI[20,21],butmanymore exist.

Nyx,aparallelAMRcodeforcomputationalcosmology,employs “optimalsubcycling”,andusesSAMRonsomelevels,andNSAMR onothers,adaptingtotheperformanceneedsoftheproblemat hand[22].

Diachinetal.[23]discussanumberoftechniquesfor imple-menting SAMR and UAMR (i.e. Unstructured Adaptive Mesh Reﬁnement)inparallel,butdonotmentionanyparallelismpresent withinthealgorithmitself.

ThepaperfromTACConExtremeScaleAMR[24]discussestheir parallelAMRstructure, butthefocusisontheforestofoctrees methodstheyapply,notonparallelismwithintheAMRalgorithm itself.

Dynamic parallelism schemes could achieve the scheduling describedinthiswork,i.e.schemesinwhichallthemethodsof theprogramarescheduledastaskscapableofexecutingassoon asinputsareready.However,inorderforthespeeduptobe real-ized,thetasksforonegroupofprocessorsneedbeassignedtothe ﬁnestgrid,whiletheremainingtasksshouldbeassignedtoasecond group.

TheUintahFrameworkhasapowerfuldynamictaskexecution systemaswellasSAMRcapabilities,buttheydonotmentionthe SAMRframeworkbeingabletotakeadvantageofthisparallelism withAMRitself[25].

Discussionsofdynamicparallelismaremostlyconcernedwith loadbalancingtheﬁnergrids,e.g.seeBalsara[26].Inthispaper theKnapsackalgorithmisappliedtobalancingtheﬁnestgrid,the coarsergridsareseenasaperturbation.However,wealsonotethat fortypicalapplicationsofthecodeusedhere,thesub-meshesare oftenofverysimilarsize.Forproblemslikethis,simpleralgorithms canbeapplied.

Thereis,however,onedynamicSAMRsystemthatcomesvery closetoours.TheAstroBEARFramework[27],recognizesthevalue

(3)

ofallowingthedifferentlevelsofAMRtoadvanceindependently, andinoneoftheirﬁgurestheyshowanexecutionorderidentical totheoneweuse.Theydiscussmethodsofloadbalancing,butdo notmentionthestrategyofisolatingtheﬁnegridsononegroupof processors.Instead,theyuseanextendedsystemofghostcellsto decoupleadvancementontheirgrids.

The authorsare not aware of any othersystem, apartfrom AstroBEAR,thatactuallyparallelizestheSAMRmethoditself.While CSAMRhassomefeaturesincommonwithAstroBEAR,webelieve CSAMRstillmakesdeﬁnitecontributionsinhowtounderstandand parallelizethealgorithm.

3. Descriptionofthemethod

AMRalgorithmstypicallystudyhyperbolicequationswherea setoffieldsareevolvedforwardintime.Thecomputationaldomain isdescribedbyasetofnestedgridswithincreasinglyhigher res-olution. Thesegridsmay beevolvedtogether witha time step appropriatetotheCourantconditionofthefinestgrid,orwitha timestepthatvariesbylevel.Inthislattercase,thelevelsare typ-icallyevolvedinanorderdescribedbythepseudocodegivenin Fig.4,assumingafixedrefinement-factoroftwobetweenlevels.

Notethathereweassumethatvaluesforlevelrunfrom0upto, andincludingmaxlevel-1,with0indicatingthecoarsestleveland maxlevel-1thefinest.Thefunction_one_step(level)evolves thespecifiedlevelofrefinementbyatimestepofsize(1/2)level_t.

Fig.4.Pseudo-codedepictingtheexecutionofabasicsubcyclingAMRalgorithm withtraditionalcallordering.

Ifthisisnotthefinestlevel,theboundaryofthenextfinerlevel willbeprovidedbyprolongation,afterwhichthatlevelcanmake twosteps,performinganotherprolongationinbetween.Afterthese two steps,both levelsareagainatthesamephysical time,and informationfromthefinergridhastoberestrictedtothecoarser grid.

Ifthereﬁnementregionsmove,prolongationisusedtoinitialize anynewpatchesthatarecreated.

In thefollowing, wewilluseNasthenumberof levelsina problem(correspondingtomaxlevelinpseudocode).Wewill typ-icallyusetoidentifyaparticularlevel(correspondingtolevelin pseudocode).

For simplicity,weassume roughly thesamenumber ofgrid pointsforeachlevel.Evolutionsusingsubcyclingtake2(N−1)time stepsontheﬁnestlevel,andmustevolveatotalof2N₋₁_steps

intotal. Fora largenumberof levels,this isroughlya factorof 2≈(2N₋_1)/(2(N−1)₎_compared_to_the_number_of_steps_on_the_ﬁnest

level.Evolvingtheﬁnestgridisaboutasexpensiveasevolvingall coarsergridscombinedinthiscase.

Alternatively, evolutions which do not usesubcycling must perform2(N−1)stepsoneachoftheNlevels,resultinginatotal of N_×2(N−1) _points. _Compared _to _subcycling, _this _means

non-subcycling methods use approximately N×2N−1_/(2N₋₁₎_≈_N/2

moretimesteps(inthelimitoflargeN,althoughtenislargein thiscase).

Lookingcloselyatdatadependenciesinaboveevolutionscheme revealsthatthecalltoonestep(level)online5andtheﬁrstcall ofrecursivestep(level+1)online6ofFig.4areworkingon differentlevelsanddonotdependoneachother,andtheirorder canbeinvertedorparallelized.First,letusconsideraninverted form,seeFig.5.

Fig.5.Pseudo-codedepictingtheexecutionofabasicAMRalgorithmwithan alter-natecallordering.

(4)

Fig.6. Pseudo-codedepictingtheexecutionofanAMRalgorithm.Therecursive stepisbrokenintoashorterfunctionwhichdoesnotincludetheﬁnalcallto one step(maxlevel).

Next we notice that the call to recursivestep(level) always ﬁnishes with a call to onestep(max level) (line 3, for level == maxlevel). We can re-write this code touse a shorterrecursivestep,onethatdoesnotincludethisﬁnalcallto onestep(max level).WeshowthisforminFig.6.

Theopportunityforparallelism,byrunninglines8and9ofFig.6 inparallelisnowobvious.WeshowthefullyparallelcodeinFig.7. Theneteffectofthisisthatthereisonlyoneparalleltask out-standingatanytime.Theforkedtasksonlyevolvetheﬁnestlevel, theremainderofthetasksevolveacompleteSAMRevolutionwith N−1levels.

ThisisillustratedintheleftpanelofFig.8forthesimplecaseof 2levels.Thefirstiterationonbothlevelsisperformedinparallel. Thefinerlevelhastodoonemoreiterationtoarriveatthesame timeasthecoarserlevel,duringwhichprocessesowningthecoarse levelremainidle,asseeninFig.8.Thesamefigureshowsasimilar schemaforthecaseof3levelsaswell,wherewewanttonotethat inthiscasetherelativeamountofidletimeisreducedbyafactor of2.Indeed,each newlevelreducesitbythisfactor,givingthe amountofoverheadfromidletimeas

Tidle=

1

2N where N>1 (1)

which,startingwithwhentheﬁnestlevelfallsbelow1%(N=7). ProblemswithN≥7arenotunusual.Infact,sometypesofproblems requiremanymorelevelsofreﬁnement.

Thereasonforthereductionoftheidletimecomparedtotwo levelsisthatwithinthesecondtimestepoftheﬁnestgrid,whenfor

Fig.7.Pseudo-codeforaparallelAMRalgorithm.Thetwostepsonline8and9of

Fig.6arenowruninparallel.

Fig.8. Theorderinwhichdifferentlevelsareevolvedisshowninbothdiagrams.The twocolumnsindicatethetwoprocessgroupsineachcase.Differentshadesofgray indicatedifferentlevels:thelighterthecolorthefinerthelevel.Numbersindicate theevolvedtimeonthatlevelinunitsofthefinestlevel’stimestep.Theleftpanel showsthecaseof2levels,whiletherightshowsthesamefor3levels.Withineach panel,theleftsideshowsthecoarserlevels,therightsideshowsthefinestlevel.In both,walltimeismeanttoincreasedownward.Crossesindicatewhenoneofthe processgroupsisidle.Redarrows(arrowsoriginatingontheleftcolumnofboxes) indicatetheflowofdataforprolongation,whilebluearrows(arrowsoriginating ontherightcolumnofboxes)indicatethesameforrestriction.Twofulltimesteps ofthecoarsestlevelareshowninbothcases.Theexplicittimingofprolongation andrestrictionisencodedinthepseudocodeofFig.7.(Forinterpretationofthe referencestocolorinthisfigurelegend,thereaderisreferredtothewebversionof thisarticle.)

(5)

Fig.9.TwoexamplesofstatesonanAMRevolutionusing4levels,atdifferenttimes(leftpanel:after3iterations,rightpanel:after7iterations).Simulationtimeisincreasing inup-wards.Rectanglesdepictsimulationtimesatwhichagivenlevelwasalreadyevolvedatthattime.Iftheseareﬁlled,dataforthattimeiscurrentlyavailable,while emptyrectanglesmeanthatdataforthistimewasalreadyoverwritten(assumingonlytwolevelsarekeptinmemory).Thehatchedareadepictsarangeintimewithin whichdataonagivenlevelcanbeconstructedbyinterpolation.ItmayhappenduringaCSAMRevolutionthatthereisnotemporaloverlapforsomepair(orpairs)oflevels ofreﬁnement(e.g.levelsoneandfourontheleftpanel).

twolevelsoneprocessgroupwouldbeidle,thecoarsestgridcan beevolved,fillingthegap.Thenexttimethishappens,however, thecoarsestgridisalreadyattherequiredtime,andtheprocess groupwillbeidle.Inthisway,everysecondgapisfilledwiththe evolutionofthecoarsestgrid.Thiscanbeextendedtomorelevels, alwaysfillinghalfofthegapswithevolutionsofthenew,coarser levels,confirmingEq.(1).

RestrictionandprolongationmightlookcomplexinFig.8,but therules dictatingwhentoperformthem arerelatively simple anddeterminedbydataﬂow.Wheneveraﬁnergridneeds bound-arydatafromthenextcoarsergrid,theorderingensuresthatthis coarsergridisalreadyevolvedtothatpointandcanprovidethe data.

Restriction,ontheotherhand,cannotalwaysbeperformedas soonasthetheﬁnerlevelisevolved,becausethecoarserlevelmight notyethavebeenevolvedtothatpointintime.Thisisincontrastto SAMR,wherecoarselevelsareevolvedﬁrstandarealways“ready” toreceiverestricteddata.Instead,therulehereisthatrestriction hastobeperformedwhenthebothlevelsareready.Anexample inFig.8wouldberestrictionfromthemiddleleveltothecoarsest levelinthe3-levelcase.

3.1. Limitations

Whilebetterscalabilitybyaboutafactoroftwo(forlarge num-bersoflevels)iscertainlyabigadvantage,potentiallyreducingthe timeforagivensimulationbyafactorofahalf,CSAMRalsoposes acoupleoflimitationsbothtoimplementationandusage.

First,ourmethodisnotaswell-suitedtoproblemsthathave manymoregridpointsattheﬁnestlevelthanatthecoarserones.In thissituation,theperformancebeneﬁtcomparedtoSAMRissmall. Examplesoffactorstoconsiderherearetheeliminationoferrors fromtime-interpolationnecessaryin(C)SAMR,butontheother handcoarsegridshavetoperformmany morestepsinNSAMR, accumulatingerror.

Second,itmaynotalwaysbepossibletoconstructdatafrom allrefinementlevelsatone specificsimulation timeusing tem-poralinterpolation(dependingonthenumberoftimelevelsyou keepin memory).WithinSAMRthis isalwayspossiblebecause coarselevelsareevolvedfirst,anddatafromthoselevelscanalways beused for interpolation shouldthis be necessary.In contrast, withinCSAMRthenecessary information for interpolationmay onlyavailableatspecifictimes.When,andhowoftenthisis pos-sibledependsonthetotalnumberofrefinementlevels,andthe numberoftimelevelsthatarekeptforeachrefinementlevel.The

problemisshownbyanexampleinFig.9,for4reﬁnementlevels and2timelevels.

Ingeneral,timeinterpolationonthecompletedomainisonly possibleoncethecoarsestlevelstepsareaheadofthefinerlevelsin simulationtime.Thishappensabouthalf-waythroughacomplete timestep.Atthatmoment,pasttimelevelsonthefinerlevelshave tobeavailablefortimeinterpolation,withthefinestlevel requir-ingthehighestnumber.Thetotalnumberoftimelevelsrequired

(Rlevels)ontheﬁnestgriddepends onthemaximum numberof

reﬁnementlevels(N),andis

Rlevels=2N−2+1. (2)

Eachcoarser gridwould onlyneed halfas manytimelevels for thispurpose,sothetotalnumberoftimelevels(thesumoverall refinementlevels)wouldbeabouttwicethenumbergivenbyEq. (2).While foramoderatenumberofrefinementlevelsNthisis manageable,and sometimesalready existsduetootherusesof pasttimelevels (e.g.3 timelevelsonthefinestgrid ofa 3-level simulation),thecostbecomesprohibitiveforhighernumbersof refinementlevels,e.g.257timelevels ofthefinestgridmust be availableina10-levelsimulation.

Onlykeepingsomeofthesetimelevels(e.g.alwayskeepingthe onethatalignsintimewiththecoarsestlevelinadditiontotheones neededforevolution)couldbeusedtodecreasethetotalnumber oftimelevels,butwouldleadtoreducedaccuracyintime inter-polation.Similarly,extrapolationcouldbeusedwheneverdatafor interpolationisnotavailable.Whetherornotsuchapproachesare viabledependsentirelyonthepurposeforrequiringglobaldata. Wewillnowconsiderseveral.

Courantfactor:CalculationsoftheCourantfactoraretypically wholegridoperations.However,becauseoneexpectstheﬁnegrid tosetthelimitwebelievethatthisisnotlikelytobemuchofan issue.

Ellipticsolves:Sinceanellipticsolveisinherentlyawholegrid operation,NSAMRistheonlypracticalchoice.However,such prob-lemshavebeensolvedforsomeproblemsinaSAMRframework. SeePretorius[28]asexampleofhowthismaybedoneandhow complexitmaybe.

Analysis:Eachscienceproblemhasitsownspecialneeds.For blackholesimulationsthatneedtofindapparenthorizons,thelack ofglobaldatawouldlimitthesimulationtolessfrequent computa-tionsofthehorizon.However,horizonstypicallyfitentirelywithin onlyafewofthefinestgridsandsoforthisproblemafullgridmay notbeneeded.

I/O: We also note that the changed order of steps compli-catesI/O.Outputrequiringinterpolationisaffectedbythesame

(6)

limitations oninterpolationthat have already been mentioned. Often,simulationdataisnotneededoneverytimestepbutitcan besaveduntildataonalllevelsispresentsimultaneously. Alterna-tively,datacanbereconstructedduringpost-processing.

Affectedoutputtypesincludecommonoperationslikeextrema, norms,orsums ofvariables. On theotherhand,quantitieslike normsorsumsmay betooinaccuratewhentime-interpolation isnecessarytocomputethem.Becauseofthis,evenwithSAMR, evaluatingtheseisoftendoneonlyattimesthatdonotrequire interpolationintime.

Thethirdlimitationwewouldliketoconsideristhatrefinement factorshigherthan2areeitherlessefficientwithCSAMR,ormore difficulttoimplement,andineithercasetheincreaseinscalingis smaller.

Higherreﬁnementfactorsthan2arecertainlynotexotic. Espe-ciallyfactorsof3or4areoftenfoundinexistingframeworks;e.g. WRFusesafactorof3foritshorizontalmeshreﬁnement.However, mostAMRframeworkscananddouseafactorof2formostoftheir applications.

Inthefollowing,weanalyzetheamountofidle-timeforthecase ofarefinementfactorofthree.WehavealreadyseeninEq.(1),that forafactorof2theamountofidletimetendstozeroforalarge numberoflevels.Forthethreestepsafinelevelhastoperform here,thenextcoarserlevelonlydoesone,anditmustcomplete thatonesimultaneouslywiththefirstofthethreefinelevelsteps. Thisleavestwoidlestepsforthegroupofprocessorshandlingthe coarserlevels.

Levelingeneraldoes3 _steps_for_each_full _time_step._This

meansthattheﬁnestlevelwilldo3(N−1)_steps_and_the_total_number

ofstepsallotherlevelsneedstoperformis(3(N−1)₋_1)/2,_resulting

inaidle-stepnumberof3(N−1)−(3(N−1)−1)/2foroneofthe pro-cessgroups.Theoverallidletimeassuminganidenticalsizeofboth groupsisthen(Fig.10)

Tidle= 3N−1−(3(N−1)−1)/2 2·3N−1 = 1 4+ 1 4·3(N−1) (3)

Thus,withoutfurtheroptimizationCSAMRwillleaveoneofthe twoprocessgroupsidleabouthalfofthetime.Thisamountstoa speedupofonly1.5comparedtothespeedupof2forareﬁnement factorof2,andcomparedtoaSAMRsimulationusingthesamesize ofgridsbuthalfthenumberofprocessors.

Fig.10.Theorderinwhichdifferentlevelsareevolvedisshowninbothdiagrams, withthesamemeaningofnumbers,colorsandarrowsasinFig.8.Thedifference here,however,isareﬁnementfactorof3,resultinginmoreidletime,andthatthe rightpanelonlyshowsonefulltimestep.(Forinterpretationofthereferencesto colorinthisﬁgurelegend,thereaderisreferredtothewebversionofthisarticle.)

Despitethismoderatespeedupof1.5,fewerSUsarewastedin thismannerthanwithNSAMR,andsothetrade-offmaybeworth it.We alsonotethat theremainingidle timecouldbeusedfor othertasks,e.g.expensiveanalysisthatotherwisewouldhaveto temporarilypauseevolution,orwouldbeperformedpostruntime. Onepossiblewaytoovercomethisidletimeproblemfor refine-mentfactor threewould betodividethecoarserstep intotwo equalandsequentialsteps(andthereforerequireonlyhalfasmany processorsinthesecondgroup).Thefirstsequentialstepwould needtocover theboundaryof thenextfinergrid,sothat pro-longationispossible,thesecondstepwouldcovertherestofthe grid.Ifweassumethesetwostepsareequal,thentheidletimeis TN

idle=(1/2)(1/3) N

,whereTN

idleisthenumberofidlestepswithN

levelsandN>1.Aquantitywhichrapidlygoestozeroasthenumber oflevelsincreases.

Infact,wecansaygenerally,thatforrefinementfactorRour methodshowsaspeedupofR/(R−1)using(R−1)/Rtimesasmany processors,providedthecoarserstepcanbedividedintoR−1equal parts,thefirstofwhichcoverstheboundaryfortherefinedregion itcontains.Thus,forrefinementfactortwo,thefirststepisdoneas asinglepartandcoversthewholegrid;forfactorthreeitisdone intwopartsthatcoverhalfthegrideach;forfactorfour3partsare requiredthateachcover1/3ofthegrid,andsoon.AsRincreases, therefinedregionneedstofitintosmallerandsmallersub-regions toavoidwastedtimesteps,diminishingthespeedup.Thus,R=2 remainsanoptimalcase.Wenote,thattheabovediscussionis the-oreticalasonlyrefinementfactortwowasimplementedforthe codeusedinthispaper.

Theﬁnalpossibledrawbackwewouldliketomentionisthe memoryimbalancecreatedbyassigningtheﬁnestleveltoone pro-cessgroupandallotherlevelsacrossasecondprocessgroup.This imbalanceis,assumingidenticallevelsizes,onlydependentonthe numberoflevels,andamountstoafactorof1/(N−1).Thiscouldbe aproblemincaseswhereapplicationsarelimitedbytheamount ofnodememory,e.g.iflargeamountsofdatathatcannoteasily bedistributed,likeequationofstatetables,needtobestoredin memoryoneachnode.However,fortheextremescalingthisis oftennotaproblem.Here,scalingisusuallylimitedbyacertain grainsizebelowwhichtheamountofworktobedoneoneach nodebecomessmallcomparedtotheoverheadofcommunication betweenthenodes.

3.2. Loadbalancingconsiderations

Sofar,weonlyconsideredthecaseofidentical,oratleastsimilar numbersofgridpointsoneachlevel.Inthiscasethetimeittakesto evolveeachofthelevelsisaboutthesame,makingload-balancing easy.However,practicalgridsetupsmightdeviatefromthisideal. Inthefollowingwedescribepossibleproblemswhichmayarise withunequalnumbersofpointsandhowtheymaybealleviated. 3.2.1. Finergridshavemorepoints

Considerthesituationinwhicheachlevelhastwiceasmany gridpointsasthenextcoarsestlevel,e.g.thenumberofpointson levelisgivenby:S=2_S

0.Thetotalnumberofgridpointsis

ST= N−1

=0 S=(2N₋_1)S 0≈2NS0. (4)

Letusassumethatthetimetoperformasinglesteponalevel ismodeledbytheformula:

T∝max(G,S/P) (5)

wherePisthetotalnumberofprocessorsandTisthetime.This for-muladictatesthatthereisanoptimal“grainsize”(G),thesmallest

(7)

Fig.11.ShownarestrongscalingresultsforSAMRandCSAMRontheﬁrstmachine,for2,4,6,and8evolvedlevels(toptobottomplot).Itcanbeclearlyseenthatwhilefor onlytwolevelsSAMRperformsbetterforfewerprocesses,CSAMRoutperformsSAMRineveryothercase,especiallyforlargenumbersofprocessesand/orlargenumberof levels.

numberofpointswhichaprocessorcanefficientlyhandle.This flat-teningoftheperformancecurveisexpected,becauseofoverheads introducedbyghostzones,etc.Notethatwedo notnecessarily expectthescalingcurvetobethisbad,andalsothattheinfluence ofbadscalingoncoarserlevelswillbesmallforsubcycling meth-ods,andsowewillcomparewithperfectscaling.Thetrueresult willlikelyliesomewhereinbetween.Note,however,thatEq.(5) predictsbehaviorquitesimilartowhatweseeinourstrongscaling plots,e.g.seeFig.11.

NotealsothattheactualvalueofGisnotimportanthere,itis aquantitythatwilldependondetailsofthecomputer architec-tureandthecodebeingused.Inallthecalculationsbelow,wewill assumethatwerunatmaximumspeedandgiveexactlyGpoints toeachprocessor,unlessotherwisestated.

ForNSAMR,thetotalnumberofprocessesis,therefore,

Ptotal(NS)=ST/G=(2N−1)S0/G≈2NS0/G. (6)

Theapproximationisaccuratetowithin1%forN≥7.DeﬁningTto bethetimetoperformonestepwehave

Truntime(NS)=2(N−1)T, (7)

andthus,atotalcostinSUsof

CSU(NS)≈(2(2N−1)S0/G)·T. (8)

ForSAMRlevelsareevolvedsequentially,eachusingthesame numberofprocessors.Evenforproblemsin whichlevelsareof equalsize,finerlevelshavetoperformmoreworkthancoarser levelsforthesameamountofphysicaltime,becausetheymust updatemorefrequently,andsotheytendtobethemostimportant. Thistrendofmoreworkonthefinerlevelsisevenmoretrueinthis example,wherethefinergridsarelarger.Itis,therefore,clearthat needsofthefinestlevelshoulddeterminethenumberofprocessors used.

PT(S)=SN−1/G=2(N−1)S0/G. (9)

Notethathereweassumethatonalllevelscoarserthantheﬁnest thereis notenoughworkavailable,i.e.gridpointsonwhichto operate,andsomeprocessesareidle.

Truntime(S)=(2(N−1)+2(N−2)+2(N−3)+···)·T=2·2(N−1)T, (10)

andwiththisinatotalSUcostof

CSU(S)=2·(2(2N−1)S0/G)·T. (11)

Forcomparison,ifwehaveperfectscaling,thenSAMRwillhave noidletimeand

Truntime(IS)=

2(N−1)+1₂2(N−2)+1₄2(N−3)+···

·T=4₃·2(N−1)T, (12) andwiththisinatotalSUcostof

CSU(IS)=

4 3·(2

(2N−1)_S₀_/G)_·_T. ₍₁₃₎

ForCSAMR,wehaveabettersituation,inthatthefinestlevel doesnotdictateourtotalprocesscount.Thenextfinestlevel, how-ever,hasasignificantimpact.Wehavetwogroupsofprocessors forCSAMR,thefinestlevel

P_N−1=2(N−1)S0/G

andforthenextcoarsestlevel

PN−2=2(N−2)S0/G, leadingtoatotalof PT(PS)= 3 2·2 (N−1)_S 0/G. (14)

CSAMR needs the same time per coarse level time step as NSAMR,sothetotalruntimeisidenticaltoEq.(7):

Truntime(PS)=2(N−1)T, (15)

leadingtoatotalSUcostof CSU(PS)=

3 42

(2N−1)_S

0/G·T. (16)

Thefollowingtablesummarizesthenumbersofprocesses,run timeandcomputationalcostforeachofthethreemethods.

Method Procs Runtime Comp.cost. Normalization PT/2N(S0/G) Trt/2N−1T CSU/(22N−1S0T/G)

NSAMR 1 1 1

SAMR 1/2 2 1

SAMR,ideal 1/2 4/3 2/3

CSAMR 3/4 1 3/4

TheSAMR calculationwithoutperfect scalinghasclearly no advantageoveranyoftheothermethods,butasmentioned,thisis theworstpossiblecaseforSAMR.Notonlydoesittakethelongest byfartocompute,itfailstoprovideanyspeedup,becauseofthe

(8)

scalingequationEq.(5).Thisresultsupportstheconclusionofthe Parameshgroupcitedabove,thatsubcyclingdoesnothelpforgrids withlargerﬁnegrids.

Comparingperfectscalingwiththeothermethods,however,the runtimeofallthreemethodsisalmostidentical.NSAMRandCSAMR have,usingthecriteria abovefor thenumbersofprocesses, by designanidenticalruntime,whileSAMRwithidealscalingis some-whatslowerbecauseithastoevolvethecoarserlevelssequentially totheﬁnestlevel.

Differences,however,areclearinthenumberofusedprocesses, andthus,inthetotalcomputationalcost.SAMRusesonlyhalfthe processesandthushalftheresourcescomparedtoNSAMR.CSAMR liesbetweenthetwowith3/4oftheprocessesandcostofNSAMR. Thisshows,thatforthisparticularcase,andwithnottoobad scalingaroundthegrainsizeG,SAMRwouldbepreferredwhen optimizingcomputationalcost,butitwouldbeatleast14%slower thantheothertwomethods.Wheninterestedinthefastestmethod, CSAMRwouldbepreferredbecausewhileNSAMRhasthesame runtime,CSAMRwouldrequirelessresources.

3.2.2. Finergridshavefewerpoints

Whileblackholeevolutionstypicallyhaveapproximatelythe samenumberofgridpointsperlevel,thisisnotalwaysthecase. Sometimesoneofthecoarserlevelswillhavesigniﬁcantlymore pointsinordertoprovidebetterdataforradiationextraction.

Consideraproblemsimilartotheabove,butassumeacoarse levelthathasthreetimesasmanypointsthaneachoftheother, identicallevels.Inthiscase,andunlessthecoarseleveldoesnot completelydominatetheevolution,theparallelizationstrategyof allthreemethodswillstaythesame,becausewhilethecoarselevel maybelarger,thefinegridsstilldomanymoresteps.Becauseof thelatterreasonalso,theinfluenceofthislarger,coarselevelon theruntimeofSAMRand CSAMRwillbenegligible,assuminga largenumberoflevels.Ontheotherhand,withinNSAMRmethods, thislargeleveltakesthesamenumberofstepsasanyotherlevel, significantlyaddingtotheoverallruntime.

Ontheotherhand,thelargelevelmightnotbethecoarsestone, andthelimitofalargenumberoflevelsmightnotapply.Inorder toshedlightonthiscase,weassumeonlythreelevelsintotalin thefollowing,againwiththecoarsestlevelbeingthelargerthan anyoftheother,identicallevels.

LetthesizeofallbutthecoarsestlevelbeS≡(1/3)S0=S1=S2.

Inthiscase,NSAMRwillrequirePT=5S/Gprocessorsandtaketime

equaltoTs=4Ttoevolve.Notethatweonlyhavetodooneanalysis

forSAMRhere,regardlessoftheassumedscaling,becauseherethe numberofprocessesisdeterminedbythesizeoftheﬁnelevel,and nootherlevelissmaller.

SAMRwillthereforerequirePT=S/G,andtheevolutiontimewill

beTs=T0+2T1+4T2=3T+2T+4T=9T.

CSAMR,ontheotherhand,willassignPT=3S/Gprocessorsfor

thecoarserprocessorgrouptoevolve thelargelevelefﬁciently, andPT=S/Gfortheﬁneevolutiongroup,andthusthesavingsin

SUswouldbe4/5relativetoNSAMR.

WeﬁndagainthatCSAMRisabletoobtainthespeedofNSAMR, withanSUperformancemidwaybetweenNSAMRandSAMR.

Method Procs Runtime Comp.cost. Normalization PT/(N0/G) Truntime CSU/(TN0/G))

NSAMR 5 4T 20

SAMR 1 9T 9

CSAMR 4 4T 16

However,CSAMRcanbeoptimizedfurther.Addinga3rd pro-cessorgroup(ofequalsizetotheothertwo)canassistwiththe evolutionofthecoarsergrid.LetuscallthisgroupB(for“Biggrid”). ThescheduleforaCSAMRevolutionwouldthenlooklikethis:

Proc.B Proc.1 Proc.2 onestep(0)/3 onestep(1) onestep(2) onestep(0)/3 onestep(0)/3 onestep(2) idle onestep(1) onestep(2) idle idle onestep(2)

Theentryonestep(0)/3appearsthreetimes,andeachtime itisinvokeditperforms1/3oftheupdaterequiredforthe coars-estlevel.Thisexecutionobtainsasavingsof3/5inSUsrelativeto NSAMR,despitethefactthattheadditionalprocessorsareidle25% ofthetime.Withthisadjustment,theperformancetablenowlooks likethis:

Method Procs Runtime Comp.cost. Normalization PT/(N0/G) Truntime CSU/(TN0/G) NSAMR 5 4T 20 SAMR 1 9T 9 CSAMR 3 4T 12 4. Experimentalresults 4.1. Experimentalenvironments

Toevaluatetherealspeedupandeffectivenessofourmethod, we implemented CSAMR within a code evolving the general relativistic vacuum equations, e.g. in order to simulate black holes:AMSS-NCKU[29,30].ItiswritteninC++andFortran90.All thearraycalculations(e.g.thephysics)arewritteninFortran90, whiletheinfrastructureusesC++.Priortothiswork,AMSS-NCKU couldonlyuseSAMR.

TheCSAMRimplementedinAMSS-NCKUwastestedontwo dif-ferentplatforms.TheﬁrstisSuperMikeIIatLSU.Itisa146TFlops PeakPerformance440computenodecluster,eachnode contain-ingtwo8-CoreSandyBridgeXeon64-bitprocessorsoperatingata corefrequencyof2.6GHz.ThesecondisBlueWatersofNCSA.The BlueWaterssystemisaCrayXE/XKhybridmachinecomposedof AMD6276InterlagosprocessorsandNVIDIAGK110Kepler accel-eratorsallconnectedbytheCrayGeminitorusinterconnect.Note thattheseacceleratorswerenotusedforresultswithinthispaper.

4.2. Results

TotesttheperformanceofCSAMR, werana seriesofbinary blackholesimulations.Only10stepswereevolvedinsteadofthe moretypical10000.However,thisshouldbeadequateas10steps isenoughtoshowtheperformanceofthemethod.Inalltests,as istypicalforourevolutions,thenumberofgridpointsperlevelis almostidentical.

In our tests, we initialized the mesh size of each level to 40×40×20,andperformedsimulationswithdifferentnumbers ofmeshlevels.

Fig.11showstheresultsontheﬁrstmachine.Thescalabilityof SAMRstopswhenthenumberofprocessesisabout32,butCSAMR continuesto64processes.We canalsoseethatwhenthe num-beroflevelsistwo,theperformanceofCSAMRiscomparablebut slightlylessthanSAMRforsmallprocessorcounts.However,for largernumbersoflevels,Thegrainsizeeachprocessorhastoevolve islargerforCSAMR,andthusthereislessoverheadperprocessor. ThisleadstobetterperformanceforCSAMRcompared toSAMR evenwiththesamenumberofprocesses.Fig.11alsoclearlyshows theadvantageofCSAMRwhenthenumberoflevelsislarge.

Fig.12istheresultsonthesecondmachinewiththesameinput setupandsamesoftwareimplementation.Besidetheexact evolu-tiontime,wecanseethatthetrendisjustlikeontheﬁrstmachine.

(9)

Fig.12.ShownarestrongscalingresultsforSAMRandCSAMRonthesecondmachine,for4,6,and8evolvedlevels(toptobottomplot).SimilartoFig.11,itcanbeclearly seenthatCSAMRoutperformsCSAMRinallcases,especiallyforlargenumbersofprocessesand/orlargenumberoflevels.

ThefollowingtablesarebasedonrunsofAMSS-NCKU.Ineach case,weperformabinaryblackholesimulationusingtheSAMR method,thenperformthesamerunusingtheCSAMRmethodbut withtwiceasmanyprocessors.Inbothrunswepartitionthegridin exactlythesameway,givingthesamenumberofzonestoeach pro-cess.Ideally,theCSAMRrunshouldbetwiceasfast(100%faster). However,inpractice,measuredspeedupsarebetween70%and 90%.

Wetheorizedthatthetimespentinprolong/restrictoperations weresequentializing thecode,sowe instrumentedthecodeto measurethefractionoftimespentinthisphase(s),theninserted thefractioninAmdahl’sLaw.

PredictedSpeedup= 1

s+(1−s)/2 (17)

Forthesamesizeproblem,AMSS-NCKUreportedconsiderable varianceinthefractionoftimespentintheprolong/restrictphase byeachnode.Wereportthreetimes:aminimumspeedup,which usestheaveragetime spentinprolong/restrictacrossallnodes (tavg); a maximum time, basedon theminimum time spentin

prolong/restrictacrossall nodes(tmin);and anempirically

cho-senaverageof(tw=0.80×tavg+0.20×tmin).Theresultsofthese

experiments(withthetitle“nobarrier”below),showrough agree-mentbetweenthepredictedandexpectedperformance,butthe varianceiswide.

Inanefforttoimprovethematchbetweenourexperimental andpredictions,wecreatedanewversionofthecodewithreduced asynchrony.WeintroducedanMPIbarrierbothbeforeandafterthe prolong/restrictsections.Surprisingly,theintroductionofbarriers didnotimpacttheoverallwalltimeofthecode,butreducedthe varianceintimespentinprolong/restricttovirtuallynothingand vastlyimprovedtheagreementwithourpredictions.

40×40×20nobarrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 32 16 1.65 1.65 1.69 1.86 64 32 1.74 1.69 1.73 1.92 128 64 1.68 1.60 1.66 1.95 40×40×20barrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 16 8 1.83 1.84 1.84 1.84 32 16 1.72 1.73 1.73 1.73 64 32 1.63 1.68 1.68 1.68 128 64 1.63 1.66 1.66 1.66 32×32×16nobarrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 16 8 1.84 1.80 1.82 1.89 32 16 1.82 1.78 1.81 1.94 32×32×16barrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 16 8 1.88 1.83 1.83 1.84 32 16 1.82 1.81 1.81 1.81 64×64×32nobarrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 16 8 1.82 1.81 1.83 1.90 32 16 1.85 1.75 1.78 1.90 64 32 1.71 1.66 1.70 1.90 128 64 1.81 1.74 1.77 1.92 256 128 1.85 1.76 1.80 1.95

Inthefuture,weplantoworkonreducingthetimespentin pro-longation/restriction,inordertoachieveabetterparallelspeedup.

5. Conclusion

InthisworkwehaveintroducedConcurrentStructured Adap-tive Mesh Refinement (CSAMR), an offline scheduling algorithm that optimizes an adaptive mesh refinement calculation. We haveshownthat itoffersthespeedofnon-subcycling,butuses

(10)

signiﬁcantly fewer resources. Thisdecrease can enable runs at an increased problem size compared to usual subcycling, or alternativelycan substantially decrease the computational cost ofnon-subcyclingsimulations,increasingthenumberofpossible simulationsforagivenallocationsize.

WehavealsonotedthattheCSAMRschedulingincludesa cer-tain,usuallysmall,amountofidletimeintothecalculation.This idletimemightbere-purposedforanalysis,in-situvisualization, etc.Ifthisidletimeiswellused,theresultcouldleadtoashorter walltimerelativetoanon-subcyclingcode,whichhasnoidletime tospareforthese(possiblynecessary)tasks.

Acknowledgments

Weacknowledgethesupportofthefollowinggrants:TheDoE XPRESS proposal (DE-SC0008714), National Science Foundation 1265449,NationalScienceFoundation1212401,theNational Nat-uralScienceFoundationofChina(Nos.61272087,61073008,and 60773148),BeijingNaturalScienceFoundation(Nos.4082016and 4122039).

References

[1]M.J.Berger,P.Colella,Localadaptivemeshreﬁnementforshock hydrodynamics,J.Comput.Phys.82(1)(1989)64–84.

[2]M.J.Berger,J.Oliger,Adaptivemeshreﬁnementforhyperbolicpartial differentialequations,J.Comput.Phys.53(3)(1984)484–512.

[3]F.Löfﬂer,J.Faber,E.Bentivegna,T.Bode,P.Diener,R.Haas,I.Hinder,B.C. Mundim,C.D.Ott,E.Schnetter,G.Allen,M.Campanelli,P.Laguna,The Einsteintoolkit:acommunitycomputationalinfrastructureforrelativistic astrophysics,Class.QuantumGrav.29(11)(2012)115001,http://dx.doi.org/ 10.1088/0264-9381/29/11/115001arXiv:1111.3344[gr-qc].

[4]D.Pollney,C.Reisswig,L.Rezzolla,B.Szilágyi,M.Ansorg,B.Deris,P.Diener, E.N.Dorband,M.Koppitz,A.Nagar,E.Schnetter,Recoilvelocitiesfrom equal-massbinaryblack-holemergers:asystematicinvestigationof spin–orbitalignedconﬁgurations,Phys.Rev.D76(2007)124002 arXiv:0707.2559[gr-qc].

[5]E.Schnetter,S.H.Hawley,I.Hawke,Evolutionsin3-Dnumericalrelativity usingﬁxedmeshreﬁnement,Class.QuantumGrav.21(2004)1465–1488,

http://dx.doi.org/10.1088/0264-9381/21/6/014arXiv:gr-qc/0310042. [6]Anumericalsimulationoftheevolutionofthelow-levelwindspeedand

rainfallinHurricaneIvan(2004)atlandfall,Posterpresentation,30th ConferenceonHurricanesandTropical,Meteorology,PonteVedraFlorida, 2012.

[7]A.Dubey,A.Almgren,J.Bell,M.Berzins,S.Brandt,G.Bryan,P.Colella,D. Graves,M.Lijewski,F.Löfﬂer,B.O’Shea,E.Schnetter,B.V.Straalen,K.Weide, Asurveyofhighlevelframeworksinblock-structuredadaptivemesh reﬁnementpackages,J.ParallelDistrib.Comput.74(12)(2014)3217–3227,

http://dx.doi.org/10.1016/j.jpdc.2014.07.001.

[8]P.MacNeice,K.Olson,C.Mobarry,R.deFainchtein,C.Packer,PARAMESH:A paralleladaptivemeshreﬁnementcommunitytoolkit,Comput.Phys. Commun.126(3)(2000)330–354.

[9]ParallelAdaptiveMeshReﬁnementPARAMESH.URL:http://www.physics. drexel.edu/olson/paramesh-doc/Usersmanual/amr.html.

[10]L.Dursi,M.Zingale,EfficiencygainsfromtimerefinementonAMRmeshes andexplicittimestepping,in:T.Plewa,T.Linde,V.GregoryWeirs(Eds.), AdaptiveMeshRefinement–TheoryandApplications,LectureNotesin ComputationalScienceandEngineering,vol.41,Springer,Berlin,Heidelberg, 2005,pp.103–113,http://dx.doi.org/10.1007/3-540-27039-67.

[11]G.L.Bryan,M.L.Norman,B.W.O’Shea,T.Abel,J.H.Wise,M.J.Turk,D.R. Reynolds,D.C.Collins,P.Wang,S.W.Skillman,B.Smith,R.P.Harkness,J. Bordner,J.-h.Kim,M.Kuhlen,H.Xu,N.Goldbaum,C.Hummels,A.G.Kritsuk, E.Tasker,S.Skory,C.M.Simpson,O.Hahn,J.S.Oishi,G.C.So,F.Zhao,R.Cen,Y. Li,TheEnzoCollaboration,Enzo:Anadaptivemeshreﬁnementcodefor astrophysics,Astrophys.J.Suppl.211(2)(2014)52arXiv:1307.2265.

[12]EnzoastrophysicalAMRcode,2013,URL:http://enzo-project.org/. [13]BoxLib,2011,URL:https://ccse.lbl.gov/BoxLib.

[14]T.Goodale,G.Allen,G.Lanfermann,J.Massó,T.Radke,E.Seidel,J.Shalf,The Cactusframeworkandtoolkit:designandapplications,in:VectorandParallel Processing–VECPAR’2002,5thInternationalConference,LectureNotesin ComputerScience,Springer,Berlin,2003,URL:http://edoc.mpg.de/3341. [15]CactusComputationalToolkit.URL:http://www.cactuscode.org/. [16]P.Colella,D.Graves,N.Keen,T.Ligocki,D.Martin,P.McCorquodale,D.

Modiano,P.Schwartz,T.Sternberg,B.VanStraalen,ChomboSoftware PackageforAMRApplicationsDesignDocument,Tech.rep.,LawrenceBerkely NationalLaboratory,AppliedNumericalAlgorithmsGroup,Computational ResearchDivision,2009.

[17]S.G.Parker,Acomponent-basedarchitectureforparallelmulti-physicsPDE simulation,FutureGener.Comput.Syst.22(2006)204–216.

[18]S.G.Parker,J.Guilkey,T.Harman,Acomponent-basedparallelinfrastructure forthesimulationofﬂuid–structureinteraction,Eng.Comput.22(2006) 277–292.

[19]Z.Mo,A.Zhang,X.Cao,Q.Liu,X.Xu,H.An,W.Pei,S.Zhu,Jasmin:aparallel softwareinfrastructureforscientiﬁccomputing,Front.Comput.Sci.China4 (4)(2010)480–488,http://dx.doi.org/10.1007/s11704-010-0120-5. [20]CASC,SAMRAIStructuredAdaptiveMeshReﬁnementApplication

Infrastructure,CenterforAppliedScientiﬁcComputing,LawrenceLivermore NationalLaboratory,2007December,URL:https://computation.llnl.gov/casc/ SAMRAI/.

[21]R.Hornung,S.Kohn,ManagingapplicationcomplexityintheSAMRAI object-orientedframework,Concurr.Comput.Pract.Exp.14(5)(2002) 347–368.

[22]A.S.Almgren,J.B.Bell,M.J.Lijewski,Z.Luki ´c,E.V.Andel,Nyx:Amassively parallelAMRcodeforcomputationalcosmology,Astrophys.J.765(1)(2013) 39,URL:http://stacks.iop.org/0004-637X/765/i=1/a=39.

[23]L.F.Diachin,R.Hornung,P.Plassmann,A.Wissink,Paralleladaptivemesh reﬁnement,ParallelProcess.Sci.Comput.20(2006)143.

[24]C.Burstedde,O.Ghattas,M.Gurnis,T.Isaac,G.Stadler,T.Warburton,L. Wilcox,Extreme-scaleAMR,in:Proceedingsofthe2010ACM/IEEE InternationalConferenceforHighPerformanceComputing,Networking, StorageandAnalysis,IEEEComputerSociety,2010,pp.1–12.

[25]Q.Meng,J.Luitjens,M.Berzins,DynamicTaskSchedulingforScalableParallel AMRintheUintahFramework,Tech.rep.,Citeseer,2010.

[26]D.S.Balsara,Divergence-freeadaptivemeshreﬁnementfor magnetohydrodynamics,J.Comput.Phys.174(2)(2001)614–648.

[27]J.J.Carroll-Nellenback,B.Shroyer,A.Frank,C.Ding,Efﬁcientparallelizationfor AMRMHDmultiphysicscalculations;implementationinAstroBEAR,J. Comput.Phys.236(2013)461–476.

[28]F.Pretorius,M.W.Choptuik,Adaptivemeshreﬁnementforcoupled elliptic-hyperbolicsystems,J.Comput.Phys.218(1)(2006)246–274.

[29]Z.Cao,H.-J.Yo,J.-P.Yu,Reinvestigationofmovingpuncturedblackholeswith anewcode,Phys.Rev.D78(12)(2008)124011.

[30]P.Galaviz,B.Brügmann,Z.Cao,Numericalevolutionofmultipleblackholes withaccurateinitialdata,Phys.Rev.D82(2)(2010)024005.

FrankLöfﬂerreceivedhisBSdegreein1988inphysics fromChemnitzUniversityofTechnology,Germany.He receivedhisPhDin2005,fromAlbert-Einstein-Institute, MaxPlanckInstitute,Germany.From2005to2007hewas apostdoctoralscholaratSISSA–InternationalSchoolfor AdvancedStudies,Italy,andfrom2007to2011hewasa postdoctoralscholaratLouisianaStateUniversity.From 2011untilthepresent,heworkedasanITConsultantat LouisianaStateUniversity.Hisresearchareasincludehigh performancecomputing,andneutronstarsimulations.

ZhoujianCaoreceivedtheBSdegreein2001fromthe Physics Department of BeijingNormal University. He receivedhisPhDdegreeintheoreticalphysics,in2006, from BeijingNormal University. From 2006 to 2011, heworkedasanassistantprofessorintheAcademyof Mathematics andSystemsSciences, ChineseAcademy of Sciences.From2011to thepresent,heworked as anassociateprofessorintheAcademyofMathematics andSystemsSciences,ChineseAcademyofSciences.His researchareasincludenumericalrelativity,gravitational theoryandscientiﬁccomputing.

Steven R.Brandt receivedthe BS degreein 1985in physicsfromNortheasternUniversity.Hereceivedhis PhDfornumericalstudiesofblackholespacetimesin 1996,fromtheUniversityofIllinoisatUrbana-Champaign. From1997to1998hewasapostdoctoralscholaratthe AlbertEinsteinInstituteinPotsdam,andin1999hewasa postdoctoralscholaratPennStateUniversity.From2005 untilthepresent,heworkedasanAdjunctProfessorof ComputerScienceandITConsultantatLouisianaState University.Hisresearchareasincludehighperformance computing,garbagecollection,andcoastalsimulation.

ZhihuiDureceivedtheBEdegreein1992incomputer departmentfromTianjinUniversity.HereceivedtheMS andPhDdegreesincomputerscience,respectively,in 1995and1998,fromPekingUniversity.From1998to 2000,heworkedatTsinghuaUniversityasa postdoc-toralscholar.From2001tothepresent,hehasworked atTsinghuaUniversityasanassociateprofessorinthe Department ofComputerScienceandTechnology.His researchareasincludehighperformancecomputing,grid computing,cloudcomputingandenergyefﬁcient com-puting.