• No results found

A new parallelization scheme for adaptive mesh refinement

N/A
N/A
Protected

Academic year: 2021

Share "A new parallelization scheme for adaptive mesh refinement"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

ContentslistsavailableatScienceDirect

Journal

of

Computational

Science

j o u r n al ho me p a g e :w w w . e l s e v i e r . c o m / l o c a t e / j o c s

A

new

parallelization

scheme

for

adaptive

mesh

refinement

Frank

Löffler

a

,

Zhoujian

Cao

b

,

Steven

R.

Brandt

a,c,∗

,

Zhihui

Du

d

aCenterforComputationandTechnology,LouisianaStateUniversity,BatonRouge,LA70803,USA

bInstituteofAppliedMathematics,AcademyofMathematicsandSystemsScience,ChineseAcademyofSciences,Beijing100190,People’sRepublicofChina cDepartmentofComputerScience,LouisianaStateUniversity,BatonRouge,LA70803,USA

dTsinghuaNationalLaboratoryforInformation,ScienceandTechnology,DepartmentofComputer,ScienceandTechnology,TsinghuaUniversity,Beijing

100084,People’sRepublicofChina

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received30October2015

Receivedinrevisedform27April2016 Accepted4May2016

Availableonline6May2016

Keywords:

Parallelapplicationframeworks Parallelalgorithms

Parallelapplications Adaptivemeshrefinement

a

b

s

t

r

a

c

t

WepresentanewmethodforparallelizationofadaptivemeshrefinementcalledConcurrentStructured AdaptiveMeshRefinement(CSAMR).Thisnewmethodoffersthelowercomputationalcost(i.e.wall time×processorcount)ofsubcyclingintime,butwiththeruntimeperformance(i.e.smallerwalltime) ofevolvingalllevelsatonceusingthetimestepofthefinestlevel(whichdoesmoreworkthansubcycling buthaslessparallelism).Wedemonstrateouralgorithm’seffectivenessusinganadaptivemesh refine-mentcode,AMSS-NCKU,andshowperformanceonBlueWatersandotherhighperformanceclusters. Fortheclassofproblemconsideredinthispaper,ouralgorithmachievesaspeedupof1.7–1.9whenthe processorcountforagivenAMRrunisdoubled,consistentwithourtheoreticalpredictions.

©2016TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Adaptivemeshrefinement(AMR)isusefulwhendifferentlevels ofaccuracyarerequiredindifferentregionsofa computational domain.StructuredAdaptiveMeshRefinement(SAMR)methods havebeenwidelyusedincomputationsinceBerger,Oliger,and Collelapioneeredthetechniquein80’s[1,2].Anumberofcodes andframeworkshavebeendevelopedaroundtheiridea.

For a variety of cutting-edge simulations, including but not limitedtoastrophysicalcompactobjectslikeblackholes,neutron stars,and collisionsbetweenblackholes andneutronstars,the numberofgridpointsisusuallyaboutthesameoneachlevel,e.g. [3–5].Anexamplegridsetupforthistypeofapplicationisshownin Fig.1.

Thisarrangementisalsocommoninmanyotherareas,suchas climatemodeling.InFig.2weshowthegriddecompositionfor a simulationdonewiththeWeather,Research,and Forecasting model(WRF).Thesimulationshownwastakenfromapresentation attheAmericanMeteorologicalSociety[6],andusesmesh refine-mentwiththreelevels,containing230k,305k,and174kgridpoints respectively–reasonablysimilarnumbers.

Simulationsin thisproblemdomaintypically usesubcycling in time, even though subcycling has lower concurrency and

∗ Correspondingauthorat:CenterforComputationandTechnology,Louisiana StateUniversity,BatonRouge,LA70803,USA.

E-mailaddress:[email protected](S.R.Brandt).

therefore longerwall time (approximately 50% longerfor WRF which uses refinement factor 3). The main motivation for this is the substantially reduced computational cost (i.e. wall time×processorsused).

InanyAMRalgorithm,thecoarselevelneedstoupdated every-whereusingdatafromthefinelevel(restriction),andthefinelevel needstobeupdatedontheboundaryusingdatafromthecoarse level(prolongation).Thisdependency,andthefactthatin subcy-clinglevelstaketime stepsofdifferentsizes, meansthat levels areusuallydistributedindependentlyfromeachotherandevolved sequentially,e.g.level1,thenlevel2,thenlevel3.

EventhoughSAMRisasequentialalgorithm,parallelspeed-up is achievedbydividingthemeshwithineach level(see Fig.3). SAMRcanscale untilthesub-meshesreachacertainminimum size,beyondwhichoverheadsforghostzonesandcommunication becometoocostly.

Thisscalinglimitationisoneofthereasonswhymanycodes eschewsubcyclinginfavorofcomputingalllevelsatonceinparallel (non-subcycling),usingthetimestepdeterminedbythe require-mentsonthefinestlevel.Thisachievesmuchmoreparallelismand hastheadvantageofbeingapproximatelytwiceasfastastypical subcyclingalgorithms(forrefinementfactortwoandasimilargrid sizeperprocess).

Ontheotherhand,non-subcyclingcanbefarmorecostlyin termsofcomputationalcost.Howmuchmoreexpensivethisoption isdependsonmanyaspects,e.g.thenumberoflevels,andtheratio ofgridpointsoneachrefinementlevel.Incaseswherethenumber ofpointsonthefinestlevelismuchhigherthanonotherlevels http://dx.doi.org/10.1016/j.jocs.2016.05.003

(2)

Fig.1. ExampleofanAMRgridhierarchywithnestedlevelsoffinerresolution, denotedbydifferentshadesofgray.(Althoughthisexampleis2D,inthecasethe mainapplicationisinfull3D.)Tworegionsarehighlyresolved,eachsurrounded bylevelsofcoarserresolution,whicheventuallymergeandformonesetofboxes surroundingbothfineregions.Notethatonerefinementlevelcannotberepresented byjustoneortwoboxes,buthastobesplitintothreeboxestorepresentitsshape. Weshowthewholestructureontheleftandazoomoftheinnerlevelsontheright handside.

Fig.2. Grids used to simulateHurricane Ivan, produced using the Weather, Research,andForecastingmodel(WRF).Thesimulationshownwastakenfroma presentationattheAmericanMeteorologicalSociety,andhasthreelevelsof refine-mentcontainingreasonablysimilarnumbersofgridpoints.ImagecourtesyofSytske Kimball,ProfessorofMeteorologyDirector,SouthAlabamaMesonet,Dept.ofEarth Sciences,UniversityofSouthAlabama.

Fig.3. (a)showsthatallthedifferentmeshlevelscanbedividedintosimilar sub-meshessothecalculationoneachlevelcanbeexecutedinparallelonallits sub-meshes(theexampleis2Dforsimplicity,butinmanyapplications,themeshesare 3D).Themoresub-meshescanbedividedinto,themoreprocessorscanbeusedto runtheminparallel,thisoftenmeansthemoreperformanceimprovementcanbe achieved.(b)showsthateachsub-meshmustincludeadditionalghostzoneandthe ghostzonewilldefinitelyintroducesomeoverhead.Thesmallerthesub-meshis, themoreoverheadwillbeintroduced.Sothesizeofthesub-meshcannotbetoo smalltoimprovetheperformance.

subcyclingcannotofferahugebenefit.Henceforth,wewillreferto thisnon-subcyclingversionofAMRasNSAMR.

Inthispaper,weprovideanewvariantoftheSAMRthatadds concurrency directlytothe algorithm, calledConcurrent Struc-tured AMR (CSAMR). CSAMRcombines the speed, scaling, and concurrencyofthenon-subcycling,orNSAMR,withthelow com-putationalcostofsubcyclingstructuredAMR,orSAMR.

2. Relatedwork

AmultitudeofpackagesexistthatoffersomeAMRcapability. EvenlimitingthistoparallelAMRproducesalisttoolongtodiscuss here.Therefore,weselectacoupleofcaseswhichwebelieveto bemostrelevant.Forasurveyofhighlevelframeworksin block-structuredAMRpackagesthereaderisreferredto[7].

Paramesh[8,9]isapackageofFortran90subroutinesdesigned toprovideanapplicationdeveloperwithaneasywaytoextend anexistingserialcodewhichusesalogicallyCartesianstructured meshintoaparallelcodewithAMR.For manyproblemsinthe domainofinterestofmostParamesh users,thenumberofgrid pointsonthefinerlevelsislargerthanatthecoarserlevels,and thethroughputbenefitofsubcyclingisnotworththetimecost [10].Thus, ParameshusersprefertouseNSAMR. Indeed,aswe showbelow,inmanycasesofthisnature,thereisnobenefitin throughputfromSAMR.

Other packages require subcycling in time. One interesting exampleofthisisEnzo[11,12].Itonlysupportssubcycling,butit doesnotrequirethattheratiosofcellsizesandtimestepsneedto beidenticalonalllevels,whileothercodesoftendo.Packagesthat allowuserstochoosebetweensubcyclingandnotare,e.g.BoxLib [13],Cactus[14,15]andChombo[16].Othernotablepackagesare Uintah[17,18],Jasmine[19]andSAMRAI[20,21],butmanymore exist.

Nyx,aparallelAMRcodeforcomputationalcosmology,employs “optimalsubcycling”,andusesSAMRonsomelevels,andNSAMR onothers,adaptingtotheperformanceneedsoftheproblemat hand[22].

Diachinetal.[23]discussanumberoftechniquesfor imple-menting SAMR and UAMR (i.e. Unstructured Adaptive Mesh Refinement)inparallel,butdonotmentionanyparallelismpresent withinthealgorithmitself.

ThepaperfromTACConExtremeScaleAMR[24]discussestheir parallelAMRstructure, butthefocusisontheforestofoctrees methodstheyapply,notonparallelismwithintheAMRalgorithm itself.

Dynamic parallelism schemes could achieve the scheduling describedinthiswork,i.e.schemesinwhichallthemethodsof theprogramarescheduledastaskscapableofexecutingassoon asinputsareready.However,inorderforthespeeduptobe real-ized,thetasksforonegroupofprocessorsneedbeassignedtothe finestgrid,whiletheremainingtasksshouldbeassignedtoasecond group.

TheUintahFrameworkhasapowerfuldynamictaskexecution systemaswellasSAMRcapabilities,buttheydonotmentionthe SAMRframeworkbeingabletotakeadvantageofthisparallelism withAMRitself[25].

Discussionsofdynamicparallelismaremostlyconcernedwith loadbalancingthefinergrids,e.g.seeBalsara[26].Inthispaper theKnapsackalgorithmisappliedtobalancingthefinestgrid,the coarsergridsareseenasaperturbation.However,wealsonotethat fortypicalapplicationsofthecodeusedhere,thesub-meshesare oftenofverysimilarsize.Forproblemslikethis,simpleralgorithms canbeapplied.

Thereis,however,onedynamicSAMRsystemthatcomesvery closetoours.TheAstroBEARFramework[27],recognizesthevalue

(3)

ofallowingthedifferentlevelsofAMRtoadvanceindependently, andinoneoftheirfigurestheyshowanexecutionorderidentical totheoneweuse.Theydiscussmethodsofloadbalancing,butdo notmentionthestrategyofisolatingthefinegridsononegroupof processors.Instead,theyuseanextendedsystemofghostcellsto decoupleadvancementontheirgrids.

The authorsare not aware of any othersystem, apartfrom AstroBEAR,thatactuallyparallelizestheSAMRmethoditself.While CSAMRhassomefeaturesincommonwithAstroBEAR,webelieve CSAMRstillmakesdefinitecontributionsinhowtounderstandand parallelizethealgorithm.

3. Descriptionofthemethod

AMRalgorithmstypicallystudyhyperbolicequationswherea setoffieldsareevolvedforwardintime.Thecomputationaldomain isdescribedbyasetofnestedgridswithincreasinglyhigher res-olution. Thesegridsmay beevolvedtogether witha time step appropriatetotheCourantconditionofthefinestgrid,orwitha timestepthatvariesbylevel.Inthislattercase,thelevelsare typ-icallyevolvedinanorderdescribedbythepseudocodegivenin Fig.4,assumingafixedrefinement-factoroftwobetweenlevels.

Notethathereweassumethatvaluesforlevelrunfrom0upto, andincludingmaxlevel-1,with0indicatingthecoarsestleveland maxlevel-1thefinest.Thefunctiononestep(level)evolves thespecifiedlevelofrefinementbyatimestepofsize(1/2)levelt.

Fig.4.Pseudo-codedepictingtheexecutionofabasicsubcyclingAMRalgorithm withtraditionalcallordering.

Ifthisisnotthefinestlevel,theboundaryofthenextfinerlevel willbeprovidedbyprolongation,afterwhichthatlevelcanmake twosteps,performinganotherprolongationinbetween.Afterthese two steps,both levelsareagainatthesamephysical time,and informationfromthefinergridhastoberestrictedtothecoarser grid.

Iftherefinementregionsmove,prolongationisusedtoinitialize anynewpatchesthatarecreated.

In thefollowing, wewilluseNasthenumberof levelsina problem(correspondingtomaxlevelinpseudocode).Wewill typ-icallyusetoidentifyaparticularlevel(correspondingtolevelin pseudocode).

For simplicity,weassume roughly thesamenumber ofgrid pointsforeachlevel.Evolutionsusingsubcyclingtake2(N−1)time stepsonthefinestlevel,andmustevolveatotalof2N1steps

intotal. Fora largenumberof levels,this isroughlya factorof 2≈(2N1)/(2(N−1))comparedtothenumberofstepsonthefinest

level.Evolvingthefinestgridisaboutasexpensiveasevolvingall coarsergridscombinedinthiscase.

Alternatively, evolutions which do not usesubcycling must perform2(N−1)stepsoneachoftheNlevels,resultinginatotal of N×2(N−1) points. Compared to subcycling, this means

non-subcycling methods use approximately N×2N−1/(2N1)N/2

moretimesteps(inthelimitoflargeN,althoughtenislargein thiscase).

Lookingcloselyatdatadependenciesinaboveevolutionscheme revealsthatthecalltoonestep(level)online5andthefirstcall ofrecursivestep(level+1)online6ofFig.4areworkingon differentlevelsanddonotdependoneachother,andtheirorder canbeinvertedorparallelized.First,letusconsideraninverted form,seeFig.5.

Fig.5.Pseudo-codedepictingtheexecutionofabasicAMRalgorithmwithan alter-natecallordering.

(4)

Fig.6. Pseudo-codedepictingtheexecutionofanAMRalgorithm.Therecursive stepisbrokenintoashorterfunctionwhichdoesnotincludethefinalcallto one step(maxlevel).

Next we notice that the call to recursivestep(level) always finishes with a call to onestep(max level) (line 3, for level == maxlevel). We can re-write this code touse a shorterrecursivestep,onethatdoesnotincludethisfinalcallto onestep(max level).WeshowthisforminFig.6.

Theopportunityforparallelism,byrunninglines8and9ofFig.6 inparallelisnowobvious.WeshowthefullyparallelcodeinFig.7. Theneteffectofthisisthatthereisonlyoneparalleltask out-standingatanytime.Theforkedtasksonlyevolvethefinestlevel, theremainderofthetasksevolveacompleteSAMRevolutionwith N−1levels.

ThisisillustratedintheleftpanelofFig.8forthesimplecaseof 2levels.Thefirstiterationonbothlevelsisperformedinparallel. Thefinerlevelhastodoonemoreiterationtoarriveatthesame timeasthecoarserlevel,duringwhichprocessesowningthecoarse levelremainidle,asseeninFig.8.Thesamefigureshowsasimilar schemaforthecaseof3levelsaswell,wherewewanttonotethat inthiscasetherelativeamountofidletimeisreducedbyafactor of2.Indeed,each newlevelreducesitbythisfactor,givingthe amountofoverheadfromidletimeas

Tidle=

1

2N where N>1 (1)

which,startingwithwhenthefinestlevelfallsbelow1%(N=7). ProblemswithN≥7arenotunusual.Infact,sometypesofproblems requiremanymorelevelsofrefinement.

Thereasonforthereductionoftheidletimecomparedtotwo levelsisthatwithinthesecondtimestepofthefinestgrid,whenfor

Fig.7.Pseudo-codeforaparallelAMRalgorithm.Thetwostepsonline8and9of

Fig.6arenowruninparallel.

Fig.8. Theorderinwhichdifferentlevelsareevolvedisshowninbothdiagrams.The twocolumnsindicatethetwoprocessgroupsineachcase.Differentshadesofgray indicatedifferentlevels:thelighterthecolorthefinerthelevel.Numbersindicate theevolvedtimeonthatlevelinunitsofthefinestlevel’stimestep.Theleftpanel showsthecaseof2levels,whiletherightshowsthesamefor3levels.Withineach panel,theleftsideshowsthecoarserlevels,therightsideshowsthefinestlevel.In both,walltimeismeanttoincreasedownward.Crossesindicatewhenoneofthe processgroupsisidle.Redarrows(arrowsoriginatingontheleftcolumnofboxes) indicatetheflowofdataforprolongation,whilebluearrows(arrowsoriginating ontherightcolumnofboxes)indicatethesameforrestriction.Twofulltimesteps ofthecoarsestlevelareshowninbothcases.Theexplicittimingofprolongation andrestrictionisencodedinthepseudocodeofFig.7.(Forinterpretationofthe referencestocolorinthisfigurelegend,thereaderisreferredtothewebversionof thisarticle.)

(5)

Fig.9.TwoexamplesofstatesonanAMRevolutionusing4levels,atdifferenttimes(leftpanel:after3iterations,rightpanel:after7iterations).Simulationtimeisincreasing inup-wards.Rectanglesdepictsimulationtimesatwhichagivenlevelwasalreadyevolvedatthattime.Ifthesearefilled,dataforthattimeiscurrentlyavailable,while emptyrectanglesmeanthatdataforthistimewasalreadyoverwritten(assumingonlytwolevelsarekeptinmemory).Thehatchedareadepictsarangeintimewithin whichdataonagivenlevelcanbeconstructedbyinterpolation.ItmayhappenduringaCSAMRevolutionthatthereisnotemporaloverlapforsomepair(orpairs)oflevels ofrefinement(e.g.levelsoneandfourontheleftpanel).

twolevelsoneprocessgroupwouldbeidle,thecoarsestgridcan beevolved,fillingthegap.Thenexttimethishappens,however, thecoarsestgridisalreadyattherequiredtime,andtheprocess groupwillbeidle.Inthisway,everysecondgapisfilledwiththe evolutionofthecoarsestgrid.Thiscanbeextendedtomorelevels, alwaysfillinghalfofthegapswithevolutionsofthenew,coarser levels,confirmingEq.(1).

RestrictionandprolongationmightlookcomplexinFig.8,but therules dictatingwhentoperformthem arerelatively simple anddeterminedbydataflow.Wheneverafinergridneeds bound-arydatafromthenextcoarsergrid,theorderingensuresthatthis coarsergridisalreadyevolvedtothatpointandcanprovidethe data.

Restriction,ontheotherhand,cannotalwaysbeperformedas soonasthethefinerlevelisevolved,becausethecoarserlevelmight notyethavebeenevolvedtothatpointintime.Thisisincontrastto SAMR,wherecoarselevelsareevolvedfirstandarealways“ready” toreceiverestricteddata.Instead,therulehereisthatrestriction hastobeperformedwhenthebothlevelsareready.Anexample inFig.8wouldberestrictionfromthemiddleleveltothecoarsest levelinthe3-levelcase.

3.1. Limitations

Whilebetterscalabilitybyaboutafactoroftwo(forlarge num-bersoflevels)iscertainlyabigadvantage,potentiallyreducingthe timeforagivensimulationbyafactorofahalf,CSAMRalsoposes acoupleoflimitationsbothtoimplementationandusage.

First,ourmethodisnotaswell-suitedtoproblemsthathave manymoregridpointsatthefinestlevelthanatthecoarserones.In thissituation,theperformancebenefitcomparedtoSAMRissmall. Examplesoffactorstoconsiderherearetheeliminationoferrors fromtime-interpolationnecessaryin(C)SAMR,butontheother handcoarsegridshavetoperformmany morestepsinNSAMR, accumulatingerror.

Second,itmaynotalwaysbepossibletoconstructdatafrom allrefinementlevelsatone specificsimulation timeusing tem-poralinterpolation(dependingonthenumberoftimelevelsyou keepin memory).WithinSAMRthis isalwayspossiblebecause coarselevelsareevolvedfirst,anddatafromthoselevelscanalways beused for interpolation shouldthis be necessary.In contrast, withinCSAMRthenecessary information for interpolationmay onlyavailableatspecifictimes.When,andhowoftenthisis pos-sibledependsonthetotalnumberofrefinementlevels,andthe numberoftimelevelsthatarekeptforeachrefinementlevel.The

problemisshownbyanexampleinFig.9,for4refinementlevels and2timelevels.

Ingeneral,timeinterpolationonthecompletedomainisonly possibleoncethecoarsestlevelstepsareaheadofthefinerlevelsin simulationtime.Thishappensabouthalf-waythroughacomplete timestep.Atthatmoment,pasttimelevelsonthefinerlevelshave tobeavailablefortimeinterpolation,withthefinestlevel requir-ingthehighestnumber.Thetotalnumberoftimelevelsrequired

(Rlevels)onthefinestgriddepends onthemaximum numberof

refinementlevels(N),andis

Rlevels=2N−2+1. (2)

Eachcoarser gridwould onlyneed halfas manytimelevels for thispurpose,sothetotalnumberoftimelevels(thesumoverall refinementlevels)wouldbeabouttwicethenumbergivenbyEq. (2).While foramoderatenumberofrefinementlevelsNthisis manageable,and sometimesalready existsduetootherusesof pasttimelevels (e.g.3 timelevelsonthefinestgrid ofa 3-level simulation),thecostbecomesprohibitiveforhighernumbersof refinementlevels,e.g.257timelevels ofthefinestgridmust be availableina10-levelsimulation.

Onlykeepingsomeofthesetimelevels(e.g.alwayskeepingthe onethatalignsintimewiththecoarsestlevelinadditiontotheones neededforevolution)couldbeusedtodecreasethetotalnumber oftimelevels,butwouldleadtoreducedaccuracyintime inter-polation.Similarly,extrapolationcouldbeusedwheneverdatafor interpolationisnotavailable.Whetherornotsuchapproachesare viabledependsentirelyonthepurposeforrequiringglobaldata. Wewillnowconsiderseveral.

Courantfactor:CalculationsoftheCourantfactoraretypically wholegridoperations.However,becauseoneexpectsthefinegrid tosetthelimitwebelievethatthisisnotlikelytobemuchofan issue.

Ellipticsolves:Sinceanellipticsolveisinherentlyawholegrid operation,NSAMRistheonlypracticalchoice.However,such prob-lemshavebeensolvedforsomeproblemsinaSAMRframework. SeePretorius[28]asexampleofhowthismaybedoneandhow complexitmaybe.

Analysis:Eachscienceproblemhasitsownspecialneeds.For blackholesimulationsthatneedtofindapparenthorizons,thelack ofglobaldatawouldlimitthesimulationtolessfrequent computa-tionsofthehorizon.However,horizonstypicallyfitentirelywithin onlyafewofthefinestgridsandsoforthisproblemafullgridmay notbeneeded.

I/O: We also note that the changed order of steps compli-catesI/O.Outputrequiringinterpolationisaffectedbythesame

(6)

limitations oninterpolationthat have already been mentioned. Often,simulationdataisnotneededoneverytimestepbutitcan besaveduntildataonalllevelsispresentsimultaneously. Alterna-tively,datacanbereconstructedduringpost-processing.

Affectedoutputtypesincludecommonoperationslikeextrema, norms,orsums ofvariables. On theotherhand,quantitieslike normsorsumsmay betooinaccuratewhentime-interpolation isnecessarytocomputethem.Becauseofthis,evenwithSAMR, evaluatingtheseisoftendoneonlyattimesthatdonotrequire interpolationintime.

Thethirdlimitationwewouldliketoconsideristhatrefinement factorshigherthan2areeitherlessefficientwithCSAMR,ormore difficulttoimplement,andineithercasetheincreaseinscalingis smaller.

Higherrefinementfactorsthan2arecertainlynotexotic. Espe-ciallyfactorsof3or4areoftenfoundinexistingframeworks;e.g. WRFusesafactorof3foritshorizontalmeshrefinement.However, mostAMRframeworkscananddouseafactorof2formostoftheir applications.

Inthefollowing,weanalyzetheamountofidle-timeforthecase ofarefinementfactorofthree.WehavealreadyseeninEq.(1),that forafactorof2theamountofidletimetendstozeroforalarge numberoflevels.Forthethreestepsafinelevelhastoperform here,thenextcoarserlevelonlydoesone,anditmustcomplete thatonesimultaneouslywiththefirstofthethreefinelevelsteps. Thisleavestwoidlestepsforthegroupofprocessorshandlingthe coarserlevels.

Levelingeneraldoes3 stepsforeachfull timestep.This

meansthatthefinestlevelwilldo3(N−1)stepsandthetotalnumber

ofstepsallotherlevelsneedstoperformis(3(N−1)1)/2,resulting

inaidle-stepnumberof3(N−1)−(3(N−1)−1)/2foroneofthe pro-cessgroups.Theoverallidletimeassuminganidenticalsizeofboth groupsisthen(Fig.10)

Tidle= 3N−1−(3(N−1)−1)/2 2·3N−1 = 1 4+ 1 4·3(N−1) (3)

Thus,withoutfurtheroptimizationCSAMRwillleaveoneofthe twoprocessgroupsidleabouthalfofthetime.Thisamountstoa speedupofonly1.5comparedtothespeedupof2forarefinement factorof2,andcomparedtoaSAMRsimulationusingthesamesize ofgridsbuthalfthenumberofprocessors.

Fig.10.Theorderinwhichdifferentlevelsareevolvedisshowninbothdiagrams, withthesamemeaningofnumbers,colorsandarrowsasinFig.8.Thedifference here,however,isarefinementfactorof3,resultinginmoreidletime,andthatthe rightpanelonlyshowsonefulltimestep.(Forinterpretationofthereferencesto colorinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

Despitethismoderatespeedupof1.5,fewerSUsarewastedin thismannerthanwithNSAMR,andsothetrade-offmaybeworth it.We alsonotethat theremainingidle timecouldbeusedfor othertasks,e.g.expensiveanalysisthatotherwisewouldhaveto temporarilypauseevolution,orwouldbeperformedpostruntime. Onepossiblewaytoovercomethisidletimeproblemfor refine-mentfactor threewould betodividethecoarserstep intotwo equalandsequentialsteps(andthereforerequireonlyhalfasmany processorsinthesecondgroup).Thefirstsequentialstepwould needtocover theboundaryof thenextfinergrid,sothat pro-longationispossible,thesecondstepwouldcovertherestofthe grid.Ifweassumethesetwostepsareequal,thentheidletimeis TN

idle=(1/2)(1/3) N

,whereTN

idleisthenumberofidlestepswithN

levelsandN>1.Aquantitywhichrapidlygoestozeroasthenumber oflevelsincreases.

Infact,wecansaygenerally,thatforrefinementfactorRour methodshowsaspeedupofR/(R−1)using(R−1)/Rtimesasmany processors,providedthecoarserstepcanbedividedintoR−1equal parts,thefirstofwhichcoverstheboundaryfortherefinedregion itcontains.Thus,forrefinementfactortwo,thefirststepisdoneas asinglepartandcoversthewholegrid;forfactorthreeitisdone intwopartsthatcoverhalfthegrideach;forfactorfour3partsare requiredthateachcover1/3ofthegrid,andsoon.AsRincreases, therefinedregionneedstofitintosmallerandsmallersub-regions toavoidwastedtimesteps,diminishingthespeedup.Thus,R=2 remainsanoptimalcase.Wenote,thattheabovediscussionis the-oreticalasonlyrefinementfactortwowasimplementedforthe codeusedinthispaper.

Thefinalpossibledrawbackwewouldliketomentionisthe memoryimbalancecreatedbyassigningthefinestleveltoone pro-cessgroupandallotherlevelsacrossasecondprocessgroup.This imbalanceis,assumingidenticallevelsizes,onlydependentonthe numberoflevels,andamountstoafactorof1/(N−1).Thiscouldbe aproblemincaseswhereapplicationsarelimitedbytheamount ofnodememory,e.g.iflargeamountsofdatathatcannoteasily bedistributed,likeequationofstatetables,needtobestoredin memoryoneachnode.However,fortheextremescalingthisis oftennotaproblem.Here,scalingisusuallylimitedbyacertain grainsizebelowwhichtheamountofworktobedoneoneach nodebecomessmallcomparedtotheoverheadofcommunication betweenthenodes.

3.2. Loadbalancingconsiderations

Sofar,weonlyconsideredthecaseofidentical,oratleastsimilar numbersofgridpointsoneachlevel.Inthiscasethetimeittakesto evolveeachofthelevelsisaboutthesame,makingload-balancing easy.However,practicalgridsetupsmightdeviatefromthisideal. Inthefollowingwedescribepossibleproblemswhichmayarise withunequalnumbersofpointsandhowtheymaybealleviated. 3.2.1. Finergridshavemorepoints

Considerthesituationinwhicheachlevelhastwiceasmany gridpointsasthenextcoarsestlevel,e.g.thenumberofpointson levelisgivenby:S=2S

0.Thetotalnumberofgridpointsis

ST= N−1



=0 S=(2N1)S 0≈2NS0. (4)

Letusassumethatthetimetoperformasinglesteponalevel ismodeledbytheformula:

T∝max(G,S/P) (5)

wherePisthetotalnumberofprocessorsandTisthetime.This for-muladictatesthatthereisanoptimal“grainsize”(G),thesmallest

(7)

Fig.11.ShownarestrongscalingresultsforSAMRandCSAMRonthefirstmachine,for2,4,6,and8evolvedlevels(toptobottomplot).Itcanbeclearlyseenthatwhilefor onlytwolevelsSAMRperformsbetterforfewerprocesses,CSAMRoutperformsSAMRineveryothercase,especiallyforlargenumbersofprocessesand/orlargenumberof levels.

numberofpointswhichaprocessorcanefficientlyhandle.This flat-teningoftheperformancecurveisexpected,becauseofoverheads introducedbyghostzones,etc.Notethatwedo notnecessarily expectthescalingcurvetobethisbad,andalsothattheinfluence ofbadscalingoncoarserlevelswillbesmallforsubcycling meth-ods,andsowewillcomparewithperfectscaling.Thetrueresult willlikelyliesomewhereinbetween.Note,however,thatEq.(5) predictsbehaviorquitesimilartowhatweseeinourstrongscaling plots,e.g.seeFig.11.

NotealsothattheactualvalueofGisnotimportanthere,itis aquantitythatwilldependondetailsofthecomputer architec-tureandthecodebeingused.Inallthecalculationsbelow,wewill assumethatwerunatmaximumspeedandgiveexactlyGpoints toeachprocessor,unlessotherwisestated.

ForNSAMR,thetotalnumberofprocessesis,therefore,

Ptotal(NS)=ST/G=(2N−1)S0/G≈2NS0/G. (6)

Theapproximationisaccuratetowithin1%forN≥7.DefiningTto bethetimetoperformonestepwehave

Truntime(NS)=2(N−1)T, (7)

andthus,atotalcostinSUsof

CSU(NS)≈(2(2N−1)S0/G)·T. (8)

ForSAMRlevelsareevolvedsequentially,eachusingthesame numberofprocessors.Evenforproblemsin whichlevelsareof equalsize,finerlevelshavetoperformmoreworkthancoarser levelsforthesameamountofphysicaltime,becausetheymust updatemorefrequently,andsotheytendtobethemostimportant. Thistrendofmoreworkonthefinerlevelsisevenmoretrueinthis example,wherethefinergridsarelarger.Itis,therefore,clearthat needsofthefinestlevelshoulddeterminethenumberofprocessors used.

PT(S)=SN−1/G=2(N−1)S0/G. (9)

Notethathereweassumethatonalllevelscoarserthanthefinest thereis notenoughworkavailable,i.e.gridpointsonwhichto operate,andsomeprocessesareidle.

Truntime(S)=(2(N−1)+2(N−2)+2(N−3)+···)·T=2·2(N−1)T, (10)

andwiththisinatotalSUcostof

CSU(S)=2·(2(2N−1)S0/G)·T. (11)

Forcomparison,ifwehaveperfectscaling,thenSAMRwillhave noidletimeand

Truntime(IS)=



2(N−1)+122(N−2)+142(N−3)+···



·T=43·2(N−1)T, (12) andwiththisinatotalSUcostof

CSU(IS)=

4 3·(2

(2N−1)S0/G)·T. (13)

ForCSAMR,wehaveabettersituation,inthatthefinestlevel doesnotdictateourtotalprocesscount.Thenextfinestlevel, how-ever,hasasignificantimpact.Wehavetwogroupsofprocessors forCSAMR,thefinestlevel

PN−1=2(N−1)S0/G

andforthenextcoarsestlevel

PN−2=2(N−2)S0/G, leadingtoatotalof PT(PS)= 3 2·2 (N−1)S 0/G. (14)

CSAMR needs the same time per coarse level time step as NSAMR,sothetotalruntimeisidenticaltoEq.(7):

Truntime(PS)=2(N−1)T, (15)

leadingtoatotalSUcostof CSU(PS)=

3 42

(2N−1)S

0/G·T. (16)

Thefollowingtablesummarizesthenumbersofprocesses,run timeandcomputationalcostforeachofthethreemethods.

Method Procs Runtime Comp.cost. Normalization PT/2N(S0/G) Trt/2N−1T CSU/(22N−1S0T/G)

NSAMR 1 1 1

SAMR 1/2 2 1

SAMR,ideal 1/2 4/3 2/3

CSAMR 3/4 1 3/4

TheSAMR calculationwithoutperfect scalinghasclearly no advantageoveranyoftheothermethods,butasmentioned,thisis theworstpossiblecaseforSAMR.Notonlydoesittakethelongest byfartocompute,itfailstoprovideanyspeedup,becauseofthe

(8)

scalingequationEq.(5).Thisresultsupportstheconclusionofthe Parameshgroupcitedabove,thatsubcyclingdoesnothelpforgrids withlargerfinegrids.

Comparingperfectscalingwiththeothermethods,however,the runtimeofallthreemethodsisalmostidentical.NSAMRandCSAMR have,usingthecriteria abovefor thenumbersofprocesses, by designanidenticalruntime,whileSAMRwithidealscalingis some-whatslowerbecauseithastoevolvethecoarserlevelssequentially tothefinestlevel.

Differences,however,areclearinthenumberofusedprocesses, andthus,inthetotalcomputationalcost.SAMRusesonlyhalfthe processesandthushalftheresourcescomparedtoNSAMR.CSAMR liesbetweenthetwowith3/4oftheprocessesandcostofNSAMR. Thisshows,thatforthisparticularcase,andwithnottoobad scalingaroundthegrainsizeG,SAMRwouldbepreferredwhen optimizingcomputationalcost,butitwouldbeatleast14%slower thantheothertwomethods.Wheninterestedinthefastestmethod, CSAMRwouldbepreferredbecausewhileNSAMRhasthesame runtime,CSAMRwouldrequirelessresources.

3.2.2. Finergridshavefewerpoints

Whileblackholeevolutionstypicallyhaveapproximatelythe samenumberofgridpointsperlevel,thisisnotalwaysthecase. Sometimesoneofthecoarserlevelswillhavesignificantlymore pointsinordertoprovidebetterdataforradiationextraction.

Consideraproblemsimilartotheabove,butassumeacoarse levelthathasthreetimesasmanypointsthaneachoftheother, identicallevels.Inthiscase,andunlessthecoarseleveldoesnot completelydominatetheevolution,theparallelizationstrategyof allthreemethodswillstaythesame,becausewhilethecoarselevel maybelarger,thefinegridsstilldomanymoresteps.Becauseof thelatterreasonalso,theinfluenceofthislarger,coarselevelon theruntimeofSAMRand CSAMRwillbenegligible,assuminga largenumberoflevels.Ontheotherhand,withinNSAMRmethods, thislargeleveltakesthesamenumberofstepsasanyotherlevel, significantlyaddingtotheoverallruntime.

Ontheotherhand,thelargelevelmightnotbethecoarsestone, andthelimitofalargenumberoflevelsmightnotapply.Inorder toshedlightonthiscase,weassumeonlythreelevelsintotalin thefollowing,againwiththecoarsestlevelbeingthelargerthan anyoftheother,identicallevels.

LetthesizeofallbutthecoarsestlevelbeS≡(1/3)S0=S1=S2.

Inthiscase,NSAMRwillrequirePT=5S/Gprocessorsandtaketime

equaltoTs=4Ttoevolve.Notethatweonlyhavetodooneanalysis

forSAMRhere,regardlessoftheassumedscaling,becauseherethe numberofprocessesisdeterminedbythesizeofthefinelevel,and nootherlevelissmaller.

SAMRwillthereforerequirePT=S/G,andtheevolutiontimewill

beTs=T0+2T1+4T2=3T+2T+4T=9T.

CSAMR,ontheotherhand,willassignPT=3S/Gprocessorsfor

thecoarserprocessorgrouptoevolve thelargelevelefficiently, andPT=S/Gforthefineevolutiongroup,andthusthesavingsin

SUswouldbe4/5relativetoNSAMR.

WefindagainthatCSAMRisabletoobtainthespeedofNSAMR, withanSUperformancemidwaybetweenNSAMRandSAMR.

Method Procs Runtime Comp.cost. Normalization PT/(N0/G) Truntime CSU/(TN0/G))

NSAMR 5 4T 20

SAMR 1 9T 9

CSAMR 4 4T 16

However,CSAMRcanbeoptimizedfurther.Addinga3rd pro-cessorgroup(ofequalsizetotheothertwo)canassistwiththe evolutionofthecoarsergrid.LetuscallthisgroupB(for“Biggrid”). ThescheduleforaCSAMRevolutionwouldthenlooklikethis:

Proc.B Proc.1 Proc.2 onestep(0)/3 onestep(1) onestep(2) onestep(0)/3 onestep(0)/3 onestep(2) idle onestep(1) onestep(2) idle idle onestep(2)

Theentryonestep(0)/3appearsthreetimes,andeachtime itisinvokeditperforms1/3oftheupdaterequiredforthe coars-estlevel.Thisexecutionobtainsasavingsof3/5inSUsrelativeto NSAMR,despitethefactthattheadditionalprocessorsareidle25% ofthetime.Withthisadjustment,theperformancetablenowlooks likethis:

Method Procs Runtime Comp.cost. Normalization PT/(N0/G) Truntime CSU/(TN0/G) NSAMR 5 4T 20 SAMR 1 9T 9 CSAMR 3 4T 12 4. Experimentalresults 4.1. Experimentalenvironments

Toevaluatetherealspeedupandeffectivenessofourmethod, we implemented CSAMR within a code evolving the general relativistic vacuum equations, e.g. in order to simulate black holes:AMSS-NCKU[29,30].ItiswritteninC++andFortran90.All thearraycalculations(e.g.thephysics)arewritteninFortran90, whiletheinfrastructureusesC++.Priortothiswork,AMSS-NCKU couldonlyuseSAMR.

TheCSAMRimplementedinAMSS-NCKUwastestedontwo dif-ferentplatforms.ThefirstisSuperMikeIIatLSU.Itisa146TFlops PeakPerformance440computenodecluster,eachnode contain-ingtwo8-CoreSandyBridgeXeon64-bitprocessorsoperatingata corefrequencyof2.6GHz.ThesecondisBlueWatersofNCSA.The BlueWaterssystemisaCrayXE/XKhybridmachinecomposedof AMD6276InterlagosprocessorsandNVIDIAGK110Kepler accel-eratorsallconnectedbytheCrayGeminitorusinterconnect.Note thattheseacceleratorswerenotusedforresultswithinthispaper.

4.2. Results

TotesttheperformanceofCSAMR, werana seriesofbinary blackholesimulations.Only10stepswereevolvedinsteadofthe moretypical10000.However,thisshouldbeadequateas10steps isenoughtoshowtheperformanceofthemethod.Inalltests,as istypicalforourevolutions,thenumberofgridpointsperlevelis almostidentical.

In our tests, we initialized the mesh size of each level to 40×40×20,andperformedsimulationswithdifferentnumbers ofmeshlevels.

Fig.11showstheresultsonthefirstmachine.Thescalabilityof SAMRstopswhenthenumberofprocessesisabout32,butCSAMR continuesto64processes.We canalsoseethatwhenthe num-beroflevelsistwo,theperformanceofCSAMRiscomparablebut slightlylessthanSAMRforsmallprocessorcounts.However,for largernumbersoflevels,Thegrainsizeeachprocessorhastoevolve islargerforCSAMR,andthusthereislessoverheadperprocessor. ThisleadstobetterperformanceforCSAMRcompared toSAMR evenwiththesamenumberofprocesses.Fig.11alsoclearlyshows theadvantageofCSAMRwhenthenumberoflevelsislarge.

Fig.12istheresultsonthesecondmachinewiththesameinput setupandsamesoftwareimplementation.Besidetheexact evolu-tiontime,wecanseethatthetrendisjustlikeonthefirstmachine.

(9)

Fig.12.ShownarestrongscalingresultsforSAMRandCSAMRonthesecondmachine,for4,6,and8evolvedlevels(toptobottomplot).SimilartoFig.11,itcanbeclearly seenthatCSAMRoutperformsCSAMRinallcases,especiallyforlargenumbersofprocessesand/orlargenumberoflevels.

ThefollowingtablesarebasedonrunsofAMSS-NCKU.Ineach case,weperformabinaryblackholesimulationusingtheSAMR method,thenperformthesamerunusingtheCSAMRmethodbut withtwiceasmanyprocessors.Inbothrunswepartitionthegridin exactlythesameway,givingthesamenumberofzonestoeach pro-cess.Ideally,theCSAMRrunshouldbetwiceasfast(100%faster). However,inpractice,measuredspeedupsarebetween70%and 90%.

Wetheorizedthatthetimespentinprolong/restrictoperations weresequentializing thecode,sowe instrumentedthecodeto measurethefractionoftimespentinthisphase(s),theninserted thefractioninAmdahl’sLaw.

PredictedSpeedup= 1

s+(1−s)/2 (17)

Forthesamesizeproblem,AMSS-NCKUreportedconsiderable varianceinthefractionoftimespentintheprolong/restrictphase byeachnode.Wereportthreetimes:aminimumspeedup,which usestheaveragetime spentinprolong/restrictacrossallnodes (tavg); a maximum time, basedon theminimum time spentin

prolong/restrictacrossall nodes(tmin);and anempirically

cho-senaverageof(tw=0.80×tavg+0.20×tmin).Theresultsofthese

experiments(withthetitle“nobarrier”below),showrough agree-mentbetweenthepredictedandexpectedperformance,butthe varianceiswide.

Inanefforttoimprovethematchbetweenourexperimental andpredictions,wecreatedanewversionofthecodewithreduced asynchrony.WeintroducedanMPIbarrierbothbeforeandafterthe prolong/restrictsections.Surprisingly,theintroductionofbarriers didnotimpacttheoverallwalltimeofthecode,butreducedthe varianceintimespentinprolong/restricttovirtuallynothingand vastlyimprovedtheagreementwithourpredictions.

40×40×20nobarrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 32 16 1.65 1.65 1.69 1.86 64 32 1.74 1.69 1.73 1.92 128 64 1.68 1.60 1.66 1.95 40×40×20barrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 16 8 1.83 1.84 1.84 1.84 32 16 1.72 1.73 1.73 1.73 64 32 1.63 1.68 1.68 1.68 128 64 1.63 1.66 1.66 1.66 32×32×16nobarrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 16 8 1.84 1.80 1.82 1.89 32 16 1.82 1.78 1.81 1.94 32×32×16barrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 16 8 1.88 1.83 1.83 1.84 32 16 1.82 1.81 1.81 1.81 64×64×32nobarrier

Procs Procs Measured speedup Min speedup Average speedup Max speedup PSAMR SAMR 16 8 1.82 1.81 1.83 1.90 32 16 1.85 1.75 1.78 1.90 64 32 1.71 1.66 1.70 1.90 128 64 1.81 1.74 1.77 1.92 256 128 1.85 1.76 1.80 1.95

Inthefuture,weplantoworkonreducingthetimespentin pro-longation/restriction,inordertoachieveabetterparallelspeedup.

5. Conclusion

InthisworkwehaveintroducedConcurrentStructured Adap-tive Mesh Refinement (CSAMR), an offline scheduling algorithm that optimizes an adaptive mesh refinement calculation. We haveshownthat itoffersthespeedofnon-subcycling,butuses

(10)

significantly fewer resources. Thisdecrease can enable runs at an increased problem size compared to usual subcycling, or alternativelycan substantially decrease the computational cost ofnon-subcyclingsimulations,increasingthenumberofpossible simulationsforagivenallocationsize.

WehavealsonotedthattheCSAMRschedulingincludesa cer-tain,usuallysmall,amountofidletimeintothecalculation.This idletimemightbere-purposedforanalysis,in-situvisualization, etc.Ifthisidletimeiswellused,theresultcouldleadtoashorter walltimerelativetoanon-subcyclingcode,whichhasnoidletime tospareforthese(possiblynecessary)tasks.

Acknowledgments

Weacknowledgethesupportofthefollowinggrants:TheDoE XPRESS proposal (DE-SC0008714), National Science Foundation 1265449,NationalScienceFoundation1212401,theNational Nat-uralScienceFoundationofChina(Nos.61272087,61073008,and 60773148),BeijingNaturalScienceFoundation(Nos.4082016and 4122039).

References

[1]M.J.Berger,P.Colella,Localadaptivemeshrefinementforshock hydrodynamics,J.Comput.Phys.82(1)(1989)64–84.

[2]M.J.Berger,J.Oliger,Adaptivemeshrefinementforhyperbolicpartial differentialequations,J.Comput.Phys.53(3)(1984)484–512.

[3]F.Löffler,J.Faber,E.Bentivegna,T.Bode,P.Diener,R.Haas,I.Hinder,B.C. Mundim,C.D.Ott,E.Schnetter,G.Allen,M.Campanelli,P.Laguna,The Einsteintoolkit:acommunitycomputationalinfrastructureforrelativistic astrophysics,Class.QuantumGrav.29(11)(2012)115001,http://dx.doi.org/ 10.1088/0264-9381/29/11/115001arXiv:1111.3344[gr-qc].

[4]D.Pollney,C.Reisswig,L.Rezzolla,B.Szilágyi,M.Ansorg,B.Deris,P.Diener, E.N.Dorband,M.Koppitz,A.Nagar,E.Schnetter,Recoilvelocitiesfrom equal-massbinaryblack-holemergers:asystematicinvestigationof spin–orbitalignedconfigurations,Phys.Rev.D76(2007)124002 arXiv:0707.2559[gr-qc].

[5]E.Schnetter,S.H.Hawley,I.Hawke,Evolutionsin3-Dnumericalrelativity usingfixedmeshrefinement,Class.QuantumGrav.21(2004)1465–1488,

http://dx.doi.org/10.1088/0264-9381/21/6/014arXiv:gr-qc/0310042. [6]Anumericalsimulationoftheevolutionofthelow-levelwindspeedand

rainfallinHurricaneIvan(2004)atlandfall,Posterpresentation,30th ConferenceonHurricanesandTropical,Meteorology,PonteVedraFlorida, 2012.

[7]A.Dubey,A.Almgren,J.Bell,M.Berzins,S.Brandt,G.Bryan,P.Colella,D. Graves,M.Lijewski,F.Löffler,B.O’Shea,E.Schnetter,B.V.Straalen,K.Weide, Asurveyofhighlevelframeworksinblock-structuredadaptivemesh refinementpackages,J.ParallelDistrib.Comput.74(12)(2014)3217–3227,

http://dx.doi.org/10.1016/j.jpdc.2014.07.001.

[8]P.MacNeice,K.Olson,C.Mobarry,R.deFainchtein,C.Packer,PARAMESH:A paralleladaptivemeshrefinementcommunitytoolkit,Comput.Phys. Commun.126(3)(2000)330–354.

[9]ParallelAdaptiveMeshRefinementPARAMESH.URL:http://www.physics. drexel.edu/olson/paramesh-doc/Usersmanual/amr.html.

[10]L.Dursi,M.Zingale,EfficiencygainsfromtimerefinementonAMRmeshes andexplicittimestepping,in:T.Plewa,T.Linde,V.GregoryWeirs(Eds.), AdaptiveMeshRefinement–TheoryandApplications,LectureNotesin ComputationalScienceandEngineering,vol.41,Springer,Berlin,Heidelberg, 2005,pp.103–113,http://dx.doi.org/10.1007/3-540-27039-67.

[11]G.L.Bryan,M.L.Norman,B.W.O’Shea,T.Abel,J.H.Wise,M.J.Turk,D.R. Reynolds,D.C.Collins,P.Wang,S.W.Skillman,B.Smith,R.P.Harkness,J. Bordner,J.-h.Kim,M.Kuhlen,H.Xu,N.Goldbaum,C.Hummels,A.G.Kritsuk, E.Tasker,S.Skory,C.M.Simpson,O.Hahn,J.S.Oishi,G.C.So,F.Zhao,R.Cen,Y. Li,TheEnzoCollaboration,Enzo:Anadaptivemeshrefinementcodefor astrophysics,Astrophys.J.Suppl.211(2)(2014)52arXiv:1307.2265.

[12]EnzoastrophysicalAMRcode,2013,URL:http://enzo-project.org/. [13]BoxLib,2011,URL:https://ccse.lbl.gov/BoxLib.

[14]T.Goodale,G.Allen,G.Lanfermann,J.Massó,T.Radke,E.Seidel,J.Shalf,The Cactusframeworkandtoolkit:designandapplications,in:VectorandParallel Processing–VECPAR’2002,5thInternationalConference,LectureNotesin ComputerScience,Springer,Berlin,2003,URL:http://edoc.mpg.de/3341. [15]CactusComputationalToolkit.URL:http://www.cactuscode.org/. [16]P.Colella,D.Graves,N.Keen,T.Ligocki,D.Martin,P.McCorquodale,D.

Modiano,P.Schwartz,T.Sternberg,B.VanStraalen,ChomboSoftware PackageforAMRApplicationsDesignDocument,Tech.rep.,LawrenceBerkely NationalLaboratory,AppliedNumericalAlgorithmsGroup,Computational ResearchDivision,2009.

[17]S.G.Parker,Acomponent-basedarchitectureforparallelmulti-physicsPDE simulation,FutureGener.Comput.Syst.22(2006)204–216.

[18]S.G.Parker,J.Guilkey,T.Harman,Acomponent-basedparallelinfrastructure forthesimulationoffluid–structureinteraction,Eng.Comput.22(2006) 277–292.

[19]Z.Mo,A.Zhang,X.Cao,Q.Liu,X.Xu,H.An,W.Pei,S.Zhu,Jasmin:aparallel softwareinfrastructureforscientificcomputing,Front.Comput.Sci.China4 (4)(2010)480–488,http://dx.doi.org/10.1007/s11704-010-0120-5. [20]CASC,SAMRAIStructuredAdaptiveMeshRefinementApplication

Infrastructure,CenterforAppliedScientificComputing,LawrenceLivermore NationalLaboratory,2007December,URL:https://computation.llnl.gov/casc/ SAMRAI/.

[21]R.Hornung,S.Kohn,ManagingapplicationcomplexityintheSAMRAI object-orientedframework,Concurr.Comput.Pract.Exp.14(5)(2002) 347–368.

[22]A.S.Almgren,J.B.Bell,M.J.Lijewski,Z.Luki ´c,E.V.Andel,Nyx:Amassively parallelAMRcodeforcomputationalcosmology,Astrophys.J.765(1)(2013) 39,URL:http://stacks.iop.org/0004-637X/765/i=1/a=39.

[23]L.F.Diachin,R.Hornung,P.Plassmann,A.Wissink,Paralleladaptivemesh refinement,ParallelProcess.Sci.Comput.20(2006)143.

[24]C.Burstedde,O.Ghattas,M.Gurnis,T.Isaac,G.Stadler,T.Warburton,L. Wilcox,Extreme-scaleAMR,in:Proceedingsofthe2010ACM/IEEE InternationalConferenceforHighPerformanceComputing,Networking, StorageandAnalysis,IEEEComputerSociety,2010,pp.1–12.

[25]Q.Meng,J.Luitjens,M.Berzins,DynamicTaskSchedulingforScalableParallel AMRintheUintahFramework,Tech.rep.,Citeseer,2010.

[26]D.S.Balsara,Divergence-freeadaptivemeshrefinementfor magnetohydrodynamics,J.Comput.Phys.174(2)(2001)614–648.

[27]J.J.Carroll-Nellenback,B.Shroyer,A.Frank,C.Ding,Efficientparallelizationfor AMRMHDmultiphysicscalculations;implementationinAstroBEAR,J. Comput.Phys.236(2013)461–476.

[28]F.Pretorius,M.W.Choptuik,Adaptivemeshrefinementforcoupled elliptic-hyperbolicsystems,J.Comput.Phys.218(1)(2006)246–274.

[29]Z.Cao,H.-J.Yo,J.-P.Yu,Reinvestigationofmovingpuncturedblackholeswith anewcode,Phys.Rev.D78(12)(2008)124011.

[30]P.Galaviz,B.Brügmann,Z.Cao,Numericalevolutionofmultipleblackholes withaccurateinitialdata,Phys.Rev.D82(2)(2010)024005.

FrankLöfflerreceivedhisBSdegreein1988inphysics fromChemnitzUniversityofTechnology,Germany.He receivedhisPhDin2005,fromAlbert-Einstein-Institute, MaxPlanckInstitute,Germany.From2005to2007hewas apostdoctoralscholaratSISSA–InternationalSchoolfor AdvancedStudies,Italy,andfrom2007to2011hewasa postdoctoralscholaratLouisianaStateUniversity.From 2011untilthepresent,heworkedasanITConsultantat LouisianaStateUniversity.Hisresearchareasincludehigh performancecomputing,andneutronstarsimulations.

ZhoujianCaoreceivedtheBSdegreein2001fromthe Physics Department of BeijingNormal University. He receivedhisPhDdegreeintheoreticalphysics,in2006, from BeijingNormal University. From 2006 to 2011, heworkedasanassistantprofessorintheAcademyof Mathematics andSystemsSciences, ChineseAcademy of Sciences.From2011to thepresent,heworked as anassociateprofessorintheAcademyofMathematics andSystemsSciences,ChineseAcademyofSciences.His researchareasincludenumericalrelativity,gravitational theoryandscientificcomputing.

Steven R.Brandt receivedthe BS degreein 1985in physicsfromNortheasternUniversity.Hereceivedhis PhDfornumericalstudiesofblackholespacetimesin 1996,fromtheUniversityofIllinoisatUrbana-Champaign. From1997to1998hewasapostdoctoralscholaratthe AlbertEinsteinInstituteinPotsdam,andin1999hewasa postdoctoralscholaratPennStateUniversity.From2005 untilthepresent,heworkedasanAdjunctProfessorof ComputerScienceandITConsultantatLouisianaState University.Hisresearchareasincludehighperformance computing,garbagecollection,andcoastalsimulation.

ZhihuiDureceivedtheBEdegreein1992incomputer departmentfromTianjinUniversity.HereceivedtheMS andPhDdegreesincomputerscience,respectively,in 1995and1998,fromPekingUniversity.From1998to 2000,heworkedatTsinghuaUniversityasa postdoc-toralscholar.From2001tothepresent,hehasworked atTsinghuaUniversityasanassociateprofessorinthe Department ofComputerScienceandTechnology.His researchareasincludehighperformancecomputing,grid computing,cloudcomputingandenergyefficient com-puting.

References

Related documents

Enligt Lombardi, Bloch & Vasarhelyi (2014) och Issa, Ting & Vasarhelyi (2016) kan nya anpassningar till digitaliseringen medföra möjligheter för revisorn samtidigt som

As the threshold in annual E is easily reached at the sites, the imposed irrigation restrictions in the base run do not have big influence on the evaporation during the irrigation

During this competition, competitors were tasked with optimizing the layouts of five generated wind farms based on a simplified cost of energy evaluation function of the wind

Based on the results of the present study, African American doctoral students participating in online education experienced various stressors related to discrimination such as

At Nelsonville High School, ninth-grade language arts teachers did not have consistent access to their ELL support teacher which limited their ability to co-plan lessons taking

TABLE V - Mean concentrations of curcuminoid pigments in pig ear skin (µg pigment/g skin) during in vitro skin permeation studies of different formulations containing curcumin

Recasting his economic theory of discrimination in the context of the mortgage lending market, Becker argues that bigoted lenders are willing to sacrifice the profit they might earn

[r]