Data-parallel agent-based microscopic road network simulation using graphics processing units

(1)

Contents lists available at ScienceDirect

Simulation

Modelling

Practice

and

Theory

journal homepage: www.elsevier.com/locate/simpat

Data-parallel

agent-based

microscopic

road

network

simulation

using

graphics

processing

units

Peter

Heywood

a,∗

_,

_Steve

_Maddock

a

_,

_Jordi

_Casas

b

_,

_David

_Garcia

b

_,

Mark

Brackstone

b

_,

_Paul

_Richmond

a

a_Department_of_Computer_Science,_The_University_of_Sheﬃeld,_United_Kingdom b_Transport_Simulation_Systems_Ltd_(TSS),_Spain

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Availableonlinexxx

Keywords:

Agent-basedsimulation GPU

Simulationframework Transportmicrosimulation

a

b

s

t

r

a

c

t

Roadnetworkmicrosimulationiscomputationallyexpensive,andexistingstateoftheart commercialtoolsusetaskparallelismand coarse-graineddata-parallelismformulti-core processorsto achieveimprovedlevels ofperformance.Analternativeisto useGraphics ProcessingUnits(GPUs)andﬁne-graineddataparallelism.ThispaperdescribesaGPU ac-celeratedagentbasedmicrosimulationmodelofaroadnetworktransportsystem.The per-formanceforaprocedurallygeneratedgridnetworkisevaluatedagainstthatofan equiv-alentmulti-coreCPUsimulation.InordertoutiliseGPUarchitectureseffectivelythe pa-perdescribesanapproachforgraphtraversalofneighbouringinformationwhichisvital toprovidinghighlevelsofcomputationalperformance.Thegraphtraversalapproachhas beenintegratedwithinaGPUagentbasedsimulationframeworkasageneralisedmessage traversaltechniqueforgraph-basedcommunication.Speed-upsofupto43× are demon-stratedwithincreasedperformancescalingbehaviour.Simulationofoverhalfamillion ve-hiclesandnearlytwomilliondetectorsatarateof25× fasterthanreal-timeisobtained onasingleGPU.

1. Introduction

Simulationsofroadnetworksareusedduringthedevelopmentandmanagementoftransportnetworksaroundtheglobe. Microscopic roadnetwork simulationsare fine-grainedsimulations which simulateindividual vehicles within thesystem, capturinglowlevelbehaviours.AgentBasedModelling(ABM)isonemicroscopicapproach,whererelativelysimple individ-ualbehavioursaredefined,whichcombinedwithinteractionsbetweenagentsandtheenvironment,allows theemergence ofcomplex behaviours. However, microscopic simulations are much more computationally expensive than the more tra-ditionalhigher-level macroscopicsimulations, which use a higherlevel ofabstraction consistingof network flows rather thanindividualvehicles. Nonetheless,thelevelofdetailcapturedbymicroscopicsimulations ismuchgreaterthan thatof macroscopicandmesoscopicsimulations,includingtheemergentbehavioursenabledbytheuseofABM.

∗ _{Corresponding}_author.

E-mailaddresses:p.heywood@sheﬃeld.ac.uk(P.Heywood),p.richmond@sheﬃeld.ac.uk(P.Richmond).

https://doi.org/10.1016/j.simpat.2017.11.002

(2)

As microscopic simulations are computationally expensive, current commercial and open-source tools such as Aim-sun [1], Vissim [2], MATsim [3] and SUMO [4] can have considerable execution run-times, especially for large scale simulations. This limits the overall effectiveness and uptake of microscopic simulation within the transport sector [5]. To increase simulator performance, existing simulation tools use parallel processing applied to multi-core processors to reduce therun-timeof simulations.Task-parallelandcoarse-graineddata-parallel approachesaretypically applied within existing tools. Task-parallelism distributes independent processing tasks to separate processing threads. Coarse-grained data-parallelism applies the same algorithms to differentunits of data,where the individual unitsof data are relatively large.Theseapproachesare well-suitedto multi-coreCentralProcessing Units(CPUs), butcan resultinpoorperformance scalability.

Many-core architecturessuch as Graphics Processing Units(GPUs) offersignificantly greater levels of parallelism than multi-core architectures, andsignificantlygreater levelsofraw computeperformance. Toaccessthehighlevels of perfor-mance, algorithms anddata structures must expose high levels of parallelism andenable goodmemory-access patterns. Typicallyfine-graineddata-parallelismisused,wherethesamealgorithmsareappliedtorelativelysmallindividualunitsof data.

Todemonstratethesimulation performanceimprovements whichcould beachievedby usingmany-coreGPUs for mi-croscopicroadnetworksimulations,asingle-lanetransportmodelwithstop-sign-basedyellow-boxjunctionshasbeen im-plemented using FLAME GPU (Flexible Large Scale Agent Modelling Environment forGraphics Processing Units). FLAME GPU is the only general-purpose GPU-accelerated ABM framework [6]. The performance of the simulation is bench-marked using a procedurally generated artiﬁcial grid-based road network, and the simulation performance is compared toahigh-performancemulti-core CPUmicroscopicroadnetworksimulationtool,Aimsun 8.1,whichimplementsthesame models.

This paper presents two contributions:(i) a GPU accelerated data-parallel agent-based road network microsimulation modelisevaluatedagainstanequivalentmodelinacommercialmulti-core CPUsoftwaretool,demonstratingconsiderable improvementstosimulationperformanceandperformancescalability;and(ii)ageneral-purposegraph-based communica-tionstrategyispresentedforhighperformanceagentcommunicationforﬁne-graineddata-parallelagentbasedsimulations, implemented forthe FLAMEGPU ABMframework which enableshigh performance agentbasedsimulations of transport networksonGPUs.

Section2 providesasummary ofrelatedwork.Section3 detailstheimplementedmodel,thebenchmarknetworkused andtheFLAMEGPUimplementation.Section4 describesthegraph-basedcommunicationstrategytoenablethehighlevels ofperformanceinthismodel.Section5 providestheresultsofasetofapplicationbenchmarksusedtoassessthe perfor-manceimpactofGPUsonmicroscopicroadnetworksimulationusingABM.Section6 concludesthepaper.

2. Relatedwork

Microscopicsimulationoftransportsimulationinvolvesthesimulationofindividualvehiclesandpiecesofroadnetwork infrastructuretopredict theeffectsofchanges invehiclebehaviour orchanges intheroadnetworkinfrastructure.Thisis used asatool inthe planningandmanagement oftransport networkstoimprovethe effectivenessofinfrastructure and minimisetheimpactofchangesinconditionsonthetransportationnetwork.Microscopicsimulations arecomputationally expensive, whichhaslimitedtheadoptionofmicrosimulationcomparedtohigherlevelmesoscopicandmacroscopic sim-ulations [5]. Amicro-scale simulation must includebehavioural models for vehicles to followas well asmodels for any dynamicinfrastructuresuch astraﬃcsignals andvehicle detectors,whichattemptto accuratelycapturethe behaviourof therealworld.ABMisapowerfulapproachtodeﬁningmicroscopicmodels,whichprovidesanaturalmethodofdescribing individual behaviouralmodelsandthensimulatingthe interactionbetweenindividualsinthesimulation[7].Someofthe mostimportantvehicle behaviouralmodels are: (i)Car Following Models[8,9];(ii) LaneChangingBehaviour[10,11]; and (iii)GapAcceptanceModelling[1,12].

Thesebehaviouralmodelsmakeuseofseveralsetsofinputdata,including:theattributesandstructureofthetransport network;transportdemandwhichisusedtopopulatethesimulationandallowalternatescenariostobesimulatedfor pre-dictiveuse;andsimulationparameterswhichareusedtomodifythemodelswithinthesimulator.Parameterscaninclude distributionsfromwhichvehiclepropertiescan besampled,such asvehiclelength,rateofaccelerationorevenproperties suchaspollutionemissionrates.Theseparameterscanbemanipulatedtoreplicateobservedbehaviour[13,14].

(3)

Fig.1. Theaveragerun-timefrom3repetitionsfora60minuteAimsun8.1simulationcontainingapproximately25,000vehicles,fordifferentprocessor threadcounts.Asprocessorthreadcountincreases,totalsimulationtimedecreases,butwithdiminishingreturns.NotethattheIntelXeonE5-2643isa dual-socketsystem.Additionally,noperformanceimprovementswereobservedwhenusingmoreprocessingthreadsthanphysicalcores(Hyper-threading).

Incontrasttomulti-coreprocessorarchitectures,highlyparallelmany-coreprocessingarchitecturessuchasGPUsenable theuseoffine-graineddata-parallelismtooffer highlevels ofperformance. Originallydevelopedfor2Dand3Dcomputer graphicsapplications,GPUsarenowcommonlyusedtoaccelerategeneralpurposecomputingthroughdata-parallelism,as they offersignificant computational power. Forinstance, a singlestate ofthe artNVIDIA P100GPU hasa peak theoreti-cal performance of 5.3TFLOPS (trillion floating point operations per second)in double precision [17], compared to Intel Broadwell(22 core)Xeon CPUswhichare capable oflessthan 1.0TFLOPS [18], withevengreater levels ofrawcompute performanceforsingleandhalfprecision floating pointoperations. From adevice perspective,GPUs useaformofSingle Instruction,ManyData(SIMD)parallelarchitecture,wherethesameinstructionsareappliedtomanyitemsofdata concur-rently.Additionally,alarge portionoftheGPUdiespaceisused forthe manyprocessingcores,leavinga relativelysmall portionofdiespaceavailableforon-chipcachesotoachievehighlevelsofperformancetheuseofmemorymustbe care-fullyconsidered. Alongwithdataaccess patterns,theSIMDmodel usedby GPUs oftenrequiresalternate algorithms and datastructures tosoftwaredesignedformulti-coreCPUs,inordertoleveragethehighlevelsofperformance required. Ad-ditionally,GPUsare typicallymoreenergy-efficientwhencomparedtoCPUs,offering greaterlevelsofperformanceforthe samepowerusage,i.e.GPUs typicallyofferahighernumberofFLOPSperwatt.Thisisdemonstratedbythehigh propor-tionofGPUenabledsuper-computersintheupperranksoftheGreen500,therankingoftheworld’smostpower-efficient super-computers[19].

AlthoughcurrentcommercialandopensourcemicroscopicroadnetworksimulatorssuchasVissimorAimsunuse multi-coreCPUparallelism toincrease simulatorperformance,GPUshaveyettobeadoptedwithintransportmodellingsoftware asawhole.GPUshavebeenusedpreviouslyinmicroscopictransportnetworksimulation,asbothcellularautomata[20] and alsousingmulti-agentsystems[21].Additionallyothertasksassociatedwithmicroscopicsimulationhavebeenaccelerated usingGPUs,suchasthegenerationofdemanddatausingactivityplangeneration[22],traﬃc signaloptimisation[23] and dynamicrouteassignment[24].Recently,outsideofmicroscopicsimulation,GPUshavebeenutilisedwithinbothmesoscopic andmicroscopicroad networksimulation toolsandperformance improvementshave beendemonstrated[25–27], further highlightingthedemandforreducedsoftwarerun-timeswithinthetransportmodellingsector.

(4)

Fig.2. FLAMEGPUState-basedrepresentationforvehicleagentswithinthesimulation,illustratingthelogicalcontrol-ﬂowoftheagentswithinasingle iteration.Interactionwithdetectoragentshasbeenomittedforclarity.

enablinghighperformancesimulations withoutexplicitknowledgeofthe parallelprocessingmodel[6,31,32].FLAMEGPU hassuccessfullybeenusedinseveralmodellingdomainssuchaspedestriancrowdsimulation[33] andcellularsimulations

[31,34],however the useof FLAME GPUfor road network simulation hasbeen limited, due to the unsuitabilityof ﬁxed radiuscommunicationforusewithintransportnetworks[35].

3. Mulit-agentGPUmicrosimulation

ToevaluatethepotentialperformanceimpactofGPUacceleratedABMonmicroscopicroadnetworksimulationa single-lane road network model was implemented. The implemented models include: car following model [1,15,36] (based on Gipps’carfollowingmodel[8]);constantvehiclearrival[15,36];virtualqueues[15,36];detectors[15,36];ayellow-boxmodel

[15,36] andagapacceptancegive-waymodel[1,15,36] combinedwithstop-signsforjunctions.Thesefeaturesandmodels wereselectedtoproducecomparablebehaviourfortheprocedurallygeneratedgridroadnetworkdescribedinSection5.3, withoutthecomplexitiesofmodellingthebroadrangeofroadnetworkinfrastructurepresentinrealworldroadnetworks. FLAMEGPUsimulationsmakeuseofastate-basedrepresentationofagentdynamicstosimplifydevelopmentandenable keyperformance optimisationson behalfofthemodeller.Fig.2 showsthestate machineforthevehicularagentsusedto implementthismodel;itrepresentsthelogicalprocessfollowedbyeachagentateachiterationofthesimulation,showing therelationshipbetweenagentstates,functionsandmessage-lists.Agentsenterthestatemachineforthecurrentiteration inoneofthepossibleinitialstates.Agentfunctionsallowagentstotransitionfromonestatetoanother,withtheorderof thesefunctionscontrolledbyexecutionlayers.Communicationbetweenagentsisintheformofmessagelists,whichareused tohighlightdependenciesinthestate-basedrepresentation.Theuseofindirectcommunicationpreventsraceconditionsin theparallelprocessingenvironment.Agentfunctionscanoutputmessagestotherelevantmessagelistdatastructure,which isthenusedasinputbyadifferentagentfunction.Themessagelistsareaccessedusingspecialisedmessageaccesspatterns toreduce theperformance impactofmessagelistparsing,anewgraph-basedcommunicationpatternisusedforagentto agentcommunication,asdescribedinSection4.

Forinstance,intheﬁrstexecutionlayer,theagentsineachofthethreepossiblestatesforvehiclesoutputamessagetoa messagelist.Thosemessagelistsaretheniteratedbythedifferentagentstatesinlayerstwoandthree,dependantuponthe agentstate.Vehicleagentswhicharedeemedtohaveleft thesimulationareidentiﬁedasdeadandsubsequentlyexcluded fromfurtheriterationsofthesimulation.

(5)

stateimplementtheconstantorexponentialvehiclearrivalalgorithmfromAimsun[15],ensuringthatthesameconditions requiredforanagenttobeinsertedintothesimulationsuchasorderofarrivalandsuﬃcientheadwayaremet.

TheRoadandJunctionstatesareusedtorepresentvehicleswhichusethetwographsusedtoimplementtheroad net-workdatastructure,withvehiclesintheRoadstateexhibitingprimarilythecarfollowingbehaviouralmodel,andinteraction withstopsigns.AgentsintheJunctionstateexecutethegapacceptancegive-waymodelandyellow-boxmodelusedforthis study,andsubsequentlyifmovementisallowedthevehiclewillusethecarfollowingmodeltotraversethejunction.

Additionaltothestatemachineforcars,alinearstatemachineexistsfordetectoragents.DetectorAgentsrepresent real-worldvehicledetectionsystems suchasinductionloopsplacedwithin roadnetworkinfrastructure.Thesedetectionloops areusedtomeasuretraﬃc conditions,theoutputofwhichcanbe usedforcalibration,validationorinteractivebehaviour. Thedetectorstate machineislinear,consistingoffouragentfunctionsinseparatelayers.Thesefunctionseach iterateone ofthemessagelists produced bythe vehicleagents,andcaptureanystatistics relevantto theindividual detectorforthe currentinteraction.

Thetransportnetworkisrepresentedusinga pairofgraphs,representedusingtheCompressed SparseRow(CSR)data structure.Theﬁrstgraphrepresentstheroadsofthetransportnetworkasedges,andthejunctionsasvertices.Thesecond graphmodelstheconnectionsbetweenroadswithineachjunction.Separategraphsareusedtosimplifythespeciﬁcationof theroadnetwork.

4. Networkcommunication

WithinFLAMEGPU, communicationbetweenagentsusesmessage-listtraversalalgorithmsprovidingspecialised access patternstoenableeﬃcient,indirect,data-parallelagentcommunication.Thealgorithmsanddata-structuresusedcanhave considerableimpact ontheperformanceofamodel,andthedevelopmentofspecialisedhigh-performanceaccessto mes-sages is vital to achieve highlevels of performance forlarge-scale ABMs. Thissection describes a newspecialisation for graph-basedmessageswithinFLAMEGPU,whichenableshighperformancefornetwork-basedcommunicationpatterns.The performanceofthecommunicationstrategyisevaluatedcomparedtoexistingFLAMEGPUcommunicationpatterns.

4.1. FLAMEGPUmessage-basedcommunication

ThecommunicationmethodusedinaFLAMEGPUmodelcanbespecialisedtoofferincreasedlevelsofperformancefor differentcommunicationpatterns.Existingtechniquesformessage-basedcommunicationavailablewithinFLAMEGPUallow globalcommunication(all-to-all)orpartitionedcommunicationbasedonspatiallocality[6].All-to-allmessagespecialisation enablesindirectglobalcommunicationbetweenagents,whilespatiallypartitionedmessagingrestrictsthecommunicationto theimmediateregionsurroundingtheagentin2Dor3Dspace.Thesecommunicationpatternsarenon-optimalforseveral corevehiclemodelsforroadnetworkmicrosimulation,wherethecommunicationbetweenagentsisrestrictedtotheroad networkinfrastructure.Typicallymodelssuchascarfollowingandlanechangingmodelsonlyrequirecommunicationwith other agents on the same road section ordirectly connected road sections within the transport network. Other models suchasgive waymodellingforjunctionsmayrequirecommunicationwithin thejunctionandwithother vehicles onthe connectedroad sections.Inthesecases both all-to-allorspatially partitioned communicationstrategiescan resultin the iterationoflargemessageliststoﬁndtherelevantmessageormessages.Spatiallypartitionedcommunicationwouldoffer an improvementover all-to-allcommunication, howeverin areaswitha large numberof agents,orin dense sectionsof roadnetworksuchasurbancitycentres,thenumberofmessageseach agentmustiteratecanstillbe signiﬁcantforeven relativelysmallcommunicationradii,resultinginanegativeimpactonperformance[35].

4.2.Graph-basedmessaging

Ideallycommunicationshouldonlyneedtooccur withintheconstraintsofthenetwork,whichistypicallyrepresented usingagraphorseriesofgraphs.Forthepurposesofthisbenchmarkmodel,whichconsistsofsinglelanesectionsofroads andstop-sign-based,yellow-boxjunctions,communicationislimitedtothecurrentroadnetworkedge,thenextroad net-workedge,orthecurrentjunction(graphvertex).Existingtechniquesforgraph-basedcommunicationexistwithinparallel anddistributedagentbasedframeworkssuch asD-Mason andRepast HPC,howeverthesestrategiesuseindividual agents astheverticesofthegraphandtherelationshipsbetweenagentsaretheedgesofthegraph[29,37].

(6)

Fig.3. TheeffectsofalternateFLAMEGPUcommunicationstrategiesoncarfollowingmodelsisshownforasectionofagrid-basedroadnetwork,fora singleagentrepresentedinwhite.All-to-allcommunicationresultsin42messagesbeingparsedbythesingleagent.Spatiallypartitionedmessagingresults inthe18messagesfromtheagentsintheblueshadedregion,whilethegraph-basedcommunicationstrategyresultsinonly5messagesfromtheorange borderedagentsbeingprocessed.Thespatiallypartitionedradiususedforthisillustrationwouldbeinsuﬃcientforaccuratemodelling,andisonlyused forillustrativepurposes.

vertexwithinthenetwork.Ascancanthenbeperformedonthehistogramtoprovidetheoutputpositionforthemessages, allowing messages to be sortedinparallel. The histogramwhich isconstructed duringthe sortingalgorithm canalso be usedasapartofthedatastructureformessageaccess.Thenetwork-basedcommunicationstrategyisgeneralenoughtobe appliedto anyGPUmulti-agentsimulation,wherecommunicationalonga graph-baseddatastructureisrequired,such as socialnetworkmodelling.

4.3. Applicationtoroadnetworkmicrosimulation

(7)

Fig.4. Theaveragetimetakenforvehicleagentstooutputroadmessagesareshownforthreecommunicationstrategiesanddifferentscalesofmodel. Thisshowstheoverheadcostofeachcommunicationspecialisation.TheapplicationwasbenchmarkedonaNvidiaTitanX(Pascal).

Fig.5. TheaveragetimetakenforvehicleagentstoexecutetheGPUkernelwhichincludesthecarfollowingmodelisshownforthreecommunication strategiesanddifferentscalesofmodel.Thiskerneliteratesallmessagesintheappropriatelisttoﬁndtheleadvehicle.Graph-basedcommunicationshows thehighestlevelofperformance.TheapplicationwasbenchmarkedonaNvidiaTitanX(Pascal).

4.4.Performanceimpactofthecommunicationstrategyonthecarfollowingmodel

Toevaluatetheperformanceimprovementprovidedbygraph-basedcommunicationforagent-basedroadnetwork mod-els,aset ofroadnetworkswerebenchmarkedusingthree communicationstrategies.Themessagelistoutput byvehicles travellingalongtheroadsectionswithintheroadnetworkwerespecialisedusingall-to-all,spatiallypartitionedand graph-basedmessaging.TheperformanceoftherelevantGPUkernelsisextractedandcompared.

(8)

Table1

ComputationalhardwareusedforbenchmarkingAimsunandFLAMEGPUsimulations.

Identiﬁer CPU CPUcores CPUfrequency Memory GPU

System#1 Intelcorei74770k 4 3.5GHz 16GBDDR3 NVIDIATitanX(Pascal)(TCC)& NVIDIAGeForceGTX1080 (WDDM)

System#2 2× IntelXeonE52698v4 40(20) 2.20GHz 512GBDDR4 8× NVIDIATeslaP100(linux)

Table2

Parametersusedforvalidationandallbenchmarksimulations.

Parametername Gridsizeexperiment Inputﬂowexperiment Units

Timestep 0.8 0.8 seconds

Reactiontime 0.8 0.8 seconds

Stopreactiontime 1.2 1.2 seconds

Detectorperiod 600 600 seconds

Inputﬂowscalefactor 1.0 1.0 %

Seed Randomvalue Randomvalue Longinteger

Sectionlength 1000 1000 m

Junctionlength 10 10 m

Gridsize 2–576 64,128&256

Inputﬂow 600 100–1000

Detectorspersection 3 3

5. Experimentalresults

This section summarises thecross-validation ofthe FLAMEGPU basedsimulation against the existing multi-core CPU road network microsimulation tool Aimsun. It provides details of the procedurally generated Manhattan style grid road networkusedforperformancebenchmarkingandtwosetsofbenchmarkexperimentsarediscussed.

5.1. Validation

Fortheperformanceevaluationofthetwosimulatorstobecomparableitisimportantthatsoftwareproduces compara-bleresults.Asthecomplexsystemisstochastic theresultswillneedto bestatisticallyequivalent.TheFLAMEGPU imple-mentationwascross-validatedagainst Aimsun8.1foraseriesofmodelstovalidatemodelbehaviours,such asthevehicle arrival algorithm[15],carfollowing[15] andstop signinteraction [15].Additionally theoverallbehavioursandpopulation sizesfortheprocedurallygeneratedroadnetworkareevaluated.Thesamesetofparameters areusedforbothsimulators. Thesetofvalidationmodelsshowedthatvehicleparametersampling,thevehiclearrivalalgorithm,thecarfollowingmodel andthejunctionmodelswereallcorrectlyimplemented,withnodevianceobserved.Theprocedurallygeneratednetworks showedstatisticallysimilarpopulationsizesandturningbehavioursoveranaverageofmultiplereplications.

5.2. Benchmarkingmethodology

Multipleconﬁgurations ofprocessinghardwarewereused toobservetheperformanceimpact ofthehardwareon per-formance.Table 1 describesthe twohardwaresystems usedfortheseexperiments.The Aimsun benchmarkmodels were executedusingSystem#1,adesktopcomputer.FLAMEGPUsimulationswerecarriedoutusingarangeofGPUsinSystem #1,andalsoonan NVIDIADGX-1 (System#2),a state-of-the-artmulti-GPUserver containingNVIDIATesla P100 acceler-ators. Allsimulationswere executedusingasingleCPU andsingleGPUandthe sameparameters(Table2)were usedfor bothsimulationsofonehourofvehicledemand.

5.3. Procedurallygeneratedbenchmarkroadnetwork

(9)

Fig.6. A5×5exampleoftheprocedurally-generatedartiﬁcialgridnetwork,showingtheoverallstructureofthenetworkandthearrangementofturning sectionswithinajunction.Thenetworkcanbescaledtoanysize,withnetworksofupto576×576usedduringbenchmarking.

Table3

ParametersforgenerationoftheprocedurallygeneratedManhattan-stylegridroadnetwork.

ParameterName Description

Gridsize(G) Thenumberofjunctionsforeachrowandcolumnofthegrid Sectionlength Thelengthofeachroadsectionbetweenjunctionsinmetres Junctionlength Thelengthordiameterofjunctionswithinthenetwork

Exitturnproportion Forjunctionswith1exitdestinationedge,theproportionofvehiclesturningontotheexitisreducedto maintainhighervehiclepopulationswithinthesimulatedroadnetwork.

5.4.Networkscaleexperiment

Toevaluatetheperformanceimpactoftotalproblemscale,theGridSizeoftheprocedurallygeneratednetworkwas var-iedfrom2 to576.Aimsun simulations wereonly carriedout up toa gridsize of512 dueto thetime takenforasingle simulationtocomplete.Asthegridsizeisincreased,thesize ofthetransport networkincreases,thenumberofdetectors agentsinthesimulationincreasesandthevehiclepopulationsizeincreases.Fig.7 showtheaveragesimulationtimefor3 repetitionsofthesimulation.ThisshowsthatforlargerscaleproblemstheGPUacceleratedFLAMEGPUsimulation outper-formsAimsun8.1,byupto37.2× usingSystem#1and43.8× usingSystem#2.TheRealtimeratio(RTR)ofasimulationis theratioofexecutionrun-timetosimulatedtime. ThelargestsimulationexecutedinFLAMEGPUof576,000vehiclesand 1,994,112detectorsshowsaRTRof10.5_× usingaNVIDIAGeForceGTX1080insystem#1,16.6_× usingsystem#1NVIDIA TitanX(Pascal)and25.5× usingtheNVIDIATeslaP100insystem#2.Forverysmallsimulations,withgridsize8andbelow containinglessthan7396concurrentvehicles,Aimsun8.1showsgreaterlevelsofperformancethanFLAMEGPU.

5.5.Inputﬂowexperiment

Demandontransportationnetworksisincreasingglobally[39].Toevaluatetheimpactofthisonmicroscopictransport networksimulation,thedemandonthetransportnetworkwasvariedforseveralﬁxedsizesofnetwork.Duetothe charac-teristicsofourprocedurallygeneratedtransportnetwork,thevehiclepopulationsizeincreasesnon-linearly,asvehiclescan onlyenterfromtheedgesofthenetwork.ThemodelparametersusedaredeﬁnedinTable2.

Figs.8–10 showtheaveragesimulationtimefor3repetitionsofeachsimulation.Thesimulationsforagridsizeof64, FLAMEGPUshowsaveragespeed-upsofbetween3.8_× and18.9_× comparedtoAimsun 8.1. Speed-upsofbetween9.3_× and36.6× areshownforagridsizeof128,andbetween8.1× and75.3× foragridsizeof256.

5.5.1. Discussionofapplicationbenchmarkresults

(10)

Fig.7. Totalsimulationperformanceasthetotalscaleofthesimulationisincreased,foreachsimulator.Valuesshownaretheaveragefrom3repetitions. Alogarithmicscaleisusedtoimprovevisibilityofsimilartotalsimulationtimes.

Fig.8. Totalsimulation timeagainstinputﬂow foraprocedurallygeneratedroadnetworkofgridsize64. Runtimesshownaretheaveragefrom3 repetitions.Alogarithmicscaleisused.

of parallelism andtherefore GPUutilisation increases.Additionally, GPUacceleration hasassociated overhead costs. Data andcommandsmustbetransferredfromthehostcomputertotheGPUonwhichcodeisexecuted.Thisisarelativelyslow process whichmust be minimisedwhere possibleto achieve highlevels ofperformance. Largermodels quicklyamortize thiscostthroughperformancegains.

Theperformanceofindividualsimulationiterations,showninFig.11 showsthatper-iterationsimulationtimeincreases asthesimulationprogresses.Thisisduetotheincreasingnumberofvehicleswithinthesimulation.Therateofincreaseof per-iterationsimulationtimegrowsatahigherratewithinAimsunthanFLAMEGPU,demonstratingtheincreasedscalability oftheGPUacceleratedsimulationwithrespecttopopulationsize.ForbothAimsun andFLAMEGPUthetime tosimulate asingleiterationpeaksevery750iterations.Thiscanbe accountedforbythedetectorandstatisticalsummarydatabeing processedandwrittenouttodiskevery600simulatedseconds(withasimulationstepof0.8s).

(11)

Fig.9. Totalsimulationtimeagainst inputﬂowforaprocedurallygeneratedroadnetworkofgridsize128.Runtimesshownaretheaveragefrom3 repetitions.Alogarithmicscaleisused.

Fig.10. Totalsimulationtimeagainstinputﬂowforaprocedurallygeneratedroadnetworkofgridsize256.Runtimesshownaretheaveragefrom3 repetitions.Alogarithmicscaleisused.

Table4

GPUspeciﬁcations[40–42].

GPU Corecount Corefrequency(MHz) Memory Memorybandwidth(GB/s)

NVIDIAGeforceGTX1080 2560 1607(1733) 8GBGDDR5X 320

NVIDIATitanX(Pascal) 3584 1417(1531) 12GBGDDR5X 480

NVIDIATeslaP100(SXM2) 3584 1328(1480) 16GBCoWoSHBM2 732

highpopulations of agents,the NVIDIATesla P100 in System #2showed thegreatest levels of performance. This canbe attributedtothelarger numberofprocessingcores,andthe greaterlevels ofmemorybandwidth providedby the HBM2 memory,compared toGDDR5X andGDDR5.The CPUsinSystem #1& #2also differ,howeverthe majorityofprocessing timewithintheapplicationisexecutedontheGPU.

6. Conclusionandfuturework

(12)

multi-Fig.11. Per-Iterationsimulationtimeforamodelwithgridsize128andinputﬂow500vehiclesperhour.

coreCPUbasedsimulationsoftwaretool commonlyusedwithinthetransportplanningindustry.Toensurethatsimulator behaviour is comparableandenable simulatorperformance to be compared the FLAMEGPU basedsimulation wascross validated against Aimsun using a series of models. As this is a stochastic model and additional features are presentin Aimsunwhichcouldnotbefullymitigatedtheresultswerenon-exactbutwerestatisticallysimilar.Thelargestsimulation executed usingbothsimulators,aone hoursimulationcontainingupto512,000vehiclesand1,575,936detectors,showed a speed-upof 43.8_× for theGPU acceleratedsimulation,with a real-time-ratioof 28.9. TheGPU acceleratedsimulation alsodemonstrated greatlyimprovedperformance scaling behaviour,withthe largestsimulationofup to576,000vehicles and1,994,112 detectorscompleting in 49.5%of thetime ofa much smaller(up to64,000vehicles and24,960 detectors) simulationinAimsun8.1.

CriticaltotheperformanceoftheGPUacceleratedagentbasedmicrosimulationwastheuseofdataparallelalgorithms anddatastructuresemployedbyFLAMEGPUforagentmanagement,abstractingthecomplexitiesoftheGPUprogramming modelaway fromthe modeller.The proposed graph-basedcommunicationstrategy fordata-parallel agentsisalso vitally important,asexistingmethodsforagentcommunicationwithin FLAMEGPUwerenon-optimal,limitingsimulation perfor-mance.Theproposedgraphbasedmethodsalsohavemoregeneralimplications,asothermodelssuchassocialinteraction modelsrelyongraph-basedcommunication.Theincreasedmemory-bandwidthoftheNVIDIATeslaP100GPU,comparedto theNVIDIATitanX(Pascal) andNVIDIAGeForceGTX1080provedtohaveadramaticincrease ontheperformanceofthe application,astheapplicationismemorybound.

6.1. Futurework

OurFLAMEGPUbasedsimulatorcurrentlysupportsasingletypeofjunctiontosupportthebenchmarkmodelused.To enablethissoftwaretobeusedforreal-worldstudiesoftransportsystems,additionalfeaturesmustbeimplemented,such astheinclusionofmultiplelanes,traﬃcsignalsandalternativetypesofjunction.Theimpactoftheseadditionalbehaviours wouldrequirefurtherperformanceevaluationusingprocedurallygeneratedandreal-worldtransportnetworks.

The proposed graph-basedcommunication strategy is currently limited to communicationalong the currentedge, or around asinglevertexwithin thegraph.Althoughthiscanbe usedto achievesigniﬁcant performance improvementsthe strategyistoofferhighperformancecommunicationformultiple-edgesforgraphexploration.Thiswouldaidininthewider applicabilityofthegeneralcommunicationstrategyandofferperformanceimprovementstoothersimulationdomains. Fur-thermore,the existing agentbased simulationuses multiplegraphs to representthe transport network,combiningthese graphsintoasinglegraphwouldsimplifythedevelopmentofadditionalmodels,andallow amore-generalised communi-cationstrategytoyieldfurtherperformanceincreases.

(13)

Acknowledgement

Thiswork wassupported byaDepartmentforTransport TransportTechnologyResearchInnovationGrant“Accelerating TransportMicrosimulation: Demonstrating theimpact offuturemanycoresimulations” (T-TRIGJuly2016) andadditional supportthroughEPSRC fellowship“AcceleratingScientiﬁcDiscoverywithAcceleratedComputing” (EP/N018869/1).

References

[1] J.Barceló,J.Casas,DynamicnetworksimulationwithAimsun,in:SimulationApproachesinTransportationAnalysis,Springer,2005,pp.57–98. [2] M.Fellendorf,Vissim:amicroscopicsimulationtooltoevaluateactuatedsignalcontrolincludingbuspriority,in:64thInstituteofTransportation

EngineersAnnualMeeting,Springer,1994,pp.1–9.

[3] M.Balmer,M.Rieser,K.Meister,D.Charypar,N.Lefebvre,K.Nagel,Matsim-t:architectureandsimulationtimes,in:Multi-agentSystemsforTraﬃc andTransportationEngineering,IGIGlobal,2009,pp.57–78.

[4] D.Krajzewicz,J.Erdmann,M.Behrisch,L.Bieker,Recentdevelopmentand applicationsofSUMO-simulationofurbanmobility,Int.J.Adv.Syst. Measur.5(3&4)(2012)128–138.

[5] C.Antoniou,J.Barcelò,M.Brackstone,H.Celikoglu,B.Ciuffo,V.Punzo,P.Sykes,T.Toledo,P.Vortisch,P.Wagner,Traﬃcsimulation:caseforguidelines, 2014URLhttp://publications.jrc.ec.europa.eu/repository/bitstream/JRC88526/2014_multitude_guidelines_on-line.pdf,doi:10.2788/11382.

[6] P.Richmond,FLAMEGPUTechnicalReportand UserGuide(CS-11-03),TechnicalReport,DepartmentofComputerScience,UniversityofSheﬃeld, 2011.

[7] G.Eliasson,Modelingtheexperimentallyorganizedeconomy,1991.doi:10.1016/0167-2681(91)90047-2.

[8] P.Gipps,Abehaviouralcar-followingmodelforcomputersimulation,Transp.Res.PartB15(2)(1981)105–111,doi:10.1016/0191-2615(81)90037-0. [9] M.Treiber,A.Hennecke,D.Helbing,Congestedtraﬃcstatesinempiricalobservationsandmicroscopicsimulations,Phys.Rev.E62(2)(2000)1805. [10] P.Gipps,Amodelforthestructureoflane-changingdecisions,Transp.Res.PartB20(5)(1986)URLhttp://www.sciencedirect.com/science/article/pii/

0191261586900123.

[11]M.Brackstone, M.McDonald,J.Wu,Lanechangingon themotorway:factorsaffectingitsoccurrence,and theirimplications,in:RoadTransport InformationandControl,1998.9thInternationalConferenceon(Conf.Publ.No.454),IET,1998,pp.160–164.

[12] R.Liu,D.VanVliet,D.Watling,Microsimulationmodelsincorporatingbothdemandandsupplydynamics,Transp.Res.PartA40(2)(2006)125–150. [13] R.Dowling,A.Skabardonis,J.Halkias,G.McHale,G.Zammit,Guidelinesforcalibrationofmicrosimulationmodels:frameworkandapplications,Transp.

Res.Rec.(1876)(2004)1–9.

[14] S.-J.Kim,W.Kim,L.Rilett,Calibrationofmicrosimulationmodelsusingnonparametricstatisticaltechniques,TranspResRec.(1935)(2005)111–119. [15] TransportSimulationSystems,Aimsun8UsersManual,2014.

[16] A.PTV,Vissim5.40usermanual,Karlsruhe,Germany(2011). [17]NVIDIACorporation,NVIDIATeslaP100(whitepaper),2016.

[18] Microway, Detailed speciﬁcations of the intel xeon e5-2600v4-broadwell-ep-processors, 2016. URL https://www.microway.com/ knowledge-center-articles/detailed-speciﬁcations-of-the-intel-xeon-e5-2600v4-broadwell-ep-processors/.

[19] Green500,Green500website,2016.(https://www.top500.org/green500/).LastAccessed2016-08-01

[20]M.Zamith,M.Joselli,J.R.S.Junior,R.C.P.Leal-Toledo,E.Clua,I.MediaLab,E.Soluri,N.Tecnologia,AnapproachfortraﬃcforecastwithGPUcomputing &cellularautomatamodel.

[21]D.Strippgen,K.Nagel,Multi-agenttraﬃcsimulationwithcuda,in:HighPerformanceComputing&Simulation,2009.HPCS’09.International Confer-enceon,IEEE,2009,pp.106–114.

[22]K.Wang,Z.Shen,AGPU-basedparallelgeneticalgorithmforgeneratingdailyactivityplans,IEEETrans.Intell.Transp.Syst.13(3)(2012)1474–1480. [23]K.Wang,Z.Shen, ArtiﬁcialsocietiesandGPU-basedcloudcomputingforintelligenttransportationmanagement,IEEEIntell.Syst.26(4)(2011)22–28. [24]Y.Sano,N.Fukuta,AGPU-basedframeworkforlarge-scalemulti-agenttraﬃcsimulations,in:AdvancedAppliedInformatics(IIAIAAI),2013IIAI

Inter-nationalConferenceon,IEEE,2013,pp.262–267.

[25]Y.Xu,G.Tan,X.Li,X.Song,MesoscopictraﬃcsimulationonCPU/GPU,in:Proceedingsofthe2ndACMSIGSIMConferenceonPrinciplesofAdvanced DiscreteSimulation,ACM,2014,pp.39–50.

[26]X.Song,Z.Xie,Y.Xu,G.Tan,W.Tang,J.Bi,X.Li,Supportingreal-worldnetwork-orientedmesoscopictraﬃcsimulationongpu,Simul.Modell.Practice Theory74(2017)46–63.

[27]R.Bradley,I.Wright,D.Swain,R.Himlin,P.Heywood,P.Richmond,M.Mawson,G.Fletcher,R.Guichard,AcceleratingtraﬃcmodelsusingGPU-based technology,in:EuropeanTransportConference2016AssociationforEuropeanTransport(AET),2016.

[28]G.Cordasco,R.D.Chiara,A.Mancuso,D.Mazzeo,V.Scarano,C.Spagnuolo,Aframeworkfordistributingagent-basedsimulations,in:Euro-Par Work-shops(1)’11,2011,pp.460–470.

[29]N.Collier,M.North,RepastHPC:APlatformforLarge-ScaleAgentBasedModeling,Wiley,2011.

[30]L.Chin,D.Worth,C.Greenough,S.Coakley,M.Holcombe,M.Gheorghe,Flame-ii:aredesignoftheﬂexiblelarge-scaleagent-basedmodelling envi-ronment,TechnicalReport,2012.

[31] P.Richmond,D.Walker,S.Coakley,D.Romano,Highperformancecellularlevelagent-basedsimulationwithFLAMEfortheGPU.,BrieﬁngsBioinf.11 (3)(2010)334–347URLhttp://www.ncbi.nlm.nih.gov/pubmed/20123941,doi:10.1093/bib/bbp073.

[32]P.Richmond,D.Romano, Template-drivenagent-basedmodelingandsimulationwithcuda,in:GPUComputingGemsEmeraldEdition,in:Applications ofGPUComputingSeries,2011,pp.313–324.

[33]T.Karmakharm,P.Richmond,D.M.Romano,Agent-basedlargescalesimulationofpedestrianswithadaptiverealisticnavigationvectorﬁelds.,TPCG 10(2010)67–74.

[34]S.Tamrakar,P.Richmond,R.M.DSouza,Pi-ﬂame:aparallelimmunesystemsimulatorusingtheﬂamegraphicprocessingunitenvironment, SIMULA-TION(2016).0037549716673724

[35]P.Heywood,P.Richmond,S.Maddock,RoadnetworksimulationusingﬂameGPU,in:Euro-Par2015:ParallelProcessingWorkshops,Springer,2015, pp.430–441,doi:10.1007/978-3-319-27308-2.

[36]TransportSimulationSystems,Aimsun8DynamicSimulatorsUsersManual,2014.

[37]G.Cordasco,C.Spagnuolo,V.Scarano,Towardthenewversionofd-mason:eﬃciency,effectivenessandcorrectnessinparallelanddistributed agen-t-basedsimulations,in:ParallelandDistributedProcessingSymposiumWorkshops,2016IEEEInternational,IEEE,2016,pp.1803–1812.

[38]S.Green,Particlesimulationusingcuda,NVIDIAWhitepaper6(2010)121–128.

[39]Departmentfor Transport,Roadtraﬃcforecasts2015,2015,(https://www.gov.uk/government/uploads/system/uploads/attachment_data/ﬁle/260700/ road-transport-forecasts-2013-extended-version.pdf).

[40]NVIDIA Corporation, Nvidia geforce gtx 1080 whitepaper: Gaming perfected, 2016. URL http://international.download.nvidia.com/geforce-com/ international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf.

[41] NVIDIACorporation,Nvidiatitanx(pascal)webpage,2016.URLhttp://www.geforce.co.uk/hardware/10series/titan-x/.