Flexible Agent Based Simulation for Pedestrian Modelling on GPU Hardware

(1)

Flexible Agent Based Simulation for

Pedestrian Modelling on GPU

Hardware

Paul Richmond

The Department of Computer Science

University of Sheffield, UK

[email protected]

www.dcs.shef.ac.uk/~paul

•

Richmond Paul, Coakley Simon, Romano Daniela, "Cellular Level Agent Based Modelling on the Graphics

Processing Unit (with FLAME GPU)",

Selected for review in the special issue: "Parallel and Ubiquitous methods

and tools in Systems Biology" of the international journal: Briefings in Bioinformatics

•

Richmond Paul, Coakley Simon, Romano Daniela (2009), "Cellular Level Agent Based Modelling on the

Graphics Processing Unit", Proc. of HiBi09 - High Performance Computational Systems Biology, 14-16 October

2009,Trento, Italy

•

Richmond Paul, Coakley Simon, Romano Daniela(2009), "A High Performance Agent Based Modelling

Framework on Graphics Card Hardware with CUDA", Proc. of 8th Int. Conf. on Autonomous Agents and

Multiagent Systems (AAMAS 2009), May, 10–15, 2009, Budapest, Hungary

•

Richmond Paul, Romano Daniela(2008), "A High Performance Framework For Agent Based Pedestrian

Dynamics On GPU Hardware", Proceedings of EUROSIS ESM 2008 (European Simulation and Modelling),

(2)

Introduction and Scope

•

Agent Based Modelling (ABM)

•

Emergence of Complex natural behaviour for simple rules

•

Individuals are agents with memory

•

Update own memory by considering neighbours

•

Of Pedestrian Behaviour

•

Continuous space mobile agents

•

Discrete time steps

•

On the GPU

•

Why?:

Performance

and real time visualisation

•

Aim is for Flexibility: Want to be able to harness the GPUs power without modellers having to

understand GPU programming

(3)

Outline

• FLAME and FLAME GPU

• About FLAME

• A simple example of an pedestrian model specification

• Implementing FLAME on the GPU

• Brief overview of GPU technology

• Mapping agent data and functions to the GPU

• Agent communication patterns

• Case Study

• Pedestrian modelling

• Discrete agents

• Performance results

(4)

What is FLAME?

•

What is FLAME (and what FLAME is not)?

•

Flexible Large-scale Agent Modelling Environment

•

XML Model specification based on the X-Machine

•

Template systems for generating simulation code

Single CPU

GRID

GPU

•

Not a modelling application itself (dynamically generated API)

•

Why extend FLAME to the GPU

•

Complete modelling environment (beyond that of simple swarms)

•

Formal and portable specification technique based on the X-Machine

•

Many existing models to be used for benchmarking

•

What is FLAME GPU

•

Data parallel implementation of FLAME using CUDA

•

Offers real time visualisation

(5)

FLAME and Formal Agent Specification

• The X-Machine

• formally defined by Eilenberg (Eilenberg 74) as a 8-tuple (

∑

,

Γ

, Q,

M,

Φ

, F, q0, m0), where;

∑

and

Γ

are the input and output finite alphabet respectively;

Q is the finite set of states;

M is the (possibly) infinite set called memory;

Φ

is a finite set of partial functions ø that map an input and a memory

state to an output and a new memory state, ø:

∑

× M

→ Γ

× M;

F is the next state partial function that, given a state and a function from

the type

Φ

, provides the next state, F: Q ×

Φ →

Q (F is often described

as a transition state diagram);

(6)

Agents as Communicating X-Machine’s

• Each agent is a

Communicating Stream

X-Machine (

Balanescu 99

)

• Stream: input and output are

streams of data

• Communicating: agents input

and output messages

• State transitions (functions)

describe agent behaviour

• Updates agent memory

• Outputs messages (and

agents) and process input

messages

(7)

Specifying an Agent in XMML

<

xagent

>

<

name

>

pedestrian

</

name

>

<

memory

>

<

variable

><

type

>

float

</

type

><

name

>

x

</

name

></

variable

>

<

variable

><

type

>

float

</

type

><

name

>

y

</

name

></

variable

>

<

variable

><

type

>

float

</

type

><

name

>

velx

</

name

></

variable

>

<

variable

><

type

>

float

</

type

><

name

>

vely

</

name

></

variable

>

</

memory

>

<

states

>

<

state

><

name

>

start_state

</

name

></

state

>

<

state

><

name

>

wait_input

</

name

></

state

>

<

initialState

>

start_state

</

initialState

>

</

states

>

<

functions

>

<

function

>

<

name

>

output_location

</

name

>

<

currentState

>

start_state

</

currentState

><

nextState

>

wait_input

</

nextState

>

<

outputs

>

<

output

><

messageName

>

pedestrian_location

</

messageName

></

output

>

</

outputs

>

</

function

>

<

function

>

<

name

>

input_locations

</

name

>

<

currentState

>

wait_input

</

currentState

><

nextState

>

start_state

</

nextState

>

<

inputs

>

<

input

><

messageName

>

pedestrian_location

</

messageName

></

input

>

</

inputs

>

</

function

>

</

functions

>

<

type

>

CONTINUOUS

</

type

>

</

xagent

>

(8)

Specifying Agent Communication in XMML

<

message

>

<

name

>

pedestrian_location

</

name

>

<

variables

>

<

variable

>

<

type

>

float

</

type

><

name

>

x

</

name

>

</

variable

>

<

variable

>

<

type

>

float

</

type

><

name

>

y

</

name

>

</

variable

>

<

variable

>

<

type

>

float

</

type

><

name

>

velx

</

name

>

</

variable

>

<

variable

>

<

type

>

float

</

type

><

name

>

vely

</

name

>

</

variable

>

</

variables

>

<

partitioningSpatial

>

<

radius

>

25

</

radius

>

<

xmin

>

-100.0

</

xmin

>

<

xmax

>

100.0

</

xmax

>

<

ymin

>

-100.0

</

ymin

>

<

ymax

>

100.0

</

ymax

>

<

zmin

>

0.0

</

zmin

>

<

zmax

>

25

</

zmax

>

</

partitioningSpatial

>

</partitioningNone>

</partitioningDiscrete>

</

message

>

(9)

Specifying the Function Order

<

layers

>

<

layer

>

<

layerFunction

>

<

name

>

output_location

</

name

>

</

layerFunction

>

</

layer

>

<

layer

>

<

layerFunction

>

<

name

>

input_locations

</

name

>

</

layerFunction

>

</

layer

>

</

layers

>

input_locations()

output_location()

start_state

wait_input

agent->x

agent->y

agent->vel_x

OUT

IN

pedestrian_location

Message list

(10)

Simulation Process and Code Generation

•

XMML Model File

•

Syntax validated through XML Schema

•

Base XMML Schema describes the basic structure of an X-Machine agent

•

GPU Specific extensions (partitioning) available through a XMMLGPU Schema

•

Object Orientated Approach to extension of the base model

•

C Function Files

•

Translates an XMML model file into simulation source code

•

Templates are written in XML (using XSLT Schema) so can be syntax validated

•

XSLT Processors implement a W3C specification: Any compliant processor can be used to

generate code

•

FLAME GPU is therefore not dependant on internal tools or parsers

•

XML Input Data

(11)

• About FLAME

• Implementing FLAME on the GPU

• Agent communication patterns

• Case Study

• Discrete agents

• Performance results

(12)

Programming the GPU

• Purpose of the GPU

• Data parallel device for operation on streams of data

• Programming for General Purpose Use

• Graphics API Technique:

Not ideal

• High Level Alternatives

Brook GPU (Buck 04): SIMD Stream programming extension for C

Sh (McCool 02): C++ language with a Compiler for GPU backends

• Hardware Specific

Stream SDK: Low level ATI specific native instruction set and High

Level support with Brook +

CUDA: NVIDIA programming for GPU using a compiler and a C

syntax with extensions

(13)

NVIDIA CUDA Programming Model

•

GPU is a coprocessor to CPU (with its own global memory)

•

Many parallel threads of execution

•

Each thread runs the same kernel program (SPMD)

•

Threads are grouped into regular sized blocks

•

Threads within a block can communicate through shared memory

• Simple synchronisation primitive

• Threads across blocks can not communicate

Block 0

Block 1

Block 2

Block 3

Grid of Blocks

Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Thread N

…

Block of Threads

(14)

CUDA Hardware Model

GPU Device

Vector Processor 1 Registers Vector Processor 2 Registers Vector Processor N Registers Instruction Unit … Multiprocessor 1 Shared Memory

GPU DRAM Device Memory

Texture Cache Constant Cache Multiprocessor 2

Multiprocessor N

• Thread blocks are

mapped to Multi

Processors (MPs)

• Multiprocessors are a

set of SIMD thread

(vector) processors

• Limited shared memory

per MP (and hence

blocks)

• Limited cache and

registers per MP

(15)

Mapping Agent Functions to the GPU

__FLAME_GPU_FUNC__ int input_function(

xmachine_memory_pedestrian* xmemory,

xmachine_message_pedestrian_location_list* location_messages)

{

/* Get the first message */

xmachine_message_pedestrian_location* location_message =

get_first_pedestrian_location_message

(location_messages);

/* Repeat untill there are no more messages */

while(location_message)

{

/* Process the message */

if distance_check(xmemory, location_message)

{

updateSteerVelocity(xmemory, location_message);

}

/* Get the next message */

location_message =

get_next_pedestrian_location_message

(location_message,

location_messages);

}

/* Update any other xmemory variables */

xmemory->x += xmemory->vel_x*TIME_STEP;

...

return 0;

}

• Each transition function

is wrapped by a GPU

kernel

• Each agent is a thread

performing the function

• Functions can input

and output messages

• Functions can output

new agents (agent

birth)

• An agent can be

removed (agent death)

by returning non 0

(16)

Mapping X-Machine Agent Data to the GPU

• All data (agents and messages) is mapped to global

memory on the GPU

• Lists are stored using an Structure of Arrays (SoA) rather

than an Array of Structures (AoS)

• Data is read from global memory to registers

• Agents and messages are referenced as C structures

within function code

typedef struct agent{

float

x

;

float

y

;

} xm_memory_agent_list [N];

typedef struct agent_list{

float

x

[N];

float

y

[N];

} xm_memory_agent_list;

0

1

2

3

N

…

0 1 2

3

N

…

0 1 2

3

N

…

(17)

Use of Parallel Compaction

•

Need to avoid diversity within thread

blocks

•

Agents are stored and processed in

state lists to avoid conditional

branching

•

Sparse lists still occur as a result off

• Agent births

• Function filters

• Also during message outputs

Agent Function

Agent list (colour represents state)

Agent list after agent function

Agent Function

1

0

1

0

1

0

Compact New Agent List

Agent List

0

1

2

3

4

Agent Birth Output Flags

(18)

Brute Force Message Communication

• Tile message lists into shared memory to reduce global

memory access (Nyland 07)

• Each thread in the thread block loads a single message into shared

memory on the

load_first_message

function

• Each call to

load_next_message

then iterates through messages

in shared memory

• When a call to

load_next_message

is made after each message in

SM has been returned then tile a new batch of messages

• Repeat until all messages have been considered

(19)

Effect of Optimisations on for Brute Force

Message Communication

• Simple benchmarking model

• Efficient data access methods double performance

• Massive performance gain by using shared memory

0

10

20

30

40

50

60

70

80

90

100

R

e

la

ti

v

e

S

p

e

d

u

p

o

v

e

r

F

L

A

M

E

(20)

Limited Range Message Communication

•

For each message output

• Environment is split into discrete partitions equal to

the message range (each has a unique identifier)

• The message list is sorted depending on the partition

which the message is within

• A boundary matrix indicates how many messages

are within each partition by indicating the start and

end index of agents within the sorted list

• To read all messages within a partition the boundary

matrix indicates the range within the message list

which needs to be iterated

• Each agent reads 27 partitions (for a 3D

environment) including its own which guarantees

messages within the range are processed.

• Roughly 2/3 messages are outside the range but

much better than O(n)²

•

Texture cache is used to read messages from global

memory

(21)

Evaluation of Limited Range

Communication

N

32

64

96

128

160

192

224

256

1024

0.94

1.05

0.90

0.86

0.93

0.89

0.95

0.88

4096

1.24

1.25

1.30

1.22

1.39

1.22

1.24

1.25

16384

2.45

2.48

2.62

2.53

2.76

2.81

2.77

2.60

65536

9.09

9.34

9.47

9.23

9.22

9.31

9.45

9.42

262144

33.74

37.99

36.88

37.39

36.61

36.83

37.81

38.12

1048576

136.28

169.73

147.39

172.98

145.21

165.34

151.26

177.06

0% 20% 40% 60% 80% 100% 1024 4096 16384 65536 262144 1048576 P e rc e n ta g e o f G P U T im e

(22)

Discrete Agent Communication

• Discrete Agents reading Discrete Messages

• Load messages into shared memory

• Continuous Agent Reading Discrete Messages

• Cant ensure all messages are loaded into shared memory

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Message Load 1 16 17 18 19 20 21 22 23 Message Load 2 24 25 26 27 28 29 30 31 Message Load 3 32 33 34 35 36 37 38 39 Message Load 4 40 41 42 43 44 45 46 47 Message Load 5

48 49 50 51 52 53 54 55 Message Load 6, 7, 8 and 9 56 57 58 59 60 61 62 63 63 56 57 58 59 60 59 60 61 62 63 56 31 24 25 26 27 28 27 28 29 30 31 24 7 0 1 2 3 4 3 4 5 6 7 0 39 32 33 34 35 36 35 36 37 38 39 32 15 8 9 10 11 12 11 12 13 14 15 8 47 40 41 42 43 44 43 44 45 46 47 40 23 16 17 18 19 20 16 20 21 22 23 16 55 48 49 50 51 52 52 52 53 54 55 48 31 24 25 26 27 28 27 28 29 30 31 24 63 56 57 58 59 60 60 60 61 62 63 56 39 32 33 34 35 36 35 36 37 38 39 32 7 0 1 2 3 4 3 4 5 6 7 0 2D Message Output

(23)

Performance of Discrete Message

Communication

• Cellular Automaton Model (Game of Life)

• Over 1 million agents

• Shared memory only suitable for very small

interaction ranges

50

100

150

200

250

300

G

P

U

T

im

e

(

m

s

)

TEX 64

TEX 256

SMC 64

SM 256

(24)

• About FLAME

• Implementing FLAME on the GPU

• Agent communication patterns

• Case Study

• Discrete agents

(25)

A Simple Pedestrian Model

• Inter agent interaction (using spatially partitioned

messaging) is based on a hybrid of Reynolds and

Social Forces

• Social repulsion force

Navigates pedestrians to area of low concentration

Limited forward Vision

Preference over agents in direct line of sight

Scaled depending on distance to neighbour

• Close Range Interaction Force

Very short range with no limited vision

Acts as collision avoidance

(26)

Visualisation Technique

•

Agent data is already on the GPU

•

Agent positions are made available to OpenGL by mapping them to a Buffer

Object

•

We can also store geometry on the GPU to reduce draw calls

•

For Complex models (lots of vertices)

•

Store a single instance of the geometry in a Vertex Array

•

Draw the array for each agent and set a Vertex Attribute each time to indicate the agent

index

•

GLSL vertex shader is used to displace vertices in the same way

•

For Simple Models we can use a single large Vertex Array to hold a geometry

instance for each agent

•

Associate each vertex with an agent by using a Vertex Attribute stored in a Vertex Attribute

Array

(27)

Animation and Level of Detail (LOD)

•

Animation - Very simple

• Interpolate between 2 key frames

• Rotate the model depending on velocity direction

• Performed in a vertex shader

•

LOD - All data is maintained on GPU so must remain parallel

• Set View position as a GLOBAL variable

• Use agent script to calculate viewing distance

• Save LOD Level in an agent variable

• Use parallel reduction function to count number of agents per Level

• Secondary sort of the agents by LOD Level and render in groups

(28)

(29)

Performance Results

Observables

•

Performance Dependant on Communication Radius

• Larger communication = less partitions = more agents considered per update

•

LOD technique has a cost

• Don’t use for small populations

•

Very large population sizes possible in real time

50

100

150

200

250

300

350

400

450

500

F

ra

m

e

s

p

e

r

S

e

c

o

n

d

(

F

P

S

)

Billboards

Detail Level 0

Detail Level 1

Detail Level 2

Dynamic

(30)

Environment Collision Avoidance

• Discrete grid of agents to encode the environment

• Static Discrete Agents

• Repulsive forces direct agents from wall

• Automatically generated in advance

• Continuous Pedestrian Agents read discrete messages

• Apply a collision force

(31)

Long Range Navigation

• Many agents following similar paths so a global solution is

used

• Fluid flow route for each path through the environment

• Calculated offline in advance by backtracking from exit point

• Smooth movement around obstacles

• Discrete Agents also responsible for pedestrian birth

allocation

(32)

(33)

Conclusions and Future Work

• Summary

• Flexible agent architecture for the GPU suitable for force models

• Easily extendible

• Massive performance/cost benefits

• Scope for Future Work

• Multi GPU

• Would enable extremely large populations of systems to be simulated

• For Spatial partitioning only partition boundaries would need to be

communicated between GPU devices

• Improve pedestrian models

• Improved collision detection (more accurate)

• Long range individual path planning without flow grids

• Physically accurate animation and movement

(34)

References

•

A. Treuille, S. Cooper, and Z. Popovi

ć

, "Continuum crowds," in

SIGGRAPH '06: ACM SIGGRAPH 2006

Papers

.

New York, NY, USA: ACM, 2006, pp. 1160-1168.

•

R. M. D’Souza, M. Lysenko, and K. Rahmani. Sugarscape on steroids: simulating over a million agents at

interactive rates. In

Proceedings of Agent2007

, 2007.

•

Samuel Eilenberg.

Automata, Languages, and Machines

. Academic Press, Inc., Orlando, FL, USA, 1974.

•

T. Balanescu, A. J. Cowling, H. Georgescu, M. Gheorghe, M. Holcombe, and C. Vertan. Communicating

stream x-machines systems are no more than x-machines.

j-jucs

, 5(9):494–507, 1999.

|http://www.jucs.org/jucs_5_9/communicating_stream_x_machines|.

•

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat

Hanrahan. Brook for gpus: stream computing on graphics hardware.

ACM Trans. Graph.

, 23(3):777–786,

2004.

•

Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shader metaprogramming. In

HWWS ’02:

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware

, pages 57–

68, Aire-la-Ville, Switzerland, Switzerland, 2002. Eurographics Association.

•

Lars Nyland, Mark Harris, and Jan Prins. Fast n-body simulation with cuda. In Hubert Nguyen, editor,

GPU Gems 3

, chapter 31. Addison Wesley Professional, August 2007.