Flexible Agent Based Simulation for
Pedestrian Modelling on GPU
Hardware
Paul Richmond
The Department of Computer Science
University of Sheffield, UK
www.dcs.shef.ac.uk/~paul
•
Richmond Paul, Coakley Simon, Romano Daniela, "Cellular Level Agent Based Modelling on the Graphics
Processing Unit (with FLAME GPU)",
Selected for review in the special issue: "Parallel and Ubiquitous methods
and tools in Systems Biology" of the international journal: Briefings in Bioinformatics
•
Richmond Paul, Coakley Simon, Romano Daniela (2009), "Cellular Level Agent Based Modelling on the
Graphics Processing Unit", Proc. of HiBi09 - High Performance Computational Systems Biology, 14-16 October
2009,Trento, Italy
•
Richmond Paul, Coakley Simon, Romano Daniela(2009), "A High Performance Agent Based Modelling
Framework on Graphics Card Hardware with CUDA", Proc. of 8th Int. Conf. on Autonomous Agents and
Multiagent Systems (AAMAS 2009), May, 10–15, 2009, Budapest, Hungary
•
Richmond Paul, Romano Daniela(2008), "A High Performance Framework For Agent Based Pedestrian
Dynamics On GPU Hardware", Proceedings of EUROSIS ESM 2008 (European Simulation and Modelling),
Introduction and Scope
•
Agent Based Modelling (ABM)
•
Emergence of Complex natural behaviour for simple rules
•
Individuals are agents with memory
•
Update own memory by considering neighbours
•
Of Pedestrian Behaviour
•
Continuous space mobile agents
•
Discrete time steps
•
On the GPU
•
Why?:
Performance
and real time visualisation
•
Aim is for Flexibility: Want to be able to harness the GPUs power without modellers having to
understand GPU programming
Outline
• FLAME and FLAME GPU
• About FLAME
• A simple example of an pedestrian model specification
• Implementing FLAME on the GPU
• Brief overview of GPU technology
• Mapping agent data and functions to the GPU
• Agent communication patterns
• Case Study
• Pedestrian modelling
• Discrete agents
• Performance results
What is FLAME?
•
What is FLAME (and what FLAME is not)?
•
Flexible Large-scale Agent Modelling Environment
•
XML Model specification based on the X-Machine
•
Template systems for generating simulation code
Single CPU
GRID
GPU
•
Not a modelling application itself (dynamically generated API)
•
Why extend FLAME to the GPU
•
Complete modelling environment (beyond that of simple swarms)
•
Formal and portable specification technique based on the X-Machine
•
Many existing models to be used for benchmarking
•
What is FLAME GPU
•
Data parallel implementation of FLAME using CUDA
•
Offers real time visualisation
FLAME and Formal Agent Specification
• The X-Machine
• formally defined by Eilenberg (Eilenberg 74) as a 8-tuple (
∑
,
Γ
, Q,
M,
Φ
, F, q0, m0), where;
∑
and
Γ
are the input and output finite alphabet respectively;
Q is the finite set of states;
M is the (possibly) infinite set called memory;
Φ
is a finite set of partial functions ø that map an input and a memory
state to an output and a new memory state, ø:
∑
× M
→ Γ
× M;
F is the next state partial function that, given a state and a function from
the type
Φ
, provides the next state, F: Q ×
Φ →
Q (F is often described
as a transition state diagram);
Agents as Communicating X-Machine’s
• Each agent is a
Communicating Stream
X-Machine (
Balanescu 99
)
• Stream: input and output are
streams of data
• Communicating: agents input
and output messages
• State transitions (functions)
describe agent behaviour
• Updates agent memory
• Outputs messages (and
agents) and process input
messages
Specifying an Agent in XMML
<
xagent
>
<
name
>
pedestrian
</
name
>
<
memory
>
<
variable
><
type
>
float
</
type
><
name
>
x
</
name
></
variable
>
<
variable
><
type
>
float
</
type
><
name
>
y
</
name
></
variable
>
<
variable
><
type
>
float
</
type
><
name
>
velx
</
name
></
variable
>
<
variable
><
type
>
float
</
type
><
name
>
vely
</
name
></
variable
>
</
memory
>
<
states
>
<
state
><
name
>
start_state
</
name
></
state
>
<
state
><
name
>
wait_input
</
name
></
state
>
<
initialState
>
start_state
</
initialState
>
</
states
>
<
functions
>
<
function
>
<
name
>
output_location
</
name
>
<
currentState
>
start_state
</
currentState
><
nextState
>
wait_input
</
nextState
>
<
outputs
>
<
output
><
messageName
>
pedestrian_location
</
messageName
></
output
>
</
outputs
>
</
function
>
<
function
>
<
name
>
input_locations
</
name
>
<
currentState
>
wait_input
</
currentState
><
nextState
>
start_state
</
nextState
>
<
inputs
>
<
input
><
messageName
>
pedestrian_location
</
messageName
></
input
>
</
inputs
>
</
function
>
</
functions
>
<
type
>
CONTINUOUS
</
type
>
</
xagent
>
Specifying Agent Communication in XMML
<
message
>
<
name
>
pedestrian_location
</
name
>
<
variables
>
<
variable
>
<
type
>
float
</
type
><
name
>
x
</
name
>
</
variable
>
<
variable
>
<
type
>
float
</
type
><
name
>
y
</
name
>
</
variable
>
<
variable
>
<
type
>
float
</
type
><
name
>
velx
</
name
>
</
variable
>
<
variable
>
<
type
>
float
</
type
><
name
>
vely
</
name
>
</
variable
>
</
variables
>
<
partitioningSpatial
>
<
radius
>
25
</
radius
>
<
xmin
>
-100.0
</
xmin
>
<
xmax
>
100.0
</
xmax
>
<
ymin
>
-100.0
</
ymin
>
<
ymax
>
100.0
</
ymax
>
<
zmin
>
0.0
</
zmin
>
<
zmax
>
25
</
zmax
>
</
partitioningSpatial
>
</partitioningNone>
<partitioningDiscrete>
<radius>0</radius>
</partitioningDiscrete>
</
message
>
Specifying the Function Order
<
layers
>
<
layer
>
<
layerFunction
>
<
name
>
output_location
</
name
>
</
layerFunction
>
</
layer
>
<
layer
>
<
layerFunction
>
<
name
>
input_locations
</
name
>
</
layerFunction
>
</
layer
>
</
layers
>
input_locations()
output_location()
start_state
wait_input
agent->x
agent->y
agent->vel_x
agent->vel_x
OUT
IN
pedestrian_location
Message list
Simulation Process and Code Generation
•
XMML Model File
•
Syntax validated through XML Schema
•
Base XMML Schema describes the basic structure of an X-Machine agent
•
GPU Specific extensions (partitioning) available through a XMMLGPU Schema
•
Object Orientated Approach to extension of the base model
•
C Function Files
•
Translates an XMML model file into simulation source code
•
Templates are written in XML (using XSLT Schema) so can be syntax validated
•
XSLT Processors implement a W3C specification: Any compliant processor can be used to
generate code
•
FLAME GPU is therefore not dependant on internal tools or parsers
•
XML Input Data
• FLAME and FLAME GPU
• About FLAME
• A simple example of an pedestrian model specification
• Implementing FLAME on the GPU
• Brief overview of GPU technology
• Mapping agent data and functions to the GPU
• Agent communication patterns
• Case Study
• Pedestrian modelling
• Discrete agents
• Performance results
Programming the GPU
• Purpose of the GPU
• Data parallel device for operation on streams of data
• Programming for General Purpose Use
• Graphics API Technique:
Not ideal
• High Level Alternatives
Brook GPU (Buck 04): SIMD Stream programming extension for C
Sh (McCool 02): C++ language with a Compiler for GPU backends
• Hardware Specific
Stream SDK: Low level ATI specific native instruction set and High
Level support with Brook +
CUDA: NVIDIA programming for GPU using a compiler and a C
syntax with extensions
NVIDIA CUDA Programming Model
•
GPU is a coprocessor to CPU (with its own global memory)
•
Many parallel threads of execution
•
Each thread runs the same kernel program (SPMD)
•
Threads are grouped into regular sized blocks
•
Threads within a block can communicate through shared memory
• Simple synchronisation primitive
• Threads across blocks can not communicate
Block 0
Block 1
Block 2
Block 3
Grid of Blocks
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread N
…
Block of Threads
CUDA Hardware Model
GPU Device
Vector Processor 1 Registers Vector Processor 2 Registers Vector Processor N Registers Instruction Unit … Multiprocessor 1 Shared MemoryGPU DRAM Device Memory
Texture Cache Constant Cache Multiprocessor 2
Multiprocessor N
• Thread blocks are
mapped to Multi
Processors (MPs)
• Multiprocessors are a
set of SIMD thread
(vector) processors
• Limited shared memory
per MP (and hence
blocks)
• Limited cache and
registers per MP
Mapping Agent Functions to the GPU
__FLAME_GPU_FUNC__ int input_function(
xmachine_memory_pedestrian* xmemory,
xmachine_message_pedestrian_location_list* location_messages)
{
/* Get the first message */
xmachine_message_pedestrian_location* location_message =
get_first_pedestrian_location_message
(location_messages);
/* Repeat untill there are no more messages */
while(location_message)
{
/* Process the message */
if distance_check(xmemory, location_message)
{
updateSteerVelocity(xmemory, location_message);
}
/* Get the next message */
location_message =
get_next_pedestrian_location_message
(location_message,
location_messages);
}
/* Update any other xmemory variables */
xmemory->x += xmemory->vel_x*TIME_STEP;
...
return 0;
}
• Each transition function
is wrapped by a GPU
kernel
• Each agent is a thread
performing the function
• Functions can input
and output messages
• Functions can output
new agents (agent
birth)
• An agent can be
removed (agent death)
by returning non 0
Mapping X-Machine Agent Data to the GPU
• All data (agents and messages) is mapped to global
memory on the GPU
• Lists are stored using an Structure of Arrays (SoA) rather
than an Array of Structures (AoS)
• Data is read from global memory to registers
• Agents and messages are referenced as C structures
within function code
typedef struct agent{
float
x
;
float
y
;
} xm_memory_agent_list [N];
typedef struct agent_list{
float
x
[N];
float
y
[N];
} xm_memory_agent_list;
0
1
2
3
N
…
0 1 2
3
N
…
0 1 2
3
N
…
Use of Parallel Compaction
•
Need to avoid diversity within thread
blocks
•
Agents are stored and processed in
state lists to avoid conditional
branching
•
Sparse lists still occur as a result off
• Agent births
• Function filters
• Also during message outputs
Agent Function
Agent list (colour represents state)
Agent list after agent function
Agent Function
1
0
1
1
0
1
1
0
Compact New Agent List
Agent List
0
0
1
2
2
3
4
4
Agent Birth Output Flags
Brute Force Message Communication
• Tile message lists into shared memory to reduce global
memory access (Nyland 07)
• Each thread in the thread block loads a single message into shared
memory on the
load_first_message
function
• Each call to
load_next_message
then iterates through messages
in shared memory
• When a call to
load_next_message
is made after each message in
SM has been returned then tile a new batch of messages
• Repeat until all messages have been considered
Effect of Optimisations on for Brute Force
Message Communication
• Simple benchmarking model
• Efficient data access methods double performance
• Massive performance gain by using shared memory
0
10
20
30
40
50
60
70
80
90
100
R
e
la
ti
v
e
S
p
e
e
d
u
p
o
v
e
r
F
L
A
M
E
Limited Range Message Communication
•
For each message output
• Environment is split into discrete partitions equal to
the message range (each has a unique identifier)
• The message list is sorted depending on the partition
which the message is within
• A boundary matrix indicates how many messages
are within each partition by indicating the start and
end index of agents within the sorted list
• To read all messages within a partition the boundary
matrix indicates the range within the message list
which needs to be iterated
• Each agent reads 27 partitions (for a 3D
environment) including its own which guarantees
messages within the range are processed.
• Roughly 2/3 messages are outside the range but
much better than O(n)²
•
Texture cache is used to read messages from global
memory
Evaluation of Limited Range
Communication
N
32
64
96
128
160
192
224
256
1024
0.94
1.05
0.90
0.86
0.93
0.89
0.95
0.88
4096
1.24
1.25
1.30
1.22
1.39
1.22
1.24
1.25
16384
2.45
2.48
2.62
2.53
2.76
2.81
2.77
2.60
65536
9.09
9.34
9.47
9.23
9.22
9.31
9.45
9.42
262144
33.74
37.99
36.88
37.39
36.61
36.83
37.81
38.12
1048576
136.28
169.73
147.39
172.98
145.21
165.34
151.26
177.06
0% 20% 40% 60% 80% 100% 1024 4096 16384 65536 262144 1048576 P e rc e n ta g e o f G P U T im eDiscrete Agent Communication
• Discrete Agents reading Discrete Messages
• Load messages into shared memory
• Continuous Agent Reading Discrete Messages
• Cant ensure all messages are loaded into shared memory
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Message Load 1 16 17 18 19 20 21 22 23 Message Load 2 24 25 26 27 28 29 30 31 Message Load 3 32 33 34 35 36 37 38 39 Message Load 4 40 41 42 43 44 45 46 47 Message Load 5
48 49 50 51 52 53 54 55 Message Load 6, 7, 8 and 9 56 57 58 59 60 61 62 63 63 56 57 58 59 60 59 60 61 62 63 56 31 24 25 26 27 28 27 28 29 30 31 24 7 0 1 2 3 4 3 4 5 6 7 0 39 32 33 34 35 36 35 36 37 38 39 32 15 8 9 10 11 12 11 12 13 14 15 8 47 40 41 42 43 44 43 44 45 46 47 40 23 16 17 18 19 20 16 20 21 22 23 16 55 48 49 50 51 52 52 52 53 54 55 48 31 24 25 26 27 28 27 28 29 30 31 24 63 56 57 58 59 60 60 60 61 62 63 56 39 32 33 34 35 36 35 36 37 38 39 32 7 0 1 2 3 4 3 4 5 6 7 0 2D Message Output
Performance of Discrete Message
Communication
• Cellular Automaton Model (Game of Life)
• Over 1 million agents
• Shared memory only suitable for very small
interaction ranges
50
100
150
200
250
300
G
P
U
T
im
e
(
m
s
)
TEX 64
TEX 256
SMC 64
SM 256
• FLAME and FLAME GPU
• About FLAME
• A simple example of an pedestrian model specification
• Implementing FLAME on the GPU
• Brief overview of GPU technology
• Mapping agent data and functions to the GPU
• Agent communication patterns
• Case Study
• Pedestrian modelling
• Discrete agents
A Simple Pedestrian Model
• Inter agent interaction (using spatially partitioned
messaging) is based on a hybrid of Reynolds and
Social Forces
• Social repulsion force
Navigates pedestrians to area of low concentration
Limited forward Vision
Preference over agents in direct line of sight
Scaled depending on distance to neighbour
• Close Range Interaction Force
Very short range with no limited vision
Acts as collision avoidance
Visualisation Technique
•
Agent data is already on the GPU
•
Agent positions are made available to OpenGL by mapping them to a Buffer
Object
•
We can also store geometry on the GPU to reduce draw calls
•
For Complex models (lots of vertices)
•
Store a single instance of the geometry in a Vertex Array
•
Draw the array for each agent and set a Vertex Attribute each time to indicate the agent
index
•
GLSL vertex shader is used to displace vertices in the same way
•
For Simple Models we can use a single large Vertex Array to hold a geometry
instance for each agent
•
Associate each vertex with an agent by using a Vertex Attribute stored in a Vertex Attribute
Array
Animation and Level of Detail (LOD)
•
Animation - Very simple
• Interpolate between 2 key frames
• Rotate the model depending on velocity direction
• Performed in a vertex shader
•
LOD - All data is maintained on GPU so must remain parallel
• Set View position as a GLOBAL variable
• Use agent script to calculate viewing distance
• Save LOD Level in an agent variable
• Use parallel reduction function to count number of agents per Level
• Secondary sort of the agents by LOD Level and render in groups
Performance Results
Observables
•
Performance Dependant on Communication Radius
• Larger communication = less partitions = more agents considered per update
•
LOD technique has a cost
• Don’t use for small populations
•
Very large population sizes possible in real time
50
100
150
200
250
300
350
400
450
500
F
ra
m
e
s
p
e
r
S
e
c
o
n
d
(
F
P
S
)
Billboards
Detail Level 0
Detail Level 1
Detail Level 2
Dynamic
Environment Collision Avoidance
• Discrete grid of agents to encode the environment
• Static Discrete Agents
• Repulsive forces direct agents from wall
• Automatically generated in advance
• Continuous Pedestrian Agents read discrete messages
• Apply a collision force
Long Range Navigation
• Many agents following similar paths so a global solution is
used
• Fluid flow route for each path through the environment
• Calculated offline in advance by backtracking from exit point
• Smooth movement around obstacles
• Discrete Agents also responsible for pedestrian birth
allocation
Conclusions and Future Work
• Summary
• Flexible agent architecture for the GPU suitable for force models
• Easily extendible
• Massive performance/cost benefits
• Scope for Future Work
• Multi GPU
• Would enable extremely large populations of systems to be simulated
• For Spatial partitioning only partition boundaries would need to be
communicated between GPU devices
• Improve pedestrian models
• Improved collision detection (more accurate)
• Long range individual path planning without flow grids
• Physically accurate animation and movement
References
•
A. Treuille, S. Cooper, and Z. Popovi
ć
, "Continuum crowds," in
SIGGRAPH '06: ACM SIGGRAPH 2006
Papers
.
New York, NY, USA: ACM, 2006, pp. 1160-1168.
•
R. M. D’Souza, M. Lysenko, and K. Rahmani. Sugarscape on steroids: simulating over a million agents at
interactive rates. In
Proceedings of Agent2007
, 2007.
•
Samuel Eilenberg.
Automata, Languages, and Machines
. Academic Press, Inc., Orlando, FL, USA, 1974.
•
T. Balanescu, A. J. Cowling, H. Georgescu, M. Gheorghe, M. Holcombe, and C. Vertan. Communicating
stream x-machines systems are no more than x-machines.
j-jucs
, 5(9):494–507, 1999.
|http://www.jucs.org/jucs_5_9/communicating_stream_x_machines|.
•
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat
Hanrahan. Brook for gpus: stream computing on graphics hardware.
ACM Trans. Graph.
, 23(3):777–786,
2004.
•
Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shader metaprogramming. In
HWWS ’02:
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
, pages 57–
68, Aire-la-Ville, Switzerland, Switzerland, 2002. Eurographics Association.
•
Lars Nyland, Mark Harris, and Jan Prins. Fast n-body simulation with cuda. In Hubert Nguyen, editor,
GPU Gems 3
, chapter 31. Addison Wesley Professional, August 2007.