Behavior Evaluation - Implementation and Results

2.5 Implementation and Results

2.5.1 Behavior Evaluation

Artificial scenarios were set up to evaluate local navigation and some emergent behaviors of ClearPath. The following emergent behaviors were observed from ClearPath in the benchmarks: lane-formation, vortices, slow down in congestion, respond to collisions, avoiding each other, swirl to resolve congestion, jamming at exits, and arching at narrow passages.

Circle-1K: 1,000 agents start arranged uniformly around a circle and try to move

directly through the circle to the their antipodal position on the other side (see Fig. 2.8). The scenario becomes very crowded when all the agents meet in the middle and swirling behavior occurs.

Figure 2.8: Dense Circle Scenario: 1,000agents are arranged uniformly around a circle

and move towards their antipodal position. This simulation runs at over 1,000 FPS on an Intel 3.14 GHz quad core, and over 8,000 FPS on 32 Larrabee cores.

4-Streams: 2,000agents are organized as four streams that walk along the diagonals

of the square. This is similar to the benchmark in Continuum Crowds (Treuille et al., 2006), though ClearPath results in different behaviors, including smooth motion, lane formation and some swirls.

Back&Forth: Between10and100agents are asked to move back and forth along

a line. This is test is run side-by-side with OpenSteer to compare the number collision with unmodified OpenSteer to the number of collisions with OpenSteer combined with ClearPath. With both techniques agents generally smoothly avoid each other. However, when the ClearPath local collision avoidance algorithm is added, there are no longer any

penetrations between the agents. With OpenSteer alone there are a significant number of collisions.

2.5.2 Complex Scenarios

Larger, more realistic scenarios were used to evaluate the performance of the overall system. All of these demos have a large, complex set of obstacles, with roadmaps to guide agents to their goal.

Building Evacuation: Agents were given initial positions of different rooms in an office building. The scene has 218obstacles and the roadmap consists of429 nodes and 7.2K edges. The agents move towards the goal positions corresponding to the exit signs.

The hallway quickly fills up with the agents and there is congestion at the exits, which allow for only1 2agents to leave at a time. Three versions of this scenario are used with500, 1K and5K agents and they are denoted as Evac-500, Evac-1K, and Evac-5K, respectively.

Stadium Scene: The scenario was a simulation of25K agents as they exited from

their seats out of a stadium. The scene has approximately1.4K obstacles and the roadmap

consists of almost2K nodes and3.2K edges. The agents moves towards the corridors and

produce congestion and highly-packed scenarios. This benchmark is denoted as Stad-25K. City Simulation: This scenarios uses a city model with buildings and streets with

1.5K obstacles. The roadmap has480nodes and916edges. Different agents are simulated

as they walk around the city streets. The agents move at different speeds and overtake each other and avoid collisions with oncoming agents. Three versions with10K,100K and250K

agents were performed and denote as City-10K, City-100Kand City-250K, respectively.

2.5.3 Parallel Implementation

ClearPath can be easily parallelized across multiple agents since the computation performed by each agent is local and independent of the other agents. Specifically, performance

on two kind of parallel processors and systems with different characteristics were tested. They are:

1. Multi-core Xeon processors: The algorithms were tested on a PC workstation with a Intel quad-core Xeon processor (X5460) running at 3.16 GHz with32KB L1 and 12MB L2 cache. Each core runs a single thread. There is no support for gather/scatter

operations.

2. Many-core Larrabee simulator: Larrabee (Seiler et al., 2008) is an x86-based many-core processor architecture based on small in-order cores that uniquely combines full programmability of today’s general-purpose CPU architecture with compute-throughput and memory bandwidth capabilities similar to modern GPU architectures. Larrabee processor core consists of a vector unit (VPU) together with a scalar unit augmented with 4-way multi-threading, and the VPU supports 16 32-bit float or integer operations per clock. Each core has 32KB L1 data cache, and 256KB L2 cache. Hardware support for masking, gather/scatter, and pack instructions allows a substantial amount of fine-grained parallelism to be exploited by P-ClearPath.

2.5.3.1 Data-Parallelism

Figure 2.7 shows the improvement due to SSE instructions for P-ClearPath on Xeon processors. There is only about25 50% speedup with SSE instructions as Xeon processors

do not support scatter and gather instructions, have limited support for incoherent branches. Figure 2.9 shows the performance of P-ClearPath on SSE and Larrabee architectures. For Larrabee, the performance is measured in terms of Larrabee units. A Larrabee unit is defined to be one Larrabee core running at 1 GHz. The reported performance data is derived from performance simulation of complete multithreaded execution that takes into account various execution stalls arising from instruction dependency, resource contention, and cache/memory misses.

0 20 40 60 80 100 120 140 City-100K City-250K P-ClearPath 32unit 48unit 64unit 0 20 40 60 80 100 120 140 Evac-5K Stad-25K P-ClearPath 1unit 4unit 8unit 243 FP S

Figure 2.9: Performance (FPS) of P-ClearPath on SSE (left most column) and Larrabee (with different units) architectures. Even on complex scenes, P-ClearPath achieves30 60

FPS on many-core processors.

2.5.3.2 Thread-level Parallelism (TLP)

One of the issues that affects scaling is the load balancing amongst different threads. Some of the agents in a dense scenarios may perform more computations than the ones in sparse regions, as they consider more neighbors within the optimization computation. Hence, a static partitioning of agents amongst the threads may suffer from severe load balance problems, especially in simulations with few number of agents for large number of threads. The main reason is that the agents assigned to some specific thread(s) may finish their computation early, while the remaining ones are still performing the collision avoidance and other computations.

ClearPath reduces the load imbalance by using a scheme based on dynamic partitioning of agents. Specifically, Task Queueing (Mohr et al., 1990) is used to decompose the execution into parallel tasks consisting of a small number of agents. This allows the runtime system to schedule the tasks on different cores. In practice, the scaling improves by more than 2X as compared to static partitioning for16threads. This speedup occurs in small game like

scenarios with tens or hundreds of agents. By exploiting TLP, P-ClearPath achieves around

Additionally, the combined benefits of TLP and data-level parallelism were evaluated. Running 4 threads per core, the working set fits in L1 data cache and the implementation is not sensitive to memory latency issues. 16-wide SIMD provides around 4X scaling over scalar code. Hardware gather/scatter provides 50% additional speedup, resulting in 4.5X scaling. The pack instructions provide additional 2X scaling to achieve the overall 6.4X scaling.

2.5.3.3 SIMD Scalability

ClearPath Steps % Time Breakdown SIMD Scaling

Step 1 14% 7.9X Step 2 39% 5.6X Step 3 21% 5.9X Step 4 13% 9.3X Step 5 13% 5.6X Total 100% 6.4X

Table 2.1: Average time (%) spent in various steps (Section 3.3) of Clearpath and the corresponding SIMD scalability on Larrabee simulator of execution cycles as compared to the scalar version.

Table 2.1 shows the SIMD scaling achieved by each of the first five steps of ClearPath described in Section 3.3. The table provides a breakdown of the time spent in that part and the overall speedup. By performing repeated packing and memory gather operations during these steps to negative performance effects of path divergence and early termination of the agent cone intersection and inside/outside computation can be reduced. This improves the SIMD utilization but the performance improvement is limited by the overhead of theseoperations.

2.6 Analysis and Comparisons

In document 4799.pdf (Page 58-63)