3.3 Optimization of the Broad Phase
A role of the broad phase in rigid-body simulation is to find a pair of bodies that could possibly collide. Because a fast algorithm is required rather than an accurate one in the broad phase, a bounding volume (AABB) that surrounds a real shape of a rigid body along each x-, y-, and z-axis is used. Then, if the AABBs of two rigid bodies cross one another, broad phase outputs this as a pair with a pos sibility of colliding. The collision-detection sequel to the broad phase can be done only for these colliding pairs, using real shapes of rigid bodies that belong to a pair.
Therefore, useless operations will be largely avoided.
3.3.1 “Sweep and Prune” Algorithm
The mechanism of the “sweep and prune” (SAP) algorithm is very simple yet ef-fective [Ericson 05]. In this algorithm, minimum and maximum values of AABB are stored in a node of the list along the x-, y-, and z-axes. All nodes in the lists are maintained as sorted along each axis. Then, select one axis and find nodes crossing one another. When the inters ection is found, also check the in-tersection of nodes along the other axis. If nodes of two AABBs are crossing all axes, two AABBs are output as an overlapped pair. If the state of a rigid body is changed (moved, added, or removed), the algorithm’s operation is to just change the pointer to the nodes, so the update of nodes is completed fast when there are not so many moving objects in the world.
Figure 3.4. The mechanism of double buffering.
52 3. Broad Phase and Constraint Optimization for PlayStation R 3
Figure 3.5. Sweep and prune.
Figure 3.6. Structure of AABB node.
3.3.2 An Optimized SAP Algorithm
First, remove the data structure connected with many pointers because that struc-ture is hard for SPU s to deal wit h. As in Figure 3.5, linked list s are crea ted along the x-, y-, and z-axes, and each node in these lists is linked with a previous node. When traversing all lists, such a structure caus es calling too many DMA operations and ends up with a low performance. Therefore, to reduce traversing pointers, it is better to use a structure that holds all necessary information without pointers.
An Optimized Structure of AABB.
Figure 3.6 shows the structure of the node that holds all AABB values in one structure. For efficient use of the local storage, we take a value of AABB as a 16-bit integer instead of using a 32-bit float, and this structure is represented as a 128-bit length so that SPUs can handle it at a peak performance.Using a sorted array as a list.
To parallelize computation and DMA transfer in doub le-buffering mode, a sorted array is better than a linked list. As for a linked list in which all nodes are connected by pointers, the current node is always3.3. Optimization of the Broad Phase 53 needed to access the next node. We can’t use the method of double buffering due to this dependency.
However, with a sorted array we can calculate the address of each node by adding an offset address and an index of a node. Thus, we can calculate an address of a node used in the future. We prepare two buffers and assi gn one for DMA transfer of data and the other for computation and then execute computation and DMA transfer in parallel to hide the latency of DMA transfer.
The following two operations are needed to replace the operations of a linked list (see Figures 3.7 and 3.8).
• Insert. Add new nodes to the last of an array, then sort the whole array.
• Remove. Set sort keys of removed nodes to the maxim um number as a sentinel value, then sort the whole array.
Parallelize the sort algorithm.
For faster operation, a sort algorithm is also needed to run in parallel when using multiple SPUs (see Figure 3.9). First, we need to load as much data from main memory as we can store into the local stor-age and then sort the data on each SPU. The straightforward implementation of this sort is not complicated because sorting is executed on a single SPU, and all the necessary data is in its local storage. Actually, we use the combination of the bitonic and the merge sort for sorting on a single SPU.The next step is to merge two sorted data sets using two SPUs, just putting data from both sides while comparing the value of data. Then repeat these procedures a few times and all the data will be sorted correctly, as shown in Figure 3.10.
How to find overlapped pairs using sorted AABB arrays.
First, prepare AABB arrays that are sorted along the x-, y-, and z-axes, taking the minimum value as a sort key. Then, find the axis along whic h all rigid bodies are most widelyFigure 3.7. Insert operation.
54 3. Broad Phase and Constraint Optimization for PlayStation R 3
Figure 3.8. Remove operation.
positioned. We can calculate variance of rigid bodies by checking how far away each rigid body’s position is from the average position (see Listing 3.1). The axis with the large st distance can be selecte d. Then, an AABB array along only the selected axis is used to find overlapping pairs.
/ / C a l c u l a t e a v e r a g eVec tor averag e (0. 0 f ,0.0 f ,0.0 f );
for ( in t i =0; i <numRigidBo dies ; i ++) { averag e . x += po s it io n [ i ] . x ; averag e . y += po s it io n [ i ] . y ; average . z += po si ti on [ i ] . z ; }
averag e . x /= ( f l o a t ) numR igid Bodi es ; averag e . y /= ( f l o a t ) numR igid Bodi es ; averag e . z /= ( f l o a t ) numR igid Bodi es ;
/ / C a l c u l a t e v a r i a n c e
Vecto r to t al d i s t a n c e (0. 0 f ,0. 0 f ,0 .0 f ) ;
Figure 3.9. Parallel sort algorithm.
3.3. Optimization of the Broad Phase 55
Figure 3.10. Merging data sets from two SPUs.
for ( in t i =0; i <numRigidBodi es ; i ++ ) { Ve ct or dir ecti on ;
dis ta nce . x = pos it io n [ i ] . x − ave ra ge . x ; dis ta nce . y = pos it io n [ i ] . y − ave ra ge . y ; dis ta nce . z = pos it io n [ i ] . z − ave ra ge . z ; t o t a l d i s t a n c e . x += di st an ce . x ∗ di st an ce . x ; t o t a l d i s t a n c e . y += di st an ce . y ∗ di st an ce . y ; t o t a l d i s t a n c e . z += di st an ce . z ∗ di st an ce . z ; }
/ / S e l e c t t h e a x i s a l o n g whi ch a l l r i g i d b o d i e s a r e / / s p r e a d m os t w i d e l y .
i f ( t o t a l d i s t a n c e . x > t o t a l d i s t a n c e . y ) { i f ( t o t a l d i s t a n c e . x > t o t a l d i s t a n c e . z ) {
/ / S e l e c t X a x i s }
else {
/ / S e l e c t Z a x i s }
} else {
i f ( t o t a l d i s t a n c e . y > t o t a l d i s t a n c e . z ) { / / S e l e c t Y a x i s
}
else / / S e l e c t Z a x i s { }
}
Listing 3.1. Calculate variance of rigid bodies to find the axis.
For example, Figure 3.11 shows a scene with seven rigid bodies. The x-axis is selected as all rigid bodies are positioned mostly along this axis. When checking the overlapping of A with other AABBs (B, F, C, E, D, and G), we note that the AABBs are sorted in the order A, B, F, C, E, D, G . We can see AABBs of both A and F are separated along the axis, so we don’t need to check later AABBs.
56 3. Broad Phase and Constraint Optimization for PlayStation R 3
Figure 3.11. AABB array on the selected axis.
Parallelize finding overlapped pairs algorithm.
To find the intersection of AABBs, continue to traverse the AABB array until a maximum value of the cur-rent AABB becomes smaller than a minimum value of the next AABB. This pro-cess is implemented as a simple double loop, shown in Listing 3.2. Figure 3.12 illustrates this algorithm using an extreme case. Actually, pairs with a dotted line aren’t processed because any unnecessary processing is removed by checking the end condition.for ( in t i =0; i <t o t a l ; i ++ ) { for ( in t j= i +1; j <t o t a l ; j + +) {
/ / a x i s : 0 , 1 , 2 = X , Y, Z
i f ( AABBs[ i ] . max[ a x i s ] < AABBs[ j ] . min [ ax i s ] ) { break ;
}
i f ( ch ec kO ve r la p ( AABBs[ i ] , AABBs[ j ] ) ) { submitOverl appedPa ir ( i , j );
} } }
Listing 3.2. The example code of finding overlapped pairs.
The next step is to parallelize this proc ess. Because there is more effici ent parallelization if we can divide by a large processing element, we divide by the outside loop instead of the inner loop. Since there are no data dependencies be-tween AABBs, we can divide the total number of AABBs by the number of the possible parallel batches as in Figure 3.12. As soon as each SPU finds a batch that hasn’t been processed, it starts finding overlapping pairs within this batch.
We can’t know the execution time of each batch beforehand, as the amount of data included in each batch isn’t unifo rm. It is necessary to choose the number of batches carefully so that an SPU does not become free.