• No results found

Kernel Performance in a Fluid Simulation

In this section we bring together the results and discussions from previous sections to look at the performance of each kernel in the context of a full fluid simulation. We present the percentage of total execution time spent on each kernel and provide reasons why the percentages for each kernel are to be expected.

For this discussion we use a 192×192×192 modified “dam break” simulation as our test case. The initial conditions are shown in Figure6.6and a selection of stills from the simulation are shown in Figure6.7. Equal dimensions ensure that no streaming direction has better performance

Figure 6.6: The scene used in this analysis. The green circle on the right represents the fluid source, which has a radius ofqw3.

due to the data dimensions. The scene is bounded by obstacle cells that contain the fluid. There is a wall of fluid against the left yz-face of the bounding box. The simulation models the fluid as it flows down the left yz-face and begins to slosh from one side of the scene to the other. In order to include the performance of the source kernel, we include a circular fluid source on the right yz-face.

Using a single scene for this discussion is sufficient because the difference in performance profiles between different simulation scenes is not significant enough to affect this analysis. We will therefore leave the discussion of the effect of different scenes to Chapter7.

Figure 6.8 and Table 6.4 show the percentage of time spent executing each of the kernels throughout the simulation. The most counter-intuitive result is that the core LBM steps (stream- ing and collide), which perform the bulk of the work for the simulation, only require 47.3% of execution time, while a large portion of the time (38.6%) is spent executing kernels that manage the fluid interface (the interface adjustment kernel and the interface movement kernels). In the discussion below, we explain this result by showing that the streaming and collide kernels are able to make efficient use of the resources available to them, whereas the rest of the kernels do not make efficient use of the available memory bandwidth because of the limitations discussed in Section6.2.

All of the LBM kernels are limited by memory bandwidth. This means that the efficiency with which individual warps access memory has a significant impact on the overall kernel per- formance. Warps that have only a few active threads will take almost the same amount of time as warps with all active threads to perform simultaneous aligned global memory operations. Warps with only a few active threads are thus wasting the resources associated with each idle thread. For the adjustment and interface movement kernels this warp inefficiency is unavoidable, but

Figure 6.7: Visuals from our test case simulation, rendered using LuxRender [48], with lattice dimensions of 192×192×192. The large frame is from the 1120th time step. The smaller frames are from time step 0 to 2000 with 181 time steps between each frame.

for the streaming and collide kernels we need to ensure that their performance within the fluid simulation is in line with expectations based on the isolated analyses.

In Section 6.1.2 we showed that the streaming made efficient use of the available memory bandwidth. Profiling confirms that this is still true within a fluid simulation. For each global memory access performed, 32 threads issued global memory instructions. This implies that all

Streaming 35.9% Streaming Adjustments 30.0% Collide 11.4% Interface Movement 22.7%

Figure 6.8: A brief overview of how much time is spent performing operations during each of the high- level stages stages of our GPU implementation. See Table6.4for per-kernel information.

Stage Kernel % of execution time

Streaming 35.9% y-axis streaming 9.9% z-axis streaming 9.7% xy-diagonal streaming 6.6% xz-diagonal streaming 6.5% x-axis streaming 3.2% Streaming Adjustments 30.0% Interface Adjustments 15.9% Obstacle Adjustments 12.0% Source Adjustments 2.1% Collide 11.4% Interface Movement 22.7%

Update Fluid Interface 9.2% Convert Full Interface Cells 8.2% Convert Empty Interface Cells 3.0% Finalise Fluid Interface 2.3%

Table 6.4: This table shows the percentage of execution time for each kernel during a fluid simulation for a 192×192×192 lattice with 2000 time steps.

memory operations were fully coalesced. This is an expected result because there is little differ- ence between the isolated test environment and a fluid simulation from the perspective of our streaming algorithms. The streaming kernels’ high memory throughput is the main reason why the streaming kernels have relatively low execution times compared to the other kernels, consid- ering the number of memory operations they are required to perform.

While the collide kernel is also shown to make efficient use of the memory bandwidth in Section6.1.3, that was in the context of a fluid-only simulation where operations are performed on all cells. In a simulation with a fluid interface and obstacles, the collide kernel warps overlap with the fluid interface and result in memory operation inefficiencies. For our test case, 13–15% of all cells require collision calculations, but 20–35% of all warps overlap with these cells. On average, only 17 threads per warp are actively performing collision operations, which allows us to set a target for expected memory throughput as 1732 = 53.1% of the available memory bandwidth (≈ 55 GB/s). The measured result surpasses this target with a memory throughput of 61.00 GB/s. We can therefore conclude that the collide kernel has excellent performance within the context of a fluid simulation.

The performance of the source adjustment kernel is lower than expected. It makes use of only 30.7% of the available memory bandwidth. Unlike the streaming kernels, each thread of the source adjustment kernel only performs a single global memory read before completion. Since there is a single thread for each lattice cell, initialisation costs (for both hardware and thread- local variables) take a non-negligible amount of the total execution time. In order to improve the memory throughput of this kernel it would have to be redesigned so that, like the streaming kernels, each thread performs the source adjustments for multiple lattice cells. This will allow for better intra-warp latency hiding to take place.

Although the obstacle adjustment kernel accesses an average of only six values per memory request, it was still able to achieve a modest throughput of 64.78 GB/s. The interface adjustment kernel accessed an average of seven values per memory request, but only achieved a throughput of 21.21 GB/s. The difference in performance between these two kernels is likely to result from the obstacle adjustment kernel performing half the number of instructions (since it performs no calculations for the LBM) that the interface adjustment kernel performs. Further investigation is required to confirm this.

As expected, the interface movement kernels also all have low memory throughput (15.29 GB/s to 45.27 GB/s). As with the source adjustment kernel, each thread for the interface movement kernels operates on a single lattice cell. For cells where no work is required, a non-negligible amount of initialisation is required. This is likely to play a role in their poor performance, but further work is required to identify the exact reasons for the poor performance and suggest spe- cific areas where improvements are possible.