• No results found

5.5 Fluid Interface Movement

5.5.4 Finalise Interface

The final fluid interface movement kernel is responsible for updating the fill fraction for inter- face cells and finalizing the cell types (Pseudocode11). These actions are performed separately because they would introduce race conditions if they were included in the Update Fluid Interface kernel (Pseudocode10).

Apart from a divide operation to calculate the new fill fraction for interface cells, there is no computation performed by this kernel. The main work done is the persisting of information to global memory. As with the previous kernels, we schedule a thread per lattice cell and we cache the values determined in different code branches, so a single coalesced memory write can be

Pseudocode 11 The algorithm for finalising the movement of the fluid interface. The changes to global memory are precomputed and cached so that they can be persisted to global memory by all threads in a warp simultaneously.

1: x= tx, ty, tz



2: Calculate and cache change to fill fraction at x 3: Cache type change for x

4: new empty→empty 5: new fluid→fluid

6: new interface→interface

7: Persist change to fill fraction if necessary 8: Persist type change if necessary

performed when all adjustment threads are following the same code branch.

—∼—

In this chapter we presented our design for a GPU-based free-surface LBM fluid simulation using CUDA. Since our design involved the implementation of multiple kernels, before investi- gating the overall performance of our fluid simulation in Chapter7, we first examine the perfor- mance of the individual kernels in the following chapter.

Chapter 6

GPU LBM Kernel Analysis

The performance of an algorithm on any HPC platform is, arguably, the most important aspect of its implementation. The potential reduction of execution time is often the driving factor behind any decision to move from an implementation on a conventional single-core platform to an HPC one. This is no different for our implementation of the LBM on graphics hardware. Before we dive into any overall performance results, it is necessary to examine the individual parts of our algorithm to ensure that they making optimal use of the resources available to them. The purpose of this chapter is to perform this examination.

Our investigation will focus on the performance of the kernels that implement the core LBM operations: the streaming kernels and the collide kernel. Each of these kernels is first analysed in isolation to show how close these kernels can operate to the graphics hardware performance limits. For the streaming kernels, we compare memory throughput and execution time with that of a specially designed benchmark kernel. For the collide kernel, we discuss the effect of latency hiding for performance improvements and show that memory bandwidth is a bottleneck. We then compare performance of these kernels against equivalent CPU code to demonstrate the pos- sible benefits of running a LB simulation on graphics hardware. In addition to the streaming and collide kernels, three streaming adjustment kernels1and four fluid interface movement kernels2

are required for a full free-surface simulation. We discuss why their close relationship to the shape of the fluid interface significantly reduces the value of analysing these kernels in isolation, and discuss the difficulties in measuring their performance in the context of a free-surface fluid simulation. Lastly, we bring together the results and discussion to examine the performance of all the kernels in the context of a fluid simulation to show how well the kernels make use of the available resources during normal fluid simulations. This final analysis is important for show- ing how well the kernels perform at the tasks for which they were specifically designed, and to explain the breakdown of execution time across all kernels.

All tests in this chapter are performed on the hardware specified in Table6.1. We used the 1Source adjustments , obstacle adjustments and interface adjustments.

CPU Intel Core i5-3570K

CPU Cores 4

CPU Memory 8 GiB DDR3

CPU Clock Speed 3.4 GHz

Graphics Card Model NVIDIA GeForce GTX 560Ti GPU Generation GF114 (Fermi)

GPU Clock Rate 822 MHz

GPU Memory 1 GiB GDDR5

GPU Memory Clock Rate 4008 MHz

GPU Cores 384

CUDA Runtime Version 5.0

Table 6.1: The hardware used to perform this analysis.

hardware that was locally available to us to perform these tests. Since the kernel analysis is pri- marily concerned with efficient resource utilization rather than raw performance, this hardware was acceptible for these purposes.

We begin with the isolated analysis of the streaming and collide kernels.

6.1

Isolated Analysis

Although assessing the performance of kernels in the context of the task they are designed to per- form is ideal, it is possible for the peak performance of the kernels to be obscured by the nature of the data being processed. For LB fluid simulations, the shape of the fluid can have a significant effect on the performance as various operations may or may not be performed. However, in an isolated environment without other kernels we can control the simulation data to ensure that it does not significantly impact our performance metrics. The streaming and collide kernels are best suited to this type of analysis because their calculations and branching is not highly dependent on the shape of the fluid interface. Since the code branches in the adjustment and interface move- ment kernels are highly dependent on the fluid surface, discussing them outside that context is not valuable and therefore they are not included in this section.

We begin by discussing our approach for choosing the lattice dimensions for each isolated analysis. We first investigate the performance of the streaming kernels and end this section with the analysis of the collide kernel.