3.2 Algorithms and implementation techniques
3.2.1 Lattice site addressing
When applying the propagation to the distribution values in the lattice sites, there needs to be some way to compute where the neighboring lattice sites reside in memory, as data from the current lattice site will be moved to the neighboring sites. The simplest way to handle this would be to allocate a three-dimensional array, the size of the bounding box of the simulation geometry used. The location of the neighboring lattice sites can then be computed by looking at the coordinate of the current lattice site and adding the corresponding offset for the current distribution value. This approach would be a direct addressing scheme for the lattice sites.
The direct addressing scheme, while simple to create and use, has some drawbacks, the need to allocate space for all the lattice sites within the bounding box of the geometry being the biggest. Since the samples used for this work consists of a porous material, a significant percentage of the volume consists of solid lattice sites, for instance the main sample has only 13 % fluid sites. Using the direct addressing scheme would force the solver to allocate a lot of memory that will not be used for the fluid simulation, and it would severely limit the amount of simulation data that is able to fit into a compute node.
Additionally, allocating space for the solid sites has an adverse effect on the performance of the solver. The performance impact comes from the way memory accesses through the cache structure work on a modern processor. On a hardware level, if the program requests a single value, that memory access is not done for an individual value. In case the value requested does not currently reside in the cache, the processor will always fetch an entire cache line instead of a single value. In the case where there are unused, solid, sites in the lattice, when a cache line containing data associated with
such a site is requested by the processor, this unusable data is moved into the cache structure. Fetching data into the cache that is not used by the computation is wasting bandwidth, since there will be no computation done with it.
To avoid the performance degradation associated with unused lattice sites and to save memory space, the LB solver used here is implemented using indirect addressing. In practice, indirect addressing requires data to be allocated only for fluid site. This is done by precomputing the indices used for the propagation targets for all the distribution values before the simulation starts. During the simulation, these indices are read from an array containing the propagation target indices for all lattice sites. With the indirect propagation, we can implement almost any conceivable data layout for the distribution values, and the lattice data can be arranged in any order.
The downside of the indirect scheme is that it requires some additional memory space for the indexing values, as well as consuming extra memory bandwidth when reading the indexing values. Ideally, only one set of prop- agation indices is needed for all but the center distribution value in each lattice site. For simulations where 32-bit indices are sufficient, the solver would need (Q − 1) ∗ 32 bits of space for the indexing values per lattice site. When using porous media, or another media with a large amount of solid lattice sites in the simulation geometry, the extra memory space required is quickly offset by the saving coming from not storing solid lattice sites.
The performance impact of using indirect addressing will depend on the algorithm used. Ideally, the distribution values will only be read and written once per iteration. In the case where double-precision values are used for the distribution values, 2 ∗ 64 ∗ Q bits ideally need to be accessed for the distribution values while only (Q − 1) ∗ 32 bits are needed for the indexing. The result is that the theoretical performance impact of using indirect addressing should only be around 24% lower than in the case we run a fluid only simulation.
For porous media, using direct addressing will have a significant per- formance impact, in addition to the wasted memory space from storing all the solid sites. Since some of the cache lines fetched will include data that will not be used by the simulation, it will waste bandwidth that could be used for actual simulation data. Figure 3.3 shows result from testing on an Nvidia Tesla M2050, placing solid sites at random locations to achieve a certain percentage of solid sites. The performance is measured in millions of fluid lattice sites updated per second, MFLUPS. We only count the fluid sites since no computation happens at the solid sites, and in fact since we are using indirect addressing, we do not even allocate memory for them. At roughly 10% of the lattice volume filled with solid sites, direct and indi- rect addressing performed the same [64]. Increasing the percentage of solid
0 20 40 60 80 % of solid lattice sites
40 60 80 100 120 140 160 Sp ee d M ill io n flu id la tt ic e si te u pd at es/ s ( M FL U PS ) Direct addressing Indirect addressing
Figure 3.3: The performance at different percentages of solid sites, compared to the performance of the direct and indirect addressing schemes on a Tesla C2050 GPU. Solid sites are placed at random into the simulation domain.
sites in the lattice will have the performance of the direct addressing scheme falling further in a linear relation to the number of solid sites. The indirect variant will see a declining performance while going towards 70% solid sites, and a performance increase when going to a more solid simulation geometry.