2.4 Results
2.4.1 Serial CPU Implementation
BGK MRT Shan-Chen 103 104 105 106 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 Serial Performance of the LBM (Intel Xeon X5355)
Number of Nodes
MLUPS
Figure 2.9: Serial performance for various implementations of the LBM as a function of domain size. Once the lattice arrays exceed the L2 cache size, performance is limited by available memory bandwidth.
Serial performance was analyzed for the BGK, MRT and Shan-Chen implementa- tions of the LBM based on simulations performed using a single core of a 3 GHz Intel Xeon X5355 processor. Lattice update rates are reported in million-lattice-updates- per-second (MLUPS) as a function of lattice size in Fig. (2.9). In each case, cubic lat- tices were considered. Results demonstrate that performance is roughly proportional to the computational intensity of the model for domains that fit within the L2 cache (2×4 MB). This corresponds to a maximum lattice size of approximately 253 for the
Authors Year Processor GHz GB/s MLUPS Wellein, Zeiser, Hager, 2006 Intel Xeon DP 3.4 5.3 4.8
Donath AMD Opteron 1.8 5.3 2.7
Intel Itanium 2 1.4 6.4 7.6
IBM Power4 1.7 9.1 5.9
Mattila, Hyv¨aluoma, Rossi, 2007 AMD Opteron 246 2.0 - 2.47 Aspn¨as, Westerholm
Mattila, Hyv¨aluoma, 2008 AMD Opteron 2.0 - 4.02
Timonen, Rossi Intel Xeon 3.2 - 4.67
Heuveline, Krause, Latt 2009 AMD Opteron 2.6 6.4 1.9 McClure, Prins, Miller - Intel Xeon 3.0 3.0 4.43
- Intel Nehalem 2.93 12.5 11.92
Table 2.1: Reported peak performance based on serial execution of the D3Q19 BGK LBM for a variety of processors [183, 117, 116, 77].
Model FLOPs Memory References BGK 295 19 read + 19 write = 38 MRT 975 19 read + 19 write = 38 Shan-Chen 2050 114 read + 78 write = 192
Table 2.2: Basic computational and memory reference parameters per lattice site for the LBM models considered in this work.
BGK and MRT schemes, whereas the increased storage requirements of the Shan-Chen scheme lead to a maximum of about 203. For flow in porous medium systems that are representative of a macroscale representative elementary volume (REV) [17], the more relevant performance estimates are those obtained for larger domain sizes. In these cases, execution speed is primarily memory bandwidth limited with maximum execu- tion speeds of 4.43 MLUPS for the BGK scheme, 3.55 MLUPS for the MRT scheme and 1.25 MLUPS for the two-component Shan-Chen scheme. This memory bandwidth limitation is key for the applications of greatest concern in porous medium science.
Performance of the BGK LBM is consistent with results reported by other authors for the D3Q19 model, tabulated in Table 2.1 with hardware specifications noted when available. Full periodic boundary conditions for the distributions and separate execu- tion of interior and exterior lattice sites impose a slight performance penalty for our implementation, both of which are necessary to carry out porous medium simulations in parallel. Due to similar memory bandwidth demands, the MRT scheme achieves similar performance to BGK model when the problem size exceeds the L2 cache limit. The performance deficit for the MRT scheme indicates that computational intensity does have limited impact on performance, meaning that memory bandwidth is not the sole limiting factor. The performance of the Shan-Chen scheme is consistent with the higher memory bandwidth demand associated with this model.
Total memory requirements are roughly equivalent for the BGK and MRT schemes due to the fact that the distributions represent the only major variable that must be allocated and stored. A total of 19×8 = 152 bytes are required to store the distributions for a single component at a lattice site. An equivalent number of bytes must be accessed from either data caches or main memory to perform each lattice update in the BGK and MRT schemes. The demand for memory bandwidth is proportional to the number of values which must be accessed from main memory. Due to the fusion of the streaming and collision operations, the total number of memory references per lattice update is given byQ(R+W), whereR denotes a memory read and W denotes a write. Memory references for the single component D3Q19 BGK and MRT schemes are shown in Table 2.2, along with those for the two-component Shan-Chen scheme. In the Shan-Chen LBM, arrays must be allocated to store both distribution and density values at each lattice site for each of the two components, requiring 2×(19 + 1)×8 = 320 bytes per lattice site. Compared with the BGK and MRT schemes, significantly more memory references are required for implementation of the Shan-Chen scheme. In addition to the streaming requirement for two fluid components, additional memory references are
Topsail Franklin MMQ
Number of cores per Node 8 4 32
Total number of nodes 520 9,572 1
Aggregate mem. bandwidth (GB/s) - 32 160 Interconnect mem. bandwidth (GB/s) 1.0 1.6 - BGK (max. MLUPS/core) 5.31 5.04 11.92 MRT (max. MLUPS/core) 3.78 3.71 5.67 Shan-Chen (max. MLUPS/core) 1.06 0.998 2.237
Table 2.3: Overview of hardware specifications and LBM performance for the parallel systems used in this work.
required to write the post-streaming density values and separately execute the collision step. The total number of memory references for the Shan-Chen scheme are given by:
MRPLUS−C = (R+W)(Q×Nc) | {z } streaming +W(Nc) | {z } density + (2R+W)(Q×Nc) | {z } collision . (2.30)
Note that it is possible to implement each scheme with a greater number of memory references but it is not possible to do so with less. For the two-component Shan-Chen scheme, approximately five times as many values must be accessed from memory to perform a lattice update compared with the basic MRT approach. Fig. (2.9) indicates a roughly three-fold performance differential between these two methods, suggesting that use of merged storage arrays combined with various compiler optimizations decreases the relative cost of the Shan-Chen scheme.