2.3 Implementation and Optimization Approaches
2.3.1 Hardware Overview
Scalable parallel computers combine multiple CPU’s sharing memory into a node, and interconnect multiple nodes through a high speed switching network into a cluster. Fig. (2.2) illustrates the structure of a single node in a parallel computing cluster circa 2010. An Intel node based on four Nehalem-EX processors is illustrated; similar designs are available from other manufacturers (e.g. based on AMD opteron or IBM Power components). Each Nehalem-EX processor contains 8 parallel processing cores, and can reference directly attached high-speed memory at about 40 GB/s. It also connects directly to the other three processors through a high speed interconnect (100GB/s) to access non-local memory. In aggregate, this configuration provides 32 processing cores with up to 160 GB/s of shared main memory bandwidth. Computational accelerators, in the form of graphics processing units can also be incorporated in the node. Data is transferred to and from the GPUs at about 4–8 GB/s. Data is transferred to and from other nodes in the cluster at about the same rate (4–8 GB/s).
Multiple levels of cache are provided to reduce the latency associated with memory accesses. Depending on the specific processor, a particular data cache may be associated with each individual core or may be shared between multiple cores. For the setup shown in Fig. (2.2), each of the eight cores in a Nehalem-EX is equipped with 32 KB dedicated L1 data cache and 256 KB L2 cache. A 24 MB L3 cache is shared among the eight cores. Each cache stores a subset of the data contained within main memory based on which data is required by the processor cores to perform computations. The time to access data from the caches is considerably lower than the time to access data from main memory. Once data has been loaded into the cache, it remains there until subsequent data accesses require it to be replaced. Temporal and spatial locality of cache references can be improved by manipulating data access patterns, which can impact performance significantly. Relatively little temporal locality is presented by LBM methods.
While the processor cores on a multi-core CPU are capable of performing computa- tions in parallel, serial codes use only one core at a time and therefore do not take full
Memory Module GPU GPU GPU GPU
I/O
I/O
Hub
Hub
Nehalem Nehalem Nehalem Nehalem EX EX EX EX Network Connection to Other NodesFigure 2.2: Schematic showing a GPU-accelerated processing node populated with four Intel Nehalem Xeon CPU’s, each with 8 processing cores.
advantage of the processor capabilities. Multi-core shared-memory implementations can be constructed for shared memory using language extensions and libraries such as openMP or MPI. Scalability of memory intensive computations to multiple cores is limited by a maximum memory bandwidth to be shared by all cores. Increased cache sizes and memory bandwidth that better scales with the number of cores is a primary objective of modern processor architectures such as Nehalem architecture. While these designs do improve aggregate bandwidth, the maximum memory bandwidth remains a critical limit for LBM methods.
In order to simulate large domain sizes and accelerate the solution time for a given LBM simulator, parallel implementation is a necessity for most porous medium applica- tions. A typical approach is to use the message passing interface (MPI) to develop code to run in parallel on multiple processor nodes [177, 169, 113, 165, 161, 19, 56]. Scal- ing the LBM to run on a large number of processors requires a domain decomposition strategy that evenly distributes the computational load between processors while mini- mizing the amount of communication that must be performed [140, 182, 181, 179]. The computational load scales with the volume of lattice sites not in the solid phase within a subdomain, while the communication scales with the surface area of the subdomain. MPI implementations are primarily targeted for large distributed memory comput- ers constructed from a large number of processors, and typically utilize each processor core when using a large number of multi-core CPU’s. For many of these systems, the amount of memory bandwidth increases with the number of processors rather than the number of processor cores. Inter-processor communication is needed and relies upon a network connecting the various processors. The bandwidth of this network determines the data transfer rate between processors, which impacts efficiency and scaling. As long as communication times are shorter than the computational time (and can be overlapped with computations), scaling is determined by load balancing of the com- putational work among the processors on the systems. Once communication times exceed the computational time, communications can no longer be masked effectively and parallel efficiency deteriorates.
GPU’s represent a different approach to multiprocessing. The GPU achieves high performance through multiple processing units, each of which contain multiple arith- metic units executing identical instructions on different pieces of data, known as single- instruction multiple-data (SIMD) operation. A modern GPU can have thousands of arithmetic operations in process concurrently. GPUs also have very high performance memory systems, provided memory is referenced appropriately. In order to streamline
CPU Memory
8 GB/s 10-30 GB/s 170 GB/sGPU Memory
CPU
GPU
Data Cache
Core 1 Core 2- Individual GPU core
- Cache shared by multiple cores
the development process, NVidia introduced CUDA, an extension to the C program- ming language targeted for GPU applications [134]. The CUDA programming model is based on the GPU setup shown in Fig. (2.3) in which a CPU is used to perform basic tasks such as allocating memory and performing input and output, while the GPU is used to perform intensive calculations. Main memory is divided between the CPU and GPU, and data must be copied explicitly from one location to another. In order to maximize performance, memory operations involving data transfer between the CPU’s and GPU’s must be minimized. Data transfer rates between the GPU and its associ- ated memory significantly outperform other memory operations, especially when data accesses follow advantageous patterns. This derives from the fact that memory trans- actions can be coalesced into a single operation for 16 or 32 SIMD threads provided that alignment and contiguity conditions are met. While these conditions become sub- stantially less restrictive with each new generation of GPU, data alignment remains a critical consideration for optimization of GPU-based code.