4.2 The lattice Boltzmann method in CUDA
4.2.1 Uniform Grid Implementations
This section first outlines some of the recent implementations the LBM on uniform grids. The performance of each method is given in terms of Million Lattice Updates Per Second (MLUPS), which is a common performance measurement used for the LBM. T¨olke and Krafczyk [64,65] solve the memory misalignment problem by splitting the domain into an array of one dimensional blocks, and performing the propagations that involve shifts in the minor directions through shared memory. Since the scope of shared memory is limited to the block, the distributions that are leaving the block along the minor direction are temporarily stored as an incoming distribution from the opposite side of the block. A
separate kernel with a different block topology is then invoked to place these distributions to their correct locations. This approach ensures that all global memory accesses are fully coalesced, but has the disadvantages of requiring an additional kernel to be launched. Their implementation achieved a maximum performance of 592 MLUPS for the D3Q13 lattice on the GeForce 8800 Ultra.
Obrecht et al. [49] showed that a relatively straight forward “reversed” scheme can achieve high memory bandwidths on devices with higher compute capabilities. This method relies on the fact that the penalties incurred from misaligned memory accesses have been allevi- ated on the higher compute capability devices (1.2 and above), and on the observation that misaligned writes are more costly than misaligned reads. The implementation consists of each thread performing a series of global memory reads that involve misalignments, apply- ing the collision operator, and writing the updated distributions onto global memory in a fully coalesced fashion. Their method showed a maximum performance of 516 MLUPS for the D3Q19 lattice on the GeForce GTX 295.
Both of the implementations above allocate arrays for two sets of domains, and alternate between the two grids after every time step to avoid data overwriting. There have also been work that performs the reads and writes in place, allowing the simulation to be contained in a single array, thereby reducing memory requirements by half. Myre et al. [42] compared the “ping-pong” scheme that alternates between two sets of grids, with the “flip-flop” and “Lagrangian” schemes that only use one set of data. The “ping-pong” pattern used in their study is similar to the “reversed” scheme used by Obrecht et al., but it performs the collision in place first, and then writes the updated distributions to the neighboring sites to complete the propagation. The “Lagrangian” pattern assigns a global memory location for each fluid packet, and the threads are assigned a fixed point in the simulation domain. At each time step, the threads selectively read the fluid packets that would arrive at its location, perform the collision step, and write the updated values to the original memory locations. In the “flip-flop” pattern, two different read/write methods are used alternately for each time step. In the first time step, each thread reads the distribution functions from its node, applies the collision operator, and writes the updated information in the same node location, but with reversed directions. In the second time step, the threads now read the incoming distributions from the adjacent nodes, and subsequently writes to the adjacent nodes. Their study showed that the “Lagrangian” method is the fastest among the three memory access patterns, which achieved a maximum performance of 444 MLUPS for the D3Q19 lattice on the Tesla C1060.
Astorino et al. [3] also present a single grid method that employs the swapping technique [33] for propagation. This technique involves a series of swapping operations between dis- tributions in the opposite directions on adjacent nodes, thus eliminating data overwriting.
The limitation with this propagation method in the context of GPU computing is that it does not allow the collision and propagation steps to be combined into one kernel. This is because the swapping technique assumes knowledge of the order of the nodes at which the algorithm is applied. Since all threads execute concurrently, one cannot know the order that the nodes are executed a priori, and will have to rely on kernel synchronizations to ensure that the collision operation is only applied to the nodes that have completed the swapping with all adjacent nodes. With this method, they report a maximum performance of 375 MLUPS on the D3Q19 lattice using the GeForce GTX 480.
The current implementation of the LBM was based on the “reversed” scheme, where each thread reads incoming distributions from its neighboring nodes, completes the collision step, and writes the updated distributions in a fully coalesced manner [49]. One dimen- sional thread blocks oriented in a way that it is aligned to the minor dimension of the array (in this case, the x direction) to allow coalesced accesses. The optimal block size is dependent on the size and shape of the domain, but in general, having 64 or 128 threads per block leads to good performance. Below is an excerpt that outlines the procedure for the current LBM implementation.
/∗
S t r i p p e d down c o d e f o r r u n n i n g t h e LBM on t h e GPU u s i n g CUDA. ∗/ #include <cuda . h> . . . /∗ D e v i c e f u n c t i o n f o r c o n v e r t i n g x , y , z c o o r d i n a t e s t o memory a d d r e s s I n p u t s a r e : f num = d i s t r i b u t i o n number (0 −18) ; p i t c h = p i t c h s i z e o f a r r a y ∗/ i n l i n e d e v i c e i n t f mem ( i n t f num , i n t x , i n t y , i n t z , s i z e t p i t c h ) {
return ( x+y∗ p i t c h+z ∗YDIM∗ p i t c h )+f num ∗ p i t c h ∗YDIM∗ZDIM ; }
. . . /∗
K e r n e l f o r s t r e a m i n g and c o l l i d i n g . Reads d i s t r i b u t i o n s from f i n , and w r i t e s i t o n t o f o u t .
Other i n p u t s a r e : omega = r e l a x a t i o n r a t e ; p i t c h = p i t c h s i z e o f a r r a y ∗/
g l o b a l void update ( f l o a t ∗ f i n , f l o a t ∗ f o u t , f l o a t omega , s i z e t p i t c h ) {
i n t x = t h r e a d I d x . x+b l o c k I d x . x∗ blockDim . x ; i n t y = t h r e a d I d x . y+b l o c k I d x . y∗ blockDim . y ; i n t z = t h r e a d I d x . z+b l o c k I d x . z ∗ blockDim . z ; i n t j = x+y∗ p i t c h+z ∗YDIM∗ p i t c h ;
f l o a t f [ 1 9 ] ;
// r e a d from i n p u t a r r a y . a r r a y i n d e x computed from d e v i c e f u n c t i o n : f mem f [ 0 ] = f i n [ f mem ( 0 , x , y , z , p i t c h ) ] ; f [ 1 ] = f i n [ f mem ( 1 , x −1 ,y , z , p i t c h ) ] ; . . . f [ 1 8 ] = f i n [ f mem ( 1 8 , x , y+1 , z +1 , p i t c h ) ] ; // a p p l y b o u n d a r y c o n d i t i o n s . . . // a p p l y c o l l i s i o n o p e r a t o r . . . // w r i t e u p d a t e d d i s t r i b u t i o n s t o f o u t f o u t [ f mem ( 0 , x , y , z , p i t c h ) ] = f [ 0 ] ; . . . f o u t [ f mem ( 1 8 , x , y , z , p i t c h ) ] = f [ 1 8 ] ; } . . . i n t main ( ) { . . . // a l l o c a t e two s e t s o f d e v i c e memory f l o a t ∗ f d [ 2 ] ; // a l l o c a t e memory
c u d a M a l l o c ( ( void ∗ ∗ ) &f d [ 0 ] , p i t c h ∗YDIM∗ZDIM∗19∗ s i z e o f ( f l o a t ) ) ; c u d a M a l l o c ( ( void ∗ ∗ ) &f d [ 1 ] , p i t c h ∗YDIM∗ZDIM∗19∗ s i z e o f ( f l o a t ) ) ; // a l l o c a t e h o s t memory
f l o a t ∗ f h ;
// i n i t i a l i z e h o s t memory . . .
// copy h o s t memory t o d e v i c e memory t o i n i t i a l i z e s i m u l a t i o n
cudaMemcpy2D ( f d [ 0 ] , p i t c h , f h ,XDIM∗ s i z e o f ( f l o a t ) ,XDIM∗ s i z e o f ( f l o a t ) ,YDIM∗ ZDIM∗ 1 9 , cudaMemcpyHostToDevice ) ;
cudaMemcpy2D ( f d [ 1 ] , p i t c h , f h ,XDIM∗ s i z e o f ( f l o a t ) ,XDIM∗ s i z e o f ( f l o a t ) ,YDIM∗ ZDIM∗ 1 9 , cudaMemcpyHostToDevice ) ;
. . .
// d e f i n e t h r e a d and b l o c k d i m e n s i o n s f o r GPU dim3 t h r e a d s (BLOCKSIZEX, BLOCKSIZEY, BLOCKSIZEZ) ;
dim3 g r i d (XDIM/BLOCKSIZEX,YDIM/BLOCKSIZEY, ZDIM/BLOCKSIZEZ) ; . . . // d e f i n e v a r i a b l e s t o a l l o w p o i n t e r s w i t c h i n g i n t A = 0 ; i n t B = 1 ; // t i m e l o o p f o r ( i n t t = 0 ; t<TMAX; t++) { // s t r e a m and c o l l i d e t o march i n t i m e
update<<<g r i d , t h r e a d s >>>( f d [ B ] , f d [A] , omega , p i t c h ) ; // s y n c r h o n i z e d e v i c e c u d a D e v i c e S y n c h r o n i z e ( ) ; // swap i n p u t and o u t p u t a r r a y s swap (A, B) ; }
// copy d e v i c e memory b a c k t o h o s t memory
cudaMemcpy2D ( f h ,XDIM∗ s i z e o f ( f l o a t ) , f d [A] , p i t c h ,XDIM∗ s i z e o f ( f l o a t ) ,YDIM∗ ZDIM∗ 1 9 , cudaMemcpyDeviceToHost ) ; . . . // w r i t e r e s u l t s t o o u t p u t f i l e . . . // f r e e d e v i c e memory c u d a F r e e ( f d [ 0 ] ) ; c u d a F r e e ( f d [ 1 ] ) ; // f r e e h o s t memory . . . return 0 ; }