dering cells necessary to compute the simulation. In the update function each node will compute the
equation for its section of the field and communicate the borders to its neighbours usingMPI Isend
andMPI Irecv. This code example processes the entire lattice at once and then communicates the data to the neighbours.
Listing 9.6: Generated MPI code for a two-dimensional finite differencing simulation using Euler
integration.eulerperforms a single computation step for each node’s field and communicates the
borders to the neighbouring nodes. void e u l e r (f l o a t ∗u0 ,f l o a t ∗u1 ,f l o a t h ) { f o r(i n t i y = Halo ; i y <Y/P+Halo ; i y ++) { f o r(i n t i x = 0 ; i x <X ; i x ++) { u0yx = u0 [ i y∗X + i x ] ; / / compute e q u a t i o n f o r c e l l ix , i y / / u1 = u0 + f ( u0 ) ∗ h } }
MPI Irecv (&u1 [ ( ( Y/P)+ Halo )∗X ] , Halo∗X , MPI FLOAT , idp1 , 0 , MPI COMM WORLD, &r e q u e s t s [ 0 ] ) ; MPI Irecv (&u1 [ 0 ] , Halo∗X , MPI FLOAT , idm1 , 0 , MPI COMM WORLD, &r e q u e s t s [ 1 ] ) ; MPI Isend(&u1 [ Halo∗X ] , Halo∗X , MPI FLOAT , idm1 , 0 , MPI COMM WORLD, &r e q u e s t s [ 2 ] ) ; MPI Isend(&u1 [Y/P∗X ] , Halo∗X , MPI FLOAT , idp1 , 0 , MPI COMM WORLD, &r e q u e s t s [ 3 ] ) ; MPI Waitall ( 4 , r e q u e s t s , MPI STATUSES IGNORE ) ;
}
i n t main (i n t argc , char ∗∗argv ) {
i n t id , P ;
M P I I n i t (& argc , &argv ) ;
MPI Comm rank (MPI COMM WORLD, &id ) ; MPI Comm size (MPI COMM WORLD, &P ) ; idm1 = ( id == 0 ) ? P−1 : id− 1 ; idp1 = ( id == P−1) ? 0 : id + 1 ; . . .
f l o a t ∗u0 = new f l o a t[ ( ( Y/P)+2∗Halo )∗X ] ;
f l o a t ∗u1 = new f l o a t[ ( ( Y/P)+2∗Halo )∗X ] ;
f o r(i n t t = 0 ; t< 1 0 2 4 ; t ++) { e u l e r ( u0 , u1 , h ) ; swap(&u0 , &u1 ) ; } M P I F i n a l i z e ( ) ; }
9.7 Generating CUDA code
The CUDA generator has another type of parallelism to consider when constructing simulations for the GPU. The generator must create code to allocate lattices in host memory as well as in the GPU device memory. CUDA calls to copy data between these two memory areas must also be created to copy the simulation in and out of the GPU. As there is no way to synchronise between threads in different blocks, multiple CUDA kernels must also be created to compute each stage of the chosen integration method.
CHAPTER 9. AUTOMATIC CODE GENERATION
9.7.1
CUDA Template
Algorithm 18 shows the high-level template the CUDA generator uses to construct simulation code. It should be noted that, unlike the previous templates, a different update function is generated for each integration stage. Different CUDA kernels are required to compute the different stages as it is the only way to synchronise all the threads in a CUDA application.
This generator uses patterns to implement CUDA simulations that use global memory. See Chap- ters 4 and 5 for more details on the CUDA memory types. This memory type was selected because it showed the highest performance on Fermi architecture GPUs.
Algorithm 18Pseudo code for generating a CUDA finite-differencing solver. This generator creates one function for each integration step.
generateincludes
for allstagein Stepsdo generateCUDA kernel
generatethread id calculation
generateneighbour access code
for allequationinstagedo
traverseequationtree togenerateequation code
end for end for
MAIN
generateCUDA initialisation
generateparameter allocation
generateparameter initialisation
generatelattice allocation
generatelattice initialisation
generateCUDA copy data from host to device
generateCUDA run-time parameters
generatetime step iteration code
for allstagein Stepsdo generateCUDA call
end for
generateend iteration code
generateCUDA copy data from device to host
This template requires the use of several new patterns. The host allocation and iteration code will be the same as described in the C generator. But CUDA specific patterns are required to generate code for allocating device memory, copying data in and out of the device, calling kernel on the device, configuring the kernel calls and calculating an index on the device.
9.7.2
CUDA Patterns
To allocate data for the lattices, the generator must now allocate memory space on the host as well as the device memory. The generator can use the same pattern as the C generator to allocate lattice
9.7. GENERATING CUDA CODE
memory on the host. To allocate memory on the device the generator must producecudaMalloc
calls. To avoid naming conflicts, this pattern inserts d before the lattice name. The pattern for
allocating a lattice on the GPU device is given in Algorithm 19.
Algorithm 19CUDA Generator - lattice allocation pattern.
←type*dname;
←cudaMalloc((void**) &dname,size*sizeof(type));
To copy data between the host and device,cudaMemcpycalls must be used. These calls represent
the communication of data between the host and the device. Calls for each lattice will be generated at the start and at the end of the simulation, to copy the input into the device and the result back out. Two patterns are used to copy a lattice in and out of the device. These two patterns are shown in Algorithm 20.
Algorithm 20CUDA Generator - lattice communication pattern.
←cudaMemcpy(dname,name,size*sizeof(type), cudaMemcpyHostToDevice);
←cudaMemcpy(name, dname,size*sizeof(type), cudaMemcpyDeviceToHost);
To construct a CUDA program, the code to construct the size of a block and a grid must be generated. The dimensionality of these constructs will depend on the dimensionality of the desired simulation. The following pattern can be used to create the code for these two structures with an
arraygrainwhich specifies the size of the block in each dimension. Algorithm 21 shows the pattern
to create the constructs.
Algorithm 21CUDA Generator - lattice partition pattern.
size = dim[1]
ford = 2..D-1do
size = size*dim[d] end for
←dim3 block(grain[0],grain[1],grain[2]);
←dim3 grid(dim[0]/grain[0],size/(grain[1]*grain[2]));
CUDA threads created using these structures must calculate their unique index. Different calcu- lations will be required to compute the index in each dimension. The following patterns in Algo- rithm 22 will construct code to calculate the indexes of the thread depending on the dimensionality
of the simulationD.
The final pattern is the call that CUDA uses to launch a grid of threads to compute the update functions. As each stage of the integration must be computed by a different kernel, multiple kernel
calls will be required. The code to call an update functionfunction namewith parametersparame-
CHAPTER 9. AUTOMATIC CODE GENERATION
Algorithm 22CUDA Generator - index calculation pattern.
ifD==1then
←int idim[0]= (blockIdx.x*blockDim.x) + threadIdx.x;
else ifD==2then
←int idim[0]= (blockIdx.x*blockDim.x) + threadIdx.x;
←int idim[1]= (blockIdx.y*blockDim.y) + threadIdx.y;
else ifD==3then
←int idim[0]= (blockIdx.x*blockDim.x) + threadIdx.x;
←int idim[1]= ((blockIdx.y
←int idim[2]= ((blockIdx.y/(dim[1]/blockDim.y))*blockDim.z) + threadIdx.z;
else ←int k = (threadIdx.z*(gridDim.y*blockDim.y*gridDim.x*blockDim.x)) + (((blockIdx.y*blockDim.y) + threadIdx.y)*(gridDim.x*blockDim.x)) + (blockIdx.x*blockDim.x) + threadIdx.x; mod = 1 div = 1 for d = 0..D-1 do mod = mod * dim[d]
←int idim[d]= (k/div)%mod;
div = div * dim[d] end for
end if
Algorithm 23CUDA Generator - kernel call pattern.
←function name<<<grid, block>>>(parameters);