Generating CUDA code - Generative programming methods for parallel partial differential field e

dering cells necessary to compute the simulation. In the update function each node will compute the

equation for its section of the ﬁeld and communicate the borders to its neighbours usingMPI Isend

andMPI Irecv. This code example processes the entire lattice at once and then communicates the data to the neighbours.

Listing 9.6: Generated MPI code for a two-dimensional ﬁnite differencing simulation using Euler

integration.eulerperforms a single computation step for each node’s ﬁeld and communicates the

borders to the neighbouring nodes. void e u l e r (f l o a t ∗u0 ,f l o a t ∗u1 ,f l o a t h ) { f o r(i n t i y = Halo ; i y <Y/P+Halo ; i y ++) { f o r(i n t i x = 0 ; i x <X ; i x ++) { u0yx = u0 [ i y∗X + i x ] ; / / compute e q u a t i o n f o r c e l l ix , i y / / u1 = u0 + f ( u0 ) ∗ h } }

MPI Irecv (&u1 [ ( ( Y/P)+ Halo )∗X ] , Halo∗X , MPI FLOAT , idp1 , 0 , MPI COMM WORLD, &r e q u e s t s [ 0 ] ) ; MPI Irecv (&u1 [ 0 ] , Halo∗X , MPI FLOAT , idm1 , 0 , MPI COMM WORLD, &r e q u e s t s [ 1 ] ) ; MPI Isend(&u1 [ Halo∗X ] , Halo∗X , MPI FLOAT , idm1 , 0 , MPI COMM WORLD, &r e q u e s t s [ 2 ] ) ; MPI Isend(&u1 [Y/P∗X ] , Halo∗X , MPI FLOAT , idp1 , 0 , MPI COMM WORLD, &r e q u e s t s [ 3 ] ) ; MPI Waitall ( 4 , r e q u e s t s , MPI STATUSES IGNORE ) ;

}

i n t main (i n t argc , char ∗∗argv ) {

i n t id , P ;

M P I I n i t (& argc , &argv ) ;

MPI Comm rank (MPI COMM WORLD, &id ) ; MPI Comm size (MPI COMM WORLD, &P ) ; idm1 = ( id == 0 ) ? P−1 : id− 1 ; idp1 = ( id == P−1) ? 0 : id + 1 ; . . .

f l o a t ∗u0 = new f l o a t[ ( ( Y/P)+2∗Halo )∗X ] ;

f l o a t ∗u1 = new f l o a t[ ( ( Y/P)+2∗Halo )∗X ] ;

f o r(i n t t = 0 ; t< 1 0 2 4 ; t ++) { e u l e r ( u0 , u1 , h ) ; swap(&u0 , &u1 ) ; } M P I F i n a l i z e ( ) ; }

9.7 Generating CUDA code

The CUDA generator has another type of parallelism to consider when constructing simulations for the GPU. The generator must create code to allocate lattices in host memory as well as in the GPU device memory. CUDA calls to copy data between these two memory areas must also be created to copy the simulation in and out of the GPU. As there is no way to synchronise between threads in different blocks, multiple CUDA kernels must also be created to compute each stage of the chosen integration method.

CHAPTER 9. AUTOMATIC CODE GENERATION

9.7.1 CUDA Template

Algorithm 18 shows the high-level template the CUDA generator uses to construct simulation code. It should be noted that, unlike the previous templates, a different update function is generated for each integration stage. Different CUDA kernels are required to compute the different stages as it is the only way to synchronise all the threads in a CUDA application.

This generator uses patterns to implement CUDA simulations that use global memory. See Chap- ters 4 and 5 for more details on the CUDA memory types. This memory type was selected because it showed the highest performance on Fermi architecture GPUs.

Algorithm 18Pseudo code for generating a CUDA ﬁnite-differencing solver. This generator creates one function for each integration step.

generateincludes

for allstagein Stepsdo generateCUDA kernel

generatethread id calculation

generateneighbour access code

for allequationinstagedo

traverseequationtree togenerateequation code

end for end for

MAIN

generateCUDA initialisation

generateparameter allocation

generateparameter initialisation

generatelattice allocation

generatelattice initialisation

generateCUDA copy data from host to device

generateCUDA run-time parameters

generatetime step iteration code

for allstagein Stepsdo generateCUDA call

end for

generateend iteration code

generateCUDA copy data from device to host

This template requires the use of several new patterns. The host allocation and iteration code will be the same as described in the C generator. But CUDA speciﬁc patterns are required to generate code for allocating device memory, copying data in and out of the device, calling kernel on the device, conﬁguring the kernel calls and calculating an index on the device.

9.7.2 CUDA Patterns

To allocate data for the lattices, the generator must now allocate memory space on the host as well as the device memory. The generator can use the same pattern as the C generator to allocate lattice

9.7. GENERATING CUDA CODE

memory on the host. To allocate memory on the device the generator must producecudaMalloc

calls. To avoid naming conﬂicts, this pattern inserts d before the lattice name. The pattern for

allocating a lattice on the GPU device is given in Algorithm 19.

Algorithm 19CUDA Generator - lattice allocation pattern.

←type*dname;

←cudaMalloc((void**) &dname,size*sizeof(type));

To copy data between the host and device,cudaMemcpycalls must be used. These calls represent

the communication of data between the host and the device. Calls for each lattice will be generated at the start and at the end of the simulation, to copy the input into the device and the result back out. Two patterns are used to copy a lattice in and out of the device. These two patterns are shown in Algorithm 20.

Algorithm 20CUDA Generator - lattice communication pattern.

←cudaMemcpy(dname,name,size*sizeof(type), cudaMemcpyHostToDevice);

←cudaMemcpy(name, dname,size*sizeof(type), cudaMemcpyDeviceToHost);

To construct a CUDA program, the code to construct the size of a block and a grid must be generated. The dimensionality of these constructs will depend on the dimensionality of the desired simulation. The following pattern can be used to create the code for these two structures with an

arraygrainwhich speciﬁes the size of the block in each dimension. Algorithm 21 shows the pattern

to create the constructs.

Algorithm 21CUDA Generator - lattice partition pattern.

size = dim[1]

ford = 2..D-1do

size = size*dim[d] end for

←dim3 block(grain[0],grain[1],grain[2]);

←dim3 grid(dim[0]/grain[0],size/(grain[1]*grain[2]));

CUDA threads created using these structures must calculate their unique index. Different calcu- lations will be required to compute the index in each dimension. The following patterns in Algo- rithm 22 will construct code to calculate the indexes of the thread depending on the dimensionality

of the simulationD.

The ﬁnal pattern is the call that CUDA uses to launch a grid of threads to compute the update functions. As each stage of the integration must be computed by a different kernel, multiple kernel

calls will be required. The code to call an update functionfunction namewith parametersparame-

CHAPTER 9. AUTOMATIC CODE GENERATION

Algorithm 22CUDA Generator - index calculation pattern.

ifD==1then

←int idim[0]= (blockIdx.x*blockDim.x) + threadIdx.x;

else ifD==2then

←int idim[0]= (blockIdx.x*blockDim.x) + threadIdx.x;

←int idim[1]= (blockIdx.y*blockDim.y) + threadIdx.y;

else ifD==3then

←int idim[0]= (blockIdx.x*blockDim.x) + threadIdx.x;

←int idim[1]= ((blockIdx.y

←int idim[2]= ((blockIdx.y/(dim[1]/blockDim.y))*blockDim.z) + threadIdx.z;

else ←int k = (threadIdx.z*(gridDim.y*blockDim.y*gridDim.x*blockDim.x)) + (((blockIdx.y*blockDim.y) + threadIdx.y)*(gridDim.x*blockDim.x)) + (blockIdx.x*blockDim.x) + threadIdx.x; mod = 1 div = 1 for d = 0..D-1 do mod = mod * dim[d]

←int idim[d]= (k/div)%mod;

div = div * dim[d] end for

end if

Algorithm 23CUDA Generator - kernel call pattern.

←function name<<<grid, block>>>(parameters);

In document Generative programming methods for parallel partial differential field equation solvers : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand (Page 146-150)