We now consider the prototypical, embarrassingly parallel problem of adding two vectors. We follow Sec 4.2.1 in [129] closely.
#include<stdio.h>
// Constant, accessible by both device and host functions
__global__ void add( int *v, int *u, int *w ) {
//thread index
i = blockIdx.x;
// This ensures we do not try to access memory locations outside of the vectors
if (i < N)
w[i] = u[i] + v[i]; }
int main(void)
{
// Declare vectors (pointers, equivalent to arrays)
int *host_u, *host_v, *host_w;
int *dev_u, *dev_v, *dev_w;
// Allocate host memory for the host vectors
host_u = (int *)malloc( N * sizeof(int) );
host_v = (int *)malloc( N * sizeof(int) );
host_w = (int *)malloc( N * sizeof(int) );
// Allocate device memory for the device vectors
cudaMalloc( (void **)&dev_u, N * sizeof(int) );
cudaMalloc( (void **)&dev_v, N * sizeof(int) );
cudaMalloc( (void **)&dev_w, N * sizeof(int) );
// Fill in host vectors, setting w to values different from the sum
for( i= 0; i < N; i++ ) { host_u[i] = i; host_v[i] = -i; host_w[i] = i; }
// Print w for later comparison:
printf( "Vector w is initialized to be [ " );
for( i = 0; i < N; i++ )
{
printf( "%d, ", host_w[i] );
}
printf( "]\n\n" );
// Copy these to device vectors u and v
cudaMemcpy( dev_u, host_u, N * sizeof(int), cudaMemCpyHostToDevice );
// Now we can add them
add<<<N,1>>>( dev_u, dev_v, dev_w );
// We want to read out the result, so copy device w to host
cudaMemcpy( host_w, dev_w, N * sizeof(int), cudaMemcpyDeviceToHost );
// We should get all 0s:
printf( "The resulting vector is w = [ " );
for( i = 0; i < N; i++ ) { printf( "%d, ", host_w[i] ); } printf( "]" ); // Free memory free( host_u ); free( host_v ); free( host_w ); cudaFree( dev_u ); cudaFree( dev_v ); cudaFree( dev_w ); return 0; }
This program is in some sense deceptively long. In particular, we note that 17 of the 31 lines of code are simply memory management. However, we’re using a variant of C, so we would expect nothing different! Let’s tackle these memory lines by first walking through the host function main.
F.3.1 Host side: Memory management and considerations
After we instantiate the variables, we allocate memory for them. malloc and cudaMalloc both allow for dynamic arrays, so that we can input size in the command line, for instance. The cudaMalloc lines in particular are somewhat intimidating. Note that for neither memory allocation routine does the type matter; all that matters is that the right size is allocated, ensured by the (last) argument. For malloc, the type is “fixed” ex post facto by typecasting the returned value to the variable types int *. The cudaMalloc function allocates memory
by accepting a pointer to the vector (e.g., &dev u). Since the type does not matter, only the memory amount, the function only accepts void pointers. Of course, dev u itself is a pointer, so we typecast it as a pointer to a pointer of type void: (void **)&dev u. Thus, we see why above, in Sec F.2.2, the host side knew about the device variable data: it was declared and allocated memory on the host side beforehand.
Additionally, although the device functions may manipulate the data passed to it, we usually want them to have a chosen set of initial values. This can be done either on the host side or device side. Here, we see the former method. We fill in the host variables, and then copy them to the device variables. We could instead have initialized the values in the device function before adding them, saving ourselves this trouble. However, the values may come from external values that must be read in via host code, forcing us to implement a solution like we do here. The worst this does, of course is add an extremely slight time and code overhead.
Finally, we are in a position to take advantage of the GPU. A simple call to the device function global void add is all that we need on the host side. The <<<5,1>>> notation indicates we’d like 5 blocks with one thread each. We’ll go over the implementation of this function in just a bit. The result of the computation is stored in dev w. Of course, we can’t access device variables on the host side, so we must use cudaMemcpy in the reverse direction to copy the values to the host vector host w. Now we can print the values (perhaps to a file if desired) and finally free up the memory that was allocated to the host and device variables. Now, let’s take a look at the device function add.
F.3.2 Device side: N instantiations
We can immediately identify the device function by the global precursor, indicating a device function that can be called from host code. Of course, since it is called from the host side, it can’t return any values, and so it is given the type of void. Its arguments are the 3 pointers (i.e., arrays; i.e., vectors). When this function is called, the block-thread structure was specified as N blocks, 1 thread per block. Thus, N instantiations of add are created, where each is indexed by its block and thread coordinates. Here we see that the block indices
are given as blockIdx.x and blockIdx.y, while the thread indices are similarly given as threadIdx.x, threadIdx.y, and threadIdx.z. Thus, the local variable i is set to 0 in the first instantiation, and is set to N-1 (=4) in the last instantiation. Each instantiation has the very simple task of assigning the sum of u[i] and v[i] to a third variable, w[i]. Of course, if we had accidentally called for more threads than there were elements of the array, memory values that don’t belong to dev w would be written to, with different potential unpleasant results ensuing. To ensure we only write to memory locations that are indeed contained with dev w, we add a simple check condition: proceed with the operation only if the index is less than the size of the array. Somewhat embarrassingly, this simple example very nearly covers the extent to which our simulation code employs the GPU, as the integration step is just vector addition. The coupling of the populations is the only other aspect of the network integration that will benefit from some GPU treatment, and to visualize we use the GPU as well. Of course, there are many bells and whistles that we’ve added to make it more functional, but nearly all of these are implemented on the host side.