2.12 Structure of the OpenCL Host Program
2.12.4 Computation
At this stage, the actual computations are performed. Here, the buffers on the device are allocated or freed and data transfers are performed. In listing 2.78, there is a very short example of computing the vector sum. The kernel has been initialized in section 2.12.2, and it is time to use it in real computation. The example alone will not show any improvement in performance compared to the sequential version,
1 file = fopen(file_name, "rb");
2 if (file != NULL) {
3 printf("Loading binary file ’%s’\n", file_name);
4 fread(&binary_program_size, sizeof(binary_program_size), 1, file);
5 binary_program = (unsigned char *)malloc(binary_program_size);
6 fread(binary_program, binary_program_size, 1, file);
7 program = clCreateProgramWithBinary(context, 1, context_devices,
8 &binary_program_size, (const unsigned char **)&binary_program, NULL, &r);
9 if (r != CL_SUCCESS) exit(-1);
10 fclose(file);
11 if (clBuildProgram(program, 0, NULL, NULL, NULL, NULL) != CL_SUCCESS)
12 exit(-1);
13 } else {
14 printf("Saving binary file ’%s’\n", file_name);
15 // load program into context
16 program = clCreateProgramWithSource(context, 1,
17 (const char **)(&vectSumSrc), NULL, &r);
18 // compile program
19 if (clBuildProgram(program, 0, NULL, NULL, NULL, NULL) != CL_SUCCESS) exit(-1);
20 // save binary data
21 clGetProgramInfo(program, CL_PROGRAM_BINARY_SIZES, sizeof(size_t),
22 &binary_program_size, NULL);
23 binary_program = (unsigned char *)malloc(binary_program_size);
24 clGetProgramInfo(program, CL_PROGRAM_BINARIES, sizeof(unsigned char **),
25 &binary_program, &str_size);
26 file = fopen(file_name, "wb");
27 fwrite(&binary_program_size, 1, sizeof(size_t), file);
28 fwrite(binary_program, 1, binary_program_size, file);
29 fclose(file);
30 }
31 // create kernel object
32 vectSumKernel = clCreateKernel(program, "vectSum", &r);
33 if (r != CL_SUCCESS) exit(-1);
34 free(binary_program);
Listing 2.76: Selecting source or binary form of an OpenCL program. Example shows approach to cache problem.
because memory transfers will damage performance of the computation. But as part of a larger project, this method can be very useful.
The first lines in listing 2.78 prepare host buffers storing vectors and the buffer for the result vector. In actual application, these vectors would be much longer and initialized in some more sophisticated way, for example from a file. The variable
vectorSize is for storing the size of input vectors. Next, there are three variables
of the type cl_mem, which will be initialized later. These are pointers to the memory buffers in the OpenCL device. This is the way to reference memory objects from the host program.
The API call clCreateCommandQueue creates a command queue object. Every computation and data transfer is performed by putting commands into a queue. The third parameter – properties – is set to 0, meaning that this queue is created using default options. Commands will be executed in order, and profiling is disabled.
1 cl_uint program_dev_n;
2 clGetProgramInfo( program, CL_PROGRAM_NUM_DEVICES, sizeof(cl_uint), &program_dev_n, NULL );
3 size_t binaries_sizes[program_dev_n];
4 clGetProgramInfo(program,CL_PROGRAM_BINARY_SIZES, program_dev_n*sizeof(size_t), binaries_sizes, NULL );
5 char **binaries = malloc(sizeof(char **)*program_dev_n);
6 for (size_t i = 0; i < program_dev_n; i++)
7 binaries[i] = malloc( sizeof(char)*(binaries_sizes[i]+1) );
8 clGetProgramInfo(program, CL_PROGRAM_BINARIES, program_dev_n*sizeof(size_t), binaries, NULL);
Listing 2.77: Getting the binary representation for all devices.
1 cl_float vector1 [3] = { 1, 2, 3.1 }; // sample input data
2 cl_float vector2 [3] = { 1.5, 0.1, -1 };
3 cl_float result [3];
4 cl_int vectorSize = 3; // size of input data
5 cl_mem vector1_dev, vector2_dev, result_dev; // memory on the device
6 printf("vector1 = (%.1f,%.1f,%.1f)\n", vector1 [0], vector1 [1], vector1 [2]);
7 printf("vector2 = (%.1f,%.1f,%.1f)\n", vector2 [0], vector2 [1], vector2 [2]);
8 // next we have to prepare command queue
9 queue = clCreateCommandQueue(context, context_devices [0], 0, &r);
10 // we have to prepare memory buffer on the device
11 vector1_dev = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
12 vectorSize * sizeof(cl_float), vector1, &r);
13 vector2_dev = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
14 vectorSize * sizeof(cl_float), vector2, &r);
15 result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
16 vectorSize * sizeof(cl_float), NULL, &r);
17 // set parameters for kernel
18 clSetKernelArg(vectSumKernel, 0, sizeof(cl_mem), &vector1_dev);
19 clSetKernelArg(vectSumKernel, 1, sizeof(cl_mem), &vector2_dev);
20 clSetKernelArg(vectSumKernel, 2, sizeof(cl_mem), &result_dev);
21 clSetKernelArg(vectSumKernel, 3, sizeof(cl_int), &vectorSize);
22 // add execution of kernel to command queue
23 size_t dev_global_work_size [1] = { vectorSize };
24 clEnqueueNDRangeKernel(queue, vectSumKernel,
25 1, NULL, dev_global_work_size, NULL, 0, NULL, NULL);
26 // now we have to read results and copy it to hello_txt
27 clEnqueueReadBuffer(queue, result_dev, CL_TRUE, 0,
28 vectorSize * sizeof(cl_float), result, 0, NULL, NULL);
29 printf("result = (%.1f,%.1f,%.1f)\n", result [0], result [1], result [2]);
30 // free memory buffer
31 clReleaseMemObject(vector1_dev);
32 clReleaseMemObject(vector2_dev);
33 clReleaseMemObject(result_dev);
Now it is time to create buffers in the device memory. This is done us- ing clCreateBuffer. In the example, the device buffers for holding operands are created and initialized for values copied from vector1 and vector2. This behavior is driven by the setting CL_MEM_COPY_HOST_PTR in flags. The flag CL_MEM_READ_ONLY marks the buffer as a read-only for kernels. This can allow for better optimization of memory access by the drivers and less error-prone source code. The last one only creates the buffer and does not initialize it. The buffer is marked as write-only by the flag CL_MEM_WRITE_ONLY.
Kernel parameters are set using function clSetKernelArg. This function gets the kernel object, parameter number, size of argument and argument as parameters. The kernel parameters are numbered from 0. The last parameter, vectorSize, is just one element of a simple type, so it can be passed directly. In production imple- mentation, the errors should be checked, but in the example it is assumed that if the context and kernel have been created successfully, then the rest of the application will work correctly.
Note that the parameters, once set, hold their values. The values passed to the kernel are copied. For example, changing the vectorSize value after calling clSetKernelArg does not affect the value visible from kernel.
Kernel execution is done via a command queue using clEnqueueNDRangeKernel or clEnqueueTask. The second one enqueues execution of a single kernel instance and is not suitable for kernel summing vectors. In the ex- ample, the kernel will be enqueued to be executed in a one-dimensional work space of size vectorSize. This is achieved using the parameter dev_global_work_size. Note that the kernel in listing 2.73 uses only one get_global_id with 0 as a parameter. This is because of the one-dimensional computation space created by the function clEnqueueNDRangeKernel, that can be seen in line 24 of listing 2.78.
Now the results should be retrieved, using clEnqueueReadBuffer. This func- tion puts into a queue the command to copy device memory buffer content into the host memory buffer. This is a blocking function because of the third parameter –
blocking_read – set to 1. If it were to 0, then the command would immediately
return, not waiting for the command to finish.
Note that in the moment when the clEnqueueReadBuffer function is invoked, the kernel can still be in execution. The queue is processed as a parallel task, and objects put into it do not always execute immediately. The solution presented in listing 2.78 shows the way to synchronize the host and device programs by using this function. The other way to synchronize them is by using clFinish.
The last step is to free the memory buffers. This is done by using the func- tion clReleaseMemObject. This function decrements the memory object reference count; when it reaches 0, the memory is released.
cl_int clReleaseMemObject ( cl_mem memobj );
Decrements the memory object reference count. After the memobj reference count becomes zero and commands that are queued for execution on a command- queue(s) that use memobj have finished, the memory object is deleted.
1 // free kernel and program 2 clReleaseKernel(vectSumKernel); 3 clReleaseProgram(program); 4 5 // free resources 6 clReleaseCommandQueue(queue); 7 r = clReleaseContext(context);
Listing 2.79: Releasing OpenCL objects.