Printf - Debugging OpenCL codes - Using OpenCL Programming Massively Parallel Computers

3.2 Debugging OpenCL codes

3.2.1 Printf

int printf ( constant char * restrict format, ... );

The printf built-in function writes output to an implementation-deﬁned stream such as stdout under control of the string pointed to by format that speciﬁes how subsequent arguments are converted for output.

printf

The function printf became one of the standard OpenCL C functions along with introduction of OpenCL version 1.2. Even though it is now standard, there are still many implementations that are not compatible with the newest OpenCL version. This is the reason to treat the function printf as an extension in this section. When OpenCL 1.2 is implemented on a wide variety of hardware, this section will still be valid.

One of the useful extensions is deﬁned on some implementations prior to OpenCL 1.2 is the function printf. This extension allows for printing text to the con- sole directly from OpenCL kernels. This extension is available as cl_intel_printf or cl_amd_printf on Intel and AMD CPUs. Its usage is very similar to the debugging using the ordinary printf C function. This is the simplest approach for solving sim- ple problems. This extension, however, is not currently available for GPUs because of the place of execution of the kernel. The introduction of OpenCL 1.2 standard will encourage platform vendors to implement the printf function in such a way that it will be possible to compile OpenCL programs that contain this function even if it will not produce any output. Probably it will even be possible to retrieve some output from future GPUs.

The printf is very convenient, and many programmers prefer this style of de- bugging. Using printf in an OpenCL program, however, brings some pitfalls that the programmer must be aware of. On older OpenCL implementations, before using the function printf, there is need of enabling the appropriate extension. The program code that performs this can be seen in listing 3.12. This code also takes into account the situation when there is no printf extension available; then, it just deﬁnes an empty macro that allows for ﬂawless compilation of code, but the output does not display anything on the console.

Please note that it is really easy to generate enormous amounts of output using this extension. Imagine executing a kernel in the NDRange of size 32× 32. Assume

that the kernel contains one printf instruction. Every work-item in this situation will produce just one line of output, but for one kernel run it will be 1024 lines. It is really complicated to obtain valuable information from this amount of data. The best solution for this is to execute the kernels with very limited NDRange sizes or using conditional blocks to print debugging information only from interesting work-

1 #ifdef cl_amd_printf

2 #pragma OPENCL EXTENSION cl_amd_printf : enable

3 #endif

4 #ifdef cl_intel_printf

5 #pragma OPENCL EXTENSION cl_intel_printf : enable

6 #endif

8 #ifndef printf

9 #define printf(...) {}

10 #endif

Listing 3.12: The OpenCL program fragment that turns on the printf extension for AMD and Intel platforms.

1 kernel void dbg_add_v1(global float *B, global float *A) {

2 uint i = get_global_id(0); 3 uint j = get_global_id(1); 4 uint w = get_global_size(0); 5 uint h = get_global_size(1); 6 printf("[%d, %d] ", i, j); 7 printf("%2.1f+%2.1f=", B [i * w + j], A [i * w + j]); 8 B [i * w + j] = B [i * w + j] + A [i * w + j]; 9 printf("%2.1f\n", B [i * w + j]); 10 }

Listing 3.13: Kernel using printf to output some information about it’s internal operation. The incorrect approach.

items. It is also possible to use some automatic approach by parsing the output and presenting it in a graphical way.

Another obstacle is parallel execution of work-items. It is impossible to deter- mine the exact order of printf invocation. The execution model is also the reason that using many consecutive printf calls to produce only one line of output is use- less. Consider the situation presented in listing 3.13. If it were a sequential program, then the output would be readable, in just one line per kernel instance and informa- tive enough. Execution of this code on a parallel machine, however, gives the output shown in listing 3.14 for a work of size 4× 4. This is because of the race between

different work-items.

There are methods to cope with this problem. One is by printing debugging information in one printf call per output line. This is the simpliest solution and works in most cases. The result can be seen in listing 3.16 for work size 4× 4. The

kernel that produces this output is shown in listing 3.15. Note that the order of execution of the printf command is nondeterministic, so the order of output can vary.

The last approach that allows for output that is always in the same order is by using a loop and synchronization. This method simulates serialized execution of the code. Note that this works only for kernels run in one workgroup, because there is

1 (...) 2 [2, 3] adding 2.000000 + -1.000000 = 1.000000 3 [3, 3] adding 4.000000 + -2.000000 = adding -4.000000 + 2.000000 = -2.000000 4 [2, 0] 2.000000 5 [0, 0] adding 4.000000 + -2.000000 = 2.000000 6 adding -2.000000 + 1.000000 = -1.000000 7 [3, 0] adding -0.000000 + 0.000000 = 0.000000 8 (...)

Listing 3.14: The fragment of output for the kernel shown in listing 3.13.

1 kernel void dbg_add_v2(global float *B, global float *A) {

2 uint i = get_global_id(0); 3 uint j = get_global_id(1); 4 uint w = get_global_size(0); 5 uint h = get_global_size(1); 6 printf("[%d, %d] %2.1f+%2.1f=%2.1f\n", i, j, 7 B [i * w + j], 8 A [i * w + j], 9 B [i * w + j] + A [i * w + j] 10 ); 11 B [i * w + j] += A [i * w + j]; 12 }

Listing 3.15: Kernel using printf to output some information about its internal operation. 1 (...) 2 [3, 2] 0.0+2.0=2.0 3 [3, 0] 2.0+0.0=2.0 4 [0, 2] 1.0+2.0=3.0 5 (...)

Listing 3.16: The fragment of output for the kernel shown in listing 3.15.

no ofﬁcially supported method to synchronize different work-groups. This kernel is shown in listing 3.17, and the example output can be seen in listing 3.18.

In document Using OpenCL Programming Massively Parallel Computers (Page 169-171)