Single Kernel - Template makefile - Linking Scheme code to data-parallel CUDA-C code

4.9 Template makefile

5.1.1 Single Kernel

• Measuring time in Scheme

We used macro gpu-time to measure execution times for the Scheme implementations of our test cases. For example, in the following Scheme code gpu-time measures execution time for kernel kernel1 that takes a constant constant and src:

(gpu-time

(kernel <<< (nblocks) (blockSize) >>> constant src))

Here, gpu-time takes the kernel call kernel1 as an argument. In our implementation, gpu-time recorded the start time stamp for a kernel before calling both shims. This macro actually counted the execution times for the generated vector-length-calculation helper function and c-lambda function in Scheme shim, CUDA-C shim, and a supplied CUDA-C kernel. gpu-time also recorded the stop time stamp after the execution of CUDA-C shim.

For a Scheme implementation, we also measured execution times for different parts of a generated foreign-function interface. In order to do that, we measured combined execution time for a generated CUDA-C shim and a supplied CUDA-C kernel. We also measured execution time for a supplied CUDA- C kernel only. In Scheme implementation for a test case, execution time is the combined execution time for a generated Scheme shim, CUDA-C shim, and a supplied CUDA-C kernel. Therefore, we can measure the execution time for a generated Scheme shim by comparing it with the combined execution

time for a CUDA-C shim and a kernel. Execution time of a CUDA-C shim can be measured by

comparing with the execution time for a kernel only.

In order to measure execution times for a generated CUDA-C shim and a supplied CUDA-C kernel in Scheme implementation, we provided compiler flag -bare-time to generate CUDA code for measuring

times by Gambit compiler that involved two cudaEvent_t type variables. These two variables computed elapsed time to measure combined execution time for a generated CUDA-C shim and a supplied CUDA- C kernel and a s supplied CUDA-C kernel, as described in Appendix G. We also measured execution time for the generated code and subtracted it from execution times to measure actual execution times. • Measuring time in CUDA-C

In order to measure execution times for the CUDA-C implementations of our test cases, we used CUDA library functions. In Listing 5.1 we show a code snippet from an example host program in CUDA-C that calls the CUDA-C kernel kernel1 and measures execution times for this example using CUDA library functions involving cudaEvent_t types. On line 1, four cudaEvent_t types variables (start, stop, start_k and stop_k) are declared. Here, start and stop time stamps are used to measure execution time for CUDA-C implementation. This execution time includes vector transfer operations between host and device memory and allocation/deallocation operations in device memory. The other two cudaEvent_t type variables (start_k and stop_k) are used to compute the elapsed time and to measure execution time for the kernel only.

On line 2, a float type variable elapsed_time_ms is declared and initialized to store an elapsed time between two time stamps. On lines 3–6, start, stop, start_k, and stop_k are initialized using CUDA library functions cudaEventCreate(). Then the start time stamp for CUDA-C implementation is recorded on line 8:

cudaEventRecord(start, 0)

Next, device memory is allocated on line 10. Next, the vector is transferred to device from host memory on line 11 and the kernel is called on lines 15–16. Here, start time stamp only for the kernel execution is recorded just before the kernel call on line 13:

cudaEventRecord(start_k, 0)

The stop time stamp stop_k is recorded on line 16:

cudaEventRecord(stop_k, 0)

just after the kernel call. After the kernel execution, the vector is transferred back to host memory on line 19. Next, device memory is deallocated on line 20. The stop time stamp is recorded on line 22:

1 c u d a E v e n t _ t start , stop , start_k , s t o p _ k ; 2 f l o a t e l a p s e d _ t i m e _ m s = 0.0 f ; 3 c u d a E v e n t C r e a t e (& s t a r t ); 4 c u d a E v e n t C r e a t e (& s t o p ); 5 c u d a E v e n t C r e a t e (& s t a r t _ k ); 6 c u d a E v e n t C r e a t e (& s t o p _ k ); 7 // t a k i n g s t o p t i m e s t a m p for CUDA - C i m p l e m e n t a t i o n 8 c u d a E v e n t R e c o r d ( start , 0); 9 s i z e _ t s i z e _ u 3 2 v _ s r c = h _ u 3 2 v _ s r c _ l e n * s i z e o f( u i n t 3 2 _ t ); 10 c u d a M a l l o c ((v o i d **) & d _ u 3 2 v _ s r c , s i z e _ u 3 2 v _ s r c ); 11 c u d a M e m c p y ( d _ u 3 2 v _ s r c , h _ u 3 2 v _ s r c , s i z e _ u 3 2 v _ s r c , c u d a M e m c p y H o s t T o D e v i c e ); 12 // t a k i n g s t a r t t i m e s t a m p o n l y for k e r n e l e x e c u t i o n 13 c u d a E v e n t R e c o r d ( start_k , 0); 14 // c a l l i n g k e r n e l 15 k e r n e l 1 < < < b l o c k S i z e , nBlocks > > > ( u 3 2 _ c o n s t a n t , d _ u 3 2 v _ s r c , 16 h _ u 3 2 v _ s r c _ l e n ); 17 // t a k i n g s t o p t i m e s t a m p o n l y for k e r n e l e x e c u t i o n 18 c u d a E v e n t R e c o r d ( stop_k , 0); 19 c u d a M e m c p y ( h _ u 3 2 v _ s r c , d _ u 3 2 v _ s r c , s i z e _ u 3 2 v _ s r c , c u d a M e m c p y D e v i c e T o H o s t ); 20 c u d a F r e e ( d _ u 3 2 v _ s r c ); 21 // t a k i n g s t o p t i m e s t a m p for CUDA - C i m p l e m e n t a t i o n 22 c u d a E v e n t R e c o r d ( stop , 0); 23 c u d a E v e n t S y n c h r o n i z e ( s t o p ); 24 c u d a E v e n t E l a p s e d T i m e (& e l a p s e d _ t i m e _ m s , start , s t op ); 25 p r i n t f( " % f , " , e l a p s e d _ t i m e _ m s ); 26 c u d a E v e n t E l a p s e d T i m e (& e l a p s e d _ t i m e _ m s , start_k , s t o p _ k ); 27 p r i n t f( " % f , " , e l a p s e d _ t i m e _ m s ); 28 c u d a E v e n t D e s t r o y ( s t a r t ); 29 c u d a E v e n t D e s t r o y ( s t o p ); 30 c u d a E v e n t D e s t r o y ( s t a r t _ k ); 31 c u d a E v e n t D e s t r o y ( s t o p _ k );

Listing 5.1: Measuring execution times in CUDA-C implementation using cudaEvent_t type variables.

Next, the stop time stamp is synchronized on line 23 with the most recent call to cudaEventRecord(stop,0) on line 22. The elapsed time between start and stop time stamps is measured in milliseconds on line 24 and displayed to standard output on line 25. Execution time for the kernel only is measured by computing the elapsed time between time stamps start_k and stop_k on line 26 and displayed to standard output on line 27. Finally all four time stamps are destroyed on lines 28–31. Note that start

and stop time stamps also count the execution of time stamps start_k and stop_k on lines 18 and 22, but their execution times are so small that we can easily neglect them.

In document Linking Scheme code to data-parallel CUDA-C code (Page 99-102)