2.2 Domain-Specific Description for Image Processing Kernels
3.2.2 Results
Vertical Mean Filtering
A naïve parallel algorithm can run N = W × (H − D) threads, each producing a single output element, which requires Θ(ND) reads and arithmetic operations. A good parallel algorithm, however, must be efficient and scalable [LS08]. Therefore, an algorithm is used that strips the computation, where up to T outputs in the same strip are computed serially in two phases [HLDK09]: The first phase computes Ox,y0 according to (3.1), while the second phase computes Ox,y for y ≥ y0 + 1 as Ox,y−1+ Ix,y+D−1− Ix,y−1/D.
3 Domain-specific Source-to-Source Compilation for Medical Imaging
This algorithm performs Θ(N + ND/T ) reads and arithmetic operations, consid-erably reducing memory bandwidth and compute requirements for T D, whilst allowing up to dN/T e threads to run in parallel. Thus, this algorithm trades off work efficiency against parallelism.
Listing 3.4 shows the implementation of this algorithm in the proposed framework.
Since the framework supports only a 1:1 mapping of output pixels to threads, the offset specification is used to calculate the pixel location for a 1:N mapping. A special syntax for a 1:N mapping can be provided in the future.
1 c l a s s V e r t i c a l M e a n F i l t e r R o l l i n g S u m : p u b l i c K e r n e l {
Listing 3.4: Kernel description of the vertical mean filter using a rolling sum.
The performance of the code generated by the proposed framework is compared against that of hand-written code reported in [HLDK09].1 The vertical mean filter is run with different values for T , that is, changing the number of pixels calculated by one thread. Figure 3.2 shows the execution times of the vertical mean filter applied to an image of 5120 × 3200 pixels. Processing more than one pixel increases the throughput from 0.53 Gpixel/s for T = 1, up to the peak throughput of 6.6 Gpixel/s at several points (e. g., for T = 528).
The results show that the generated CUDA code achieves the same performance as the optimized hand-written CUDA code.2 However, the high-level implementation in the framework is concise and has only a fraction of the complexity of the low-level implementation of [HLDK09]. For instance, in terms of lines of code, the
1Note that the same configuration is used: thread block dimensions 128× 1, kernel diameter D = 40. However, the Quadro FX 5800 Graphics Processing Unit (GPU) is used, rather than the GeForce GTX 280.
2The generated OpenCL code is slightly slower than the generated CUDA code, which is at-tributed to the relative immaturity of the OpenCL implementation.
3.2 Point Operators
0 100 200 300 400 500 600 700 800
0 1 2 3 4 5 6 7
Output pixels per thread (T)
Gpixel/s
CUDA(HPPC09) CUDA(generated) OpenCL(generated)
Figure 3.2: Throughput of the generated CUDA C/OpenCL sources in Gpixel/s for the vertical mean filter on an image of 5120 × 3200 pixels in comparison to the hand-written CUDA code from [HLDK09].
low-level implementation consists of about 500 lines of host and device code, whilst the high-level implementation consists of fewer than 50 lines of code.
In the previous example, the image width of 5120 is a multiple of the warp size of the underlying hardware (which is 32). This results in optimal memory transfers utilizing memory bandwidth best. However, if the image width is not a multiple of the warp size and not properly aligned, bandwidth throughput drops. For instance, increasing image width by one pixel using an image of 5121 × 3200 pixels, gives us a peak throughput of 3.9 Gpixel/s which is roughly half of the throughput obtained before. Using the framework allows to pad images and changes the kernel source to take padding into account. The amount of padding required for best performance depends on the underlying hardware. For the used graphics hardware, best memory throughput can be achieved when the image is padded to a multiple of the memory transaction size that can be handled by the GPU in one transaction. This size can be 32-, 64-, and 128-byte segments of aligned memory. Doing so improves the peak throughput as shown in Fig. 3.3 for an image of 5121 × 3200 pixels with image lines padded to the different memory transaction sizes. The peak throughput of 6.4Gpixel/s is achieved for aligning to 256-bytes, which is double the maximum transaction size.
3 Domain-specific Source-to-Source Compilation for Medical Imaging
Figure 3.3: Throughput of the generated CUDA C sources in Gpixel/s for the ver-tical mean filter on an image of 5121 × 3200 pixels with padding. The generated CUDA C and OpenCL source pads the image width to a multiple of 32-, 64-, 128-, 256-, or 512-bytes.
OpenCV Library
One widely used library for image processing is the Open Source Computer Vision (OpenCV) library [Wil11]. Image processing algorithms are optimized in OpenCV to make use of the Single Instruction, Multiple Data (SIMD) units and multiple cores of modern processors. Beginning with version 2.2, selected algorithms (mostly con-volution kernels) can also be executed on the GPU. Instead of implementing these kernels from scratch, OpenCV relies on the NVIDIA Performance Primitives (NPP) library. To compare the performance of code generated by the proposed framework to such state-of-the-art approaches, all six convolution kernels from OpenCV that utilize NVIDIA GPUs have been implemented in the framework. These kernels mostly support the 8-bit unsigned char type and the 3 × 3 and 5 × 5 window dimensions—which are also used for evaluation. Note that there is no 5 × 5 GPU implementation of the Laplace convolution filter. However, the proposed framework can also generate code for other configurations with only minor modifications to the high-level description as for the 5 × 5 Laplace convolution filter.
Figure 3.4 shows the execution times of the OpenCV implementations on a CPU (Core 2 Quad @3.00 GHz) and three GPUs: NVIDIA’s Quadro FX 5800 and Tesla C2050 and AMD’s Radeon HD 5870. For the NVIDIA cards, the OpenCV im-plementation and CUDA/OpenCL code generated by the proposed framework are compared, while on the AMD card only generated OpenCL code is available. Gen-erated code is as fast as OpenCV code (actually, faster in most cases). With larger filter window size also execution time increases. Again, the generated CUDA code is slightly faster than the OpenCL code. The GPU implementation of OpenCV
3.2 Point Operators
blur dilate erode gaussian laplace sobel
5.0 2.4 2.8
blur dilate erode gaussian laplace sobel
5.0 12.4 4.2
timeinms
(b) 5× 5 window size.
Figure 3.4: Comparison of the execution time of convolution kernels from OpenCV and the proposed framework for an image of 1024 × 1024 pixels on a Quadro FX 5800, Tesla C2050, and Radeon HD 5870. The results for a window size of 3 × 3 is shown in (a) and for a window size of 5 × 5 in (b).
relies on NPP library, resulting in longer execution times. While the presented Domain-Specific Language (DSL) approach generates GPU code from a high-level representation of the desired convolution kernel, the OpenCV library and NPP li-brary3 provide more general implementations that are not optimized for the selected convolution kernel properties like the kernel size. The performance of the vectorized OpenCV code varies considerably on the Central Processing Unit (CPU). For some convolution kernels, their CPU implementation is almost as fast as the generated GPU code (e. g., for dilate and erode); for most kernels, however, their CPU im-plementation is one order of magnitude slower than generated GPU code (e. g., for blur, laplace, and gaussian). One big advantage of the proposed framework is that code can be generated for any pixel data type, while the OpenCV implementations are mostly restricted to unsigned char.
3The NPP source code is not available for detailed analysis.
3 Domain-specific Source-to-Source Compilation for Medical Imaging
3.3 Local Operators
Since local operators and convolution functions read typically neighboring pixels to calculate the value of the result pixel, most of the neighboring pixels read for pixel pi,j are also required for pixel pi+1,j. This results in redundant fetches of the same pixels and imposes high pressure on the global memory bandwidth. To relieve global memory bandwidth, the region read by a group of threads can be a) staged into fast on-chip memory (scratchpad memory) and read from there afterwards, or b) accessed by a memory path that traverses a cache. In the latter case, only the first access to a pixel has the long latency of global memory. Subsequent accesses are served by the cache. In the former case, the data is staged into scratchpad memory and memory accesses of the kernel go to the fast scratchpad memory. However, synchronization is required before the calculation can begin. This separates data transfer and calculation phases within the kernel. The benefit of massive multi-threading provided by the underlying hardware is to hide memory transfers when data transfers and calculations are done at the same time. This benefit is lost when data is staged to scratchpad memory. Hence, staging to scratchpad memory makes only sense in case the benefit of data reuse exceeds the multithreading benefit. For local operators with small window sizes, this is rarely the case. Nonetheless, the proposed source-to-source compiler supports both options.
Texturing memory: All graphics cards from AMD and NVIDIA support cached data access, either using texturing hardware or by default (on newer Fermi GPUs from NVIDIA). In CUDA, texturing hardware can be utilized reading from a tex-ture reference that is bound to global memory. In OpenCL, the texturing hardware is used when data is read from an image object. Therefore, accesses to Image(x, y) objects have to be mapped to the corresponding tex1Dfetch() and read_imagef() functions in CUDA and OpenCL, respectively. However, this is only valid if data is read. When data is written to an Image object, normal global memory array pointers are used in CUDA and the write_imagef() function in OpenCL. That is, prior to the mapping of Image accesses to the low-level equivalents, a read/write analysis of the kernel method is performed. Therefore, a control-flow graph (CFG) of the instructions in the kernel method is created and traversed afterwards. Access information is stored for each Image and Accessor object and used to select the ap-propriate texturing function call. This results in a mapping of read/write-accesses as shown in Listing 3.5. The mapping is done using a recursive AST-visitor. When-ever an Image or Accessor node is visited, the above described transformations are applied.
3.3 Local Operators
Listing 3.5: Using texturing hardware to read from an Accessor.
The image object access functions in OpenCL take and return always vector ele-ments with size of four, although only one of the four components is required for the example above. Therefore, the CL_R channel order is used, which maps only one of the four components to memory and populates the remaining three channels with zeros. The corresponding extraction from and packing to vector elements is added by the framework as well. Once the accesses to Image objects are mapped, the low-level code can be emitted. When this is done, also the corresponding CUDA texture reference and OpenCL sampler definitions are created. The OpenCL kernel-function parameters for images are emitted with the corresponding read_only and write_onlyattributes obtained from the read/write analysis. Texture references are not added as kernel-function parameters in CUDA since they are static and globally visible in CUDA.
Scratchpad memory: Current graphics cards provide fast on-chip scratchpad memory, also called shared memory in CUDA and local memory in OpenCL, that is shared between all threads of a compute unit. Adding the __shared__ (CUDA) and __local (OpenCL) keywords to memory declarations within a kernel allows to use this memory. Since the memory is shared between all threads mapped to one compute unit (number of threads » warp size), synchronization between threads mapped to a compute unit is required so that all threads have a consistent view of the scratchpad memory. This is done by the __synchthreads() (CUDA) and barrier() (OpenCL) function. Only after all threads have reached this synchronization point, execution continues. Using scratchpad memory includes two phases: first, the data is staged from the GPU memory into scratchpad memory and second, data accesses are redirected to the scratchpad. In Listing 3.6, the size of the scratchpad memory depends on BSY/BSX, the size of the 2D image subregion mapped to the compute unit as well as on SY/SX, the image region accessed beyond the 2D image subregion within the local operator. A constant of 1 is added to BSX so that different banks of the scratchpad memory are accessed for row-based filters in order to avoid bank conflicts.
The data from global memory is staged into scratchpad memory in multiple steps, depending on the size of additional pixels required by the kernel. When data is read from the scratchpad memory, the thread identifiers of the threads mapped to one compute unit threadIdx.x/threadIdx.y and get_local_id(0)/get_local_id(1) are used for the CUDA and OpenCL code, respectively.
1 // P h a s e 1: s t a g e d a t a to s c r a t c h p a d m e m o r y
2 // C U D A
3 _ _ s h a r e d _ _ f l o a t _ s m e m I N [ SY + BSY ][ SX + BSX + 1];
4 _ s m e m I N [t h r e a d I d x.y][t h r e a d I d x.x] = IN [ . . ] ;
5 if ( . . . ) _ s m e m I N [t h r e a d I d x.y + . . ] [t h r e a d I d x.x + ..] = IN [ . . ] ;
3 Domain-specific Source-to-Source Compilation for Medical Imaging
Listing 3.6: Staging pixels to scratchpad memory before accessing image pixels.