Discussion - Evaluation and Results - SPES. Software Platform Embedded Systems 2020

3.5 Evaluation and Results

3.5.3 Discussion

Although the presented framework focuses on programmability for domain experts in medical imaging, it offers decent performance on GPUs from different manufac-turers. The domain experts can express algorithms in a high-level language tailored to their domain. This allows high productivity and the mapping to different target hardware platforms from the same algorithm description. The development time for the manual implementations is in the range of several days (for non-GPU experts even weeks), while the DSL description takes only a couple of minutes.

3 Domain-specific Source-to-Source Compilation for Medical Imaging

Table 3.8: Execution times in ms for the Gaussian filters from OpenCV on the Quadro FX 5800and the generated implementations using the CUDA and OpenCL backends for an image of 4096 × 4096 pixels and different filter window sizes.

Gaussian: 3× 3

Clamp Repeat Mirror Const.

OpenCV: PPT=8 4.86 5.82 10.46 6.22 OpenCV: PPT=1 7.63 9.22 20.98 9.79

CUDA(Gen) 8.60 8.63 8.64 8.67

CUDA(+Tex) 8.55 8.58 8.60 8.63

CUDA(+Smem) 11.83 11.83 11.84 11.90 OpenCL(Gen) 13.58 13.47 13.10 13.46 OpenCL(+Img) 15.42 15.47 15.06 15.24 OpenCL(+Lmem) 17.84 17.86 17.91 18.35

Gaussian: 5× 5

Clamp Repeat Mirror Const.

OpenCV: PPT=8 4.90 5.87 10.45 6.22 OpenCV: PPT=1 7.64 9.22 20.98 9.79

CUDA(Gen) 9.88 9.95 9.95 10.12

CUDA(+Tex) 9.91 9.97 9.98 10.20

CUDA(+Smem) 14.36 14.36 14.37 14.43 OpenCL(Gen) 16.14 16.26 16.18 16.60 OpenCL(+Img) 18.38 18.44 18.33 18.65 OpenCL(+Lmem) 23.61 23.62 23.62 24.13

For example, the source-to-source compiler generates a CUDA kernel with 317 lines of code for the kernel description shown in Listing 2.7 (16 lines of code). This comes from 9 different kernel implementations for the top, top-right, right, etc. image bor-ders plus index adjustments for boundary handling. In addition, the generated code depends on the filter window size and image size. Writing such code by hand is often error-prone and tedious.

3.6 Related Work

128 256 384 512 640 768 896 1024

160 170 180 190 200 210 220 230 240 250

Optimum with 32 × 6: 167.94 ms

number of threads (blocksize)

executiontime(ms)

Figure 3.6: Configuration space exploration for the bilateral filter (filter window size: 13 × 13) for an image of 4096 × 4096 on the Tesla C2050. Shown are the execution times for processing the bilateral filter in dependence on the blocksize.

3.6 Related Work

The work most close to ours is the RapidMind multi-core development platform [Rap09] targeting standard multi-core processors as well as accelerators like the Cell Broadband Engine (Cell B.E.) and GPUs. The RapidMind technology is based on Sh [MDP⁺04], a high-level metaprogramming language for graphics cards. Rapid-Mind provides its own data types that can be arranged in multi-dimensional arrays.

Accessors can be used to define boundary handling properties of the underlying data. A two-dimensional array in RapidMind corresponds to an Image object in the proposed framework. In addition to the boundary handling modes supported in RapidMind, also mirroring is supported at the image border, a widely used boundary handling mode in medical imaging. Programs that operate on arrays are identified by special keywords in RapidMind, while compiler-known C++-classes are used in

3 Domain-specific Source-to-Source Compilation for Medical Imaging

the proposed framework to express image processing kernels. Within a RapidMind program, neighboring elements can be accessed using the shift() method on input data. Since there are no details on code generation for border handling publicly available for RapidMind, the approach followed in this work can only be compared quantitatively with the one of RapidMind.

In 2009, Intel acquired RapidMind and incorporated the RapidMind technology into Intel Array Building Blocks (ArBB) [NSL⁺11]. Since then, RapidMind is dis-continued as is Sh. The focus of Intel’s ArBB is on vector parallel programming and for that reason, image processing features of RapidMind like generic boundary handling support were not adopted. In ArBB, only a constant value is returned when arrays are accessed out of bounds. To access neighboring elements, the cur-rent processed element and the offset is passed to the neighbor() function. Using the position() function, the position within the n-dimensional iteration space can be retrieved and used to implement more sophisticated boundary handling modes.

However, this comes along with large overheads on GPUs. As a remedy, multiple levels of parallelism are exploited in the code-generation backend. When merging RapidMind technology with Intel’s C for Throughput Computing (Ct), the backend for graphics cards was dropped and is not supported anymore.

Beside language based frameworks, there exist library based frameworks like OpenCV [Wil11] and the NPP library. These libraries allow to use predefined kernels. However, to offload new algorithms that are not available, low-level code has to be written.

Other compiler based approaches allow also to offload code to GPU accelera-tors. The input to such compilers is typically sequential C or basic CUDA as well as annotations describing transformations applied to the code. Examples are HMPP Workbench [DBB07], PGI Accelerator [Wol10], hiCUDA [HA11], and CUDA-lite [ULBH08], just to name a few. In order to obtain a decent perfor-mance using these compiler based approaches, the programmer has to know what compiler transformations can be applied and how to rewrite code to make such transformations possible. Algorithm designers and domain experts, however, have only little knowledge of the underlying hardware and compiler transformations. As a consequence, the full potential of such frameworks is only rarely exploited. In the proposed DSL, however, the required metadata is implicitly given by the DSL syntax and has not to be specified separately.

The proposed framework is most similar in spirit to Cornwall et al.’s work on indexed metadata for visual effects [CHK⁺09], but introduces additional device-specific optimizations such as global memory padding for memory coalescing, sup-port for boundary handling, and the heuristic for automatic kernel configuration selection.

The main contributions of this work include a) a domain-specific description of local operators in medical imaging, b) a new code generation framework that uti-lizes a two-layered parallelization approach exploiting both Single Program, Multi-ple Data (SPMD) and MultiMulti-ple Program, MultiMulti-ple Data (MPMD) parallelism on current graphics cards architectures, and c) a heuristic for automatic kernel

config-3.6 Related Work

uration selection and tiling. The presented approach is not limited to algorithms stemming from the medical domain, but can be also utilized for other application domains, in particular the two-layered parallelization approach.

4 Outlook

4.1 Local Operators

The current compiler optimizations for local operators can be further extended to unroll the loops of convolutions and to propagate the constants of the filter masks. To do so, a syntax using lambda-functions has been defined as seen in Listing 4.1. However, Clang, on which the source-to-source compiler is based, does not yet support lambda-functions. As soon as this support is available, constant propagation and loop unrolling can be applied to local operators where the filter mask constants are known at compile time.

1 v o i d k e r n e l () {

2 o u t p u t () = c o n v o l u t e ( cMask , SUM, [&] () {

3 r e t u r n c M a s k () * I n p u t ( c M a s k ) ;

4 }) ;

5 }

Listing 4.1: Using the convolute function provided by the framework and a lambda-function to express convolution kernels.

4.2 Global Operators

In the current source-to-source compiler, support for global operators (reductions in case of the considered application domain) is not implemented in the source-to-source compiler. As syntax has been proposed to describe global operators using functorsin C++. Functors are supported by Clang. The nature of functors is similar to lambda-functions with the difference that functors have an internal state and have no access to the surrounding scope. Listing 4.2 defines a functor that determines the minimal value within an image. The invocation of the functor is shown in Listing 4.3. To generate GPU code for this kind of reduction, parallel reductions (also known as prefix scan) can be used. A template for reduction on graphics cards can be provided by the compiler and the reduction function (determining the minimum in the example) can be derived from the functor.

1 t e m p l a t e<t y p e n a m e data_t >

2 s t r u c t M i n R e d u c t i o n : p u b l i c G l o b a l R e d u c t i o n< data_t > {

3 p u b l i c:

4 u s i n g G l o b a l R e d u c t i o n< data_t >:: r e d u c e ;

6 M i n R e d u c t i o n (A cc e s s o r< data_t > & img , d a t a _ t n e u t r a l ) :

4 Outlook

7 G l o b a l R e d u c t i o n< data_t >( img , n e u t r a l )

8 {}

10 d a t a _ t r e d u c e ( d a t a _ t left , d a t a _ t r i g h t ) {

11 if ( l e f t < r i g h t ) r e t u r n l e f t ;

12 e l s e r e t u r n r i g h t ;

13 }

14 };

Listing 4.2: Using functors to express global operations. Shown is a global operator that determines the minimal pixel value of an Image.

1 // i m a g e d e f i n i t i o n

2 Image<float> IN ( width , h e i g h t ) ;

4 // a c c e s s o r d e f i n i t i o n

5 A c c e s s o r<float> A c c I n ( IN ) ;

7 // g l o b a l o p e r a t i o n u s i n g f u n c t o r s

8 M i n R e d u c t i o n <float> r e d M i n ( AccIn , F L T _ M A X ) ;

9 f l o a t m i n _ p i x e l _ f u n c t o r = r e d M i n . r e d u c e () ;

Listing 4.3: Invocation of the global operator provided by the functor instance to determines the minimal pixel value of an Image.

4.3 Vectorization

On Graphics Processing Units (GPUs) from AMD, Instruction Level Parallelism (ILP) is essential to exploit the Very Long Instruction Word (VLIW) architecture.

One approach to exploit VLIW is to analyze the program code and to group inde-pendent instructions into VLIW instructions. This is done by the AMD compiler that generates assembly code for the target GPU. However, the compiler is in most cases not able to find enough independent work to keep all processing ele-ments busy. Another approach is to write vectorized code—grouping independent code by hand, so that the compiler can create VLIW instructions directly from the vectorized code. Instead of vectorizing the operations performed on one pixel, the operation on multiple neighboring pixels can be put into VLIW instructions. This can be done in a straightforward way for point operators, but for local and global operators, control flow prevents vectorization. Complex control flow analysis has to be performed in order to build use-def chains and detect kernel parts that can be vectorized. Kernels parts that can not be vectorized have to be executed for each iteration point.

5 Conclusions

This work has presented a domain-specific description of algorithms from medical imaging and efficient mapping to low-level CUDA and OpenCL code from this de-scription for Graphics Processing Unit (GPU) accelerators. The Domain-Specific Language (DSL) description allows to express the three operators typically encoun-tered in angiography: a) point operators, b) local operators, and c) global operators.

Based on the metadata provided by the programmer, a two-layered parallel code utilizing Single Program, Multiple Data (SPMD) and Multiple Program, Multiple Data (MPMD) parallelism is generated. Using this approach, it has been shown that code can be generated for boundary handling that has constant performance independent from the selected boundary handling mode while the performance of other solutions varies significantly. Filter masks are stored to constant memory to avoid unnecessary recalculations. Padding for images stored to the global memory of GPUs is automatically added and index calculations adjust so that memory accesses are coalesced and the memory bandwidth can be optimally utilized. To determine a good configuration for the generated kernels, a heuristic was presented that takes boundary handling metadata, the resource usage of kernels, as well as hardware ca-pabilities and limitations into account. The resulting kernel configuration and tiling minimizes the number of threads executing code for boundary handling. Also, the generated code by the presented framework is typically even faster than manual im-plementations and those relying on hardware support for boundary handling. In an experimental analysis, even implementations from RapidMind, a commercial frame-work for multi-core and GPU programming, have been outperformed by a factor of two and similar results to the GPU backend of the widely used image processing library Open Source Computer Vision (OpenCV) and the NVIDIA Performance Primitives (NPP) library have been obtained.

The results as presented in this work have been published to some extend in [MLT11]

and [MHT⁺12]. The sources of the presented framework are going to be publicly available as open source.

Bibliography

[Ban00] I.N. Bankman. Handbook of Medical Imaging: Processing and Analy-sis. Elsevier, 2000.

[CHK⁺09] J.L.T. Cornwall, L. Howes, P.H.J. Kelly, P. Parsonage, and B. Nico-letti. High-Performance SIMT Code Generation in an Active Visual Effects Library. In Proceedings of the 6th ACM Conference on Com-puting Frontiers (CF), pages 175–184. ACM, May 2009.

[Cla11] Clang. Clang: A C Language Family Frontend for LLVM. http:

//clang.llvm.org, 2007–2011.

[DBB07] R. Dolbeau, S. Bihan, and F. Bodin. HMPP: A Hybrid Multi-core Parallel Programming Environment. In Proceedings of the 1st Work-shop on General Purpose Processing on Graphics Processing Units (GPGPU), October 2007.

[DWL⁺11] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Don-garra. From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming. Parallel Computing, 2011.

[HA11] T.D. Han and T.S. Abdelrahman. hiCUDA: High-level GPGPU Pro-gramming. IEEE Transactions on Parallel and Distributed Systems, 22(1):78–90, January 2011.

[HLDK09] L. Howes, A. Lokhmotov, A. Donaldson, and P. Kelly. Towards Metaprogramming for Parallel Systems on a Chip. In Proceedings of the 3rd Workshop on Highly Parallel Processing on a Chip (HPPC), pages 36–45. Springer, August 2009.

[KEFA03] D. Kunz, K. Eck, H. Fillbrandt, and T. Aach. Nonlinear Multiresolu-tion Gradient Adaptive Filter for Medical Images. In Proceedings of SPIE Medical Imaging 2003: Image Processing, volume 5032, pages 732–742. SPIE, February 2003.

[KWmH10] D. Kirk, W.H. Wen-mei, and W. Hwu. Programming Massively Par-allel Processors: A Hands-on Approach. Morgan Kaufmann, 2010.

Bibliography

[LA04] C. Lattner and V. Adve. LLVM: A Compilation Framework for Life-long Program Analysis & Transformation. In Proceedings of the In-ternational Symposium on Code Generation and Optimization (CGO), pages 75–86. IEEE, March 2004.

[LS08] C. Lin and L. Snyder. Principles of Parallel Programming. Addison-Wesley Publishing Company, 2008.

[Lue08] D. Luebke. CUDA: A Heterogeneous Parallel Programming Model for Manycore Computing. Tutorial at the 13th International Conference on Architectural Support for Programming Languages and Operat-ing Systems (ASPLOS). http://gpgpu.org/static/asplos2008/

index.shtml, March 2008.

[MDP⁺04] M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule. Shader Algebra. ACM Transactions on Graphics (TOG), 23(3):787–795, Au-gust 2004.

[MHT⁺10] Richard Membarth, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. Comparison of Parallelization Frameworks for Shared Memory Multi-core Architectures. In Proceedings of the Embedded World Conference, Nuremberg, Germany, March 2010.

[MHT⁺11a] Richard Membarth, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. Frameworks for GPU Accelerators: A Comprehensive Evaluation using 2D/3D Image Registration. In Proceedings of the 9th IEEE Symposium on Application Specific Processors (SASP), pages 78–81, San Diego, CA, USA, June 2011. IEEE.

[MHT⁺11b] Richard Membarth, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. Frameworks for Multi-core Architectures: A Com-prehensive Evaluation using 2D/3D Image Registration. In Proceed-ings of the 24th International Conference on Architecture of Comput-ing Systems (ARCS), pages 62–73, Lake Como, Italy, February 2011.

Springer.

[MHT⁺11c] Richard Membarth, Frank Hannig, Jürgen Teich, Gerhard Litz, and Heinz Hornegger. Detector Defect Correction of Medical Images on Graphics Processors. In Proceedings of the SPIE Medical Imaging 2011: Image Processing, volume 7962, pages 79624M 1–12, Lake Buena Vista, Orlando, FL, USA, February 2011. SPIE.

[MHT⁺12] Richard Membarth, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. Device-dependent Code Generation for GPU Accel-erators in Medical Imaging for Local OpAccel-erators. In Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Shanghai, China, May 2012. IEEE.

Bibliography

[MLH⁺12] Richard Membarth, Jan-Hugo Lupp, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. Dynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging.

In Proceedings of the 25th International Conference on Architecture of Computing Systems (ARCS), Munich, Germany, February 2012.

Springer.

[MLT11] Richard Membarth, Anton Lokhmotov, and Jürgen Teich. Generat-ing GPU Code from a High-level Representation for Image ProcessGenerat-ing Kernels. In Proceedings of the 5th Workshop on Highly Parallel Pro-cessing on a Chip (HPPC), Bordeaux, France, August 2011. Springer.

[NSL⁺11] C.J. Newburn, B. So, Z. Liu, M. McCool, A. Ghuloum, S. Du Toit, Z.G. Wang, Z.H. Du, Y. Chen, G. Wu, P. Guo, Z. Liu, and D. Zhang. Intel’s Array Building Blocks: A Retargetable, Dynamic Compiler and Embedded Language. In Proceedings of the 9th An-nual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 224–235. IEEE, April 2011.

[NVI09] NVIDIA Corporation. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. White Paper, October 2009.

[OLG⁺07] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A.E.

Lefohn, and T.J. Purcell. A Survey of General-Purpose Computa-tion on Graphics Hardware. Computer Graphics Forum, 26(1):80–113, March 2007.

[Rap09] RapidMind. RapidMind Development Platform Documentation. RapidMind Inc., June 2009.

[RRS⁺08] S. Ryoo, C.I. Rodrigues, S.S. Stone, J.A. Stratton, S.Z. Ueng, S.S.

Baghsorkhi, and W.W. Hwu. Program Optimization Carving for GPU Computing. Journal of Parallel and Distributed Computing, 68(10):1389–1401, October 2008.

[Rus06] J.C. Russ. The Image Processing Handbook, volume 5. CRC Press, 2006.

[TKS⁺11] P. Thoman, K. Kofler, H. Studt, J. Thomson, and T. Fahringer. Au-tomatic OpenCL Device Characterization: Guiding Optimized Kernel Design. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (Euro-Par), pages 438–452.

Springer, August 2011.

[TM98] C. Tomasi and R. Manduchi. Bilateral Filtering for Gray and Color Images. pages 839–846. IEEE Computer Society, January 1998.

Bibliography

[ULBH08] S.Z. Ueng, M. Lathara, S. Baghsorkhi, and W. Hwu. CUDA-lite:

Reducing GPU Programming Complexity. Languages and Compilers for Parallel Computing, 5335:1–15, 2008.

[Wil11] Willow Garage. Open Source Computer Vision (OpenCV). http:

//opencv.willowgarage.com/wiki, 1999–2011.

[Wol10] M. Wolfe. Implementing the PGI Accelerator Model. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU), pages 43–50. ACM, March 2010.

[WPSAM10] H. Wong, M.M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU Microarchitecture through Mi-crobenchmarking. In Proceedings of the 2010 IEEE International Sym-posium on Performance Analysis of Systems and Software (ISPASS), pages 235–246. IEEE, 2010.

A Acronyms

ALU Arithmetic Logic Unit

API Application Programming Interface ArBB Array Building Blocks

AST Abstract Syntax Tree CAL Compute Abstraction Layer Cell B.E. Cell Broadband Engine Cg C for Graphics

Ct C for Throughput Computing CPU Central Processing Unit

CUDA Compute Unified Device Architecture DSL Domain-Specific Language

eDSL Embedded Domain-Specific Language FMA Fused Multiply-Add

ID IDentifier

GLSL OpenGL Shading Language GPU Graphics Processing Unit

GPGPU General Purpose Computation on Graphics Processing Unit HLSL High Level Shading Language

ILP Instruction Level Parallelism NPP NVIDIA Performance Primitives OpenCL Open Compute Language

OpenCV Open Source Computer Vision

A Acronyms

OpenGL Open Graphics Library MAD Multiply-Add

PCIe Peripheral Component Interconnect Express RGBA Red Green Blue Alpha

SFU Special Function Unit

MIMD Multiple Instruction, Multiple Data SIMD Single Instruction, Multiple Data SPMD Single Program, Multiple Data SIMT Single Instruction, Multiple Thread TDP Thermal Design Power

MPMD Multiple Program, Multiple Data VLIW Very Long Instruction Word

In document SPES. Software Platform Embedded Systems 2020 (Page 57-72)