SPES. Software Platform Embedded Systems 2020

(1)

SPES

Software Platform Embedded Systems 2020

- Beschreibung der Fallstudie „Programmierverfahren zur

Generierung parallelen echtzeitfähigen Codes in der

diagnostischen Bildverarbeitung“

-Phase 2

Deliverable D_MT_5.1_1

Version 1.0

Projektbezeichnung SPES 2020

Verantwortlich

Richard Membarth

QS-Verantwortlich

Mario Körner, Frank Hannig

Erstellt am

25. Januar 2012

Freigabestatus

Vertraulich für Partner

Projektöffentlich

x Öffentlich

Bearbeitungsstatus

in Bearbeitung

vorgelegt

x fertig gestellt

(2)

Weitere Produktinformationen

Erzeugend Richard Membarth

Mitwirkend Mario Körner, Frank Hannig, Wieland Eckert, Jürgen Teich

Änderungsverzeichnis

Änderung Geänderte Beschreibung der

Änderungen Autor Zustand Nr. Datum Version Kapitel

1 01.10.2011 0.1 2, 3, 4 DSL,

Codegener-ierung & Outlook RM in Bearbeitung 2 09.01.2012 0.2 2, 3, 4 DSL,

Codegener-ierung & Outlook RM in Bearbeitung 3 24.01.2012 0.3 1, 5 Introduction,

Conclusions RM in Bearbeitung 4 25.01.2012 1.0 Alle Finaler Report RM fertig

Prüfverzeichnis

Die folgende Tabelle zeigt einen Überblick über alle Prüfungen – sowohl Eigenprü-fungen wie auch PrüEigenprü-fungen durch eigenständige Qualitätssicherung – des vorliegen-den Dokumentes.

Datum Geprüfte Version Anmerkungen Prüfer Neuer Produktzustand 24.11.2011 0.1 Fragen zu Rand-behandlung und DSL FH in Bearbeitung 23.01.2012 0.2 Fragen zu Bi-lateral Filter, Nomenklatur und Randbe-handlung MK in Bearbeitung 24.01.2012 0.3 Fragen zu GPU Architektur und Formatierung FH fertig

(3)

Abstract

In this study, the programmability of GPU accelerators for medical imaging is inves-tigated. To cope with the complexity of programming GPU accelerators in medical imaging, a framework is presented that allows to describe image processing kernels in a domain-specific language, which is embedded into C++. The description is based on decoupled access/execute metadata, which allow the programmer to spec-ify both execution constraints and memory access patterns of kernels. The descrip-tion of point, local, and global operators in the domain of medical imaging and the efficient, device-dependent code generation for GPU accelerators from this descrip-tion is presented. A source-to-source compiler performs the transladescrip-tion from this high-level description into low-level CUDA and OpenCL code with automatic sup-port for device-dependent optimizations such as global memory padding for memory coalescing and optimal memory bandwidth utilization, boundary handling, and fil-ter masks. Taking the annotated metadata and the characfil-teristics of the parallel GPU execution model into account, two-layered parallel implementations—utilizing SIMD and MIMD parallelism—are generated. An abstract hardware model of graphics card architectures allows to model GPUs of multiple vendors like AMD and NVIDIA, and to generate device-dependent code for multiple targets. It is shown that the generated code is faster than manual implementations and those relying on hardware support for boundary handling. Even implementations from RapidMind, a commercial framework for GPU programming, are outperformed and similar results achieved compared to the GPU backend of the widely used image processing library Open Source Computer Vision (OpenCV), and the NVIDIA Per-formance Primitives (NPP) library.

(6)

(7)

1 Introduction

In a previous study, we evaluated different frameworks for programming current standard shared memory multi-core processors [MHT+_10,MHT+_{11b] and for} accel-erators based on Graphics Processing Units (GPUs) [MHT+_{11a] using the 2D/3D} image registration from medical imaging as case study. Different parallelization approaches (data-parallel vs. task-parallel) were considered in order to investigate different aspects like usability, performance, and overhead of the frameworks. The investigated frameworks included OpenMP, Cilk++, Threading Building Blocks, RapidMind, and OpenCL on standard multi-core processors, whereas CUDA, OpenCL, RapidMind, PGI Accelerator, and HMPP Workbench were considered for GPU ac-celerators.

Based on the outcomes of these evaluations, we decided to consider CUDA and OpenCL for further optimization and for investigations of source-to-source trans-lations. In the optimization study, we investigated the employment of stream-ing, the overlapping of data transfers with computations for applications from the medical domain (see [MHT+_{11c]) and the employment of resource management} and scheduling techniques for applications from the medical domain. A framework for scheduling and resource management was introduced, which supports a) high-priority tasks that can interrupt running GPU applications (dynamic scheduling behavior), b) assigns automatically streams to kernels for overlapping data transfers and execution kernels, and c) distributes task trees to different GPUs (multi-GPU support) [MLH+_12].

This study focuses on the source-to-source translation for medical imaging: the definition of a Domain-Specific Language (DSL) for medical imaging applications, a source-to-source compiler that understands the DSL and can map it to a target language (CUDA and OpenCL).

In this chapter the motivation for the current study—the need for a DSL and tools to generate low-level code for different targets—will be given. Since, the hardware architecture of current graphics cards have a significant impact on the way algorithms are written and optimized for GPUs, a brief overview of current GPU architectures from AMD and NVIDIA is given at the end of this chapter.

1.1 Motivation

Computer systems are increasingly heterogeneous, as many important computa-tional tasks, such as multimedia processing, can be accelerated by special-purpose processors that outperform general-purpose processors by one or two orders of

(8)

mag-1 Introduction

nitude, importantly, in terms of energy efficiency as well as in terms of execution speed.

Until recently, every accelerator vendor provided their own Application Program-ming Interface (API), typically based on the C programProgram-ming language. For exam-ple, NVIDIA’s API called Compute Unified Device Architecture (CUDA) targets systems accelerated with GPUs. In CUDA, the programmer dispatches compute-intensive data-parallel functions (kernels) to the GPU, and manages the interaction between the Central Processing Unit (CPU) and the GPU via API calls. Ryoo et al. [RRS+_{08] highlight the complexity of CUDA programming, in particular, the} need for exploring thoroughly the space of possible implementations and configura-tion opconfigura-tions. Open Compute Language (OpenCL), a new industry-backed standard API that inherits many traits from CUDA, aims to provide software portability across heterogeneous systems: correct OpenCL programs will run on any standard-compliant implementation. OpenCL per se, however, does not address the problem of performance portability; that is, OpenCL code optimized for one accelerator de-vice may perform dismally on another, since performance may significantly depend on low-level details, such as data layout and iteration space mapping [HLDK09].

Nevertheless may the achievable speedup when outsourcing workload performed typically sequentially in software on a processor to an acceleration coprocessorin hardware be limited in case memory transfers are require frequently and if these memory transfer phases turn out to be time-consuming.

CUDA and OpenCL are considered as low-level programming languages since the user has to take care about parallelization of the algorithm as well as the communica-tion between the host and the GPU accelerator. In general, low-level programming increases the cost of software development and maintenance: whilst low-level lan-guages can be robustly compiled into efficient machine code, they effectively lack support for creating portable and composable software.

While previous work shows that running the same kernels (e. g., written in OpenCL) on different hardware (from AMD and NVIDIA) can have significant impact on the performance [DWL+_{11], the framework presented in this work serves to protect} investments in software in the face of the ever changing landscape of computer systems.

High-level languages with domain-specific features are more attractive to domain experts, who do not necessarily wish to become target system experts. To compete with low-level languages for programming accelerated systems, however, domain-specific languages should have an acceptable performance penalty.

This work presents a framework for medical image processing in the domain of angiography that allows programmers to concentrate on developing algorithms and applications, rather than on mapping them to the target hardware or dealing with low-level implementations. More specific, as design entry to the framework, a domain-specific language is defined to describe image processing kernels on an abstract level. Three groups of operators that have to be supported by the domain-specific language for medical imaging have been identified: a) point operators, b) local operators, and c) global operators. Point operators are applied to the picture

(9)

1.2 On the Democratization of Parallel Computing

elements (pixels) of the image and solely the pixel the point operator is applied to contributes to the operation (e. g., adding a constant to each pixel of the image). Local operators are similar to point operators, but also neighboring pixels contribute to the operation (e. g., replace the current pixel value by the average value of its neighbors). In contrast, global operators (e. g., reduction operators) produce one output for the operator applied to all pixels of the image (e. g., compute the sum of all pixels).

The basics of the proposed framework are implemented as a library of C++ classes and will be summarized in Chapter 2. The description of point operators, local operators, and reductions—one specific type of global operators—in the DSL supported by the framework is described in Section 2.2.2. Chapter 3 introduces the Clang-based compiler to translate this description into efficient low-level CUDA and OpenCL code as well as corresponding code to talk to the GPU accelerator. The transformations applied to the target code and the run-time configuration are device-dependent and take an abstract architecture model of the target graphics card hardware into account (Section 3.4). This results in efficient code that is evaluated against highly optimized GPU frameworks from RapidMind and OpenCV, as well as hand-written implementations (Section 3.5). In Section 3.6, the approach followed in this work is compared with other approaches and an outlook is given in Chapter 4 for future extensions of the framework and code generation optimizations. Finally, this work is concluded in Chapter 5.

1.2 On the Democratization of Parallel Computing

In the following, the scope of this work is outlined. First, a short historical view on GPU architectures and GPU programming is given, before current GPU architec-tures considered in this work are explained in detail.

1.2.1 From GPGPU to GPU Computing

Driven mainly by the gaming industry, commodity graphics hardware (GPUs) have developed at a stunning pace the last decade, pushing the performance of graphics cards. Consequently, it was only a matter of time until developers discovered that graphics cards are capable of performing more than only graphics computations. Starting back in the 1980s the first attempts for General Purpose Computation on Graphics Processing Unit (GPGPU) have been made, but were not widespread until programmable shader units were introduced. It was now possible to program the different shaders of the graphics hardware pipeline (see Figure 1.1) using DirectX and Open Graphics Library (OpenGL). However, it was hard for programmers to code for the graphics cards since each shader unit was optimized only for one task and functionality was limited to those tasks [OLG+_{07]. The hardest part, however,} was that the programmers had to trick the GPU into general purpose computing by casting the problem as graphics: The data had to be turned into texture maps and

(10)

1 Introduction

Figure 1.1: Traditional graphics hardware pipeline. The vertex and fragment pro-cessor stages are both programmable. Image source: [OLG+_07].

the algorithms into rendering passes [Lue08]. Shading languages like NVIDIA’s C for Graphics (Cg), the OpenGL Shading Language (GLSL), or the High Level Shading Language (HLSL) used with DirectX emerged, but still the hardware limitations were existent and the developer still had to think in graphics. For instance, efficient scatter operations—indexed write array operations—were not supported.

This, however, changed with the introduction of the Tesla architecture and the G80 chipsets (found in the GeForce 8 series GPUs from NVIDIA) end of 2006: the hardware units dedicated for individual stages within the graphics pipeline of Figure 1.1 have been unified into a single shader core that can be used for different shaders (e. g., vertex and geometry shader). At the same time, NVIDIA introduced CUDA, a parallel computing architecture and programming language for its GPUs. CUDA was the first framework to provide a non-graphics API that allows to access the processor cores of GPU architectures. The limitations of previous architectures have been removed: device memory can be accessed using direct load and store operations, programs are not layered anymore on graphics, and a lower branching penalty that allows enhanced control flow. ATI (now part of AMD) followed the same approach and unified their shader architecture.

As a consequence, the term GPGPU is used within this work to denote the programming of GPUs for general purpose computing without a dedicated API while GPU Computing is used for programming with a dedicated API. In recent literature, the term Heterogeneous Computing is also often used for what is denoted by the term GPU Computing in this work.

(11)

1.2.2 GPU Architecture

A detailed knowledge of the underlying hardware architecture of current graphics cards is indispensable in order to achieve good performance. Therefore, a hardware architecture for current graphics cards’ architectures is introduced. This hardware model is used within this work for efficient mapping to graphics cards from AMD and NVIDIA, but is also valid for graphics cards of other vendors like ARM or Imagination Technologies. The model is used to hide low-level details of different GPU architectures. At the same time, it allows to refine the model when necessary for efficient target-specific optimizations. For programming GPUs, each hardware manufacturer provide their own low-level frameworks (e. g., Compute Abstraction Layer (CAL) from AMD and CUDA from NVIDIA) and introduced terms to name software and hardware features. This causes confusion and makes it hard to com-pare different hardware architectures. For this reason, terms within this work use the notation from OpenCL, a hardware-independent software framework for hetero-geneous platforms. Only when vendor-specific hardware features are introduced in this section, the vendor-specific term will be mentioned as well as the corresponding OpenCL term.

Figure 1.2 shows the hardware model as used within this work: the compute device denotes the graphics card including the compute device memory (GPU memory). Each compute device hosts several compute units, which in turn host the processing elements, the Arithmetic Logic Units (ALUs). The number of processing elements within one compute unit is constant for one graphics cards generation, while the number of compute units varies and is used to distinguish low-end, mid-range, and high-end cards. Each processing element has access to different memory types: a) fast on-chip private memory, which is implemented as registers, b) fast on-chip local memory, which is shared between all processing elements within one compute unit, and c) device memory with high latency. The device memory is accessible in two ways, namely read-only constant memory, and read-write global memory. Constant memory can only be modified by the host system and is cached, while global memory can be also modified by the processing elements of the compute device. To decrease memory latency, some compute devices feature L1 and L2 caches. Since the terms used to describe architectural features of graphics cards are sometimes misleading, Table 1.1 lists the corresponding CPU terms and those used by OpenCL, NVIDIA, and AMD. The ones used within this work are underlined.

Table 1.1: Terms in CPU and CPU terminology.

CPU OpenCL NVIDIA AMD

SIMD processor compute unit streaming multiprocessor compute unit

SIMD unit processing element streaming processor stream core

(12)

1 Introduction

Figure 1.2: GPU architecture model: the GPU (compute device) is sub-divided into multiple compute units, which host several Processing Elements (PEs). Private memory is associated to each PE and only accessible by the associated PE. In contrast, local memory is visible to all PEs within one compute unit and global memory by the PEs of all compute units.

For experimental evaluation, the two most recent GPU generations from AMD and NVIDIA are considered. Therefore, Table 1.2 lists the specifications of the AMD Radeon HD 5870 (Cypress architecture), the AMD Radeon HD 6970 (Cay-man architecture), NVIDIA Quadro FX 5800 (Tesla architecture), and NVIDIA Tesla C2050 (Fermi architecture) graphics cards. The biggest difference between the underlying architectures is that the processing elements of the AMD cards use a Very Long Instruction Word (VLIW) implementation (VLIW5/VLIW4), while the processing elements of NVIDIA’s cards are scalar units. The VLIW architec-ture are more suited for traditional graphics workloads, where the Red Green Blue Alpha (RGBA) components of a pixel are processed within one VLIW instruction. On NVIDIA cards, this takes four instructions. However, for non-graphics work-loads, NVIDIA’s hardware architecture is better suited: instructions can be mapped directly to the processing elements, while on AMD cards, the compiler has to look for independent scalar instructions that can be issued within one VLIW instruction. That is, although GPUs from AMD have double the theoretical peak performance, NVIDIA GPUs are preferred for GPU computing due to better software and

(13)

com-1.2 On the Democratization of Parallel Computing

piler support. The trend towards heterogeneous computing, however, becomes ap-parent when looking at upcoming architectures: NVIDIA’s upcoming Kepler and Maxwell architectures will add support for virtual memory and pre-emption, while AMD’s upcoming Graphics Core Next Gen will move to scalar processing elements and add support for virtual memory as well.

Example: Fermi Architecture

One implementation of the hardware architecture model of Figure 1.2 is the Tesla C2050 from NVIDIA with 448 processing elements distributed over 14 compute units Table 1.2: Comparison of hardware capabilities of the Cypress (Radeon HD 5870), Cayman (Radeon HD 6970), Tesla (Quadro FX 5800), and Fermi (Tesla C2050) architectures.

Cypress Cayman Tesla Fermi

Compute units up to 20 up to 24 up to 30 up to 14

Warp size 16 16 8 2× 16

Processing elements 320 (VLIW5) 384 (VLIW4) 240 (Scalar) 448 (Scalar)

ALUs 1600 1536 240 448

Shader clock 0.850GHz 0.880GHz 1.296GHz 1.150GHz

GFLOPS (SP) 27202 ₂₇₀₃2 ₉₉₃1,3 ₁₂₈₈2,4

GFLOPS (DP) 5442 ₆₇₅2 ₇₈2 ₅₁₅2

Memory type GDDR5 GDDR5 GDDR3 GDDR5

Memory interface 256-bit 256-bit 512-bit 384-bit

Memory clock 1.200GHz 1.375GHz 0.800GHz 0.750GHz Memory bandwidth 154GB/s 176GB/s 102.4GB/s 144GB/s Memory size 2GB 2GB 4GB 6GB L2 cache 512KB 512KB n/a 768KB L1 cache 8KB5 ₈_KB5 ₈_KB5 _{16/48 KB}6 L2 bandwidth 435GB/s 451GB/s n/a 589GB/s L1 bandwidth 1088GB/s 1352GB/s 768GB/s 1178GB/s Constant memory 48KB 48KB 64KB 64KB Constant cache 8KB 8KB 8KB 8KB Power (TDP)7 ₁₈₈_W ₂₅₀_W ₁₈₉_W ₂₃₈_W

1_{Counting Multiply-Add (MAD) as two operations.} 2_{Counting Fused Multiply-Add (FMA) as two operations.} 3_{Counting two operations per SFU.}

4_{Counting four operations per SFU.} 5_{Read-only texture cache.}

6_{Configurable.}

(14)

1 Introduction

Figure 1.3: Fermi architecture (see [NVI09]): Up to 512 streaming processors distributed over 16 multiprocessors. Each multiprocessor comprises 32streaming processors, local memory memory/L1 cache, and texture units.

as shown in Figure 1.3. In the Tesla C2050, each compute unit has 32 k registers (private memory) and a scratchpad memory of 64 KB that can be used as local memory and L1 cache. Thereby, the part of the scratchpad that is used as local memory is configurable. Possible configurations are 16/48 KB local memory/L1 cache and 48/16 KB local memory/L1 cache. Each compute unit has 32 ALUs for single-precision operations and 4 Special Function Units (SFUs) for hardware-accelerated transcendental operations like sine and cosine. For double-precision operations, always two single-precision ALUs are combined to perform one double-precision operations, resulting in 16 ALUs for double-double-precision operations. While the warp size for the Tesla C2050 is 32, up to 8 warps can be executed at the same time. That is, up to 1536 work-items can be assigned to one compute unit. However, one work-group can contain at most 1024 work-items, which means that multiple work-groups can be scheduled by one compute unit, namely up to 8. Table 1.3 summarizes these details also for the other graphics cards.

(15)

Table 1.3: Comparison of low-level capabilities of a single compute unit of the Cy-press (Radeon HD 5870), Cayman (Radeon HD 6970), Tesla (Quadro FX 5800), and Fermi (Tesla C2050) architectures.

Local memory 32KB 32KB 16 KB 16/48 KB1

Registers 16k 16k 16k 32k

Register file size 256KB 256KB 64 KB 128KB

ALUs (SP) 42 ₄3 ₈ ₂_{× 16}4

ALUs (DP) 22 ₂3 ₁ ₁₆4

SFUs 1 13 ₂ ₄

Warp size 64 64 32 32

Warps per compute unit 32 32 32 48

Work-items per compute unit 1024 1024 1024 1536

Work-groups per compute unit 8 8 8 8

Work-items per work-group 256 256 512 1024

1_{Configurable.}

2_{The first four units of the VLIW5 are used either for four single precision operations, or two}

double-precision operations. The fifth unit is dedicated for hardware-accelerated transcendental operation.

3_{The four units of the VLIW4 are used either for four single precision operations, two}

double-precision operations, or one hardware-accelerated transcendental operation plus one one single-precision operation.

4_{The 2}

× 16 ALUs can be used either for 32 single-precision operations, or 16 double-precision operations.

1.2.3 GPU Programming

The programming model of GPUs exploits massive data-level parallelism. The execution model is based on the concept of Single Instruction, Multiple Data (SIMD) where groups of threads (up to the warp size) execute the same instruction within one compute unit. The SIMD execution model on GPUs is extended to support divergent execution paths for different execution units within one instruction stream: if threads branch in different directions (if-then-else), the hardware uses masks to decide which processing elements execute the current instruction. Once the control converges, the same instruction is executed by all processing elements again. Therefore, NVIDIA calls this new execution model also Single Instruction, Multiple Thread (SIMT) (comparable to Single Program, Multiple Data (SPMD)).

To exploit this data parallelism, a programmer writes so-called kernels that are executed by the processing elements of a compute unit. Kernel programs written in CUDA and OpenCL are based on the C language with some limitations. For example, function pointers, recursions, or variable-length arrays are not supported. Currently, first hardware and software support is added to enable limited recursion and a subset of C++. The interaction between the host (CPU) and the compute

(16)

1 Introduction

Figure 1.4: GPU architecture model: the GPU (compute device) is sub-divided into multiple compute units, which host several Processing Elements (PEs). Private memory is associated to each PE and only accessible by the associated PE. In contrast, local memory is visible to all PEs within one compute unit and global memory by the PEs of all compute units. device (GPU) is provided via API calls. These include resource management (e. g., device and memory allocation/deallocation), execution of programs (kernels) on the compute device, as well as synchronization between activities on the compute device and the host. The communication between host system and compute device happens over the Peripheral Component Interconnect Express (PCIe) bus as shown in Figure 1.4.

Although the GPU hardware can schedule and execute different kernels to dif-ferent compute units, this low-level scheduling and resource management feature is not exposed to the programmer. That is to say, the GPUs support a modification of the SPMD execution model within single compute units and the Multiple Program, Multiple Data (MPMD) execution model taking all compute units into account. From a programmers perspective, though, only SPMD is supported.

Since in data-parallel processing a domain of data elements (e. g., pixels of an image) is processed by a program, the data domain has to be specified and provided as input to a program. In GPU computing, the size of the domain (e. g., width and

(17)

Figure 1.5: GPU programming model: the domain (e. g., image) is sub-divided into independent work-groups, representing independent sub-problems of the domain (e. g., sub-regions of the image). Work-groups are assigned to compute units and executed by the processing elements of the compute unit. The work-group is further sub-divided into work-items (e. g., pix-els), which are scheduled at warp-granularity to the processing elements. height of an image) is called global work size1 _{and defines how many work-items} (threads) are allocated on the compute device. As the number of work-items that can be mapped to a single compute unit is limited, work-items are grouped into sub-domains that get assigned to compute units. The size of these sub-domains is also called local work size. The compute unit in turn schedules work-items in groups of the warp size to its processing elements. This breakdown of the domain into sub-domains and work-items is shown in Figure 1.5. When launching a kernel for execution on the compute device, the programmer has to specify the partitioning of the domain, that is, the size and shape of the global and local work size.

Within a kernel, built-in variables (CUDA) and function calls (OpenCL) are available to identify the current work-item on a processing element. This includes the local IDentifier (ID) within the sub-domain assigned to the compute unit as well as the global ID within the entire domain. These IDs are used to determine the data element (e. g., pixel within an image), a work-item performs operations on.

(18)

1 Introduction

1.2.4 GPU Performance Considerations

Given the fact current graphics cards provide a peak performance that is a magni-tude higher compared to current high-end server processors2 _{but a bandwidth that} is only 4 − 5× higher, imposes high demands on programs in order to exploit the massive computational performance of GPUs: a) massive parallelism, b) massive multithreading, c) scalability and portability, d) memory hierarchy exploitation, and e) efficient mapping to the hardware [KWmH10].

Massive Parallelism In traditional programming for multi-core CPUs, the num-ber of threads matches typically the available cores (i. e., four threads are used on a quad-core). Thread creation on CPUs is relatively expensive and threads are long-living. In contrast, threads are lightweight and perform only little work on GPUs. On Fermi, each compute unit is capable to schedule up to 1536 threads (work-items). The Tesla C2050 hosts 14 compute units, that is, 21504 threads can be active at the same time. This means that a different work granularity is appropriate for GPUs compared to multi-core programming on CPUs. Instead of assigning slices of an image to cores, pixels are assigned to cores. Using 21504 threads, mapping one pixel to one thread, a subregion of 146 × 146 pixels can be processed concurrently. This data-parallel approach is characteristic of GPU computing.

Massive Multithreading While the architecture of CPUs is optimized for latency— to finish a task as fast as possible, GPUs are optimized for throughput. Global memory on GPUs has a latency of 400–600 cycles and whenever an instruction has to wait for data from global memory, the thread stalls and waits until the data is available. This is in particular a problem since most kernels do only little calcula-tions. Consider for example a simple kernel that adds only a constant to each pixel of an image: index calculation and the arithmetic operation itself take only a few cycles compared to the latency to fetch the pixel from global memory and to store the modified pixel back to global memory. To solve this dilemma, GPUs support massive multithreading: whenever the threads within a warp stall, the compute unit schedules the next warp. All warps (up to 48 for the Tesla C2050) are scheduled this way in a round-robin fashion. At that time, the warp that stalled first is ready to continue execution—the data has been fetched from global memory while the other warps have been scheduled. This way, execution of single threads has a high latency, but the throughput of the total workload is high.

Scalability GPU programs have to run on a variety of different graphics cards. This includes different implementations of the same hardware architecture—from low-end over mid-range up to high-end cards—as well as different hardware archi-tectures from possibly different vendors. Writing portable and scalable programs 2_{Intel Xeon E7-8870 has a peak performance of 96 GFLOPS and a memory bandwidth of}

(19)

for different graphics cards of the same hardware architecture can be cut down to provide enough parallelism: the number of compute units and available memory bandwidth varies, but the remaining hardware architecture is the same. While the low-end cards may have only 2 compute units, the high-end cards can have up to 15compute units in case of the Fermi architecture. Therefore, enough parallelism has to be provided to keep at least high-end cards busy.

Memory Hierarchy Although GPUs provide an immense compute power, it is

hard to exploit this potential in many application domains due to an imbalance be-tween memory bandwidth and peak performance. Providing a memory bandwidth of 144 GB/s on the Tesla C2050, can deliver 36 G 32-bit data elements. These data elements result in 36 GFLOPS for operations with one operand from global mem-ory, or in 18 GFLOPS for operations with two operands for global memory. The MAD/FMA operations used to characterize the peak performance require three operands from global memory, resulting in only 12 GFLOPS of the theoretically possible 1288 GFLOPS. This is less than 1% of the peak performance. To achieve higher performance, data reuse is essential and the memory hierarchy has to be used wisely. Table 1.4 shows how many operations per data fetch are required to exploit both data bandwidth and compute performance at the same time. As reference for peak performance calculations, normal single-precision operations are considered instead of add operations. Hardware vendors use the multiply-add operations to calculate peak performance since one multiply-multiply-add operation can be counted as two operations. Still, the number of operations per data fetch is enormous for global memory, ranging between 57–62 and 25–29 operations per data fetch for GPUs from AMD and NVIDIA, respectively. This gets even worse tak-ing into account that only about 80% of the available memory bandwidth can be achieved in real systems. Using caches improves the situation a lot, but still 8 and 4operations have to be performed on the fastest L1 cache and local memory for AMD and NVIDIA GPUs. Only registers can provide the required bandwidth to feed the compute units steadily with data. Therefore, it is essential to utilize the available memory hierarchy effectively for compute kernels that are memory bound. This is also the case for most image processing kernels and in particular for those considered in this work.

For memory intensive applications, the transfer to the compute device can be a bottleneck, too. The compute device is connected to the host system using the PCIe bus in current systems. PCIe can deliver a bandwidth of up to 8 GB/s. This is one order of magnitude lower than global memory bandwidth.

Efficient Mapping Writing applications for GPUs can be relatively straightfor-ward and a speedup compared to a single-core reference implementation can be achieved in a relative short time period. However, mapping and matching the algo-rithm efficiently to the underlying hardware architecture requires low-level insights

(20)

1 Introduction

and changes to the algorithm. This includes memory coalescing, bank conflicts, and control flow divergence, just to name a few.

Memory access to global memory has to be coalesced to utilize the whole memory bandwidth and avoid serialized memory accesses. Similarly, local memory is divided into banks and access is serialized when multiple threads access the same bank. Such bank conflicts degrade memory bandwidth of local memory and should be avoided. A memory access pattern that is distributed evenly across the banks can be served in the same memory transaction. Control flow has to be minimized or adapted to the warp size of the compute units. When divergence occurs within one warp, diverging paths are executed one after another. That is, execution time is longer for programs that have divergent execution paths since the instructions of both paths have to be executed by all work-items. However, in case the divergence point occurs exactly between two warps, there is no penalty for divergent paths.

Table 1.4: Arithmetic operations required per data fetch in order to exploit both peak performance and memory bandwidth for the Cypress (Radeon HD 5870), Cayman (Radeon HD 6970), Tesla (Quadro FX 5800), and Fermi (Tesla C2050) architectures.

GFLOPS (SP) 10881 ₁₃₅₂1 ₃₁₁1 ₅₁₅1

PCIe bandwidth 8GB/s 8GB/s 8GB/s 8GB/s

Global mem. bandwidth 153.6GB/s 176GB/s 102.4 GB/s 144GB/s

L2 bandwidth 435GB/s 451GB/s n/a 589GB/s

L1 bandwidth 1088GB/s 1352GB/s 768GB/s 1178GB/s

Register bandwidth 8704GB/s2 3 ₁₀₈₁₃_GB/s2 3 ₂₄₈₈_GB/s2 ₄₁₂₂_GB/s2

Ops/fetch (GDDR) 56.67 61.45 24.30 28.61

Ops/fetch (L2 cache) 20.00 23.98 3.30 6.99

Ops/fetch (L1 cache) 8.00 8.00 n/a 3.50

Ops/fetch (registers) 1 1 1 1

1_{For normal operations (no MAD/FMA, no SFU).}

2_{For an operation reading 8 bytes for two operands from register.} 3_{Using 128-bit registers.}

(21)

2 Embedded Domain-specific

Language for Medical Imaging

This chapter introduces a Domain-Specific Language (DSL) for describing and rep-resenting algorithms typically encountered in medical imaging. In order to ease the interaction and integration of the DSL with existing code, the DSL is embedded into an existing language like C/C++. This type of DSL is also referred to as Embedded Domain-Specific Language (eDSL). In this work, the DSL is based on C++ classes and is, hence, also an DSL embedded into C++. The algorithms considered in this work operate on two-dimensional images and are encountered in many medical imaging applications such as image preprocessing and image postprocessing.

2.1 Image Processing Characteristics in Medical

Imaging

In this section, image processing algorithms encountered in the considered domain of angiography are discussed and illustrated. In angiography, images are recorded by a X-ray system and the acquired image has to be preprocessed in order to compensate for errors caused by the detector (see [MHT+_{11c]). This class of algorithms are also} classified as Fundamental Enhancement Techniques in literature [Rus06, Ban00]. The acquired image is stored digitally in a computer system (e. g., using an array in the C programming language). Algorithms compensating for different image defects are applied to the image and finally, the image can be displayed to the attending doctor.

2.1.1 Image Operator Classification

Typically, an image pipeline—a sequence of algorithms—is applied to one input image, before the next image is processed. The algorithms can be categorized into three algorithm groups:

• Point Operators are applied to the picture elements (pixels) of the image and solely the pixel the point operator is applied to contributes to the operation (e. g., adding a constant to each pixel of the image).

• Local Operators are similar to point operators, but also neighboring pixels contribute to the operation (e. g., replace the current pixel value by the average value of its neighbors).

(22)

2 Embedded Domain-specific Language for Medical Imaging

• Global Operators (e. g., reduction operators) produce one output for the operator applied to all pixels of the image (e. g., compute the sum of all pixels).

2.1.2 Application of Image Operators

In the ideal case, an image operator is applied to an input image, producing an output image of the same size, that is, the operator is applied to each pixel of the input image, producing one output pixel. This is, however, not the case for all algorithms. Some algorithms consider only pixels within a region of interest, that is, a detail of a picture. For example, in case only the inner part of an image is of interest to the doctor, a region of interest is specified that defines the desired part of the image. Zooming into or out of an image (upsampling or downsampling) creates images with a higher or lower resolution. When creating an image with a lower resolution, not every pixel of the input image needs to be processed. The opposite applies when a higher resolution image should be created from an image with a lower resolution. Therefore,

Input Images: Figure 2.1 shows the following access patterns reading from input images:

• The whole image is accessed.

• The image is accessed with an offset.

• Only a certain region of interest of the image is accessed—the image is cropped. • Every n-th pixel in one direction is accessed—a stride is defined for image

access.

• Combinations of the above access patterns.

In case of local operators, neighboring pixels are accessed. When the local op-erator is applied to a pixel at the image border, for example left image border, no neighboring pixels to the left are available. Therefore, special treatment is required, so called boundary handling. Different boundary handling modes are possible, like using the last valid pixel value or using a constant value.

Output Image: Similar to input images, also the region of an output images that is written can be constrained. Figure 2.2 shows the possibilities to write only certain parts of an output image:

• The whole image is accessed.

• The image is accessed with an offset.

• Only a certain region of interest of the image is accessed—the output image is cropped.

(23)

2.1 Image Processing Characteristics in Medical Imaging

(a) Image and boundary. (b) Image offset.

(c) Image crop. (d) Image stride.

(e) Image crop with offset.

Figure 2.1: Different views of an image for image operators: (a) shows the whole image and the surrounding boundary, while in (b) an offset to the image is used. (c) only a detail of an image is considered and in (d) every second pixel in the x-direction is considered. The combination of offset and crop is seen in (e).

(24)

(a) Output image. (b) Crop of output

im-age with offset.

Figure 2.2: Region of interest of the output image to be written: in (a) the whole output image is written, while in (b) only a region of interest is written.

Figure 2.3: The size of the output image defines the Iteration Space: to each pixel of the output image is one corresponding pixel in the input image assigned. • Combinations of the above constraints.

Iteration Space: Mapping input images to output images. Since

algo-rithms produce in the considered domain always one output image, the region of the output image that is created by the algorithm is also called Iteration Space. While always one, and exactly one output image can be used for image processing algorithms, multiple input images can be used. These input images can have differ-ent access patterns, but the same number of pixels assigned. Figure 2.3 shows the Iteration Spacedefined on the output image (highlighted red) and the corresponding pixels read in the input image (highlighted dark blue).

2.2 Domain-Specific Description for Image

Processing Kernels

This section introduces a DSL for describing image processing algorithms/kernels in the considered domain. The DSL is based on built-in C++ classes that describe

(25)

2.2 Domain-Specific Description for Image Processing Kernels

the following four basic components required to express image processing on an abstract level:

• Image: Describes data storage for the image pixels. Each pixel can be stored as an integer number, a floating point number, or in another format such as RGB, depending on instantiation of this templated class. The data layout is handled internally using multi-dimensional arrays.

• Iteration Space: Describes a rectangular region of interest in the output image, for example the complete image. Each pixel in this region is a point in the iteration space.

• Kernel: Describes an algorithm to be applied to each pixel in the Iteration Space.

• Accessor: Describes which pixels of an Image are seen within the Kernel. Similar to an Iteration Space, the Accessor defines an Iteration Space on an input image.

These components are an instance of decoupled access/execute metadata [HLDK09]: the Iteration Space specification provides ordering and partitioning constraints (ex-ecute metadata); the Kernel specification provides a pattern of accesses to uniform memory (access metadata). Currently, the access/execute metadata is mostly im-plicit: it is assumed that the iteration space is independent in all dimensions and has a 1:1 mapping to work-items (threads), and that the memory access pattern is obvious from the kernel code.

2.2.1 Point Operators

The interaction between the four basic classes of the DSL is explained in the follow-ing. Listing 2.1 shows the definition of a point operator for a simple kernel: adding a constant to each pixel of the image. To define the point operator, a class is defined that inherits from the Kernel base class provided by the framework. In the exam-ple, the point operator is called AddToPixelOperator. The Kernel-base class has to be initialized with an Iteration Space object in the constructor of the derived class (line 7–8). Other parameters in the constructors are assigned to member variables of the class and can be used later on in the kernel (line 7–8). Each input image Accessor has to registered to the kernel using the addAccessor() method. This is required so that the DSL can assign input pixels to output pixels as described in Section 2.1.2. The function of the point operator itself is defined in the virtual kernel()method provided by the base class. The pixel of the input image assigned to the current iteration space point can be retrieved by the parenthesis operator (). Within the kernel() method, the current output image pixel can be written us-ing the output() method. The output image is retrieved from the Iteration Space declaration, and is explained below.

(26)

2 Embedded Domain-specific Language for Medical Imaging 1 c l a s s A d d T o P i x e l O p e r a t o r : p u b l i c K e r n e l { 2 p r i v a t e: 3 A c c e s s o r<float> & I n p u t ; 4 f l o a t val ; 5 6 p u b l i c:

7 A d d T o P i x e l O p e r a t o r (I t e r a t i o n S p a c e<float> & IS , A c c e s s o r<float> & Input , f l o a t val ) : 8 K e r n e l( IS ) , I n p u t ( I n p u t ) , val ( val ) 9 { 10 a d d A c c e s s o r (& I n p u t ) ; 11 } 12 13 v o i d k e r n e l () { 14 o u t p u t () = I n p u t () + val ; 15 } 16 };

Listing 2.1: Point operator described in the framework: add a constant to each pixel of the image.

Listing 2.1 defines the point operator for an output Image bound to an Iteration Space and an input Accessor (bound to the input Image. The missing definition of these objects for applying the point operator is shown in Listing 2.2: Image objects are defined by specifying the image resolution (line 9–10) and get initial-ized by assigning plain C array invoking the = operator of the Image class. The Accessor on the IN image object is defined by passing the IN image object to the Accessor constructor (line 16). Optional parameters can be used to specify the crop (width×height) and offset (x, y) for the Accessor. Similar, the Iteration Space is defined and bound to the OUT Image (line 19). The crop and offset can be added as optional parameters, too. An instance of the point operator is created in line 22, passing the Iteration Space bound to the output Image, the input Accessor, and the constant value to the constructor. The point operator is executed by invoking the execute() method of the point operator (line 25). To retrieve the image pixel data associated with an Image object, the getData() operator can be used to retrieve a pointer to the plain C array.

1 c o n s t int w i d t h = 1024 , h e i g h t = 1 0 2 4 ; 2 c o n s t f l o a t val = 23. f ; 3 4 // p o i n t e r s to raw i m a g e d a t a 5 f l o a t * h o s t _ i n = . . . ; 6 f l o a t * h o s t _ o u t = . . . ; 7 8 // i n p u t and o u t p u t i m a g e s 9 Image<float> IN ( width , h e i g h t ) ; 10 Image<float> OUT ( width , h e i g h t ) ; 11

12 // i n i t i a l i z e i n p u t i m a g e 13 IN = h o s t _ i n ; // o p e r a t o r =

(27)

2.2 Domain-Specific Description for Image Processing Kernels 14 15 // i n p u t a c c e s s o r 16 A c c e s s o r<float> A c c I n ( IN ) ; 17 18 // d e f i n e r e g i o n of i n t e r e s t 19 I t e r a t i o n S p a c e I s O u t ( OUT ) ; 20 21 // d e f i n e k e r n e l

22 A d d T o P i x e l O p e r a t o r A d d K e r n e l ( IsOut , AccIn , val ) ; 23 24 // e x e c u t e k e r n e l 25 A d d K e r n e l . e x e c u t e () ; 26 27 // r e t r i e v e o u t p u t i m a g e 28 h o s t _ o u t = OUT . g e t D a t a () ;

Listing 2.2: Example code that initializes and executes the point operator, adding a constant value to each pixel within an image.

2.2.2 Local Operators

In image processing and in particular in medical imaging, local operators are widely used. These operators can be applied independently to all pixels of the image and depend only on the pixel itself and the pixels in its neighborhood. The neighbor-hood read by local operators in medical imaging is typically symmetrically arranged around the pixel to be calculated with the same diameter in each direction. Further-more, local operators describe the convolution of an image by a filter mask (e. g., Gaussian function) with the following properties: a) the filter mask is centered at the pixel it is applied to [0, 0] and b) bounded to the neighborhood [−m, +m]×[−n, +n]. The latter implies a window size (2m + 1) × (2n + 1) of local operators to be uneven (e. g., 3 × 3, 5 × 5, 9 × 3). Since the filter mask used for convolution is typically constant, the values for the filter mask can be stored to a lookup table. Figure 2.4 shows the filter masks used for a window size of 3 × 3: the pixels in the neighbor-hood (f) are convoluted with the closeness (c) and similarity (s) filter masks. While the filter mask for the closeness component is constant and depends only on the distance to the center pixel, the filter mask for the similarity component depends on the pixel values and has to be calculated for each pixel separately.

Figure 2.4 also indicates a problem when local operators are applied to an image: if the 3×3 filter mask is applied to a pixel located at the image border, pixels beyond the image are accessed. Accessing pixels out of bounds leads to erroneous pixel values in the resulting image and may result in program termination. Therefore, boundary handling modes need to be defined that determine the behavior when pixels beyond the image are accessed.

(28)

Figure 2.4: Bilateral filter: Convolution of the image (f) with the closeness (c) and similarity (s) filter masks.

Example: Bilateral Filter

In the following, the image processing framework is illustrated using a bilateral filter that smooths an image while preserving edges within the image [TM98]. This filter is used in many applications, amongst others also in medical imaging and will be used within this paper to discuss various properties of local operators. The bilateral filter applies a local operator to each pixel that consists of two components: a) the closeness function that considers the distance to the center pixel to determine the weighting of pixels, and b) the similarity function that takes the difference between pixel values into account to determine the weighting of pixels. Gaussian functions of the Euclidean distance are used for the closeness and similarity weights as seen in Equation (2.1) and (2.2), respectively.

(29)

2.2 Domain-Specific Description for Image Processing Kernels closeness((x, y), (x0, y0)) = e−12( k(x,y)−(x0,y0)k σd ) 2 (2.1) similarity(p, p0) = e−12( kp−p0k σr ) 2 (2.2) The bilateral filter replaces each pixel by a weighted average of geometric nearby and photometric similar pixel values. Only pixels within the neighborhood of the relevant pixel are used. The neighborhood and consequently also the convolution size is determined by the geometric spread σd. The parameter σr (photometric spread) in the similarity function determines the amount of combination. Thereby, difference of pixel values below σrare considered more similar than differences above σr.

To execute the bilateral filter on a Graphics Processing Unit (GPU), the filter kernel applied to one pixel has to be specified in a program. Such a program is also called kernel, executed on the GPU in parallel, and applied to each pixel of the image independently. The pseudo code for the bilateral filter executed on GPUs is shown in Algorithm 1. Here, the kernel itself is described in lines 3–12, defining the bilateral filter for one pixel, taking the pixels in its neighborhood into account. This code is executed sequentially as can be seen in the double nested innermost loops. This is the code that is typically written by a programmer in either Compute Unified Device Architecture (CUDA) or Open Compute Language (OpenCL). The parallel execution on the graphics hardware is expressed by the parallel forall loop in line 1 and the parallel for loop in line 2. While the parallel for loop describes the data-parallel execution within one compute unit on the GPU (think of it as a Single Instruction, Multiple Data (SIMD) processor), the parallel forall loop depicts the parallel execution on all available groups of processors (i. e., Multiple Instruc-tion, Multiple Data (MIMD) processors). Note that this is a two-layered approach exploiting two different types of parallelism: SIMD parallel execution within the processing elements of one compute unit, and MIMD parallel execution between multiple compute units. Although the graphics hardware supports the execution of different programs (Multiple Program, Multiple Data (MPMD)) on different com-pute units, programs written in CUDA and OpenCL contain the same code for all compute units (Single Program, Multiple Data (SPMD)).

Compared to point operators, neighboring pixels are accessed in local operators. Therefore, the , the In Listing 2.3, neighboring pixels are accessed by passing the column (dx) and row (dy) offsets as optional parameters to the parenthesis operator () of the Accessor (line 18 and 23).

1 c l a s s B i l a t e r a l F i l t e r : p u b l i c Kernel<float> { 2 p r i v a t e: 3 A c c e s s o r<float> & I n p u t ; 4 int sigma_d , s i g m a _ r ; 5 6 p u b l i c:

7 B i l a t e r a l F i l t e r (I t e r a t i o n S p a c e<float> & IS , A c c e s s o r<float> & Input , int sigma_d , int s i g m a _ r ) :

(30)

Algorithm 1: Bilateral filter implementation on the graphics card. 1 forall the thread blocks B do in parallel

2 for each thread t in thread block b do in parallel 3 x, y← get_global_index(b, t);

4 p, k← 0;

5 for yf =−2 ∗ sigma_d to +2 ∗ sigma_d do 6 for xf =−2 ∗ sigma_d to +2 ∗ sigma_d do 7 c← closeness((x, y), (x + xf, y + yf));

8 s← similarity(input[x, y], input[x + xf, y + yf]); 9 k← k + c ∗ s; 10 p← p + c ∗ s ∗ input[x + xf, y + yf]; 11 end 12 end 13 output[x, y]← p/k; 14 end 15 end 8 K e r n e l( IS ) , I n p u t ( I n p u t ) , s i g m a _ d ( s i g m a _ d ) , s i g m a _ r ( s i g m a _ r ) 9 { a d d A c c e s s o r (& I n p u t ) ; } 10 11 v o i d k e r n e l () { 12 f l o a t c_r = 1.0 f / ( 2 . 0 f * s i g m a _ r * s i g m a _ r ) ; 13 f l o a t c_d = 1.0 f / ( 2 . 0 f * s i g m a _ d * s i g m a _ d ) ; 14 f l o a t d = 0.0 f , p = 0.0 f , s = 0.0 f ; 15 16 for (int yf = -2* s i g m a _ d ; yf < = 2 * s i g m a _ d ; yf ++) { 17 for (int xf = -2* s i g m a _ d ; xf < = 2 * s i g m a _ d ; xf ++) { 18 f l o a t d i f f = I n p u t ( xf , yf ) - I n p u t () ; 19 20 s = exp ( - c_r * d i f f * d i f f ) ; 21 c = exp ( - c_d * xf * xf ) * exp ( - c_d * yf * yf ) ; 22 d += s * c ; 23 p += s * c * I n p u t ( xf , yf ) ; 24 } 25 } 26 o u t p u t () = p / d ; 27 } 28 };

Listing 2.3: Kernel description of the bilateral filter.

In Listing 2.4, the BilateralFilter is invoked, similar to the point operator example in Listing 2.2.

1 c o n s t int w i d t h = 1024 , h e i g h t = 1024 , s i g m a _ d = 3 , s i g m a _ r = 5; 2

3 // p o i n t e r s to raw i m a g e d a t a 4 f l o a t * h o s t _ i n = . . . ;

(31)

2.2 Domain-Specific Description for Image Processing Kernels

5 f l o a t * h o s t _ o u t = . . . ; 6

7 // i n p u t and o u t p u t i m a g e s 8 Image<float> IN ( width , h e i g h t ) ; 9 Image<float> OUT ( width , h e i g h t ) ; 10 11 // i n i t i a l i z e i n p u t i m a g e 12 IN = h o s t _ i n ; // o p e r a t o r = 13 14 // d e f i n e r e g i o n of i n t e r e s t 15 I t e r a t i o n S p a c e<float> I s O u t ( OUT ) ; 16 17 // a c c e s s o r u s e d to a c c e s s i m a g e p i x e l s 18 A c c e s s o r<float> A c c I n ( IN ) ; 19 20 // d e f i n e k e r n e l 21 B i l a t e r a l F i l t e r BF ( IS , AccIn , sigma_d , s i g m a _ r ) ; 22 23 // e x e c u t e k e r n e l 24 BF . e x e c u t e () ; 25 26 // r e t r i e v e o u t p u t i m a g e 27 h o s t _ o u t = OUT . g e t D a t a () ;

Listing 2.4: Example code that initializes and executes the bilateral filter. The remainder of this section discusses how properties of local operators are ex-pressed in the proposed framework that relieve developers from low-level program-ming of GPU accelerators. In the following sections, the benefit of this domain-specific knowledge for efficient low-level code generation for GPU accelerators using source-to-source compilation is shown.

Boundary Handling

When an image is accessed out of bounds, there are different possibilities to handle this situation: the image is virtually expanded beyond the border and the pixel value of the expanded image is returned. This can be realized by a) adding additional pixels at each image boundary, and b) adjusting the index of the accessed pixel to a pixel that resides within the image. The approach followed in this work is the latter, which is discussed in Chapter 3.

However, it will be shown that boundary handling is completely transparent to the programmer in the proposed framework. The programmer specifies only the desired boundary handling mode on images and the source-to-source compiler will handle the low-level details. Supported boundary handling modes are listed in Ta-ble 2.1 and visualized in Figure 2.5. These include the most important boundary handling modes commonly encountered in image processing1_{. In particular,} mir-roring is important in medical imaging, for example, when multiresoultion filters 1_{However, if necessary, the framework can be easily extended to support further boundary}

(32)

Table 2.1: Supported boundary handling modes.

Mode Returned pixel value for out of bounds:

Undefined not specified, undefined

Repeat pixel value of image repeated at the border

Clamp last valid pixel within image

Mirror pixel value of image mirrored at the border Constant constant value, user defined

(see [KEFA03]) are applied to a image: the image gets upsampled multiple times and at the border occur large unnatural-looking artifacts when the border pixel gets replicated repeatedly. In contrast, using mirroring leads to natural looking images. Instead of tying the boundary handling mode to a particular image, the framework provides Accessors to describe the way an image is accessed as mentioned earlier. That is, no memory is allocated and hold by Accessors. In case boundary handling is required, the Accessor defines the view on a BoundaryCondition object rather than on an Image. The BoundaryCondition itself specifies the boundary handling mode on an input image and the size of the local operator. Listing 2.5 shows the collaboration of BoundaryCondition and Accessor in order to define clamping as boundary handling mode for a local operator with a window size of (4∗σd+ 1)×(4∗ σd+ 1). The resulting Accessor can be used in the BilateralFilter implementation defined in Listing 2.3 and instantiated in Listing 2.4 to add boundary handling support. In case a constant boundary handling is specified, the BoundaryCondition requires the constant value as an additional parameter.

1 // i n p u t i m a g e and a c c e s s o r w i t h b o u n d a r y h a n d l i n g 2 Image<float> IN ( width , h e i g h t ) ;

3 B o u n d a r y C o n d i t i o n<float> B c I n ( IN , 4* s i g m a _ d +1 , 4* s i g m a _ d +1 ,

B O U N D A R Y _ C L A M P) ;

4 A c c e s s o r<float> A c c I n ( B c I n ) ;

Listing 2.5: Usage of a BoundaryCondition to define the boundary handling mode for a local operator.

The source-to-source compiler creates then the appropriate boundary handling sup-port utilizing the access metadata provided by the framework. No further modi-fications are required. Tying the boundary handling mode to an Accessor instead of an Image has the additional benefit that multiple boundary handling modes can be defined on the same image. This allows different algorithms or kernels to access the same image, but to use different boundary handling modes—the most appro-priate one for the current algorithm—without the need to keep separate copies of the image.

(33)

2.2 Domain-Specific Description for Image Processing Kernels ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? M N O P I J K L E F G H A B C D ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (a) Undefined. B C D F G H J K L A B C D E F G H I J K L A B C E F G I J K N O P J K L F G H B C D M N O P I J K L E F G H A B C D M N O I J K E F G A B C N O P J K L F G H M N O P I J K L E F G H M N O I J K E F G (b) Repeat. M M M M M M M M M M N O P M N O P M N O P P P P P P P P P P M M M I I I E E E A A A M N O P I J K L E F G H A B C D P P P L L L H H H D D D A A A A A A A A A A B C D A B C D A B C D D D D D D D D D D (c) Clamp. E I M F J N G K O M N O P I J K L E F G H P L H O K G N J F O N M K J I G F E C B A M N O P I J K L E F G H A B C D P O N L K J H G F D C B I E A J F B K G C A B C D E F G H I J K L D H L C G K B F J (d) Mirror. Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q M N O P I J K L E F G H A B C D Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q (e) Constant.

Figure 2.5: Boundary handling modes for image processing. By default, the be-havior is undefined when the image is accessed out of bounds (a). The framework allows to specify different boundary handling modes like re-peating the image (b), clamping to the last valid pixel (c), mirroring the image at the image border (d), and returning a constant value when accessed out of bounds (e).

(34)

Filter Masks

The framework provides also support for filter masks for local operators: a Mask holds the precalculated values used by the convolution filter function. Since the filter mask is constant for one kernel, this allows the source-to-source compiler to apply optimizations such as constant propagation and avoids superfluous and redundant calculations. To define a Mask in the framework, the filter mask size has to be provided. Afterwards, the precalculated values can be assigned to the Mask. Listing 2.6 shows the usage of a Mask for the closeness filter mask.

1 // pre - c a l c u l a t e d m a s k c o e f f i c i e n t s 2 f l o a t m a s k [] = { 0 . 0 1 8 3 1 6 , ... }; 3 4 // f i l t e r m a s k 5 Mask<float> C M a s k (4* s i g m a _ d +1 , 4* s i g m a _ d +1) ; 6 C M a s k = m a s k ;

Listing 2.6: Usage of a Mask to store the filter mask for a local operator. Listing 2.7 shows the usage of the Mask within the kernel of the BilateralFilter: using the indices xf and yf, the precalculated values are retrieved from the closeness filter mask. Note that also the calculation of c_d is not necessary anymore.

1 v o i d k e r n e l () { 2 f l o a t c_r = 1.0 f / ( 2 . 0 f * s i g m a _ r * s i g m a _ r ) ; 3 f l o a t d = 0.0 f , p = 0.0 f , s = 0.0 f ; 4 5 for (int yf = -2* s i g m a _ d ; yf < = 2 * s i g m a _ d ; yf ++) { 6 for (int xf = -2* s i g m a _ d ; xf < = 2 * s i g m a _ d ; xf ++) { 7 f l o a t d i f f = I n p u t ( xf , yf ) - I n p u t () ; 8 9 s = exp ( - c_r * d i f f * d i f f ) ; 10 c = C M a s k (2* s i g m a _ d + xf , 2* s i g m a _ d + yf ) ; 11 d += s * c ; 12 p += s * c * I n p u t ( xf , yf ) ; 13 } 14 } 15 o u t p u t () = p / d ; 16 }

Listing 2.7: Using a Mask within the kernel to retrieve the filter mask coefficients for a local operator.

(35)

3 Domain-specific

Source-to-Source Compilation

for Medical Imaging

This chapter describes the transformations applied by the proposed source-to-source compiler and the steps taken to create Compute Unified Device Architecture (CUDA) and Open Compute Language (OpenCL) code from a high-level description of local operators and access metadata. Based on the latest Clang/LLVM compiler frame-work [Cla11,LA04], the proposed source-to-source compiler uses the Clang frontend for C/C++ to parse the input files and to generate an Abstract Syntax Tree (AST) representation of the source code. The backend uses this AST representation to apply transformations and to generate host and device code in CUDA or OpenCL.

3.1 Source-to-Source Compiler

This section describes the design of the proposed source-to-source compiler and the single steps taken to create CUDA and OpenCL code from a high-level description of image objects, iteration space objects, and kernel objects.

3.1.1 Kernel Code

The compiler creates the kernel code AST in multiple steps.

First, the kernel declaration is created. The kernel parameters are identified from the Kernel class constructor. Each variable, reference, or pointer has to be initialized in the constructor of the Kernel class and a corresponding kernel parameter is added to the declaration. In doing so, references to Accessor objects are replaced by global memory pointers to the pixel type. Some additional parameters such as the image width/height are added for index calculations and for future uses like border handling.

Second, the kernel body is created from the kernel method of the class. To get an AST for the kernel body, the original AST is copied with certain AST nodes replaced. References to Image objects are replaced with references to corresponding arrays. The compiler adds statements at the beginning of the kernel that calculate the pixel location from the thread index and block index in CUDA or the global indices in OpenCL. Similarly, each class member expression—access to a member variable of the kernel class—is translated to a reference to the corresponding kernel

(36)

3 Domain-specific Source-to-Source Compilation for Medical Imaging

function parameter. After the translation, an AST representation is obtained that can be used for further transformations.

After traversal, the AST is pretty printed and stored to a file. During pretty printing, CUDA and OpenCL specific function and variable qualifiers are emitted. For example, the __global__ qualifier in CUDA and the __kernel qualifier in OpenCL are emitted for entry functions.

3.1.2 Host Code

Unlike for device code, no AST representation for host code is created. Rather, Clang’s Rewriter functionality is used to change the textual representation of AST nodes, whilst leaving the nodes intact.

To invoke previously generated device kernels, the framework code gets translated into corresponding CUDA or OpenCL Application Programming Interface (API) calls as follows:

• Image declarations (line 8 and 9): Get translated into device memory alloca-tion using cudaMalloc or clCreateBuffer.

• Memory assignments (line 12): Get translated into memory transfers using cudaMemcpy or clEnqueueWriteBuffer.

• IterationSpace declaration (line 15): Defines the kernel execution configura-tion.

• Kernel declaration (line 18): Gets translated into loading the kernel source. For CUDA, this step is not required. For OpenCL, the kernel source is loaded from a file, an OpenCL program for the loaded source is created, and the kernel is compiled.

• Kernel execution (line 21): Gets translated into launching the kernel, using the execution configuration obtained from the corresponding IterationSpace declaration.

• Memory assignments (line 24): Get translated into memory transfers using cudaMemcpy or clEnqueueReadBuffer.

In addition to the above changes, further changes are required to get proper CUDA or OpenCL code. First of all, include directives for the CUDA or OpenCL headers are added. In addition, the CUDA kernel sources are included at the beginning of the file. To initialize the runtime, the corresponding functionality is added at the beginning of the main function. In particular, for OpenCL this initialization is an important part since it sets up the platform and device to be used for execution. After these changes, the generated host and device files can be compiled.

(37)

3.2 Point Operators

3.1.3 Padding Support

To avoid conflicts in accessing image pixels in global memory, the source-to-source compiler adds padding to allocated host and device memory. For host code, special memory allocation functions are required to allocate memory so that each image row is padded to get the desired data alignment. For device code, array index cal-culations are changed to take padding into account. The compiler handles padding automatically, given the desired alignment amount for the target device. Listing 3.1 shows the adjusted index calculation and an iteration space check to assure only pixels within the image are processed.

1 // a d d e d i t e r a t i o n s p a c e c h e c k for p a d d e d i m a g e s k e r n e l 2 if ( g i d _ x < w i d t h ) { 3 ... 4 } 5 6 // read - a c c e s s to A c c e s s o r 7 IMG () 8 // C U D A / O p e n C L r e a d / w r i t e w it h s t r i d e 9 img [ g i d _ x + g i d _ y * s t r i d e ]

Listing 3.1: Padding images for coalesced memory access.

3.2 Point Operators

To illustrate the image processing framework for point operators, a grayscale vertical mean filter is considered.

3.2.1 Example: Vertical Mean Filter

The vertical mean image filter, calculates the average of D input column pixels for the output pixel with coordinates (x, y):

Ox,y = 1 D D−1 X k=0 Ix,y+k, where 0≤ x < W, 0 ≤ y < H − D. (3.1) • I is an input image of W × H pixels;

• O is an output image of W × H pixels;

• D is the diameter of the filter, that is, the number of input pixels over which the mean is computed (typically, D H).

Figure 3.1 shows the vertical mean filter on an image.

To express this filter in the proposed framework, the programmer derives a class from the built-in Kernel class and implements the virtual kernel function, as shown in Listing 3.2. To access the pixels of an input image, the parenthesis operator ()

(38)

3 Domain-specific Source-to-Source Compilation for Medical Imaging

Figure 3.1: Vertical mean filtering: the average value of D input column pixels is used to calculate one output pixel.

is used, taking the column (dx) and row (dy) offsets as optional parameters. The output image as specified by the Iteration Space is accessed using the output() method provided by the built-in Kernel class. The user instantiates the class with input image accessors, one iteration space, and other parameters that are member variables of the class.

1 c l a s s V e r t i c a l M e a n F i l t e r : p u b l i c K e r n e l { 2 p r i v a t e: 3 A c c e s s o r<float> & I n p u t ; 4 int d ; 5 6 p u b l i c:

7 V e r t i c a l M e a n F i l t e r (I t e r a t i o n S p a c e<float> & IS , A c c e s s o r<float> & Input , int d ) : 8 K e r n e l( IS ) , I n p u t ( I n p u t ) , d ( d ) 9 { 10 a d d A c c e s s o r (& I n p u t ) ; 11 } 12 13 v o i d k e r n e l () { 14 f l o a t sum = 0.0 f ; 15 16 for (int k =0; k < d ; ++ k ) { 17 sum += I n p u t (0 , k ) ; 18 } 19 20 o u t p u t () = sum /(f l o a t) d ; 21 } 22 };

Listing 3.2: The vertical mean filter expressed in the framework.

In Listing 3.3, the input and output Image objects IN and OUT are defined as two-dimensional W × H grayscale images, having pixels represented as floating-point numbers (lines 8–9). The Image object IN is initialized with the host_in pointer to a plain C array with raw image data, which invokes the = operator of the Image

SPES. Software Platform Embedded Systems 2020

SPES

Software Platform Embedded Systems 2020

- Beschreibung der Fallstudie „Programmierverfahren zur

Generierung parallelen echtzeitfähigen Codes in der

diagnostischen Bildverarbeitung“

-Phase 2

Deliverable D_MT_5.1_1

Version 1.0

Projektbezeichnung SPES 2020

Verantwortlich

Richard Membarth

QS-Verantwortlich

Mario Körner, Frank Hannig

Erstellt am

25. Januar 2012

Freigabestatus

Vertraulich für Partner

Projektöffentlich

x Öffentlich

Bearbeitungsstatus

in Bearbeitung

vorgelegt

x fertig gestellt

Weitere Produktinformationen

Änderungsverzeichnis

Prüfverzeichnis

Contents

Abstract

1 Introduction

1.1 Motivation

1.2 On the Democratization of Parallel Computing

1.2.1 From GPGPU to GPU Computing

1.2.2 GPU Architecture

Example: Fermi Architecture

1.2.3 GPU Programming

1.2.4 GPU Performance Considerations

2 Embedded Domain-specific

Language for Medical Imaging

2.1 Image Processing Characteristics in Medical

Imaging

2.1.1 Image Operator Classification

2.1.2 Application of Image Operators

2.2 Domain-Specific Description for Image

Processing Kernels

2.2.1 Point Operators

2.2.2 Local Operators

3 Domain-specific

Source-to-Source Compilation

for Medical Imaging

3.1 Source-to-Source Compiler

3.1.1 Kernel Code

3.1.2 Host Code

3.1.3 Padding Support

3.2 Point Operators

3.2.1 Example: Vertical Mean Filter