• No results found

HIGH PERFORMANCE FOURIER VOLUME RENDERING ON GRAPHICS PROCESSING UNITS (GPUS)

N/A
N/A
Protected

Academic year: 2021

Share "HIGH PERFORMANCE FOURIER VOLUME RENDERING ON GRAPHICS PROCESSING UNITS (GPUS)"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

HIGH PERFORMANCE

FOURIER VOLUME RENDERING

ON GRAPHICS PROCESSING UNITS (GPUS)

By

Marwan Mohamed Ahmed Abdellah

Systems & Biomedical Engineering Department Faculty of Engineering, Cairo University

A thesis submitted to the

Faculty of Engineering, Cairo University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE in

SYSTEMS & BIOMEDICAL ENGINEERING

FACULTY OF ENGINEERING CAIRO UNIVERSITY

GIZA, EGYPT 2012

(2)

HIGH PERFORMANCE

FOURIER VOLUME RENDERING

ON GRAPHICS PROCESSING UNITS (GPUS)

By

Marwan Mohamed Ahmed Abdellah

Systems & Biomedical Engineering Department Faculty of Engineering, Cairo University

A thesis submitted to the

Faculty of Engineering, Cairo University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE in

SYSTEMS & BIOMEDICAL ENGINEERING

Under the supervision of

Assoc. Prof. Ayman El-Dieb

Assoc. Prof. Amr Shaarawi

Systems & Biomedical Engineering Department Faculty of Engineering, Cairo University

FACULTY OF ENGINEERING CAIRO UNIVERSITY

(3)

HIGH PERFORMANCE

FOURIER VOLUME RENDERING

ON GRAPHICS PROCESSING UNITS (GPUS)

By

Marwan Mohamed Ahmed Abdellah

Systems & Biomedical Engineering Department Faculty of Engineering, Cairo University

A thesis submitted to the

Faculty of Engineering, Cairo University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE in

SYSTEMS & BIOMEDICAL ENGINEERING

Approved by the Examining Committee

Prof. Dr. Yasser Kadah

, Member

__________________________________________________________________ ___________________________________________________

Prof. Dr. Mohamed El-Adawy

, Member

__________________________________________________________________ ___________________________________________________

Assoc. Prof. Dr. Ayman El-Dieb

, Main Advisor

__________________________________________________________________ ___________________________________________________

Assoc. Prof. Dr. Amr Shaarawi

, Thesis Advisor

__________________________________________________________________ ___________________________________________________ FACULTY OF ENGINEERING CAIRO UNIVERSITY GIZA, EGYPT 2012

(4)

Engineer : Marwan Mohamed Ahmed Abdellah Date of Birth : 5 / 7 / 1987

Nationality : Egyptian

E-mail :

[email protected]

Phone : +2 0100 27 51 829

Address : No. 4, Muhammad Kasem St., Maadi, Cairo, Egypt. Registration Date : 1 / 10 / 2009

Awarding Date : / /

Degree : Master’s of Science (M.Sc.)

Department : Systems & Biomedical Engineering Department

Supervisors : Assoc. Prof. Dr. Ayman M. Eldieb

Assoc. Prof. Dr. Amr A. Shaarawi

Examiners : Prof. Dr. Yasser Mostafa Kadah

Prof. Dr. Mohamed Ibrahim. Eladawy

(Faculty of Engineering – Helwan University)

Assoc. Prof. Dr. Ayman M. Eldieb

Assoc. Prof. Dr. Amr A. Shaarawi

Title of Thesis :

High Performance Fourier Volume Rendering on Graphics Processing Units (GPUs)

Key Words :

Fourier Volume Rendering, Medical Image Reconstruction, Projection-slice Theory, GPU Computing, CUDA.

Summary :

The past years have seen tremendous advances in volume visualization techniques that have been used broadly in medical imaging. In particular, volume rendering techniques have received a considerable attention in this area. However, spatial domain volume rendering has achieved a wide acceptance from scientists and physicians, but this category of rendering techniques was associated with several constrains due to their O(N3) time-complexity, which limited their usability in several aspects. Fourier Volume Rendering (FVR) is an alternative technique that operates on the frequency spectrum of the volume with lower time complexity of order O(N2logN) relying on the projection-slice theory. This technique allows the generation of attenuation- only renderings or projections of volumetric data that look like x-ray radio- graphs. It has been used extensively in digital radiography. In this work, a high performance pure GPU-accelerated implementation for the Fourier volume rendering pipeline is

(5)

DECLARATION

I, Marwan Abd Ellah, hereby declare that all information in this document has been obtained and presented in accordance with academic rules and ethical conduct. I also declare that, as required by these rules and conduct, I have fully cited and referenced all material and results that are not original to this work.

Marwan Abd Ellah

Date

(6)

ACKNOWLEDGEMENTS

This work would not have been possible without the invaluable support, advice and encouragement of my dear supervisors. I am honored to present

my special thanks and deepest gratitude toDr. Ayman El Deib &Dr. Amr

Sharawi for their guidance and insightful feedback during the duration of this project.

Also, I would like to thank my professor Dr. Yasser Kadah for his

out-standing Medical Image Reconstruction course, which has formed the

funda-mentals of image reconstruction in general and the Fourier volume rendering in particular, and also have fueled me to investigate deeper to end up with this work.

As well as, I would like to thankDr. Stefano Cozzini for accepting me to attend his advanced school in High Performance Computing that was held in the Abdulsalam International Center of Theoretical Physics (ICTP) in Italy. It was really a nice, valuable, and unforgettable experience.

(7)

ABSTRACT

The past several years have seen tremendous advances involume visualization

techniques that have been used broadly in medical imaging. In particular,

volume rendering has received a considerable attention in this area. However, spatial domain volume rendering has achieved a wide acceptance from scien-tists and physicians, but this category of rendering techniques was associated with diverse constrains due to their O(N3) time-complexity, which limited

their usability in several aspects. Fourier Volume Rendering (FVR) is an

alternative technique that operates on the frequency spectrum of the volume with lower time complexity of order O(N2logN) relying on the

projection-slice theory. This technique allows the generation of attenuation-only ren-derings or projections of volumetric data that look like x-ray radiographs. It has been used extensively in digital radiography. In this work, a high performance pure GPU-accelerated implementation for the Fourier volume rendering pipeline is proposed to achieve 30X of speed up over a hybrid implementation by mapping the entire pipeline to be executed on the GPU.

Keywords: Fourier Volume Rendering, Medical Image Reconstruction,

(8)

PREFACE

In this work, an in-depth investigation has been carried out to achieve a high performance implementation of the Fourier volume rendering pipeline on the GPU. It considered in particular CUDA-enabled GPUs to be used as a high performance computing architectures that can leverage the performance of data-parallel algorithm, which completely suits our problem.

In advance, in Chapter 1, Introduction, volume visualization techniques

that have been used widely in the medical arena are presented. It concen-trated mainly on volume rendering as a scientific tool to explore the internal structures of volumetric objects. Then, it focused on Frequency domain vol-ume rendering as an alternative technique to spatial domain algorithms at which it reduces the rendering time-complexity to order of O(N2logN).

Af-terwards, we summarize the previous work in this area and our contribution.

Chapter 2,Theory Behind Frequency Domain Volume Rendering, aims at

providing a gentle introduction to the theories relevant to frequency domain volume rendering. Sampling theory, Fourier transform, Hartley transform, and projection-slice theory are briefly discussed to set the stages to chapters to come by.

Basically, High Performance Computing – as we understand – deals with the implementations of some algorithm and the hardware it run on, but as a

(9)

research tool, it demands at least a basic understanding of several disciplines, concepts, and methodologies that range from algorithms, computer

program-ming, software and hardware architectures. In Chapter 3,High Performance

Computing on Graphics Processing Units, we explain how the evolution of GPUs has turned them to be high performance platforms relying on their massively parallel architecture. A special treatment for the CUDA architec-ture is considered. Although we tried to keep this chapter comprehensive and concise, but the temptation to cover everything is overwhelming and the reader is assumed to have some familiarity with programming and high-level computer architecture.

In Chapter 4,Algorithm & Implementation, the Fourier volume rendering algorithm is presented and demystified to the reader. This chapter is intended as an attempt to summarize the Fourier volume rendering pipeline. It started with a general description on a level independent of specific architecture and then it moves towards a certain strategy that will be adopted to leverage the performance of the GPU-accelerated implementation. It is the author’s persuasion that a good understanding of the implementation aspects of this algorithm will reflect the significance of the achieved results.

In Chapter 5, Results, we discuss reconstruction and performance

bench-marking results of both the naive implementation and our proposed one that is executed entirely on the GPU.

In Chapter 6,Conclusion & Future Work, we wrap up and conclude what have been presented in this sequel followed by some future work that might be undertaken either by us or by future researchers working in the same area.

(10)

ACRONYMS

1D One-Dimensional

2D Two-Dimensional

3D Three-Dimensional

ALU Arithmetic Logic Unit

APIs Application Programming Interfaces

BO Buffer Object

Cg C for Graphics

CPU Central Processing Unit

CT Computed Tomography

CUDA Computer Unified Device Architecture

CUFFT CUDA FFT

(11)

DHT Discrete Hartley Transform

DRR Digital Reconstructed Radiograph

ECC Error-Correcting Code

FBO Frame Buffer Object

FFT Fast Fourier Transform

FFTW FFT in the West

FHT Fast Hartley Transform

FVR Fourier Volume Rendering

GLSL OpenGL Shading Language

GPGPU General Purpose Graphics Processing Unit

GPU Graphics Processing Unit

HPC High Performance Computing

MRI Magnetic Resonance Imaging

OpenCL Open Computing Library

OpenGL Open Graphics Library

PBO Pixel Buffer Object

SIMT Single Instruction Multiple Thread

SM Stream Multiprocessor

SP Streaming Processor

TP Thread Processor

VBO Vertex Buffer Object

(12)

CONTENTS

1 INTRODUCTION 2

1.1 Medical Visualization . . . 3

1.2 Volume Rendering . . . 5

1.3 Frequency Domain Volume Rendering . . . 8

1.4 Previous Work . . . 9

1.5 Contribution & Thesis Objectives . . . 11

2 THEORY BEHIND FREQUENCY DOMAIN VOLUME REN-DERING 13 2.1 Notation . . . 14 2.2 Special Functions . . . 14 2.2.1 Delta Dirac . . . 14 2.2.2 Shah Function . . . 16 2.2.3 Sinc Function . . . 16 2.2.4 Rect Function . . . 17 2.3 Sampling Theory . . . 17

2.3.1 Nyquist – Shannon Sampling Theorem . . . 18

(13)

2.4 Fourier Transform . . . 23

2.4.1 Transform Pair . . . 23

2.4.2 Properties of Fourier Transform . . . 24

2.4.3 Multi-Dimensional Fourier Transform . . . 25

2.4.3.1 2D Fourier Transform . . . 25

2.4.3.2 3D Fourier Transform . . . 26

2.4.4 Separability Theorem . . . 26

2.4.5 Convolution Theorem . . . 27

2.4.6 Discrete Fourier Transform . . . 27

2.4.7 Fast Fourier Transform . . . 28

2.5 Hartley Transform . . . 29

2.5.1 Definition . . . 29

2.5.2 Discrete Hartley Transform . . . 30

2.5.3 Pros & Cons . . . 30

2.6 Projection-Slice Theory . . . 31

2.6.1 Definition . . . 31

2.6.2 Proof . . . 34

3 HIGH PERFORMANCE COMPUTING ON GRAPHICS PROCESSING UNITS (GPUS) 37 3.1 High Performance Computing . . . 38

3.2 The Era of GPU Computing . . . 40

3.2.1 GPGPU & GPU Computing . . . 40

3.2.2 GPU Architecture Evolution . . . 43

3.3 CPU & GPU In Comparison . . . 44

3.4 Heterogeneous Computing Model . . . 47

3.5 Compute Unified Device Architecture . . . 49

3.5.1 Understanding CUDA Architecture . . . 49

3.5.2 CUDA Programming Model . . . 50

3.5.3 Threading Hierarchy . . . 51 3.5.4 Memory Model . . . 52 3.5.4.1 Global Memory . . . 53 3.5.4.2 Shared Memory . . . 54 3.5.4.3 Register Memory . . . 54 3.5.4.4 Local Memory . . . 54 3.5.4.5 Constant Memory . . . 54 3.5.4.6 Texture Memory . . . 54 ix

(14)

3.5.5 Execution Model . . . 56

3.5.6 CUDA Software Programming Environment . . . 59

3.5.7 CUDA Computing Architecture . . . 60

3.5.8 Limitations of CUDA . . . 65

3.6 GPU Contexts . . . 66

3.7 FFT on GPU . . . 67

4 ALGORITHM & IMPLEMENTATION 71 4.1 Objective & Flow . . . 72

4.2 Algorithm . . . 73

4.3 Implementation Strategy . . . 76

4.4 The Naive Hybrid Approach . . . 78

4.4.1 Analyzing the Naive Algorithm . . . 81

4.4.2 Naive Algorithm Bottlenecks . . . 87

4.4.3 Suppressing Multidimensional Arrays . . . 89

4.5 Algorithm Mapping to the GPU . . . 90

4.5.1 CUDA Kernels . . . 92

4.5.2 FVR Pipeline on GPU . . . 92

4.5.3 Mapping Analysis . . . 95

5 RESULTS 101 5.1 Volume Reconstruction Results . . . 103

5.2 Benchmarking Results . . . 109

5.2.1 Eliminating Multi-Dimensional Arrays . . . 109

5.2.2 Mapping Computational Context to GPU . . . 110

6 CONCLUSION & FUTURE WORK 114 6.1 Conclusion . . . 115

6.2 Future Work . . . 116

(15)

LIST OF FIGURES

1.1 Computer-generated Rendering for a Skull Dataset,reference:

Wikipedia . . . 3

1.2 Surface Rendering of a Head Dataset, reference: Wikipedia . . 4

1.3 The Process of Volume Rendering a Tooth, reference : GPU Gems . . . 5

1.4 High Definition Volume Rendering for a Skull with Volume Ray-Casting, reference: Wikipedia . . . 6

1.5 Mouse Skull (CT) Rendering using the Shear Warp Algorithm, reference: Wikipedia . . . 7

1.6 Example of Rendering CT Data (Visible Male Dataset) using the Fourier Volume Rendering Algorithm . . . 8

1.7 A Projection of the Foot Dataset Reconstructed using the Fourier Volume Rendering Algorithm . . . 8

2.1 Continuous Dirac Delta Function δ(t) . . . 15

2.2 Kronecker Delta Function δ[n] . . . 15

2.3 The Shah Function X(t) . . . 16

2.4 The Sinc Function sinc(t) . . . 16

2.5 The Rect or Box Function Π(t) . . . 17

(16)

2.6 Sampling Process - Time Domain is on Left, and Frequency

Domain is on Right. . . 19

2.7 Aliasing . . . 22

2.8 Hamming Window & its Frequency Response . . . 23

2.9 Projecting 3D Volume to a 2D X-ray like Image . . . 32

2.10 Graphical illustration of the projection-slice theory in two-dimensions. f(x, y)and F(kx, ky)are two-dimensional Fourier transform pairs,p(x)is the projection off(x, y)on thexaxis, and s(kx)is the projection slice ofp(x)in the frequency domain. 35 3.1 High Performance Computing Interdisciplinarity . . . 39

3.2 Memory Bandwidth Improvements for CPU & GPU [72] . . . 41

3.3 Single & Double Precision Floating-Point Operations Per Sec-ond for CPU & GPU [72] . . . 41

3.4 The GeForce 7800 architecture with 3 kinds of Programmable Engines (Courtesy of NVIDIA) . . . 45

3.5 The G80 GPU with Unified Shader Architecture . . . 46

3.6 CPU & GPU Computing Architectures in Comparison, GPU Devotes More Transistors to Data Processing . . . 47

3.7 Heterogenous Computing Model with CPU & GPU . . . 48

3.8 Problem Decomposition for Serial Parts to be executed on CPU & Parallel Parts to be executed on the GPU . . . 48

3.9 Block Diagram for CUDA Stream Multiprocessor (SM) . . . . 50

3.10 Three-Dimensional Blocks of Two-Dimensional Grids . . . 52

3.11 CUDA Memory Model . . . 53

3.12 Executing a Kernel Grid on two different GPUs . . . 57

3.13 Executing Two Different Kernel Grids on the GPU, (Courtesy of NVIDIA) . . . 58

3.14 Thread Index Calculations with 1D Grid & 1D Blocks . . . 59

3.15 NVIDIA Compilation Process . . . 60

3.16 CUDA Framework Architecture . . . 61

3.17 GT200 GPU Architecture . . . 62

3.18 Fermi Architecture Block Diagram . . . 63

3.19 Fermi SM Architecture . . . 64

3.20 CUDA Interoperability with OpenGL . . . 67

(17)

4.2 FVR Pipeline . . . 75

4.3 FVR Pipeline is Divided into Preprocessing Stage & Render-ing Loop. The RenderRender-ing Loop is Executed 3 Times to Gen-erate 3 Different Projections for the Same Volume. . . 76

4.4 Naive Hybrid Implementation for the FVR Pipeline . . . 79

4.5 3D Wrapping-Around with 3D Arrays for Real Data . . . 83

4.6 3D Wrapping-Around with 3D Arrays for Complex Data . . . 84

4.7 Repacking the Complex Spectrum from FFTW Array into 1D Array Compatible with OpenGL 3D Texture . . . 85

4.8 A Block Diagram Illustrating the Execution of the OpenGL Off-Screen Context . . . 86

4.9 2D Wrapping-Around Involving 2D Arrays . . . 88

4.10 Rendering the Projection Image . . . 88

4.11 Eliminating 2D Arrays from the FVR Pipeline . . . 90

4.12 Eliminating 3D Arrays from the FVR Pipeline . . . 91

4.13 FVR Pipeline on GPU . . . 94

4.14 Linking OpenGL Off-Screen Rendering Context with CUDA Context . . . 97

4.15 Linking OpenGL Off-Screen Rendering Context with CUDA Context . . . 98

4.16 Linking OpenGL CUDA Context with OpenGL On-Screen Context . . . 99

5.1 Sagittal View for Visible Male Dataset (256 x 256 x 256) . . . 103

5.2 The Central Part of the Visible Male Dataset (128 x 128 x 128)103 5.3 Axial View for Visible Male Dataset (256 x 256 x 256) . . . . 104

5.4 Sagittal View for the Skull Dataset (256 x 256 x 256) . . . 104

5.5 Coronal View for the Skull Dataset (256 x 256 x 256) . . . 104

5.6 Foot Dataset (256 x 256 x 256) . . . 105

5.7 Engine Dataset (256 x 256 x 256) . . . 105

5.8 Bonsai Tree Dataset (256 x 256 x 256) . . . 105

5.9 Teapot Dataset (128 x 128 x 128) . . . 106

5.10 Hydrogen Atom Dataset (128 x 128 x 128) . . . 106

5.11 Nieg Dataset (64 x 64 x 64) . . . 106

5.12 Tri-Linear Interpolation Scheme . . . 107

5.13 Nearest-Neighbor Interpolation . . . 107

5.14 Orthogonal Projection for the Visible Male Dataset . . . 108

(18)

5.15 Oblique Projections for the Visible Male Dataset without & with high order reconstruction filter in A & B respectively. . 108

(19)

LIST OF TABLES

3.1 CUDA Memory Types supported by its Memory Model for NVIDIA Quadro FX 4800 . . . 55 3.2 A Table Summarizing the Features of the Three Main CUDA

GPU Architectures . . . 65 5.1 Benchmarking for the 3D Wrapping-Around Operation on CPU

with 3D & 1D Arrays . . . 109 5.2 Benchmarking for the 3D Wrapping-Around Operation on CPU

with 3D & 1D Arrays including the Time Consumed During the Replacement of Arrays . . . 110 5.3 2D Wrapping-Around of Real Data on CPU & GPU . . . 110 5.4 3D Wrapping-Around Operation of Real Data on CPU & GPU 111 5.5 3D Wrapping-Around of Complex Data on CPU & GPU . . . 111 5.6 2D FFT with FFTW & CUFFT Libraries . . . 111 5.7 3D FFT with FFTW & CUFFT Libraries . . . 111 5.8 Comparing Performance for a volume of 256 . . . 112

(20)

References

Related documents

The chair of the commission shall appoint members of the Panel who have experience in providing services to victims of domestic and sexual abuse and shall include at least

•  Frye Leadership Institute (US/International) •  Northern Exposure to Leadership (Canada) •  Snowbird Leadership Institute (US).. Recent/current

These analytical results give some easy algorithms to locate leaders and competitors which are valid for any personalization vector and only use information related to the

As described in this policy review, homeless people were only recently granted access to welfare rights and services in the social security system of Croatia.. Whilst social

We make five main contributions: (i) in Section 3, we introduce the idea of using the Lagrangian formulation when the Lagrange multipliers are not taken to be a simple vector,

For three consecutive days, Victor Val- ley College nursing students in CNSA (California Nursing Student's Associa- tion) promoted continuing education with a little help from

We represent that, to the best of our knowledge and in accordance with applicable accounting principles for interim reporting, the condensed consolidated interim financial

However, in the post-IPO year, the participation of VCs increases the earnings management of portfolio firms and the regression coefficients for discretionary accruals estimated by