Real-Time Wavelet Filtering on the GPU

(1)

May 2007

Anne Cathrine Elster, IDI

Master of Science in Computer Science

Submission date:

Supervisor:

Norwegian University of Science and Technology

Real-Time Wavelet Filtering on the GPU

(2)

(3)

Problem Description

Specialiced graphics hardware, GPUs, has in recent years had a rapid increase in both

performance and programmability. This has made it interesting as a platform for general purpose computations.

Ultrasound imaging has for many years been one of the most popular medical diagnostic tools. Compared to X-rays, computed tomography(CT) and magnetic resonance imaging (MRI),

ultrasound has the advantages of safety, low cost and interactivity. With the recent introduction of 3D ultrasound, the requirement for data processing has increased tremendously.

In this thesis, we want to study the intersection of these two fields. The objectives are to:

* Evaluate image enchancement techniques that are currently implemented on the CPU to see if it is possible to implement them on the GPU. An important subgoal is to conduct performance comparison of the two solutions.

* Evaluate image enchancement techniques that are currently not in use, but are made possible with the use of GPU technology.

If the first two objectives are met, other related tasks may be explored.

Assignment given: 15. January 2007 Supervisor: Anne Cathrine Elster, IDI

(4)

(5)

Abstract

The wavelet transform is used for several applications including signal enhancement, com-pression (e.g. JPEG2000), and content analysis (e.g. FBI fingerprinting). Its popularity is due to fast access to high pass details at various levels of granularity.

In this thesis, we present a novel algorithm for computing the discrete wavelet transform using consumer-level graphics hardware (GPUs). Our motivation for looking at the wavelet transform is to speed up the image enhancement calculation used in ultrasound processing. Ultrasound imaging has for many years been one of the most popular medical diagnostic tools. However, with the recent introduction of 3D ultrasound, the combination of a huge increase in data and a real-time requirement have made fast image enchancement techniques very important.

Our new methods achieve a speedup of up to 30 compared to SIMD-optimised CPU-based implementations. It is also up to three times faster than earlier proposed GPU implementa-tions. The speedup was made possible by analysing the underlying hardware and tailoring the algorithms to better fit the GPU than what has been done earlier. E.g. we avoid using lookup tables and dependent texture fetches that slowed down the earlier efforts. In addi-tion, we use advanced GPU features like multiple render targets and texture source mirroring to minimise the number of texture fetches.

We also show that by using the GPU, it is possible to offload the CPU so that it reduces its load from 29% to 1%. This is especially interesting for cardiac ultrasound scanners since they have a real-time requirement of up to 50 fps. The wavelet method developed in this thesis is so successful that GE Healthcare is including it in their next generation of cardiac ultrasound scanners which will be released later this year.

With our proposed method, High-definition television (HDTV) denoising and other data in-tensive wavelet filtering applications, can be done in real-time.

(6)

(7)

Acknowledgments

I would like to extend my gratitude to Dr. Anne C. Elster for being my primary supervisor during this thesis. Without her advice and high standards, this work would have ended as a collection of incoherent ideas. Gratitude is also extended to her Ph.D. student Thorvald Natvig, who provided great help during the formulation of the theoretical models in the the-sis.

The help from Dr. Sevald Berg, Dr. Stein Inge Rabben, and especially my co-advisor Dr. Erik Steen, all from GE Healthcare, has been invaluable during this project.

Great gratitude goes to Jørgen Braseth and Elisabeth Dornish for having the patience to proof-read the thesis and correct my many spelling errors. Also, I would like to thank the people at ’Ugle’ computer lab for many coffee breaks and interesting discussions that made the thesis work worthwhile.

(8)

(9)

Contents

1 Introduction 1

1.1 Motivation . . . 1

1.2 Thesis outline . . . 2

2 Background 3 2.1 Graphics Processing Unit . . . 3

2.1.1 The what and why, CPU vs. GPU . . . 3

2.1.2 Stream Programming Model . . . 6

2.1.3 The Graphics Pipeline . . . 6

2.1.4 General-Purpose computation on GPUs (GPGPU) . . . 9

2.1.5 Suitable applications for the GPU . . . 10

2.2 Ultrasound . . . 13

2.2.1 Ultrasound noise formulation . . . 13

2.2.2 Ultrasound image restoration . . . 14

2.3 Wavelets . . . 14

2.3.1 Wavelets . . . 15

2.3.2 Continuous wavelet transform . . . 16

2.3.3 Multiresolution Analysis . . . 17

2.3.4 Discrete Wavelet Transform . . . 18

2.4 Filters . . . 21

2.4.1 Gaussian blur . . . 21

2.4.2 Edge enhancement . . . 22

2.4.3 Wavelet spatial filters . . . 23

(10)

3.2 DWT on the GPU . . . 29

3.2.1 Forward Discrete Wavelet Transform . . . 31

3.2.2 Backward Discrete Wavelet Transform . . . 33

3.2.3 Single channel version . . . 36

3.3 Spatial filters . . . 36

3.3.1 Gaussian blur . . . 37

3.3.2 Edge enhancement . . . 37

3.3.3 Soft thresholding . . . 38

3.3.4 Hard thresholding . . . 39

3.4 Theoretic GPU Performance model . . . 39

3.4.1 Example graphic card . . . 41

3.4.2 Comparison with previous implementations . . . 42

4 Results 45 4.1 Testing environment . . . 45 4.2 Benchmarking . . . 46 4.2.1 Raw processing . . . 47 4.2.2 Further analysis . . . 49 4.3 Discussion . . . 51

5 Conclusions and Future Work 55 5.1 Conclusion . . . 55 5.2 Future work . . . 56 Bibliography 56 A Implementation 61 A.1 GPUVideoProcessor . . . 61 A.2 Cmdgpuwavelet . . . 62 B GPU microprograms 65 B.1 Daubechies 9/7 Mono DWT . . . 65 B.2 Daubechies 9/7 Colour DWT . . . 69 B.3 Spatial filters . . . 76

(11)

List of Tables

2.1 Comparison of high-end CPUs and GPUs on some key figures. . . 5

3.1 GeForce 8800 GTX performance numbers . . . 41

3.2 The number of fragment instructions used by the GPU implementations. . . 41

3.3 Summary of variables, GeForce 8800 . . . 42

4.1 The GPU specifications used in the test. . . 46

4.2 Performance comparison of different wavelet algorithms, measuring the num-ber of elements processed pr. micro second. The bold faced values indicate the largest number of elements processed. . . 47

4.3 Comparison of different wavelet algorithms, showing the speedups of the meth-ods relative to the single-CPU version . . . 48

4.4 Comparison of different wavelet algorithms, showing the number of frames rendered per second. Bold indicates the highest fps. . . 48

4.5 Breakdown of walltime for GPU Monochrome 16-bit . . . 49

4.6 Load on the CPU while rendering 20 fps. . . 50

4.7 The efficiency of the mono versions using the colour versions as reference. . . 50

4.8 Performance comparison of different wavelet algorithms, measuring the num-ber of elements processed pr. micro second. GPU used: GeForce 7600GS . . . . 51

4.9 Relative speedups using GeForce 7600 GS . . . 51

4.10 Performance from Table 4.2 and 4.8 adjusted for price. . . 52

(12)

(13)

List of Figures

2.1 Performance timeline, GPU vs. CPU. . . 4

2.2 The relative number of transistors used for ALU and cache by a CPU and a GPU. 5 2.3 The graphics pipeline as a stream model. . . 6

2.4 The new programmable graphics pipeline. . . 7

2.5 The workflow for an image filter . . . 9

2.6 Interpolation example . . . 11

2.7 GPU memory overview . . . 12

2.8 Different Ultrasound modes . . . 13

2.9 Examples of different wavelets . . . 16

2.10 Time-Frequency resolution . . . 17

2.11 An example with 3 levels DWT . . . 19

2.12 2D Forward DWT . . . 20

2.13 Multiple levels of 2D DWT . . . 21

2.14 Gauss function . . . 22

2.15 Example of image filtering . . . 22

2.16 Figure showing where filtering may be applied between forward and backward DWT. . . 23

2.17 Thresholding functions . . . 24

2.18 Example of too much hard thresholding . . . 24

3.1 Schematic overview of the filter rendering process. . . 28

3.2 Explanation of different filter kernels. . . 28

3.3 Schematic overview of the DWT process. . . 30

3.4 Forward DWT addressing . . . 33

(14)

3.8 Static branch resolution by drawing quads. . . 36

3.9 Filter kernels . . . 37

3.10 Soft thresholding function explained . . . 38

3.11 Hard thresholding function explained . . . 39

3.12 A model of the system, including GPU, CPU and memory . . . 40

4.1 The number of elements processed by the different methods pr ms, calculated for different algorithms and data sizes. . . 47

4.2 Breakdown of walltime GPU Monochrome 16 bit on a log-log sale . . . 49

4.3 Comparison of CPU load on5122_{elements. . . .} ₅₀

4.4 The performance of 16-bit computation relative to size. . . 52

4.5 Comparison of two different enchancement techniques . . . 53

4.6 Comparison of 16bit and 32bit precision . . . 54

A.1 The different wavelet visual output options . . . 62

(15)

Chapter

1

Introduction

’Begin at the beginning,’ the King

said, very gravely, ’and go on till you come to the end: then stop.’

-Lewis Carroll

This chapter provides the motivation and the outline of this thesis. The intended audience of this thesis are professionals in the field of medical visualisation. Experience with computer visualisation is not required, but recommended for a thorough understanding of the thesis.

1.1

Motivation

When Röntgen first discovered x-rays in 1895, he triggered off a medical revolution. Today it is unimaginable to have a modern hospital without x-ray equipment like Computed To-mography (CT). Later, other methods like Magnetic Resonance Imaging (1967), and medical ultrasound (1956) has also been developed. A more recent breakthrough (1980s) is the use of computers to process and enhance the images to facilitate viewing. Image enhancement drastically improves the clinical value of the equipment and therefore it is ubiquitous today. As the physical imaging equipment has been improved over the years, more and more data needs to be processed. At the same time, the practitioners require better quality and real-time access to the data. Ultrasound has been an especially popular choice for real-real-time and portable imaging, mainly because of its non-invasive nature and small size.

Medical equipment used to be composed of special-purpose computer parts designed for the task. In later years, general-purpose hardware is increasingly replacing these parts. The rationale is twofold. First, it lets the manufacturer concentrate its efforts on the software. Secondly, the price of general purpose hardware is usually much lower than that of special purpose because of the mass production of the former. Actually, a large share of the hardware in a modern ultrasound scanner is the same as the hardware that resides in a computer under an office desk.

The ever-increasing amount of medical data, combined with the use of general purpose hard-ware, makes the task of fully utilising this hardware very important. The need for more computing power easily outgrows the hardware, and so the algorithms have to be tuned to a compromise between quality and speed. This problem is not new, and much effort has been put into investigating algorithms and methods for the x86, Intels dominating processor ar-chitecture. Such efforts are still important and will continue to be important in the future. However, there are other ways to increase the computational power of the computer system.

(16)

One component that has gone almost unnoticed so far is the graphics processor (GPU). The explanation for this oversight is simple Until recently, it couldn’t be programmed at all. How-ever, the gaming industry has driven the development forward, and today’s GPUs are not just programmable, they are also much more powerful than the CPU in terms of raw computing power.

One important image-enchancement method used in medical imaging, and for ultrasound specifically, is wavelet filtering. Wavelets are a mathematical tool that makes it possible to decompose images and describe them in terms of a course approximation and several levels of detail. It is often said that by using wavelets it is possible to ’see the forestandthe trees’. In this thesis, we will investigate the possibility of using the GPU for wavelet image filtering in the context of medical ultrasound. The main goal is to offload the CPU so that it can be used for other tasks. Another and more ambitious goal is to make the enhancement faster than the CPU, and open up for possibilities of even more refined methods than the ones used today.

1.2

Thesis outline

This thesis has three main parts. The first part (Chapter one) contains background informa-tion and theory about ultrasound, GPU and wavelet filters. The second part (Chapter two) contains a description of the algorithms and methods developed during this thesis. The final part (Chapter four and five) holds the results and a conclusion. In addition to these three parts, an appendix includes practical information about the running the programs developed and listings of some of the important parts of our GPU code.

(17)

Chapter

2

Background

A dwarf on a giant’s shoulders sees

farther of the two -Didacus Stella

In this chapter, we will present the main theories behind our work. In Section 2.1, we in-troduce the graphics processor (GPU), discuss the hardware features and how to program it. We will also explain why we choose to program the GPU. In Section 2.2, we present medi-cal ultrasound, the context of this thesis. Section 2.3 discusses the theoretimedi-cal background on wavelets. Section 2.4, the theory behind the filters we have implemented is discussed.

2.1

Graphics Processing Unit

The Graphics Processing Unit (GPU) is a special-purpose computing unit designed and op-timised for computation related to 3D graphic rendering. In recent years, the GPU has been transformed from a static pipeline with modest capabilities to a very flexible and powerful computation unit. Owens et al. [1] is an excellent survey of the current state of General-Purpose computation on GPUs (GPGPU). For practical programming info, tips and tricks Gpu Gems [2] is highly recommended. Both resources are used heavily in this section. In this chapter, we will discuss the the differences between CPUs and GPUS (Section 2.1.1). Then (in Section 2.1.2) a GPU programming model will be discussed. Finally, the programming techniques and GPU architecture will be discussed.

2.1.1

The what and why, CPU vs. GPU

The Central Processing Unit (CPU) in a modern PC, is composed of millions of transistors. Each year the number of transistors in a CPU increases. In 1965, Intel’s chairman Gordon Moore predicted that the number of transistors that could be fabricated on a single processor die would double each other year. This prediction still holds this day. However, the compu-tational power does not double each year. Approximate numbers from [3] indicate that each year the

• Capability of a processor increases 71 percent. • Transistor speed increases 15 percent.

(18)

• Memory bandwidth increases 25 percent. • Memory latency increases 5 percent.

It is easy to see from this list that the computing power increases much faster than the com-munication ability. This has been a known problem in VLSI design for a long time. Until recently the solution to this problem has been to add layers of faster memory between the processor and the main memory. We have registers, on-die caches, off-die caches, main mem-ory, disk and tapes. This additional caching requires a lot of transistors. On a modern CPU the computing unit only occupies a small part of the total space on a chip. As an example, the ALU (Algorithmic-Logic Unit) on the Itanium 2 occupies 6.5% of the total die space [4]. In ad-dition to cache, the processor also uses prefetching to speed up memory access. To make sure that the correct memory is prefetched additional transistors have to be used to implement techniques like branch-prediction, instruction snooping, etc. For a general-purpose processor like the CPU, the methods mentioned above and the transistor resource utilisation has been a successful tactic for the last decades. However, a major drawback to this tactic has been that it is inherently optimised for single threaded programs with no parallelism.

For applications that actually have data parallelism, the route that the traditional CPU has gone is not the best option. A good example is media processing. A typical media operation is to execute a small series of computations on each data element in the media stream. For these kinds of operations, both the cache and the extra prefetching is useless. In fact, they may both hurt the execution speed because of cache thrashing. Another similar example of parallel computation is 3D-rendering. As 3D applications, and especially games, required more and more computational resources during the 1990’s it became apparent that the CPU wouldn’t be able handle the load by itself. The first modern consumer GPU was born in 1996.1 _{By the}

turn of 2000 almost all consumer PCs contained a GPU, and today it is ubiquitous. A timeline with the computing power of GPUs and CPUs is shown in Figure 2.1.

0 50 100 150 200 250 300 350 2001 2002 2002 2002 2003 2003 2004 2005 2005 2006 GFLOPS NVIDIA Intel ATI

Figure 2.1: Performance timeline, GPU vs. CPU. Figure reproduced from [1]

The GPU is a good example of special-purpose hardware that is tailored for one specific task, 3D rendering. In the beginning, 3D rendering was also all it could be used for. As the GPUs evolved they have become more flexible, and it is now possible to program them to do a

(19)

2.1. Graphics Processing Unit 5 variety of tasks. In a recent survey carried out in [1], a number of applications were listed, such as rigid body simulations, linear algebra solvers, fluid simulations, image segmentation, Fast Fourier Transforms (FFTs), and Discrete Cosine Transforms (DCTs), database searches, to mention a few. Not only is itpossibleto implement these methods on the GPU, for most of these applications the GPU will outperform the CPU several-folds. How is this possible? As mentioned, one of the main problems with the CPU is that it assumes serial data and that it focuses its transistors on techniques to ‘avoid‘ memory latency. The GPU, on the other hand, assumes massively parallel data and a relatively simple series of computations on this data, so it can concentrate its transistors on the computational part. An illustration of this can be seen in Figure 2.2 taken from a recent NVIDIA document [5].

Cache

ALU

Control

ALU

DRAM

(a) CPU

DRAM

(b) GPU

Figure 2.2: The relative number of transistors used for ALU and cache by a CPU and a GPU. Figure reproduced from [5].

This causes the GPU to have much more raw computing power than a similarly sized and priced CPU. In Table 2.1, a number of high-end GPUs and CPUs are compared. Price

esti-xPU\Property Cost Bandwidth GFLOPS

Radeon X1900 XTX $400 51.0 240.0

GeForce 7900 GTX $340 38.4 240.0

Geforce 8800 GTX $599 86.4 345.0

Intel DualCore X6800 $999 8.5 50.0

Table 2.1: Comparison of high-end CPUs and GPUs on some key figures.

mates are gathered from [7, 6], GPU performance estimates from [8] and CPU performance estimates from [1, 9].

(20)

2.1.2

Stream Programming Model

As we have seen in the previous section, the GPU has massive amounts of raw computational power compared to the CPU. The main reason for this is that it is tailored for 3D graphics programming, which in turn implies that we cannot apply the standard CPU programming models directly to the GPU. Instead one typically use a model called the Stream Programming Model which fits nicely on the GPU [4]. The stream model consists of two key concepts [3]:

• StreamAn ordered set of data consisting of the same data type. Streams can be of any length, but they are usually long. The datatype can be arbitrarily complex. A stream can be copied, divided into substreams and operated on bykernels.

• KernelA basic operation consisitng of a small number of operations. A kernel performs a computation on one or morestreamsand outputs one or morestreams. 2 The kernel output are functions only of their inputs. The operation on one element of a stream is never dependent on the computation of another element in the same stream. This is a key property to make the model easily parallelisable.

Applications are made by chaining multiple kernels together. In Figure 2.3, an example stream model from [3] is shown. The application is the 3D graphics pipeline. There are two kinds

Vertex Program Triangle Assembly Texture Memory Transformed

Vertex Stream _Clip/Cull/

Viewport Triangle Stream Rasterization Screen-Space Triangle Stream Vertex Stream Vertex Program Unprocessed Fragment Stream Composite Fragment Stream Framebuffer Pixel Stream Image

Figure 2.3: The graphics pipeline as a stream model. Reproduced from [3].

of parallelism exposed by this model. First, the data-level parallelism is quite easy to imagine because each element in a stream is of the same data type and they undergo the same op-erations. Also, since we can chain multiple kernels, this model can be task-parallelised and deeply pipelined.

2.1.3

The Graphics Pipeline

We have seen that a GPU contains massive computing power and that the stream program-ming model fits nicely to its graphics pipeline. In this section, we will take a closer look at the internal structure of a GPU and how to actually execute kernel programs on a GPU.

The rendering process

The rendering process starts with the application creating a scene description based on poly-gons. The process of rendering these polygons to the screen is called display traversal [10].

(21)

2.1. Graphics Processing Unit 7 We can break this process into three big parts:

1. Geometry processing 2. Fragment operation

3. Compositing and outputting to buffer

This process used to be fixed. This means each part of the process had a fixed number of steps and features, and the developer could only set some parameters controlling the rendering. These parameters usually controlled:

• Transformation from local models 3D-coordinates to 2-D screen coordinates.

• Lighting and shading. This could be the number of light sources and which colours. • Some parameters controlling depth-culling and other efficiency measures.

The last sub-process would output the end result to the GPU’s frame buffer. The frame buffer is the part of memory on the graphics board that will be rendered to the screen. Recently, however, the fixed function pipeline was replaced with a much more flexible pipeline. Such a pipeline is shown in Figure 2.4.

Vertex Program

Vertex Processing

Primitive Assembly

Fragment Processing Compositing

Clipping/Culling Viewport Mapping Rasterization Fragment Program Frame-Buffer Operations

Vertices Transformed_Vertices Primitives Screen-Space Primitives

Unprocessed Fragments

Shaded

Fragments Pixels

Figure 2.4: The new programmable graphics pipeline. Reproduced from [11].

Programmable pipeline

The display traversal in a programmable pipeline is the same as for fixed function pipelines, but as the name implies, it is now possible to program some parts of the pipeline. This can be done with special micro programs, called shadersin the graphics literature. Two kinds of shaders exist,fragment shaderandvertex shader. Until quite recently, these microprograms had to be developed in assembly language., However, several special purpose high-level lan-guages has been developed in the last three years. The three main ones are:

1. GL Shading Language (GLSL), developed by the Open GL committee.

2. High Level Shading Language (HLSL) which is Microsoft’s shading language. 3. C for graphics (Cg) which was developed by NVIDIA

(22)

The three languages are very similar and show strong similarities to C. In this thesis, HLSL will be used. The reason for this choice was the author’s previous exposure to the accompa-nying DirectX framework. The choice of framework and language showed to have very little effect on the thesis work and porting our methods to another framework should be fairly easy. We will now take a look at the different processing stages and their programmability.

Geometry processing

As mentioned, the scene graph is represented as polygons. The application feeds these poly-gons to the GPU as vertices. A vertex may contain a number of properties. The obvious ones are position, colour, opacity and texture coordinates. The developer may specify any kind and any number of extra properties by defining a customvertex declaration. As vertices enter the GPU, the first step is to perform linear operations on these vertices. This is done by a mi-croprogram called thevertex shader. The vertex shader may alter any of the properties of the vertex. The usual operations performed are transformation from model coordinates to screen coordinates and per-vertex lighting. In addition to the vertex properties, the vertex shader may also access per frame parameters calleduniforms. An example of a uniform is the matrix used to transform the vertex into screen space. The vertex shader may also access the texture memory. An important limitation, however, is that unlike vertices may not be deleted or cre-ated in the shader. Only modification of the properties are allowed. After being processed by the vertex shader, the vertices are assembled into geometric primitives: points, lines and triangles. The primitives are clipped, culled, mapped to the viewport and rasterised.

Fragment operation

As the primitives are rasterised, the GPU decomposes each primitive into fragments. Each fragment corresponds to a pixel in screen space. For each fragment, a microprogram called thefragment shaderis run. The inputs to the fragment shader are interpolated versions of the output values from the vertices contributing to the primitive owning the fragment. Like the vertex shader, the fragment shader may also use uniform variables and texture memory for its operation. In addition, the fragment shader may usedependent texture reads. These are two texture reads where the memory position of the latter is dependent upon the content of the first read.

Compositing and outputting to buffer

After each fragment is processed in the fragment shader, we have a fragment ready to be put on the screen. Still there are some fixed processing options left. These include:

• CullingThe GPU can discard a fragment based on a depth test, scissor test, alpha test or stencil test.

• BlendingThe fragment colour can be combined with the colours already in the buffer. Blending is the act of combining the fragment created in the current rendering pass, thesource fragment, with the fragment already present in the output buffer, thedestination fragment. De-noting each fragment as a vector ci = {ri, gi, bi, ai}, the fraction of the fragment used in

(23)

2.1. Graphics Processing Unit 9

i∈ {src, dst}. This gives us

fsrc=fsrccsrc

fdst=fdstcdst

(2.1) Any number may be used for the constants but the most used are:

fsrc=asrc fdst= 1−asrc

(2.2) which is called alpha blending.

If the fragment passes all culling tests and is blended, we can finally render it to the output buffer. Usually the output buffer is theframe buffer, a special memory location on the graphic board that represents the screen. With a programmable pipeline we can also render to texture memory or even several output buffers at once.

2.1.4

General-Purpose computation on GPUs (GPGPU)

In this section, we will look at how we can employ the GPUs described in the previous section for general purpose computation (GPGPU) using the stream model.

Figure 2.5 shows the workflow of a typical GPGPU program.

Init Application Init DirectX

Set stream data Upload Texture Execute kernels Fragment shading Start process Draw Quad Display Save result Render to Texture

Figure 2.5: The workflow for an image filter

Streams

The streams of the GPU are textures. These are very similar to arrays used on the CPU. They can contain up to four 32-bit floating point pr. element. These are called r,g,b,a (red, green, blue and alpha). The names stem from graphics, but the names are irrelevant when programming GPGPU. We hence have basically the same as afloat[x][4]on the CPU. However, unlike the CPU, these can beswizzeledwithout cost. To swizzle is to rearrange or select the components of a single element. In Listing 2.1, a small example of swizzle operation can be seen.

float4 vector = {1,2,3,4}; // A four component element

vector.rgba; // same as just using vector, gives 1,2,3,4

vector.rrrr; // will select 1,1,1,1

vector.gb; // will select only element 2,3

Listing 2.1: Nine Tap Vertex Shader

(24)

Kernels

The GPU has two types of programmable processors, the vertex processor and the fragment processor. For GPGPU the fragment processor is usually the preferred choice. There are two reasons for this. First, GPUs typically have more fragment processors than vertex processors. As an example the GeForce 7800 GTX has 8 vertex processors and 24 fragment processors. Second, the fragment processor has the ability to write directly to texture memory. The output from the vertex processor, on the other hand, has to travel through the rasteriser and fragment processor. Which makes it more difficult to program. Thus, most of the computational work in GPGPU programs is done in the fragment processors, but the vertex processors can perform some auxiliary functions as we shall see in later chapters.

Invocation

The fragment shaders are executed by drawing geometry. Usually, we draw a quadrilateral with the same amount of fragments as the number of elements in the desired output. In Figure 2.6(a), we can see an example quadrilateral drawn. One example fragment is shown with its interpolated texture coordinates. In Figure 2.6(b), a quad with interpolated colours are shown to help visualise the interpolation. By adjusting the texture coordinates in the vertices used in the quadrilateral, we can adjust which texture memory each fragment shader uses. We will see examples of this in the later chapters.

Reading back

The primary method of reading back from GPU memory is by rendering to texture. It is also possible to read back from the frame buffer, but this is much slower. After the result is rasterised to the texture we have to explicitly transfer the results back from GPU memory to main memory. This operation flushes the GPU pipeline and should be used with care. In many applications, discussed in this thesis, including our ultrasound framework, the result is to be displayed on the screen, so it isn’t necessary to read back to the CPU at all. This speeds up the applications noticeably.

Multiple output streams

In the stream model, it is possible to output several streams from a kernel. This is also possible to do on the GPU with the use of multiple render targets (MRT). One important limitation is that the streams written to (the textures) have to be of the same size.

2.1.5

Suitable applications for the GPU

As mentioned in the introduction, [1] contains a comprehensive survey over applications that have been implemented on the GPU. Larsen [12] describes a GPU implementation of the Dis-crete Cosine Transform. This is a nice example of a transform In this section we will discuss which traits an application should have to be successfully ported to the GPU.Successfullycan be defined in various ways, but some important keywords are speedup, more power effective, faster, more computation per dollar.

Arguably, the most important trait for a successful GPU application, is that it has high arith-metic intensity [13].

(25)

2.1. Graphics Processing Unit 11 vertex.x = 0 vertex.y = 0 texture.x = 0 texture.y = 1 vertex.x = 512 vertex.y = 0 texture.x = 1 texture.y = 1 vertex.x = 512 vertex.y = 512 texture.x = 1 texture.y = 0 vertex.x = 0 vertex.y = 512 texture.x = 0 texture.y = 0 vertex.x = 290 vertex.y = 0 texture.x = 0.566 texture.y = 1

(a) Example of texture coordinates

(b) Interpolation of colours

Figure 2.6: Interpolation example

arithmetic intensity = operations / words transferred

As an example, consider simple vector addition: Even if the vectors are big this will certainly go faster on the CPU than on a GPU. There is just too much overhead involved in copying the vectors from main memory to GPU memory and back again compared to the number of operations. In addition, it is important that we have enough data to work with. Even if we have a complicated operation, if the data is small, the overhead of invoking the GPU will dominate the time and the CPU will be faster. The question of how complicated a function should be, or how big the data should be, to make it worthwhile using the GPU, is a difficult one. There is no simple answer, but a rule of thumb could be: at least a couple of thousand of elements and several multiple-add operations.

How the data is accessed is another important element. In GPU Gems 2 [14], the memory read speed on a GPU and a CPU was measured. The GPU reading has become a bit faster

(26)

since, but the general trend should still apply. The GPU reads sequential and random data faster than the CPU, especially the sequential, but reading from the cache is much faster on the CPU. These data are easily understood when we take the discussion from Section 2.1.1 into account, supporting the view of the GPU as a stream processor.

(a) Reading a single floating point with different memory access methods.

CPU

GPU North Bridge

Main Memory

South Bridge Graphics

Memory Other Periperals 8 GB/s 6.4 GB/s Display 6.4 GB/s or more 35 GB/s

(b) How the GPU connects to the rest of the PC

Figure 2.7: GPU memory overview. Figures reproduced from [14].

Flow control is one of the most important concepts in computation. Because of the highly parallel nature of the GPU it doesn’t support flow control very well. On old architectures it didn’t even support loops, but that has luckily been handled in the later years. It is possible to do branching in a fragment or vertex program, but extensive branching is not recommended. There exists a couple of methods to avoid branching, [1] has a good overview of the different methods.

(27)

2.2. Ultrasound 13 In parallel computing, thegatherandscattermemory operations are two basic primitives that can be used to explain the memory pattern of an algorithm or the capability of an API. In stream computing,gatheroccurs when a kernel processing a stream element requests infor-mation from other parts of the stream [13].Scatter, on the other hand, distributes information to other streams. GPU supportsgathervery well, but notscatter. Hence, applications should be rewritten if they use scatter. One example of this is sorting. The scattering quick-sort is not a good fit for the GPU, while the gather oriented bitonic sort is [15].

2.2

Ultrasound

Medical ultrasound is an non-invasive, real-time, portable method of performing medical imaging. These properties makes it an attractive alternative to other imaging methods like Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) that require large imo-bile equipment. Ultrasound is used in many branches of medicine, but especially in cardiol-ogy, obstetrics and gastroenterology. An introduction to modern ultrasound can be found in [16].

An ultrasound machine shoots ultrasound pulses, in the 5-15 MHz range, from a transducer into a body. The wave propagates into the tissue, while a small amount is reflected along the path. The main condition for the reflection is ’impedance jumps’ which occur at the inter-face between two tissues with different sound transmission interinter-faces. The transducer then measures the strength and time of the reflected waves. Based on the time the pulse was sent, the distance sound travels in soft tissue, and when the reflection returned, the distance the returning wave has travelled into the tissue can be calculated. The resulting visualisation is a beam where the pixel intensity is based on the energy of the reflected wave. By adding more ultrasound probes in the transducer and changing the angle at which the ray is shot 2D and 3D ultrasound becomes possible. In Figure, 2.8 different ultrasound modes are shown.

(a) 1D (time on the X-axis) (b) 2D (c) 3D

Figure 2.8: Different Ultrasound modes

2.2.1

Ultrasound noise formulation

One of the fundamental problems faced by ultrasound imaging is the presence of speckle noise. This kind of noise is common to all coherent systems and affect both human interpreta-tion and computer-assisted techniques. Goodman (1976) [17] studied the statistical properties of speckle noise and showed that when the resolution cell is small compared to the spatial detail in the object and the image has been sampled coarsely enough the degradation at any pixel can be assumed to be independent of degradation at all other pixels. He also showed

(28)

that the degradation at each pixel can be modelled as multiplicative noise. In Jain [18], this is formulated as:

f(x, y) =g(x, y)ηm(x, y) +ηa(x, y) (2.3) g(x, y)is a piece-wise constant 2D function representing the unknown error-free image,f(x, y) is the observed values of g(x, y), ηm(x, y) is the multiplicative noise, ηa(x, y) the additive

noise, (x, y) ∈ R. The additive noise ηa is relatively small compared to the multiplicative

noise in ultrasound images, so we can approximate Equation 2.3 by

f(x, y) =g(x, y)ηm(x, y) (2.4)

We can seperate noise and signal by applying the log transform:

log(f(x, y)) =log(g(x, y)) +log(ηm(x, y)) (2.5)

Actually this log transform is done anyway on commercial ultrasound systems because of the limited dynamic range of the monitors. We can write 2.5 as

f(x, y)l=g(x, y)l+ηa(x, y)l (2.6)

The subscript of the noise is changed frommtoato point out that the noise is now actually the same as white Gaussian additive noise. So the image enhancement algorithms has to deal with additive Gaussian noise instead of multiplicative noise. The former task is much easier than the latter.

There exists other formulations of ultrasound speckle noise, most of them quite complicated. The formulation presented here is advanced enough for our purposes and has been shown to work well in practice. For these reasons, this noise model will be used.

2.2.2

Ultrasound image restoration

Image restoration is an old field in computer science and a large number of methods has been developed. Most methods proposed for ultrasound image enhancement are based on these existing methods, but adjusted to suit the characteristic of the ultrasound imaging pro-cess. In [19], some methods are listed: Median filtering [20, 21], Wiener filtering [18], adap-tive weighted median filtering [22], adapadap-tive speckle reduction (ASR), wavelet shrinkage and Nonlinear Anisotropic Diffusion (NAD).

In this thesis, we will only look at some of the methods proposed. We will concentrate most of our effort on the multiresolution methods. These methods are based on wavelets described in Chapter 2.3. This thesis has been written in cooperation with the medical industry and the selection of techniques are based on the techniques that are actually used in practice [23]. In literature, it is popular to establish statistical methods that will work on all kinds of ultra-sound equipment. In practice, it is often better to adjust the enhancement parameters to suit different equipment. This thesis will assume hand-tuned instead of auto-tuned parameters.

2.3

Wavelets

Transforms on digital signals are used to obtain information that are not readily available by analysing the signal in its raw form. Examples include the Hilbert transform, the Wigner distribution, the Radon Transform and the most popular one: the Fourier Transform (FT).

(29)

2.3. Wavelets 15 One reason for this popularity is due to the Fast Fourier Transform (see Cooley et al. [24] for more information), which is a fast algorithm for computing the Fourier Transform. The FT transforms a signal from the raw format into the frequency domain. The raw format is usually the time domain, but when we are analysing images it is the spatial domain. The Fourier transform is defined as:

ˆ

f(ω) def=

Z

R

x(t)e−iωtdt

One important fact to notice is that the Fourier transform is integrated over the entire time range. This gives Fourier one of its most important properties: A transformed signal will con-tain the exact frequencies that existed in the original signal. All time (or spatial) information is lost. 3 _{This may or may not be a big problem. For so-called stationary signals it has no}

ef-fect. Stationary signals are signals which are constant in their statistical parameters over time But for non-stationary signals we loose potentially important information in our transform. Several schemes have been devised to counter this. One approach has been the Short-Time Fourier Transform (STFT). The STFT assumes that the signal is stationary for some portion of the signal. It then cuts the signal into small segments. The problem is that now we no longer integrate over the whole time segment. According to Heisenberg Uncertainty Princi-ple (HUP), we now no longer get exact frequency information. We can adjust this by changing the segment length. If we letσtdenote a time-spread around a time instant, andσωdenote

the frequency spread around an instant, HUP states:

σ2_tσ_ω2 ≤ 1 4

For our transformations the tradeoffs can be summed up as:

Segment size Time resolution Frequency resolution

Small Good Poor

Large Poor Good

Another approach is using a related transform, the wavelet transform.

2.3.1

Wavelets

In this chapter, we will present the current state as presented by [25, 26, 27, 28].

The word wavelet is a direct translation from French ‘ondolette‘ which means small wave, and this is exactly what wavelets are. Functions which oscillates like a wave in limited space. Wavelets have to satisfy certain requirements.ψ(x)is a wavelet given that oscillates, it aver-ages to zero and that it is well localised. We use the wavelets quite similar to the way sines and cosines are used in Fourier. But an important concept in wavelet analysis isscale. One choses a wavelet, scales it and translates it to see its correlation with the signal being analysed. When the signal correlates to a large (rude) scale we can see the course features, while small scales will show the fine features. It is often said you can ’see the forestandthe trees’.

The prototype function is called the mother wavelet. The scaled and translated wavelets are called the baby wavelets. Given a waveletψ(x)we create baby waveletsψs,τ(x)by scaling

3_{Since the transform is reversible we don’t actually ‘loose‘ any information. It is just unavailable in the}

(30)

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -1.5 -1 -0.5 0.5 1 1.5

(a) Hermitian hat

-5-4.5 -4 -3.5 -3 -2.5 -2-1.5 -1 -0.50 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 -1.25 -1 -0.75 -0.5 -0.25 0.25 0.5 0.75 1 1.25 (b) Morlet withz0= 5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2.5 -2 -1.5 -1 -0.5 0.5 1 1.5 2 2.5 (c) Haar -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -1 -0.5 0.5 1 (d) Mexican hat -4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8 -2.4 -1.6 -0.8 0.8 1.6 2.4

(e) Mexican hat withτ = 1 ands= 2 -4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8 -2.4 -1.6 -0.8 0.8 1.6 2.4

(f) Mexican hat withτ = 2 ands= 1

Figure 2.9: Examples of different wavelets

withsand translating byτ:

ψs,τ(x) = 1 √ sψ( x−τ s ) (2.7)

2.3.2

Continuous wavelet transform

As indicated in the introduction we are interested in using the wavelets in transforms. The continuous wavelet transform (CWT) is defined as:

Wψ(s, τ) :=

Z

R

f(x)ψ?_s,τ(x)dx (2.8) whereψ? is the complex conjugate of the waveletψ. To use a wavelet in the CWT it has to abide the admissibility condition

0< Cψ:=

Z

R

|ψˆ(ω)|2

|ω| dω <∞ (2.9)

whereψˆdenotes the Fourier transform of the waveletψ. In addition, the wavelet has to be chosen from the space ofL2₍

R). These conditions are usually satisfied by: Z

R

ψ(x)dx= 0andψ(0) = 0 and thatψ(ω)→0asω→ ∞fast enough to makeCψ<∞.

It can be shown4that the CWT can approximate any function inL2₍

R)at arbitrary precision.

As a matter of fact, we can restrict our scaling factor to s = 2j, j ∈ Z and still be able to

represent any function. This transform,Wψ(2j, τ), is called theDyadictransform.

(31)

2.3. Wavelets 17 The inverse wavelet transform is defined as

f(x) = 1 Cψ Z R Z R Wψ(s, τ) ψs,τ(x) s2 dτ ds (2.10) f t

(a) Wavelet decomposition

f t

σ

ω

σ

t

σ

t

σ

ω (b) Short-Time Fourier f t

σ

ω

σ

t

σ

t

σ

ω (c) Wavelet Figure 2.10: Time-Frequency resolution. Figures reproduced from [27] Some example wavelets can be seen in Figure 2.9.

Since we are working with computers we would like to see how we could make this trans-form discrete. This however, is not straight forward, and we have to make a detour into Multiresolution Analysis first.

2.3.3

Multiresolution Analysis

Multiresolution Analysis (MRA) is a method where a signal is successively decomposed into coarser details and approximations. We project a signal onto a sequence of approximation spaces

{0} ⊂. . . Vj+1⊂Vj⊂Vj−1. . .⊂L2(R) (2.11)

where the indexj indicates resolution scale2j. We can restrict ourselves to these subspaces since we are working with dyadic transformations as discussed in Section 2.3.2. The differ-ences between two approximation spaces are contained in thedetail spaceWj:

Vj∩Wj= 0andVj−1=Vj⊕Wj (2.12)

Another way of saying this is thatVjare the sums of theWj. We can write this explisitly as:

(this is the same as writing Equation 2.12 out)

VJ =

M

j>=J+1

Wj (2.13)

If the vector spacesV andW also satisfy:

• The basis functions that span the space are orthogonal to its translates by integers. • The approximation at a resolution is self-similar:

(32)

• In addition to the fact that they are nested vector spaces, all functions in all subspaces can approximate all function inL2₍

R): \ j∈Z Vj = 0 [ j∈Z Vj =L2(R) (2.14)

they are called a multiscale approximation ofL2₍

R)[27].

Scaling and Wavelet functions

Since we would like to approximate an arbitrary signal within each subspaceVjwe are

look-ing for basis functions that span the all the subspaces. Actually, it can be proven5_{that if}_V jis

a multiscale approximation inL2₍

R)then there exists a singlescalingfunctionϕsuch that

₁ √ 2jϕ( x−k2j2 2j ) (2.15) is an orthonormal basis ofVj. From this definition we can see thatf(x)∈Vj+1⇔f(2x)∈Vj.

Based on our previous assumptions we can write this out as an explicit recursive function:

ϕ(x) =√2X k∈Z hkϕ(2x−k)⇔ 1 √ 2ϕ( x 2) = X k∈Z hkϕ(x−k) (2.16)

his called the filter mask for scaling functionϕ.

ForWj, we call the basis spanning the vector spacewaveletsand they are denotedψ. Every

basis functionψj−1ofWj−1is orthogonal to every basis functionϕj−1. Analogous to Equation

2.15 we have: ₁ √ 2jψ( x−k2j2 2j ) (2.17) is an orthonormal basis ofWj. And analogous to Equation 2.16 we have

ψ(x) =√2X k∈Z g[kψ(2x−k)⇔ 1 √ 2ψ( x 2) = X k∈Z gkψ(x−k) (2.18)

wheregis the detail filter, and can be calculated as:

g[k] =hψ(x),√2ϕ(2x−k)i

The linkage between multiscale analysis and wavelet theory was first done by Mallat [29] and it enables us to create a Discrete Wavelet Transform.

2.3.4

Discrete Wavelet Transform

The Multiscale Analysis presented above gives us a method of decomposing a signal into components of different resolutions. We get the details by applying the wavelet functionψ

(33)

2.3. Wavelets 19 (or its filter maskg) and the approximations with the scaling functionϕ, (or its filter maskh). To the coarse level we can further apply the filters on the approximation recursively. Ifλ(x)j

is the approximation at leveljandγ(x)jthe detail we can write this as:

λ(x)j+1= k=∞ X k=−∞ h(k)λj(2x+k) γ(x)j+1= k=∞ X k=−∞ g(k)λj(2x+k+ 1) (2.19)

This algorithm is called the Discrete Wavelet Transform. Note that unlinke the CTW the DWT contains no redunancy. A block diagram showing three levels of the DWT is shown in Figure 2.11. h[n] g[n] 2 2 h[n] g[n] 2 2 h[n] g[n] 2 2 x[n]

γ

1

γ

2

γ

0

λ

0

λ

2

λ1

Figure 2.11: An example with 3 levels DWT

The Discrete Wavelet Transform is sometimes called the Fast Wavelet Transform as an analogy to the Fast Fourier Transform (FFT). Actually, the DWT is faster than the FFT, its complexity isO(n)compared to FFT’sO(n log(n)).

It is important to understand that the DWT is not simply sampling the CWT. The wavelets (and the associated filters) now have to be chosen carefully so that they are a basis ofL2(R)

and in the form of Equation 2.17. So our choice for wavelets is quite restricted. If we fur-ther restrict our filtersgandhto have finite response (FIR)and be orthogonal we can create the inverse filters h0 and g0 in such a way that we get perfect reconstruction. We call this reconstructionsynthesisand it can be written as

λj(x) = k=∞ X k=−∞ h0(k)λj−1(x+k) + k=∞ X k=−∞ g0(k)γj−1(x+k) (2.20)

Creating a wavelet that meets all the required conditions is quite tedious and outside the scope of this thesis. We shall instead use some of the already designed wavelets that have been proven to satisfy the needed conditions.

2D Discrete Wavelet Transform

Since this thesis concentrates on images, we would like to extend the DWT to two dimen-sions. There exists several methods for doing this. The easiest and most used is the separable

(34)

approach. The separable 2D DWT is achieved by first applying the 1D DWT on the rows, and then on the columns. This gives us four decomposed signals for each level of DWT:

• LL: The approximation. This is the signal that will be recursed further upon.

• LH: Horisontal approximation, vertical detail. This signal will contain specifically the vertial details and can be used if one wants to apply a special filter for the vertical details. (Like searching for vertical lines.)

• HL: Horisontal detail, vertical approximation. • HH: Detail in both vertical and horisontal direction.

g[n] h[n] 2 2 LL g[n] h[n] 2 2 g[n] h[n] 2 2 HH HL LH LL

Horisontal pass Vertical pass

j+1 j+1 j+1 j+1

j

(a) One level of 2D forward DWT

2 H L LL Horisontal Analysis Vertical Analysis HL LL LH HL j+1 j+1 j+1 j+1 j

(b) How (a) is usually visualised on screen. Figure 2.12: 2D Forward DWT

In Figure 2.12, the 2D Forward DWT is shown, Figure 2.12(b) shows how the four signals are visualised. LL is used for further decomposition, Figure 2.13 shows how a three level forward DWT is visualized. Backward DWT is done the exact opposite way, first on the columns and then on the rows. The other approach to 2D is called the non-separable approach and is still on a basic research level. It will not be discussed further in this thesis.

Choice of Wavelets

Unlike the FT, we have to choose which wavelets to use for the wavelet transform. The main reason we are using wavelets in this thesis is for denoising. Therefore, the wavelet we chose

(35)

2.4. Filters 21 LL HH LH LL HL LH LL HL LH HH1 1 1 2 2 2 3 3 3 3

(a) Three level forward DWT (b) The forward DWT on a real picture. Figure 2.13: Multiple levels of 2D DWT

have to fit this purpose. In [28], there is a discussion of the important properties the wavelet should have for denoising. There are two main points.

• Denoising is facilitated by a sparse environment, i.e. the detail should contain few non-neglible coefficients.

• The visual quality is important.

The first objective is met by having manyvanishing momentsN˜ and as smallsupport sizeKas possible. The theoretical limit isK = 2 ˜N −1 and is achieved in the Daubechies wavelets. This is a family of wavelets usually denoteddbN˜. Visual quality is usually achieved by hav-ing a regular and symmetric wavelet. Daubechies symletssymN˜ are the most compactly supported symmetric wavelets. In this thesis, the Daubiches 9/7 and 5/3 wavelets are used. They are chosen because they are the most used in denoising litterature and they are used in the JPEG2000 standard (indicating good visual quality) as well.

2.4

Filters

As part of this thesis, one goal was to assess the possibility of doing simple spatial filters used in ultrasound processing on the GPU. In this section, we will take a look at these filters. The theory of these filters can be found in any introductionary textbook on image processing like Gonzales and Woods [30].

2.4.1

Gaussian blur

Gaussian blur is a widely used image filter. The effect of a Gaussian blur is to reduce noise and detail of an image, i.e. a low pass on an image. The name Gaussian blur comes from the

(36)

use of the normal distribution

G(r) =√ 1 2πσ2e

−r2_/₂_pi3

(2.21) which is also called the Gaussian distribution. ris the blur radiusr2 ₌_x2₊_y2_,_σ_the

stan-dard deviation or ‘the amount of blurring‘. Applying a Gaussian blur to a image is simply convolving the image with the Gaussian distribution.

-1.5 -1.25 -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 1.25 1.5 0.25 0.5 0.75 1 1.25 (a) σ= 0.3 -1.5 -1.25 -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1 1.25 1.5 0.25 0.5 0.75 1 1.25 (b)σ= 0.5 Figure 2.14: Gauss function

(a) The original picture, Lena

(b) Gaussian filter applied (c) Laplacian edge en-chancement applied

Figure 2.15: Example of image filtering

2.4.2

Edge enhancement

To do edge enhancement we have chosen to use the Laplacian. The Laplacian52_f_{of an image} f(x, y)is given by: 52f = δ 2_f δx2 + δ2_f δy2 (2.22)

[30]. We are working on images, so we need a discrete representation. The discrete partial derivatives are: δ2_f δx2 =f(x+ 1, y) +f(x−1, y)−2f(x, y) δ2_f δy2 =f(x, y+ 1) +f(x, y−1)−2f(x, y) (2.23)

(37)

2.4. Filters 23 which gives us:

52f =f(x, y+ 1) +f(x, y−1) +f(x+ 1, y) +f(x−1, y)−4f(x, y) (2.24) Using only these equations directly would give us the features as greyish lines on a dark background. If we add the original image back again we will get images with feature en-hancement. So the final filter equation is given in Equation 2.25.

g(x, y) =52_f₊_f₍_{x, y}₎

(2.25)

2.4.3

Wavelet spatial filters

“The purpose of subband filtering is of course not to just decompose and recon-struct. The goal of the game is to do some compression or processing between the decomposition and reconstruction stages.” – Daubechies [25]

In Section 2.3, we introduced wavelets as a method of decomposing an image into a number of detail coefficients and approximations. By applying the backward DWT we get the original image back. If we apply filters to the decomposed results before composition it is possible to refine the end result. There are some options on which of the coefficients to filter. In Figure 2.16, these options are shown on a two-level wavelet.

h[n] g[n] 2 2 h[n] g[n] 2 2 x[n]

γ

1

γ

0 λ0 λ1 2 2 h'[n] g'[n] 2 2 h'[n] g'[n] f1[n] f2[n] f3[n] f4[n] λ! 0 λ! 1 γ! 1 γ0! x'[n]

Figure 2.16: Figure showing where filtering may be applied between forward and backward DWT.

The most used option is to decompose the image entirely and then apply filter to the decom-posed coefficients (f2,f3andf4. It is also possible to apply a filter to the approximation before

further decomposing it (f1). Usually we apply different kinds of filters to the low pass part f2 and the high pass part (f3 andf4). We will now discuss a popular wavelet filter called

Wavelet Shrinkage.

Wavelet Shrinkage (WS)

Wavelet Shrinkage is a filtering method that was first applied to denoising by Donoho [31]. The idea is that most speckle noise lies in the high frequencies of the image. In addition, the noise will usually contribute less than the features. So if we reduce the detail coefficients we can reduce the noise energy while still preserving most of the features. Since the noise is most evident in the higher frequencies, WS applies more reduction to the finer detail coefficients, and less to the coarser. In terms of levels, the low numbered levels are reduced more. The reduction method used is thresholding. Especially two thresholding methods are used much in literature, soft and hard thresholding. From [19] we have a definition of soft thresholding

(38)

where

(|v| −t)+=

(

|v| −t if|v|> t

0 otherwise (2.27)

Hard thresholding is defined as:

u=H(v, t) =

(

0 ift > v >−t

v if otherwise (2.28)

The thresholding functions are shown in Figure 2.17 The difference between the two

thresh--5 -2.5 0 2.5 5

-2.5 2.5

(a) The original function

-5 -2.5 0 2.5 5 -2.5 2.5 (b) Soft thresholding -5 -2.5 0 2.5 5 -2.5 2.5 (c) Hard thresholding Figure 2.17: Thresholding functions

olding functions are small in practice. Usually hard thresholding contains features better, but also it contains more artefacts from the wavelet synthesis. An extreme example of these arte-facts can be seen in Figure 2.18. As can be seen, these artearte-facts are quite similar to the artearte-facts produced by hard filters in the Fourier domain.

(a) The original image (b) Thresholded image Figure 2.18: Example of too much hard thresholding

(39)

2.5. Previous GPU work on wavelets 25

2.5

Previous GPU work on wavelets

The simple spatial filters has been implemented on the GPU by numerous authors. [32] lists sample implementations of a Gaussian Filter. The Laplacian can be implemented in a similar manner. Our implementation of these techniques is similar and exhibits similar performance. As far as the authors knows, there has only been one previous attemt at implementing the DWT, presented in [33]. The method presented uses a lookup table to check which data values and kernel values should be used for each result. This table takes care of boundary conditions and it also stores information about which kernel (high or low) that should be used. One of the positive results of this design is that the decomposition and composition are very similar. Also the design allows for wavelet kernels of different size without to much re-coding. The downside to this approach is that it uses many texture reads and all but one of them are dependent. Dependent texture reads are texture reads that depend on earlier texture reads in the same kernel computation. A more detailed comparison of the previous method and our contribution will be presented in Section 3.4.

All of the methods presented in this thesis have previously been implemented on the CPU. We will hence also compare all our methods to CPU-equivalent implementations.

(40)

(41)

Chapter

3

Novel Wavelet Techniques

The devil is in the details -Proverb

The background theory for GPUs, Wavelets and Ultrasound were discussed in the previous chapter. In this chapter, our novel methods will be presented, including a texture addressing optimisation that we have used for both DWT and spatial filtering (Section 3.1). The main techniques developed, DWT on the GPU, is described in Section 3.2. We have also imple-mented spatial filters that are used between forward and backward DWT. These are described in Section 3.3. In Section 3.4, we derive a theoretical model that can be used to analyse the performance of the algorithms on the GPU.

3.1

Texture addressing on the GPU

One of the novel techniques developed is the texture addressing optimisation used in our filters, both for the DWT and the spatial filters. The idea itself is not new, but it has not been used in conjunction with DWT before.

In its most basic form, filters are just functions that take in an input image and outputs a transformed output image. Usually, the image filters are defined by their filter kernels. A filter kernel define how to construct one pixel in the output image. The input to the kernel is usually at least the pixel in the same position in the input image as the current output value calculated by the kernel, but also the pixels surrounding them are often used. The inputs are often called ’taps’. This kind of computation maps perfectly to the GPU; One output and several inputs. In Figure 3.1, a schematic overview of filtering process on a GPU is shown. In order to execute a filter kernel (fragment program), we will have to know the address of the corresponding pixel in the input texture. This will be achieved using the texture input coordinates. As discussed in Section 2.1.4 the texture coordinates are given as arguments to the vertices in the quad. They are then interpolated across the quad. Texture coordinate addresses in DirectX are in[0,1]regardless of the size of the source texture.

If we use a quad with texture coordinates, we can access the input pixel with same coordinates as the current output pixel with:

(42)

Init DirectX Upload Image (texture) Fragment shading Draw Quad Display Save

Figure 3.1: Schematic overview of the filter rendering process.

1 4 2 5 8 0 3 6 7 (a) A 3x3 tap v v f f f v f f c f f f v v v (b) A 1x15 tap

Figure 3.2: Explanation of different filter kernels.

float4 value = tex2D(Source, Tex);

We now have the address to the centre pixel. Almost all filter kernels use more taps. E.g. a simple 3x3 kernel like the one in Figure 3.2(a). To compute the addresses we will have to add or subtract the texel1_{width and height. The obvious way to do this would be to do it in the}

fragment shader like this: (in this example for tap 5)

tap[5] = tex2D(Source,float2(Tex.x + texelWidth, Tex.y))

Since source texture coordinates go from 0 to 1, texelWidth is1/sourceW idth. However, this is quite inefficient. For a picture with10242 _{elements, each coordinate will be computed 1}

048 576 times! We can instead precompute the offsets in the vertex shader and pass them along as parameters to the fragment shader. Listing 3.1 shows a vertex shader calculating the addresses for the sample 9-tap kernel.

(43)

3.2. DWT on the GPU 29

NINE_TAP_STRUCT SevenTapVertexShader(INPUT_TEX input) { NINE_TAP_STRUCT output = (NINE_TAP_STRUCT)0;

output.Position = mul(input.Position, mvpMatrix);

output.Tap0 = input.Tex + float2(0 - texelHeight, texelWidth);

output.Tap1 = input.Tex + float2(0 - texelHeight, 0 );

output.Tap2 = input.Tex + float2(0 - texelHeight, texelWidth);

output.Tap3 = input.Tex + float2(0, -texelWidth);

output.Tap4 = input.Tex

output.Tap5 = input.Tex + float2(0, texelWidth);

output.Tap6 = input.Tex + float2(0 + texelHeight, -texelWidth);

output.Tap7 = input.Tex + float2(0 + texelHeight, 0);

output.Tap8 = input.Tex + float2(0 + texelHeight, texelWidth);

return output;

}

Listing 3.1: Nine Tap Vertex Shader

Since it is only possible to transfer a small number, typically no more than nine, texture coor-dinates from the vertex to the fragment shader, this method doesn’t always work. In this case, some of the values can be precomputed in the vertex shader, while the rest is computed in the fragment shader. In Figure 3.2(b), a kernel that can be used for separable 15x15 kernels is shown. Here, the dots are precomputed in the vertex shader, while the crosses are computed in the fragment shader.

The GPU allows for several interpolation methods when accessing texture memory. This means that if you try to access an address between two elements, you can get a blend of the two elements. For our application this is rarely what we want, so we have disabled this option and chosen ‘point‘ selection. This mode chooses the texture element closest to the address given.

3.2

DWT on the GPU

In this Section, we will discuss the main contribution of this thesis: A novel fast algorithm for computing the DWT on the GPU. The method presented here is faster than any other method, CPU or GPU, that the author knows about.

To recapitulate from Section 2.3.4, the two algorithms, or equations that the DWT consists of are: Equation 2.19 (Forward DWT)

λ(x)j+1= k=∞ X k=−∞ h(k)λj(2x+k) γ(x)j+1= k=∞ X k=−∞ g(k)λj(2x+k+ 1) (3.1)

(44)

Analysis Apply filters Synthesis

Display

Save

(a) CPU Version

Init DirectX Upload Textures Analysis

Display Save Analysis_Analysis Analysis Analysis_Analysis Analysis_Analysis Analysis_Analysis Analysis_Analysis Analysis_Analysis Analysis Apply Filters Analysis_Analysis Analysis_Synthesis Analysis_Analysis Analysis Apply Filters Analysis_Analysis Analysis_Synthesis Analysis_Analysis Analysis Apply Filters Analysis_Analysis Analysis_Synthesis 1 2 128 (b) GPU Version

Figure 3.3: Schematic overview of the DWT process.

and Equation 2.20 (Backward DWT)

λj(x) = k=∞ X k=−∞ h0(k)λj−1(x+k) + k=∞ X k=−∞ g0(k)γj−1(x+k) (3.2)

There are two known ways of implementing the DWT on the CPU. The most obvious is to implement the convolution in Equation 3.1 and 3.2 directly. The other is called the lifting scheme[34], which is an optimised version which uses approximately half of the floating-point operations used in the direct version. For the CPU, the lifting scheme is clearly the fastest algorithm. However, it uses two sub steps for each forward DWT level calledlifting andupdate. Theupdatestep is dependent upon theliftingstep. To implement this on the GPU we would have to use two render passes for each forward DWT, for a total of four passes (2 steps * 2 directions) for each level of . The direct computation can be computed in one step pr. direction, two in total for each forward DWT level. The situation is the same for the inverse transform. A render pass causes a switch in the rendering context. This gives quite a lot of overhead and should therefore be avoided. Our implementation of the DWT uses the direct computation method. The algorithm contains five phases: initialisation, four computational phases and display. The phases can be seen in Figure 3.3(b). The initialisation, uploading of textures and display phases will not be described here as they have already been covered in Chapter 2.1. Forward and Backward DWT will be discussed in Section 3.2.1 and 3.2.2 respectively.

(45)

3.2. DWT on the GPU 31

3.2.1

Forward Discrete Wavelet Transform

The equation presented in the previous section has an infinite filter size, but for practical applications we will always have a finite filter. For a filter of sizeK, we can rewrite Equation 3.1 to: λ(x)j−1= k=K X k=0 h(k)λ(2x−K 2 +k) γ(x)j−1= k=K X k=0 g(k)γ(2x+ 1−K 2 +k) (3.3)

A straightforward implementation of this would be to compute the high and low values as shown in Listing 3.2.

lowResult[n] = 0;

for(int n=0;n<N;n+=2) {

for(int k=0;k<K;k++) {

lowResult[n/2] += source[n+k-K/2] * lowKernel[k] }

}

highResult[n] = 0;

highResult[n/2] += source[n+k-K/2+1] * highKernel[k] }

}

Listing 3.2: Forward DWT Pseudocode using two for-loops

For the CPU, this algorithm will run fast, and it is possible to implement it on the GPU using two passes. By drawing a quad with sizeN

2 results in an equal number of fragment executed.

The inner loops can be directly implemented as fragment programs. However, this approach has several drawbacks. First, it needs two render passes, and as we discussed this will hinder performance. Secondly, it doesn’t take into account boundary conditions. This happens when

n < K₂. In the previous implementation of DWT on the GPU, both of these problems were solved by combining the for-loops and using a lookup table. The essence of their approach can be seen in Listing 3.3.

for(int n=0;n<N;n++) {

if(n < N) {

allResults[n] = 0;

allResults[n] += source[n+k-K/2] * lowKernel[k] }

} else {

allResults[n] = 0;

allResults[n] += source[n+k-K/2+1] * highKernel[k] }

} }

(46)

The if-else statement has been written for clarity, it is not in their code. Instead, a lookup table is used. This approach has some performance problems which we will discuss later, we have chosen another route. Just like the previous implementation, we have chosen to rewrite Listing 3.2 so it just consists of one for-loop instead of two. Our version is shown in Listing 3.4

lowResult[n] = 0; highResult[n] = 0;

lowResult[n/2] += source[n+k-K/2] * lowKernel[k] highResult[n/2] += source[n+k-K/2+1] * highKernel[k] }

}

Listing 3.4: Forward DWT Pseudocode using one for-loop

This solves the problem of two render passes, but another problem arises because a fragment shader can normally only output one value (the pi