Comparison of a Multithreaded CPU Realization and a CUDA

3. Applications of Monogenic Wavelet Frames

3.3. High Performance Implementation within the CUDA Architecture

3.3.2. Comparison of a Multithreaded CPU Realization and a CUDA

While the field of monogenic wavelet frames is present for more than a decade, there is no known fast implementation which focuses on fast computations on consumer hardware. Some form a proof of concept and are implemented as plugins for ImageJ [41, 40] or are settled in Matlab [80, 78, 37]. In both cases, either the computation speeds are not acceptable (more than 40 seconds) or very high amounts of RAM (more than 64GB) are needed, when considering an image of 1024x1024 pixels and an Intel Core i5-7600 with four cores of 3.50 GHz and 16 GB of RAM.

To realize an own fast implementation, the programming language C/C++ was considered to guarantee a fast migration from CPU to GPU and also to achieve platform independence. To use the capabilities of several parallel threads in modern CPUs, the framework of OpenMP [14] was used. It forms an abstraction to use multi threading in a platform independent way and is a feature of compilers.

To construct a process which can be benchmarked, a decomposition and reconstruction algorithm is considered. This forms the most basic part of each algorithm which is based on monogenic wavelet frames Furthermore several basic assumptions were made in the implementation process:

• The algorithm of choice is the stationary monogenic wavelet transform. • Input data is a square grey valued image and quantized as 8 bit values.

20_{An introducing “Hello World” application in [35] significantly changed into [36] in just five}

3.3. High Performance Implementation within the CUDA Architecture 89

• The exact size of input images is 2N _{× 2}N _{with N = 9, . . . , 12, (13).}

• The resulting image has to have exactly the same 8 bit values as the input value. Thus, no numeric error should exceed the quantization.

The exact workload is the complete decomposition of the image into filtered parts. After that, on each scale/filter, the phase as well as the amplitude has to be computed. These values are the basic properties of the whole image and thus have to be used to compute new wavelet coefficients. Afterwards a reconstruction process is made to compute the original image again. This approach provides a typical workflow of monogenic wavelet frames and also provide a prototype to extend the software into further different specializations.

One step of optimization, which is a core feature of frames in general is the numerical stability. Therefore single precision data types are used. The experiments have shown, that no significant error is produced by use of 32 bit precisions21_.

Multi Threaded C/C++ Implementation

The most important task of the algorithm is Fourier transforming. Here, the FFTW provides an efficient framework [24, 25]. A minimal example is shown in Listing 3.2. The framework has own data types which are binary compatible with the native C++ data types from the standard library. Therefore, according casts give the possibility to use overloaded operators and thus have already implemented complex arithmetic. Also through memory allocation via fftwf_malloc(...), there are possibilities to use system intrinsic functions which form a kind of micro-parallelization. Next to this concurrency approach, the function fftwf_execute_dft() is thread safe22 _which

guarantees interoperability with OpenMP.

The creation of the filter bank is the next computational intense task of the whole algorithm and is shown in Listing 3.3. Here, to gain more speed when using it more than once, all filters are kept in RAM, e.g in batch processing of images or videos. On the one hand, this consumes big loads of memory but on the other hand an application of the filter bank is quickly done by one loop which is also able to be executed in parallel.

After the application of filters and Riesz transform all parts need to be inverse Fourier transformed. The resulting frame coefficients are the raw materials of several different algorithms like the equalization of brightness. These give new real frame coefficients which need to be composed again to gain a complete signal again. Considering a distribution of the areas of the image instead of a distribution along the scales give the possibility to completely parallelize this approach without communication. After this composition, the result just have to be inverse Fourier transformed and rescaled. Furthermore reencoding is necessary to spread the values from the interval [0, 1) to the interval [0, 256) which can be converted to unsigned values of 8 bits through rounding instructions of the standard math library.

21_{In this case significant is meant in the sense, that an error occurs which changes the 8 bit}

values.

22_{While the execution of a FFTW plan is thread safe, the creation of one is not. Therefore}

90 3. Applications of Monogenic Wavelet Frames

1 # i n c l u d e < complex > 2 # i n c l u d e < f f t w 3 . h > 3 ...

5 int main (int argc , c o n s t char* argv []) 6 { 7 f f t w f _ c o m p l e x * image , * f o u r i e r I m a g e ; 8 f f t w f _ p l a n p _ f o r w a r d , p _ i n v e r s e ; 9 std :: complex <float> * i m a g e C o m p l e x , * f o u r i e r I m a g e C o m p l e x ; 10 11 // from i n p u t i m a g e : 12 int i m a g e W i d t h = 512 , i m a g e H e i g h t = 512; 13 int i m a g e P i x e l s = i m a g e W i d t h * i m a g e H e i g h t ; 14 f l o a t f f t w B a c k S c a l i n g F a c t o r = 1.0 f / i m a g e P i x e l s ; 15 i m a g e = ( f f t w f _ c o m p l e x *) f f t w _ m a l l o c ( 16 s i z e o f( f f t w f _ c o m p l e x ) * i m a g e P i x e l s ) ; 17 f o u r i e r I m a g e = ( f f t w f _ c o m p l e x *) f f t w _ m a l l o c ( 18 s i z e o f( f f t w f _ c o m p l e x ) * i m a g e P i x e l s ) ; 19 p _ f o r w a r d = f f t w f _ p l a n _ d f t _ 2 d ( i m a g e W i d t h , i m a g e H e i g h t , 20 image , f o u r i e r I m a g e , F F T W _ F O R W A R D , F F T W _ E S T I M A T E ) ; 21 p _ i n v e r s e = f f t w f _ p l a n _ d f t _ 2 d ( i m a g e W i d t h , i m a g e H e i g h t , 22 f o u r i e r I m a g e , image , F F T W _ B A C K W A R D , F F T W _ E S T I M A T E ) ; 23 i m a g e C o m p l e x = 24 r e i n t e r p r e t _ c a s t< std :: complex <float>* >( i m a g e ) ; 25 f o u r i e r I m a g e C o m p l e x = 26 r e i n t e r p r e t _ c a s t< std :: complex <float>* >( f o u r i e r I m a g e ) ; 27

28 // read input , save as v a l u e s in i n t e r v a l [0 ,1) in i m a g e 29 30 f f t w f _ e x e c u t e _ d f t ( p _ f o r w a r d , image , f o u r i e r I m a g e ) ; 31 ... // c a l c u l a t i o n s on the f r e q u e n c y s p e c t r u m 32 f f t w f _ e x e c u t e _ d f t ( p _ i n v e r s e , f o u r i e r I m a g e , i m a g e ) ; 33 ... // r e s c a l e p i x e l s with f f t w B a c k S c a l i n g F a c t o r 34 35 f f t w f _ f r e e ( i m a g e ) ; 36 f f t w f _ f r e e ( f o u r i e r I m a g e ) ; 37 38 r e t u r n 0; 39 }

Listing 3.2: Minimal example of the FFTW workflow. Here, the modern array execution syntax via fftwf_execute_dft(...) is used. It has the advantage to reuse a fftw plan with different input data, as long it has the same size and data types. Furthermore this listing shows a possibility to use the native complex.h interface via reinterpret_cast<...>(...) because the data types provided by FFTW are binary compatible.

3.3. High Performance Implementation within the CUDA Architecture 91 1 # p r a g m a omp p a r a l l e l { 2 int n u m t h r e a d s = o m p _ g e t _ n u m _ t h r e a d s () ; 3 int t h r e a d I D = o m p _ g e t _ t h r e a d _ n u m () ; 4 int i m P i x e l s = w i d t h * h e i g h t ; 5 int low = i m P i x e l s * t h r e a d I D / n u m t h r e a d s ; 6 int high = i m P i x e l s * ( t h r e a d I D +1) / n u m t h r e a d s ; 7 for ( s i z e _ t s c a l e = 0; s c a l e < n u m F i l t e r s ; ++ s c a l e ) { 8 // s c a l i n g f a c t o r for the a c t u a l s c a l e : 9 f l o a t s c a l i n g F a c t o r = 2 << s c a l e s [ s c a l e ]; 10 for ( s i z e _ t i = low ; i < high ; ++ i ) {

11 f l o a t s c a l e d P o s i t i o n = s c a l i n g F a c t o r * r a d i u s [ i ]; 12 if ( s c a l e == 0) { // h i g h p a s s f i l t e r 13 if ( s c a l e d P o s i t i o n >= 0.25 f ) (* f i l t e r B a n k ) [ i ] = 1.0 f ; 14 else (* f i l t e r B a n k ) [ i ] = held ( s c a l e d P o s i t i o n ) ; 15 }

16 else if ( s c a l e == n u m F i l t e r s -1) { // low pass f i l t e r 17 if ( s c a l e d P o s i t i o n <= 0.25 f ) 18 (* f i l t e r B a n k ) [ i + s c a l e * i m P i x e l s ] = 1.0 f ; 19 else 20 (* f i l t e r B a n k ) [ i + s c a l e * i m P i x e l s ] = 21 held ( s c a l e d P o s i t i o n ) ; 22 } 23 else { // b a n d p a s s f i l t e r s 24 (* f i l t e r B a n k ) [ i + s c a l e * i m P i x e l s ] = 25 held ( s c a l e d P o s i t i o n ) ; 26 } 27 } 28 } 29 }

Listing 3.3: Workflow of the creation of the complete filter bank in use. Considering the stationary monogenic filter bank, there is the use of the filter Parameter set named scales according to Table 3.1. An optimal option to realize the operation 2K[scale] is shifting. The result is a big data field containing all filters. The computations are distributed among the threads in one scale.

92 3. Applications of Monogenic Wavelet Frames 1 # p r a g m a omp p a r a l l e l { 2 int n u m t h r e a d s = o m p _ g e t _ n u m _ t h r e a d s () ; 3 int t h r e a d I D = o m p _ g e t _ t h r e a d _ n u m () ; 4 int low = i m a g e P i x e l s * t h r e a d I D / n u m t h r e a d s ; 5 int high = i m a g e P i x e l s * ( t h r e a d I D +1) / n u m t h r e a d s ; 6 for ( s i z e _ t s c a l e = 0; s c a l e < n u m F i l t e r s ; ++ s c a l e ) { 7 for ( s i z e _ t idx = low ; idx < high ; ++ idx ) {

8 r e s u l t I m a g e C o m p l e x [ idx ] += 9 f i l t e r e d S p e c t r u m C o m p l e x [ idx + s c a l e * i m a g e P i x e l s ] 10 * f i l t e r b a n k [ idx + s c a l e * i m a g e P i x e l s ]; 11 } 12 } 13 }

Listing 3.4: Synthetization of the image from the real coefficients. Here, with proper distribution of the areas to add together through the scales, no communication is needed.

To optimize the implementation, it was analyzed for performance and memory leaks with tools like ompP and also valgrind. Exact Benchmarks will be presented in the according Section 3.3.3.

Implementation on CUDA Enabled Devices

Before details of the implementation on CUDA is described, some modern techniques are shown in Listing 3.5, which are in use in most cases of programming in this framework. Here, two functions are introduced, which are able to resolve errors in the case of occurring. The reason behind is that there is no information from the graphics card, if functionalities like this aren’t provided. Also, next to these helpers, several operators are overloaded to provide cleaner and more understandable code when dealing with complex numbers.

As already mentioned, exact implementation aspects often differ between classic CPU approaches and realizations on CUDA enabled devices. First of all the amount of data, which is to be moved to the memory of the graphics card need to be as low as possible because of the high latency. Therefore it is just natural to copy the image as 8 bit values and to apply rescaling and reencoding on the graphics card. Next to the image, several symbols, e.g. the image dimensions as well as the configuration parameters for the frame generation, also need to be copied to the device. Here optimization can be accomplished by using two different streams, streamMain and streamSupport. In this way, asynchronous copy operations can be started and thus run concurrently to first preprocessing calculation operations like the generation of the radius mesh and the Riesz Fourier kernels. A complete pre computation of all filter values on all scales is not useful in the case of CUDA code. The reason is the limited amount of memory23

which is provided by consumer graphics devices of NVIDIA in contrast to the amount

23_{In case of the device which is used for later tests, i.e. NVIDIA GTX 980 Ti, just 6GB of}

global memory is provided. In case of an image of 1024 × 1024, already more than 1GB is used just to store filters and transform matrices. In case of bigger images, this amount is growing faster than quadratic.

3.3. High Performance Implementation within the CUDA Architecture 93 1 # i n c l u d e < c u d a _ r u n t i m e . h > 2 # i n c l u d e < d e v i c e _ l a u n c h _ p a r a m e t e r s . h > 3 # i n c l u d e < c u f f t . h > 4 # i n c l u d e < c u C o m p l e x . h > 5 6 i n l i n e void c h e c k C u d a ( c u d a E r r o r _ t r e s u l t ) 7 { 8 if ( r e s u l t != c u d a S u c c e s s ) 9 { 10 ... 11 a s s e r t ( r e s u l t == c u d a S u c c e s s ) ; 12 } 13 } 14 15 i n l i n e void c h e c k C u f f t ( c u f f t R e s u l t r e s u l t ) 16 { 17 if ( r e s u l t != C U F F T _ S U C C E S S ) 18 { 19 ... 20 a s s e r t ( r e s u l t == c u d a S u c c e s s ) ; 21 } 22 } 23 24 ... 25 i n l i n e _ _ d e v i c e _ _ _ _ h o s t _ _ c u f f t C o m p l e x o p e r a t o r*( 26 c u f f t C o m p l e x c o n s t & a , c u f f t C o m p l e x c o n s t & b ) { 27 r e t u r n c u C m u l f ( a , b ) ; 28 } 29 i n l i n e _ _ d e v i c e _ _ _ _ h o s t _ _ c u f f t C o m p l e x o p e r a t o r*( 30 c u f f t C o m p l e x c o n s t & a , f l o a t c o n s t & b ) { 31 r e t u r n c u C m u l f ( a , m a k e _ c u F l o a t C o m p l e x ( b ,0.0 f ) ) ; 32 } 33 i n l i n e _ _ d e v i c e _ _ _ _ h o s t _ _ c u f f t C o m p l e x o p e r a t o r*=( 34 c u f f t C o m p l e x & a , f l o a t b ) { 35 r e t u r n a = c u C m u l f ( a , m a k e _ c u F l o a t C o m p l e x ( b ,0.0 f ) ) ; 36 } 37 ...

Listing 3.5: Additional functions and helpers which play a role in usualy implementations focused on signal processing on CUDA enabled devices. Here, the error checking functions are defined and operators for calculations with cuFFT data types are overloaded.

94 3. Applications of Monogenic Wavelet Frames

1 _ _ g l o b a l _ _ void f f t S h i f t ( c u f f t C o m p l e x * data ) { 2 int x s h i f t = ( d _ i m P i x e l s + d _ i m W i d t h ) /2;

3 int y s h i f t = ( d _ i m P i x e l s - d _ i m H e i g h t ) /2; 4 c u f f t C o m p l e x temp ;

5 for (int yIdx = b l o c k I d x . y * b l o c k D i m . y + t h r e a d I d x . y ; 6 yIdx < d _ i m H e i g h t /2;

7 yIdx += b l o c k D i m . y * g r i d D i m . y ) {

8 for (int xIdx = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ; 9 xIdx < d _ i m W i d t h ;

10 xIdx += b l o c k D i m . x * g r i d D i m . x ) { 11 int idx = yIdx * d _ i m W i d t h + xIdx ; 12 if ( xIdx < d _ i m W i d t h /2) {

13 temp = data [ idx ];

14 data [ idx ] = data [ idx + x s h i f t ]; 15 data [ idx + x s h i f t ] = temp ;

16 } else {

17 temp = data [ idx ];

18 data [ idx ] = data [ idx + y s h i f t ]; 19 data [ idx + y s h i f t ] = temp ;

20 } 21 } 22 } 23 } 24

25 int main (int argc , c o n s t char* argv []) { 26 c u f f t H a n d l e f f t M a i n P l a n ; 27 c u f f t C o m p l e x * d _ d a t a ; 28 ... 29 c u d a M a l l o c (& d_data , i m P i x e l s *s i z e o f( c u f f t C o m p l e x ) ) ; 30 c u d a M e m c p y A s y n c (... , c u d a M e m c p y H o s t T o D e v i c e , s t r e a m M a i n ) ; 31

32 c u f f t P l a n 2 d (& fftPlan , imWidth , imHeight , C U F F T _ C 2 C ) ; 33 c u f f t S e t S t r e a m ( fftPlan , s t r e a m M a i n ) ;

34 c u f f t E x e c C 2 C ( fftPlan , d_data , d_data , C U F F T _ F O R W A R D ) ; 35

36 dim3 d i m G r i d (8 , 8 , 1) ; // n u m b e r of b l o c k s in the grid 37 dim3 d i m B l o c k (32 , 32 , 1) ; // n u m b e r of t h r e a d s in a b l o c k 38 fftShift < < < dimGrid , dimBlock , 0 , s t r e a m M a i n > > >( d _ d a t a ) ; 39 ...

40 c u d a F r e e ( d _ d a t a ) ; 41 ...

42 }

Listing 3.6: Examplary CUDA code to give an idea how the kernel execution syntax is working and how a FFT is performed. While the main function is more examplary, the fftShift kernel is the same in productive code

3.3. High Performance Implementation within the CUDA Architecture 95

1 ... 2

3 int main (int argc , char c o n s t * argv []) { 4 ... 5 c u d a E v e n t R e c o r d ( s t a r t E v e n t , s t r e a m M a i n ) ; 6 7 c u f f t E x e c C 2 C (... , C U F F T _ F O R W A R D ) ; 8 fftShift < < <... > > >(...) ; 9

10 for(int idx = 0; idx < n u m F i l t e r s ; idx ++) { 11 m a i n K e r n e l A n a l y s i s < < <... > > >(...) ; 12 fftShift < < <... > > >( idx , . . . ) ; 13 c u f f t E x e c C 2 C (... , C U F F T _ I N V E R S E ) ; 14 m a i n K e r n e l E o b < < <... > > >(...) ; 15 c u f f t E x e c C 2 C (... , C U F F T _ F O R W A R D ) ; 16 fftShift < < <... > > >(...) ; 17 m a i n K e r n e l S y n t h e s i s < < <... > > >( idx , . . . ) ; 18 } 19 fftShift < < <... > > >( d _ s p e c t r u m R e s u l t ) ; 20 c u f f t E x e c C 2 C (... , C U F F T _ I N V E R S E ) ; 21 22 c u d a E v e n t R e c o r d ( s t o p E v e n t , s t r e a m M a i n ) ; 23 c u d a E v e n t S y n c h r o n i z e ( s t o p E v e n t ) ; 24 c u d a E v e n t E l a p s e d T i m e ( 25 & e l a p s e d M i l i s e c o n d s , s t a r t E v e n t , s t o p E v e n t ) ; 26 ... 27 }

Listing 3.7: The main execution syntax of the kernels responsible for the main algorithm. Here, also the Events are built in, which are meant to measure the execution time of the whole procedure. The function cudaEventSynchronize waits for an event to complete. The reason is the scope of execution which is immediatly returned to the main program after a kernel is started.

96 3. Applications of Monogenic Wavelet Frames

needed. Also, even in the case there were devices with a serious amount of memory, it is more typical to choose RAM over computational workload when it comes to save one over the other.

The exact procedure to run CUDA code is exemplary shown in Listing 3.6. There, the procedure is shown how to perform a Fourier transform via the use of the cuFFT library provided by NVIDIA and an additional shift operation. After the memory is properly reserved (malloced) and a fft plan is set up to work on exactly two dimensional data of dimensions imWidth and imHeight as a complex to complex operation24 on single precision data (CUFFT_C2C), this plan can be executed on a specific stream. Lastly, the fftShift function is executed with the kernel execution syntax. This syntax allows a specific use in the number of blocks and threads in one to three dimensions. Here, the borders of hardware need to be respected (see Listing 3.1) to avoid errors. The third parameter is the amount of shared memory which is to be reserved for this kernel on one block. At last, the associated stream is given as a parameter. Modern CUDA programs do not use the Null stream, which has some specific properties to guarantee compatibility with legacy code. The CUDA kernel25_{itself has a special specifier. There}

are three specifiers to consider: __host__, __global__ and __device__. Global functions run on the device but are executed from the host and thus form the main entry points.

In Listing 3.7 the implementation of the complete algorithm is introduced. The time is measured through the event API provided by the CUDA toolkit. The kernels itself are implemented with grid stride loops (Also used in the kernel of Listing 3.6). In this way, the result of them do not depend on the exact configuration of the grid and block dimensions. Also these make it possible to rudimentary debug applications without having two GPUs26_{. Except the event synchronization, at no point of the algorithm}

any communication is needed to be implemented in person (the Fourier transform has built in communication which is needed.). The main advantage of this is fast computations. Communications cause bottlenecks in most cases and thus slow down the whole algorithm.

To identify performance and memory issues, the whole program was checked by cuda-memcheck, which is a tool to identify memory leaks. Also the performance was checked with the profiler tools of NVIDIA. There, the distribution of workload on different streams were optimized. Further aspects of the capabilities of this algorithm implementation on CUDA devices is shown in the next section.

In document Applications of Riesz Transforms and Monogenic Wavelet Frames in Imaging and Image Processing (Page 88-96)