random number generation for OpenMP - TESTS AND COMPARISONS

TESTS AND COMPARISONS

Algorithm 11 random number generation for OpenMP

________________________________________________________

1: pragma omp parallel 2: srand()

3: pragma omp for

4: for(i → 0 to i → arraySize)

5: dataArray ← rand() % arraySize 7.7.2. New memory management model in CUDA

Another, research question was the usage of new memory model. Although most of the papers in our pool are belonged to post 2010 era, the memory model in CUDA has been just changed, in 2015, with the introduction of CUDA dynamic parallelism.

Not surprisingly, the new model was not present in any of the papers. Therefore, explaining the difference between the new memory model and the old memory techniques in CUDA decided to be part of this thesis.

The old memory model is involved the explicit memory copying between the global and device memory, it has already been discussed in the previous chapters. However, for a sorting algorithm, the actual speed-up was being occurred where the local memory is used in CUDA.

Accordingly, in old memory model it was the case of making as much as operations before deferring the control back to the CPU. In addition, the local memory (i.e.

registers, and not card memory) have been being used for making small but faster kernel calculations. However, the new dynamic parallelism, automatically distributes the available registers among each thread. Which also means, a kernel code that successfully runs on a new GPU, with using local memory, will not work on an older GPU.

An example for the above paragraph is present in the previous section, Section 7.5, where the Nsight profiler shows the register usage, and automatic memory management for the HBquick sort program, which was introduced in the Chapter 6.

Briefly, the new memory model makes coding in CUDA easier. For the reader’s attention the properties of GPU, which is used in this thesis, is present in Appendix A.1.

7.7.3. Scalability issues in shared memory languages

This is actually a very general subject to all shared memory systems, and might be the one of the reasons why there is increasing demand for data-level parallelism instead of shared memory systems. That is why; finding a direct answer for this question seems a little out of the scope of this thesis.

The so-called scalability issue is only present using shared memory systems. In CUDA, kernel code has to be written in such way that every thread executes the kernel code at least once. Otherwise, it is called branch divergence, this is both mentioned in Section 3.7 and Section 7.4.

For OpenMP based sorting algorithms, which are used in this thesis, the appropriate information obtained from measuring wall-clock execution times of these algorithms is present in Section 7.3. The test results show a speed-up proportional to the thread count for our algorithms, however, the CPU used for the tests has only 4 actual threads. This means, a system with more actual threads may reveal opposite results, such as slow-down instead of speed-up when the algorithms are run with more than 4 threads.

75 7.7.4. Testing the outputs

In our pool of papers, another issue was the absence of the mentioning of the word test in its actual meaning. Thus far, nearly all of the papers mentioned to test as,

“testing” the program for desired execution time. For this reason, the actual meaning of test (i.e. testing the output validity) for a sorting algorithm will be examined, in order to figure out the reasons for the absence of the testing phrase from papers.

Actually, to test a parallel function could be quite challenging. The reasons for that involves but not limited to the following:

1. Test Array size in device or system memory.

2. Time needed to test the function.

3. Accurately understand the error codes returned by GPU in runtime.

In most of the literature papers, it is mentioned that a sorting algorithm coded with a parallel language is usually superior to a sequential (i.e. single thread) code in terms of time and memory space [40]. Therefore, it will be very problematic to use the traditional code for testing purposes of the parallel code. Of course, a simple algorithm like the Algorithm 12 is easy to code but it has flaws in itself. These flaws are incompatibility of some C language statements in both CUDA and OpenMP. For example, break clause cannot be used in both parallel languages. On Windows, the code line is redundant in runtime, and on Linux, the code does not compile!

Therefore, if the array size was too long, and if the error occurred in just the beginning, the test function does not terminate at the occurrence of the first error, which obviously means unnecessary computations are made.

_________________________________________________________

Algorithm 12 CUDA or OpenMP test case with intentional redundant code ________________________________________________________

1: for(i → 0 to i → arraySize) 2: if ( array[i] > array [ i+1] )

3: print “Array not sorted on”, i

3: break

4: endif 5: endfor

The answer for this question is simple, although only using dynamic parallelism. The Algorithm 13 shows the test case only for the CUDA based sorting algorithms.

Again break clause cannot be used in CUDA language. Therefore, a simple loop control is constructed using a Boolean variable that is isSorted, and a recursive function testFunc. The only consideration here is the stack size, but the stack size can grow up to device memory size. Moreover, it seems, an industrial sized application is needed to overwhelm it.

_________________________________________________________

In document A STUDY OF PARALLEL SORTING ALGORITHMS. USING CUDA AND OpenMP (Page 90-93)