CONCLUSIONS AND FUTURE WORKS - A STUDY OF PARALLEL SORTING ALGORITHMS. USING CUDA AND OpenMP

In this thesis, the parallel sorting subject is covered using two common parallel languages in the literature, which were CUDA and OpenMP. While the preference of examining these language were not unintentional, the information obtained from researching field related papers is showed that most of the effort making those papers are spent to comparing the parallel versions of the algorithms to their single-threaded counterparts. However, there should be a difference between speed-up and code writing effort is considered not just with the sequential languages but also with the parallel languages themselves. Therefore, in this thesis it is deemed suitable to compare parallel languages against each other and against to their sequential versions. In this thesis, first five chapters give background information about the parallel languages. In Chapter 2, a small systematic literature review is made with a pool of around seventy papers. Then the results from the SLR study showed that the information about testing the programs or information about metrics other than speed-up is being missed completely. Therefore, in Chapter 7, this missing information is conveyed using the data available from this thesis. Then, it is found out that providing these data was very easy. Thus, it has been concluded that the information about the codes for initialization or testing of the parallel algorithms missed simply because of the choice of the authors. In Chapter 6, the algorithms that are suitable for comparison are introduced and in the next chapter the timing results are obtained by comparing the algorithms. Then the algorithms are examined for the reasons that might cause the slowdown to happen. It was found that in CUDA language there is a hard limit of concurrently executing threads, even when these threads were grouped in warps, for a GPU. Therefore, a future study can re-evaluate the timing results found in this thesis when a more advanced hardware exists in the

market. The time comparisons also shows that there is a huge difference between the CUDA and OpenMP results, with the same algorithms written in OpenMP being faster. This result can be explained with the higher clock frequencies of the CPU compared to GPU, because CPUs are designed for hiding latency. However, this is where CUDA statement holds true, because developers of CUDA language never claims that CUDA language can beat a CPU when comparison is made with latency.

Moreover, what is claimed by CUDA language is providing much higher throughput then a CPU and providing it for a very long time. In addition, if one considers that a GPU has much lower power consumption, it should be more favorable to build a cluster of GPUs instead of a cluster of CPUs. Then, those two entities can be fairly compared. In addition, the results obtained from this thesis show that OpenMP based algorithm works as intended because their close relativeness to the C language.

However, CUDA is much harder to both code and debug, simply because being introduced recently.

This study revealed the data-level parallelism has a promising future for even using the algorithms, which arise from many data dependent memory I/O operations.

Although, the timing results in Chapter 6 reveals the OpenMP based algorithms have significant performance efficiency, that assumption only holds true if the comparisons are made using only the speed-up in wall-clock time, in mind. However, in this thesis, information about other metrics for assessing parallel languages against each other is given; these were memory efficiency, throughput and computations per watt efficiency. Then, it is clear that the data-level parallelism has significant benefit when compared to the task-level parallelism. In addition, current CPUs has multiple identical cores on the same chip, making them the head starters when the computation involves many operations, where hiding the latency almost impossible (e.g. a sorting algorithm where a computation uses the output of the previous computation). Therefore, future improvements to the GPU hardware can follow the same approach today, of making less cores (or SMs in GPU) but making them more heavy weight in terms of computational capabilities. In addition, in new architectures of Nvidia devices, SM count decreases but the number of SP (streaming processors) in the SMs increases. That means, in the future the data-level parallelism will have much better results, even with the data dependent computations, when compared to task-level parallelism. In addition, in Section 7.7.4.1 it is shown that different test

inputs could change the output of the same function significantly, most of them erroneous. Therefore, a thin wrapper for unit testing the CUDA code is a necessity. A future work to make this happen could positively affect coding in CUDA.

REFERENCES

[1] Cheng, John, Max Grossman, and Ty McKercher. Professional Cuda C Programming. John Wiley & Sons, 2014.

[2] Cook, Shane. CUDA programming: a developer's guide to parallel computing with GPUs. Newnes, 2012.

[3] http://www.nvidia.com/object/cuda_home_new.html

[4]

Fung, et al., "Mediated Reality Using Computer Graphics Hardware for Computer Vision", Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002), Seattle, Washington, USA, 7–10 October 2002, pp. 83–89

[5] http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsk y_NVIDIA%27s_Fermi-The_First_Complete_GPU_Architecture.pdf

[6] http://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

[7] http://docs.nvidia.com/cuda/cuda-c-programming-guide/#introduction

[8] http://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf

[9]

Di Carlo, Stefano, et al. "A software-based self test of CUDA Fermi GPUs." Test Symposium (ETS), 2013 18th IEEE European. IEEE, 2013.

[10] http://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions

88 [11]

Petersen, Kai, et al. "Systematic mapping studies in software engineering." 12th International Conference on Evaluation and Assessment in Software Engineering. Vol. 17. No. 1. sn, 2008.

[12]

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#scalable-programming-model__automatic-scalability

[13]

Bentley, Jon L., and M. Douglas McIlroy. "Engineering a sort function." Software: Practice and Experience 23.11 (1993): 1249-1265.

[14] http://developer.download.nvidia.com/compute/cuda/CUDA_Occupan cy_calculator.xls

[15] http://people.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lec ture3.pdf

[16]

https://docs.google.com/spreadsheets/d/1-N6SrSuCLSOMtYDccE0GveY0RGHw_eFwiNyrtxtBpn4/edit?usp=sh aring

[17] Farber, Rob. CUDA application design and development. Elsevier, 2011.

[18] Sanders, Jason, and Edward Kandrot. CUDA by example: an

introduction to general-purpose GPU programming. Addison-Wesley Professional, 2010.

[19] Chapman, Barbara. Parallel Computing: from Multicores and GPU's to Petascale. Vol. 19. IOS Press, 2010.

[20] Kirk, David B., and W. Hwu Wen-mei. Programming massively parallel processors: a hands-on approach. Newnes, 2012.

[21] PARALLEL THREAD EXECUTION ISA, http://docs.nvidia.com/cuda/pdf/ptx_isa_4.3.pdf

Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A

comprehensive performance comparison of CUDA and OpenCL."

Parallel Processing (ICPP), 2011 International Conference on. IEEE, 2011.

[25]

White, Sam, Niels Verosky, and Tia Newhall. "A CUDA-MPI Hybrid Bitonic Sorting Algorithm for GPU Clusters." Parallel Processing Workshops (ICPPW), 2012 41st International Conference on. IEEE, 2012.

[26]

Yang, Yi, et al. "A GPGPU compiler for memory optimization and parallelism management." ACM Sigplan Notices. Vol. 45. No. 6.

ACM, 2010.

[27] Lorie, Raymond A., and Honesty Cheng Young. A low communication sort algorithm for a parallel database machine. IBM Thomas J. Watson Research Division, 1989.

[28]

Zhang, Keliang, and Baifeng Wu. "A novel parallel approach of radix sort with bucket partition preprocess." High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on. IEEE, 2012.

[29]

Peters, Hagen, Ole Schulz-Hildebrandt, and Norbert Luttenberger. "A novel sorting algorithm for many-core architectures based on adaptive bitonic sort." Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012.

[30]

Kumari, Smriti, and Dhirendra Pratap Singh. "A parallel selection sorting algorithm on GPUs using binary search." Advances in

Engineering and Technology Research (ICAETR), 2014 International Conference on. IEEE, 2014.

90 [31]

Hofmann, Michael, and Gudula Rünger. "A partitioning algorithm for parallel Sorting on distributed memory systems." High Performance Computing and Communications (HPCC), 2011 IEEE 13th

International Conference on. IEEE, 2011.

[32]

Kothapalli, Kishore, et al. "A performance prediction model for the CUDA GPGPU platform." High Performance Computing (HiPC), 2009 International Conference on. IEEE, 2009.

[33] Molnar, Steven, et al. "A sorting classification of parallel rendering."

Computer Graphics and Applications, IEEE 14.4 (1994): 23-32.

[34] Rolfe, Timothy J. "A specimen of parallel programming: parallel merge sort implementation." ACM Inroads 1.4 (2010): 72-79.

[35]

Süß, Michael, and Claudia Leopold. "A user’s experience with parallel sorting and openmp." Proceedings of Sixth European Workshop on OpenMP-EWOMP. 2004.

[36]

Inoue, Hiroshi, et al. "AA-sort: A new parallel sorting algorithm for multi-core SIMD processors." Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques.

IEEE Computer Society, 2007.

[37]

Herdman, J. A., et al. "Accelerating hydrocodes with OpenACC, OpeCL and CUDA." High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:. IEEE, 2012.

[38] Shen, Xipeng, and Chen Ding. "Adaptive data partition for sorting using probability distribution." Parallel Processing, 2004. ICPP 2004.

International Conference on. IEEE, 2004.

[39]

Huang, Bonan, Jinlan Gao, and Xiaoming Li. "An empirically

optimized radix sort for gpu." Parallel and Distributed Processing with Applications, 2009 IEEE International Symposium on. IEEE, 2009.

[40] Takeuchi, Akira, Fumihiko Ino, and Kenichi Hagihara. "An improved binary-swap compositing for sort-last parallel rendering on distributed memory multiprocessors." Parallel Computing 1762.

91 [41]

Landaverde, Raphael, et al. "An investigation of Unified Memory access performance in CUDA." High Performance Extreme Computing Conference (HPEC), 2014 IEEE. IEEE, 2014.

[42]

Xiang, Wang. "Analysis of the Time Complexity of Quick Sort Algorithm." Information Management, Innovation Management and Industrial Engineering (ICIII), 2011 International Conference on. Vol.

1. IEEE, 2011.

[43]

Bakhoda, Ali, et al. "Analyzing CUDA workloads using a detailed GPU simulator." Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 2009.

[44]

Ohno, Kazuhiko, et al. "Automatic Optimization of Thread Mapping for a GPGPU Programming Framework." International Journal of Networking and Computing 5.2 (2015): 253-271.

[45] Barnat, Jiří, et al. "Computing strongly connected components in parallel on CUDA." Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 2011.

[46]

Sun, Weidong, and Zongmin Ma. "Count sort for gpu computing."

Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on. IEEE, 2009.

[47]

Satish, Nadathur, Mark Harris, and Michael Garland. "Designing efficient sorting algorithms for manycore GPUs." Parallel &

Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 2009.

[48]

Davidson, Andrew, et al. "Efficient parallel merge sort for fixed and variable length keys." Innovative Parallel Computing (InPar), 2012.

IEEE, 2012.

[49]

Wang, Lingyuan, Miaoqing Huang, and Tarek El-Ghazawi.

"Exploiting concurrent kernel execution on graphic processing units."

High Performance Computing and Simulation (HPCS), 2011 International Conference on. IEEE, 2011.

92 [50]

Potluri, Sreeram, et al. "Extending openSHMEM for GPU computing."

Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 2013.

[51]

Garcia, Vincent, Eric Debreuve, and Michel Barlaud. "Fast k nearest neighbor search using GPU." Computer Vision and Pattern

Recognition Workshops, 2008. CVPRW'08. IEEE Computer Society Conference on. IEEE, 2008.

[52]

Sintorn, Erik, and Ulf Assarsson. "Fast parallel GPU-sorting using a hybrid algorithm." Journal of Parallel and Distributed Computing 68.10 (2008): 1381-1388.

[53] Di Carlo, Stefano, et al. "Fault mitigation strategies for CUDA GPUs."

Test Conference (ITC), 2013 IEEE International. IEEE, 2013.

[54]

Hernandez Rubio, Erika, et al. "FLAP: Tool to generate CUDA code from sequential C code." Electronics, Communications and Computers (CONIELECOMP), 2014 International Conference on. IEEE, 2014.

[55] Leischner, Nikolaj, Vitaly Osipov, and Peter Sanders. "GPU sample sort." Parallel & Distributed Processing (IPDPS), 2010 IEEE

International Symposium on. IEEE, 2010.

[56]

Ujaldon, Manuel. "High Performance Computing and Simulations on the GPU using CUDA." High Performance Computing and Simulation (HPCS), 2012 International Conference on. IEEE, 2012.

[57]

Zhao, Yue, Xiaoyu Cui, and Ying Cheng. "High-performance and real-time volume rendering in CUDA." Biomedical Engineering and Informatics, 2009. BMEI'09. 2nd International Conference on. IEEE, 2009.

[58]

Nishikawa, Naoki, Keisuke Iwai, and Takakazu Kurokawa. "High-performance symmetric block ciphers on cuda." Networking and Computing (ICNC), 2011 Second International Conference on. IEEE, 2011.

93 [59]

Solomonik, Edgar, and Laxmikant V. Kale. "Highly scalable parallel sorting." Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE, 2010.

[60]

Ali Yazıcı, Hakan Gokahmetoglu "Implementation of Sorting Algorithms with CUDA: An Empirical Study" ICAT 2015,

International Conference on Advanced Technology and Sciences, 2015

[61]

Shenghui, Liu, Ma Junfeng, and Che Nan. "Internal sorting algorithm for large-scale data based on GPU-assisted." Measurement,

Information and Control (ICMIC), 2013 International Conference on.

Vol. 1. IEEE, 2013.

[62] Moore, Nicholas. "Kernel specialization for improved adaptability and performance on graphics processing units (GPUs)." (2012).

[63]

Shamoto, Hideyuki, et al. "Large-scale distributed sorting for GPU-based heterogeneous supercomputers." Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 2014.

[64] Sohn, Andrew, and Yuetsu Kodama. "Load balanced parallel radix sort." Proceedings of the 12th international conference on

Supercomputing. ACM, 1998.

[65]

Odeh, Saher, et al. "Merge path-parallel merging made simple."

Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International. IEEE, 2012.

[66] Wen, Zhaofang. "Multiway merging in parallel." Parallel and Distributed Systems, IEEE Transactions on 7.1 (1996): 11-17.

[67]

Duarte, Rodrigo, Resit Sendag, and Frederick J. Vetter. "On the Performance and Energy-efficiency of Multi-core SIMD CPUs and CUDA-enabled GPUs." Workload Characterization (IISWC), 2013 IEEE International Symposium on. IEEE, 2013.

[68] Potluri, Sreeram, et al. "Optimizing MPI communication on multi-GPU systems using CUDA inter-process communication." (IPDPSW), 2012

94 [69]

Corrêa, Wagner T., James T. Klosowski, and Cláudio T. Silva. "Out-of-core sort-first parallel rendering for cluster-based tiled displays."

Parallel Computing 29.3 (2003): 325-338.

[70]

Liu, Yu, et al. "Parallel algorithms for approximate string matching with k mismatches on CUDA." Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International. IEEE, 2012.

[71]

Peters, Hagen, Ole Schulz-Hildebrandt, and Norbert Luttenberger.

"Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead." Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 2010.

[72] Uyar, Ahmet. "Parallel merge sort with double merging." Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on. IEEE, 2014.

[73]

Jeon, Minsoo, and Dongseung Kim. "Parallel merge sort with load balancing." International Journal of Parallel Programming 31.1 (2003):

21-33.

[74] Cole, Richard. "Parallel merge sort." SIAM Journal on Computing 17.4 (1988): 770-785.

[75]

Kim, Kil Jae, Seong Jin Cho, and Jae-Wook Jeon. "Parallel quick sort algorithms analysis using OpenMP 3.0 in embedded system." Control, Automation and Systems (ICCAS), 2011 11th International

Conference on. IEEE, 2011.

[76]

Misra, Prasant, and Mainak Chaudhuri. "Performance evaluation of concurrent lock-free data structures on GPUs." Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on.

IEEE, 2012.

[77]

Duato, Jose, et al. "Performance of CUDA virtualized remote GPUs in high performance clusters." Parallel Processing (ICPP), 2011

International Conference on. IEEE, 2011.

95 [78]

Liu, Yong, and Yan Yang. "Quick-merge sort algorithm based on Multi-core linux." Mechatronic Sciences, Electric Engineering and Computer (MEC), Proceedings 2013 International Conference on.

IEEE, 2013.

[79] Wegner, Lutz M. "Quicksort for equal keys." Computers, IEEE Transactions on 100.4 (1985): 362-367.

[80] Liang, Yun, et al. "Register and thread structure optimization for GPUs." Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific. IEEE, 2013.

[81]

Merrill, Duane G., and Andrew S. Grimshaw. "Revisiting sorting for GPGPU stream architectures." Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 2010.

[82] Moloney, Brendan, et al. "Sort-first parallel volume rendering."

Visualization and Computer Graphics, IEEE Transactions on 17.8 (2011): 1164-1177.

[83]

Bethel, E., et al. "Sort-first, distributed memory parallel visualization and rendering." Proceedings of the 2003 IEEE symposium on parallel and large-data visualization and graphics. IEEE Computer Society, 2003.

[84]

Moreland, Kenneth, Brian Wylie, and Constantine Pavlakos. "Sort-last parallel rendering for viewing extremely large data sets on tile

displays." Proceedings of the IEEE 2001 symposium on parallel and large-data visualization and graphics. IEEE Press, 2001.

[85]

Taniar, David, and J. Wenny Rahayu. "Sorting in parallel database systems." High Performance Computing in the Asia-Pacific Region, 2000. Proceedings. The Fourth International Conference/Exhibition on.

Vol. 2. IEEE, 2000.

[86]

Batcher, Kenneth E. "Sorting networks and their applications."

Proceedings of the April 30--May 2, 1968, spring joint computer conference. ACM, 1968.

96 [87]

Thompson, Clark D., and Hsiang Tsung Kung. "Sorting on a mesh-connected parallel computer." Communications of the ACM 20.4 (1977): 263-271.

[88]

Baraglia, Ranieri, et al. "Sorting using bitonic network with CUDA."

the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR), Boston, USA. 2009.

[89] Amirul, Mohamad, et al. "Sorting very large text data in multi GPUs."

Control System, Computing and Engineering (ICCSCE), 2012 IEEE International Conference on. IEEE, 2012.

[90]

Li, Jing-mei, and Jie Zhang. "The performance analysis and research of sorting algorithm based on OpenMP." Multimedia Technology (ICMT), 2011 International Conference on. IEEE, 2011.

[91]

Xuejing, Gong, Ci Linlin, and Yao Kangze. "Two parallel strategies of split-merge algorithm for image segmentation." Wavelet Analysis and Pattern Recognition, 2007. ICWAPR'07. International Conference on.

Vol. 2. IEEE, 2007.

[92] M. Dawra and P. Dawra, IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012

[93] D. S. Hirschberg, Communications of ACM, 21(8), 1978

[94]

B. Wilkinson and M. Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, 2nd. ed., Pearson Education, 2005.

[95]

D. Merrill and A. Grimshaw, Revisiting Sorting for GPGPU Stream Architectures, Technical Report CS2010-03, Department of Computer Science, University of Virginia. February 2010.

[96]

http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_WarpsAndOccupancy.pdf

[97] Hoare, Charles AR. "Quicksort." The Computer Journal 5.1 (1962):

10-16.

[98] http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-GTC2012-CUDA-Programming-Model.pdf

[99] http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

APPENDIX A

In document A STUDY OF PARALLEL SORTING ALGORITHMS. USING CUDA AND OpenMP (Page 101-115)