Effective Sorting Using Parallel Computing
Prof. Sangita LadeComputer Engineering Department Vishwakarma Institute of Technology
Pune, India [email protected]
Krushna Patil
Computer Engineering Department Vishwakarma Institute of Technology
Pune, India [email protected]
Nikhil Chhadikar Computer Engineering Department Vishwakarma Institute of Technology
Pune, India [email protected]
Apeksha Chaudhari Computer Engineering Department Vishwakarma Institute of Technology
Pune, India [email protected]
Abstract— There are various sorting algorithms available in the literature such as merge sort, bucket sort, heap sort, etc. The fundamental issues with all these sorting algorithms are the time and space complexities, for huge amount data. In this paper we are providing optimized counting sort algorithm for sorting billions of integers using serial and parallel processing. Comparisons are made between various sorting algorithms like counting sort, bucket sort and merge sort. Counting sort and merge sort algorithms are implemented on CPU and in parallel on graphics processor unit (GPU). It is observed that our optimized counting sort takes only 6 m seconds of time to sort 100 million of integers & it is 23 times faster than the counting algorithm implemented before.
Keywords—counting sort, merge sort, parallel processing, GPU
I. INTRODUCTION
Nowadays, computers have multi-core specifications. This has increased the degree of multiprogramming and ultimately increased the CPU utilization [1].Earlier almost more than half of computer time was spent on sorting. But with the advancements this is no longer issuing to be concerned about [2]. Sorting plays a very important role in various operations like searching, frequency distribution and convex hulls. There are various sorting algorithms present like bubble sort,quick sort,merge sort to name a few that ease the sorting process [3]. But when it comes to applying the sorting methods for large data size, the computation time and space still remains an issue. Computational speed is directly proportional to space required and when it comes to millions of data, machine freezes at that point. The problems usually faced are:
1. Time required to sort millions of numbers using serial computing is quite large (0.5 sec).
2. Space complexity increases rapidly.
3. For data size of 10^9, the machines freezes up. 4. Slows down the entire computation process.
methods, problems which do not fit on single machine can be solved. Operations on huge size of data can be performed in reasonable amount of time.
In this paper, comparative study is done by executing various sorting algorithms which are developed for sorting of billions of integers. Algorithms are run on different machines using serial and parallel processing techniques. Comparative study is carriedout to find which out most efficient algorithm.Total processingspeedis increasedto executethe sortingof billions of integers in as less amount of time as possible also taking care of space required for computation as the size of data for sorting is to the power of eight.
II. SORTING METHODS
A. Serial Algorithms
One of the computer process is sorting of data. There are various sortingalgorithms available like merge sort,quick sort,insertionsort,list sort,bucket sort.Each one of it has its own advantages and disadvantages. Out of a comparative study on the basis of complexities, bucket and counting sort were found appropriate for the study. In this research, counting sort and bucket sort is used with optimization. Their time and space complexities are analyzed. The algorithms developed are executed on different machines and the data generation time, sorting time and total time is calculated for data up to 10^8 integers. Comparison is made after executing the algorithms on machines with different specifications and observations are made.
1. Counting Sort
2. Count array is modified such that element at each index stores the sum of previous counts. The modified count array indicates the position of each object in output sequence.
3. Output each object from the input sequence followed by decreasing its count by 1.
The pseudo code for Counting Sort is [4]:
1. int key (int x)//function for extracting key value from each input number, here the unit’s place
2. return (x%10) 3. for i ←0→ n 4. x ← in[i]
5. count [key(x)] ← count[key(x)]+1 6. for i ←0→ k-1
7. oc ←count [i] 8. count[i]← nc 9. nc←nc+oc 10. for i←0→n 11. x←in[i]
12. out[count[key(x)]] ←x
13. count[key(x)]← count[key(x)]+1 Asymptotic analysis of counting sort:
In this algorithm, the initialization of the count array and the loop which performs a prefix sum on the count array takes O(k) time. And other two loops for initialization of the output array takes O(n) time. Therefore, the total time complexity for the algorithm isO(n+k).
Worst Case Time complexity O (n+k) Average Case Time complexity O(n+k)
Best Case Time complexity O(n+k) Space Complexity 2n
2. Bucket Sort
Bucket sort is mainly useful when input is uniformly distributed over a range. It can be used to sort a large set of floating point numbers which are in range from 0.0 to 1.0 and are uniformly distributed across the range. A simple way is to apply a comparison based sorting algorithm. The lower bound for Comparison based sorting algorithm (Merge Sort, Heap Sort, Quick-Sort etc.) is Ω(n Log n), i.e., they cannot do better than nLogn. Counting sort cannot be applied here as we use keys as index in counting sort . Here keys are floating point numbers. Following is bucket algorithm
1. Create n empty buckets or lists.
2. Do following foe every array element arr[i]. Insert arr[i] into bucket [n*array[i]]
3. Sort individual buckets using insertion sort. 4. Concatenate all sorted buckets.
Pseudo code of Bucket Sort is as follows[5]:
1. begin 2. B [0….n - 1] 3. n = A.length 4. For i = 0 to n - 1 5. B [i] = NULL 6. For i = 1 to n
7. Insert A[i] into list B [nA[i]] 8. For i = 0 to n - 1
9. Sort list B[i] with insertion sort.
10. Concatenate the lists B [0], B [1], B [n-1] in order.
Asymptotic analysis of bucket sort:
In this algorithm, if we assume that insertionin a bucket takes O (1) time then steps 1 and 2 of the above algorithms clearly take O(n) time. The O (1) is easily possible if we use a linked list to represent a bucket. Step 4 also takes O(n) time as there will be n items in all buckets. The main step to analyze is step 3. This step also takes O(n) time on average if all numbers are uniformly distributed.
Worst Case Time complexity O (n^2) Average Case Time complexity O (n+k) Best Case Time complexity O(n+k) Space Complexity 10n
3. Execution of algorithms on machines with different specifications
i. Machine 1: ASUS-GL551VW
a. Machine Description: Architecture: x86_64
Model name: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
b. OS Description: NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
Data size Bucket sort
10^8 5.42711 1.989685
10^8 6.08566 2.02002
10^9 59.842 21.72842
[image:3.595.47.287.65.114.2]10^9 60.06382 20.53312 Fig. 1. Time in sec required for execution of bucket and counting sort on machine 1
ii. Machine 2: DELL Latitude E6220
a. Machine Description: Architecture: x86_64
Model name: Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz CPU MHz: 1345.895
b. OS Description: NAME= Ubuntu
VERSION= 18.04.1 LTS (Bionic Beaver)
Data size Bucket sort
(time in sec) Counting sort(time in sec) 10^4 0.020152 0.003281 10^4 0.05008 0.002879 10^5 0.041172 0.011148 10^5 0.051398 0.012658 10^6 0.226285 0.04136 10^6 0.193598 0.037202 10^7 0.858906 0.314428 10^7 0.858906 0.311893
10^8 6.22311 2.93093
10^8 6.51261 2.93093
Fig. 2. Time in sec required for execution of bucket and counting sort on machine 2
iii. Machine 3: DELL Inspiron 15 3000 series a. Machine Description
Card name: NVIDIA GeForce 820M Chip type: GeForce 820M
Display Memory: 4044 MB Dedicated Memory: 1996 MB Shared Memory: 2047 MB b. OS Description
Windows 8.1 Pro 64-bit (6.3, Build 9600) System Model: Dell Inspiron 3542
Processor: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz (4 CPUs), ~1.7GHz
Memory: 8192MB RAM
Data size Bucket sort
(time in sec) Counting sort(time in sec)
10^4 0.157 0.005
10^4 0.158 0.006
10^5 0.318 0.011
10^5 0.305 0.013
10^6 0.866 0.063
10^6 1.078 0.065
10^7 2.602 0.629
10^7 2.532 0.602
10^8 17.383 6.511
10^8 16.284 5.675
It is observed that counting sort outperformed bucket sort in the total sorting time on all the different machines it was tested upon. Hence counting sort performed better than bucket sort. Counting sort was then compared with the standard algorithm <algorithm.h> to test the performance. Following results were observed:
Standard algorithm <algorithm.h>
Data size Generation Time ( in sec)
Sorting Time (in sec)
Total Time (in sec)
[image:3.595.306.552.148.304.2]10^4 0.000595 0.005973 0.007923 10^4 0.000523 0.0086 0.011069 10^5 0.005471 0.036376 0.04398 10^5 0.005754 0.033353 0.036974 10^6 0.020607 0.328578 0.35789 10^6 0.022227 0.324432 0.355501 10^7 0.155219 3.56678 3.80538 10^7 0.158648 3.55816 3.80237 10^8 1.48517 39.2959 41.8451 10^8 1.44944 40.1986 42.3975 Fig. 5. Time in sec required for generation, sorting and total time in sec for standard algorithm
Comparison between total time of counting sort and Standard algorithm
Data size Counting sort
[image:3.595.306.559.354.510.2](in sec) Standard sort(in sec) 10^4 0.002183 0.007923 10^4 0.002479 0.011069 10^5 0.007295 0.04398 10^5 0.007833 0.036974 10^6 0.024988 0.35789 10^6 0.030717 0.355501 10^7 0.223943 3.80538 10^7 0.217985 3.80237 10^8 1.989685 41.8451 10^8 2.002228 42.3975 Fig. 6. Total time in sec for counting and standard sorting algorithms It is observed that counting sort algorithm developed outperformedstandard algorithm as the total time (generation time + sorting time) for counting sort is less than total time of standard sort.
B. Parallel Algorithms 1. Merge Sort
In this algorithm, list is sorted recursively by dividing it into smaller pieces during reassembly of the list. The pseudo code [5] for parallel merge sort is explained below:
parallel_mergesort(DataArray,SizeofData) 1. begin
2. Create processors Pi where i = 1 to n
3. if i > 0 then receive size and parent from the root receive the list, size and parent from the root
9. send list, midvalue, first child
10. send list from midvalue, listsize-midvalue, second child
11. call mergelist(list,0,midvalue,list, midvalue+1,listsize,temp,0,listsize)
12. store temp in another array list2
13. else
14. call parallelMergeSort(list,0,listsize)
15. endif
16. if i>0 then
17. send list, listsize, parent
18. endif
19. end
Time Complexity of Merge sort is as follows: TimeComputation= ∑logp i=0(2i − 1).
2. Counting Sort
The sources of parallelism in standard counting sort algorithm are partition in input array, count occurrences in privatized count and reduce privatized counts to global counts.
The pseudo code for the parallel counting sort algorithm is as below
accessing the array we have broken down each iteration to datasize/number_of_threads.
We created n threads per block and each thread accessing the array location simultaneously by incrementing the count array as per the index value in the final array .Then the count array is decremented until every element is zero. The thread created is as follows:
Thread t = blockIdx.x * blockDim.x + threadIdx.x;
where the thread id must be less than the index value of the generated array for counting the data size.
We generated array of size ranging from 10^3 to 10^8 unsigned integers. The counting sort algorithm developed is tested upon this data size for thread size per block varying from 1 to 1024. The time required for sorting of data is measured using inbuilt cudaEventElapsedTime API. The sorting time which we are referring to is the time required for the kernel launch to perform the required operation.
These results were carefully observed and noted, and the analysis is as shown in the next section
1. Begin
2. Create threads Ti where i =1 to arraysize 3. Create a count array
4. Create a final array
5. increment the value at count array 6. insert the elements in the final array until
the count is 0
7. return the final array
III. GPU
We executed the new version of the sequential and parallel count sort algorithm on Window 10 64-bit operating system Intel® core™ i7 processor 6700HQ @ 2.97 GHz machine. The system has the GeForce GTX 960M having 5 graphic processor 640 CUDA cores. There are maximum 2048 threads per multiprocessor and 1024 threads per block. System having the CUDA runtime version is 9.2. The total amount of global memory present in the system is 4 GB and the total amount of constant memoryis 65536 bytes.The total amount of shared memory per block is 49152 bytes. System having the total number of registers available per block is 65532 and warp size is 32. Maximum sizes of each dimensionof a blockare 1024 x 1024 x 64 and maximum size of each dimension of a grid is 2147483647 x 65535 x 65535.
IV. PROPOSED METHODOLOGY
The loopin the code consuming most time in the iterations is:
For(i=0;i<arraysize;i++){
While(count[i]>0){ Insert(element, array[j++]);
} }
These two loops are the ones that consume the maximum
[image:4.595.308.543.304.761.2]V. RESULTS ANDOBSERVATIONS
The time required for sorting of data using uniform test case (the unsigned integers are generated randomly in the array) is measured in milliseconds and is as shown in figure(8). The figure(8) clearly shows that the time required for sortingby using parallel count sort outperformedthe serial sorting method by a factor of 10x.
As the thread size increases from t=1 to t= 1024 , the time required for sorting decreases rapidly. The same result is also plottedin the figure(9).In the figure(9) ,the X-axis represents the data size from 10^4 to 10^8 andthe Y-axis represents time in milliseconds for sorting of the data. The time required for sorting in serial and parallel is plotted using line graph and color coding for representing two different line graphs. From the graph, it is clearly observed that parallel sorting method is better than serial sorting method .
DATA
SIZE T = 1 T = 2 T = 4 T = 8 T = 1 6 T = 3 2 T = 6 4 T= 128 T= 256 T= 512 T=1024 10^3 0.02538 0.01683 0.01229 0.00886 0.00774 0.00771 0.00794 0.00774 0.00758 0.01485 0.02131
10^3 0.02506 0.01702 0.01219 0.00906 0.00813 0.00806 0.00762 0.00787 0.00845 0.00906 0.01088
10^3 0.02570 0.02298 0.01238 0.00874 0.00781 0.00762 0.00790 0.00774 0.00794 0.01808 0.01024
10^4 0.17405 0.09165 0.05075 0.02928 0.01946 0.01971 0.02634 0.01667 0.01594 0.02083 0.01706
10^4 0.17530 0.09299 0.04995 0.02970 0.01872 0.01501 0.01690 0.01789 0.01898 0.01741 0.01722
10^4 0.17552 0.09251 0.05040 0.03494 0.01744 0.01464 0.01664 0.02755 0.01827 0.01725 0.01728
10^5 1.67082 0.84128 0.42458 0.21754 0.11293 0.06816 0.06941 0.06986 0.06944 0.07107 0.07434
10^5 1.66973 0.83904 0.42725 0.21706 0.11242 0.07043 0.07011 0.06970 0.07075 0.08013 0.07590
10^5 1.66710 0.85584 0.42525 0.21728 0.11306 0.08179 0.07062 0.07366 0.07056 0.07181 0.07360
10^6 16.62071 8.34890 4.19926 2.12054 1.06643 0.60048 0.60275 0.61034 0.60560 0.61107 0.66662
10^6 16.63021 8.35549 4.20080 2.11894 1.08051 0.60016 0.60243 0.60384 0.61533 0.61344 0.66922
10^6 16.62928 8.35299 4.20211 2.13043 1.06752 0.60109 0.60243 0.60365 0.60458 0.62813 0.66224
10^7 166.28694 83.33117 41.94672 21.10419 10.57040 5.86110 5.88227 5.88973 5.88051 5.93338 6.43488
10^7 166.20921 83.42659 41.95021 21.10765 10.56950 5.88915 5.87261 5.87757 5.88467 5.93386 6.44358
10^7 166.27329 83.71888 41.96624 21.12365 10.56117 5.87210 5.87962 5.88285 5.88464 5.93133 6.43862
10^8 1558.54993 787.23749 399.62909 206.41553 105.61843 58.62192 58.64154 58.61737 58.55430 59.05235 64.28595
10^8 1557.87488 787.46484 401.26260 206.82118 105.61530 58.66173 58.64747 58.57686 58.60781 59.05677 64.21223
10^8 1558.40748 787.02414 400.95035 206.20204 105.63517 58.64931 58.60838 58.59607 58.62307 59.02445 64.22752
Ti
m
e
Comparison of parallel and serial counting sort
700 600 500 400 300 200 100 0
10^4 10^4 10^4 10^5 10^5 10^5 10^6 10^6 10^6 10^7 10^7 10^7 10^8 10^8 10^8
Data Size
Parallel Serial
Fig.9. Comparison of parallel and serial counting sort with time in millisecond
VI. CONCLUSION
For sorting of very large data set, parallel counting sort algorithm outperforms other algorithms. For data size of 10^8 integers, time required for sorting is 64.28595 milliseconds. If the same algorithm is executed on better GPU with better specifications the algorithm can give more optimized results.
VII. REFERENCES
[1] Neetu Faujdar and SatyaPrakash Ghrera,“Performance Evaluation of Parallel Count Sort using GPU Computing with CUDA”, Indian Journal of Science and Technology, Vol 9(15), DOI: 10.17485/ijst/2016/v9i15/80080, April 2016 [2] Jehad Alnihoud and Rami Mansi “An enhancement of major sorting algorithms”, The International Arab Journal of Information Technology, Vol. 7, No. 1, January 2010. [3] Khaled Thabit and Afnan Bawazir “A novel approach for selectionsort algorithm with parallel computing and dynamic programming concepts”, JKAU Comp. IT. Vol 2, pp: 27-44. [4] Aishwarya Kaul, “Space Optimization of Counting Sort”, International Journal of Computer Applications(0975-8887) [5]https://nptel.ac.in/courses/106106112/Module4/Lectures %201-4/Lecture%203.pdf