An Efficient Methodology To Sort Large Volume Of Data

(1)

An Efficient Methodology to Sort Large Volume

of Data

S.Bharathiraja, G.Suganya, M.Premalatha, R.Kumar, Sakkaravarthi Ramanathan

Abstract—Sorting is a basic data processing technique that is used in all day-day applications. To cope up with technological advancement and extensive increase in data acquisition and storage, Sorting requires improvement to minimize time taken for processing, response time and space required for processing. Various sorting techniques have been proposed by researchers but the applicability of those techniques for large volume of data is not assured. The main focus of this work is to propose a new sorting technique titled Neutral Sort, to reduce the time taken for sorting and decrease the response time for a large volume of data. Neutral Sort is designed as an enhancement to Merge Sort. The advantages and disadvantages of existing techniques in terms of their performance, efficiency and throughput are discussed and the comparative study shows that Neutral sort drastically reduces time taken for sorting and hence reduces the response time.

Index Terms—Chunking, Banding, Sorting, Efficiency, Merge sort, Bigdata Sorting, Neutral Sort.

——————————  ——————————

1 I

NTRODUCTION N data science, ordering of elements is very much useful in various applications and data base querying. Different approaches have been proposed by researchers for performing the sorting operation effectively based on the type of application used. The approaches are normally validated using the space complexity, that refers to the temporary space occupied by the algorithm during sorting and time complexity, the time taken to sort and return the final sorted data. Algorithms emerge with the consent to minimize both space and time complexities. Literature as discussed in the following section asserts that quick sort and merge sort algorithms are efficient and effective for usage in real time. In addition to this, depending on the location where sorting is executed, the techniques are broadly categorized into Internal and external sorting. The existing sorting techniques provide different throughput and efficiency, out of which Quick, Merge, Binary sorting techniques have good responses.

Chunking and banding are widely used in different applications to create modularity and to ensure specificity. Also, Merge sort, a technique that divides the data into chunks till the size of the chunk becomes one and then combines the chunks applying sorting algorithm in order is proven to be efficient by many researchers. Considering the fact that the

divided chunks may already be in sorted manner and hence doesn’t need further split, we have proposed a modified Merge sort algorithm titled “Neutral sort” to reduce time complexity and hence to reduce response time.

Chapter 2 elaborates on the various algorithms that exist in practice with a detailed analysis on their merits and limitations. A detailed discussion on proposed methodology with the algorithmic explanation is presented in Chapter 3. The prototype is tested and the results are discussed in Chapter 4.

2 E

XISTING ALGORITHMS

2.1 Bubble Sort

Bubble sort is one of the simplest algorithm for sorting data. The algorithm compares each element with its next element and if necessary (if not ordered) swaps the elements [1]. The comparison and swapping operations are repeated during each pass until all the elements in the list are completely sorted. It is inefficient since, in the worst case situation there might be at most swapping possible. The time complexity to sort data with the application of Bubble sort in worst case scenario is O (n2_).

2.1.1 Benefits and Drawbacks

Though Bubble sort is inefficient its simplicity makes it an advantageous algorithm given some small set of data for sorting. Bubble sort utilizes O(1) auxiliary space. Due to its time constraint its inefficient for sorting large set of data[1].

2.2 Selection Sort

The number of comparisons and hence swapping is slightly reduced in Selection sort when compared to Bubble sort. During every pass, only one swap occurs to place the appropriate data in respective position and hence, selection sort is better than Bubble sort in terms of swapping. Even then, while applied to large datasets, the performance of this sorting is O (n2_{)[2] in worst case situation.}

I

————————————————

 S.Bharathiraja is currently working as a Assistant Professor in Vellore Institute of Technology Chennai and is pursuing his Ph.D., PH-04439931120. E-mail: [email protected]

 Dr. G.Suganya is currently working as Associate Professor is currently in Vellore Institute of Technology Chennai and is specialized in Software Engineering and Machine Learning., PH-04439931399. E-mail:

[email protected]

 M.Premalatha is currently working as a Assistant Professor in Vellore Institute of Technology Chennai and is pursuing his Ph.D., PH-04439931071. E-mail: [email protected]

 Dr.R.Kumar is currently working as a Associate Professor in Vellore Institute of Technology Chennai and is pursuing his Ph.D., PH-0443993. E-mail: [email protected]

(2)

2.2.1performance Analysis

[2][3]Selection sort needs O(n2_{) comparisons and (n-1) swaps} to sort the list of n elements. [2][3]Two other variations of selection sort that are quite popular are Quadratic Sort[5] and Tree Sort [3][2].

Though selection sort is inefficient due to its performance for large set of data its better than bubble sort in terms of lesser number of swapping which reduces the data movement much. Selection sort utilizes O(1) auxiliary space.

2.3 Insertion Sort

A much better efficient algorithm compares with bubble and selection sort is the Insertion sort. It works just similar to playing cards. As and when a new element is picked up it looks for the appropriate position and inserts the element by shifting all the elements which occurs after the inserting position one position forward to maintain the sorted order in the list. This insertion process will continue for all the elements in the list to maintain the ordering. Insertion sort is more suitable for almost or partially sorted list of elements. Time complexity is proven to be O (n2_{) for this sorting.}

2.3.1 Performance Analysis

Insertion sort performance degrades to quadratic computational complexity if the order of elements is reversed. The advantage of insertion sort is its efficient performance when the elements in the list are partially sorted Insertion sort is inefficient when size grows since its average case running time is also O(n2_{). A popular variant to this sort is the Shell} sort [7]. It is unstable and an in-place comparison sorting algorithm with the performance of O(n log n) in best case scenario. Insertion sort utilizes O(1) auxiliary space.

2.3.2 Benefits And Drawbacks

Insertion sort is also inefficient like bubble and selection sort for large set of data. Also to locate the insertion position it has to scan through number of element to be sorted. Its best case run time is O(n) [13] only when the elements are already in sorted form.

2.4 Quick Sort

The fastest internal sorting technique is the Quick sort [8]. Quick sort is most widely used many application data sets, since it doesn’t require any additional space for sorting the elements. Its exhibits divide and conquer strategy for solving the problems.

[8]Quick sort selects an element called a pivot and divides the array elements in to two parts by placing the pivot element in its sorted location by applying an exchange procedure. Once the pivot element is placed in its sorted location all the elements to the left of selected pivot will be smaller than pivot and similarly, all the elements to the right of pivot element will be greater. This method of slitting the array in to two is called partitioning.

The process is repeated recursively for both the right and left side partitions until each right and left partitions have single

selected as either the left most or the right most element. This drawback was eliminated later by selecting a suitable method of selecting the pivot. One best method among choosing an appropriate pivot which would divide the list into two almost equal halves during all recursive partitions is to choose the pivot as the median of first, middle and last elements. Quick sort works efficiently even in an virtual memory environment and its an in-place algorithm.

[8]The fastest sorting algorithm with an average running time of O(n log n) is the Quick sort. Generally if we select either the first or the last element as pivot it produces the worst run time of O(n2). However, these worst case scenarios are not that frequent. One of the variant of Quick sort is the Qsort [10], which is robust and faster than the traditional Quick sort algorithm method. Quick sort is the better suited for large data sets.

[8]It is the fastest and efficient algorithm for large sets of data. But it is inefficient if the elements in the list are already sorted which results in the worst case time complexity of O(n2). Quick sort uses O(log n) [13] secondary space for recursive function calls so it might be expensive in terms of space occupancy for large sets of data. Moreover quick sort carry out sequential traverse through the elements of array that results in good locality of reference and cache behavior for arrays [9].

2.5 Merge Sort

Merge sort [11] also uses divide and conquer approach. The procedure splits the set of elements to be sorted into two equal halves if the number of elements is even and splits into two partitions with either of the partition containing one element greater than the other, if the number of elements is odd. The procedure continues recursively until all the partition contains one element each.

Merge operation is applied after partitioning the elements. The merging operation merges the correspond left and right partition and grows up until it forms a single partition which contains all the elements in the original set but now in sorted order. Mostly Merge sort is done using recursive procedure though it can also be done using non-recursively. It is a stable sorting technique since it preserves the relative order of elements with equal key.

(3)

3 P

ROPOSED

M

ETHODOLOGY

3.1 The Working Principle

The working principle of the Neutral Sort is: Dividing the whole list of elements into chunks which are already in sorted order into itself, if the total number of chunks are greater than 1, then we need to apply Banding Algorithm which is given below till we get the total number of chunks = 1. In general the merge sort restricts itself into a sub file of sizes 1, 2, 4, 8. . . 2k where 2k ≥ size of the file F. Also must it needs as many split merge phases based on the number of elements used. Normally unnecessary split and merge process is carried out if a particular chunk is already in sorted order. For those chunks which are already in sorted order we restrict further split, thereby reducing the split merge phase for those chunks. If the size of such sorted chunks is considerably larger it would greatly reduce the amount of time required to perform sorting. Our Neutral sort is a version of merge sort that uses the strategy of identifying such sorted chunks and minimizes the unnecessary split merge operation for those chunks. Especially in large volume of data where sorting is applied, the strategy is best suited.

A sample example is given below to demonstrate the working principle of Neutral Sort. Consider a file F containing elements as given below:

F: 7 9 0 2 1 8 6 4 3 5 10

Segments of F consist of elements that may already in sorted fashion,

F: 7 9 0 2 1 8 6 4 3 5 10

Divide the file F elements in to different chunks wherein the elements in each chunk are sorted. Copy the chunks in F1 and F2 (Chunk1 in F1, chunk2 in F2, chunk3 in F1, chunk4 in F2 and so on)

F1: 7 9 1 8 4

F2: 0 2 6 3 5 10

Now, perform banding algorithm to combine the chunks in F1 and F2. Banding is almost similar to performing merge algorithm, it sorts the elements in both the chunks while merging.

F: 0 2 7 9 1 6 8 3 4 5 10

Now after F1 and F2 are combined we have 3 chunks to which the same procedure is applied. Divide the three chunks by applying the same procedure.

F1: 0 2 7 9 3 4 5 10 F2: 1 6 8

Perform banding operation which combines the first chunk in F1 and F2, the resultant number of chunks obtained would be 2.

F: 0 1 2 6 7 8 9 3 4 5 10

Again divide the last two chunks by copying it to F1 and F2

F1: 0 1 2 6 7 8 9

F2: 3 4 5 10

Apply the banding algorithm those two chunks are merged and thus the elements are sorted.

F: 0 1 2 3 4 5 6 7 8 9 10

The above procedure is iterated until the number of chunks becomes equal to one.

3.2 Algorithm

3.2.1 Chunking Pseudo for Neutral Sort

/*File F containing the data is spitted and written to two sub files F1 and F2 by copying segments of sorted data in F alternately to F1 and F2. */

1. Open file F for input spacer with F1 and F2 as output files. 2. While EOF of file F has not been reached:

a. Elements of F are copied to F1 based on following conditions: Read one element at a time from F and write it onto F1 until the subsequent element in F is less than last copied item in F1 or the EOF is reached. b. If not EOF, then the next sorted sub file of F is copied

into F2 just similar to the procedure what we have done for copying on to F1.

End While

3.2.2 Merge Pseudo for Neutral Sort

/*Merge sorted sub files correspondingly fromm F1 and F2 to produce the output file F. */

1. Open F1 and F2 for input spacer, F for output spacer. 2. Initialize a global variable “__sub__files” = 0. 3. While EOF of F1 or EOF of F2 is reached:

a.While no end of a sub files in F1 or in F2 has been reached:

if

The next element in F1 is less than the next element of F2, copy the next element from input spacer sub file F1 into output spacer file F. Else

Copy the next element from input spacer sub file F1 into the output spacer file F.

End While

b.If The end of a sub file in F1 has been reached:

Copy the rest element of the corresponding input spacer sub file F2 into F.

Else

Copy the rest element of the corresponding input spacer sub file F1 into F.

c.Increment the global variable “__sub__files” by 1. End While

(4)

3.2.3 Pseudo Neutral Sort

/*Use neutral sort to sort the file F. */

Repeat the following step until global variable “__sub__files” is equal to 1.

1. Call chunking algorithm for splitting the file F into two distinct files F1 and F2.

2. Corresponding chunks in files F1 and F2 are merged using Merge algorithm, where the merged content would be available in File F.

4 R

ESULTS AND

D

ISCUSSIONS

The run time of all the existing algorithms are being listed in Table1. The execution time for each algorithm is given in terms of micro seconds using windows operating system. .Net frame work and I3 processor. The number of elements gradually increased and the corresponding run time is recorded by running each algorithm. As we could see in table1, as the number of elements increases the execution time also increases for most of the conventional algorithms. Relatively lesser execution time is achieved using neutral sort for the same set of elements. Due to the introduction of chunking and banding algorithm to the existing merge sorting technique the execution time is considerably minimized which

increases the efficiency of the neutral sort algorithm. To show the performance of neutral sort with other sorting algorithms we have shown the comparative performance using both line chart Figure1.

As we could see from the Table1, the performance of bubble, selection and insertion are reasonable till 1000 elements to be sorted. When the number of elements to be sorted increases considerably from 1000, the performance of bubble, selection, and insertion sort starts degrading. The amount of time required to sort the element greatly increases as the volume of element to be sorted increases. Since our aim is to analyze the performance of merge sort by introducing the chunking and banding algorithm the comparison between merge and neutral sort in table1 would give an insight in terms of execution time. As we could see when the number of elements to be sorted using merge sort is around 20000 the time required to sort in micro seconds was 176036 where in for the same number of elements using Neutral sort the execution time in micro seconds was 18596 which is par to the performance of Quick sort. It is evident from the result data that because of introduction of chunking and banding algorithm the execution time is greatly minimized. Also we could notice that the performance of Neutral sort is almost on par with Quick sort.

(5)

Table1 – Execution time in micro seconds for various sorting algorithms in comparison with neutral sort.

5 C

ONCLUSION

We have presented the conventional sorting algorithms and later extensively discussed about the modified Quick and merge sort algorithms in detail in this work. Since our basic aim is to improve the performance of merge sort marginally by introducing the concepts of chunking and banding we have discussed merge sort extensively. Also we have elaborated our Neutral sort algorithm with a sample example along with algorithm. The results show a convincing improvement in terms of execution time in micro seconds. It is evident from the results that the performance of our Neutral sort algorithm is better compared to other conventional algorithms. As the number of elements increases, performance of the algorithm degrades in all the conventional algorithms. Because of the introduction of chunking and banding algorithm in conventional merge sort, we could increase efficiently considerably. The overhead involved in introducing the chunking and banding algorithm is not discussed in this paper, which could be addressed in the subsequent paper.

R

EFERENCES

[1] Donald E. Knuth, “The Art of Computer Programming”, 2nd_Eition,

volume 3. ADDISON-WESLEY, 1998.

[2] E.H.Friend. “Sorting on electronic computers”, Journal of Applied and Computational Mathematics, Vol. 3 Issue 2, pages 34-168, 1956. [3] Samanta Debasis Samanta, “Classic Data Structure” ,2nd_{Edition. PHI,}

2009.

[4] D. L. Shell, “A high-speed sorting procedure”, Association of Computer Machinery, Vol.2 , Issue 7, pp.:30-32, July 1959.

[5] C.A.R. Hoare, “Quicksort”, Communications of.ACM, pp.4:321, 1961. [6] Omar khan Durrani, Shreelakshmi V, Sushma Shetty, and Vinutha D C, “Analysis and determination of asymptotic behavior range for popular sorting algorithms”, International Journal of Computer Science Informatics, ,ISSN:2231-5292,Vol 2,Issue 1.

[7] Jon Louis Bentley and M. Douglas McIlroy, “Engineering a sort function”, Software Practioner . Experience. , Vol. 23(11), pp.1249-1265, 1993.

[8] Coenraad Bron. Merge sort algorithm [m1] (algorithm 426), “Communications of ACM”, Vol.15(5):, pp.357,358, 1972.

[9] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, “Introduction to Algorithms”, 2nd_{Edition, MIT Press}

and McGraw-Hill, 2001. ISBN 0-262-03293-7 page 17.

[10] Bender, Michael A; Farach-Colton, Martín; Mosteiro, Miguel (2006). [11] A.D. Mishra and D. Garg. Selection of Best Sorting

Algorithm.International Journal of Intelligent Processing, 2(2):363 368, July-December 2008.

[12] M.S. Garai Canaan.C and M. Daya, “Popular sorting algorithms.World Applied Programming”, Vol 1, pp.62-71, April 2011.

[13] Pankaj Sareen , Comparison of Sorting Algorithms (On the Basis of Average Case), “International Journal of Advanced Research in Computer Science and Software Engineering”, Volume 3, Issue 3, March 2013,ISSN: 2277 128X.

Execution Time in Micro Second (Windows Platform & .net Framework, I3 Processor) Total

Element Neutral Bubble Selection Insertion Quick Merge

100 1537 1333 874 766 735 1222

200 1745 1250 1005 1471 212 937

300 1890 1826 1386 1527 467 944

500 1941 4062 2911 2567 720 1863

1000 2909 13071 7412 2693 823 4604

2000 2959 52545 25149 23531 914 7083

5000 2225 326268 140043 100694 1931 17314

10000 2470 1097879 598587 498780 3598 51063

20000 2596 4637621 2479249 2681191 7362 176036

50000 4433 27555655 13896535 12612051 18769 1039849