Micro-Benchmarks - Efficient compilation of a verification-friendly programming language

The micro-benchmark consists of 5 Whiley programs to test code optimisations and measure the performance of generated C code. The benchmark suite in- cludes Reverse (see Appendix B.1) TicTacToe (see Appendix B.2) MergeSort

(see Appendix B.4), BubbleSort (see Appendix B.3) and MatrixMulti (see Ap- pendix B.5) programs.

Each test case takes command line arguments as input to vary the array size of benchmark programs. In each case, we choose three sizes to measure the memory leaks and execution time. Note TicTacToe program varies the number of repeats, rather than the size of game board, for bench-marking.

Table 8.1: Memory leaks (bytes) of micro-benchmarks

Memory Leaks (bytes)

Test Case Problem Size N N + D C C + D

Reverse 100,000 4,800,416 0 1,600,408 0 1,000,000 48,000,424 0 16,000,416 0 10,000,000 480,000,432 0 160,000,424 0 TicTacToe 100,000 276,000,296 0 204,000,288 0 200,000 552,000,296 0 408,000,288 0 300,000 828,000,296 0 612,000,288 0 BubbleSort 1,000 32,408 0 8,400 0 10,000 320,416 0 80,408 0 100,000 3,200,424 0 800,416 0 MergeSort 1,000 320,376 0 80,368 0 10,000 640,648 0 160,544 0 100,000 961,144 0 240,776 0 MatrixMult 1, 000× 1, 000 112,000,464 0 24,000,456 0 2, 000× 2, 000 448,000,464 0 96,000,456 0 3, 000× 3, 000 1,008,000,464 0 216,000,456 0

Memory Leaks Table 8.1 shows that, on our benchmark suite, our dealloca-

tion analysis effectively avoids memory leaks on both naive and copy eliminated code for all test cases. Also, the copy elimination alone can effectively remove copies in all test cases, and avoid all unnecessary copies in four cases (at least):

182

Reverse, BubbleSort, MergeSort and MatrixMult. Note in each case, there are minor and constant amounts of memory leaks, e.g. 424 bytes in Reverse case, which do not grow with problem sizes, because our program needs to allocate some extra memory space to store the values of command line arguments.

1 function reverse(int[] arr) -> int[]: 2 int i = |arr|

3 int[] r = [0; |arr|]

4 while i > 0 where i <= |arr| && |r| == |arr|: 5 int item = arr[|arr|-i]

6 i = i - 1 7 r[i] = item 8 return r

Listing 8.1: Reverse program

Reverse program uses two arrays (arr and r ) to run function reverse (see Listing 8.1). Because each array is declared as signed 64-bit integers (int64 t), we can get the number of arrays used in the program as estimates of memory leaks.

Consider the array size of 1× 107 _{as an example. Each array takes up 80}

MB, and the memory leaks in Table 8.1 show our copy elimination analysis reduces six arrays down to only two, and thus removes all redundant array copies. Leaks in Reverse program also have a linear relation with array sizes,

and then we can get 3.3× 108 _{= (16GB/48bytes) as the estimated maximal}

size of naive Reverse code. We can choose 1× 108_{, 2}_{× 10}8 _{and 3}_{× 10}8 _as

array sizes to benchmark speed-ups.

1 function bubbleSort(int[] items) -> int[]: 2 int length = |items|

3 int last_swapped = 0 // Until no items is swapped 4 while length > 0:

5 last_swapped = 0 6 int index = 1

7 while index < length:

8 if items[index-1] > items[index]: 9 int tmp = items[index-1] 10 items[index-1] = items[index] 11 items[index] = tmp 12 last_swapped = index 13 index = index + 1

14 length = last_swapped// Skip the remaing items as they are ordered. 15 return items

Listing 8.2: Bubble sort program

BubbleSort program creates and sorts one array of int64 t type. Consider

copy elimination analysis removes all copies and keeps only one array to do bubble sorting. We choose 1_{× 10}5_{, 2}_{× 10}5 _{and 3}_{× 10}5 _{as benchmark levels}

to measure the speed-ups of code optimisation.

1 function sortV1(int[] items, int start, int end)->int[]: 2 if (start+1) < end:

3 int pivot = (start+end) / 2

4 int[] lhs = Array.slice(items,start,pivot) 5 lhs = sortV1(lhs, 0, pivot)

6 int[] rhs = Array.slice(items,pivot,end) 7 rhs = sortV1(rhs, 0, (end-pivot))

8 ...

9 // Merge ’lhs’ and ’rhs’ arrays

10 while i < (end-start) && l < (pivot-start) 11 && r < (end-pivot):

12 ...

13 return items

Listing 8.3: Merge sort program

Similarly, in MergeSort program our copy elimination can also remove all unnecessary copies and reduce four arrays down to one.

Table 8.1 show the memory leaks are not severe in MergeSort and BubbleSort programs, so we can benchmark speed-up on larger array sizes. Since the memory leaks in both cases increase linearly with array size, we can predict that naive MergeSort code runs out of memory at array size of 5.0_{× 10}7 ₌

16(GB)/320(bytes) as an estimate of memory leaks. Therefore, we can set benchmark levels to 1.0× 107_{, 2.0}_{× 10}7 _{and 3.0}_{× 10}7 _{for both MergeSort}

and BubbleSort cases.

1 function mat_mult(int[] a, int[] b, int[] data, int width, int height)

-> (int[] c): 2 int i = 0 3 while i < height: 4 int j = 0 5 while j < width: 6 int k = 0 7 int sub_total = 0 8 while k < width: 9 sub_total=sub_total+a[i*width+k]*b[k*width+j] 10 k = k + 1 11 data[i*width+j] = sub_total 12 j = j + 1 13 i = i + 1 14 return data

Listing 8.4: Matrix multiplication program

MatrixMult program creates three matrices of int64 t type and represents

184

each matrix amounts to 8 MB. The results show our copy elimination removes all redundant copies but keeps only three necessary matrices to compute matrix multiplication. Without memory deallocation the naive C code has server leaks. For example, when matrix size is increased up-to 4, 000_{× 4, 000, the} naive MatrixMult code amounts to 17.92 GB and exceeds the memory limits and causes system breakdown.

Table 8.2: Average execution time (seconds) of micro-benchmarks

Implementation Speed-up

Test Case Problem Size N N + D C C + D N

C N+D C+D Reverse 1_{× 10}8 _0.903 _1.195 _0.351 _0.371 _2.58 _3.22 2× 108 _1.744 _1.735 _0.694 _0.694 _2.51 _2.50 3_{× 10}8 _2.609 _2.608 _1.015 _1.027 _2.57 _2.54 TicTacToe 100,000 0.241 0.193 0.156 0.118 1.54 1.64 200,000 0.412 0.353 0.277 0.225 1.49 1.57 300,000 0.615 0.517 0.405 0.342 1.52 1.51 BubbleSort 100,000 6.659 6.627 6.634 6.616 1.00 1.00 200,000 26.399 26.396 26.418 26.398 1.00 1.00 300,000 59.358 59.372 59.377 59.364 1.00 1.00 MergeSort 1× 107 _0.078 _0.077 _0.040 _0.035 _1.95 _2.19 2_{× 10}7 _0.148 _0.149 _0.046 _0.067 _3.21 _2.21 3× 107 _0.196 _0.191 _0.063 _0.073 _3.13 _2.62 MatrixMult 1, 000× 1, 000 1.28 1.27 1.29 1.39 1.00 0.92 2, 000_{× 2, 000} 19.3 19.2 19.1 19.1 1.01 1.01 3, 000× 3, 000 47.9 47.7 47.9 48.0 1.00 0.99

Execution Time and Speed-up Table 8.2 shows that our de-allocation

macro (N+D) does not slow down the execution of naive code in all cases. Copy elimination (C) and the combined optimised (C+D) code both increases speed-ups with array sizes in Reverse, TicTacToe and MergeSort.

In conclusion, our combined optimised (C+D) code runs as fast as copy eliminated code in Reverse and TicTacToe, but runs slower in MergeSort case. Our de-allocation macro takes up time to free allocated memory and thus introduces delays in execution. Since the time in merge sort case is comparatively small, the delays become more significant than other two cases. The flat speed-ups in BubbleSort and MatrixMult cases require further profiling to find out performance bottlenecks. By using gprof tool, we can know naive BubbleSort code spends almost 100% time on sorting and swapping array items. Likewise, naive MatrixMult code takes 99% time to calculate the products of rows and columns, and spends only 0.1% on array copying. Since their computation dominates the overheads of array copies and memory deallocation, our code optimisation has little effects on speed-ups.

In document Efficient compilation of a verification-friendly programming language (Page 195-200)