Evaluation of STAR - Dynamic Memory Allocation Techniques for High-Level Synthesis. Nicholas V.

12 ret void

13 }

(a)

1 @.str = private unnamed_addr constant [18 x i8] c"Addresses: %d %d\0A\00", align 1

3 ; Function Attrs: noinline nounwind

4 define void @foo() #0 {

5 %malloccall = tail call i8* @malloc(i32 320)

6 %1 = bitcast i8* %malloccall to [80 x i32]*

7 %malloccall1 = tail call i8* @malloc(i32 280)

8 %2 = bitcast i8* %malloccall1 to [70 x i32]*

9 %3 = getelementptr inbounds [80 x i32]* %1, i32 0, i32 0

10 %4 = ptrtoint i32* %3 to i32

11 %5 = getelementptr inbounds [70 x i32]* %2, i32 0, i32 0

12 %6 = ptrtoint i32* %5 to i32

13 %7 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([18 x i8]* @.str, i32 0, i32 0), i32 %4, i32 %6) #4

Figure 8.3: An example of STAR’s stack-allocated replacement technique. (a) This the unmodified program, (b) shows STAR’s modification.

8.3 Evaluation of STAR

To evaluate STAR, we explore a variety of applications. We elaborate on each benchmark, outlining the high-level behaviour. Although our LLVM pass is HLS-agnostic, we modified the STAR algorithm to only replace stack-allocated arrays that use a 32-bit datatype (i.e. int, uint32 t, etc.). This is due to limitations within LegUp, where casting between types has limited support.

• aes: This benchmark is taken from the CHStone benchmark suite [64] and implements the AES cipher. This benchmark exhibits highly dependent function invocations therefore, serves as a stress-test to STAR. The sum of all 32-bit stack-allocated arrays is 2176 bytes.

Chapter 8. Replacing Stack Allocated Arrays with Dynamic Memory Calls 66

• matrixmult2d: There are four independent compute kernels, each computing a multiply between two stack-allocated arrays (representing matrices) and storing the result in a global array. The sum of all 32-bit stack-allocated arrays is 102400 bytes. This benchmark demonstrate STAR’s ability to reduce BRAM utilization with independent functions that use stack-allocated arrays.

• qsort: This benchmark implements the quick-sort algorithm. This implementation has four inde-pendent compute kernels, each computing a quick-sort on a different segment of a shared list. The sum of all 32-bit stack-allocated arrays is 7680000 bytes.

• hash: Four different hash-functions are implemented and evaluated for collisions with two different hash-table sizes. The sum of all 32-bit stack-allocated arrays is 8619008 bytes.

• cpu test: A number of different CPU-like operations are emulated in this benchmark, testing for ROR, ROL, etc. on stack-defined arrays. The sum of all 32-bit stack-allocated arrays is 2560000 bytes.

Although aes is the only benchmark to exhibit highly dependent function-invocations, we wanted to demonstrate the effectiveness of STAR with a large number of independent function-calls, highlighting potential BRAM/memory savings. Therefore, this reflects upon our choice for a majority of the bench-marks. We use LegUp to evaluate STAR, along with libmem, to conduct our studies. We opted to use gnumem from our allocator library, however any allocator can be employed. We set LegUp to target an Intel/Altera Stratix V FPGA (5SGXEA7N2F45C2) and set a target clock period of 1 ns, using Quartus Prime 18.0. We use the Stratix V ALM (adaptive logic module) count as our area metric. We charac-terize performance through cycle latency, F_max, and wall-clock time (T_wall = cycle latency

F max ). The results when STAR is applied to this benchmark suite is outlined in Table 8.1.

First, we highlight that for each benchmark, we were able to reduce the overall BRAM usage when replacing stack-allocated arrays with dynamic memory allocation calls. These are expected results, since the STAR algorithm can identify if a reduction in memory reservation is possible prior to compilation and will only conduct replacement if the memory reservation can be reduced. These benchmarks highlight the effectiveness of this approach, reducing the BRAM usage by up to ∼ 75% (e.g. the BRAM usage of qsort reduces from 514 BRAMs to 129). This is desirable, especially for applications which require large amounts on-chip memory for processing. This area-saving approach is not a “free lunch”; the overall wall-clock time can increase to 1.75× the original time. The decrease in performance is mainly attributed to an increase in the number of cycles required to search for free memory. There is an observable decrease in the Fmax across all the benchmarks: as mentioned in Chapter 6, this is due to

Chapter 8. Replacing Stack Allocated Arrays with Dynamic Memory Calls 67

Bench STAR? ALMs MBits (Kb) BRAMs F_max (MHz) Cycles T_wall

aes N 3608 37.06 22 290.28 14603 50.31

Y 1.31× 1.16× 0.91 × 0.77× 1.17× 1.52×

matrixmult2d N 1134 137.5 22 235.63 152337 646.51

Y 1.87× 0.51× 0.46× 0.97× 1.00× 1.04×

qsort N 2883 7531.25 514 266.95 52109 195.20

Y 1.63× 0.28× 0.25× 0.67× 1.18× 1.75×

hash N 3507 8473.25 532 176.03 1957153 11118.29

Y 1.44× 0.49× 0.49× 0.91× 1.15× 1.27×

cpu test N 869 2500 256 240.33 520030 2163.82

Y 2.83× 0.41× 0.25× 0.77× 1.21× 1.58×

Table 8.1: Performance and Area Results when Stack-Allocated Arrays are Replaced with Dynamic Memory Allocation Algorithms, using STAR.

a critical path introduced from gnumem’s free to a newly added memory controller, which manages the heap (BRAMs). LegUp automatically adds a memory controller for any memory accessed by more than 1 function, and the pointers to the memory may alias. This helps to arbitrate access from pointers which may point to different arrays at runtime, as well as helping to perform limited type-casting.

Although there is a substantial increase in the number of ALMs consumed, each benchmark is relatively small, with the largest design consuming 3608 ALMs, while this specific device has 234,720 ALMs (the largest design takes up 1.53% of the ALMs). Referring to Chapter 4.1, the combined ALM consumption of both gnu malloc() and gnu free() is 241 ALMs, and is a fixed cost when using STAR.

Along with the auto-instantiated memory controller and additional logic required for interfacing these components the main contributor to ALM-increase is from the allocator-pair and memory controller.

Therefore, with larger designs, the additional cost in ALMs can be amortized; this can be seen as a general trend across these benchmarks.

Inspecting the results for aes, STAR reduced the BRAM reservation by 2 BRAMs. However, there was an increase in the number of reserved memory bits. This is a consequence of how memory is physically mapped to on-chip BRAMs, described below.

The stack-allocated memory may not require all bits in the reserved BRAM. However physical BRAMs can implement a desired stack-allocated array in a limited number of ways [64] and there-fore the selected configuration may reserve more bits than required. Therethere-fore, the effective bits (i.e., used bits) per BRAM could be less than the possible 20 Kb. When using STAR, multiple stack-allocated arrays can be reduced to a shared-heap, which reduces the number of BRAMs. However, the under-lying dynamic memory allocation algorithm may need to use the entire BRAM (e.g., to account for book-keeping logic which may reside on the heap). Therefore, the number of effective bits used on the

Chapter 8. Replacing Stack Allocated Arrays with Dynamic Memory Calls 68

heaps BRAMs may increase but with a lower number of BRAMs. As an example, suppose we have two stack-allocated arrays implemented as M20K BRAMs, and each BRAM only uses an effective 8 Kb for each BRAM. Using STAR, these two stack-allocated arrays can be combined into 1 BRAM. Therefore, we expect the new BRAM to have an effective bit usage of 16 Kb. However, additional memory overhead may be required to employ the dynamic memory allocation. As long as the memory overhead and the 16 Kb is less than the possible 20 Kb in an M20K, then we have reduced BRAM usage, but increased total memory bits.

Lastly, we observe a general trend (and one which we expected): functions which have many indepen-dent paths in their call graphs and have large stack-allocated arrays can be modified to share memory using STAR, and thereby reducing on-chip BRAM usage.

8.4 Summary

In this chapter, we introduced a method to reduce the total BRAM usage through the replacement of stack-allocated with dynamic memory allocation requests. We demonstrate that the methodology provided in this chapter can reduce the total BRAM usage when there are many independent functions in the function call-graph with large stack-allocated arrays.

Chapter 9

Conclusions and Future Work

In this thesis, we explored the idea of dynamic memory allocation as a supported high-level synthesis construct. We provided a C-library of dynamic memory allocation algorithms, libmem, amenable to several high-level synthesis tools. To evaluate dynamic memory allocation algorithms in the HLS context, we curated and generated a number of benchmarking applications and tests: this is the dmbenchhls benchmark suite. We explored the usage of our C-library with dmbenchhls. From our findings, we provided guidelines on how to select dynamic memory allocation algorithms in HLS-intended application with regards to a design objective (e.g. area-optimized, performance optimized).

The results indicated performance and area optimization opportunities exist for dynamic memory allocation algorithms in the HLS context. We provided a static-analysis framework, ASAP (Automatic Sizing and Partitioning), to partition and parallelize dynamic memory heaps at compile time to improve performance. The main contributor to performance degradation is the high cycle latency required for the search of free memory in a heap. ASAP was able to reduce the high-cycle latency through parallelization, thereby reducing the wall-clock time of real-world applications, with some improvements of up to ∼ 2×.

The ASAP framework reduces BRAM usage by applying a modified variant of Valgrind’s memcheck, which determines and sets the minimum required heap-depth for a given input to an application. Ad-ditionally, dynamic memory allocation could be used to improve resource utilization. By modifying C functions to share a common memory resource, rather than using stack-allocated arrays, BRAM us-age could be reduced. We introduced another framework, STAR (Stack-Allocated Array Replacement), which can conserve on-chip BRAMs through the automatic replacement of stack-allocated arrays with dynamically allocated arrays. STAR led to a substantial BRAM reduction in some applications, recov-ering up to 75% of BRAMs which would have been reserved with a traditional HLS flow.

Chapter 9. Conclusions and Future Work 70

Our findings suggest the inclusion of dynamic memory allocation techniques in the HLS-supported subset of C and C++ is useful and may provide additional benefits to the HLS-generated circuit. Sup-porting dynamic memory allocation with high-level synthesis tools removes the need to rewrite HLS applications. This prevents users from introducing additional software/hardware bugs, and improves the upon design time. Additionally, these high-level constructs allow for more abstract designs, lowering the barrier to broad uptake of HLS as an accepted design methodology.

9.1 Future Work

With our HLS-friendly C-library of dynamic memory allocation algorithms, libmem, we investigated how these algorithms impact the area and performance of an HLS-generated circuit. We provided design guidelines for HLS-developers. However, our work only considers the performance and area impacts for one allocator per application. Methods to automatically select allocator(s) for an application may prove useful, and would minimize a user’s design time. These methods could incorporate ideas and methods from dynamic analysis techniques and other runtime-profiling methods. Additionally, libmem use on-chip BRAMs as heap memory; exploring the use off-chip memories have been left as future work.

In Chapter 5, we provided a set of 8 benchmarks to evaluate dynamic memory allocation techniques in the HLS context. Three benchmarks simulate common dynamic memory request patterns and five benchmarks explore real-world applications. In future work, we can extend this set of benchmarks to include more tests covering a wider and broader range of dynamic memory request patterns (e.g., varying the patterns based on request size, combining patterns, etc.) and more real-world applications (e.g., Bellman-Ford routing, Dijstrka’s algorithm, etc.).

The ASAP framework, presented in Chapter 7, discovers all safe parallelism opportunities within an application that uses dynamic memory. Parallelism is provided by partitioning the heap into separate memories and replicating an allocator to handle each separate memory. However, the user is respon-sible for exploring and selecting the number of heap-memory partitions which provide a performance benefit. Additionally, our methodology currently replicates only one-style of allocator (which is also user-selected). Therefore, it’s possible to improve this framework by automatically selecting the number of partitions which improves performance. For example, the results collected in Chapter 7 show partition-ing provides the most benefit when many dynamic memory requests are issued per partition. Therefore, our framework could be modified to identify these cases. Additionally, it’s possible that each partition may be best suited to one type of allocation algorithm. This could be addressed with an estimation engine, which could employ dynamic analysis techniques and other runtime-profiling methods. Lastly,

Chapter 9. Conclusions and Future Work 71

our framework is not able to determine if the ranges of address-space for two or more pointers on the same heap are disjoint. If the ranges of address-space are disjoint, the heap can be further partitioned.

By combining our approach with the methods presented in [51] and [26], it may be possible to improve upon the heap-partitioning result and/or reduce the time required for analysis.

The STAR framework provides a substantial reduction in BRAM usage. However, an application’s overall performance degrades. Recall, this framework replaces all stack-allocated arrays with dynamically allocated arrays only if there is a reduction in reserved memory. However, by modifying STAR to select a subset of stack-allocated arrays to replace, performance degredation could be lessened, while still reducing the overall BRAM usage. Modifying STAR to reflect this would require a method to explore this large solution space. This is a combinatorial optimization problem, where a number of algorithms exist to explore the solution space (e.g., branch and bound algorithms [65], etc.).

Appendix A

BRAM Fragmentation with dmbenchhls and libmem

A.1 Memory Patterns

Table A.1: BRAM Usage and Effective Memory Bits for the memory-patterns in dmbenchhls when used with libmem

Memory Metric gnumem linmem bitmem lutmem budmem

random MemBits 526848 - 596480 544663 1106432

BRAMs 36 - 40 51 74

square MemBits 526336 - 595968 544151 1105920

BRAMs 34 - 38 49 72

triangular MemBits 526336 526336 595968 544151 1105920

BRAMs 34 36 38 49 72

In document Dynamic Memory Allocation Techniques for High-Level Synthesis. Nicholas V. Giamblanco (Page 75-82)