• No results found

7.3 Evaluation

7.3.3 Comparison to State-of-the-Art

Table 7.1: ASAP’s analysis runtime for each benchmark.

In this section, we compare the quality of result of ASAP to (1) Liang et. al’s Hi-DMM’s heap-partitioning algorithm [33] and (2) Winterstein’s Heap Analyzer [51]. The heap-heap-partitioning techniques presented in [33] and [51] are tailored to operate with their specialized (and inflexible) allocation

mecha-Chapter 7. Automatic Sizing and Partitioning of Dynamic Memory Heaps 57

nisms and environment, prohibiting a one-to-one comparison to our work. However, we comment on their results obtained in [33] and [51], and compare to our findings with ASAP. Table 7.1 outlines the total execution time to perform our analysis and modification (i.e. the time spent with Valgrind’s memcheck and our static analysis).

Hi-DMM’s heap-partitioning technique reduces the cycle latency (and overall performance) of a matrix-multiply application by 6.01%, when partitioning the heap by two [33, p. 11]. The authors did not highlight any area impact that heap-partitioning imposes on the resulting circuit, however we suspect that ∼ 2× more allocator area is required. Additionally, the authors did not report the elapsed time for analysis and modification. Our work evaluated ASAP with another matrix-multiply application, with the evaluation in Table 7.2. Restricting ASAP to two partitions, a speed-up of 1.66× was observed. When the heap was maximally partitioned (i.e. partitioned into 4 heaps), ASAP provided a runtime benefit of 2.03×. Additionally, our framework automatically detects all safe heap partitioning opportunities, and does not require user-input, unlike [33].

Winterstein’s HeapAnalyzer achieved an average speed-up of 2.72× for memory intensive applications when partitioning the heap [51, p. 104]. Additionally, the relationship between area utilization and heap partitioning is linear. However, HeapAnalyzer’s heap-partitioning procedure employs symbolic execution, which is known to be slow. HeapAnalyzer’s analysis takes an average runtime of 476.1 s (7.935 minutes). With our benchmark suite, (and selecting those benchmarks which are suitable for heap partitioning: matrixmult, pca, dfs and the memory patterns), ASAP achieves an average speed-up of 2.84×. Additionally, ASAP also has a linear relationship between allocator area and heap-partitioning.

Lastly, ASAP’s analysis and modification runtime is fast, where the analysis spent an average of 16.3ms analyzing the benchmarks evaluated in this work. The runtimes are tabulated in Table 7.1. ASAP’s analysis is 3-5 orders of magnitude faster than the average runtime of HeapAnalyzer, with an average runtime speedup of 28, 899×, while providing the same/better quality of result for heap partitioning.

From the results collected in this section, we provide the following conclusions: (1) Applications which issue many memory requests (i.e. malloc and free) benefit from ASAP and (2) Conditional invocations of malloc and free may not reveal benefits of using ASAP since the reduction in cycle latency from allocation mechanisms may be hidden by the unpredictable control-flow.

7.4 Summary

In this chapter, we explore a method to primarily improve the performance of applications which use dynamic memory and are to undergo the high-level synthesis process. Our work also attempts to reduce

Chapter 7. Automatic Sizing and Partitioning of Dynamic Memory Heaps 58

the required on-chip BRAM reservation (which will serve as the heap). We encapsulate our method as a framework, which (1) uses dynamic analysis with known inputs to the program to determine the minimum required heap-depth, (2) automatically and safely partitions a program’s heap, (3) assigns in-dependent dynamic memory allocators to these heap-partitions to increase parallelism (thereby reducing the cycle latency). With our framework, we are able to improve performance of memory-request-intensive applications by ∼ 2×. We now move on to techniques that aim to optimize for area; we explore a tech-nique to recover over-reserved block RAMs (BRAMs) by converting stack allocated arrays to dynamic memory calls.

Chapter 7. Automatic Sizing and Partitioning of Dynamic Memory Heaps 59

BenchHeapsBRAMsMBits(Kb)DSPsALMs∆ALM∆ALM1 Fmax(MHz)Lat.(Cycles)WallClk.(µs)SpeedUp matrixmult132150.553245691.00×1.00×174.161690897.081.00× 242295.553261171.34×1.26×190.331110358.341.66× 352440.553266471.45×1.30×179.53963953.691.81× 462585.553271181.56×1.33×171.09817547.782.03× pca126145.77218771.00×1.00×193.421053954.491.00× 237290.77222231.18×1.00×195.58979450.081.09× 346435.77230381.62×1.25×229.78907839.511.38× 456580.77234281.83×1.27×235.24885937.661.45× dfs110145.00010581.00×1.00×261.3712184.661.00× 220290.00015601.47×1.14×238.669994.191.11× 330435.00019771.87×1.21×249.079153.671.27× 440580.00025472.41×1.42×238.558673.631.28× rsa136149.7490113071.00×1.00×123.84189046115265.351.00× 246294.7490124241.10×1.07×90.03188951620987.630.73× 356439.7490126391.12×1.06×131.48188917914368.571.06× 466584.7490127811.13×1.04×116.95188908816152.950.95× histogram1741142.00022101.00×1.00×264.274096551550.141.00× 2771159.00027341.24×1.08×273.374095821498.271.03× 3801176.00032231.46×1.14×273.674095631496.561.04× 4831193.00036921.67×1.20×260.014095441575.110.98× hash1781216.25718011.00×1.00×217.113962421825.071.00× 21522376.25724051.34×1.14×219.353798481731.701.05× 32263536.25729361.63×1.24×214.273757551753.651.04× 43004696.25734551.92×1.34×206.833716621796.941.02× Table7.2:AreaandPerformanceMetricswhenusingASAPwithLegUpwiththeupdateddmbenchhls∆ALMtabulatestheareaconsumedforeach heapdividedbythebaselineALMconsumption(1Heap).∆ALM1 computesthesamecomparison,exceptwiththeareaoflibbitmemremovedfor eachadditionalallocator.

Chapter 8

Replacing Stack Allocated Arrays with Dynamic Memory Calls

Most modern HLS tools require that arrays be statically sized. In software, arrays declared in func-tions are stack allocated, and therefore only contribute to the program’s memory footprint while the corresponding function executes. However, an HLS-generated circuit reserves BRAMs to implement function-local, stack allocated arrays. These BRAMs are not reused to implement other local arrays (in other functions), thereby contributing to a fixed and constant amount of total circuit memory usage.

In this chapter, we explore an alternative to this: we propose a methodology which can replace stack-allocated arrays with dynamically stack-allocated arrays, in an effort to (a) reduce the total BRAM usage and (b) to enable reuse of memory between various functions.

8.1 Motivation

Suppose we wish to investigate four different hash functions, and check how many collisions (if any) exist, when we are hashing to two hash-tables of different sizes. We can implement this program in a high-level language (such as C or C++), and hopefully profit from the HLS methodology. For each hash function, we can implement a test-bench function. Each test-bench would have two statically sized arrays (which are stack-allocated), to implement the two hash-tables. An example is depicted in Fig. 8.1.

Depending on how large the hash-tables are and how many values are being hashed, these hash-table arrays could be large. Recall that these stack-allocated arrays will be realized using on-chip BRAMs [5].

This could potentially consume a majority of on-chip BRAM resources. Instead of declaring

stack-60

Chapter 8. Replacing Stack Allocated Arrays with Dynamic Memory Calls 61

allocated arrays, dynamic memory allocation requests could be used in place of this, as a method to reduce total reserved memory. Since HLS tools will schedule the execution of each function to execute sequentially, the heap required for the dynamic memory allocation algorithm will only need to be as large as the largest stack-allocated reservation of any of these independent functions. For example, if each function in Fig. 8.1 requires two allocated arrays, the total memory reservation with stack-allocated arrays would be (4 Functions) ∗ (1800 ∗ (4bytes) + 65536 ∗ (4bytes)) = 1077376 bytes. However, if the stack-allocated arrays were replaced with dynamic memory requests, we would only need to host a heap that is 14 as large, totalling 269344 bytes. Reducing the total memory reservation by this factor is beneficial for large designs, or designs which are required a lot of scratch-pad memory. However, we must address the following issues:

• When should we replace stack-allocated arrays with dynamic memory allocation requests?

• How large should we make the heap?

• What are the performance and area impacts on the resulting circuit?