• No results found

7.3 Evaluation

7.3.1 Evaluation of ASAP with dmbenchhls

Figure 7.8: Illustration of ASAP’s partitioning. (a) represents an application which uses one heap, (b) the same application, with partitioned heaps.

7.3 Evaluation

We evaluated our framework with LegUp [5], however any HLS tool which utilizes LLVM [14] as their compilation environment can be used with the ASAP framework, since the resulting heap-partitioning analysis and modifications occur at the LLVM-IR level. We set LegUp to target an Intel/Altera Stratix V FPGA (5SGXEA7N2F45C2) and set a target clock period of 1 ns, using Quartus Prime 18.0. We use the Stratix V ALM (adaptive logic module) count as our area metric. We characterize performance through cycle latency, Fmax, and wall-clock time (Twall= cycle latency

F max ).

We chose to use libbitmem from libmem. We selected this allocation scheme for it’s low ALM consumption (it is desirable to keep ALM consumption low, especially when heap-partitioning) and high Fmax, [9]. However, any allocation mechanism can be used. Using LegUp, we synthesized libbitmem, and isolated the synthesized hardware to determine the area utilization and minimum clock period. The allocator requires 349 ALMs, and has an maximum operating frequency of 461.5 MHz.

We perform two studies where we, (1) investigate ASAP’s abilities with two benchmarks suites: a suite of memory access patterns, and dmbenchhls and (2) compare ASAP to state-of-the-art dynamic memory allocation frameworks [33], [51].

7.3.1 Evaluation of ASAP with dmbenchhls

First, we evaluate the ASAP framework with three memory patterns, inspired from from dmbenchhls and [54], to verify that our analysis can handle a variety of programs, and to explore performance improvements with these memory patterns. The three memory patterns are labelled as: random4, triangle4, square4.

Chapter 7. Automatic Sizing and Partitioning of Dynamic Memory Heaps 53

random4: Here, four compute kernels randomly generate malloc requests during runtime. Randomly generated mallocs are provided with a randomly generated request size as input. Each request is given a random lifetime, which dictates the number of iterations to wait until free is invoked on this request.

Each kernel is independent of each other.

triangle4: This memory-pattern iteratively requests for memory upfront. And then iteratively releases the reserved memory. This process can vary in the following ways: request sizes for memory requests can be constant or computed by a function (i.e. randomly generated, linear and increasing), and the order in which memory is released need not be in the same order as it was requested. Our implementation has four independent compute-kernels which iteratively requests for memory in a linear fashion, where the loop iteration index will dictate the request size. Additionally, we release memory in the same order it was allocated.

square4: This pattern requests for memory, executes program logic, and then releases this memory right after. This request-do-release pattern is iterative. Similar to random4 and triangle4, the request-size can be constant or produced by a function. Our implementation has four independent compute-kernels that request for memory based on a loop-index, executes program logic (which does not contain any other memory requests), and then releases the hold on this memory.

104 105 106

Figure 7.9: (a) - (d) explore Cycle Latency, Fmax, Wall-Clock Time and Area Utilization for three memory access patterns benchmarks.

As noted, each memory-pattern benchmark has four independent compute-kernels which perform the same operation, and therefore each benchmark can have it’s heap partitioned into a maximum of four.

This provides a test for ASAP to (1) check if four heaps can be detected (i.e connected components), and (2) to evaluate performance improvements, in hopes to achieve ∼ 4× improvement in runtime.

Additionally, we explored the effects of sweeping through the number of user-requested heap-partitions

Chapter 7. Automatic Sizing and Partitioning of Dynamic Memory Heaps 54

from one (no partitioning) to four (maximum possible partitioning). We show results in Fig. 7.9.

Our results indicate ASAP was able to detect all four possible heap partitions for all of these mem-ory patterns. This is identified by the incremental performance improvement as we swept through all possible heap-partitions. For triangle4 and square4, maximally partitioning the heap provides a runtime improvement of 5.11× and 5.77×, respectively. The largest contribution to performance improvement is a result of a reduction in the number of cycles required to execute the application. With one heap, only one allocator manages the heap. This places strain on the management strategy (i.e. searching for free memory). By partitioning the heap, heap-specific allocators no longer face the same strain on the management system. The program segments which are assigned to a partitioned heap are the only requesters placing strain on this heap-specific allocator. Reducing this strain reduces the number of cycles required to manage memory. When the heap is maximally partitioned, ASAP can reduce the cycle latency by ∼ 4×. Although the critical path of the application (with no heap partitioning) lies within the allocator and the heap (which is mapped to BRAM during the HLS process), partitioning the heap may yield improvements to Fmax. When partitioning the heap in 4, wall-clock time improvements for random4 are modest, achieving a maximum speed up of ∼ 1.4× and Fmax slows to ∼ 0.93× of the original. We attribute this to random4’s malloc and free invocation pattern, which is within control flow and is randomly explored at runtime. Partitioning the heap may provide little benefit to reduce cycle latency if malloc and free invocations are not deterministic (i.e. the number of times malloc is called and/or the size of the requests are variable).

Lastly, the improvement in performance is not for free – the increase in area utilization is approxi-mately ∼ n× the size of the allocation mechanism, where n is the number of heap partitions. In these designs, the allocator design is as large as the program logic, and therefore, we see an average increase in area of ∼ 4×. However, if we remove the area-utilization from replicating the allocator n times, then we note a marginal increase in area ∼ 1.4×, which is the area-overhead of extra control logic for the program’s auto-generated state machine, which now has to control n more allocators.

7.3.2 Evaluation of ASAP with dmbenchhls

We evaluate ASAP with the dmbenchhls. However, we modified this suite in the following ways: (1) only memory intensive applications from this suite were evaluated (hash and dfs), (2) we added more appli-cations which request for memory at runtime and are memory intensive (rsa, matrixmult, histogram, and pca).

We modified hash to evaluate several hash algorithms for collision with large datasets. As before, the

Chapter 7. Automatic Sizing and Partitioning of Dynamic Memory Heaps 55

hash map used for collision detection is dynamically generated. dfs dynamically generates a binary tree from user data, and then traverses the tree to return an ordered list. rsa implements the Rivest-Shamir-Adleman public key encryption algorithm described in [60]. This implementation generates a public and private key pair, and conducts the following process: (a) encrypts a string of plaintext with the public key, producing ciphertext (storing this result dynamically) and (b) decrypts the ciphertext with the private key, and compares the original string to the decrypted message. matrixmult multiplies two matrices together, where the resulting matrix is stored dynamically in an array. An operation is carried out on the resultant matrix, to ensure the matrix-operation is correct. histogram dynamically generates a histogram of user defined data. The number of bins to separate the data into is also user-defined. pca performs principal components analysis on user-provided data, and was taken from Stanford’s Phoenix [61]. We modified this implementation to be HLS-amenable (i.e. removed low-level operating system calls).

All benchmarks have four independent compute kernels which perform the operations described above. These compute kernels do not divide up the overall application calculation; they are independent copies of the same computation. Therefore, each benchmark can have their heap partitioned a maximum of four times. This evaluates if ASAP can correctly detect the maximum number of partitions in real-life applications, while exploring performance and area impacts.

Results of ASAP with the modified dmbenchhls benchmark suite are tabulated in Table 7.2. A global trend in the Lat(Cycle) (the cycle latency) column of Table 7.2 demonstrates that when sweeping through all partitioning possibilities, the number of clock cycles required for the program decrease when more partitions are used. The decrease in cycles indicates that ASAP was able to detect the maximum number of heap partitions. However, only three of the six benchmarks achieve a reasonable speed up. These speed-ups range from 2.03× to 1.28×. matrixmult’s performance improved from ASAP by 2.03× . This majority of this benchmark’s execution time is spent allocating an empty, square matrix.

This memory-intense task issues many memory requests to an allocator, performs a computation on the allocated space, and then issues many deallocation requests. This application bears resemblance to the triangle4 memory pattern. Hence, compared to using one heap (and equivalently, one allocator), heap partitioning reduces the overall strain on the allocation scheme’s management system, reducing the clock latency (and in this case, by half). Likewise, pca and dfs deliver a large number of allocation and deallocation requests, as observed through cycle reductions when sweeping through the partition possibilities. The magnitude of latency reduction is less than matrixmult, which also indicates that this application does not stress the underlying memory management system of a given allocator.

However, three benchmarks, rsa, histogram and hash do not provide a notable speed up. rsa is

Chapter 7. Automatic Sizing and Partitioning of Dynamic Memory Heaps 56

known to be multiplication and compute heavy (i.e. using the Euclidean Algorithm [62] is compute heavy). This is demonstrated through cycle latency and number of DSPs consumed. Although this application employed dynamic memory to store the ciphertext, the majority of application runtime is spent on other logic. Therefore the achieved latency reduction with ASAP does not provide a global benefit. Similarly, this is observed in the histogram and hash benchmarks.

ALM utilization increases approximately linearly with the number of heap partitions selected. With every new partition, a heap specific allocator is generated, and contributes the allocator area with additional overhead (for control logic), as seen in Table 7.2. However, larger applications can amortize the cost of multiple heap partitions, where utilization reaches 1.13× and 1.56× when 4 heap partitions are used. We also analyze the additional ALM overhead for each extra heap partition by removing the ALM consumption of these additional allocators, (Column ∆ALM1). Again, the ALM overhead is larger with more heap partitions, however, the relationship is not linear. As noted before, this increase is due to extra control logic for the heap-specific allocators. Although ASAP is able to provide the optimal heap-depth for an application through dynamic analysis, BRAM utilization (as well as the number of Memory Bits) grows linearly with the number of heap partitions. Valgrind’s memcheck determines the minimum heap-depth required for the application – it does not provide information on the heap-depth required for heap partitions (since it knows nothing of ASAP’s partitioning mechanism). Therefore, we make a conservative estimate, and size heaps at the reported minimum heap-width.