Dynamic Memory Allocation Techniques for High-Level Synthesis. Nicholas V. Giamblanco

(1)

Dynamic Memory Allocation Techniques for High-Level Synthesis

by

Nicholas V. Giamblanco

A thesis submitted in conformity with the requirements

for the degree of Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

(2)

Abstract

Dynamic Memory Allocation Techniques for High-Level Synthesis Nicholas V. Giamblanco

Master of Applied Science

Graduate Department of Electrical and Computer Engineering University of Toronto

2020

The omission of support for several software-defined constructs within High-Level Synthesis (HLS) have hindered it’s broad adoption. Dynamic memory allocation is one feature that is not within the traditional supported subset. This works expands the synthesizable subset of the C languages for HLS tools, by including dynamic memory allocation algorithms (i.e. malloc and free). We define a variety of high-level synthesis approaches to support malloc and free. We first provide a library of HLS-friendly allocation algorithms, libmem, which is studied for performance and area impacts. We also define a framework to improve the performance and area of dynamic memory allocation mechanisms in HLS applications. Our framework employs a variety of dynamic and static program analyses to identify the required size of the heap memory space, parallelize dynamic memory allocators and save precious on-chip BRAM. We evaluate our work with a variety of benchmarks, and demonstrate the usefulness of including dynamic memory allocation within HLS.

(3)

To my parents, Joseph and Gerarda, and my sister, Diana. Thank-you for helping me push forward with all your love and support. It’s been a wild ride so far.

(4)

Acknowledgements

Professor Jason Anderson, thank you for supervising me in pursuit of this research. Your guidance, insight and support provided an invaluable skill set that has helped me along as an individual and as a researcher. Thanks for always encouraging me to push to “new heights” and to not be afraid. You rock! I would also like to thank past and present members of Jason’s group for all your help. You guys also rock.

Many thanks goes to Ian Taras, Dylan Stuart, Ciaran Bannon, Erik Tillberg and Austin Liolli. Thanks for sitting with me through the “good times, bad times. You know I had my share,”. To all my colleagues and friends at the University of Toronto and Ryerson University, thank you for joining me for afternoon walks, “evening beers”, and all your help with academic and personal life.

Without Dr. Catherine Benes, I wouldn’t be the man I am today. I am indebted to her extreme kindness, helpfulness and support. Thank-you “T-Mom”.

I would like to thank Dr. Prathap Siddavaatam for believing me in the high and low times, you are an inspiration. You’re encouragement lead me here. I hope that I can follow your footsteps.

Dr. Joseph Geraci, thank you for those late-night discussion where we explored ideas in: mathematics, sciences, and life. You are a gem, dude!

I’d like to thank my dear friends Eric Post & Patrick Beaudry, for sticking by my side despite the hundreds of kilometers which separate us, you two are “lit, fam” ®. Here’s to the future. To Dr. Susanna McCarthy, thanks for keeping me alive, well, and healthy. Miss you!

To the woman that I love very much, Jennifer Hao Yue Kou, thank-you for everything you do. You are magic.

I’d like to thank Benjamin, my dog. You helped me get up in the morning, and to take time for myself so that I don’t burn out. You really are a man’s best friend. Bork bork, bud!

Lastly, I’d like to thank my family for supporting me unconditionally. Tanto amore!

(5)

List of Tables

4.1 List of modern, C-based HLS tools and their support for synthesizing dynamic memory allocation constructs. Entries marked as -uindicates that the documentation is unclear. . 22 5.1 Characteristics of dmbenchhls. . . 30 6.1 Fmax, ALM and BRAM usage for each dynamic memory allocation algorithm in isolation.

This excludes heap-memory and global-memory used for book-keeping. . . 32 6.2 Effect of tree depth (12 vs 6 levels) on Cycle Latency, Fmax and Wall Clock Time of

budmem with three memory patterns. . . 40 6.3 Design guidelines for a specific memory access patterns. We identify that linmem has

limited use cases with (*). . . 40 7.1 ASAP’s analysis runtime for each benchmark. . . 56 7.2 Area and Performance Metrics when using ASAP with LegUp with the updated dmbenchhls

∆ALM tabulates the area consumed for each heap divided by the baseline ALM consump-tion (1 Heap). ∆ALM1computes the same comparison, except with the area of libbitmem removed for each additional allocator. . . 59 8.1 Performance and Area Results when Stack-Allocated Arrays are Replaced with Dynamic

Memory Allocation Algorithms, using STAR. . . 67 A.1 BRAM Usage and Effective Memory Bits for the memory-patterns in dmbenchhls when

used with libmem . . . 72 A.2 BRAM Usage and Effective Memory Bits for the memory-patterns in dmbenchhls when

used with libmem . . . 73 B.1 Exploration of BRAM Usage and Effective Memory Bits when ASAP is applied for the

memory-patterns in dmbenchhls with bitmem . . . 74 viii

(9)

List of Figures

2.1 An example of a C program (a) and the equivalent algorithm encoded in LLVM’s IR (b).

The control flow of this program is outlined in (c). . . 7

2.2 Overview of LLVM Architecture. . . 8

2.3 Overview of an LLVM-based HLS Compiler. . . 9

3.1 Doug Lea’s malloc() implementation, where CHUNKs stored on the heap are used to identify reserved or free memory segments. (a) Demonstrates an empty heap, with the initialized doubly linked-list. (b) Demonstrates how the doubly linked-list is modified after a request is served. . . 13

3.2 Simplistic Behaviour of a Linear DMAS. . . 15

3.3 Representation of a bitmap allocator. . . 16

3.4 The structure of a Buddy Allocation scheme. . . 17

3.5 The general structure of a pre-allocated address allocation scheme. . . 19

5.1 The triangle memory pattern. . . 27

5.2 The square memory pattern. . . 28

5.3 The random memory pattern. . . 29

6.1 Hardware Architecture of gnu malloc. . . 33

6.2 Hardware Architecture of lin malloc. . . 34

6.3 Hardware Architecture of bit malloc. . . 35

6.4 Hardware Architecture of lut malloc. . . 37

6.5 (a) - (c) explore Cycle Latency, Fmax and wall-clock time for three memory access patterns. 38 6.6 (a) - (c) explore Cycle Latency, Fmaxand wall-clock time for each benchmark in dmbenchhls. 39 6.7 (a) - (c) explore Cycle Latency, Fmaxand wall-clock time for each benchmark in dmbenchhls, with the results for budmem removed. . . 41

(10)

7.1 ASAP Framework. . . 44

7.2 LLVM’s store instruction. . . 46

7.3 LLVM’s call instruction. . . 47

7.4 An example LLVM-IR program where mi7→ mj. . . 47

7.5 Example graph of malloc() and free()s within in a program is shown in (a) (malloc()’s are labelled as M 0-M 3, free()’s are labelled as F 0 − F 4). (b) visualizes the connected components of (a) identifying which malloc and free groups can be assigned an inde-pendent heap . . . 49

7.6 An example of the recursive DFS on malloc()’s users. (a) is an example program encoded in LLVM-IR. (b) depicts the result of the DFS from %5 tail call i8 * @malloc(). The DFS was able to locate the @free() call related to %5 . . . 51

7.7 An example of ASAP’s heap partitioning. (a) This the unmodified program, (b) shows ASAP’s modification before executing Line 31 of Algorithm 1 (here uc stands for user-chosen) (c) shows the fully transformed code (after Line 31 has been executed). . . 51

7.8 Illustration of ASAP’s partitioning. (a) represents an application which uses one heap, (b) the same application, with partitioned heaps. . . 52

7.9 (a) - (d) explore Cycle Latency, Fmax, Wall-Clock Time and Area Utilization for three memory access patterns benchmarks. . . 53

8.1 An example program which test four hash functions for collisions, written in C . . . 62

8.2 An example of two different program call graphs (a) demonstrates a call graph where functions have very shallow (or no) function invocations, (b) a deep and dependent call graph. . . 62

8.3 An example of STAR’s stack-allocated replacement technique. (a) This the unmodified program, (b) shows STAR’s modification. . . 65

(11)

Chapter 1

Introduction

1.1 Motivation

Today’s computer architects cannot rely on Moore’s Law and Dennard Scaling for improvements in performance [1, 2, 3]. Performance gains regulated by these laws have hit a limit, and we need to seek alternatives to address the computing needs of the present and future. Alternative computing plat-forms (e.g. Field-Programmable Gate Arrays (FPGAs), Coarse-Grained Reconfigurable Architectures (CGRAs), Application Specific Integrated Circuits (ASICs)) are being considered as a solution to meet our computing needs [2, 1]. These alternative computing platforms require users to describe applica-tions in an application-specific language, such as hardware-description language or low-level assembly. Designing applications with these languages is tedious and error-prone as it exposes fine-grain design parameters that one must use. Mapping an algorithm or application to a low-level language can also muddle with it’s overall readability. Additionally, the user may be required to describe a custom comput-ing architecture in addition to the application (e.g. [4]) or provide additional design details, complicatcomput-ing the design process. These drawbacks have placed a barrier on adoption and usage, even though these computing platforms can improve performance or reduce power consumption for a variety of applications [5] compared to traditional computing platforms.

One approach to lower this barrier-to-entry is through High-Level Synthesis (HLS). HLS is a design methodology which can construct a hardware circuit (described in RTL) from an algorithm, usually described in C/C++ [6]. Only a limited subset of C or C++ is supported by modern high-level synthesis tools. Many commercial and open-source HLS tools ([5, 7, 8]) currently lack support for dynamic memory allocation, recursion and function-pointer support. All of these programming constructs are commonly

(12)

Chapter 1. Introduction 2 applied by programmers; these are staples in a programmer’s toolkit. However, these unsupported features have slowed the broad uptake of HLS as an accepted design methodology.

Unsupported software constructs must be removed from any application which will undergo the HLS process. This is costly to the user who must extensively rework the program to replace unsupported constructs with additional logic or other programmatic treatments. The modified application must be verified for correct functionality, which is also time consuming. Lastly, some programming constructs do not have an easy-fix (e.g. realloc), and may consume more resources than necessary (e.g. excess BRAM may be reserved to make up for a runtime-based memory bound). In this work, we focus on bridging the gap in HLS-supported C/C++ constructs by exploring dynamic memory allocation as a feature of HLS.

It is clear why HLS tools have foregone support for dynamic memory allocation. There is no obvious way to include this programming paradigm. High-level synthesis tools compile a high-level language to FPGAs or Application Specific Integrated Circuits (ASIC) devices which do not naively include an operating-system. Therefore, explicit memory managers are required to replace the memory management behaviour of an operating system. Many algorithms which manage memory exist; however, it is not clear which algorithm would fare best in an application or best map to either an FPGA or ASIC. Additionally, the location and size of the memory-to-be-managed is also an open question. Should the memory be placed on-chip, or held off-chip? How large should the memory be?

There are several benefits if dynamic memory allocation schemes are made available with the HLS process. Users would no longer be required to undergo massive code overhauls, reducing the design time for a trade-off with circuit area and performance. There would be no need to over-reserve memory; users can request the memory they need at runtime. Application code remains portable (users no longer need to modify the application to suit an HLS tool), removing the possibility of introducing software (and inevitably hardware) bugs.

1.2 Contributions

In this thesis, we add support for dynamic memory allocation constructs. We develop a tool-agnostic dynamic memory allocation library, libmem, which implements five unique memory management al-gorithms. Four of these memory management algorithms are in the literature surrounding dynamic memory allocation in software. We designed a novel memory management algorithm which was inspired from the surrounding literature on dynamic memory allocation, as well as software-techniques which are HLS-friendly. Each algorithm was developed in HLS-friendly C and optimized for the HLS process. We

(13)

Chapter 1. Introduction 3 explore how the usage of these algorithms affect the area and performance of an HLS-generated circuit. All allocators in libmem employ BRAMs as heap memory.

We also introduce an automatic, tool-independent analysis and modification suite, ASAP (Automatic Sizing and Partitioning of Dynamic Memory Heaps), which (a) automatically resolves unsupported dynamic memory constructs with a user-selected replacement scheme from libmem, (b) determines the minimum memory-depth for a program, and (c) performs a performance-driven optimization to these allocation mechanisms, parallelizing dynamic memory allocators through automatic partitioning of the heap and replicating a user-selected allocator for each heap partition.

Lastly, we introduce STAR (Stack Allocated Array Replacement), an analysis and modification technique which can replace stack-allocated arrays in C and C++ with a dynamic memory allocation request of the required static size, in hopes of reducing the reservation of on-chip BRAMs.

We present a benchmark-suite to aid in the evaluation of dynamic memory allocation schemes, dmbenchhls. This suite consists of two benchmark categories, (1) dynamic memory request patterns and (2) realistic applications which are suited for acceleration purposes and employ dynamic memory allocation.

We evaluate libmem, ASAP and STAR with LegUp [5], developed at the University of Toronto. To summarize the contributions of this thesis:

• A dynamic memory allocation library of five memory allocation schemes tailored for HLS: – The implementation of four HLS-optimized, software-defined memory allocation schemes from

literature.

– A novel dynamic memory allocation scheme.

• An automatic heap sizing and partitioning approach, ASAP, to improve the performance of dy-namic memory allocation in HLS applications while reducing the reservation of on-chip BRAM. • Automated conversion of stack-allocated arrays to heap-allocated memory, STAR, to further reduce

BRAM usage.

• A benchmark suite, dmbenchhls, which hosts a variety of applications that use dynamic memory allocation constructs and explore different memory request patterns.

A preliminary version of a portion of the work presented in this thesis has been published in [9] and [10].

(14)

Chapter 1. Introduction 4

1.3 Thesis Organization

The remainder of this thesis has been organized in the following way: Chapter 2 provides an overview on compiler technologies, program analysis techniques, a specific compilation framework, LLVM, HLS and dynamic memory allocation. Chapter 3 outlines the algorithmic description of a four dynamic memory allocation mechanisms from the literature and provides an algorithmic description of our new dynamic memory allocation mechanism, lutmem. Chapter 4 reviews related work. Chapter 5 describes the variety of memory request patterns and real-life applications used to evaluate the high-level synthesis of dynamic memory allocation algorithms. Chapter 6 details how we implemented the algorithms presented in Chapters 3 as a HLS-friendly library, and provides experimental results with this library and the benchmarks outlined in Chapter 5. We introduce a method to improve upon the performance of dynamic memory allocators in Chapter 7. We then explore a methodology to reduce over-utilization of on-chip BRAM through the employment of dynamic memory allocation in Chapter 8. Chapter 9 concludes this thesis with a discussion of possible future works.

(15)

Chapter 2

Background

We first review compiler technologies and associated program analysis techniques used by compilers. We then review HLS from a generic perspective (not tied to a specific HLS tool). Although we implemented our work within the LegUp HLS tool [5], we restrict our discussion to remain quite general, since our techniques are HLS-compiler agnostic1. We then review the concept of dynamic memory allocation.

2.1 Compiler Technologies

Compilers facilitate the transformation process from one programming language (e.g., C or C++), referred to as the source, to another programming language (e.g., a wide range of computer instruction-set-architectures (ISA): x86, ARM), referred to as the target [11, 12]. Compilers initiate the transformation process by parsing a source into an intermediate representation (IR)2_{. A compiler’s IR provides a way to}

represent, reason and optimize a program while being independent of the source and target. For example, the compiler can create an analysis which inspects the IR for areas of performance improvements, instead of having to create separate analyses for each supported source language or target language. Compilers perform a suite of analyses on a user-program to warn users of misleading behaviour and syntactical errors or improve the performance and security. Finally, the compiler will map the IR to a user-selected target. Many compiler frameworks exists, most notably GCC (The GNU Compiler Collection) [13], and LLVM (Low-Level Virtual Machine) [14]; we will focus our attention to the latter, as many HLS tools are built within LLVM’s compiler framework [5, 7, 8]. As we discuss the LLVM framework below, we will also detail associated compiler terms which are used in this thesis.

1_{Some of the work in this thesis requires that the HLS compiler is built-around the LLVM framework. We will explicitly} outline this requirement when necessary.

2_{For more information on this process, we direct the interested reader to [11, 12]}

(16)

Chapter 2. Background 6

2.1.1 LLVM

The LLVM compiler framework [14] hosts a suite of compiler tool-chains, enabling the compilation of any input language to any target, provided there exist a frontend for the desired input language and a backend for the desired target. We now describe how LLVM compiles an input language into LLVM-IR. LLVM’s Intermediate Representation

LLVM’s IR represents programs in a strongly-typed, reduced instruction-set computing (RISC) assembly language [15, 14]. LLVM’s IR is input-language and ISA agnostic. During IR construction, the frontend maps an application encoded in the related input-language to a sequence of the IR’s RISC instructions. RISC instructions consist of simple instructions such as add, shift-left, branch, etc. Each instruction may have 0 or more inputs, and may have 0 or 1 outputs. Instructions which produce an output are assigned to a register. The set of registers which can be accessed and assigned are unlimited. Once a register is allocated for an instruction, it cannot be updated. For example, if we have wish to add three numbers a,b,c, LLVM-IR will represent this as:

; The next two commented instructions would be illegal in LLVM-IR. ; %sum = add %a, %b ; This is okay.

; %sum = add %sum, %c ; Cannot Reassign %sum! ; End of Illegal Operation

; The next two instructions are a legal sequence.

%sum = add %a %b

%sum2 = add %sum, %c

; End of legal sequence.

This representation of the IR is called single static assignment (SSA) form [16]. Expressing a low-level program in SSA form enables the ability to conduct many program analyses [17].

LLVM’s IR is structured in a way to explicitly expose control flow in a program, through the use of basic blocks and control flow graphs. A basic-block has the following structure: (a) an entry to the block, (b) control-flow-independent instructions, and (c) an terminator of the block. Terminator instructions are either branch (br) or return (ret) instructions. The ordering of these structures are strict, i.e. (c) must follow (b) and (b) must follow (a). Connections between basic-blocks are made if a branch exists to some other basic-block or itself.

The connections between basic blocks can be represented graphically, which we refer to as a control flow graph (CFG). CFGs represent conditional flow of a program. An complete example of LLVM’s IR

(17)

//=---=// // An example program, where // a variable is the sum of the // natural numbers up to ten, // and doubled #include <stdio.h> int t= 0; int i= 0; //... for(i = 0; i< 10;++i) { t+=i+i; } //.. (a) ; <label>:1 ;... br label%2 ; <label>:2 ; preds = %1,%2

%i =phi i32 [0,%1], [%inxt,%2]

%t =phi i32 [0,%1], [%tnxt,%2]

%tnxt =add nsw%i, %i,

%inxt =add nsw%i, 1 %cond =icmp eq%inxt,10 br i1 %cond,label%3,label %2

; <label>:3

store i32 %t,i32* %4,align4

;... (b) BB1: ... BB2: %i = phi i32 ... ... (%inxt == 10)? BB2: store i32 ... (c)

Figure 2.1: An example of a C program (a) and the equivalent algorithm encoded in LLVM’s IR (b). The control flow of this program is outlined in (c).

is provided in Fig. 2.1. In this figure, we show how a C program (Fig. 2.1(a)) is represented in LLVM’s IR (Fig. 2.1(b)), and explictly outlines the CFG of the example program (Fig. 2.1(c)). Note that each instruction in Fig. 2.1(b) is assigned to a unique register, showing the SSA-form of the IR. Also note:

• Basic blocks are denoted with comments similar to: ; <label>:X: ; preds = %A,.... • Branch instructions, br ..., represent the edges in the control flow graph in Fig. 2.1(c).

• Phi instructions, %W = phi ... allow an allocated register to be conditionally set on which path was taken to enter the basic block.

Program Analysis Techniques

Analyzing programs may reveal information regarding safety or correctness, or provide insight to optimize programs for performance improvements. Methods which compute this information are referred to as program analyses [18]. Program analysis techniques fall into one of two fields: either static analysis or dynamic analysis [18]. Static analysis techniques inspects the source-code of a program [19]. Dynamic analysis inspects the execution of a program [20].

The LLVM framework provides (a) a standard suite of static and dynamic analysis techniques which can be applied to programs encoded in LLVM-IR and (b) the ability for users to generate their own LLVM-IR program analyses. When analyses are applied to LLVM’s IR, we refer to this as a compiler pass. The LLVM compiler runs a sequence of compiler passes. Only one pass is applied to the IR at-a-time. The pass is administered to the IR by LLVM’s pass manager, the LLVM Optimizer, opt.

Some analyses require the IR to be rewritten to reflect a performance optimization or safety improve-ment, whereas others simply inform users/other-programs of the information collected by the analysis. To handle this, LLVM provides two types of compiler passes: (a) Analysis Passes and (b) Transformation

(18)

Chapter 2. Background 8 Passes. Analysis passes gather information about the program and does not modify the IR. Transfor-mation passes attempt to modify the IR to reflect possible performance or security optimizations.

In this work, we define three static analyses and implement them as compiler passes within the LLVM framework. Our first analysis can inspect LLVM’s IR for calls to standard dynamic memory function calls (i.e. malloc() and free) and remap these calls to a user-selected function. Our second analysis examines dynamic memory function invocations to determine if the address space used by these programming constructs are disjoint, exposing opportunities for parallelism to the HLS tool. Our last analysis examines a program for stack-allocated arrays, and evaluates if these arrays can be replaced with dynamically-allocated arrays.

We also use and modify Valgrind, a dynamic instrumentation framework. Valgrind compiles a pro-gram into a custom intermediate representation amenable to the insertion of metadata and debugging information [21]. This framework can then execute the instrumented code just-in-time [22] to assist in analyzing programs for memory leaks, security vunerabilities, etc. In this thesis, we modify Valgrind’s memcheck, a dynamic analysis technique, to automatically gauge the heap-size for a given program with known inputs, which will be undergoing the high-level synthesis process.

The LLVM compilation framework and associated IR provides a long-term solution to compilation needs of the present and future. An overview of LLVM’s software-architecture and logical behaviour is outlined in Fig. 2.2. This figure demonstrates LLVM’s ability to compile any supported input language (as designated by Input Language) to any supported target language.

C/C++ x86

go LLVM Optimizer ARM

ruby PowerPC

Input Language Output Target

(19)

2.2 High-Level Synthesis Overview

High-Level Synthesis (HLS) is a process which creates a hardware-circuit from an algorithmic description. The algorithmic description is generally written in C/C++3 _{[7, 5, 8, 25], and the hardware-circuit is}

described in a hardware description language, typically Verilog, VHDL or SystemC. HLS tools are built upon pre-existing compiler frameworks, to leverage the compiler’s parser, IR, and analysis/transform passes. A combination of analyses are applied on this IR, with some originating from the compiler framework (e.g. dead-code elimination, constant propagation), and others being specific. HLS-specific passes aim to expose hardware-amenable transformations while preserving the correctness of the original input program. Some examples include exposing parallelism between operations in the IR such as automatic array partitioning [26]. Upon completion of the program analysis and optimizations the compiler IR is mapped to a hardware-description. Mapping a compiler IR to hardware takes place through three distinct analyses and transformations: (a) allocation, (b) scheduling and (c) binding, which we elaborate on below.

C/C++ Verilog

go

LLVM Optimizer (Typical Program Optimizations)

(HLS-Specific Optimizations) ARM ruby x86 Allocation, Scheduling, Binding

Input Language Output ISA

Figure 2.3: Overview of an LLVM-based HLS Compiler.

2.2.1 Allocation

During the allocation phase of HLS compilation, hardware resources (i.e. functional units, storage elements) are allocated to implement the hardware circuit. This step decides, for example, how many divider units are permitted in the synthesized circuit.

(20)

2.2.2 Scheduling

During scheduling, untimed instructions from the IR are scheduled into a particular clock cycle and a finite-state machine is created for each function. To do this, the program’s control and data flow is analyzed to identify when instructions (or synonymously, operations) can be executed. As discussed previously, an instruction may have 0 to many inputs. Therefore, each input to an instruction must be available (i.e. finished executing) before the current instruction can execute. Analyzing this chain of dependencies can be viewed as a scheduling problem. There are various ways to generate an instruction execution schedule for hardware: as-soon-as-possible scheduling, as-late-as-possible (ALAP) scheduling, and others [27].

2.2.3 Binding

After scheduling the operations of the given program, each instruction must be bound to an available (and allocated) resource, i.e. binding an operation to a functional unit, data to a storage elements. Several techniques have been explored to solve this problem, however the most notable technique is bipartite weighted matching [28].

2.2.4 RTL Generation

Once allocation, scheduling and binding4 _{have completed, the information collected from these analyses}

can now be used to construct a hardware description. The high-level view of an LLVM-based, HLS compiler is depicted in Fig. 2.3

2.3 Dynamic Memory Allocation

Dynamic memory allocation is a programming construct where programs can request for memory dur-ing execution to assist with their processdur-ing or storage tasks. Generally, programmers employ these constructs to serve programs which have runtime-known memory bounds; otherwise, stack-allocated memory would suffice. These requests are generally handled by some managerial process, typically an operating system or memory manager. We will refer to this managerial process as the memory manager throughout our discussion. Memory managers provide programs with memory dynamically by control-ling a segment of memory, commonly referred to as heap memory. There are two main types of dynamic memory allocation requests: (a) a request for memory, and (b) the release of dynamically allocated

4_{It is not necessary that this be a strict ordering, however it is how several tools have implemented their HLS compiler,} [5, 25, 6]

(21)

Chapter 2. Background 11 memory. Although programmers can ask the memory manager to resize previously-allocated memory, the memory manager will complete the following process: (1) request for the new size of memory, (2) copy data from the previously allocated memory to the freshly-allocated memory and (3) release the previously-allocated memory. This process only involves the two main memory allocation requests, therefore, we restrict our discussion to these. We now discuss how a memory manager processes these two requests.

2.3.1 Asking for Memory

Programs which request for memory at runtime must send their request to a memory manager. The memory request will be inspected for the requested size, and a special tracking algorithm designed to update and index regions of the heap is employed to allocate additional memory to the program with the corresponding size. We will keep the discussion of these algorithms limited here, as they are discussed in Chapter 3. Once memory has been reserved, the memory manager returns the starting address of the freshly-allocated heap memory. This means that the program must implicitly trust that the memory manager has reserved at least the request size – unless the returned address is 0 (NULL) indicating the memory could not be reserved.

2.3.2 Releasing Memory

Dynamically allocated memory should be released when it is no-longer required by the program. Releas-ing allocated-memory allows for reuse by others. To release previously-allocated memory, the memory manager is supplied with the previously-allocated address. The memory manager will employ the dy-namic memory allocation algorithm to inspect the provided address for it’s corresponding size. This address is then marked as free, permitting the reuse of this memory segment. Some dynamic memory al-location algorithms attempt to reorder/merge available memory segments into larger contiguous chunks. This process is known as coalescing.

2.3.3 Summary of Dynamic Memory

In summary, a dynamic memory allocation algorithm only needs to manage memory, and only returns an address to the program requesting for memory. The program implicitly trusts that the dynamic memory allocation algorithm has reserved the appropriate size since the program will use the provided address and compute an address offset within the bound of the request size. When releasing the reservation on allocated memory, a dynamic memory allocation algorithm locates information about the reservation on

(22)

Chapter 2. Background 12 the provided address and marks the associated memory space as reusable via their memory-accounting method.

2.4 Summary

In this chapter, we have summarized the background information necessary to understand the contents of this thesis. In particular, we have reviewed compiler technologies, with emphasis on LLVM. We described LLVM’s software architecture and usage, and its intermediate representation. We then reviewed concepts on program analyses and detailed how LLVM facilitates program analyses through analysis and transform passes. We then outlined the essential concepts of HLS, with our focus being compiler-centric. Lastly, we introduced the concepts of dynamic memory allocation.

(23)

Chapter 3

Dynamic Memory Allocation

Schemes

In this chapter, we review four dynamic memory allocation schemes from the literature, and are widely used in software applications. We also present a novel dynamic memory allocation scheme, inspired from the literature and optimized for high-level synthesis. We describe the algorithmic behaviour for each scheme. We will refer to a reservable segment of contiguous memory as the heap. A heap must have some alignment (e.g. int or byte aligned).

3.1 Linked-List Memory Allocation

0 1 0x23222320 bot 0x23229340 top (a) 1 0 1 0x23222320 bot 0x23228300 0x23229340 top (b)

Figure 3.1: Doug Lea’s malloc() implementation, where CHUNKs stored on the heap are used to identify reserved or free memory segments. (a) Demonstrates an empty heap, with the initialized doubly linked-list. (b) Demonstrates how the doubly linked-list is modified after a request is served.

We begin our review of dynamic memory allocation techniques from the literature with an algorithm employed in many software ecosystems, Doug Lea’s malloc() and free() [29]. Lea’s memory allocation scheme employs a doubly linked list to maintain state of used and unused regions of memory in a heap.

(24)

Chapter 3. Dynamic Memory Allocation Schemes 14 The nodes of this doubly linked list (DLL) only contain: (a) a left neighbour, (b) a right neighbour and (c) a free flag which indicates if the segment of memory between the current node and it’s right neighbour is reserved or not (0-free, or 1-reserved). We refer to these nodes as CHUNKS. CHUNK-nodes consume a fixed size of BCHUNK bytes in memory and are stored in the heap at an address, aCHUNK, to

identify a region of reserved memory, which begins at aCHUNK+ BCHUNK. Upon reception of a memory

request of β bytes, this scheme must verify the requested quantity of memory is aligned to the heap-memory. as to not overwrite part of a CHUNK. Therefore, memory requests are adjusted to ensure it is a multiple of BCHUNK-bytes. Once aligned, this scheme searches for the requested amount of memory

by traversing the DLL, depicted in Fig 3.1(a). An empty heap is initialized with two CHUNKs, bot and top, which point to the start and end of our heap. These CHUNKs are reserved and cannot be removed. Initially, bot has it’s left neighbour pointing to NULL (indicating the beginning of the heap) and right neighbour linked to top and sets it’s free flag to 0, making the entire heap available. Similarly, top’s left neighbour is set to bot and it’s right neighbour is linked to NULL indicating the end of our heap and the free flag is set to 1. We begin searching for free-β bytes of memory by commencing traversal from bot, and continue our search rightwards in the DLL. As we inspect each CHUNK in our traversal, we review the CHUNK’s free flag, as well it’s right neighbour. If the CHUNK’s flag is free, we check if the memory segment between the right neighbour and the current node is sufficient to serve the request. Referring to Fig. 3.1(a), the node at address 0x23222320 in the doubly-linked list is free for use. Therefore, we can compute the available free space by inspecting this node’s right neighbour (located as 0x23229340 and perform the subtraction between these address. In this example, the available free space for reservation is 0x23228300 − 0x23222320 = 1124.29 MB.

If the free space is equal to the β, we return the address aCHUNK+ BCHUNK to the requester. An offset

of BCHUNK is added to the address as to not overwrite the CHUNK. However, it is possible that the free

space is greater than the incoming request. We cannot naively return the address aCHUNK+ BCHUNKto the

requester in this case, as the available memory would quickly vanish (the requester will receive much more memory than needed). This scheme’s treatment inserts and links a new CHUNK into the heap at the boundary of the request size to (1) only use the requested memory and (2) identify the remaining free space. After this, this scheme can return the address. This process is depicted in Fig. 3.1(b). For example, if β = 24544, and our heap is in the same state as Fig. 3.1(a), then the free space is larger than the request size. Hence, a new CHUNK is inserted into the doubly-linked list, which provides use with the state of the heap in Fig. 3.1(b). However, if the computed free space is less than the memory request, traversal of the doubly linked list continues until memory of sufficient size is found, or the end of the DLL is encountered and we return the address of 0 (NULL pointer).

(25)

Chapter 3. Dynamic Memory Allocation Schemes 15 To free memory, a program can invoke a free() call with the address, aCHUNK+ BCHUNK, returned by

malloc. We must update this CHUNKs availability by updating it’s free flag to 0. The CHUNK is fetched by subtracting B_CHUNKbytes from the provided address, and then updated. However, to reduce the effects of memory fragmentation, the left and right neighbours of this CHUNK are inspected to see if memory can be coalesced into a larger segment of contiguous memory. This occurs by inspecting the left and right neighbour’s free flag. A restructuring of the DLL takes place if this is possible.

3.2 Linear Allocation

... 0xAA.. 0x00.. 0xCF.. curAddr

(a) Initial state of linear allocator.

... 0xAA.. 0x00..

0xCF.. curAddr

(b) State after a memory request

Figure 3.2: Simplistic Behaviour of a Linear DMAS.

A linear dynamic memory allocator sacrifices memory reusability for minimizing the memory alloca-tion overhead. This scheme maintains state of free memory within a heap by employing a single pointer, curAddr, to point to the end of the reserved region. This is depicted in Fig 3.2(a), where initially the pointer sits at the memory cell labelled with 0xAA... Cells highlighted in green represent reserved mem-ory; non-highlighted cells represent free memory. Upon issuance of a memory request to this allocator, this scheme checks the remaining free space in the heap against the request size. The remaining free space is calculated by subtracting curAddr from the end of the heap. If there is enough free space, the pointer is incremented as shown in Fig 3.2(b), otherwise we return an address of 0 (NULL). In this example, we issued a memory request of 2 words, using the rest of the heap, as indicated by all cells being green in Fig 3.2(b). Observe that this allocation has no bookkeeping of individual memory requests. Hence, no free() mechanism exists. However, there is functionality which can reset the curAddr pointer to the beginning of the heap, which we call freeall(), yet it is the programmer’s responsibility to employ this call. Intuitively, this allocation approach is not useful for general applications – it is only useful for applications which allocate all needed memory upfront, and then deallocate all used memory at the end of execution (i.e. interleaved calls to malloc and free will lead to over reserved memory).

(26)

Chapter 3. Dynamic Memory Allocation Schemes 16 1 0 0 . . . 0xBAAD 0xC0DE 0xD00D Bitvector Heap Memory 0xAA000002 0xAA000004 0xAA000006

Figure 3.3: Representation of a bitmap allocator.

3.3 Bitmap Memory Allocation

A bitmap dynamic memory allocator employs a bit-vector, Vb = {b0, b1, ...bn}, to maintain state on

memory reservations within a heap. Each bit bi in this bit-vector maps to X bytes at a unique location

in the heap. Using Fig. 3.3 as reference, each bit in this bit-vector maps to a unique cell position, with each cell occupying 2 bytes. If a bit is 1, it is reserved, otherwise it is free for use. When issued a memory request of β bytes, this scheme maps the size of the request into a corresponding number of required bits, ηbits by the relationship, ηbits = ceil(_Xβ). For example, if each bit represents a unique location of

X = 16 bytes, then a request for 17 bytes would need two bits to represent this request. Once ηbits has

been computed, a linear search through the bit-vector takes place to locate a contiguous segment of free bits of size ηbits. If this can be located within the bit vector, (1) these bits are marked as reserved and a

calculation takes place to return the starting address of the reserved space in the heap and (2) the index of the start of this contiguous segment is recorded. We define the index of the start of this contiguous segment as s, and the bit at this location as bs. Otherwise, we return an address of 0. Our address

calculation first multiplies s × X, which is an offset value. The offset is added to the starting address of our heap, providing the address to return to the requester. Upon successful reservation, {ηbits, s} is

recorded in a key-value table, with the key being the address returned to the user (this can be the whole address, since this is guaranteed to be unique). When calling free(), the allocator will look-up {ηbits, s}

in the key-value table using the user-provided address. Finally, the bit-vector is updated to release the hold on our memory, setting bits bsto bs+ ηbits to zero.

3.4 The Buddy Allocation Scheme

A buddy allocation scheme searches for free memory in a heap by recursively partitioning the heap, and continuing the search in these partitions. An allocation request is served by walking through the partitions to identify a free partition that meets the memory-request [30]. Although several algorithms

(27)

Chapter 3. Dynamic Memory Allocation Schemes 17 C1 C2 C3 C4 C1 C2 C1 C2 C3 C4 C3 C4 - - - b1 - - b2 b1 b4 b3b2b1 Bitmap Level 0 Bitmap Level 1 Bitmap Level 2

Figure 3.4: The structure of a Buddy Allocation scheme.

exist which permit flexible exploration of the heap, we focus on the description of buddy allocation systems which strictly employ recursive bipartitioning when searching for free memory segments, as these are most amenable for hardware usage. Specifically, we review buddy dynamic memory allocation schemes that have been implemented in hardware and that employ a bitmap-tree as studies claim this specific style of buddy allocation enables increased performance and flexibility [31, 32, 33]. A bitmap-tree is a data structure with L total levels and a minimum requestable byte size λ. Level i of the bitmap-tree has 2i bits, where 0 ≤ i < L. At level i, the associated bitmap is b0,i, ..., bj,i where 0 ≤ j < 2i maps

to a corresponding request-able byte-size, λ2L−i−1. Each bit in this bitmap tree is either 0 (free) or 1 (reserved). This is demonstrated in Fig 3.4, where the heap is represented by the coalescing of four segments, C1 - C4, and the related bitmap for each level of partitioning. Hence, if we require λ = 256,

and the largest bit-vector available for use is 32 bits, then, we can have a maximum tree depth of 5 levels, limiting our heap to be sized at 25_{× 256 = 8192 bytes. The buddy allocator is limited by the maximum}

bitmap size (this limits the total number of levels in the tree) and by the minimum request-able byte size (i.e. the granularity of request-able memory sizes), which, as we will see later, can induce memory underutilization. When a memory request of β bytes is issued, the buddy allocator identifies which level of partitions will best serve the request by computing the difference between the total number of levels in the tree and the logarithm of the memory request. Once a level has been selected, the associated bitmap is searched to identify a free-bit (i.e., a partition of the heap which is available for reservation). If no partition is available, the allocator will return 0. Otherwise, the bit is marked as reserved (logic-1). All bitmaps must be updated to reflect which levels have the selected partition. For example, if the second partition in Bitmap Level 1 from Fig. 3.4 is reserved, it is not possible to request for the entire heap (i.e., Bitmap Level 0) or partitions 3 and 4 from Bitmap Level 2. Therefore, a change in any level of the bitmap-tree must propagate to upper and lower levels. Only the affected portions of the bitmaps are updated. The procedure for updating the bitmap tree is taken from [33]. Once the changes have been propagated to the entire tree, this allocator stores the level associated with the recent reservation in a key-value table, with the key being the address to return to the user. The address-to-return is the

(28)

Chapter 3. Dynamic Memory Allocation Schemes 18 starting address of the reserved partition. Storing the level information ensures the bitmap tree will only be updated from the corresponding level when the memory reservation is released. After recording the level in the key-value table, the allocator returns the address to the requester. Likewise, when the reserved memory is to-be released, the supplied address will be matched with the corresponding partition and bitmap using the key-value table. The bit which represents the partition is marked as free (logic-0). Again, all bitmaps must be updated to reflect this change, employing the strategy outlined in [33].

3.5 A Pre-allocated Address Allocation Scheme

In this section, we describe a novel dynamic memory allocation algorithm that is amenable to the high-level synthesis process. Through the review and analysis of pre-existing dynamic memory allocation mechanisms, we identified short-comings of prior approaches, including:

• Searching for free memory is costly (e.g., having to inspect a doubly-linked list for free memory, reading the bitmap to find a free segment of memory, etc.)

• Once free memory is located, calculating the address can be computationally complex (e.g. refer to the bitmap allocator in Chapter 3.3).

• Repeated requests with similar sizes are not handled well with pre-existing approaches (i.e. there is no pool of preallocated, similar-sized memory segments).

Additionally, we have not described how these algorithms translate to hardware, and their performance or area impacts. So far, our discussion of these algorithms has remained algorithmic. The implementation of these algorithmic features may prohibit possible performance optimizations during the high-level synthesis process. From these findings, we propose a new dynamic memory allocation scheme, which is both flexible and intended for hardware.

Our algorithm makes use of a key-value data structure, which we represent as:

f : K → V (3.1)

Where f is a function which maps members (keys) in K to a member in V . Using this definition, the set of positive integers, Z+ _{is K, and represents all possible memory request sizes. We wish to map a}

request size to a pre-allocated segment of memory which meets or exceeds the request size. Pre-allocated segments are logically separated by their allocation size (which are all powers-of-two) into bins. Each bin has up to n pre-allocated memory segments of the corresponding size. All pre-allocated memory

(29)

Chapter 3. Dynamic Memory Allocation Schemes 19 . . . , . . . , . . . , . . . , A160, A161, . . . , A16k ⇑ b160, b161, . . . , b16k A320, A321, . . . , A32k ⇑ b320, b321, . . . , b32k A640, A641, . . . , A64k ⇑ b640, b641, . . . , b64k . . . , . . . , . . . , . . . , β bytes ceil(log2(β)) Level n Level 4 Level 5 Level 6

Figure 3.5: The general structure of a pre-allocated address allocation scheme.

segments are from a contiguous memory location (i.e. the beginning of the second memory segment in the bin occurs directly after the end of the first memory segment’s reservation). Each bin’s memory space is disjoint (i.e. an array is reserved for each bin). To track the reservation status of a pre-allocated segment, a n-bit-vector is included with each bin to mark if an address has been reserved. The set of all pre-allocated addresses is the set V . This is shown in Fig. 3.5, where a request size (e.g. when β =16) maps to a list of possible addresses (A160,A161...) which can be reserved via a bit-vector (b160,b161...).

We now define the mapping function, f below.

Suppose a memory request of β bytes is issued. Our algorithm must identify which bin of pre-allocated segments to select from. Since each bin corresponds to pre-allocated segments of memory which are sized to be powers-of-two, we can compute the logarithm in base 2, and then take the ceiling to determine the bin of pre-allocated addesses which meet or exceed the request-size. Once this is determined, the n-bit-vector associated with that bin is inspected to locate an available address. To avoid performing a linear search through n-bits for a reservable address, we define the following process to locate a free, pre-allocated address, which was taken from [33]. Using the bin’s bit-vector, Bv = {bv0, bv1, ..., bvn} the

following operation is performed:

(¬Bv) ∧ (¬((¬Bv) − 1))

This will locate the first (and lowest-position) 0-valued bit in this bit-vector. Using this information, we can identify which address is free (each bit location maps to a unique address in the bin, i.e., bv0

(30)

Chapter 3. Dynamic Memory Allocation Schemes 20 represents the reservation status on Av0). We provide an example:

Bv= {1, 1, 1, 0, 1, 1, 0, 1}

¬Bv= {0, 0, 0, 1, 0, 0, 1, 0}

¬Bv− 1 = {0, 0, 0, 1, 0, 0, 0, 1}

¬(¬Bv− 1) = {1, 1, 1, 0, 1, 1, 1, 0}

(¬Bv) ∧ (¬((¬Bv) − 1)) = {0, 0, 0, 0, 0, 0, 1, 0}

After a free address has been uncovered, the bit representing this address is marked as reserved (updated to a logic-1), and the address is returned to the requester.

We define another key-value mapping, f−1 _{: V → B, where we map an address back to the}

cor-responding bit representing it’s reservation. This key-value structure is used to reset the state on the bit-vector, releasing it’s reservation. f−1 is described below.

When a reserved memory-address is released, the following process occurs: since every address is unique to one of the available bins, all bins are searched against the incoming address. Once the bin is identified, the bit representing this address is also identified by searching the bit-vector and checking for this address (recall that the bit-to-address mapping is also unique). Once found, the corresponding bit is set to 0.

We modify our algorithm to handle several additional cases: when this scheme is issued a request for β bytes, the memory request is inspected to check if it’s larger than the largest bin size. If the request exceeds the largest size in our scheme, address 0 is returned. Lastly, it is possible that all pre-allocated segments for a particular bin are reserved in which case a linear search takes place through V , iterating over all members of V which are have a pre-allocated memory segments that are equal to or larger than the memory request. If an address is located during this search, we follow the above procedure. Otherwise, we return an address of 0.

3.6 Summary

In this chapter, we reviewed four dynamic memory allocation algorithms from literature. From our inves-tigation, we proposed a new dynamic memory allocation algorithm which addresses some shortcomings of previous methods and is both flexible and intended for hardware.

(31)

Chapter 4

Related Work

We review work which has attempted to bridge the gap between dynamic data structures and dynamic memory allocation techniques and the world of high-level synthesis. We highlight work which also aims to improve the performance and area of dynamic memory allocation schemes within High-Level Synthesis.

4.1 Dynamic Memory Allocation Support in Modern High-Level

Synthesis Tools

We review modern (as of 2019) commercial and academic HLS tools and explicitly show if a tool supports the high-level synthesis of dynamic memory allocation schemes. We limit our review to tools which accept C or C++ as input. We provide a summary of this information in Table 4.1, and briefly describe the high-level synthesis tools which support synthesizable dynamic memory allocation (i.e. the tool can synthesize a C-implemented, dynamic memory allocation algorithm to hardware description language.). CHC is a high-level synthesis compiler and associated tool chain from Altium [36]. This tools accepts a large subset of the C language as input (we believe recursion and function-pointers remain unsupported). However, this tool supports dynamic memory allocation, although the exact methodology of how they support it is unclear. Therefore, we cannot identify if and what optimizations these tools apply to programs using dynamic memory allocation.

Both Bambu [25] and gcc2verilog [44] support dynamic memory allocation constructs during the HLS process. These two frameworks were built upon the GCC (GNU Compiler Collection) framework. These tools operate on GCC’s IR, gimple, [45, 13]. gimple representations of dynamic memory allocation algorithms can be constructed and linked in to the original gimple representation of the user program. If

(32)

Chapter 4. Related Work 22 Table 4.1: List of modern, C-based HLS tools and their support for synthesizing dynamic memory allocation constructs. Entries marked as -u _{indicates that the documentation is unclear.}

HLS Tool Proprietor Synthesizable Dynamic Memory Support eXCite [34] Y Explorations No

Catapult HLS [35] Mentor Graphics No

CHC [36] Altium Yes

Stratus [37] Cadence -u

DK Design Suite [38] Mentor Graphics -u

ROCCC [39] Jacquard Comp. No

Synphony C [40] Synopsys No

Vivado HLS [7] Xilinx No

Intel HLS Compiler [8] Intel No LegUp [5] University of Toronto No Bambu [25] Politechnico Di Milano Yes

DWARV [41] TU. Delft No

Trident [42] Los Alamos NL No CHiMPS [43] University of Washington No gcc2verilog [44] University of Korea Yes

the user does not provide alternate definitions of malloc and free, system-level software (i.e. malloc() and free() from the operating system) or tool-included definitions are supported. These definitions are written in the C programming language and make use of on-chip BRAMs for heap memory; thereby no additional on-chip processor is required to use these algorithms. For example, Bambu has a C implementation of malloc() and free() taken from [46], employing a linked-list memory allocation algorithm and implements heap memory as on-chip BRAMs. However, this can be undesirable since it is not clear if the provided algorithm is best suited for a particular HLS application [9].

Although these three modern day tools have support for synthesizing dynamic memory constructs to hardware, there is many limitations with these works. Users may not be able to (1) dictate the size of heap, (2) specify how many heaps could be assigned to the program to facilitate parallelism and (3) may not have the ability to explore performance and area impacts of using various allocation strategies. The work presented in this thesis differentiates from these; we present an exploration of a variety of dynamic memory allocation mechanisms and their hardware-equivalent, as well as automated performance and area optimizations for dynamic memory allocation mechanisms. Our work also allows for the user to express a variety of design constraints pertinent to dynamic memory allocation algorithms.

(33)

Chapter 4. Related Work 23

4.2 Synthesis of Hardware Models in C With Pointers and

Complex Data Structures

S´em´eria et al. explored methods to handle complex-data structures and pointer operations during the HLS process in [47]. They investigated a variety of techniques and demonstrated the ability to map complex data structures and pointer operations to hardware representations. As an experiment, they synthesized and explored a linked-list style of malloc and free with their HLS tool, SpC. Our work expands on this vision, and explores how dynamic memory allocation algorithms may affect performance and area of a synthesized circuit. Additionally, we explore performance and area optimizations with dynamic memory allocation and HLS.

4.3 A Dynamic Memory Manager for FPGA Applications

In this work, a dynamic memory management device is developed by Ozer [31] to allow for dynamic memory allocation requests with FPGA applications. His dynamic memory allocation algorithm im-plementated a specialized linked-list allocator. By using a linked-list to keep track of blocks which are powers-of-two in size, he enabled a computationally efficient method to track free blocks within a heap, with low latency. His implementation provided flexibility for allocatable memory regions, which did not need to be contiguous. However, this flexibility reduces the performance of his dynamic memory allocation mechanism. This work claims to achieve an operating frequency of 175 MHz, with allocation (21 clock cycles), deallocation (10 clock cycles) and address translation (2 clock cycles). This work was implemented using VHDL and was not intended for HLS tools.

4.4 Reconfigurable Fast Memory Management System Design

for Application Specific Processors

Agun & Chang [32] design and describe a hardware memory management unit to assist a typical processor by accelerating the processing of dynamic memory allocation requests. They argued that by designing a dynamic memory allocation hardware accelerator, it may be possible to accelerate system performance. Their designed consisted of a buddy allocator strategy which used an AND-OR tree structure [32]. The accelerator was used beside an OpenRISC-1000 processor, with a modified instruction set, with new instructions designed to redirect dynamic memory allocation requests to their memory manager. With a 256-bit minimum allocatable unit, they achieved a maximum operating frequency of 13 MHz, with

(34)

Chapter 4. Related Work 24 allocation and deallocation taking 1 cycle. This was designed with a hardware description language, and is not amenable to an HLS flow directly.

4.5 Dynamic Memory Management in Vivado HLS for Scalable

Many-Accelerator Architectures

Diamantopoulos et al. propose a dynamic memory management system for FPGA accelerator systems, which enabled dynamic reuse of BRAMs among several accelerators alongside a processor (either hard-ened or defined in soft logic) [48]. This allocation system employed a bitmap allocation scheme. Each bit of the bitmap points to one byte on a particular heap. Several heaps are available in this memory man-agement scheme, allowing for partitioning between possible threads. However, this scheme suffers from high-latency, and an unknown operating frequency as the dependency is on the underlying accelerators.

4.6 Adaptive Dynamic On-Chip Memory Management for

FPGA-based Reconfigurable Architectures

Dessouky et al. highlight the issues of memory ’under-utilization’ and how this can severely impact digital design and overall system performance [49]. Specifically, the lack of support for dynamic memory causes designers to over estimate the usage of a resource, which is then ‘under-utilized’ at runtime. They present a possible solution where a number of processing elements (PE) are connected to their DOMMU (Dynamic On-Chip Memory Management Unit) device. This device holds the state of logically connected BRAMS and is accessed by PEs through arbitration and paging mechanisms. Due to arbitration, latency is not known precisely but rather a function of the number of PEs and number of BRAMS. Their design can be clocked at 140 MHz (with 2 PEs and 40 BRAMs), and the frequency drops as the number of PEs increase. This design provides fixed size blocks of data.

4.7 SysAlloc: A Hardware Manager for Dynamic Memory

Al-location in Heterogeneous Systems

Xue and Thomas explore dynamic memory allocation techniques with use of off-chip DDR memories [50]. Their design aimed to minimize the unnecessary utilization of BRAMs available on-chip by permitting access to a collection of off-chip DDR and external BRAMs (on other FPGAs) for dynamic memory

(35)

Chapter 4. Related Work 25 requests. Allocators can submit dynamic memory requests over an AXI or Avalon Bus which will direct memory to-and-from off-chip DDR and BRAMs, or free the corresponding blocks. They employ a binary buddy system to handle dynamic memory usage. SysAlloc can be run at 150 MHz at the cost of hundreds of clock cycles. This is amenable to HLS flows.

4.8 Separation Logic for High-Level Synthesis

In his Ph.D. thesis, Felix Winterstein investigates the issue of dynamic data structures present in software descriptions which may be placed into an HLS flow. His work primarily studies the parallelization of pointers relating to complex data structures within a program, with emphasis of these pointers contained within loops [51]. By employing separation logic [52] and symbolic program execution, he is able to find provable disjointness between pointers (and their relative address space), meaning he can prove two pointers will never point to the same address space, exposing parallelism. These methodologies are encompassed in a tool he referred to as the Heap Analyzer. His work also explored the use of a linked-list dynamic memory allocator. However, this was not his focus and therefore limited detail was provided.

4.9 Hi-DMM

Liang et al.’s Hi-DMM, is a framework which modifies source-code containing dynamic memory al-locations to be high-level synthesis friendly [33]. By performing a source-to-source transformation, HLS-amenable specialized buddy allocators are implemented in place of generic malloc and free calls. Through their program analysis, they distinguish several types of allocation requests, which can be one of the following:

1. Constant-Coarse-Grained Allocation (CCGA): Requested byte size is large and known at compile time

2. Constant-Fine-Grained Allocation (CFGA): Requested byte size is small and known at compile time

3. Variable-Grained Allocation (VGA): Requested size is not known.

After their analysis, their specialized allocators are then paired with one of the allocation request types, in an attempt to improve performance. These specialized buddy allocators can only be accessed through an HLS handshake protocol, which introduces additional latency to the design. Additionally, Hi-DMM automatically partitions the heap through a graph based analysis of the program using Karger’s

(36)

Chapter 4. Related Work 26 algorithm [53]. This algorithm iteratively cuts the graph until the desired number of heaps is met. The author’s approach to heap-partitioning and dynamic memory support is optimized to their allocation library. In this thesis, we present a static analysis which can determine the number of safe heap partitions. Our framework is also general for any well-defined allocation mechanism.

4.10 Summary

In this chapter, we review previous work which investigates the usage and optimization of dynamic memory allocation routines either as a hardware module or in the high-level synthesis process. We also comment on state-of-the-art tools, and if they support the synthesis of dynamic memory allocation routines during the HLS process. From our research, we have identified some gaps in the study of dynamic memory allocation in HLS. There has been no evaluation between dynamic memory allocation algorithms in the HLS context. Our work evaluates the high-level synthesis of five unique dynamic memory allocation algorithms in a number of environments. From our evaluation, we provided a guideline to assist HLS-designers select an allocator. Additionally, our work explored performance and area optimizations that are possible with dynamic memory allocation algorithms in HLS.

(37)

Chapter 5

dmbenchhls: A Dynamic Memory

Allocation Benchmark Suite

There is no standardized way to evaluate the high-level synthesis of dynamic memory allocation tech-niques and associated techtech-niques. In this chapter, we explore a variety of methods to evaluate dynamic memory allocation within high-level synthesis. We first review several memory request patterns. We then review applications that require dynamic memory allocation.

5.1 Memory Request Patterns

We define a number of memory patterns which typically appear within applications and were previously suggested in [54]. These memory patterns are listed as follows:

5.1.1 Triangle

1 //==-- Triangular Memory Pattern --==// 2 int *arr[BOUND];

3 for(int i = 0; i < BOUND ++i) 4 arr[i] = malloc(choose_size()); 5

6 //.. Do computation with 2-D Array 7 for(int i = 0; i < BOUND; ++i) 8 free(arr[i]);

Figure 5.1: The triangle memory pattern.

The triangle memory pattern, presented in Fig. 5.1 iteratively requests for memory upfront. Once the 27

(38)

Chapter 5. dmbenchhls: A Dynamic Memory Allocation Benchmark Suite 28 memory can be released, this pattern iteratively releases the reserved memory. This pattern can vary in a number of ways. Request sizes can be constant or computed by a function (e.g., randomly generated, linear and increasing, represented by choose size()). Additionally, the order in which memory is released need not be in the same order as it was requested. The number of design choices for this pattern is large, and therefore, dynamic memory allocation algorithms may perform differently depending on the allocation size, and ordering. We implement the triangle memory pattern to iteratively request for memory in a linear fashion, where the loop iteration index will dictate the request size. Additionally, we release memory in the same order it was allocated. This provides a realistic evaluation of an allocator, in the sense that with an linearly-increasing request size, it will attempt to stress the underlying allocation algorithm [54].

5.1.2 Square

1 //==-- Square Memory Pattern --==// 2 for(int i = 0; i < BOUND ++i) {

3 int * arr = (int *)malloc(choose_size()); 4 //.. do some things

5 free(arr[i]); 6 }

Figure 5.2: The square memory pattern.

The square memory pattern requests for memory, executes program logic, and then immediately releases this. This request-do-release pattern is iterative. Similar to the triangle pattern, the request-size can be constant or produced by a function. Our implementation requests for memory based on a loop-index, executes program logic (which does not contain any other memory requests), and then releases the hold on this memory.

5.1.3 Random

The random memory pattern, depicted in Fig. 5.3 consists of a compute kernel that randomly gener-ates malloc requests during runtime (Lines 34 and 35). Randomly generated mallocs are provided with a randomly generated request size as input (Line 35). Each request is given a random lifetime (Line 14), which dictates the number of iterations to wait until free is invoked on this request (Lines 19 to 31). Our implementation holds the state of randomly generated mallocs in a List structure, which is a fixed size (Line 8).

(39)

Chapter 5. dmbenchhls: A Dynamic Memory Allocation Benchmark Suite 29

1 //==-- Random Memory Pattern --==// 2

3 typedef struct List { 4 int lifetime; 5 int * address; 6 } List; 7 8 List L[SIZE] = {0}; 9

10 void List_InsertNode(int * address) { 11 for(int i = 0; i < SIZE; ++i) { 12 if(L[i] == NULL) { 13 L[i].address = address; 14 L[i].lifetime = rand(); 15 } 16 } 17 } 18 19 void List_DecrementLifetime() { 20 for(int i = 0; i < SIZE; ++i) { 21 if(L[i] != NULL) { 22 if(L[i].lifetime > 0) { 23 L[i].lifetime--; 24 } else { 25 free(L[i].address); 26 L[i]=NULL; 27 } 28 } 29 } 30 } 31 32 // Main Kernel ...

33 for(i=0; i < LARGE_BOUND; ++i){ 34 if(rand() > DEF1) {

35 List_InsertNode(malloc(rand()));

36 }

37 List_DecrementLifetime(); 38 }

Figure 5.3: The random memory pattern.

5.2 Applications

We created five C applications which require dynamic memory allocation routines and are amenable to hardware implementation as an additional evaluation methodology. Each benchmark provided in this suite is significantly different in terms of high-level behavior, and was inspired by real-life applications [55, 56].

• priq: A priority queue. An array of random numbers are queued (one-at-a-time) and then popped, exhibiting a square memory pattern.

Dynamic Memory Allocation Techniques for High-Level Synthesis. Nicholas V. Giamblanco

Dynamic Memory Allocation Techniques for High-Level Synthesis

by

Nicholas V. Giamblanco

A thesis submitted in conformity with the requirements

for the degree of Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

Abstract

Acknowledgements

Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1

Motivation

1.2

Contributions

1.3

Thesis Organization

Chapter 2

Background

2.1

Compiler Technologies

2.1.1

LLVM

2.2

High-Level Synthesis Overview

2.2.1

Allocation

2.2.2

Scheduling

2.2.3

Binding

2.2.4

RTL Generation

2.3

Dynamic Memory Allocation

2.3.1

Asking for Memory

2.3.2

Releasing Memory

2.3.3

Summary of Dynamic Memory

2.4

Summary

Chapter 3

Dynamic Memory Allocation

Schemes

3.1

Linked-List Memory Allocation

3.2

Linear Allocation

3.3

Bitmap Memory Allocation

3.4

The Buddy Allocation Scheme

3.5

A Pre-allocated Address Allocation Scheme

3.6

Summary

Chapter 4

Related Work

4.1

Dynamic Memory Allocation Support in Modern High-Level

Synthesis Tools

4.2

Synthesis of Hardware Models in C With Pointers and

Complex Data Structures

4.3

A Dynamic Memory Manager for FPGA Applications

4.4

Reconfigurable Fast Memory Management System Design

for Application Specific Processors

4.5

Dynamic Memory Management in Vivado HLS for Scalable

Many-Accelerator Architectures

4.6

Adaptive Dynamic On-Chip Memory Management for