Automatic Accelerator Selection
5.3 HLS-Aware Model
This model approaches prediction of accelerator performance from a somewhat different direc-tion than the PMU-based model. At a high level, the intuidirec-tion behind this model is that if we have knowledge of an HLS tool’s capabilities, and the hardware platform on which the accelera-tor will be used, it should be possible to accurately estimate acceleraaccelera-tor runtime. Specifically, if we know how the HLS tool schedules the accelerator, and how often each part of the accelerator is active, we can determine how long the accelerator will take to operate on a specific dataset, given it does not stall. For LegUp accelerators, stalls are generally due to memory operations;
if we know how many times an accelerator will stall for memory, and for how long, it should be possible to determine the runtime of any accelerator. We explore this idea in the current section, and use it to build an HLS-Aware model for estimating accelerator performance.
The HLS-aware model employs an approach similar to that used by LegUp for predicting accelerator latency for hybrid systems using the Tiger MIPS processor [14]. Although the overall approach is very similar, the implementation is completely separate; we must take several factors into account that were not relevant to the work in [14] due to differences in the processor architecture and memory system.
5.3.1 Theory
Figure 5.2 illustrates the three components that compose the total runtime for a LegUp accel-erator:
1. ‘overhead cycles’ where arguments are transferred to the accelerator and return values are read back;
2. ‘scheduled cycles’ where the accelerator is doing useful (scheduled) work; and
3. ‘stall cycles’ where the accelerator is stalled waiting for memory operations.
Once all three of these components are known, the total runtime can be expressed as:
Latency = o + X
i∈BB
(ri+ mi) (5.1)
where:
• o is the accelerator overhead,
• ri is the runtime of basic block i, and
• mi is the memory overhead for basic block i.
As described in Section 2.3, a basic block is a code sequence with a single entry point and single exit point. An accelerator may consist of one or more basic blocks. The following sections describe how each of the three components of accelerator runtime can be estimated.
Overhead Cycles
Overhead cycles can be computed in two parts: initialization and finalization. Initialization cycles can be determined statically by counting the number of arguments that must be sent to the accelerator, while finalization is related to the size of the return value. Each write to, or read from, the accelerator takes a fixed number of cycles. One, two, and four byte arguments each take one cycle on the FPGA to be transferred, while larger arguments are transferred four bytes per cycle. This is true for both function arguments and the return value. Table 5.3 shows the transfer times for a number of standard data types. Some additional overhead cycles are required for each accelerator invocation on the software side. For our ARM hybrid system, this
Figure 5.2: Components of the runtime of a LegUp accelerator.
Chapter 5. Automatic Accelerator Selection 63 value was experimentally determined to be approximately 120 processor cycles per accelerator invocation.
Table 5.3: FPGA cycles required to transfer standard data types.
Data Type Accelerator Cycles
The total overhead, in nanoseconds, can be computed as:
Overhead = X
• sizeof () computes the size, in bytes, of the input,
• args is the list of function arguments,
• ret is the return value of the function,
• TF P GA is the clock period of the FPGA accelerator, and
• TARM is the clock period of the ARM processor.
For example, an accelerator that takes a (32-bit) pointer argument and returns a double would require lsizeof (pointer) overhead. Assuming an accelerator frequency of 100MHz and ARM core frequency of 800MHz, the total overhead would be 3 × 10ns + 120 × 1.25ns = 180ns.
Scheduled Cycles
To compute the ‘scheduled cycles’, it is necessary to know the number of cycles for which each basic block is scheduled, as well as the total number of times each basic block is executed when running on a representative set of inputs. This is true for both sequential accelerators and pipelined accelerators; however, for pipelined basic blocks we also need to know the II of the
basic block. The first piece of information is available from the reports generated by LegUp, while a control flow trace of the program is used to obtain the second piece of information.
The control flow trace is obtained by running the target program through the QEMU emulator described in Section 3.4.6.
Figure 5.3 shows trip count annotations on a simple graph of basic blocks. These trip counts are obtained from the application control flow trace. The operations in each basic block in Figure 5.3 can be represented as a dataflow graph. For example, a basic block may have a dataflow graph as shown in Figure 5.4. LegUp may schedule this graph as shown in Figure 5.5.
Given the trip counts and schedule of each basic block, we can estimate the overall scheduled runtime of the basic block.
The scheduled runtime of the basic block, ri is computed differently for non-pipelined basic blocks and basic blocks that are part of a loop pipeline. We can compute ri as follows:
ri =
LBB× EBB for regular BB IIBB× T CBB+ LBB− 1 for pipelined BB
(5.3)
where:
• LBB is the scheduled latency of the basic block,
• EBB is the number of times each basic block is executed,
• IIBB is the initiation interval of a basic block representing a pipelined loop body, and
• and T CBB is the trip count of the loop.
LBB and IIBB are obtained by running the HLS tool, while EBB is obtained form the control-flow trace. As a reminder, the initiation interval of a loop is the number of cycles between successive loop iterations, as described in Section 2.2.1. For loops with a static trip count T CBB is obtained from the HLS tool, otherwise it can be obtained from the control-flow trace.
One advantage of estimating scheduled runtime in this way is that the model adapts as LegUp evolves. For example, if LegUp implements more aggressive use of loop-pipelining, this will be reflected in the generated schedule, which is then used by the model to produce estimates which will in turn reflect any changes to loop-pipelining.
Chapter 5. Automatic Accelerator Selection 65
Figure 5.3: The basic blocks for a simple program annotated with trip counts.
Figure 5.4: A basic dataflow graph.
Figure 5.5: The scheduled dataflow graph.
Memory Stall Cycles
The final component of accelerator runtime is memory stall cycles. This is where the majority of error is introduced into the HLS-aware model since there are many uncertainties involved when estimating the number of stall cycles. For example, if the third load in Figure 5.5 stalls, the schedule may end up looking more like that in Figure 5.6. In general, the number of memory stall cycles can be determined as follows:
mi =X
wp+ 2 ×X
rp,h+ Lp,m×X
rp,m (5.4)
where:
• wp is the number of writes to processor memory
• rp,h is the number of reads from processor memory that hit in the FPGA-side cache
• rp,m is the number of reads from processor memory that miss in the FPGA-side cache
• Lp,m is the latency of processor memory reads that miss in the FPGA-side cache
For reference, the hybrid system we target is shown with memory components highlighted in Figure 5.7.
From the memory trace of the program we are able to determine how many read and write operations are performed in each basic block, and the associated memory locations. The memory trace is obtained by running an instrumented version of the target program through the QEMU emulator described in Section 3.4.6. The HLS tool determines whether the memory locations are local to the accelerator or shared with the processor. If the memory is shared with the processor a cache simulator predicts whether the memory access will hit or miss in the FPGA-side cache2. We use the cache simulator described in [14] in this work. Accesses to local memories within the accelerator do not contribute any extra cycles. The latency of reads and writes to the processor memory is determined by the system architecture.
In our target architecture, outlined in Section 3.5, all writes to processor memory (wp) take two cycles; one cycle is absorbed by the state in which the write is scheduled, while the second cycle will stall the accelerator. However, the accelerator will stall for longer if more than one write is scheduled in the same cycle. For example, if the HLS tool schedules two writes in the
2The cache simulator is not always accurate since memories have different addresses during the memory trace and when generating hardware.
Chapter 5. Automatic Accelerator Selection 67
Figure 5.6: The scheduled dataflow graph with an unscheduled stall.
Figure 5.7: Hybrid system memory components.
same cycle, the accelerator will stall for three cycles since only one of the four required write cycles is absorbed by the state in which the write operations were scheduled.
Accelerator reads that hit in the FPGA-side cache (rp,h) stall the accelerator for two cycles, while misses (rp,m) stall the accelerator for much longer. Cache miss latency (Lp,m) depends on a number of factors including the state of the cache, the position of the requested data in the cache line, and whether the requested data is in the processor caches. The number of cache hits and misses is estimated using the memory trace and the cache simulator.
Several heuristics were developed to help estimate the memory overhead for our target architecture. However, these heuristics do not perform well in all situations and prediction accuracy suffers as a result.
Although we attempt to model all of these factors, the memory trace does not provide enough information to always know whether a read will result in a cache hit or miss. For example, the memory trace is only capable of showing reads and writes at a basic block level, not at the cycle-by-cycle level at which LegUp schedules each basic block. To add complexity, the latency for a cache miss also depends on the location of the data in the cache line. If the requested data is at the beginning of a cache line, the cache will return the data as soon as it is available. However, if the data is at the end of the cache line, the cache cannot return the data until it receives the entire cache line. Therefore, cache miss latency also depends on where a requested piece of data lies in the cache line. Also, if the accelerator issues a read while the cache is performing a linefill, the cache will finish the linefill before servicing the read.
5.3.2 HLS-Aware Prediction Flow
As described in the previous section, there are a large number of variables required to compute an estimate of accelerator latency. To collect the required information, we run a memory and control flow trace of the program, run the program through the LegUp HLS tool flow, and perform a static analysis of the program. Figure 5.8 provides an overview of the full flow. The input to all stages of the flow is the C-code source file. The LegUp tool is run to obtain the scheduling information for the corresponding hardware accelerator. A program trace, using a set of representative inputs, is used to obtain the basic block execution count and memory access trace of the program. The schedule obtained from LegUp and the basic block execution count are combined to determine the first component of accelerator runtime: the ‘scheduled cycles’. The memory access trace is run through a cache simulator to determine an estimate of the number of ‘memory stall cycles’. Finally a static analysis of the C source code provides the
Chapter 5. Automatic Accelerator Selection 69 number of ‘overhead cycles’.