HLS-Aware Model - Automatic Accelerator Selection

Automatic Accelerator Selection

5.3 HLS-Aware Model

This model approaches prediction of accelerator performance from a somewhat different direc-tion than the PMU-based model. At a high level, the intuidirec-tion behind this model is that if we have knowledge of an HLS tool’s capabilities, and the hardware platform on which the accelera-tor will be used, it should be possible to accurately estimate acceleraaccelera-tor runtime. Specifically, if we know how the HLS tool schedules the accelerator, and how often each part of the accelerator is active, we can determine how long the accelerator will take to operate on a specific dataset, given it does not stall. For LegUp accelerators, stalls are generally due to memory operations;

if we know how many times an accelerator will stall for memory, and for how long, it should be possible to determine the runtime of any accelerator. We explore this idea in the current section, and use it to build an HLS-Aware model for estimating accelerator performance.

The HLS-aware model employs an approach similar to that used by LegUp for predicting accelerator latency for hybrid systems using the Tiger MIPS processor [14]. Although the overall approach is very similar, the implementation is completely separate; we must take several factors into account that were not relevant to the work in [14] due to differences in the processor architecture and memory system.

5.3.1 Theory

Figure 5.2 illustrates the three components that compose the total runtime for a LegUp accel-erator:

1. ‘overhead cycles’ where arguments are transferred to the accelerator and return values are read back;

2. ‘scheduled cycles’ where the accelerator is doing useful (scheduled) work; and

3. ‘stall cycles’ where the accelerator is stalled waiting for memory operations.

Once all three of these components are known, the total runtime can be expressed as:

Latency = o + X

i∈BB

(ri+ mi) (5.1)

where:

• o is the accelerator overhead,

• r_i is the runtime of basic block i, and

• m_i is the memory overhead for basic block i.

As described in Section 2.3, a basic block is a code sequence with a single entry point and single exit point. An accelerator may consist of one or more basic blocks. The following sections describe how each of the three components of accelerator runtime can be estimated.

Overhead Cycles

Overhead cycles can be computed in two parts: initialization and finalization. Initialization cycles can be determined statically by counting the number of arguments that must be sent to the accelerator, while finalization is related to the size of the return value. Each write to, or read from, the accelerator takes a fixed number of cycles. One, two, and four byte arguments each take one cycle on the FPGA to be transferred, while larger arguments are transferred four bytes per cycle. This is true for both function arguments and the return value. Table 5.3 shows the transfer times for a number of standard data types. Some additional overhead cycles are required for each accelerator invocation on the software side. For our ARM hybrid system, this

Figure 5.2: Components of the runtime of a LegUp accelerator.

Chapter 5. Automatic Accelerator Selection 63 value was experimentally determined to be approximately 120 processor cycles per accelerator invocation.

Table 5.3: FPGA cycles required to transfer standard data types.

Data Type Accelerator Cycles

The total overhead, in nanoseconds, can be computed as:

Overhead = X

• sizeof () computes the size, in bytes, of the input,

• args is the list of function arguments,

• ret is the return value of the function,

• T_{F P GA} is the clock period of the FPGA accelerator, and

• T_ARM is the clock period of the ARM processor.

For example, an accelerator that takes a (32-bit) pointer argument and returns a double would require lsizeof (pointer) overhead. Assuming an accelerator frequency of 100MHz and ARM core frequency of 800MHz, the total overhead would be 3 × 10ns + 120 × 1.25ns = 180ns.

Scheduled Cycles

To compute the ‘scheduled cycles’, it is necessary to know the number of cycles for which each basic block is scheduled, as well as the total number of times each basic block is executed when running on a representative set of inputs. This is true for both sequential accelerators and pipelined accelerators; however, for pipelined basic blocks we also need to know the II of the

basic block. The first piece of information is available from the reports generated by LegUp, while a control flow trace of the program is used to obtain the second piece of information.

The control flow trace is obtained by running the target program through the QEMU emulator described in Section 3.4.6.

Figure 5.3 shows trip count annotations on a simple graph of basic blocks. These trip counts are obtained from the application control flow trace. The operations in each basic block in Figure 5.3 can be represented as a dataflow graph. For example, a basic block may have a dataflow graph as shown in Figure 5.4. LegUp may schedule this graph as shown in Figure 5.5.

Given the trip counts and schedule of each basic block, we can estimate the overall scheduled runtime of the basic block.

The scheduled runtime of the basic block, ri is computed differently for non-pipelined basic blocks and basic blocks that are part of a loop pipeline. We can compute r_i as follows:

r_i =







L_BB× E_BB for regular BB IIBB× T C_BB+ LBB− 1 for pipelined BB

(5.3)

where:

• L_BB is the scheduled latency of the basic block,

• E_BB is the number of times each basic block is executed,

• II_BB is the initiation interval of a basic block representing a pipelined loop body, and

• and T C_BB is the trip count of the loop.

L_BB and II_BB are obtained by running the HLS tool, while E_BB is obtained form the control-flow trace. As a reminder, the initiation interval of a loop is the number of cycles between successive loop iterations, as described in Section 2.2.1. For loops with a static trip count T C_BB is obtained from the HLS tool, otherwise it can be obtained from the control-flow trace.

One advantage of estimating scheduled runtime in this way is that the model adapts as LegUp evolves. For example, if LegUp implements more aggressive use of loop-pipelining, this will be reflected in the generated schedule, which is then used by the model to produce estimates which will in turn reflect any changes to loop-pipelining.

Chapter 5. Automatic Accelerator Selection 65

Figure 5.3: The basic blocks for a simple program annotated with trip counts.

Figure 5.4: A basic dataflow graph.

Figure 5.5: The scheduled dataflow graph.

Memory Stall Cycles

The final component of accelerator runtime is memory stall cycles. This is where the majority of error is introduced into the HLS-aware model since there are many uncertainties involved when estimating the number of stall cycles. For example, if the third load in Figure 5.5 stalls, the schedule may end up looking more like that in Figure 5.6. In general, the number of memory stall cycles can be determined as follows:

mi =X

wp+ 2 ×X

rp,h+ Lp,m×X

rp,m (5.4)

where:

• w_p is the number of writes to processor memory

• r_p,h is the number of reads from processor memory that hit in the FPGA-side cache

• r_p,m is the number of reads from processor memory that miss in the FPGA-side cache

• L_p,m is the latency of processor memory reads that miss in the FPGA-side cache

For reference, the hybrid system we target is shown with memory components highlighted in Figure 5.7.

From the memory trace of the program we are able to determine how many read and write operations are performed in each basic block, and the associated memory locations. The memory trace is obtained by running an instrumented version of the target program through the QEMU emulator described in Section 3.4.6. The HLS tool determines whether the memory locations are local to the accelerator or shared with the processor. If the memory is shared with the processor a cache simulator predicts whether the memory access will hit or miss in the FPGA-side cache². We use the cache simulator described in [14] in this work. Accesses to local memories within the accelerator do not contribute any extra cycles. The latency of reads and writes to the processor memory is determined by the system architecture.

In our target architecture, outlined in Section 3.5, all writes to processor memory (wp) take two cycles; one cycle is absorbed by the state in which the write is scheduled, while the second cycle will stall the accelerator. However, the accelerator will stall for longer if more than one write is scheduled in the same cycle. For example, if the HLS tool schedules two writes in the

2The cache simulator is not always accurate since memories have different addresses during the memory trace and when generating hardware.

Chapter 5. Automatic Accelerator Selection 67

Figure 5.6: The scheduled dataflow graph with an unscheduled stall.

Figure 5.7: Hybrid system memory components.

same cycle, the accelerator will stall for three cycles since only one of the four required write cycles is absorbed by the state in which the write operations were scheduled.

Accelerator reads that hit in the FPGA-side cache (rp,h) stall the accelerator for two cycles, while misses (r_p,m) stall the accelerator for much longer. Cache miss latency (L_p,m) depends on a number of factors including the state of the cache, the position of the requested data in the cache line, and whether the requested data is in the processor caches. The number of cache hits and misses is estimated using the memory trace and the cache simulator.

Several heuristics were developed to help estimate the memory overhead for our target architecture. However, these heuristics do not perform well in all situations and prediction accuracy suffers as a result.

Although we attempt to model all of these factors, the memory trace does not provide enough information to always know whether a read will result in a cache hit or miss. For example, the memory trace is only capable of showing reads and writes at a basic block level, not at the cycle-by-cycle level at which LegUp schedules each basic block. To add complexity, the latency for a cache miss also depends on the location of the data in the cache line. If the requested data is at the beginning of a cache line, the cache will return the data as soon as it is available. However, if the data is at the end of the cache line, the cache cannot return the data until it receives the entire cache line. Therefore, cache miss latency also depends on where a requested piece of data lies in the cache line. Also, if the accelerator issues a read while the cache is performing a linefill, the cache will finish the linefill before servicing the read.

5.3.2 HLS-Aware Prediction Flow

As described in the previous section, there are a large number of variables required to compute an estimate of accelerator latency. To collect the required information, we run a memory and control flow trace of the program, run the program through the LegUp HLS tool flow, and perform a static analysis of the program. Figure 5.8 provides an overview of the full flow. The input to all stages of the flow is the C-code source file. The LegUp tool is run to obtain the scheduling information for the corresponding hardware accelerator. A program trace, using a set of representative inputs, is used to obtain the basic block execution count and memory access trace of the program. The schedule obtained from LegUp and the basic block execution count are combined to determine the first component of accelerator runtime: the ‘scheduled cycles’. The memory access trace is run through a cache simulator to determine an estimate of the number of ‘memory stall cycles’. Finally a static analysis of the C source code provides the

Chapter 5. Automatic Accelerator Selection 69 number of ‘overhead cycles’.

In document Automatic Accelerator Selection for the LegUp ARM Hybrid Flow. Bain A. Syrowik (Page 72-80)