Performance Comparison with ALU Based Accelerators

Modulo Scheduling onto Customized Single-Row Accelerators

6.5 Performance Comparison with ALU Based Accelerators

Many existing approaches frequently make use of architectures which allow for some programma-bility. For instance, mesh arrays of multiple-function FUs with rich interconnects (such as nearest

6.5 Performance Comparison with ALU Based Accelerators 125

0 2.5 5 7.5 10 12.5 15

fp_2_8 fp_3_9 fp_4_5_6 fp_7_10 int_5_6 int_8_9 int_10_11

Resources (Thousands) Multi-Loop LUTs Sum LUTs

Multi-Loop FFs Sum FFs Multi-Loop Slices Sum Slices

Figure 6.12: Resource requirements for multi-loop accelerators vs. sum of resources for individual loop accelerators

Table 6.4: Accelerator generation scenarios Scenario Description

a1 Allocation of any number of resources of any type a2 Allocation of any number of ALUs (+ other units) b1 2 ALUs + 1 Multiplier + 1 Branch Unit (+ 1 FPU) b2 4 ALUs + 1 Multiplier + 1 Branch Unit (+ 1 FPU) b3 8 ALUs + 1 Multiplier + 1 Branch Unit (+ 1 FPU)

neighbour connections aided by less numerous longer connections to distant units). Architectures such as these are usually designed once, possibly by a quantitative approach, to be flexible enough so that future sub-graphs can be successfully mapped and executed. That is, these designs must be rich enough in terms of resources and especially interconnection capability in order to increase applicability. On the other hand, this may incur considerable resource costs.

The results presented so far are relative to fully customized accelerators, that is, instances with any number of operation-specific FUs. The objective of generating fully customized designs is to maximize performance and to decrease resource usage. In this section the advantages of this customization are evaluated by comparing fully customized accelerators with instances containing a fixed number of ALUs, to establish a comparison with existing static resource accelerators.

Table 6.4 summarizes the five types of accelerators generated for this experiment using the de-veloped scheduler. Scenario a1 was presented in the previous section: fully customized generation of accelerators with boundless resource allocation. Scenarios b1, b2 and b3 contain a fixed number of resources. To do this, the scheduler was tuned so that scheduling starts with the given units, and so that allocation of more FUs is not performed. Instead, as per typical modulo-scheduling approaches, the II is increased until a valid schedule is possible. An additional scenario, a2, uses boundless allocation of ALUs.

Comparison with fixed resource scenarios The employed ALU essentially contain instances of each type of individual FU, with the required additional control logic. Supported operations

include all integer arithmetic (except division and multiplication), logic, and comparison opera-tions. In order for the synthesis tools to not optimize the ALU logic on a per-instance basis (due to constant propagation of the instructions feeding each ALU) a black-box instance was synthesised.

This means that each ALU represents a fixed cost of approximately 640 LUTs and no FFs.

In addition to the ALU, these cases also use a single branch unit (which evaluates all types of exit conditions) and a single integer multiplier, also instantiated as black boxes. For the bench-marks of the Livermore set, a single Floating Point Unit (FPU) is also added, which includes all floating-point arithmetic, comparison and float/integer conversion. The ALU has a latency of 1 clock cycle and the latency of the FPU varies according too the issued operation, but it is still possible to pipeline operations. Like the ALU, the FPU is constructed from one instance of each type of floating-point unit, and a black-box was used to instantiate it. The cost of the FPU is 1460 LUTs and 525 FFs. Note that although the number of units is fixed, the interconnections between them are still specialized by the scheduler based on dataflow between scheduled operations.

Figure 6.13 shows the speedups for these cases. The speedup in scenario b3 equals that of scenario a1 for all integer cases. In other words, the Megablocks that can be detected for these benchmarks can be executed at the minimum possible II with 8 ALUs, and in most cases only 4 ALUs suffice. Inversely, we can state that fully customized accelerators perform equivalently to generalized accelerators with 4 or 8 ALUs. There is only a noticeable difference between b1 and b2, where the average IIs are 19.6 and 10.3, respectively. Regardless, for the integer cases, an accelerator with only 2 ALUs still achieves a mean geometric speedup of 2.08×.

However, since the accelerators in these scenarios contain only 1 FPU, this means the speedup decreases for loops with floating-point operations. For 6 out of the 13 floating-point benchmarks, the speedup decreases to approximately half on average, whilst the remaining decrease marginally.

The average II of the former 6 cases increases to 17.2 (for b1, b2 and b3), versus the average II of 7.2 for scenario a1. The highest decrease in speedup occurs for f12, whose accelerated loop contains 16 floating-point operations, scheduled onto 4 floating-point units for the accelerator in scenario a1. With only one FPU, the speedup decreases by approximately four times, from 18.9× to 4.8× (regardless of the number of ALUs). Although the accelerator for f5 in scenario a1 contains 5 floating-point units, the speedup only decreases by half in the remaining scenarios. This is because three loops are accelerated, only one of which uses all 5 floating-point units frequently.

This can be analysed in terms of resources in this way: in order to schedule the benchmark loops with minimum II, an average of 2.3 floating point FU are instantiated for the accelerators of scenario a1. This means that, for a fixed resource accelerator, at least 2 FPUs would be required to prevent increasing the II. In other words, a fixed resource accelerator with 4 ALU and 2 FPU would incur a cost of 7707 LUTs and 3697 FFs. In comparison, the accelerators for the floating point set in scenario a1 require an average of 2829 LUTs and 2863 FFs.

As a summary, Fig. 6.14 shows the average required resources for all scenarios, distinguishing between floating-point and integer sets. The values are normalized to the resource requirements of one MicroBlaze. Considering that the number of slices represents area on the device, the accel-erators in scenario a1 are roughly 0.62×, 0.47× and 0.32× smaller than those of cases b1, b2 and

6.5 Performance Comparison with ALU Based Accelerators 127

Figure 6.13: Speedups for several types of accelerators vs. a single MicroBlaze processor

b3, respectively. The number of required FFs varies little, since the same amount of data needs to be transported between FU regardless of their type. For the floating-point cases, the reduction in the number of LUTs is higher due to the higher cost of each FPU.

In short, a specialized accelerator performs on par with a fixed-resource accelerator and when floating-point support is required it allows for better exploitation of ILP versus deploying a single fully fledged FPU, and is less costly in terms of resources versus employing two FPUs.

The configuration word memory size also varies between scenarios due to the dependence of the word width on each particular instance. For the sake of brevity, consider only that the average memory size is 8.73 kB, 15.88 kB, 12.48 kB, and 17.30 kB for scenarios a1, b1, b2 and b3. The size of the code word memory of the accelerators depends on three factors. The first is the number and type of computational resources. Specialized FUs receive at most one or two configuration bits per schedule step. The ALUs, FPUs and branch unit on the other hand require 18, 13 and 8 bits. Secondly, the connections between units affect the multiplexer complexity and therefore the number of bits in a configuration word required to control them. Lastly, a larger number of loop operations (i.e., processor instructions) to schedule generally means longer schedule lengths.

Consider then that scenario a1 instantiates more units, but each with less control bits, and that more units allows for shorter schedule lengths and therefore fewer total configuration words (approximately half relative to the remaining three cases). For the fixed-resource cases, neither the number of configuration words nor the word width vary in a predictable fashion with the of number

0 2 4

In document Generation of Custom Run-Time Reconfigurable Hardware for Transparent Binary Acceleration (Page 144-148)