• No results found

Customized Multi-Row Accelerators

4.2 Architecture Specific Tool Flow

4.1.3 Execution Model

Execution on the accelerator begins after all configuration information and inputs have been re-ceived. Inputs are sent by the MicroBlaze and configuration values via different mechanisms, depending on the accelerator interface (this is further explained in Section 4.3). A configuration sets all the multiplexer selections, which are constant throughout execution. That is, each loop that the accelerator is capable of executing corresponds to one global multiplexer context.

After execution begins there is very little control beyond counting the number of clock cycles required to complete an iteration, and writing output values to the output registers at that point.

One iteration is complete after a number of clock cycles equal to the depth of the array. A single row of FU activates per cycle, that is, execution is not pipelined in this initial design. During the first iteration, the multiplexer receives N inputs from the input registers. After this, the input multi-plexer instead fetches some of these values from the feedback lines. The remaining values remain constant throughout all iterations, and are always fetched from the read-only input registers.

The array can execute an arbitrary number of iterations. If the number of iterations is deter-mined by a constant in the Megablock trace, this propagates into the accelerator as a constant value operator. In order to terminate execution, the accelerator always has at least one exit condition.

The control module receives single bit outputs from all exit FUs and when any is true, execution ends. As the previous chapter explained, Megablock execution on the accelerator is atomic: an iteration either fully executes or is discarded. Support for non-atomic iterations would require discriminating which exit condition triggered, recovering the correct set of outputs and returning to a particular software address in the middle of the accelerated trace.

For an accelerator which supports multiple CDFGs, data is still propagated through FUs which may not be in use for a particular configuration. For data FUs this is not an issue since these results are simply not routed to the following rows or registered at the outputs. But the configuration must ensure that only the exit FUs relevant to a particular configuration are active. This is done by the 32-bit mask register, which disables or enables each such FU. This limits the number of exits allowed on the accelerator to 32. However, no observed combination of Megablocks in the utilized benchmarks exceeded this value.

Once execution completes, the results are read from the output registers by the CR and the control module resets the inter-row registers. The accelerator can then be invoked again. If the loop to accelerate is the same as the last one, the configuration process is skipped to reduce overhead.

4.2 Architecture Specific Tool Flow

Figure 4.3 shows the tool flow for generation of the custom accelerators, support architecture and CRs. As the previous chapter introduced, the accelerator generation flow is supported by CDFGs which are produced by the Megablock extractor tool. The tools presented here receive as inputs the Megablocks that are manually selected as acceleration candidates. The outputs of the tools are given to vendor synthesis tools and compilers.

Figure 4.3: Architecture-specific flow for 2D accelerator design and supporting hardware

This version of the translation step processes each CDFG file individually, and produces a binary file with the accelerator specification. For each subsequent CDFG to translate, this file is read and the existing structure is updated. The final run outputs: a Verilog include file with parameters for the accelerator HDL template, and the routing register values per-configuration.

The overall execution flow of this tool is as follows. The CDFG file is parsed and according to the depth and maximum width (i.e., ILP) a number of data structures are pre-allocated. The CDFG nodes are then translated into FUs by assigning them positions on a two dimensional grid.

Since unlimited connectivity is assumed, placement is unrestricted. In this implementation nodes are placed in the earliest possible row, and rows are filled from left to right.

Each node results in one FU, as in this architecture re-utilization of FUs only applies across different configurations, i.e., CDFGs. In other words, re-utilization of FUs across configurations is possible if two nodes of the same type in two different CDFGs occur in the same topological level. A single type of FU may support different types of CDFG operations. The most common is the implementation of the MicroBlaze add and addi (addition with an immediate constant value) via the same FU, since the distinction only exists at the level of the MicroBlaze ISA.

After the placement of CDFG nodes, the auxiliary passthrough FUs are placed; if a connection is required between FUs on non-adjacent rows, then passthroughs are added to all in-between rows. The tool performs passthrough re-utilization at this point; for instance, if two FU require the same output from a FU several rows above, only one chain of passthroughs is introduced.

This process is implemented by checking the array from bottom to top: as passthroughs are added to row N they are themselves checked for the need of another passthrough when row N − 1 is processed. Due to the nature of the CDFGs, passes tend to be created in an inverted pyramid shape. This leads to a frequent re-utilization of passthrough between configurations.

At this point, the position for every CDFG node is known, as well as their connections and the bit-widths of each row’s crossbar, by analysis of the number of inputs and outputs of neighbouring rows. With this, the number of required 32 bit configuration registers are calculated so that enough bits are available to control all multiplexers.

The output Verilog file produced specifies only the coordinates of each FU to instantiate and interface level aspects such as the number of input, output and routing registers. The generate

4.2 Architecture Specific Tool Flow 61

based Verilog template instantiates as many FU input multiplexers as required, fetching the appro-priate control bits from the routing registers. An auxiliary file is also produced and given to the CR generation tool. The number of accelerator interface registers needs to be known in order to compute each register’s address for the PLB CR.

Listing 4.1: Reconfiguration information placed in C containers

1 #include "graphroutes.h"

2

3 // Routing registers for megablock 0 4 int graph0routeregs[NUMROUTEREGS +

input [clog2b(N_REGS) - 1 : 0] addr;

output reg [31 : 0] dataout;

dataout <= (rst) ? 0 : cfgmem[addr];

endmodule

As Section 4.3 will show, this accelerator architecture was integrated into three different sys-tem architectures. Based on which syssys-tem module reconfigures the accelerator, the routing infor-mation may be output in several formats. Listing 4.1 shows how this inforinfor-mation is produced in order for it to be used by the auxiliary MicroBlaze. The per-configuration routing register values are held in C structures which are written via bus to the accelerator. It is also possible to have the CR generation tool include this process into each CR itself (not shown). Finally, Listing 4.2 shows a read-only memory module used to include this information into the injector.

Two additional steps are required when translating multiple CDFGs. Firstly, before the place-ment of new nodes, the existing accelerator’s depth (stored in the binary file) is compared to the depth of new CDFG; if the former value is lower than the later, the existing configurations have to be updated by inserting passthrough chains which transport data from the previous maximum depth to the new one. This is due to the architectural limitation of this accelerator design which only allows for feedback of values from the very last row of the array. Secondly, previously generated routing register values need to be re-generated if: the width of any row changes (as the number of inputs and outputs varies), or if new passthrough are inserted as explained, since routing is necessary through the new rows.

The CR generation process explained in Section 3.3 applies in this implementation: either PLB or FSL based CRs are generated along with the injector address table. The only other additional purpose of this tool is to generate the read-only memory modules for injector based reconfiguration of the accelerator (as per Section 3.4) or to include this information into the CRs.

(a) System1 - External code/data system memory and PLB based accelerator

(b) System2 - Local code/data system memory and PLB based accelerator

(c) System3 - Local code/data system memory and FSL based accelerator

Figure 4.4: System level variants used for evaluation, with minor accelerator and auxiliary hard-ware differences

These tools fully execute in the order of seconds, and the runtime scales in proportion to the amount of CDFGs to translate. Most of the runtime is due to file handling. For this implementa-tion, CDFGs with memory access operations or floating-point operations are not supported, and the lack of a more sophisticated node scheduling leaves resource re-utilization under-exploited. In later design iterations, the translation tool is extensively overhauled to address these limitations and to support new accelerator architectures.