Customized Multi-Row Accelerators
4.1 Accelerator Architecture
Like all accelerator architectures presented throughout this work, this first architecture relies on synthesis-time parameters, generated by the developed translation tools, which customize an ar-chitecture template. This customization varies the number, layout and type of FUs and their and interconnections. Aspects such as interfaces and control are equal for all instantiations, whilst the
55
Figure 4.1: Synthetic example of 2D accelerator instance
array of FUs is tailored for a set of Megablock CDFGs. At runtime, the migration and accelera-tor reconfiguration mechanism utilize the acceleraaccelera-tor to execute the target CDFGs. This section presents the architecture template and overall execution model.
4.1.1 Structure
Figure 4.1 shows a synthetic example of the first accelerator architecture, with interface details omitted. This design is essentially composed of rows of FUs of several types, which propagate data downwards via full crossbar interconnects. Each row may contain any number and type of unit. The depth, i.e., number of rows, of the array is also unbounded. Both aspects vary with the CDFGs used to generate the accelerator instance. Data is exchanged only between neighbouring rows. To transport data between distant rows, passthrough units are used. All FUs register their outputs, meaning data are propagated row-to-row synchronously, as a group. Data feedback, in order to provide operands to the following iterations, is only performed at the last row of the array. All accumulated data is aggregated, routed backwards and together with the input registers constitutes the data available at the start of an iteration.
Typically, the array contains more passthroughs in the bottom rows, taking on a triangular shape. This is due to redirecting all of the produced data back into the first row for the following
4.1 Accelerator Architecture 57
iteration. The number of passthroughs typically increases with each row as data accumulates.
The FUs are single-operation and single-cycle. Each type of FUs corresponds to one CDFG node type (e.g., a single MicroBlaze instructions). This accelerator implementation relied on a set of FUs implementing all 32-bit integer arithmetic (save for division) and comparison operations.
The supported arithmetic include any operations involving carry, as it is also possible to feed and to retrieve the value of the GPP’s carry bit into the accelerator. Also supported are pattern comparison and exit (i.e., branch) operations. This last class of FUs implements the equivalent of the branch instructions on the Megablocks, and are used to signal end of execution. Unsupported operations in this first design include all floating-point arithmetic and memory accesses.
The accelerator template supports FUs with any number of inputs or outputs (e.g., the 3-input adder with carry). Each 3-input is fed by a multiplexer (part of the crossbars in Fig. 4.1) which fetches all outputs of the preceding row. A possible multiplexer configuration is shown for Crossbar 2 and Crossbar 3. Some trace instructions receive constant input operands, and the Megablock extraction tools also perform constant propagation. As a result some FU input multiplexers are removed (e.g., the bra FU in Fig. 4.1). The multiplexers within the array are runtime controllable via writeable configuration registers. The configuration values per supported CDFG to write to the registers on a are also computed by the offline translation tools.
The register file values received from the MicroBlaze when the accelerator is invoked are represented at the top as input registers, which remain read-only throughout execution. Likewise, the values to be fed back are stored in a final set of output registers. The number of output registers values that are fed back is equal to 2M: one set produced in the current iteration, and another in the previous. Values in the output register set can also be re-assigned amongst themselves. This emulates the behaviour of re-assigning values between registers on the GPP’s register file.
The Iteration control module is responsible for controlling the input multiplexer. After the first iteration, the respective enable bit is set so that new values can be fetched from the feedback wires according to the input switching register, instead of the input registers. By counting clock cycles the control module determines when an iteration is completed and controls the write-enable of the output registers. Finally, it sets status bits when execution is over.
4.1.2 Interface
Two types of accelerator interfaces are explored. Figure 4.2 shows the interface for the version based on a Processor Local Bus (PLB). The Fast Simplex Link (FSL) version contains the same internal registers, but the interface is a low overhead point-to-point connection. Two types of interfaces allow for observing the impact of communication overhead on performance.
The interface-level registers of the accelerator include an instance-dependent number of input, routing and output registers (N, M and L). The input and output registers contain data inputs/out-puts and the routing registers control the inter-row connectivity. The input multiplexer and output multiplexerare each controlled by a separate register. The masks register controls which exit FUs
Figure 4.2: PLB type interface for the accelerator
in the array are enabled (for reasons explained below) and the status register indicates if the accel-erator is busy and how execution terminated. The start register is used to begin execution and the two context registers are used as scratch-pad memory by the MicroBlaze while it executes the CR.
The number of input and output registers depends on the Megablock traces. Given a set of CDFGs used to generate an accelerator, N and L will be equal to the maximum number of inputs and outputs throughout all graphs, respectively. The number of routing registers depends on the array itself. Each row requires a different amount of routing bits in function of the number of outputs of the previous row. Per row, each output is given a numeric identifier. Each FU input multiplexer of the following row uses a binary number to select any of those values. For simplicity of implementation, the bit-width for all binary coded decimals is the same for all multiplexers, and is determined by the maximum number of outputs throughout all rows. In the case of Fig. 4.1, this would be the first row, with 5 outputs, which leads to 3 bits. Thus to drive all 12 FU inputs a total of 34 bits are needed, i.e., two 32-bit registers. Additional bits are needed for the last multiplexers which needs to drive M registers. In other words, the overhead of invoking the accelerator scales with its width and depth, due to the configuration values that need to be written to it. The generate constructs the accelerator relies on connect the specific bits of the registers to the multiplexers, so a single register may hold configurations relative to several rows.
For simplicity, the input and output multiplexers are each controlled by a single 32-bit register for every instance. This imposes a limitation on the joint number of inputs and outputs. For instance, consider that M= 1 (as per Fig. 4.1). For the input multiplexer, the number of possible choices, per output it drives, is 3: the current M= 1 output result, plus the same output from the previous iteration, previous 4, plus one starting value originating from the N input registers. This selection range requires 2 bits which means the total number of inputs supported is N= 16.
The start register is written while executing the CR, after operands are sent. For the FSL interface version, this is replaced with a signal sent by the injector, which is capable of determining when the MicroBlaze has sent all operands by detecting an FSL get instruction on the bus, i.e., the MicroBlaze stalls waiting for accelerator results. The status register indicates if more than one iteration was performed on the accelerator. If this is not the case, then the context registers are used to recover values into MicroBlaze registers which were used during CR execution.