• No results found

4.3 ZUCL

5.1.1 Stream Interface

The data wires carry the payload while the remaining signals are for control. All signals are associated to a stream, the current stream is announced through

the stream_id-signal. end marks the end of the stream, while flush causes a

pipeline flush of the current stream, aborting processing of the stream and clearing the pipeline immediately. The stall-signal is a bit vector with one signal for each individual stream controlling which streams can be pushed into the pipeline. Inside a stream chunk_id helps to identify the chunks. It is possible for a chunk ID to be reused inside a tuple reducing the number of chunk IDs needed and there is no explicit order enforced between chunk IDs. This can be used for all chunks that require the same processing, a very common case are variable length strings.

The transferType-signal identifies what the data wires are used for. The data

can either be invalid (transferTypeno data), data to be processed by the module, setup data for a module, or a memory request. Chunks with transferTypeno data may be overwritten unless end or flush are active (driven to high). This allows to generate a flush or signal the end of stream without any data. The latter is required to allow empty streams. In the following, a chunk that may be overwritten (transferType no data and both, end and flush, low) is referred to as a free chunk. To avoid confusion the term configuration is reserved for the configuration of the FPGA and the runtime adjustable parameters of modules are calledsetup [data]. In case of a memory transfer or setup data, stream_id and

chunk_id are interpreted differently, they are concatenated and the lower bits

carry a module_id as shown in Figure 5.2. The remaining bits in the alternate addressing mode can be used by the user and the interface specification mandates at least one user bit. For instance, it can be used to differentiate between register addresses and payloads. This is of particular interest for modules which need the entire data vector to load setup data and thus have to send register addresses ahead of the data. This is described in detail in the Usage Notes on page91.

To relax timing, data is transferred one cycle after the control data. This gives an additional cycle for decoding chunk_idandstream_id before the data arrives. Figure 5.3 shows a module initiated read transaction. The blue region shows the stream interface, with streaming data arriving in the first half and the response

88 CHAPTER 5. PIPELINE FOR DYNAMIC STREAM PROCESSING 4 0 chunk_id stream_id 8 4 0 chunk_id stream_id 8 module_id U

Figure 5.2: Module_ID mapping onto chunk_id and stream_id wires. “U” stands for the user bit.

clk clk_2x transferType 00 01 00 11 00 end flush stream_id 2 mod2/2 ID chunk_id 3 2 mod1/2 ID

data Data Data

valid write/read

address addr

module_id modID

bytes 42

Figure 5.3: Example of dispatching a data read during streaming operation, and subsequent arrival of requested data. While end and flush are drawn as

don’t care (X), they should be driven low whenever they are unused. The stream interface has a blue background and the memory interface has a red background.

5.1. INTERFACE 89

to a module triggered read request arriving in the second half. Note the one clock cycle offset between control signals and payload on the stream interface.

The width of the data vector was chosen based on the width of the bus for Peripheral Component Interconnect Express (PCIe). Furthermore, due to the choice of interface wires in Section 5.2.1, the number of wires available for the interface is limited to 400, unless modules span more clock regions. As the maximal frequency of some modules, i.e. sorting of strings, depends directly on the bus width the interface width was not scaled up.

stream_id and chunk_id are scaled based on the number of modules to be

supported, as the sum of their bits must be small or equal to the number of bits allocated for the module_id signal. stream_id is kept a bit smaller, as it exponentially increases the number of required wires, due to the associatedstall

signals. With the chosen parameters the meta signals associated with the data vector encompass 29 bits, just shy of the 32 bits that can fit in a BRAM. This is important, as it allows to buffer the metadata in a single block ram, easing buffering a stream as required by stalls as discussed in the next section.

Stalling

Some modules might stall an input stream because they need longer for processing or output additional data, hence, requiring free chunks to insert a result into the stream. To avoid pipeline stalls a module may only stall streams it operates on and only user data can be stalled. This implies that a stalling module has to pass all other streams through. A second implication is that the module stalling a stream has to buffer all chunks that are in-flight, in addition to one chunk for each cycle it takes thestall-signal to propagate through the pipeline.

The required buffer size depends on the position of a module in the pipeline and the modules preceding it as the number of tokens in-flight depends on the cumulative pipeline length of the preceding modules. Moreover, the stalling signal has to propagate from the stalling module to the static system. During this time additional tokens of the stalled stream might be dispatched. Thus, the total number of tokens that can potentially arrive after generating a stall is calculated as the sum of the preceding modules’ pipeline depthsPmodule and thestall-signal delay ∆tstall from the module to the static system as shown in Equation (5.1).

CinF light =∆tstall+

X

modules

90 CHAPTER 5. PIPELINE FOR DYNAMIC STREAM PROCESSING

Rather than applying Equation (5.1)blindly between a stalling module and the static system, it is sufficient to apply it between the stalling buffer and the previous stall buffer which might have been inserted. Furthermore, streams can also be generated inside the pipeline. To handle generated streams Equation (5.1) is applied with the generating module considered as the starting point rather than the static system.