Computational Model - The Morpheus Platform

2 The Morpheus Platform

2.2 Computational Model

The computational model of Morpheus is based on the Molen paradigm [14].

The whole architecture is considered as a single virtual processor, where reconfigurable accelerators are functional units providing a virtually infinite instruction set. Tasks (i.e., application kernels) running on the reconfigurable units or on the ARM itself should be seen as instructions of the virtual processor. The configuration bitstream of the reconfigurable engines represent the virtual instructions micro-code, with the added value of being statically or dynamically reprogrammable. According to this paradigm, increasing the granularity of operators from ALU-like instructions to tasks running on reconfigurable engines, the granularity of the operands is forced to increase accordingly. Operands cannot be any more scalar C-type data but become structured data chunks, referenced through their addressing pattern, be it simple (a share of the addressing space) or complex (vectorized and/or circular addressing based on multi-dimensional step/stride/mask parameters).

Operands can also be of unknown or virtually infinite length, thus introducing the concept of stream-based computation. From the architectural point of view the Morpheus handling of operands can be described at two levels: Macro-Operand is the granularity handled by extension instructions, x controlled by the end user through the ARM program written in C. Macro-operands can be data streams, image frames, network packets or different types of data chunks whose nature and size depends largely on the application. Micro-Operands are the native types used in the description of the extension instruction, and tend to comply with the native data-types of the specific reconfigurable engines

entry language. Micro-operands will only be handled when programming the extensions.

As the Morpheus platform is required to process data-streams under given real time constraints the work of user at system level is to schedule tasks in order to optimize the partitioning of the applications computational demands over the available hardware units. The aim of the mapping task should be that of building a balanced pipelined flow in order to induce as few stalls as possible in the data flow in order to sustain the required run-time specifications. The computation should be partitioned on the 3 different reconfigurable engines and the ARM core as much as possible in a balanced way. Figure 2.2 provides a generic example of application mapping, utilizing only two reconfigurable engines for simplicity. It appears evident how the overall performance will be constrained by the slowest stage, where a stage can be either computation or data transfer. The timing budget of each stage is flexible, and can be refined by the user, much depending on the features of his application. The interface between the user and all hardware facilities is the main processor core.

Hardware resources are triggered and explicitly synchronized by software routines running on the ARM. In order to preserve data dependencies in the data flow without having to constrain too much size and nature of each application kernel the computation flow can be modeled according to two different design description formalisms: Petri Nets (PN) and Kahn Process

Figure 2.2: Morpheus computational model.

Network (KPN)[59]. In the first case the above described synchronization is made explicit, and each computation node is triggered by a specific set of events. In the second case synchronization is implicit, by means of FIFO buffers that decouple the different stages of computation/data transfer.

Generally speaking, the XPP array appears suited to a KPN-oriented flow, as its inputs are organized with a streaming protocol. Unlike XPP, DREAM is a computation intensive engine: input data are iteratively processed inside the reconfigurable engine's local memory. Finally M2K is an eFPGA device programmed in HDL, so that any computation running on it can be modeled according to either formalism. A KPN can be described as a sub-net of a larger PN, while the contrary is not possible: if the target application fits well to the KPN formalism, it appears relatively easy to map it on XPP and eFPGA exploring the local IO buffers as FIFOs, while if the application should exploit DREAM the pattern will have to be extended to a PN with XPP/eFPGA implementing a sub-net organized as KPN. In other cases, a streaming approach cannot be applied as different reconfigurable engine operation may be required to run iteratively on the local buffers to describe a given computation kernel, thus a full PN approach must be applied. The rules of a generic PN can be briefly described as follows: A given node can compute (trigger) when all preceding nodes have concluded computation and all successive nodes have read results of the previous computation. In the context of Morpheus these rules can be rewritten as follows. A given computation can be triggered on a given reconfigurable engine when:

 The Bit-stream for the application was successfully loaded

 All input data chunks have been successfully uploaded to the reconfigurable engine local buffers

 All output data chunks that would be rewritten by the current iteration have been successfully copied from the reconfigurable engine local buffers to their respective destinations

In the case of PN, ARM is required to verify the PN consistency and produce the preceding/successive tokens triggering computation stages. Of course, if data-chunks are large enough, this monitoring will not be required very often.

Each reconfigurable engine computation round is applied to a finite input data chunk, and will create an output data chunk. In order to ensure maximum

parallelism, during the reconfigurable engine computation round N the following input chunks N+1, N+2,... should be loaded, filling all available space in the local buffers but ensuring not to cover unprocessed chunks.

Similarly, previous available output chunks . . . , N-2, N-1 should be concurrently downloaded ensuring not to access chunks not yet processed.

This mechanism is defined ping-pong buffering, and is utilized to provide a sort of processor controlled coarse grained FIFO access.

In document Multi Processor Systems On Chip with Configurable Hardware Acceleration (Page 48-51)