The IMORC Programming Model - Performance modeling and analysis in high-performance reconfigura

The IMORC programming model consists of an architecture model and an execution model. The goal of the architecture model is to characterize the target platform. The execution model specifies the application to be implemented. Both models can be created at different levels of abstraction, depending on the current stage of the design phase. The models are used for a first estimation of an algorithm’s suitability for FPGA acceleration as well as for the design phase of the accelerator.

3.2.1 The Architecture Model

The architecture model underlying IMORC is a network of communicating cores. A core typically consists of some local storage, such as embedded memory blocks or registers, and an execution unit. Cores can communicate with other cores using an on-chip network. For accessing the network, each core provides an arbitrary number of communication ports, which are either master or slave ports. The network connects master to slave ports using links. Communication may be only initiated by master ports by sending a read or write request to the slave port. For write requests, the master has to send data corresponding to the request to the slave, for read request, the slave has to reply with a corresponding data packet. Master ports may connect to multiple slave ports, in which case addressing is required to select the target slave port of a communication request. Conversely, slave ports may connect to multiple master ports through an arbiter.

Figure 3.1(a) shows the block diagram of a typical execution (compute) core. The core comprises a slave and two master ports, each of which is separated into a request channel (REQ) providing control information and a data channel (DAT) for the actual data that has

3.2• The IMORC Programming Model 25 to be transmitted. The slave communication interface is connected to a block of registers and to internal memory. A communication request controller exists for each of the master links, and each of them is connected to the register block for retrieving control information. An execution unit is responsible for performing the actual calculations and is connected to the data channels of the master ports. If required, the execution unit may also be connected to the register block and the internal memory block to store local data. Additionally, it may be connected to the communication request controller. This is, for example, useful if the core implements an iterative function where the number of iterations necessary cannot be calculated in advance, but depends on parameters that are calculated during runtime (e. g., an approximation algorithm). In such a case, the request controller has to be informed by the execution unit if further iterations are required or not.

The host processor (Figure 3.1(b)) is modeled as a simplified core that provides, depending on the actual target platform, one or multiple master ports and optionally one slave port. The core comprises a CPU that is able to execute arbitrary tasks and a local memory. The CPU can access this memory and send messages to the master ports. Additionally other cores may access the memory using the optional slave port.

Shared memory is modeled as a specific kind of core with no execution unit but a large amount of storage. A memory core provides exactly one slave port (see Figure 3.1(c)). When a master core sends a write request to a memory core, the corresponding data is stored in the core’s storage, read request are replied with a response message containing the corresponding data from the core’s storage.

Execution Unit Memory Internal Registers REQ DAT Slave REQ DAT REQ CTRL CTRL REQ REQ DAT Master Master

(a) Compute core

Master Slave REQ DAT REQ DAT CPU Memory (b) Host processor Memory Internal Slave REQ DAT (c) Memory core

Figure 3.1:Block diagram of an example compute core and a memory core Figure 3.2 presents a sample architecture diagram including three compute cores, a host processor core and a memory core. The host interface core can directly access the slave port of core 0, which provides two master ports for accessing the slave ports of core 1 and

core 2. These master ports of core 1 and core 2 access the memory core, the master port of core 2 is additionally connected to the slave port of the host interface core.

Core 2 Core 1

Core 0 MEM Core

S S M M S M S M S M Host Processor

Figure 3.2:Sample architecture diagram with three compute cores, one memory core and one host processor core. S denotes a slave port, M denotes a master port.

3.2.2 The Execution Model

The intention of the execution model is to specify the application to be implemented and to support the definition of the architecture model. While the architecture model of IMORC comprises a set of cores that are connected to some kind of network, the execution model has to specify the actual behavior of these cores and their communication pattern. This model consists of a set of tasks, which are able to communicate to each other. Tasks are composed of a number of operations, which are classified into three distinct groups:

• Incoming communication operations, so the task has to process incoming messages, • outgoing communication operations, so the task sends messages to other tasks and

eventually has to wait for a response and

• local operations, such as the addition of two values.

Figure 3.3 shows an example for such a task. The tall box in the middle represents local operations performed by the task, the smaller boxes on the left and the right represent communication points with other tasks. Tasks can communicate with other tasks using messages. Messages can be read (rd) or write (wr), the first kind is used for requesting

data from another task, the other one for sending data to another task. Read requests need to be followed by a response message (rd resp). Tasks can be modeled at different

levels of abstraction, depending on the actual requirements. On a rather abstract level, the tasks computations may be described by pseudo code or a code segment in a high-level language. On a fairly detailed level, the computations may be expressed as a sequence of micro operations or RTL code. Tasks may directly operate on local memory. Shared memory accessed by multiple tasks is modeled as a separate task that communicates with the tasks accessing the memory.

Figure 3.4 shows a sample task graph with two computation tasks accessing a block of shared memory. Modeling memory as separate tasks simplifies the runtime analysis, especially in the case that multiple tasks access a shared block of memory and therefore

3.2• The IMORC Programming Model 27 LO CAL OPS OU T GOI NG MSG I NCOMI NG MSG wr rd rd resp wr rd resp rd

Figure 3.3:Diagram of a sample task

contention may occur. Additionally, due to the request-response nature of the communication model, tasks operating on streams of data can be modeled in a natural way. The communication subtasks send read requests to the appropriate memory task, and as soon as the data becomes available, the local operation subtasks start processing.

MEM TASK TASK 2

TASK 1 TASK 0

Figure 3.4:Sample task graph: task 0 starts two computation tasks, which in turn access a memory task

It has to be noted that the IMORC architecture and execution models are rather general and do not pose any restriction on the behavior of tasks or cores, respectively. For example, there is no need to enforce blocking reads on incoming messages as in the Kahn process

network (KPN) model, or to specify rates for the production of outgoing messages as in the synchronous data flow (SDF) model. In contrast to formal models of computation, IMORC provides more degrees of freedom but misses formally provable characteristics. However, formal models such as KPN and SDF can easily be embedded into IMORC by imposing corresponding rules.

In document Performance modeling and analysis in high-performance reconfigurable computing (Page 36-40)