the streamdrive architecture - A Dataflow Framework For Developing Flexible Embedded Accelerato

To/From SoC Interconnect Shared Memory STxP70 #0 I$ Subsystem I/F #0 Core #N-1 Core #0 I/F #K-1 HW IP#0 HW IP#K-1 IP -spe ci fi c In ter conn ec t DMA IP-specific Peripherals IP -spe ci fi c In terf ac es Peripherals ~d]u Œ•U Y•

Ti gh tl y- coup led (s yn ch ro ,e ven ts , e tc .) ban k0 ban k1 ban k2 ban k3 ban kM

Figure 2.2: The StreamDrive Cluster Block Diagram

2.2 the streamdrive architecture

The StreamDrive architecture evolved from the ST Microelectronics Platform 2012 project [15]. Figure 2.2shows the block diagram of the StreamDrive cluster. StreamDrive is a heterogeneous tightly-coupled cluster composed of a number of programmable Processing Elements (PE), application-specific Hardware Processing Elements (HWPE), and a DMA, all connected together to a shared Tightly-Coupled Data Memory (TCDM). Using shared memory instead of a cache is a power- saving feature because caches consume more power than the scratch- pad memories.

The TCDM contains an application working set used by both, PEs andHWPEs. In this way, theTCDMstorage replaces (fully or partially) the hardware elements’ dedicated storage, with advantages both in terms of area (no buffer duplication) and performance (no need to copy buffers between different memories). The size of theTCDM has important impact on area-efficiency (GOPS/mm2) of the system: the larger theTCDM, the lower the area-efficiency. A relatively smallTCDM

memory (up to 512KB in current implementation) cannot hold the entire application working set and requires frequent data movement between the TCDM and the external memory. The DMA is used for

transferring data between theTCDMand external memory. The DMA also ensures additional function of supporting stream synchronization of data transfers.

We use simple RISC in-house processing cores running at relatively low frequency (typically 500 MHz). Lowering the frequency of programmable cores improves power-efficiency of the cluster. The cores are extended with the synchronization events handling extension. The programmable cores are connected directly to theTCDMmemory with internal pipeline access latency.

2.2 the streamdrive architecture 29

While the application working set is loaded into theTCDM, the application instruction code is kept in off-cluster external memory. The instruction cache sub-system ensures efficient fetching of application code from external memory to StreamDrive cores. We have chosen to use instruction cache rather than dedicated program memory for two reasons: (1) even though instruction cache energy consumption is somewhat higher, it generally requires smaller size than a program memory for the same application; (2) instruction cache can gracefully handle programs exceeding its hardware size. StreamDrive instruction cache is shared by all processing cores, which significantly re- duces number of external memory accesses and conflicts on external memory bus. Because the StreamDrive is used in SoC systems with potentially high memory latency, we need to carefully dimension the instruction cache size. In our experience the biggest application is about 100KB, and a 64KB instruction cache is sufficient to eliminate virtually all replacement misses.

The application-specific HWPEs are essential for achieving the re- quired performance while keeping the cost and the power consumption low. In order to even further optimize power-efficiency of the system, theHWPEs can run each in their own different dedicated clock domain, thus allowing for the adjustment of their frequency in accor- dance with application requirements. The connection between the

HWPEs and the shared memory is ensured by theHBB(I/F 0, .. I/F K- 1in the figure) that serves as a bridge for streaming hardware blocks. ThePEs, theHBB, and the DMA, all support the StreamDrive communication protocol based on shared memory - this creates a common infrastructure for the core-to-core, the core-to-hardware-block, or the hardware-block-to-hardware-block communication.

The StreamDrive cluster also includes a small number of tightly- coupled peripherals aiming at accelerating the synchronization, event handling, etc.

The key element of the StreamDrive cluster is its logarithmic interconnect [149] that allows multiple concurrent accesses to the multi- bank TCDMmemory. In order to minimize the number of stalls due to conflicting simultaneous accesses to the same bank, the banking factor (i.e. the ratio between the number of TCDM memory banks and the number of access ports), needs to be correctly dimensioned. Such shared memory organization, although it has a limited scala- bility, corresponds well to the small-scale cluster architecture that we target. Our experience, confirmed by other studies on similar architec- tures [7], shows that this type of interconnect can support up to 32 access ports, each with a throughput close to 32-bits/cycle with latency compatible with the RISC core internal pipeline, under the embedded IP target frequencies. As a result, the logarithmic interconnect tech- nology constraints limit the scale of a StreamDrive cluster to around 32 processing elements. When a single StreamDrive cluster cannot

2.2 the streamdrive architecture 30

deliver necessary performance, multiple clusters can be put together, thus allowing massive upscaling in performance while maintaining the initial power- and area- efficiency.

Following StreamDrive architecture elements ensure efficient implementation of the dataflow execution model: (1) the Synchroniza- tion Event Network together with a processing element’s Event Han- dling Extension (EVTx) ISA extension, (2) the HBBfor connecting the

application-specific hardware elements to shared memory, and (3) the Dataflow-Aware DMA.

2.2.1 The Event Synchronization Network

An essential extension to the processing element’s ISA that ensures efficient implementation of the dataflow synchronization is the EVTx. The EVTx is built around the concept of hardware events. An event is similar to the processor interrupt in that both, the interrupts and the events, are signals delivered to the processor asynchronously with re- spect to the normal execution flow. However, there is one important difference, which makes events much more efficient than interrupts for implementing multiprocessor synchronization primitives. When an interrupt occurs, the interrupt handler executes code that is not part of the normal execution flow. An interrupt is handled by the processor as soon as it arrives (eventually depending on the interrupt priority level) - normal execution is then interrupted, which implies a penalizing context switch while processing the interrupt. On the con- trary, a hardware event handling may be delayed as long as the normal execution flow does not request that the event be handled. Thus, the event handler is a part of normal application execution. Event handling does not require a context switch and allows extremely efficient (few processor cycles) implementation of parallel synchronization primitives.

In StreamDrive, the hardware events are used to avoid active polling of shared memory locations while waiting for dataflow tokens to become available. It has been noticed previously that polling for dataflow firing rules may incur significant overhead in terms of performance and energy consumption [150]. One interesting so- lution for reducing this overhead has been proposed by Martin et al. [150]. The authors developed a concept of Notifying Memories, where special interconnect components can trigger/receive notifica- tions according to some events. The particular events of interest are changes in the dataflow communication channels state. In Stream- Drive, instead of a special interconnect components, it is up to the

PEs, the HWPEs, and the DMA, to generate a hardware event every time that the dataflow graph state (the number of tokens in communication channels) changes. The hardware event approach is more lightweight, more flexible, and more scalable compared to Notifying

2.2 the streamdrive architecture 31

Memories. On the other hand, a PE or a HWPE can enter an energy saving idle state while waiting for a hardware event. For the event generation, the PEs use the EVTx, a tiny extension to the processor instruction set that implements instructions for generating hardware events and for inquiring event status; theHWPEs rely on the HBBfor generating these events (see the Hardware Block Bridge description below); the DMA also integrates event generation functionality.

The StreamDrive Event Synchronization Network connects the hardware events from all platform elements. It allows selectively deliver hardware events from a set of sources to a set of destinations, depending on StreamDrive dataflow graph connections. Functionally, the Event Network essentially ORs the events from a set of sources and sends the result to a set of destination elements.

2.2.2 The Hardware Block Bridge

The HBBprovides theHWPEs an interface that abstracts the system memory addresses into a simpler token based representation, which can then be implemented using the streaming type of communication. Such token representation may go from very simple, ex. a linear streaming with standard FIFO read and write operations, to more complex access patterns, such as a sliding convolution window, etc. TheHBBperforms the following tasks:

1. It transforms streaming _HWPE read and write requests with- out the address, into a sequence of naturally aligned LOAD and STORE requests with full system address to pre-allocated buffers in shared memory.

2. It pipelines the generated memory LOAD and STORE transac- tions.

3. It manages the _HWPEworking set as rotating buffers of tokens by implementing dataflow synchronization compliant with the StreamDrive communication protocol.

4. It multiplexes multiple streaming requests from _HWPEs into a limited number of shared memory ports.

5. It ensures the clock domain frequency crossing between the

HWPEs and the StreamDrive cluster.

Among others, the token abstraction allows implementation of a rotating buffer storage model. This model is very useful in image processing. EachHWPEinput and output channel has an associated rotating buffer inside theTCDM memory. For example, aHWPE applying a filter on an image, could require access to several lines of the input image at a time as a temporal window. When the entire image does not fit in the relatively smallTCDMmemory, the rotating buffer would

2.2 the streamdrive architecture 32

HWPE

INPUT

MODULE _MODULESLAVE

In document A Dataflow Framework For Developing Flexible Embedded Accelerators A Computer Vision Case Study. (Page 54-58)