landscape of dataflow implementa tions - A DPN is a directed graph G = (N, C), where:

Definition 2. A DPN is a directed graph G = (N, C), where:

1.3 landscape of dataflow implementa tions

Due to its elegance and simplicity, the dataflow model of computation has been the subject of many research efforts. Since the early 1970s, a number of computer prototypes have been built and evalu- ated based on the dataflow model of computation. The representative dataflow architectures were the Manchester Dataflow machine [66], the MIT Tagged-token Dataflow architecture [67], the SIGMA-1 [68], the Monsoon dataflow processor [69], and others. Several dataflow embedded systems have also been designed such as the Hughes Dataflow Multiprocessor (HDFM) [70], or the AT&T Enhanced Modu- lar Signal Processor [71]. These systems implemented elaborate hardware to execute dynamic scheduler and employed expensive communication networks to route data tokens. These early dataflow computers have failed to deliver the promised performance mainly due to following limitations: (1) too fine-grained (instruction level) synchronization, (2) difficulty in exploiting memory hierarchies, and (3) the inefficient use of the pipeline.

With the proliferation of multiprocessor computing platforms, the renewed interest has emerged to the dataflow computation model, particularly in the embedded systems context. Very broadly, the existing dataflow frameworks fall into 3 categories: (1) domain-specific platforms with specialized languages and tools; (2) model based frameworks for hardware/software codesign; and (3) API based frameworks provided in the form of runtime libraries.

The domain-specific platforms and the model based frameworks rely on automated analysis and generation tools. However, usually efficient implementations are only possible for decidable dataflow models, when the automated analysis is possible. On the other hand, these frameworks require a complete rewrite of the original reference application with a new language. Embedded system developers have been familiar with sequential programming like C for a long time. In fact, around 85% of embedded system developers still use

1.3 landscape of dataflow implementations 10

C/C++ [72]. Therefore, apart from a very specialized signal processing domain, no new parallel programming models/languages have been widely adopted in embedded platforms so far.

The API approaches are mostly developed for existing fixed platform architectures, some including specialized application-specific hardware units. The API based approaches do not require the de- veloper to heavily modify the original source code. This brings more efficiency in terms of parallelization effort. These approaches usually support expressive models of computation, such as KPN or DPN but have not yet been able to close the gap between specification and implementation so as to achieve the computational performance and the energy efficiency of handcrafted solutions.

Overall, apart from a few specialized application domains, the dataflow model has not been widely adopted by industry. There are three main reasons for this:

• Motivated by necessity to amortize development cost over a large number of units, and by intensified time-to-market constraints, the IC and system companies are pushing toward platform-based designs, where new applications can be developed much more efficiently. The existing platforms lack neces- sary built-in hardware support for efficient dataflow execution.

• The main premise of existing dataflow tools, that a dataflow application can be specified at a high abstraction level and automatically transformed into an efficient implementation, has not been fulfilled. While statically decidable dataflow models do not allow to represent all of the required functionality for many streaming applications, improved expressive power of dynamic dataflow models results in problems with unbounded buffers and runtime efficiency. Thus, the main challenge that dataflow computation model has to face is the demonstration of efficient implementations that can achieve functionality and performance constraints imposed by modern applications.

• Dataflow computing is associated with reliance on development frameworks which are hard or inefficient to use for many practi- cal applications. Reference implementations are typically developed by research teams and specified as sequential programs using imperative programming languages such as C/C++ or Matlab. Transforming a sequential reference algorithm into a dataflow representation is a complex, manual, time-consuming process. In particular, the adoption of the dynamic dataflow has been hampered by the need to start from scratch with all (software and hardware) components of computing. A paper by Denning and Dennis [73] brings forth many of the issues related to the canonical parallel processing model and dataflow computing.

1.3 landscape of dataflow implementations 11

Framework Model Target Programming

RAW Machine SDF Streaming StreamIT

Applications

Imagine Streaming Streaming KernelC+StreamC

Applications

Ptholemy SDF,CSDF,HDF, Simulation Specialized

PSDF,KPN,BDF, and design Language

DPN,SDF

Peace SDF HW/SW Codesign Specialized

Language

DIF BDF,PSDF,EIDF DSP DIF Language

CSDF

TDIF CFDF GPU DIF Language

Daedalus PPN HW/SW Codesign C+coordination

Compsoc SDF,CSDF,KPN HW/SW Codesign C+coordination

Koski KPN HW/SW Codesign UML

PREESM PiSDF TI Keystone C+coordination

Shim Rendez-vous KPN Multi-core C Extension

DOL KPN Embedded C API

RVC SDF,CSDF,DPN Video Coding CAL

OpenDF SDF,CSDF,DPN General CAL

Table 1.1: Selected related work summary

Below, we examine a few selected state-of-the-art dataflow implementations from three perspectives:

• Expressiveness of the dataflow model versus the implementation efficiency.

• Ease of parallelizing a sequential C reference implementation so that the parallel efficiency of the target hardware platform can be exploited.

• Integration of specialized and application-specific hardware units with the hardware platform.

Table1.1summarizes the dataflow implementations reviewed in this chapter. In the table, the first column lists the dataflow framework; the second column in the table contains the dataflow model used by the framework; the third column specifies the target architecture platform; and the last column lists the dataflow description language when possible. Three types of dataflow specification exist: (1) a specialized language (eg. Ptolemy, StreamIT), (2) a standard language (C, C++, etc.) extended with a dataflowAPI, and (3) a combination of

a standard language (C, C++, etc.) for actor description along with a coordination language for specifying the dataflow network.

1.3 landscape of dataflow implementations 12

1.3.1 Domain-specific Architectures

The MIT Reconfigurable Architecture Workstation (Raw) [74], and Stanford Imagine [75] were the two early foundational works in the area of stream processing. The stream model is derived from theSDF

dataflow model of computation. In addition to SDF expressiveness problems, non-linear communication patterns are difficult to implement efficiently with streaming architectures because of linear stream abstraction. In order to overcome this limitation, the Raw programming language, StreamIt [76], introduced the notion of teleport messages [77]. Teleport messages allow one actor to sporadically send a message to another; that is, rather than sending a message on every firing, only some firings send messages. As another example, Imag- ine implemented conditional streams, accessed conditionally based on the condition codes (CC) [78]. Conditional streams enable implementation in presence of a data-dependent conditions.

More examples of streaming architectures include the Recon- figurable Streaming Vector Processor (RSVP) [79], which exposes streams in a core’s ISA to communicate to reconfigurable hardware; the Triggered instructions [80], featuring some streaming memory ca- pability to feed its dataflow fabric; or the Stream-Dataflow [81], a reconfigurable dataflow architecture that uses streams as underlying communication abstraction.

Several other domain-specific accelerators use the dataflow computation model, such as Eyeriss [82, 32], a domain-specific accelerator for convolutional neural networks, or the Kalray MPPA multiprocessor platform programmed using the dataflow ΣC language [83] that implements theCSDFmodel.

Overall, such domain-specific platforms failed to attract the embedded community because (1) they suffer from reduced expressiveness and flexibility, and (2) they rely on specialized languages and development tools unfamiliar to the vast majority of developers in the field.

1.3.2 Model Driven Frameworks

The model-driven dataflow frameworks are designed to support efficient design space exploration - a systematic methodology for se- lecting an embedded system implementation from a set of alterna- tives. In these tools, system designers are able to develop complete functional applications formally specified as a dataflow model and perform automated performance analysis, simulation, synthesis and verification of the implementation. Design space exploration is per- formed by iteratively analyzing and optimizing the application along with the underlying hardware and software architecture. In embedded domain, particularly popular are decidable dataflow models be-

1.3 landscape of dataflow implementations 13

cause the reduced runtime overhead and analyzability of compile time scheduling is considered a big advantage.

The Ptolemy (and its successor Ptolemy II) environment [84,85,86] is developed at the University of California at Berkeley. Ptolemy is targeted towards hardware/software codesign and in particular towards the system synthesis and verification. Ptolemy supports a wide collection of computation models including the majority of dataflow models. Ptolemy evolved to Ptolemy II, which proposes a modal approach where finite state machines (FSM) are combined with a dataflow model in a hierarchical fashion. The modal approach over- comes certain limitations of decidable dataflow models in expressive power, while it can be refined to final implementation since both FSM and decidable dataflow models provide methods of system synthesis. In Ptolemy, not all models can be used for system implementation. In particular, synthesis from decidable dataflow models has been exten- sively researched, while other models serve for simulation purpose only.

PeaCE (Ptolemy extension as a Codesign Environment) [87] is an extension to Ptolemy II that provides a hardware software co-design framework. PeaCE uses extended SDF and finite-state machines to model data flow and control flow of multimedia applications. The platform architecture consists of a number of processors and synthesizable IP cores, which are connected through a communication infrastructure. The two step design space exploration is used: (1) se- lection of processing elements and mapping of application tasks on these processing elements, and (2) exploration of the communication architecture such as bus and memory allocation. However, the framework is still limited in applicability by itsSDFsemantics.

The Dataflow Interchange Format (DIF) [88], developed at the Uni- versity of Maryland, is a textual language for specifying dataflow models for DSP systems. DIFcaptures essential modeling information that is required in dataflow based analyses and optimization, such as algorithms for consistency analysis, scheduling, memory manage- ment, etc. DIF provides an extensive repository of models, analyses, and transformations, for a number of dataflow models including dynamic models such asBDF, thePSDF, theEIDF, and theCFDF. DIFitself does not generate implementation of dataflow descriptions but can be used by different DSP tools. For example, the DIF-to-C tool [89] allows generation of C code for the target DSP platform from a SDF

dataflow specification. For the final implementation DIF-to-C relies on C compiler and optimized libraries provided by the DSP processor vendor. Shen et al. proposed the Targeted Dataflow Interchange Format (TDIF) [90] extending the_DIFwith the_CFDFsoftware synthesis. This implementation targeted CUDA code generation for the NVIDIA GPUs, and has not demonstrated its efficiency in a more constrained embedded platforms context.

1.3 landscape of dataflow implementations 14

Nikolov et al. [91, 92] presented Daedalus framework for architectural exploration, synthesis, and prototyping multicore platforms. The Daedalus combines KPNgen [93], Seasame [94], and ESPAM [95, 91] tools. Applications are specified as _KPN networks, which are either derived manually or automatically using the KP- Ngen. However, automatic generation of KPNgen networks from application’s sequential C code is only possible if the application is specified as Static Affine Nested Loop Program (SANLP). A SANLP is a nested loop program in which loop bounds, conditions and variable index expressions are affine expressions in the iterators of enclosing loops and static parameters [93, 96]. Because many applications are not static, i.e. include nested loops which can contain if-then-else

constructs with no restrictions on the condition, loops with no condition on the bounds, whilestatements other than while(1), dynamic parameters, etc., the KPNgen usability remains limited. For the re- sulting process networks it is possible to compute static schedule. The input KPN network is fed to Sesame modeling and simulation tool [94, 97] to perform architectural design space exploration for mapping and scheduling the KPN processes. Daedalus uses a heterogeneous platform (created from a library of components) where the processing elements communicate via distributed memories. A set of KPNs and platform configurations from Seasame is passed to ESPAM for prototyping on FPGA. ESPAM generates C code forKPN

software processes and synthesizable VHDL for platform hardware components from RTL component library. As explained later in this chapter, although it is relatively straightforward to manually derive a

KPN network from a sequential code,KPN execution in software suf- fers from high performance overhead, making KPN model less suit- able for the embedded implementation than theDPNmodel.

The Composable and Predictable Multi-Processor System on Chip (CompSOC) [98] is another design flow that supports a range of execution models, including theKPNmodel. The environment includes a complete multicore architecture, platform support libraries, libraries for synchronization and communication, and tools for formal verification. The main focus of CompSOC is to provide a design flow that supports simultaneous execution of multiple independent applications. In CompSOC, each application is given its own reconfigurable virtual platform. The CompSOC employs a two-level scheduling along with a resource sharing model in order to eliminate inter- ference between different applications. At a single application level, CompSOC relies on SDF3 toolset [99] for mapping and scheduling a dataflow application on the hardware platform. However, while the decidable CSDF applications can be automatically mapped, verified, and executed on the CompSOC platform, mapping and analysis of

KPNapplications is not automated. TheDPNmodel is not supported in CompSOC environment.

1.3 landscape of dataflow implementations 15

Similarly, Koski [100] framework provides environment for modeling, automated design-space exploration, synthesis, and FPGA prototyping of selected design. The input specification is given as KPN

modeled in UML. The target architecture consists of synthesizable communication and processing resources, application software, and platform-dependent and platform-independent software.

The Parallel and Real-time Embedded Executives Scheduling Method (PREESM) [101] is a framework used to prototype and generate code for applications specified inPiSDFdataflow model, and targets heterogeneous multi-core embedded platforms. PREESM works with three inputs: a PiSDF dataflow graph defining the application; a System- Level Architecture Model (S-LAM) describing the target architecture; and a scenario including a set of parameters and constraints to link both of them. S-LAM supports the description of parallel architectures as a set of heterogeneous processing elements transmitting data through a set of communication nodes and data links. PREESM automatically schedules, maps and simulates the execution of the application and generates a compilable C/C++ code for the target architecture. PREESM supports and has been used to generate code for the x86 multiprocessors, the Texas Instruments Keystone DSPs, the Kalray MPPA many-core, Xilinx Zynq SoC, and the ARM Big.LITTLE & Multi-core ARM. The runtime responsible for managing runtime reconfigurations of the PiSDF dataflow graph is called SPIDER (Syn- chronous Parameterized and Interfaced Dataflow Embedded Runtime) [102]. SPIDER exploits the trade-off between dynamicity and predictability of the PiSDFmodel to verify application properties or to perform optimization at runtime. As with other parametric dataflow models, in the PREESM/SPIDER framework the token production and consump- tion rates cannot change arbitrarily, the existence of a dataflow graph iteration must be guaranteed. In practice, this leads to a restrictive coding style which may be difficult to use, while the implementation efficiency has not yet been clearly demonstrated.

Pursuing compile time analyzability and schedulability, the model based approaches often resort to dataflow models with restricted expressiveness and flexibility. Moreover, they rely on unfamiliar languages and specialized development tools.

1.3.3 C Based Frameworks

An alternative to frameworks based on decidable dataflow models with limited expressiveness is to integrate dynamic dataflow programming structures into familiar languages, using a lightweight API with an associated runtime environment. Such approach reduces the software development impact by allowing the tools for creating and debugging dataflow applications be basically the same as those for standard software: compilers, assemblers, debuggers, and cross-

1.3 landscape of dataflow implementations 16

compilers. The vast majority of such APIs implement theKPNmodel because of the low effort required to transform a sequential reference algorithm to theKPNform.

Many such implementations target large computing systems and rely on off-the-shelf OSes. For example, the QUeing And Runtime for Kernels (QUARK) [103], TIDeFlow [104], and OpenStream [105], have been developed in the context of the High-Performance Computing (HPC) applications. YAPI [106] and Nornir [107] support the_KPNex- ecution model on workstation computers. XKaapi [108,109] is a runtime system for scheduling dataflow programs on multi-processors and clusters of multi-processors. Work in [110] proposed a design flow allowing implementation of dataflow applications on a multi- GPU computer cluster. The Intel Concurrent Collections [111] has been used for developing applications on large scale heterogeneous platforms that include general-purpose CPUs, GPUs, custom processors, and FPGAs. These implementations come with heavy performance and memory footprint overheads. This is an acceptable choice for running applications in big-size computers. In the embedded domain we need a lightweight approach: the small memory and the high performance requirements preclude using the full OS, a kernel- level scheduler, and dynamic data structures.

The Software/Hardware Integration Medium (SHIM) [112] was initially developed as a design space exploration dataflow model for specifying, validating, and synthesizing heterogeneous embedded systems. It has later been turned into a language development effort centered around scheduling and static analysis for programming shared memory multiprocessors [113, 114]. Shim relies on restricting the _KPN semantics to help both programming and automated program analysis. SHIM implements a KPNrestricted to support synchronous (ren- dezvous) communication. This choice eases scheduling, and guaran- tees thatKPNprograms are always executable in finite space because synchronous communication does not need buffering. The Tiny-Shim language is based on C (but is not a C subset) augmented with few constructs for concurrency, communication, and exceptions. SHIM imposes many syntactic restrictions on the input language which makes porting existing reference applications difficult. While it has been able to devise effective mechanisms for static scheduling and analysis (e.g., deadlock detection), the implementation relies on costly standard runtime support such as POSIX Pthreads library.

The Distributed Operation Layer (DOL) [115, 116, 23] is also a design flow framework based on the KPN model of computation and targeted at real-time multimedia and signal processing applications. The DOL design flow follows the Y-chart approach [117] in which the application specification is platform-independent and needs to be explicitly mapped on a target architecture. DOL supports the Cell Broadband Engine [118], the tile-based MpSoC Atmel Diopsis

1.3 landscape of dataflow implementations 17

940[119], the MPARM platform [120], and the Intel SCC many-core architecture [121]. In DOL, _KPN processes are described in C/C++ based on a simple API, while the network is described using the XML.

In document A Dataflow Framework For Developing Flexible Embedded Accelerators A Computer Vision Case Study. (Page 35-47)