Synthesis of Parallel Specifications - Programming heterogeneous MPSoCs : tool flows to close t

parallelization does not restrict itself to loops and is aware of the entire call hierarchy. The parallelization framework supports heterogeneous platforms, with different means of sequential performance estimation as discussed in Section 2.2.2. Last but not least, the new backend added to MAPS within this thesis, makes the solution complete.

3.3 Synthesis of Parallel Specifications

The alternative to sequential programming is parallel programming, in which an application is expressed as a set of parallel tasks. A myriad of parallel programming languages have been proposed along the years (see [274]). This section starts by providing back- ground on the main parallel languages used in the general-purpose domain. The section then focuses on abstract parallel programming models used in the embedded domain.

3.3.1 General-Purpose Parallel Programming Models

There are manifold general-purpose parallel programming models. Among the most used, one finds models that are built on top of sequential languages like C by providing libraries (e.g., Pthreads [227], MPI [225]) or language extensions (e.g., OpenMP [247]). Models are typically classified by the way tasks or processes interact, with the most common ways being shared memory (e.g., Pthreads or OpenMP) and distributed memory (e.g., MPI).

Pthreads and MPI are the most traditional programming models. In the former, the programmer is responsible for synchronizing the threads to prevent data corruption in shared memory regions. This is known to be an error-prone task [154]. In the latter, the programmer does not have to care for synchronization but for explicitly distributing the data among the processors. This is a time consuming task that requires a very good under- standing of the application. Although MPI has its roots in HPC for regular computations, it is also used in desktop computing and even in embedded systems. In both MPI and Pthreads programming models, the user has to specify the task mapping explicitly. The work of the parallel compiler is therefore simplified. In contrast to MPI and Pthreads, OpenMP eases programming by allowing the programmer to implicitly define tasks, task mapping and even synchronization. It is the job of the compiler and the runtime system to insert communication and synchronization barriers as well as to map tasks to processors. However, if the parallelization directive entered by the user is not correct, wrong code will be generated. Tracking down these wrong assumptions is a cumbersome task.

From the late 2000s, programming models that were initially dedicated for graphic processing in GPUs started to be used for general-purpose applications (see GPGPU [195]). In this area, two prominent programming models are worth mentioning, namely the Open Computing Language (OpenCL) [134] and Nvidia’s Compute Unified Device Archi-

tecture (CUDA) [192]. OpenCL strives at portability and may therefore be outperformed

by CUDA, which has a stronger compiler support. Both programming models are tar- geted for the large arrays of processing elements that characterize GPUs, usually with an identical flow of control. For this reason, they are not in the focus of this thesis.

General-purpose parallel languages have not gained much acceptance in deeply em- bedded systems. They are neither safe nor thin, i.e., their implementations typically in- cur in a high runtime overhead. The safeness issue has also motivated research in new programming models in the desktop and HPC communities, e.g., Transactional Memory

50 Chapter 3. Related Work (TM) [2] and Thread Level Speculation (TLS) [141]. However, these models require hardware support in order to attain good performance. The area and power overhead added by this extra hardware may become a showstopper for a later adoption in the embedded domain.

3.3.2 Embedded Parallel Programming

In contrast to the application-centric or programmatic approach followed in the general- purpose domain, programming models in the embedded domain are either, hardware-

centric or formalism-centric [11]. Hardware-centric approaches strive for efficiency and

require expert programmers. They expose architectural features in the language, e.g., memory and register banks. SoC companies typically offer hardware-centric programming models. In academia, in turn, formalism-centric models are more widespread, with the intention of producing “correct by construction” code.

Initial research on formal parallel programming was based on theoretical models of concurrent systems, e.g., process calculi. A prominent example of such models are Hoare’s

Communicating Sequential Processes (CSP) [105, 106]. These works mainly focused on ren- dezvous or synchronous communication, in which a process or task that produces a data

item would wait until the item has been read by the consumer. Later, more interaction models were analyzed, described and formalized, making it possible to reason about concurrent processes with asynchronous communication,e.g., in [157].

In the field of digital signal processing, graphical programming models became quite popular, since they provide algorithm designers with a natural way of specifying an application. Commercial examples include MATLAB Simulink [172], National Instruments LabVIEW [189] as well as Synopsys Signal Processing Worksystem (SPW) [240] and Sys- tem Studio [241]. Dataflow semantics (introduced in Section 2.5.1) are a common un- derpinning of most graphical programming models. Several publications addressed the problems of scheduling, buffer sizing and deadlock-free execution for different types of dataflows. Pioneering works on SDF graphs, BDF and CSDF were published by Lee et al. in [155], in [153] and Bilsen et al. in [26] respectively. Synthesis of SDF graphs for DSPs can be found in [24, 226]. A mature tool set for generating and analyzing SDF, CSDF and so-called Scenario-Aware Dataflow (SADF) can be found in the SDF3 project [73].

As mentioned in Section 2.5.1.4, static models are restricted in the kinds of applications they can express. Therefore, with embedded software becoming more complex, the analyzability-expressiveness tradeoff moved towards more expressive models. Two main lines of recent related work relevant to this thesis can be distinguished; (1) works that extend static dataflow with dynamic behavior, and (2) works that directly address PN applications. They are discussed in the following.

3.3.2.1 From Static to Dynamic Models

Several authors have proposed extensions to static models that retain some of the original formal properties. Bhattacharyya et al. [21] presented extensions towards dynamic, multi- dimensional (MDSDF) and windowed dataflow models. They derived balance equations, similar to those of SDF, for multi-dimensional and windowed (C)SDFs.

For more general models, synthesizing a static schedule becomes more difficult. In order to allow synthesis of DDF graphs, authors normally describe or automatically find clusters in the graph that feature static behavior. Plishker et al. [203] proposed a schedul- ing algorithm for a special kind of DDF, called Core Functional Dataflow (CFDF) [204]. The

3.3. Synthesis of Parallel Specifications 51 application is specified by using the functional Dataflow Interchange Format (DIF) [204], which allows to mix several kinds of dataflow actors, e.g., SDF, CSDF and BDF. This format is complex and is therefore not likely to become widely accepted in industry. In their approach, they decompose the graph and its actor modes into subgraphs that can be scheduled statically, leveraging existing techniques. The switch between static portions of the application happens dynamically at runtime. With this strategy, the authors obtain a reduced makespan and buffer sizes as compared with a dynamic Round-Robin (RR) scheduler. Results were obtained on host machines [203, 221] and not in target DSP sys- tems. Falk et al. [77] proposed a clustering algorithm with a similar goal. After clustering, they synthesize the QSS as a Finite State Machine (FSM) that can be later used to per- form dynamic scheduling at runtime. In [78], Falk et al. improved the clustering technique by using a rule-based approach. Rules allow to describe the clusters implicitly, without having to enumerate them. They present a considerable improvement in the application execution time and throughput. The algorithm is tested on an FPGA, where a MicroBlaze processor is used to execute every actor, and communication links are synthesized directly in hardware [77]. Therefore, neither actor nor communication mapping are addressed.

Clustering approaches can sometimes reduce the parallelism available in the initial specification and may introduce deadlocks. Wiggers et al. [270–272] follow a different approach for scheduling DDF graphs. Instead of searching for static clusters within a DDF graph, they restrict the dynamic nature of the input graph from the specification itself. They propose several incremental extensions to the SDF model, with the more expressive being Variable-rate Phased Dataflow (VPDF) [271]. A VPDF model allows to represent actors with different execution phases, each with a variable number of tokens. The analysis is embedded in the Omphale tool [25], which uses the sequential Omphale

Input Language (OIL) for application specification. By using a sequential language, the

user is relieved from the duty of writing directly the data flow graph. For their different models, they devised a strategy that determines the buffer capacities required to meet a throughput constraint. Their strategy starts by computing lower and upper linear bounds on a static schedule. These bounds are used to determine the starting time of the first firing of the actors through a constrained linear optimization. Thereafter, the buffer sizes are computed. The implementation is scheduled via Time Division Multiplexing (TDM) on a cycle-true simulator of a two-ARM7 platform.

3.3.2.2 Dynamic Models

Other authors directly address dynamic models without making any assumption on the application communication patterns. The rather theoretical works on KPN of Parks [197], Basten et al. [17] as well as Geilen and Basten [86] focused on devising a dynamic schedul- ing policy and buffer resizing strategy to ensure a non terminating execution on bounded memory. The initial resizing strategy of Parks consisted in computing arbitrary buffer sizes at design time. At runtime, once a deadlock is reached, the sizes of all the buffers are incremented. Parks proved that if the program can run infinitely and in bounded memory, it will do so when using his technique. Parks then refined his strategy by increasing the size of each full channel at a time, instead of increasing it for all channels. In this way, a similar schedule is obtainable with less memory consumption. Basten et al. [17] improved Parks technique with a method to determine exactly which channel buffer has to be in- creased. These works considered neither process mapping nor application constraints. A later work by Cheung et al. [50] analyzed the tradeoff between buffer sizes and application

52 Chapter 3. Related Work makespan. Instead of attempting to find the buffer sizes that ensure a non terminating execution, they modify the buffer sizes in order to achieve a desired application runtime.

Nikolov [191] presented the Daedalus framework for whole system synthesis from a PN specification. This work, however, addresses a more restricted PN model, namely Polyhe-

dral Process Networks (PPN). PPNs can be automatically derived from SANLPs (see Sec-

tion 3.2.2). For these special process networks, it is possible to compute a static schedule. Daedalus accepts mappings generated by the Sesame tool [74,202] which uses evolutionary algorithms to compute an application mapping. Additionally, Daedalus provides means for design space exploration and hardware synthesis. Merging and splitting processes in a PPN application for improving throughput have been presented in [175, 176].

Haubelt and Keinert et al. in [102, 130] presented the SystemCoDesigner framework for full SoC synthesis. SystemCoDesigner is a library-based approach, based on Sys- temC [110], that allows to represent various MoCs. It includes means for high-level performance estimation to accelerate the optimization process, which uses evolutionary algorithms. SystemCoDesigner and Daedalus differ from the approach in this thesis, since they synthesize hardware rather than attempting to map to an existing platform.

Edwards et al. [70] proposed the so-called Software/Hardware Integration Medium (SHIM) for KPN applications with rendezvous communication. In their original publication they proposed a dynamic scheduler and addressed neither mapping nor timing constraints. In later publications they present backends for homogeneous (using Pthreads) and heterogeneous architectures, e.g., for the Cell Broadband Engine [201], in [262].

Thiele et al. proposed the Distributed Operation Layer (DOL) in [249] for mapping PN applications onto heterogeneous platforms. In order to compute a mapping, they use evolutionary algorithms as well. The performance estimation of the parallel execution can be obtained by using a simulator or by using the Modular Performance Analysis (MPA)1 framework [48, 97, 109]. MPA implements a formal compositional performance analysis, based on so-called service curves from real-time calculus. By using these methods, bounds on metrics such as throughput and platform utilization can be computed. However, it does not support general KPNs and, due to the formal analysis, might report pessimistic results. Besides, it is not clear if the formal analysis can be applied to arbitrary architectures and schedulers. A trace-based performance estimation approach, similar to the one used in this thesis, is also presented in [108].

In a later publication, Geilen et al. [87] introduced an analysis framework based on timed actor interfaces. This framework unifies (C)SDF models, automata and service curves. The tracing mechanism and the time-checkpoints proposed for KPN applications in this thesis fit in the more general considerations formalized in [87].

3.3.3 MAPS in Perspective

Synthesis approaches for dynamic dataflows and process networks are summarized in Ta- ble 3.3. The table shows the main features of the approaches mentioned in the previous section along with MAPS. The second column in the table contains the MoC treated by each of the authors. When possible, the input language is also included. Three types of application specification can be identified: domain-specific languages (OIL, DIF, Tiny- SHIM [70]), C/C++ extensions (MAPS CPN) and pure C along with a so-called coordination

language [88], like C and XML in the DOL framework [249]. Coordination languages are 1_{Not to confuse with IMEC’s MPA (MPSoC Parallelization Assist)}

In document Programming heterogeneous MPSoCs : tool flows to close the software productivity gap (Page 59-63)