Exploration heuristics - Systematic Design Space Exploration of Dynamic Dataflow Programs for M

The design space exploration problem tackled by the frameworks is usually related to evalu- ation and exploration of different design alternatives. These design alternatives are related either to the parameters of the platform (i .e., when different types of architectural compo- nents are considered in the model) or to the configurations applied to a program when it is ported onto a platform (i .e., partitioning of the program components onto the processing elements). However, even when the exploration of available architectural options is tackled, it implies defining an assignment of the program components to the processing elements of the platform. In some cases, the exploration process is supported by various methods to establish a critical path of the design and/or to identify the bottlenecks of the execution. Different approaches to the DSE in terms of the formulation of the problem, the applied set of steps, along with the available heuristics are summarized for the chosen frameworks throughout this Section.

MAPS

The DSE flow consists of several phases related to mapping and scheduling. At each stage, various heuristics are available. In general, they put an emphasis on the advantages of light- weight heuristics over evolutionary methods, and aim at satisfying the specified constraints. The following stages are defined:

• Pre-Scheduling: this is devoted to a specification of finite buffer sizes. It is assumed that appropriate sizes must be found, so that no deadlocks occur. Such a configuration can be established using two heuristics; Simulated Execution relies on observing channel utilization for different inputs. Traffic ratio relies on allocating memory to every channel proportionally to the traffic on this channel;

• Scheduling: the supported policies are all data-driven. The first heuristic relies on the idea of computing a time slot, so that context switches before potential channel

Framework application handling architecture handling features additional Daedalus applications are mod-

eled using a C/C++ imperative specification and converted to KPN with KPNgen- tool [110]

multimedia M P SoC are considered as target architectures - DSE is per- formed with Sesame framework

MAPS applications are mod-

eled as execution traces, where dependencies between different firings are related to tokens ex- change and internal variables

simplified view of the target M P SoC (list of processing elements and communication primitives modeled by a cost function)

• exploration of different configuration options determining the efficiency of a program;

• performance estimation using trace

replay module: a

discrete-event simulator that schedules the segments of the execution trace (Chapter 6 of [111]);

• several heuristics for mapping, scheduling, synchronization and search algorithm; • composability analy-

sis for the purpose of executing simulta- neously multiple applications on a given platform without in- terference [82]; . integrated with High- Level Virtual Platform simulator (HVP) [112] Mescal - a micro-architecture description includes

the memory sub-

system based on

an architecture

description language

- -

Framework application handling architecture handling features additional Metropolis applications are mod-

eled as a set of processes that communi- cate through media

architectural build- ing blocks are rep- resented by performance models where the events are annotated with the costs of interest

• a third network corre- lates the two models by synchronizing the events between them; • non-deterministic

behavior and con-

straints can be

considered;

• possible transla- tion into a Petri net specification and interprocess communication removal; .

PeaCE 3 models provided:

computation tasks, control tasks and

a task model to

describe interactions

- - based on

Ptolemy

PREESM application rep-

resentation as parametrized and interfaced dataflow meta-model P i M M [113] high-level architecture description using system-level architecture model S − L AM

• based on the concept of algorithm architecture adequation matching methodol- ogy (A A A/M ) [114]; • a scenario determines

the set of parameters specifying the deploy- ment conditions and constraints;

Ptolemy tokens are used as un- derlying communication mechanism regu- lating how the actors fire - - DSE performed with ex- ternal tools (i .e., PeaCE)

Sesame application model

describing functional behavior

architecture model describing the available resources and their performance constraints

• trace-driven simulation;

• intermediate map-

ping layer for

scheduling and

event-refinement purposes; .

Framework application handling architecture handling features additional Space

Codesign

- - • 1st layer: application

specification and veri- fication; • 2nd layer: hardware/- software partitioning; • 3rd layer: emulation of a more sophisti- cated architecture model using a cycle- accurate simulation; .

SPADE an intermediate language for the compo- sition of parallel and distributed dataflow graphs a toolkit of type- generic, built-in stream processing operators supporting scalar and vectorized processing • a compiler automatically mapping applications into appropriately-sized execution units; • communication overhead minimiza- tion and parallelism exploitation; .

SynDEx algorithm graph architecture graph

and system con-

straints • algorithm architecture adequation matching methodol- ogy (A A A/M ); • distribution and scheduling of the algorithm manually or automatically with optimization heuristics, based on multi-periodic distributed real-time scheduling analyses; . - System CoDe- signer high-level model translated into behavioral SystemC

- • performance esti-

mation using per-

formance models

generated auto-

matically from the SystemC behavioral model;

• HW and SW synthesis; • DSE using state- of-the-art multi- objective algorithms; . HW synthesis delegated to a com- mercial tool: Forte Cynthe- sizer [115];

writes are avoided. The second heuristic considers the process importance, which can be implemented in multiple ways, including, for instance, topology, output rate, execution weight et c;

• Mapping phase: a static assignment of the processes to the processing elements is considered. The available strategies are: Computation balancing, Affinity, Output rate

balancing, Simulated mapping;

• Post-scheduling phase: this consists of making final adjustments to the schedule de- scriptors (i .e., fine tuning of buffer sizes).

PeaCE

The DSE is considered as a two-steps process. The communication architecture, including the memory system, is explored after the processing components are selected and the HW /SW partitioning decisions are made. Global feedback forms an iterative DSE loop. The loop is applied only to dataflow tasks that are computationally intensive, whereas the processing elements to execute control tasks are determined manually. An iteration of the co-synthesis loop (the first inner loop of the proposed flow) solves three subproblems: selecting appropriate processing elements, mapping function blocks to the selected processing elements, and evaluating the estimated performance or examining schedulability to check whether the given time constraints are met.

Since the considered design space of architectural solutions is very wide, it is traversed in an iterative fashion. First, a subspace of architecture candidates is explored quickly to build a reduced set of design points to be carefully examined. Within a given set of architecture candidates, all design points are visited by varying the priorities and conditions. Design points with a performance difference of less than 10% compared to the best one (the value of 10% comes from the accuracy of the used estimation method) are collected. The second step applies trace-driven simulation to the design points selected from the first step. It accurately evaluates the performances in the reduced space and determines the best point. The process continues as long as the iterations bring an improvement to the solution. The third step generates the next set of architecture candidates relying on the architecture of the best design point and applying small modifications to it. It results in a set of Pareto-optimal design points (that is, not dominating in terms of all optimization criteria). Although this heuristic does not explore the entire design space, its main objective is to prune the design space aggressively and arrive quickly to a high-quality solution.

During the process of simulation performed for different candidates, a performance profiler is used. The execution time and the number of executions of different tasks are recorded. From this information, the performance bottleneck of the implemented code is identified.

PREESM

The process of rapid prototyping consists of exploring the design space of a target system in order to minimize its cost and guarantee the respect of different constraints. The most common ones are: latency, throughput, memory, and energy consumption. Other constraints may exist, such as jitter or signal simultaneity.

The problems of mapping and scheduling are considered jointly by different optimization algorithms, ranging from simple to evolutionary methods. The problem of buffer dimensioning is related to a bounded memory execution guaranteeing a deadlock-free execution. Due to the supported MoC , it is possible to statically distribute the tasks. A static scheduling algorithm is usually described as a monolithic process carrying two distinct functionalities: choosing the core to execute a specific function and evaluating the set of the generated solutions. The implemented scheduling algorithms include:

• list scheduling: the tasks are scheduled in the order dictated by a list constructed from es- timating a critical path. Once made, a mapping choice is never modified. List scheduling is used as a starting point for other refinement algorithms;

• FAST algorithm: it is used as a refinement of the list scheduling solutions by utilizing probabilistic hops. It changes the mapping choices of randomly chosen tasks keeping the best latency found, until stopped by the user;

• genetic algorithm: is coded as a refinement of the FAST algorithm, since the n best solutions found by FAST are used as the base population for the genetic algorithm.

A necessity to analyze the impact of application and architecture bottlenecks on the system performance is also emphasized. The high-level architecture description facilitates studying the bottlenecks arising on the platform side.

Sesame

In the context of this framework, the mapping is related to an intermediate layer between the application and the architecture. Hence, defining a mapping is necessary in order to evaluate a candidate architecture. The mapping problem is defined as Multiprocessor Mappings of Process Networks (M M P N ), which is defined as: mi n f (x) = ( f1(x), f2(x), f3(x)), subject to

gi(x), i ∈ {1,...,n}, x ∈ Xf, where:

• f1is the maximum processing time;

• f3is the total cost of the system.

The functions giare the constraints and x ∈ Xf are the decision variables. They represent decisions, i .e., which processes are mapped onto which processors or which processors are used in a particular architecture. The constraints make sure that the decision variables are valid, i .e., result in a feasible solution. The optimization goal is to identify a set of solutions which are superior to all other solutions when all three objective functions are minimized. This is accomplished by a Strength Pareto Evolutionary Algorithm (SPEO) that finds a set

of approximated Pareto-optimal mapping solutions. Scheduling of the processes can be performed in a static, semi-static or dynamic manner. A mapping configuration contains also an assignment of buffers (with limited sizes) that are parameterized and dependent on the architecture. The assignment of buffers is performed in a safe way, that is, guaranteeing a deadlock-free execution. The performance numbers provided by the framework are intended to inspire the designer to improve the architecture, restructure/adapt the application, or modify the mapping of the application.

The output of system simulations in Sesame provides the designer with performance estimates of the system(s) under investigation together with statistical information such as utilization of architectural components (idle/busy times), the contention in a system (e.g ., network contention), profiling information (time spent in different executions), critical path analysis, and average bandwidth between architecture components. Such results allow an early eval- uation of different design choices, identifying trends in the systems’ behavior and revealing performance bottlenecks early in the design cycle.

SystemCoDesigner

The process of automatic DSE consists of finding optimal or near optimal solutions in terms of throughput, latency or required chip size by allocating processors, memories, buses, and hardware accelerators, and binding the actors and channels to the resources. As the first step, the particular instance of the system synthesis problem is formalized by providing a so-called architecture template specifying the architectural components and their interconnections. From this set, the automatic DSE has to select a subset in order to form an implementation. The target architecture allows for hardware only, software only, and mixed hardware/software designs of an application.

Next, a formal model serving as an input to the DSE is built. It consists of (a) the application, (b) the architecture template and (c) the mapping constraints. For each possible mapping, the execution times of an actor in a specific binding are annotated. Then, the exploration is performed using Multi-Objective Evolutionary Algorithms (MOE A) [116]. The problem addressed by MOE A can be stated as follows:

• The architecture is modeled as a graph representing possible interconnected hardware resources;

• The application is modeled as a graph describing the behavior of the system (vertices describe tasks, directed edges - dependencies). Data–dependent tasks have to be executed on the same or adjacent resources to ensure correct communication;

• The set of mapping edges indicates whether a specific task can be executed on a hardware resource;

• The allocation setα is the set of hardware resources;

• The bindingβ determines on which allocated resource each task is executed. For each task from the problem graph exactly one mapping has to be used.

Due to the data dependencies, a binding can be infeasible. A binding is called feasible if it guarantees that the data communications imposed by the problem graph can be established by the allocated resources. Furthermore, a feasible allocation is an allocationα that allows at least one feasible bindingβ. The task of design space exploration is formulated as a multi–objective optimization problem: minimizef (α,β), subject to: (1) α is a feasible

allocation, (2)β is a feasible binding. Unlike for the case of a single-objective optimization problems, in multi–objective optimization problems, the feasible set is only partially ordered and, thus, there is generally not only one global optimum, but a set of Pareto–optimal solutions. The MOE A does not make any assumption about the objective function.

In document Systematic Design Space Exploration of Dynamic Dataflow Programs for Multi-core Platforms (Page 67-74)