Scheduling cost - Systematic Design Space Exploration of Dynamic Dataflow Programs for Multi-co

8.3 Scheduling

8.3.3 Scheduling cost

Since all scheduling policies (both I ASP and I P SP ) are applied to a dynamic execution, they are related to making some decisions at run-time. Hence, the number of conditions and constraints considered when making a decision impacts the program performance. Let cc(SX) be the scheduling cost related to a given configuration of S. It consists of two components:

c(SX) =P cc(SX)+c f (SX), where cc(SX) denotes the number of conditions checked and c f (SX) denotes the number of conditions failed. The conditions in this case are related to the input (availability of the input tokens) and output (availability of the spaces in output buffers). The value of cc(SX) is calculated as:

cc(SX) = P i P cce(SiX) ni mX (8.1)

Consequently, the value of c f (SX) is calculated as:

c f (SX) = P i P c fe(SiX) ni mX (8.2)

The value of cce(SX) (c fe(SX)) corresponds to the number of conditions checked (failed) which is elapsed between two consecutive successful firings, respectively. Let’s consider the following example. Figure 8.5 illustrates a set of actors in a partition. For the next firing in each actor, it is indicated from how many input buffers it reads the tokens and to how many output buffers it writes the tokens. The buffers marked with red indicate that the tokens/spaces are not available in these buffers. Assuming that the last successful firing took place for actor C and the scheduling policy is RR, as defined earlier, Table 8.5 presents the process of updating the values of cce(S_ix) and c fe(S_ix) when the scheduler makes the next attempt. In this case, actor C is the next one chosen for execution and the respective values of cce(Sx_i) and c fe(S_ix) are 8 and 2.

Figure 8.5 – Sample partition with the actors to be considered by the scheduler. Attempt actor c. checked c. failed cce(SX_i ) c fe(S_iX)

last firing cj

1 A 4 1 4 1

2 B 1 1 5 2

3 C 3 0 8 2

next firing c_{j +1}

Table 8.5 – Conditions numbers updates.

This procedure is continued for every firing in the partition i . The sum of the elapsed con- ditions checked/failed is divided by the number of firings is this partitionni. This value is a metric for a given partition i . In order to provide the metrics for the configuration SX in total, the values obtained for each partition are summed and divided by the number of partitions mX. As a result, the values of cc(SX) and c f (SX) express the numbers of conditions checked/failed during the execution, per firing, per partition.

The defined values should be considered as a lower bound on the number of conditions for two reasons. First, they consider a limited set of conditions (i .e., no guard conditions considered). Second, the update of the values of cce(S_ix) and c fe(S_ix) is performed only with certain attempts

of the scheduler. The cases when the scheduler works infinitely, i .e., when none of the actors in the partition can be executed and the scheduler keeps iterating over them without any success are difficult to model reasonably.

8.4 Conclusions

The heuristics described in this Chapter aim at design space exploration of dynamic dataflow programs. Each of them took one of the subproblems discussed in Chapter 6 and explored the design points in order to establish a high-quality solution. For each of the subproblems, that is, partitioning, scheduling and buffer dimensioning, several approaches of different complexity have been proposed. Naturally, they may lead to different qualities of the solutions and operate within different time requirements. In each case, the base for the heuristics are the rich performance metrics provided by a timed E T G.

In the first part, the partitioning problem was considered. Although it is a popular problem described in the literature, it can be observed that most of the general purpose approaches for graph partitioning are not applicable in this case, since they are not aware of the semantics at the edges of dataflow networks. Hence, specific heuristics must be developed instead. The partitioning algorithms introduced in this Chapter are grouped as greedy heuristics, descent local search methods and tabu search. Greedy heuristics usually provide a solution quickly, but only the search methods explore different variations of the solutions.

In the second part, the problem of finding a finite buffer size for the buffers in the network was considered. Analyzing the related work, it was observed that so far this problem has been tackled in a very limited way, that is, reducing the MoC to the models analyzable at compile-time or simplifying it to finding a deadlock-free configuration, disregarding the effect on the performance. In some works the impact on the throughput was considered, but not in the context of a multidimensional exploration. The buffer dimensioning heuristics introduced in this Chapter aim at finding a trade-off between the program performance and resource utilization, with regards to the specific partitioning and scheduling configurations. They can be applied directly to the optimization scenarios introduced in Section 7.6.

In the last part, the scheduling problem was considered. Different aspects of the problem differentiating it from the others were presented. An overview of related work discussed different approaches aimed at eliminating the scheduling problem or maximally reducing its impact. Since the dataflow applications are expected to have a dynamic and data-dependent behavior, it is only possible to define scheduling policies. Hence, several policies have been proposed. The run-time scheduling is subject to a cost related to establishing the schedule. Therefore, a figure of merit for the scheduling cost, relying on the number of conditions checked and failed throughout the execution was also introduced.

The process of design space exploration consists essentially of performing moves from one design point to another. If the moves are properly driven, the whole exploration procedure becomes much more efficient and the final design points can be of a higher quality. An im- portant role in driving the optimization heuristics is played by the performance estimation. First, it allows calculating the performance of a program on a given platform without having to physically execute the program on this platform. Second, it evaluates different design points in order to make a decision about whether to perform a certain move or not. Finally, if the per- formance estimation simulates the entire behavior of a program, it can allow extracting some execution properties to be used by the optimization algorithms or identify the most critical parts of the execution that should be considered for optimization. This Chapter presents a performance estimation tool, developed as a module within the Turnus co-design framework, serving as a basis for the V SS methodology described in Chapter 7 and the heuristics described in Chapter 8. It is one of the stages of the design flow discussed in Section 3.5, as illustrated in Figure 9.1.

9.1 Related work

An accurate performance estimation methodology is usually built upon two stages: appropri- ate modeling and its evaluation. The quality and the level of detail in the model determine the accuracy of the estimated results when referred to the actual execution. Usually, the most accurate results can be achieved when the model is very detailed. It implies, however, longer evaluation times. On the other hand, less detailed models, which can be evaluated in shorter times, provide a lower level of accuracy [209, 210, 211]. The performance estima- tion methodologies used in various dataflow-oriented DSE frameworks (if available) were presented earlier in Section 3.2. The overview provided in this Section aims at discussing more generally different approaches to performance estimation and prediction in the field of

Performance Estimation

performance metrics

Compiler Infrastructure

Profiling and Analysis

CAL

program

Architecture Constraints ETG generation Profiling Design Space Exploration Partitioning Scheduling Buffer dimensioning timed ETG

parallel programming and multi-core platforms, with an emphasis on some very recent works and the achievable level of accuracy.

The work discussed in [155] relies on the usage of abstract models of the target system and the structure of the program. They are represented as a finite state machine with some input and output events. The estimation is performed with regards to the execution time and the code size since a trade-off between these two properties is considered as the objective of DSE . It is emphasized that the main role of performance estimation is to reduce the exploration time of the considered design space. The achieved accuracy of estimation is around 20%.

Another approach, discussed in [212], employs analytical models in order to perform the DSE of pipelined MultiProcessor System-on-Chips (M P SoC ), where the architecture favors the im- plementation of applications characterized as repetitive executions of some sub-kernels [213]. The models are used for an estimation of the execution time, latency and throughput, avoiding slow full-system cycle-accurate simulations of all the design points by extracting the latencies of individual processors. The two described methodologies include a single simulation of all the processor configurations and, on the other hand, multiple simulations of a subset of processors. It is stated that the considered design space consists of ca. 1012to 1018design points, hence a complete exploration is infeasible, as it would be expected to take years to complete. Narrowing the set of design points reduces the simulation times to several hours with the estimation accuracy between 12.95% and 18.67% (maximum absolute error). A similar concept of analytical models is used in [214], where the estimation relies on defining several figures of merit for the considered properties. They are based on certain equations using metrics measured on the platform. In this work, the estimation is applied to a different problem, namely the slowdown caused by multiple applications running simultaneously.

One of the methods employed for the purpose of performance estimation is source code analysis using the concept of elementary operations. The work described in [215] performs the profiling of different sets of operations and uses this information for estimation in het- erogeneous M P SoC achieving an estimation error around 6%. Similarly, [216] describes a methodology for source code profiling at the level of intermediate representation, where dataflow graphs are used to capture the dependencies between different operations. Another possibility to create a structural representation of operators is to use UML activity diagrams so that each operator is decomposed into operational units of different granularity. The operations are modeled with regards to the latency, power and area to form a pre-characterized operators library. It is stated that shifting the estimation level from the code to the model allows a fast DSE in the early design steps.

The objective of DSE described in [217] is to find the most efficient target System-on-Chip (SoC ) for a given application. The core of the used performance estimation methodology is the classification and learning method called Regression Random Forest. Thanks to applying

this learning technique, the number of analyzed configurations is reduced by learning the relationships between different design parameters. Using machine learning techniques for the purpose of performance prediction has recently become a popular topic of research. For instance, [218] and [219] use Artificial Neural Networks (AN N ) in order to provide the mapping and/or scheduling heuristics with some reliable performance measures. In both cases, the

AN N -based performance prediction is used to drive the heuristics and results in remarkably

better-quality configurations. In [219], the process of retraining the artificial networks aims at analyzing the discrepancy between the real and predicted performance in order to apply an error-correction learning rule and hence, maximally reduce the prediction inaccuracy. Among the possibilities to reduce the time required for a single run of performance estimation, one opportunity is to define two models with different levels of detail and accuracy. The work described in [164] introduces two estimation models: an accurate, but time consuming model and a simpler one, enabling a short estimation with a higher discrepancy. The complexity of these models depends on the number of parameters which are considered in the estimation. The models are used interchangeably in order to search for various trade-offs related to design implementations in hardware and software. The term fidelity of estimation is introduced and corresponds to the percentage of correctly predicted comparisons between different design points.

Finally, an interesting approach for a cross-platform performance and power consumption estimation is described in [220]. It employs a learning algorithm that synthesizes analytical proxy models that predict the performance and power of the workload in each program phase from performance statistics obtained through hardware counter measurements on the host. The objective of the investigation is to verify if a few example runs on a slow, detailed simulator (commonly available to software developers) and the corresponding runs on real hardware can give an insight into the correlation between the two [221]. The learning approach based on this concept is considered to provide over 97% prediction accuracy.

In document Systematic Design Space Exploration of Dynamic Dataflow Programs for Multi-core Platforms (Page 153-160)