5.6 Evaluation
5.6.4 Average performance
We have as well evaluated WaW+WaP and regular wNoC in terms of average performance. Results show that WaW+WaP incurs negligible average performance degradation (less than 1%) for both single-threaded and parallel applications. The origin of the degradation resides in the overhead introduced by packetization that is minimized as it only affects those packets having more than one flit.
5.7
Related Work
Customized NoCs for real-time such as Time Division Multiple Access (TDMA)-based or time- triggered ones will find difficulties in being adopted by the real-time industry [73] since their implementation incurs high non-recurrent costs (see Section 4.8) This is the case for [93,95,96,
Chapter 5. Improving Performance Guarantees in Wormhole Mesh NoC Designs 88 In best-effort wNoCs the use of virtual channel prioritization has been proposed as an effective way to provide tight latency bounds [97] and [109]. The same logic applies to [110], where authors provide bandwidth guarantees for Guaranteed Service (GS) connections per port. However provided guarantees require a detailed knowledge of the applications/tasks that will run in the final system and thus, fail to satisfy incremental verification requirements.
In [3, 80] authors provide realistic bounds for wNoCs without using flit-level virtual channel preemption. The model in [3] requires knowing all communication flows integrated in the system to derive safe upper-bounds, making those bounds not time-composable. Interference-free NoC designs using wormhole-based NoC designs have been recently proposed in [111] and [112]. While [111] shows lower best-effort traffic degradation than [112] by smartly multiplexing virtual channels, the degradation of best-effort traffic performance is significant.
We follow a different approach to fulfill hard-real time requirements by deriving time-composable WCTT bounds in wNoCs without sacrificing average performance. Further, we address the scalability problems of latency bounds in wNoC by proposing a mesh design that significantly improves default mesh WCTT values with low hardware complexity.
5.8
Conclusions
The use of wormhole-based NoCs in the context of CRTES applications complicates the timing analysis of applications, making the WCET estimates of those applications rapidly increase with the network size. The latency bounds achieved by our design are scalable. Our proposal enables a fair sharing of the available bandwidth across the different flows in the network. This makes time-composable WCET estimates less affected by the core count in manycore (objective O3). Our results with benchmarks and a real application (objective O4) confirm that the proposed mesh achieves tight and uniform scalable WCET values with negligible average performance degradation. Furthermore, hardware modifications required for the proposed design w.r.t. regular mesh designs are few, easing its adoption (objective O1).
Part III
Software Support for Exploiting
Manycore Potential – Scheduling
Chapter 6
Intra-GRP Scheduling Strategy
for Parallelization of Complex
Automotive Applications
This chapter tackles improvement of guaranteed performance for complex legacy applications by parallelizing and allocating them to a many-core processor designs described in Chapter 3. We focus on control applications from automotive domain, as they were built with single-core architectures in mind and they are good candidates for parallelization due to their complexity and minimizing efforts in parallelization and avoiding re-validation of the applications stands as an imperative.
6.1
Introduction
Modern road vehicles carry up to 100 single-core Electronic Control Units (ECUs) performing various functions, from opening a window to controlling the engine. This makes automotive industry to pay special attention to minimize Size, Weight, and Power (SWaP) costs, while increasing the services delivered per ECU. Multi- and many-core processor architectures, which are nowadays a reality in other embedded domains [4,19,113], are considered as a promising solution to cope with such performance and cost constraints.
Many-core ECUs aim at providing the performance required to run a high number of complex functions by:
• Integrating distributed applications into a single ECU;
• Parallelizing the computation of complex systems, such as the Engine Management System (EMS) or Advanced Driver Assistance System (ADAS);
• Combination of both.
Chapter 6. Intra-GRP Scheduling Strategy for Parallelization of Complex Applications 92
Figure 6.1: Inter-runnable dependencies existing among three of the twelve tasks that compose the EMS (tasks 1, 4 and 8 ms). Nodes represent runnables and lines the
dependencies among them
We focus on the parallelization of complex applications, i.e. improving the performance of an automotive legacy application by effectively parallelizing it over several cores (Objective O2) of a single Guaranteed Resource Partition (GRP), considering EMS as a case study (Objective O4). In this respect, it is important to remark the relevance of this problem given that a significant part of automotive software is composed of legacy code (Goal 3).
Automotive applications increasingly rely on the AUTomotive Open System ARchitecture (AUTOSAR) [24], a standardized system software architecture upon which applications are built and executed. In AUTOSAR, applications comprise a set of functions, named runnables, that are either executed periodically or triggered by an interrupt. When developing an AUTOSAR application, runnables are grouped into AUTOSAR tasks1, which are the Unit of Scheduling (UoS)
of the AUTOSAR Operating System (AR-OS). The runnable-to-task mapping and the single-core task scheduling of an application is known as application configuration and it is static and known at system integration time. Development of application configuration has high cost of validation its functional and timing correctness [29], and it is done infrequently (only once for most of the applications, exceptions are e.g. Formula 1 engine control applications, where you have several application configurations).
The current strategy of using tasks as UoS works well on applications running on single-core ECUs, because it facilitates scheduling runnables with the same timing properties by grouping runnables with the same release period or interrupt into the same task.
A single GRP is an equivalent a multi-core ECU in terms of scheduling. Current approaches targeting multi-core ECUs also consider tasks as UoS [114][115], allocating them to different cores. To do so, all dependent runnables are grouped into a single task, minimizing or even removing all inter-task communications, and so scheduling independently the different tasks to the processor cores. This approach, which is in-line with the latest AUTOSAR guide for developing and configuring AUTOSAR-compliant software for multi-core systems [24], works well for integrating multiple applications into a single ECU or for parallelizing applications with little inter-runnable communication. However, the use of tasks as UoS on many-core processors to extract parallelism of applications with highly-connected runnables is inefficient, as most runnables are allocated to a single task and thus executed sequentially in one core. This is the case for the EMS application, in which almost all runnables depend on each other. Figure 6.1 provides an intuition on the level of communication existing in the EMS, showing the inter-runnable dependencies of three of the
Chapter 6. Intra-GRP Scheduling Strategy for Parallelization of Complex Applications 93 twelve tasks that compose the EMS (concretely time-triggered tasks with periods of 1, 4 and 8 ms; see Section 6.4.1 for further details).
Moreover, current approaches require changing the runnable-to-task mapping and/or the single- core task scheduling to execute tasks in parallel as a means to improve application performance. This, in turn, implies changing the application configuration, resulting in extra effort to verify and validate the new configuration [116]. This is due to the fact that the sequential execution model of tasks abstracts and may hide mutual exclusion constraints when accessing shared resources, critical sections, etc. The parallel execution of tasks can then break this mutual exclusion relations present in applications configured for execution in single-core processors [116].
In this chapter, we propose exploiting the performance opportunities of multi-core ECUs by proposing a new allocation strategy in which legacy automotive applications (objective O2) with runnables highly connected are parallelized while maintaining the single-core application configuration (Goal 3). We present RunPar, a new allocation algorithm that considers runnables, and not tasks, as the UoS. RunPar assigns runnables of the same task to different cores respecting inter-runnable dependencies and forces tasks to execute sequentially following the task ordering of the application’s single-core task scheduling. To do so, RunPar does not allow runnables from different tasks to be executed in parallel. This approach significantly improves the state-of-the-art techniques under which runnables cannot be executed in parallel.
This runnable scheduling strategy, i.e. the allocation of tasks, guarantees that the composition of tasks and the order in which they are executed in the single-core and in the multi-core ECU remains the same. Therefore the same functional behavior is guaranteed in both platforms. We evaluate the benefits of RunPar on an EMS, a real automotive application (Objective O4) that controls the injection time and amount of fuel in a diesel engine and composed of more than one thousand highly connected runnables grouped into twelve tasks (see Section 2.2.2). Our results confirm that RunPar effectively increases the performance of EMS tasks by providing an increment of the Central Processing Unit (CPU) capacity of 31% and 42% for the two-core and four-core ECU respectively. This extra capacity can be then exploited for executing new application functionality or other automotive applications, which ultimately contributes allocating more functionality per ECU, reducing size, weight and power costs.
We consider RunPar a necessary step towards porting current legacy software to many-cores, for exploiting the many-core performance potential while containing verification effort (Objective O5). The use of runnables as UoS implies minimum modifications at AR-OS level: The scheduling tables used in the AR-OS to execute tasks are extended to incorporate the core, the order and the time in which runnables are executed, so inter-runnable dependencies are respected. How to better exploit multi-core ECU and GRP capabilities for new AUTOSAR applications to minimize inter-runnable communications, and so increase parallelism, is a challenging problem that is out of the scope of this thesis and part of our future work.
Chapter 6. Intra-GRP Scheduling Strategy for Parallelization of Complex Applications 94 1ms Cycle 1
…
1ms 4ms 1ms 5ms…
1ms 4ms 5ms Cycle 20 Cycle 4 Cycle 5(a)
(b)
r1 r2 r4 r5 r3 r6 SWC1 SWC2 SWC3 r7 r1 r3 r4 task 1ms r2 r5 r6 task 4ms r7 task 5ms(c)
Figure 6.2: Part of the runnable flow-graph of an automotive application composed of 3 SWC, 7 runnables and 3 tasks executed in a single-core processor. (a) Structure of the
application; (b) application configuration from an AR-OS point of view; (c) a possible single-core task scheduling of the three tasks.