Bounding Communication Latency - Scratchpad Memory Management For Multicore Real-Time Embedded

As discussed in the previous sections, tasks communicate asynchronously without any precedence constraint between them. In other words, when a job of a higher-priority task τ1 is ready, it will be scheduled and start loading as soon as there is a free partition regardless of any other running task that sends data to it. Since the access to main memory is serialized using DMA, integrity of the communication data is assured as there will be no data race condition (lock-less data sharing). Suppose τ1 is a receiver task for data sent by τ2. Then τ1 will access the previous (old) communication data from τ2 if τ1 is loaded while τ2 is still running or not yet unloaded from the SPM partition to main memory.

From a schedulability point of view, this communication model does not affect task scheduling. However, we still need to bound the worst-case end-to-end communication latency. As an example, we show how to determine the communication latency under the same DMA scheduling and response time model used in Section 3.3; as a reminder, this means that we consider fixed-size DMA operations, and the computed response time of a task is the time that elapsed from when a task becomes ready (released) to the time it finishes and is fully unloaded. Therefore, communication data sent by a sender task τi will be available to a receiver task after Ri, which is the worst-case response time of τi, accounting for interference and overheads.

As mentioned earlier, we consider sets of communicating tasks; each set, which we call a flow λk, is a chain of n tasks: λk = {τ1, τ2, ..., τn}, such that a task τi in the flow receives data from τi−1 and sends data to τi+1. In other words, the communication data in the flow passes through all the tasks in the flow in sequence; consequently, the end-to-end communication latency (Lλk_{) is the time it takes for the data to be consumed (loaded)}

by the first task in the flow, i.e, τ1, until the last task in the flow, i.e., τn, writes the data to main memory (unload). The end-to-end latency of a flow λk is computed as in Equation 6.1. Lλk ₌      R1+ n P i=2 (Ti − 2 · σ + Ri) Different Cores R1+ n P i=2 (Ti − (2 + m − 1) · σ + Ri) Same Core (6.1)

Theorem 6.1. The worst-case total end-to-end latency of a communication flow λk is Lλk _{= R} 1 + n P i=2 (Ti− 2 · σ + Ri)

Proof. As shown in Figure6.1, the worst-case alignment (critical instant) between any two different jobs that communicate is when the receiving job (τ2) starts loading right before the unload phase of the producer job (τ1). This behavior prevents τ2 from loading the fresh communication data until the next invocation. In addition, in the worst-case, τ2 starts loading right after its release to maximize latency by minimizing the overlapping region (see Figure 6.1) between τ2 and τ1. Note that, in our system, this only happens if the two tasks are on different cores. If they are on the same core, that core has to wait for (m − 1) TDMA slots to perform the unload of τ1, which will increase the overlapping region between the two tasks, thus reducing the latency.

Therefore, to compute the communication latency between τ1and τ2, we add one period and the worst-case response time (as detailed in the previous section) of τ2, and subtract the overlapping region between τ1 and τ2. In the worst-case, the minimum overlapping region is always made of the loading slot of τ2 and the unloading slot of τ1 which is 2 · σ. To compute the total end-to-end latency between a chain of tasks in a flow, we simply apply the same method between any two successive tasks in the flow. For the first task in the flow (τ1), we only consider the response time.

τ₁

Time τ₂

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Latency

Figure 6.1: Worst-case communication latency between two tasks .

6.4 Evaluation

We evaluate the communication latency by generating random set of tasks as discussed in Section4.2.3. From a generated task set, we compute the response time of each task using

the analysis in Section 3.3. After that, we generate four random communication flows with 5, 10, 15, and 20 tasks and compute the communication latencies of each flow. Each synthetic evaluation is repeated 1,000 times and the average worst-case communication latency is reported. Figure 6.2 shows the estimated worst-case communication latencies for the generated flows comprised of a mix of applications provided in Table 4.7. As one can observe, there is a slight improvement when the communication tasks are scheduled on the same core. In this specific evaluation, the system comprises two application cores, as this is the setting of the implemented COTS platform in Section4.2.

In addition, the yellow line in the figure represents the average communication band- width, that is, the total amount of data transferred divided by the end-to-end communication latency. For the sake of this evaluation, we assumed that each task in a flow will send data equal to the size of its data section as reported in Table4.7; hence, the amount of data transferred between any two successive tasks in a flow might be different, which resemble real-life scenarios. Each point in the line is generated by randomly picking 5, 10, 15, or 20 tasks based on the number of tasks in the flow. Then, we compute the total amount of data transferred withing the end-to-end latency window. We repeat the test 1,000 times and take the average to capture all the application benchmark in Table 4.7.

6.5 Summary

We showed how to incorporate inter-task communication in the proposed 3-phase task model. We consider an asynchronous communication model, where pairs of sender and receiver tasks exchange data. In particular, communication data is written by the sender task to main memory during its unload phase, and read from main memory by the receiver task during its load phase. Since there are no precedence constraints among communicating tasks, the schedulability analyses presented in previous chapters do not need to be changed. We computed bounds on the end-to-end latency for a chain of task based on the analysis for partitioned systems in Section 3.3, and evaluated the obtain latency on the platform introduced in Section4.2.

Chapter 7 Bundled Scheduling of Parallel

Real-time Tasks

After discussing inter-task communication in Chapter6, in this chapter and the next one we focus on intra-task communication for parallel real-time tasks. In particular, in this chapter we focus on how to schedule parallel tasks to simplify synchronization among parallel threads of the same task; while in Chapter 8 we introduce a predictable interconnection design and derive latency bounds for messages exchanged between communicating threads. With the increased demand for high-performance applications such as autonomous driving and computer vision [158], parallel processing is becoming relevant to the real-time community. However, most related works on scheduling of real-time parallel tasks [142,

89, 92, 109] assume that application threads are scheduled independently. In practice, there is large evidence [123, 60, 77, 150] that parallel threads often needs to be executed concurrently. This is especially true for threads that are tightly synchronized through the use of either shared resources or message passing primitives. Some synchronization primitives, such as intra-threads locks, cannot be modelled by most related work [39,109]. Many primitives can cause an unnecessary number of context-switches and increase thread execution time due to blocking when threads are not executed together. Furthermore, the need to account for such synchronization mechanisms greatly increases the complexity of the task model.

For instance, consider the synchronizing Thread#1 and Thread#2 in Figure 7.1. If the scheduling policy does not provide guarantees to schedule them in parallel at the same time, their WCET are prone to inflation. In a preemptive policy they might suffer exces- sive amount of preemptions, while in a non-preemptive policy a thread might spin-wait for

Time C2 preemptive C1 other workloads Thread #1 Thread #2 Time C2 non-preemptive C1 other workloads Thread #1 Thread #2 Time C2 co-scheduled C1 other workloads Thread #1 Thread #2

Figure 7.1: Illustrations of the negative impact on synchronized parallel threads if not co-scheduled at the same time

the other thread. Even in the best-case, depending on the synchronization primitives and the scheduling policy, the WCET might not be prone to inflation, but it still complicates accounting for their communication time. Therefore, we argue that parallel threads need to be co-scheduled to reduce the synchronization overheads and to simplify their communication analysis. As will be detailed in Section8.6, we can compose the total WCET as the WCET on CPUs plus the worst-case communication time between the parallel threads. In particular, all parallel threads of a tasks are scheduled concurrently, hence their communication time can be simply accounted for as long as the inter-core interconnect provides provable real-time bounds.

To address this issue, gang scheduling [123] has been extensively studied in the HPC and general purpose domains [60, 77, 150]. Under gang scheduling, an application is scheduled only if there are sufficient cores to executed all the application’s threads in parallel. Gang scheduling of real-time tasks has been investigated in [81,55,64,38,47]. In particular, [81,55,64] consider a rigid task model, where in the worst case the number of threads required by an application is assumed to remain constant over its entire execution time. While the rigid model has the benefit of simplicity, it can incur a significant loss of performance by overestimating the computational demand of an application: many parallel applications change their required number of threads during execution. For example, in the common fork-join model 1_{, the application progresses through a set of phases, where}

each phase can require a different number of threads. In the similarly common Directed Acyclic Graph (DAG) model, the application comprises a set of precedence-constrained subtasks, and the number of threads used by the application at any one time depends on the subtask scheduling.

Hence, in this chapter we introduce a novel task model, which we call the bundled model, which supports gang scheduling of parallel threads without incurring undue pessimism in 1_{Note that we use the term fork-join to refer to the model that is also known as a multi-threaded task}

modelling the application’s demand. In this model, a real-time task is composed of a sequence of bundles, where each bundle is characterized by a known worst-case execution time (WCET) and number of required cores. All threads within a bundle are then gang scheduled. Our model thus represents a generalization of the traditional real-time gang model, in the sense that a traditional gang task is equivalent to a bundled task with a single bundle. 0 4 3 2 1 0 1 2 3 4 5 6 Time thread 1 thread 2 thread 4

thread 3 thread 1 thread 1 thread 2 Cores 7 8 9 Federated 0 4 3 2 1 0 1 2 3 4 5 6 Time thread 1 thread 2 thread 4

thread 3 thread 1 thread 1 thread 2

Cores 7 8 9 Global 0 4 3 2 1 0 1 2 3 4 5 6 Time thread 1 thread 2 thread 4 thread 3 thread 1 thread 1 thread 2 7 8 9 Gang 0 4 3 2 1 0 1 2 3 4 5 6 Time thread 1 thread 2 thread 4 thread 3 thread 1 thread 1 thread 2 7 8 9 Bundled other workloads thread 1 thread 4 thread 3 thread 2 thread 1

Phase 1 Phase 2 Phase 3

thread 2

thread 1

Figure 7.2: Illustrations of how a fork-join parallel task can be scheduled according to different scheduling strategies

Figure 7.2 shows an illustrative example of a parallel fork-join task and how parallel threads might be scheduled with different scheduling strategies. The important point to note here is that our objective is to guarantee that parallel threads are scheduled concurrently 1) to reduce the synchronization overheads and 2) to simplify accounting for their communication. On the left side of the figure, global thread scheduling [109] and federated scheduling [93] are shown. As can be seen, both scheduling schemes do not provide guarantee to schedule the parallel threads in the same phase concurrently. Consequently, these scheduling schemes do not meet our objectives. As illustrated in the figure, federated

scheduling allocates a dedicated cluster of cores for the parallel task based on a metric relevant to the parallel task utilization. However, federated scheduling might suffer from core overprovisioning compared to the greedy global thread scheduling.

On the right side of the figure, gang scheduling of the rigid model [81] and the proposed bundled scheduling are shown. Indeed, gang scheduling meets our objectives. However, as mentioned earlier, gang scheduling of the rigid model suffers significant loss of performance due to core overprovisioning. As illustrated in the figure of the gang case, four cores are reserved for the entire execution time of the task, even when there are times where only one or two cores are actually needed. In the proposed bundled scheduling, we segmented the shown fork-join task into three different bundles based on the number of required cores in each phase. All threads in each bundle are guaranteed to be scheduled in parallel, similar to the gang case. However, since bundles can be independently scheduled, we only reserve the required number of cores in each bundle, which leads to improved CPU utilization. Note that, we improve over the gang scheduling of the rigid model at the cost of a more complex schedulability analysis due to induced complexity to deal with the precedence constraints between bundles.

More in details, we provide the following contributions: (A) We introduce the bundled task model, and discuss how bundles are scheduled on an identical multiprocessor according to a preemptive, fixed priority gang scheme. (B) We show how applications coded according to different programming models can be executed within bundles. (C) We derive a sufficient schedulability analysis for sporadic bundled task sets. (D) We evaluate the performance of the derived analysis and compare it against the rigid gang model, as well as a state-of-the- art analysis for global (non-gang) scheduled parallel tasks [109]. (E) Finally, we discuss how to integrate bundled scheduling with the 3-phase task model to ensure predictable access to memory resources and reduce the overhead of preemptions.

The rest of the chapter is organized as follows. We first introduce the proposed system model in Section7.1. In Section7.2, we detail the schedulability analysis for the proposed system. The evaluation results are reported in Section 7.3. In Section 7.4, we discuss how bundled scheduling could be integrated with the 3-phase execution model. Finally, we conclude with some remarks in Section 7.5.

In document Scratchpad Memory Management For Multicore Real-Time Embedded Systems (Page 155-162)