Chapter 3: Real-Time Nested Locking Protocol ( RNLP )
6.2 Other Related Work
During the course of my graduate education, I have contributed a number of results outside of the scope of this document. This section briefly summarizes these results, which appeared in ten other peer-reviewed papers. These contributions fall into three broad categories: (i) the analysis and management of shared hardware components,(ii)multiprocessor synchronization, and(iii)soft-real-time schedulability and tardiness analysis.
6.2.1 Shared-Hardware Management and Analysis1
This dissertation has described several synchronization algorithms that are motivated by the application of managing shared hardware devices. In work outside of the scope of this dissertation, I have co-authored several other publications in which shared-hardware-management techniques have been applied in practice. Many of these publications even apply algorithms described herein.
GPUs. In collaboration with Elliott and Anderson (Elliott et al., 2013), I helped design and analyzeGPUsync, which is a framework for managing GPUs in multi-GPU multiprocessor real-time systems. My contributions to this work were predominantly targeted at the design and analysis ofGPUsync, which leverages theR2DGLP
described in Chapter 5, and Glenn Elliott implemented, debugged, and evaluated the entire system.
GPUsynctakes a synchronization-oriented approach to GPU management and applies locking protocols to GPUs, as well as hardware components therein.GPUsyncwas designed with flexibility, predictability, and parallelism in mind. Specifically, it can be applied under either static- or dynamic-priority CPU scheduling; can allocate CPUs/GPUs on a partitioned, clustered, or global basis; provides flexible mechanisms for allocating GPUs to tasks; enables task state to be migrated among different GPUs, with the potential of breaking such state into smaller “chunks”; provides migration cost predictors that determine when migrations can be effective; enables a single GPU’s different engines to be accessed in parallel; properly supports GPU-related interrupt and worker threads according to the sporadic task model, even when GPU drivers are closed-source; and provides budget policing to the extent possible, given that GPU access is non-preemptive. No prior real-time GPU management framework provides a comparable range of features.
1This subsection summarizes results from the following papers:
Elliott, G., Ward, B., and Anderson, J. (2013). GPUSync: A framework for real-time GPU management. InProceedings of the 34th IEEE Real-Time Systems Symposium, pages 33–44.
Ward, B., Herman, J., Kenna, C., and Anderson, J. (2013b). Making shared caches more predictable on multicore platforms. In Proceedings of the 25th Euromicro Conference on Real-Time Systems, pages 157–167.
Ward, B., Thekkilakattil, A., and Anderson, J. (2014). Optimizing preemption-overhead accounting in multiprocessor real-time systems. InProceedings of the 22nd Conference on Real-Time Networks and Systems, pages 235–243
Chisholm, M., Ward, B., Kim, N., and Anderson, J. (2015). Cache sharing and isolation tradeoffs in multicore mixed-criticality systems. InProceedings of the 36th Real-Time Systems Symposium, pages 306–316.
Kim, N., Ward, B., Chisholm, M., Fu, C.-Y., Anderson, J., and Smith, F. (2016). Attacking the one-out-of-mmulticore problem by combining hardware management with mixed-criticality provisioning. InProceedings of the 22nd Real-Time Embedded Technology and Applications Symposium, pages 149–160.
Chisholm, M., Kim, N., Ward, B., Otterness, N., Anderson, J., and Smith, F. (2016). Reconciling the tension between hardware isolation and data sharing in mixed-critical, multicore systems. InProceedings of the 37th IEEE Real-Time Systems Symposium (to appear).
Multiprocessor cache-related preemption delay analysis. Preemptions (and migrations, in which a task is scheduled on one processor and moved to another processor), can increase the execution time of tasks due to the interference caused with respect to the cache. When a task is preempted, another task executes instead and may evict cache lines that the preempted task otherwise would have reused. When the preempted task resumes, it must reload previously cached data from memory that had it not been preempted, would have been cached and therefore accessed more quickly. There are a number of techniques known for quantifying this effect, known ascache-related preemption delay(CRPD)orcache-related preemption and migration delay(CPMD) analysis. Broadly, these two classes of techniques can be classified as eitherpreemption-centric, in which the additional delays are analytically accounted for by inflating the execution time of the preempting task, or task-centric, in which the additional delays are analytically accounted for by inflating the execution time of the preempted task. We presented a new CRPD analysis technique that combined these two approaches to CRPD analysis into a linear program, which can be solved to minimize the utilization loss associated with CRPDs (Ward et al., 2014).
Dynamic shared-cache management In collaboration with Herman, Kenna, and Anderson (Ward et al., 2013b), I helped design shared-cache management techniques designed to improve predictability with respect to shared-cache accesses, thereby improving analytical platform utilization. In that work, we developed and analyzed several techniques for minimizing or eliminating interference among concurrently executing tasks through the shared cache. These two techniques are calledcache locking, and cache scheduling. Cache locking applies a variant of thek-exclusionRNLPdescribed in Chapter 4 to individual ways of shared-cache colors. Cache scheduling applies a scheduling algorithm, such asEDF, to the ways of cache colors. These techniques are also applied in a mixed-criticality scheduling framework.
Mixed-criticality shared-cache and memory-bank management. In a series of papers (Chisholm et al., 2015; Kim et al., 2016; Chisholm et al., 2016), we developed support in themixed-criticality on multicore (MC2) scheduling framework for shared-cache and DRAM-bank partitioning. Extensive experimental evaluations were conducted to quantify the effects of isolation vs. sharing with respect to both the shared cache as well as the DRAM banks. Furthermore, these evaluations were conducted to reflect the analysis assumptions used at different criticality levels—higher-criticality tasks are provisioned more pessimistically, while lower-criticality tasks may be provisioned based on observed average-case performance. Based on the results of these experimental evaluations, a linear-programming-based optimization framework was
developed to allow shared-cache and DRAM-bank partitions to be optimized to improve the overall platform utilization within the context of the mixed-criticality analysis framework. This optimization balances the cache and memory-bank allocation among the criticality levels, based on tradeoffs observed in experimental evaluations of the execution time of tasks given different cache and memory-bank allocations. This work combined recent advances in mixed-criticality scheduling, with work on hardware management, and showed how these two orthogonal approaches can be used together to realize better overall platform utilization.
More recently in this line of work, we have considered challenges raised by data sharing among tasks in the context of both mixed-criticality scheduling and hardware management. Hardware management techniques have predominantly focused on providing isolation with respect to hardware resources, but data sharing must fundamentally break some forms of isolation, thereby causing capacity loss, or a reduction in schedulability. To address this issue, we developed inter- and intra-criticality data-sharing mechanisms, and evaluated the extent to which they reduce data-sharing-related capacity loss.
6.2.2 Contention-Sensitive Locking2
While theRNLPenables fine-grained locking, there exist pathological resource-access patterns in which the worst-case blocking behavior is the same or quite similar to existing coarse-grained locking protocols. One such problematic, pathological case occurs when there is a high degree of transitive blocking. For example, consider thatR1requestsD1={`a, `b},R2requestsD2={`b, `c},R3requestsD3={`c, `d},etc. In this case, a longtransitive blocking chaincan arise, that serializes all requests, even though some requests in the chain do not conflict with one another. To address this issue, we developed theC-RNLP(Ward and Anderson, 2014a; Jarrett et al., 2015), which allows some requests to “cut ahead” of others, so long as it does not increase the blocking of earlier-timestamped requests. In the previous example, it may be possible forR3 to cut ahead ofR2, and be satisfied concurrently withR1. By supporting such cutting ahead, we showed that the worst-case blocking bound for each request is thencontention sensitive,i.e., a function of only the number of requests with which it directly conflicts.
2This subsection summarizes results from the following papers:
Jarrett, C., Ward, B., and Anderson, J. (2015). A contention-sensitive fine-grained locking protocol for multiprocessor real-time systems. InProceedings of the 23rd International Conference on Real-Time Networks and Systems, pages 3–12.
Ward, B. and Anderson, J. (2014a). A contention-sensitive multi-resource locking protocol for multiprocessor real-time systems. In Proceedings of the 35th IEEE Real-Time Systems Symposium Work-in-Progress Session, pages 11–12
6.2.3 Improved Soft Real-Time Tardiness Bounds3
As discussed briefly in Chapter 2,G-EDF, or otherG-EDF-like schedulers, can schedule soft-real-time systems with no capacity loss and bounded deadline tardiness. The tightest known bounds on deadline tardiness are derived throughcompliant vector analysis(CVA), which derives per-task tardiness bounds. Using CVA, Erickson and Anderson (2012) showed that by setting the priority point (or the “effective deadline” used as the scheduling priority) differently from the actual deadline as inG-EDF, it is possible to create schedulers with tardiness less than that ofG-EDFas computed via CVA. Based on this work, we showed how to formulate CVA as a linear program, which in turn allows for priority points to be determined based on different optimization objectives and constraints (Ward et al., 2013a; Erickson et al., 2014). For example, we considered minimizing the maximum per-task lateness, andmaximum relative lateness, orli/pi.