We used synthetic workloads to illustrate the applicability and benefits of gFPca
based on our implementation platform (with four cores, 16 cache partitions). We focused on tasks that are sensitive to shared cache interferences (for which cache iso- lation is critical), and evaluated four algorithms: gFP(cache-agnostic global schedul-
ing), pFP (partitioned scheduling with static core-level cache allocation), nFPca
(cache-aware non-preemptive global scheduling with dynamic task-level cache alloca- tion), and gFPca (cache-aware preemptive global scheduling with dynamic job-level
cache allocation).
Workload generation. We first constructed two real-time programs in our imple-
mentation: the first randomly accesses every 32 bytes (the size of a cache line) in a 960KB array for 200 times, which was used for the highest-priority task; and the second randomly accesses every 32 bytes in a 192KB array for 2000 times, which was used for each lower-priority task. We separately measured the WCET of each program under thegFPca scheduler when it was allocated different numbers of cache
We then constructed a reference taskset τref with n = 5 tasks, with τ1 τ2
· · · τn, where τ1 = (p1 = 5000, d1 = 500) and τi = (pi = 5000, di = 1550) for all
1< i≤n. (We observed similar results when varying the number of tasks.)
0 2 4 6 8 10 12 14 16 400 600 800 1000 1200 1400 1600 1800 2000
Number of cache partitions
WCET(ms)
WCET of high priority task WCET of low priority task Deadline of high priority task Deadline of low priority task
Figure 4.12: Measured WCET vs. Number of cache partitions.
Analysis of WCET and the number of cache partitions. Fig. 4.12 shows
that the WCET of τ1 is 430ms with 16 cache partitions and 501ms with 15 cache partitions. Since its deadline is 500ms, τ1 needs all 16 cache partitions to meet its deadline. Each lower-priority task has a WCET of 800ms with 4 cache partitions,
a WCET of 1059ms with 3 cache partitions and a WCET of 1958ms with 0 cache
partition.
From the above analysis, we could feasibly assign the number of partitions of each task under gFPca and nFPca, i.e., A1 = 16 and Ai = 4 (i > 1). We set the
WCET of each task to be an upper bound of the WCET measured under the assigned number of partitions19, i.e.,e
1 = 500 andei = 1050; this was used in our experiment
investigating the impact of task density. (Note that, these WCETs are safe under
gFP as well, since gFP allows every task to access the entire cache.)
19The upper bound is to account for potential sources of interference, such as TLB overhead, and
Observation: No feasible static partitioning strategy exists. Under pFP,
tasks are statically assigned to cores (e.g., as done in [36, 65]) and shared-cache isolation is achieved among tasks on different cores via static cache partitioning. However, this static approach cannot schedule the example workload. Specifically, since τ1 requires all of 16 cache partitions to meet its deadline, if we allocate less than 16 partitions to its core, then it will miss its deadline. If we allocate all 16 cache partitions to τ1’s core, then either (i) some lower-priority task will have zero cache partition (if it is assigned to a different core) and will miss its deadline, or (ii) all tasks must be packed onto the same core as τ1’s, in which case the taskset is unschedulable (since the core utilization is more than 1). In other words, no partitioning strategy exists for the workload.
Experiment. The reference taskset illustrates the scenario where the high-priority
task has a very high density (ratio of WCET to deadline) and thus is extremely sensitive to interference. To investigate the impact of task density on the performance of the algorithms, we varied the density ofτ1 from1to0.1by increasing its deadline (while keeping all the other parameters unchanged), which produced 10 tasksets. The number of cache partitions were assigned for gFPca and nFPca as above (A1 = 16 and Ai = 4, with i > 1). Although our analysis shows that no feasible partitioning
strategy exists for pFP, for validation we evenly distributed four low-priority tasks
and 16 cache partitions to the four cores, and assigned τ1 to any of the four cores. We ran each generated taskset for one minute under each of the four schedulers (gFPca, nFPca, gFP, pFP) schedulers, collected their scheduling traces, and derived
the observed schedulability under each scheduler.
Table 4.2: Impact of task density on schedulability.
Density ≥ 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
gFPca Yes Yes Yes Yes Yes Yes Yes Yes
gFP No No Yes Yes Yes Yes Yes Yes
nFPca No No No No No No No Yes
Results. Table 4.2 shows the observed schedulability of each taskset under each scheduler. The results show that the gFPca scheduler performed best: it was able
to schedule all tasksets. The gFP scheduler performed well when the high-priority
task’s density is low; however, as the task’s deadline becomes tighter, its tolerance to cache interference from other tasks is decreased, and thus it began to miss its deadline. The results also show that the nFPca scheduler performed very poorly –
it was able to schedule only one taskset; we attribute this to its poor utilization of cache and CPU resources due to its non-preemptive nature. As predicted in our analysis, the pFP scheduler could not schedule any tasksets.
4.8
Conclusion
We have presented the design, implementation and analysis ofgFPca, a cache-aware
global preemptive fixed-priority scheduling algorithm with dynamic cache allocation. Our implementation has reasonable run-time overhead, and our overhead analysis in- tegrates several novel ideas that enable highly accurate analysis results. Our numer- ical evaluation, using overhead data from real measurements on our implementation, shows thatgFPimproves schedulability substantially compared to the cache-agnostic gFP, and it outperforms the existing cache-awarenFPca in most cases. Through our
empirical evaluation, we illustrated the applicability and benefits of gFPca. For fu-
ture work, we plan to enhance bothgFPca and its implementation to improve their
Chapter 5
Dynamic shared cache management
for virtualization systems by
virtualizing Intel CAT
We have developed the shared-cache management and analysis solution for non- virtualization systems to allocate non-overlapped cache partitions to tasks; we now explore the solution for virtualization systems. The natural question one may ask is: can we simply apply the shared cache management solution developed for non- virtualized systems in Chapter 4 to the hypervisor for mitigating the shared-cache interference in virtualization systems? The shared cache management solution can be applied to the hypervisor to allocate the shared cache partitions to VMs, but tasks within the same VM still use the same cache area allocated to the VM and will still suffer from the shared-cache interference.
In order to mitigate the shared-cache interference, concurrently running tasks must be allocated with non-overlapped cache areas. Recall that resources are dis- tributed hierarchically in virtualization systems: a type of hardware resource (say CPU resource) is first distributed to VMs by the hypervisor and then redistributed to tasks by OS in VMs. Observing that cache is not managed in virtualization systems,
we need to establish a hierarchical cache allocation framework in order to allocate non-overlapped cache areas for tasks in virtualization systems.
Recent work has developed a hierarchical cache allocation framework for allo- cating non-overlapped cache areas for tasks in virtualization systems using page coloring (e.g., [75, 37]); however, it is restricted to static cache partitioning, where a fixed set of partitions is statically assigned to each task at initialization. While this approach is simple and easy to implement, it can substantially under-utilize the cache and CPU resources, and it does not work well for systems where the tasks’ timing constraints and CPU/cache demands vary dynamically at run time, such as in multi-mode systems (as we shall illustrate in Section 5.4.4).
To bridge this gap, we present a new approach to cache management of real-time virtualization systems that can deliver strong (shared) cache isolation at both VM and task levels, and that can be configured for both static and dynamic allocations. Unlike existing work, which is software-based, our approach takes advantage of the Cache Allocation Technology (CAT), a hardware feature recently added in Intel multicore hardware for achieving core-level cache partitioning; therefore, it is much more efficient than software-based techniques. Since CAT only provides core-level cache isolation, we introduce vCAT, a novel design for CAT virtualization that can be used to achieve hypervisor- and VM-level cache allocations. Our approach to virtualizing cache partitions is analogous to memory virtualization: as the hardware provides a number of (indistinguishable) physical partitions, we can expose some number of “virtual partitions” to each VM and then transparently map them to physical partitions in the hypervisor; each VM can then allocate its virtual partitions to its tasks statically or dynamically at runtime.