Multimedia Application Performance - LITMUS RT Based Experiments

6.2 LITMUS RT Based Experiments

6.2.3 Multimedia Application Performance

We next present the results of an experiment conducted on machine A to compare our cache- aware scheduler toGEDF_{in terms of job execution times when executing a multimedia server}

workload. This workload was simulated with multiple instances of the mplayer application, modified so that one frame is processed per job. Thus, the period of themplayer application, as specified toLITMUSRT, determined the frame rate.

Multiple copies of the same segment of high-quality video (taken from [13]) were processed by different applications so that they appeared to be from different sources. All copies were resident in main memory using RAM disks, to avoid issues with hard disk accesses. The resolution of the video is 1920x1080 with a 24 fps frame rate, resulting in a period of 41 ms. Based on our observations of the amount of execution time that an application needed to process a frame of this video, we decided to assign an execution cost of 14 ms to each application (though we note that execution costs vary widely per frame). (Our observations came from running the application multiple times, each time specifying a different execution cost.) We also observed that the maximum WSS of the application was roughly 3 MB per frame. Therefore, co-scheduling three of more MTTs has the potential to thrash the 8192K shared cache on machine A. Each application was represented as a single-threaded MTT.

In our experiments, we ran one, six, and twelvemplayer applications under bothGEDF_and

our scheduler. Table 6.11 shows the measured worst-case job execution times for each case. (Note that, in the case of twelve applications, the system is overloaded since (14/41)∗12 ≈ 4.098 >4, although execution times are often lower in practice, which allows all twelve applications to be supported. Additionally, and also in the case of twelve applications, pairs of

mplayer applications were forced to share the same video copy due to RAM disk space constraints.) When one video is scheduled in isolation, the execution time under both schedulers

Machine A

Algorithm Tick AVG Sched. AVG Context Sw. AVG Release AVG

GEDF _1.166 _3.358 _0.953 _4.592

CA-SCHED _2.762 _2.052 _0.972 _—

Algorithm Tick WC Sched. WC Context Sw. WC Release WC

GEDF _1.847 _11.478 _1.532 _8.922

CA-SCHED _14.445 _6.278 _1.507 _—

Machine B

Algorithm Tick AVG Sched. AVG Context Sw. AVG Release AVG

GEDF _4.179 _7.011 _2.125 _7.410

CA-SCHED _2.803 _1.761 _1.052 _—

Algorithm Tick WC Sched. WC Context Sw. WC Release WC

GEDF _7.602 _34.005 _3.483 _13.869

CA-SCHED _9.153 _6.183 _1.938 _—

Machine C

Algorithm Tick AVG Sched. AVG Context Sw. AVG Release AVG

GEDF _4.283 _50.001 _4.661 _48.413

CA-SCHED _62.858 _8.633 _3.783 _—

Algorithm Tick WC Sched. WC Context Sw. WC Release WC GEDF _8.890 _370.023 _18.647 _214.613

CA-SCHED _361.983 _31.890 _7.103 _—

Table 6.10: Average and worst-case kernel overheads for each machine, inµs.

Algorithm 1 Video 6 videos 12 videos GEDF _14.004 _14.947 _14.956 CA-SCHED _13.771 _13.935 _13.972

is about 14 ms. UnderGEDF_{, this increases to roughly 15 ms when scheduling six and twelve}

applications, while it remains about 14 ms under our scheduler. Thus, through the use of our cache-aware scheduler, we are able to ensure that worst-case execution costs do not increase as a result of thrashing. In doing so, the worst-case cost for each application is roughly 6-7% less than GEDF _{in the case of six and twelve applications. As a result, since the system is}

able to support twelve video applications under GEDF_{, the same system should be capable}

of supporting an additional mplayer application under our cache-aware scheduler. If a large number of these systems were used within a cluster of systems that support a multimedia server workload, then the ability to support an additional application on each machine could be quite significant.

Interestingly, the memory reference patterns of these mplayer applications are arguably more sequential than random (since the memory where a video frame is stored is often scanned in predictable manner, such as strides across memory, instead of randomly), yet we still see a significant performance improvement when using our scheduler on machine A. We would expect that, if sufficient hardware support was available to allow more accurate profiling for sequential reference patterns, the performance benefit of our scheduler would be even greater on that machine.

6.3 Conclusion

In this chapter, we presented the results of experiments wherein the performance of our cache- aware scheduler was evaluated. First, we presented a set of experiments, conducted within SESC, that showed that, when the correct heuristic is selected, system performance can im- prove substantially as compared to GEDF_{. These experiments also allowed us to identify}

a “best-performing” heuristic, which was implemented within LITMUSRT along with our cache profiler. Our second set of experiments then evaluated this LITMUSRT implementation in terms if its capability to generate accurate WSS estimates, and its performance as compared to GEDF_{. We found that the accuracy of the profiler depended heavily on the}

performance monitoring features that were provided by hardware; however, when sufficient hardware support existed, our profiler was typically quite accurate. We also found that, when

our cache-aware scheduler was used on platforms where the profiler was accurate, reductions in cache miss rates were often significant. Further, the overheads of our scheduler were roughly equivalent to those in GEDF_{. The “true” cost for achieving these reductions in cache miss}

rates was an increase in deadline tardiness; however, with a combination of early-releasing and buffering, this tardiness potentially could be hidden from an end user. Finally, we conducted experiments with a simulated multimedia workload, and found that execution costs were lower as compared toGEDF_{when our scheduler was used; these lower costs could allow}

CHAPTER 7

CONCLUSION AND FUTURE WORK

In this dissertation, we extended research on multiprocessor real-time systems to support multicore platforms, specifically by designing a cache-aware real-time scheduler that reduces miss rates and avoids thrashing for shared caches on such platforms while ensuring real-time constraints. Prior work in the area of cache-aware scheduling for multicore platforms did not address support for real-time workloads, and prior work in the area of real-time scheduling did not address shared caches on multicore platforms. Further, the cache-aware real-time scheduling methods that we created earlier, described in Chapter 3, were only partial solutions and had not been evaluated within the context of a real operating system.

As multicore platforms are quickly becoming ubiquitous within many domains, creating scheduling policies to address bottlenecks that are common in these platforms, such as shared caches, is necessary to ensure acceptable system performance and permit a state-of-the-art real-time scheduler to remain relevant. Effectively addressing a bottleneck such as the shared cache can enable a larger real-time workload to be supported on a multicore platform, or may allow hardware-related costs (related to processor components, energy, or chip area) to be reduced.

7.1 Summary of Results

In Chapter 1, we presented the following thesis statement, which was to be supported by this dissertation.

Multiprocessor real-time scheduling algorithms can more efficiently utilize multicore platforms when scheduling techniques are used that reduce shared cache miss

rates. Such techniques can result in decreased execution times for real-time tasks, thereby allowing a larger real-time workload to be supported using the same hardware, or enabling costs (related to hardware, energy, or chip area) to be reduced.

In support of this thesis statement, we have presented a cache-aware scheduler, consisting of both a set of cache-aware scheduling policies, embodied in a single heuristic, and a cache profiler that allows the cache behavior of each MTT to be quantified at runtime. The choice of exactly which heuristic to implement was supported by experimentation with roughly one hundred combinations of policy choices within an architecture simulator; however, all heuristics operate similarly in that they attempt to influence the co-scheduling of MTTs through job promotions so that cache miss rates are reduced. Namely, tasks within the same MTTs are co-scheduled to facilitate cache reuse, and the co-scheduling of tasks from different MTTs is avoided when it would cause cache thrashing.

Our cache profiler quantifies the cache behavior of each MTT as a per-job WSS; these WSSs are used by the heuristic to make scheduling decisions. Per-job WSS estimates are obtained during execution using performance counters. By profiling MTTs during execution, the need to profile MTTs offline is eliminated, which might make profiling less of an inconvenience for certain types of workloads. In experiments, we found that our profiler is quite accurate when sufficient hardware support is provided; otherwise, accuracy varied considerably depending on the nature of the limited support that was available.

As both a “proof of concept” that our cache-aware scheduler was viable in practice, and to allow the empirical evaluation presented in Chapter 6 to be conducted, we implemented our scheduler within Linux. In experiments, our cache-aware scheduler often outperformed

GEDF _{under a wide variety of real-time workloads and machine architectures. This is best}

demonstrated in the experiments that were conducted on machines A and B in Chapter 6. Additionally, since overheads were found to be comparable toGEDF_{, we would not expect the}

scheduling-related overheads of our scheduler to offset any reductions in cache miss rates when compared with other scheduling approaches. We also described how deadline tardiness, while higher under our method, can often be managed through a combination of early-releasing jobs and buffering results. Lastly, in an effort to directly support the second half of the

thesis statement of this dissertation, we conducted experiments involving a multimedia server workload, and showed that the use of our cache-aware scheduler allowed execution costs to be reduced, thereby allowing the size of the workload to be increased. Overall, the presented evaluation suggests that eliminating thrashing and reducing miss rates in shared caches should be first-class concerns when designing real-time scheduling algorithms for multicore platforms with shared caches.

Finally, we consider the introduction of the MTT abstraction in this dissertation to be a contribution in itself, as it allows a notion of concurrency to be incorporated into the periodic and sporadic task models, which traditionally handle only sequentially-executing tasks. If the chip manufacturing industry continues to adhere to the multicore paradigm (which is likely, given current projections), then increasing per-chip core counts will be the primary way in which the potential processing power of chips will be increased. In this case, exploiting the available parallelism within these chips will be essential to achieving performance gains from the use of new processor technology. As end users continue to demand applications with greater features and sophistication, or of higher quality, such as improved video quality in multimedia applications, there will be increasing pressure to address issues related to concurrency at some level within virtually all areas of computer science.

In document On the design and implementation of a cache-aware soft real-time scheduler for multicore platforms (Page 179-185)