Performance Evaluation on Private LLC System

6.2 KUTHS Algorithm extension for Many-Core Systems

6.3.4 Performance Evaluation on Private LLC System

Fig.6.7shows the speedups of our proposal for 8/16/32 simulated cores with as many simulated threads per application, relative to the ACMP configuration. Each group of four simulated cores (1 large + 3 small) share a 4MB L3 cache. For simulated configurations of 8 (2 groups), 16 (4 groups) and 32 (8 groups) cores we get performance improvements of 24, 30 and 34 percent respectively. On the other hand, BIS [13] proposal with 52 small cores having 3 large cores gives average performance benefits of 42 percent, while UBA [14] outperforms it by 8 percent. If we compare it to KUTHS 32 cores (8 groups) configuration, BIS [13] outperforms it by 8 percent and UBA [14] by 16 percent on average, with less large cores in the system. This comes as conse- quence of KUTHS lightweight yet coarser grained approach which makes it unable to identify all bottlenecks in the multithreaded applications and send them only to be

Chapter 6. KUTHS

Figure 6.7: Speedup comparison of the KUTHS and the Linux OS (ACMP) Scheduler for the SPLASH-2 benchmark suite in the private last-level L3 cache 8/16/32 cores system configurations where each group of one large and three small share a 4MB L3 cache

executed on the large cores.

6.4 Summary

We have presented the Kernel to User mode code Transition aware Hardware Schedul- ing (KUTHS) method. Our work is heavily influenced by Fairness-aware Schedul- ing as well as bottleneck identification techniques. We seek to provide performance benefits from running parallel workloads on ACMPs without the need for substantial hardware extensions, sampling, or runtime overheads. Incorporating minimal hardware additions, our KUTHS policy promotes the execution of the critical sections of code on the larger rather than smaller cores within an ACMP resulting in performance gains of 11.1 percent and 30 percent (geometric mean) compared to the state-of-the-art Fairness-aware Scheduler and Linux OS Scheduler respectively, while being slower by 8 percent compared to one of the most complex and sophisticated bottleneck identification techniques running SPLASH-2 benchmarks.

cant performance benefits over other conventional approaches, there are still several challenges that can enhance the previous proposal. Improvements in both the scheduling design and implementation layout of the KUTHS mechanism could lead to effi- ciency gains and power reduction. Therefore, we seek to incorporate a smarter criticality predictor based on monitoring cores’ instructions per cycle, cache hit/miss ratios, and thread activity. The following section presents the scheduler with some of these improvements applied. These enhancements require additional hardware counters but lets us to better identify and manage a significant amount of the critical sections of the user code while maintaining hardware implementation feasibility. They also allow us to make better scheduling decisions based on the characteristics of the user code currently running.

7

TCS: Trait-aware Criticality Scheduler

for Hardware-Threads

The parallel execution of multi-threaded applications, as well as multiprocess workloads on single-ISA asymmetric multi-cores, offers a potential speedup gain. In section 1.2we have noted the importance and the impact that scheduling based on fairness, criticality and a workload’s characteristics may have on the potential speedup. These features, especially a workload’s characteristics, are relevant to asymmetric systems on account of workloads will perform differently on different core types based on their characteristics. Consequently, in an asymmetric system, it may be beneficial to correlate a workload’s execution behaviour with a particular core type to ascertain dynamically the workload’s characteristics and enhance the scheduler strategy.

We propose the Trait-aware Criticality Scheduling (TCS) policy, which is an im- provement upon the HRRS scheduling policy. While the HRRS scheduling already tackles the problem of fairness, the TCS scheduling uses the short-term traits of the hardware threads to enhance the decision making process during scheduling in a man-

ner that is more in tune with a workload’s specific characteristics. Additionally, in order to adapt the scheduling policy to tackle the software thread criticality, the TCS scheduler uses the long-term traits of the software threads during the scheduling decision making process. In the next sections we describe the proposed Trait-aware Criticality Scheduling (TCS) policy and discuss its hardware implementation.

7.1 TCS Algorithm Basis

Since the basis of the TCS scheduler is the HRRS scheduling policy, that we presented in chapter5, we briefly describe it here again to help composting the two approaches. To expalin the mechanics of the HRRS approach, Fig. 7.1 showes an x86 ACMP system containing one large out-of-order (OoO) core and three smaller and identical in-order cores. Four identical logical cores form the figure correlate to four identical hardware threads. The HRRS maps the threads running on the logical cores to the physical cores after every hardware-quantum. This provides an abstracted homo- geneous hardware view to the operating system. The OS scheduler maps software threads to the logical cores, e.g. hardware threads. This allows the OS scheduling policies and implementation to be left unchanged. The OS scheduler, being triggered after every software-quantum, executes much less frequently than that the HRRS, bee- ing triggered after every hardware-quantum. In essence, the HRRS can be viewed as mapping the logical cores that the OS sees and schedules threads onto to the physical cores of the underlying hardware which actually execute the threads.

It is important to note the defining characteristic of the HRRS algorithm. The HRRS algorithm evenly rotates threads (scheduled onto the logical cores by the OS scheduler) running on the physical cores after every hardware-quantum.

In document Hardware thread scheduling algorithms for single-ISA asymmetric CMPs (Page 90-94)