Conclusion and Implications for Predictable Execution

This work presented the first investigation of collaborative execution of computational kernels on a fused CPU- GPU architecture with a shared LLC using fine-grained SVM. We contributed two novel device-side co-scheduling methods that perform scheduling within the kernel code. It was shown that device-side enqueuing introduces considerable overhead stemming from the evaluation of the block syntax that is used in device-side enqueuing of kernels (up to 6× execution time increase), too much to be suitable for implementing co-scheduling methods. Our host-side co-scheduling method achieved 96.8% of the clairvoyant and thus hypothetical xor-Oracle’s performance on average (optimal per-kernel choice of exclusive CPU or GPU usage) and a speedup of 1.43× and 1.25× over execution on GPU only and CPU only, respectively. It also provided a 1.29× speedup over ‘atomic counting’, the best device-side co-scheduling method, because it does not add overhead to kernel execution once profiling is done. This makes our host-side co-scheduling method the most competitive practical scheme to date. We further showed that cache coherency is the major performance bottleneck in current fused CPU-GPU architectures with a shared LLC. It was shown that when CPU and GPU execute kernels in parallel on an Intel architecture, cache- related stalls observed on the CPU can increase by up to 1.75× while cache misses remain the same compared to executing the same work on the CPU and only then on the GPU (while the CPU is idle).

However, some benchmarks beneﬁted considerably from collaborative execution on CPU and GPU (up to 1.23× speedup) compared to using the most suitable device. It depends on the memory access patterns of the kernels whether cache coherency becomes a performance bottleneck or not. In future work, it will be crucial to categorize the memory access patterns of kernels and design optimizations to alleviate this performance bottleneck for even more effective co-scheduling of kernels on fused CPU-GPU architectures. It becomes evident that the trend of processor integration in high-performance architectures is a two edged sword: it can eliminate data transfers to private memories of heterogeneous compute devices and enable co-computation of kernels by, e.g., CPU and GPU, resulting in a high performance within a limited power and are budget (which is crucial, e.g., for embedded

3.8 Conclusion and Implications for Predictable Execution

systems). At the same time, the potential for resource conflicts (and the complexity thereof) increases. While these conflicts can most certainly be resolved for average-case performance, it will be more challenging for future research to resolve them for predictable performance. The presented cache coherency bottleneck adds a shared last level cache between CPU and GPU to the growing list of microarchitectural features that can benefit average-case performance, but lead to resource conflicts of such a complexity that they are virtually infeasible to analyze for execution time guarantees.

This chapter presented novel co-scheduling approaches for fused CPU-GPU architectures in a case study on how performance is achieved in an off-the-shelf platform. It provided further evidence that high-performance architectures, which were designed for average-case performance, are not suitable for hard real-time systems that require execution time guarantees. Thus, the following chapters take a different approach to obtain predictable perfor- mance and base on a system that is already amenable to WCET analysis. As motivated in Chapter 1, such a system lags years behind current platforms like the one discussed in this chapter in terms of its architectural de- sign. The focus of the following chapters will therefore be to achieve high performance and WCET guarantees by introducing runtime-reconﬁgurable accelerators.

4 Runtime Reconﬁguration under WCET Guarantees

The target of this1and the following chapters is to achieve timing-analyzable performance by employing hardware accelerators that speed up the tasks’ most compute-intensive parts, so called computational kernels (also known as hotspots) that are comprised of one or more nested loops. When implementing these accelerators as application- specific integrated circuits, the system would lack flexibility with respect to revised standards or new algorithms. Instead, using a runtime-reconfigurable architecture (which employs an FPGA, see Section 2.3) maintains a high flexibility and even allows for reconfiguring the accelerators at runtime, thereby increasing the performance as well as the computing efficiency (compared to a static set of accelerators) at the cost of a more complex timing analysis. The aim of this chapter is to enable guaranteed reconfiguration delays for configuring accelerators onto the reconfigurable area (which were previously unavailable). The following chapters will base on the guaranteed reconfiguration delays to achieve guaranteed WCETs of tasks that employ runtime reconfiguration of accelerators. Existing work on runtime reconfiguration in the context of real-time systems implicitly assumes that the process of reconfiguration itself complies with timing guarantees [16, 27, 29, 36, 93], e.g., the time it takes to config- ure a hardware accelerator on the reconfigurable fabric (reconfiguration delay) is assumed constant and free from conflicts with other system components that could affect WCET guarantees. The realization of a runtime reconfiguration controller that fulfills these assumptions and that is amenable to WCET guarantees is so far unavailable. However, guaranteed reconfiguration delays are crucial to realize runtime-reconfigurable real-time systems. The novel contributions of this chapter are as follows:

• A runtime reconﬁguration controller called “Command-based Reconﬁguration Queue” (CoRQ) that provides guaranteed latencies for its operations and supports timing analysis for WCET guarantees. It was released as an open-source project, including examples and benchmarks2.

• We show that conflicts while accessing a shared main memory during reconfiguration can lead to a slowdown of more than 21× in reconfiguration bandwidth. In contrast, CoRQ guarantees constant reconfiguration delays even under heavy system bus load.

In document Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures (Page 44-47)