5.5 Performance Controllers
5.6.6 Cloud Operator Revenue
The throughput gains achieved by Scavenger are able to supply a cloud operator with additional profit over using temporal partitioning or selling time on GPUs as a unit. Figure 5.15 shows how using Scavenger supplies additional revenue. This model assumes that the client with the primary workload pays for their fraction of performance (e.g. $0.70 for 70% performance), even if Scavenger exceeds that level of performance. For this simple model, the batch tier pays $0.01 per percentage of its performance running alone that the system achieves. Some of the additional value created could be used to discount batch tier service to compensate for decreased performance of batch jobs relative to running them alone, with the rest captured as profit. The results show that matching complementary workloads is key to realizing maximum revenue, with a COM workload
able to gain up to an extra $0.23 per dollar when matched with a MEM secondary as opposed to another COM.
5.7
Related Work
Multi-Application GPUs: Several previous works have detailed how to run multiple applications simultaneously on a GPU while optimizing for overall throughput or fairness. One line of research uses compiler techniques to interleave instructions from multiple kernels. Guevara et al. [40] merged the source code of multiple kernels into one. Another set of techniques scheduled data transfers and kernel executions for multiple kernels or applications on a non-preemptible GPU. Rossback et al. [120] created a dataflow programming model that can build schedules with fair- ness guarantee. A third stream developed ways to execute multiple applications simultaneously with hardware support. Adriaens et al. [3] showed throughput benefits with spatial partitioning. Wang et al. [141] and Xu et al. [146] developed methods to share GPU resources inside SMs. Park et al. [108] developed techniques for finding allocations that optimized throughput or fairness, Jog et al. [57] create a fair DRAM scheduler for co-running applications, and Dai et al. partition mem- ory requests [23], complementing work on CMP memory requests [26]. Preemption techniques detailed by Tenasic et al. [134] and Park et al. [106] allow these resource partitions to be adjusted. GPU QoS and performance targets: Previous work has implemented quality of service sup- port either in software in the runtime system or device driver, and in hardware for spatial partition- ing and SMK systems. On the software side, Kato et al. [63] uses the device driver to provide a soft real-time guarantee for graphics workloads when running with a compute task. Lee et al. [70] build a real-time scheduler for launching non-preemptible kernels, and Chen et al. [21, 20] create systems for building kernel launch schedules that achieve QoS goals while maximizing utilization. In hardware, Aguilera et al. [4] develop a QoS system for spatial multitasking, and Wang et al. [142] create a QoS system for SMK GPUs by adjusting the thread block allocation while tracking performance quota. Scavenger manages not only thread blocks but also active warps and memory
resources, and can determine the IPC goal at run time without it being provided by an OS-level scheduler.
CPU resource partitioning: Multi-core CPUs share memory system resources, including cache capacity and bandwidth. Guo et al. [41] provide a quality-of-service framework that steals resources from jobs with excess resource allocations in CMPs. Nesbit et al. [94] create virtual pri- vate caches that prevent threads from interfering with other threads’ cache bandwidth, and extend their concepts to virtual private machines in [95]. Xu et al. [145] create a performance model that finds cache allocation sizes for multiple processes online. Lee et al. [71] measure the resource allocations needed to reach a given QoS goal in a CMP. GPUs have different memory access char- acteristics and caching is useful for different purposes on GPUs, requiring different approaches.
Feedback control has been used for resource allocation in data centers and CPUs. In data center scheduling, Lo et al. [88] use feedback control and offline profiling data to maintain a latency quality of service target while batch tasks also execute, managing cache, bandwidth, network, and power resources. Sharifi et al. [125] use feedback control in the operating system to manage resources inside of a CMP, including cores, cache capacity, and memory bandwidth. Li et al. [81] also integrate feedback control into a CMP. Scavenger’s feedback controllers partition resources inside of cores in a finer-grained way, are found in hardware rather than the OS, and are designed for the decentralized GPU architecture.
Online performance estimation: Subramanian et al. [130] estimate performance of applica- tions when run alone on CMPs using cache metrics and memory controller priority. Eyerman et al. [29] track waiting cycles for applications co-running using SMT to estimate alone execution time. Besides alone execution time, online techniques have estimated power consumption [16] using performance counters.
5.8
Conclusion
Data center and public cloud operators must maximize the utilization of their hardware, which increasingly includes accelerators and GPUs. Sharing a GPU between multiple workloads is able to increase their throughput and utilization, but the interference between the workloads must be controlled. Scavenger is a system that controls interference to create two tiers of service on a shared GPU while still increasing throughput: one with a performance target and one for batch jobs. The key techniques enabling Scavenger are an online performance predictor for the primary workload and a set of dynamic resource allocation controllers. Scavenger can increase the throughput of the batch tier by 1.35x relative to temporal partitioning while maintaining a primary workload at 90% of its performance relative to running alone, for an overall GPU throughput increase of 9.3%.