Chapter 2: Background and Prior Work
2.5 Prior Work on Accelerator Scheduling
2.5.1.3 GPU Resource Scheduling
The final category of real-time GPU research is on the scheduling of GPU resources. That is, the problem of scheduling both data movement and GPU computations on GPU(s) shared by competing jobs of different priorities. Work in this area seeks to develop real-time GPU scheduling algorithms, as well as analytical models to support schedulability analysis. This dissertation falls within this category.
TimeGraph is an early approach to the real-time scheduling of modern GPUs (Kato et al., 2011b). TimeGraph plugs into an open-source GPU driver, where it intercepts the GPU commands issued by GPU- using applications. TimeGraph schedules a GPU as a single processor, scheduling intercepted commands according to a configurable scheduling policy. TimeGraph supports two scheduling policies: the “high- throughput” (HT) policy, and the “predictable-response-time” (PRT) policy. The HT policy allows commands from a task to be scheduled immediately, provided that the GPU is idle, or if commands from that task are currently scheduled on the GPU and no other commands from higher-priority tasks are waiting to be scheduled. This policy promotes throughput at the risk of introducing priority inversions—the scheduling of new commands may extend the delay experienced by higher-priority commands issued soon after. The PRT policy decreases the risk of lengthy priority inversions, as new GPU commands of a task are not scheduled until all of its prior commands have completed. TimeGraph monitors the completion status of commands by plugging into the interrupt handler of the open-source device driver.
GPU-using tasks under TimeGraph may be assigned a fixed-priority and GPU utilization budget.26 A task’s GPU budget is drained as it executes commands on a GPU. TimeGraph supports two budget enforcement mechanisms: “posterior enforcement” (PE) and “a priorienforcement” (AE). Under PE, the budgetary deficits incurred by a task’s budget-overrun is recouped by delaying further scheduling of the offending task until its budget has been replenished. The PE strategy is torecoverfrom budget overruns. Under AE, TimeGraph attempts to anticipate budget exhaustion. This is done by matching the sequence of a task’s requested GPU commands against a historical record of prior-issued GPU command sequences. The historical record contains an average execution time, which is taken as a predicted execution time of 26TimeGraph also supports online priority assignment for graphics (i.e., non-compute) applications, where the foreground application
the requested commands. A task’s requested GPU commands are not eligible for scheduling until the task’s budget is sufficient to cover the predicted execution time. Thus, the AE strategy is toavoidbudget overruns.
TimeGraph is somewhat limited in the real-time GPGPU domain. As presented by Katoet al. (2011b), TimeGraph is targeted tographicsapplications, such as video games and movie players, rather than GPGPU applications. It focuses on providing a configurable quality-of-service for applications, while loosely adhering to a fixed-priority real-time task model. TimeGraph also unifies the GPU EE and CE processors into one—we discussed the negative effects on schedulability of such an approach in Section 2.4.4. TimeGraph is limited in this way because it schedules GPU commands without inspecting them to determine their function. Separate EE and CE scheduling is impossible without this inspection.
The developers of TimeGraph later developed RGEM, a real-time GPGPU scheduler (Katoet al., 2011a). RGEM is similar to TimeGraph in that it also supports fixed-priority scheduling. However, RGEM operates entirely within the user-space through a user-level API. The RGEM API provides functions for issuing DMA operations and launching GPU kernels. These APIs invoke GPU scheduler routines. Because RGEM is implemented in user-space, GPU scheduler state is maintained in shared memory accessed by each GPGPU task. Tasks that are unable to be scheduled immediately on the GPU are suspended from the CPU, awaiting for a message to proceed (delivered through a POSIX message queue). Perhaps the most notable of RGEM’s contributions is how it addresses schedulability problems caused by long non-preemptive DMA operations. Here, RGEM breaks large DMA operations into smaller chunks, reducing the duration of priority inversions and thus improving schedulability.
RGEM has several advantages over TimeGraph for GPGPU applications. Unlike TimeGraph, RGEM utilizes techniques that make it amenable to schedulability analysis under rate-monotonic scheduling. Also, RGEM separately schedules a GPU’s EE and CEs. However, as presented in Katoet al. (2011a), RGEM provides no budget enforcement mechanisms.
The notion of breaking large non-preemptive GPU operations into smaller ones has also been explored by Basaran and Kang (2012). In addition to chunked DMA, Basaran and Kang also developed a mechanism for breaking large GPU kernels into smaller ones. Here, the kernel’s grid of thread blocks is programmatically split into smaller sub-grids that are launched as separate kernels. Unfortunately, this kernel-splitting requires developers to modify GPGPU kernel code. As we discussed in Section 2.4.2.1, threads must compute their spatial location (index) within a grid. Kernel-splitting requires the kernel code to include additional spatial offsets in the index computation. Zhong and He (2014) recently developed a method to make these offset
calculations transparent to the programmer in a framework called Kernelet. Kernelet programmatically analyzes kernel code and patches indexing calculations at runtime. However, no one has yet attempted to apply this technique in a real-time setting—Kernelet’s just-in-time patching of GPU kernel code may present a challenge to real-time analysis.
Katoet al. and Basaran and Kang examined GPU scheduling strictly in terms of the sporadic task model. A different approach has been taken by Verneret al. (2012), where GPU operations of various jobs of sporadic tasks are combined into abatchat runtime and scheduled jointly. Here, batches of GPU work execute in a four-stage pipeline: data aggregation, DMA data transfer from host to device memory, kernel execution, and DMA data transfer of results from device to host memory. GPU work is batched at a rate of 14dmin, or one quarter of the shortest relative deadline in the task set. Consecutive batches may execute concurrently, each in a different stage of the pipeline.27 Verneret al. has continued research on batched scheduling for multi-GPU real-time systems in Verneret al. (2014a,b). Although their work is targeted to hard real-time systems, Verneret al. only consider schedulability in terms of the GPUs only. The real-time scheduling of the CPU-side GPGPU (i.e., triggering DMA and launching kernels) work remains unaddressed.
Thus far, we have discussed research in GPU resource scheduling largely in terms of systems development, i.e., the design and implementation of real-time GPU scheduling algorithms. Research on developing new analytical models has also been pursued. Kimet al. (2013) point out that conventional rate-monotonic schedulability analysis cannot be applied to task sets where tasks may self-suspend, unless suspensions are modeled as CPU execution time (i.e., suspension-oblivious analysis, as discussed in Section 2.1.6.3). In order to reclaim CPU utilization that would otherwise be lost in suspension-oblivious analysis, Kim et al. devised a task model whereby jobs are broken into sub-jobs, along the phases of CPU and GPU execution. The challenge then becomes assigning a unique fixed priority to each sub-job. Kimet al. showed that the determination of an optimal priority assignment is NP-hard. They presented and evaluated several priority-assignment heuristics.