4.3 GPU Architecture with Multi-Context Register File
4.3.3 Context-Aware Warp Scheduling
In our architecture, a scheduler needs to perform two major functions. First, similar to the baseline GPU, the scheduler needs to select which warp will be issued in each cycle. However, unlike the baseline, the scheduler is restricted to choose a warp from the active contexts of two register file sub-banks. Addition- ally, the scheduler needs to also manage multiple contexts for each sub-bank by deciding when a context switch should happen and which context should become active. The context switches need to be carefully managed consider- ing multiple factors including DRAM refreshes, context switch overheads, and fairness.
In this subsection we discuss how the warp scheduling algorithm can be augmented to accommodate additional constraints from the hybrid memory. More specifically, there are three major design considerations for the scheduler:
• Correctness: The scheduler must ensure that dormant contexts in DRAM are preserved by reloading each context before it reaches the DRAM re- tention limit.
• Efficiency: To minimize pipeline stalls and context switch energy, the scheduler needs to minimize unnecessary context switches and intelli- gently schedule context switches to hide their latencies.
• Fairness: The scheduler needs to be fair to all warps, providing equal op- portunities to execute. Unfair scheduling can lead to a noticeable perfor- mance degradation by leaving only a small number of warps towards the end of program execution.
While it is relatively straightforward to deal with the DRAM refresh require- ment, we found that there is a fundamental tension between reducing the num- ber of context switches and improving fairness. In order to minimize context switches, the scheduler should issue warps from one active context as long as there is a “ready warp” (warp that is ready to be issued) in the context. How- ever, such a greedy algorithm may lead to unfairness, especially when a sub- set of contexts have more active warps than others; contexts with more active warps are more likely to get chances to execute. We found that the unfairness can significantly degrade performance for some applications by leaving a small number of warps to run towards the end of an execution without enough warps to fully utilize the pipeline.
Our scheduling algorithm deals with the tension between context switch overheads and fairness by setting a threshold on how long that each context can run without switching to another context with ready warps. While this thresh- old does not strictly guarantee fairness, we found that the scheme works quite well in practice, close to more complex scheduling algorithms with stronger fairness guarantees. The evaluation section provides further discussions on the trade-off between context switches and fairness based on simulations.
Figure 4.5 shows the flowchart for the new scheduler. To illustrate how a scheduling algorithm is changed for the multi-context register files, we use a simple baseline scheduling algorithm, which selects a ready warp in a round- robin fashion. To aid its decisions, the scheduler maintains three types of coun- ters per context: refresh counter (rcnt), schedule counter (scnt), and ready-warp counter (wcnt). The refresh counter keeps track of when a DRAM context needs to be refreshed. The schedule counter indicates how many times a warp in a
Refresh required? No
Initiate a context switch for refresh Start
Initiate a context switch in B1 to the dormant context w/ most ready warps
B1: most recently used sub-bank B2: stand-by sub-bank
C1: active context in B1 C2: active context in B2
Reset the schedule counter (scnt[C1] = 0)
Issue the next ready warp from C1
Done for a cycle
Issue the next ready warp from C1. Set the
scnt[C1] to TS+1 Update the C1 schedule counter C1 time-out? (scnt[C1] > TS?) Ready warp in C1? Ready warp in C2? Dormant Context with a ready
warp in B2?
Issue the next ready warp from C2. Update scnt[C2].
Dormant Context with a ready
warp in B1? Ready warp in C1? Initiate a context switch in B2 to the dormant context w/ most ready warps
Initiate a context switch in B1 to the dormant context w/ most ready warps Yes Yes Yes Yes Yes Yes No No No No No No Yes
Figure 4.5: Flow chart for warp scheduling and register file context switch- ing.
context has been checked for possible issue by the round-robin scheduler with- out a forced context switch. The ready-warp counter indicates how many warps are ready for each context.
As shown in the figure, in each cycle, the scheduler first checks if any dor- mant context needs to be refreshed, and initiates a context switch to make the dormant context active if necessary. The corresponding sub-bank is excluded from scheduling decisions until the switch is completed.
To determine which warp to issue in a given cycle, the scheduler first tries to select a ready warp from the active context (C1) of the most recently used sub-bank (B1) in a round-robin fashion. If the schedule counter for C1 (scnt[C1]) reaches a fairness threshold (TS) or there is no ready warp in C1, the scheduler
tries to pick a ready warp from the active context (C2) of the other sub-bank (B2, called stand-by sub-bank) and initiates a background context switch in B1 if successful. If there is no ready warp in the stand-by sub-bank (B2), the scheduler searches for a dormant context with a ready warp and initiates a context switch. To reduce future context switches, the scheduler selects a dormant context with the largest number of ready warps on a context switch. Finally, if no warp from another context is ready, a ready warp from the current active context (C1) is allowed to issue even if the threshold is reached.
While not shown in the flowchart to keep the chart simple, our scheduler im- plementation contains one more optimization to hide context switch latencies. In each cycle, while issuing a warp from one sub-bank, the scheduler checks how many ready warps are left in the current active context of that sub-bank. If the number is low, just enough to hide a context switch latency, and there is no ready warp in the active context of the stand-by sub-bank, the scheduler
initiates a context switch in the stand-by sub-bank.
4.4
Evaluation
This section evaluates the performance, area, and energy consumption of the proposed hybrid memory and GPU architecture through detailed simulations.