Related Work - Data Resource Management in Throughput Processors

CPU virtual register files and instruction clustering: Oehmke et al. [103] created a virtual context architecture for CPUs that serviced registers from cache of a register space in memory. Because the amount of data in a GPU’s register working set is much larger because many threads are active at the same time, our technique requires more active management of the register cache. Roth [121] describes techniques for releasing virtual registers when they are no longer needed. Ar- chitectures such as TRIPS [122,34] and others using block-structured ISAs, described by Melvin et al. [90], have executed blocks of code similar to our regions. Work such as by Ponomarev et al. [112] have diverted short-lived values from handling like other registers. Yan et al. [147] allow short-lived values to be communicated through a CPU’s forwarding network. We use regions as an overlay of a traditional ISA.

GPU register caching and RF size reduction: Vijaykumar et al. [136] oversubscribe resources, including registers, by annotating kernel phases. Our work focuses on reducing the size of hardware structures, and uses a more precise set of registers that need to be present. RegLess would be able to oversubscribe the register file without any design changes. Gebhart et al. [30] proposed a register cache in front of the main register file and a 2-level scheduling scheme to control access to the cache, to save the dynamic power of accessing the main register file. Other work by Gebhart et al. [32,31] sorted registers at compile time into a 3-level register storage hierarchy, also to save dynamic power. The novel contribution of our work is eliminating the main register file as a level in the register hierarchy. Gebhart et al. [33] also propose sharing the same SRAM structures between registers, shared memory, and L1 cache. Jeon et al. [50] allow new values to replace other warps’ dead values in the register file, allowing the size of the register file to be re- duced. By removing the main register file and caching the active set, our technique reduces the register size to the minimum needed to maintain performance.

Compiler-assisted GPU scheduling: Park et al. [107] use compiler annotations so the warp scheduler can prioritize warps with that will soon issue a load. Wu et al. [143] expose hardware scheduling decisions on GPUs to programmers. Xie et al. [144] use a compiler to make opti-

mal register allocation and thread throttling decisions. We add a layer of scheduling that makes dynamic decisions based on static analysis. Hsieh et al. [47] use compiler analysis to determine offload candidates for near-data processing. Li et al. [82] use compiler analysis to place data in different on-chip memory resources.

Resource-aware GPU scheduling: Jog et al. [61] classify warps into short and long latency to determine memory scheduling policy. Jog et al. coordinate warp scheduling with DRAM bank- level parallelism [59] and prefetching [60]. Li et al. [83] allocate cache space to a set of prioritized warps. Narasiman et al. [93] describe two-level scheduling to allow for larger warp sizes. Pichai et al. [111] show the need to coordinate warp scheduling and MMU designs. Pai et al. [105] use elastic kernels in order to better utilize registers. Gregg et al. [38] merge kernels to increase register utilization. Lee et al. [76] coordinate warp priority and access to cache resources, Liu et al. [86] prioritize warps to reduce time waiting for barriers. Kayiran et al. [64] adjust TLP for highest performance. Rogers et al. [116] use variable warp sizing and warp ganging to decrease the impact of memory divergence. Ausavarungnirun et al. [7] change cache and memory controller policies based on warp divergence.

Divergence-aware compiler techniques: ElTantawy et al. [28] track register dependencies for control divergent threads separately in hardware, and use compiler analysis [27] to analyze control divergence to eliminate deadlocks. Rhu et al. [114] analyze divergence patterns to allow for better SIMD lane permutation. Anantpur et al. [5] transforms control divergence using linearization. Jablin et al. [49] use traces for instruction scheduling on GPUs.

Value compression and scalarization: Lee et al. [74] compress register values using base- delta-immediate encoding introduced by Pekhimenko et al. [110], which reduces the number of register file banks needed to load and store registers. Gilani et al. [35] propose a GPU architecture with scalar units and 16-bit register reads. Abdel-Majeed et al. [2] use the redundant computations done between lanes for error detection. Kim et al. [67] exploit value structure using an affine functional unit. Stephenson et al. [126] show that a large fraction of register writes are constant across warps and threads. Pekhimenko et al. [109] compress data over the GPU interconnect while

minimizing the number of toggles. Vijaykumar et al. [138] propose using excess GPU computation resources for memory compression. Keckler et al. [66] propose temporal SIMT, where scalar computations do not need to be computed by all threads.

Register file implementation: Abdel-Majeed et al. [1] reduce register file dynamic and leak- age power by adding a drowsy state to the storage circuits and only reading register values for active lanes in a warp. Jing et al. [53] propose register file bank scheduling techniques that reduce bank conflicts. Namaki-Shoushtari et al. [92] power gate unused register file banks. Other work by Jing et al. [55] implemented the register file using eDRAM instead of SRAM and proposed re- freshing the DRAM during bank bubbles [54]. Mao et al. [89] and Wang et al. [140] implement a register file using racetrack memory, and Goswami et al. [36] implement it using resistive memory. Tan et al. [132] implement the GPU register file using STT-RAM for energy savings, and Yu et al. [149] implement it with an SRAM-DRAM hybrid memory. Tan et al. [131] develop a method for classifying registers as fast or slow due to process variation, and Liang et al. [85] introduce a variable-latency register file to mitigate process variation. Li et al. [84] implement register files using a hybrid CMOS-TFET process. Our design because of its small size can be implemented using conventional techniques.

Register file voltage: Kayiran et al. [65] tune down performance of GPU register file and operand collector components to save energy. Tan et al. [133] reduce GPU register file energy with aggressive voltage reduction. Leng et al. [80, 79] throttle the register file when it causes voltage droop to reduce the GPU voltage guardband.

In document Data Resource Management in Throughput Processors (Page 82-84)