9.0.1 Spawn Memory Caching
Using off-chip memory for spawn memory frees up all of the on-chip memory for the CUDA application. To reduce the performance loss from using off-chip memory for the spawn memory space, a memory cache specifically designed for the spawn memory could be researched. The spawn memory is used for creating a thread in two ways. One area is used for storing the threads’ register state and the other is used during warp formation processes. The cache will have to be designed to account for the two different usages while still maintaining a high hit rate. In addition pre-fetch logic, can also be implemented to improve cache hit rates for restoring thread memory state.
9.0.2 Register and Shared Memory Utilization
There are three limiting factors that are used to compute the maximum number of threads that can be executed on a SM: the maximum number of hardware supported threads, the number of registers and the amount of shared memory. Typically, once one of these resources is used up (not enough resources available for an entire block), no more threads can be scheduled for that SM. There can be exceptions where an application could be written such that all three resources are depleted at the same time, but this is very unlikely to happen. As a result, at least one of the memory spaces (registers or shared memory) will not be fully utilized by the application. Future research could seek to see if dynamic thread spawning use the remaining free memory resources to reduce the required off-chip memory storage and bandwidth.
To take advantage of un-used registers, the process of spawning instructions could leave the required data in the register file instead of saving the register states to spawn memory. During
the processes of warp formation, care could be taken to guarantee that threads are scheduled on the same SP such that it can re-use its original registers. Additionally, new instructions could be added to allow SPs to quickly pass register data between other SPs allowing for threads to be assigned to different SPs and then pass the required data in the register file between all SPs. New warps can continue to be scheduled in this method until there are no more free registers. When all the registers’ space has been consumed, registers used for spawn threads can be moved to spawn memory to make room for the new warp. Saving registers to spawn memory could then be done by spawning a micro-kernel that is designed to save register states to spawn memory. The hardware scheduler will add this micro-kernel to the scheduled warps. Once this warp has finished executing (saving all registers to spawn memory), the new warp can be scheduled.
The use of shared memory can be done in a similar method. When registers are being saved to spawn memory, they can first be saved into shared memory that is not used by the applications. When shared memory is filled, the register states will then be saved to off-chip memory (through the cache). When additional shared memory is required by new warps, another micro-kernel designed to flush shared memory can be executed in a similar fashion as the one for clearing registers. The micro-kernel will then save the register states that were saved in shared memory to off-chip memory.
This method requires a more complex scheduling hardware to keep track of all the different memory resources required by each thread and the location where each thread data is stored. To help alleviate the additional hardware functionality, additional software micro-kernels are
introduced. Hardware assisted micro-kernels are used for moving thread register states in
memory and can be generated by a compiler.
9.1 Programmable Thread Grouping
Diverging control flow is not the only characteristic that affects processor performance. Un-coalesced memory accesses can also decrease efficiency. Un-coalesced memory accesses are where two or more threads in a warp perform a memory operation with addresses that are outside of a single memory hardware block range. Memory accesses are then sequentialized
resulting in multiple memory accesses for a single warp. When performing warp formation, processor performance could be improved if threads could also be grouped based on future memory operations. The exact variable values and number of variables that could be used for helping form warps will be application-specific and the logic to organize threads can vary. To allow for application developers to optimize how warps are formed, a sorting algorithm can be executed that will order all possible threads awaiting to be placed into warps. The sorting algorithm will be another micro-kernel, that developers can write, that is executed when the processors are about to run idle due to a lack of work. Waiting until the processor runs idle allows for a larger number of threads waiting to be placed into warps. Unlike previous micro- kernels used to alleviate the hardware complexity, this micro-kernel will have to be developed by the application developer.
With this approach, the overhead of running a sorting algorithm could negate the long term benefits of more efficient warp formation. Alternative implementations could use fixed hardware to sort a specific number of variable options and logic sorting operations.
9.2 Programming Model for Spawning Threads
Previous sections also introduce new functionality that is not currently supported in modern day SIMT programming models. Features such as thread creation and programmable warp for- mation can be performed at the PTX level; this however does not integrate nicely with industry recommendations of using a high level programming model. Development of a programming model, that is similar to CUDA but includes additional functionality, would also increase the feasibility of this work.