General-Purpose computation on Graphics Processing

2.5 Hardware Acceleration

2.5.2 General-Purpose computation on Graphics Processing

Units

While the history of general video acceleration starts as early as 1970’s (the Gun Fight arcade used Fujitsu MB14241 video shifter to accelerate 2D sprites), this section will focus on the more recent 3D accelerators, know as Graphics Processing Units (or GPUs). We refer the interested readers to Ku- mar et al.’s overview of the These, eventually, gave rise to General-Purpose computation on Graphics Processing Units (GPGPU), where the GPU is used as a very wide Single Instruction Multiple Data (SIMD) coprocessor with a specialized memory model.

At first, the computing power and high bandwidth of GPU came from its highly fixed functionality. The main goal was to solve primary visibility, that is, for each pixel determine which triangle in the scene is the closest to camera and display its color. This was done via fixed function rasterization units that, for each triangle of the scene, solved which pixels are covered by the triangle and how far the triangle is from camera. Then, the color of each of these pixel-triangle intersections (fragments) was determined. Often, a fixed lighting model (e.g., Lambert of Phong) would be evaluated at each corner of a triangle and the result interpolated inside the triangle. Lastly, the distance (depth) of the fragment would be compared to the currently closest fragment for each given pixel (using a dedicated depth buffer) and if the new fragment was closer, its color was written to the frame buffer.

While this method of determining which triangles should be displayed was extremely efficient and is still used on the most recent desktop GPUs, the control a programmer had over the triangle’s color was severely limited to just a few parameters, e.g., surface color or light position. The fixed function pipeline could not accommodate the ever-increasing demand for higher quality and flexibility, which lead to the GPUs becoming more and more programmable. Initially, the capabilities were extended by simple assembly- like languages, with standards like OpenGL ARB assembly language and DirectX Shader Assembly Language. Later, these have been superseded by higher level fully featured C-like languages such as HLSL [Gray 2003] and GLSL [Rost et al. 2009] (see [Buck 2010] for the history of programmable shading). The shading hardware of the current generations of GPUs can be considered fully programmable wide SIMD coprocessors and their full functionality can be accessed using modern languages such as OpenCL [Khronos

OpenCL Working Group and Munshi 2015] and CUDA [NVIDIA 2015]. The

only two components that are left mainly as fixed function are the rasterization and texturing units.

The programming model for both GPU and CPU can be seen as fairly similar, especially if the latter is programmed using the ispc or a similar tool to generate SIMD code. The main differences are in the number of threads that can be in flight at the same time and the access to the main memory, but these differences still can be an order of magnitude.

Let us first note that, in this discussion, we will not be using the NVIDIA’s definition of a thread as a single SIMD lane, but rather the more universal definition of a thread as a scheduling unit. Initially, the common x86 CPUs would have a single hardware thread running on each CPU core. The Operating System would use its scheduling algorithms to assign the many software threads (e.g., an email client, a word processor, a renderer) to these hardware threads. A common practice is to reschedule threads on page fault, that is, when the active thread requested a memory page that was not currently in main memory, but on disk. As loading the page from a disk is a relatively long process (in terms of CPU cycles) and the CPU

would be idling before it could get the data, it was more efficient to assign it to another software thread.

The principle was later extended to Hyper Threading, where multiple hardware threads share the compute resources of a single core. The number of threads per core is often fixed and based on the number of register sets available in hardware, as each thread has its own logical registers. The modern x86 CPU with 6 cores can, therefore, have up to 12 hardware threads running at the same time. This allows for a faster and more fine grained level of switching controlled by the CPU itself and can hide not only page faults, but also cache misses. The principle where another thread is scheduled while the original thread waits for a memory request is also called latency hiding as it effectively hides memory latency from the user. Given that most modern applications rely heavily on accessing large amounts of data this can significantly increase the CPU resource utilization. While the theoretical speed up of 2x from doubling the number of hardware threads is rarely achieved, in the context of rendering applications a speed up of 1.5x is a fairly common occurrence when Hyper Threading is enabled.

The same latency hiding approach is used on most of the modern GPUs. Unlike the CPU, the number of threads per core can vary based on the resources required by the threads, e.g., there is a global register pool on the core and can be divided arbitrarily between the threads. The NVIDIA’s GeForce GTX 980 currently has a maximum of 64 threads (in the NVIDIA terminology warps) per core (Streaming Multiprocessor - SM) and 16 cores, for the total of 1024 hardware threads [NVIDIA 2014]. At the same time, while the CPU SIMD units have 4 to 8 lanes, the GTX 980 has 32 lanes. On the CPU we can therefore schedule 12 · 8 = 96 lanes of computation simultaneously, while the GPU can arrive at the significantly higher number of 64 · 16 · 32 = 32768 lanes. In practice, the number of scheduled lanes on the GPU is significantly lower (i.e., the code usually needs more than the minimal amount of resources), but the discrepancy is still large and to fully utilize the GPU resources the actual algorithm has to account for the need of large numbers of concurrently active lanes.

The second major difference is in the memory access itself. Most high performance GPUs, such as the aforementioned GTX 980, are expan- sion cards connected to the main memory via PCI-Express. And while the bandwidth of a 16 lane 3.0 PCI-Express is impressive 15.75 GB/s, it is still more than an order of magnitude slower than the bandwidth when accessing the GPU’s on-card memory (225 GB/s for the GTX 980). Coupled with the higher latency of accessing memory through the PCI-Express bus, it is ob- vious that high performance applications need to mainly target the on-card memory. On the other hand, the memory itself has a significantly higher bandwidth than even dual channel DDR3-2400 main memory (38.5 GB/s), which helps with memory bound algorithms.

single-threaded performance and their memory hierarchy is tuned to help with this goal. On the other hand, the memory hierarchy on the GPU is mostly biased towards providing maximum throughput and the memory latency is hidden by other threads (i.e., similar to Hyper Threading), rather than actively reduced by complex caching and out-of-order instruction issue mechanisms. To utilize this throughput, it is important to provide the GPU with comparatively large sets of work, to keep all threads occupied as much as possible. For an extreme example, it is obviously much more effective to have the GPU process a single set of a million rays, rather than a million sets of a single ray, as the latter would leave majority of the GPU unoccupied. Realistic scenarios of work set sizes are examined in Section4.3.4.

Lastly, we will mention an issue that could be encountered on large production scenes. The on-board memory is fairly small when compared to maximum main memory sizes, as few GPUs have more than 10 GB, while 32 GB or more of main memory is quite common. All the scenes presented in our work fit into the on-board memory of the used hardware but the ever present production drive for higher model detail, more complex scenes, and more detailed textures means that some production scenes would not fit. So, while the topic is beyond the scope of this work, let us briefly describe one of the basic approaches that could be used to address this issue. The idea is to look at the out of core approaches used on the CPU when the scene does not fit into main memory and has to be stored on the disk. In this case, the memory is used as a cache and various tiling and sorting schemes are employed to minimize the number of cache evictions while at the same time servicing the requests. We refer the readers to the papers by Pantaleoni et al. [2010], Eisenacher et al. [2013], and Laine et al. [2013] for inspiration.

In document Light transport simulation on special hardware (Page 56-59)